Principal Algorithm-Architecture Matching for Signal and Image Processing: Best papers from Design and Architectures..

Algorithm-Architecture Matching for Signal and Image Processing: Best papers from Design and Architectures for Signal and Image Processing 2007 & 2008 & 2009

, , , , , ,

Advances in signal and image processing together with increasing computing power are bringing mobile technology closer to applications in a variety of domains like automotive, health, telecommunication, multimedia, entertainment and many others. The development of these leading applications, involving a large diversity of algorithms (e.g. signal, image, video, 3D, communication, cryptography) is classically divided into three consecutive steps: a theoretical study of the algorithms, a study of the target architecture, and finally the implementation. Such a linear design flow is reaching its limits due to intense pressure on design cycle and strict performance constraints. The approach, called Algorithm-Architecture Matching, aims to leverage design flows with a simultaneous study of both algorithmic and architectural issues, taking into account multiple design constraints, as well as algorithm and architecture optimizations, that couldn’t be achieved otherwise if considered separately. Introducing new design methodologies is mandatory when facing the new emerging applications as for example advanced mobile communication or graphics using sub-micron manufacturing technologies or 3D-Integrated Circuits. This diversity forms a driving force for the future evolutions of embedded system designs methodologies.

The main expectations from system designers’ point of view are related to methods, tools and architectures supporting application complexity and design cycle reduction. Advanced optimizations are essential to meet design constraints and to enable a wide acceptance of these new technologies.

Algorithm-Architecture Matching for Signal and Image Processing presents a collection of selected contributions from both industry and academia, addressing different aspects of Algorithm-Architecture Matching approach ranging from sensors to architectures design. The scope of this book reflects the diversity of potential algorithms, including signal, communication, image, video, 3D-Graphics implemented onto various architectures from FPGA to multiprocessor systems. Several synthesis and resource management techniques leveraging design optimizations are also described and applied to numerous algorithms.

Algorithm-Architecture Matching for Signal and Image Processing should be on each designer’s and EDA tool developer’s shelf, as well as on those with an interest in digital system design optimizations dealing with advanced algorithms.

Año:
2011
Edición:
1
Editorial:
Springer Netherlands
Idioma:
english
Páginas:
296 / 297
ISBN 13:
978-90-481-9965-5
Series:
Lecture Notes in Electrical Engineering 73
File:
PDF, 8.62 MB
Descarga (pdf, 8.62 MB)
 
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
Lecture Notes in Electrical Engineering
Volume 73

For other titles published in this series, go to
www.springer.com/series/7818

Guy Gogniat  Dragomir Milojevic
Adam Morawiec  Ahmet Erdogan



Editors

AlgorithmArchitecture Matching
for Signal and
Image Processing
Best papers from Design and Architectures
for Signal and Image Processing 2007 & 2008
& 2009

Editors
Guy Gogniat
Lab-STICC-CNRS, UMR 3192, Centre de
Recherche
Université de Bretagne Sud – UEB
BP 92116
56321 Lorient Cedex
France
guy.gogniat@univ-ubs.fr
Dragomir Milojevic
Université libre de Bruxelles
CP 165-56, Av. FD Roosevelt 50
1050 Bruxelles
Belgium
dmilojev@ulb.ac.be

Adam Morawiec
ECSI
Av. de Vignate 2
38610 Gières
France
adam.morawiec@ecsi.org
Ahmet Erdogan
School of Engineering
The University of Edinburgh
Mayfield Road
EH9 3JL Edinburgh
United Kingdom
ahmet.erdogan@ee.ed.ac.uk

e-ISSN 1876-1119
ISSN 1876-1100
ISBN 978-90-481-9964-8
e-ISBN 978-90-481-9965-5
DOI 10.1007/978-90-481-9965-5
Springer Dordrecht Heidelberg London New York
Library of Congress Control Number: 2010938790
© Springer Science+Business Media B.V. 2011
No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by
any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written
permission from the Publisher, with the exception of any material supplied specifically for the purpose
of being entered and executed on a computer system, for exclusive use by the purchaser of the work.
Cover design: VTEX, Vilnius
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Preface

Advances in signal and image processing together with increasing computing power
are bringing mobile technology closer to applications in a variety of domains like
automotive, health, telecommunication, multimedia, entertainment and many others. The development of these leading applications, involving a large diversity of
algorithms (e.g. signal, image, video, 3D, communica; tion, cryptography) is classically divided into three consecutive steps: a theoretical study of the algorithms, a
study of the target architecture, and finally the implementation. Such a linear design
flow is reaching its limits due to intense pressure on design cycle and strict performance constraints. The approach, called Algorithm-Architecture Matching, aims to
leverage design flows with a simultaneous study of both algorithmic and architectural issues, taking into account multiple design constraints, as well as algorithm and
architecture optimizations, that couldn’t be achieved otherwise if considered separately. Introducing new design methodologies is mandatory when facing the new
emerging applications as for example advanced mobile communication or graphics
using sub-micron manufacturing technologies or 3D-Integrated Circuits. This diversity forms a driving force for the future evolutions of embedded system designs
methodologies.
The main expectations from system designers’ point of view are related to methods, tools and architectures supporting application complexity and design cycle reduction. Advanced optimizations are essential to meet design constraints and to enable a wide acceptance of these new technologies.
This book presents a collection of selected contributions addressing different aspects of Algorithm-Architecture Matching approach ranging from sensors to architectures design. The scope of this book reflects the diversity of potential algorithms, including signal, communication, image, video, 3D-Graphics implemented
onto various architectures from FPGA to multiprocessor systems. Several synthesis and resource management techniques leveraging design optimizations are also
described and applied to numerous algorithms.
The contributions of this book are split into three parts addressing major issues
when designing embedded systems. The first part proposes key contributions in the
domain of architectures for embedded applications and especially for image and
v

vi

Preface

telecommunication processing. The second part focuses on data acquisition and design techniques for embedded systems. First, an optimized sensor for image acquisition is detailed. Then several multiplication and division operators are described.
The end of this part proposes several contributions in the domain of partial and dynamic reconfiguration for signal and image processing. This technology leads to
complex design issues which are addressed in this chapter. The third part targets
embedded systems design. RTOS for embedded systems and scheduling techniques
are first addressed. Finally CAD tools for signal and image processing are detailed.
The coverage of this book is large and provides an in-depth analysis of existing techniques and methodologies to design embedded systems targeting image and signal
processing.
Guy Gogniat
Dragomir Milojevic
Adam Morawiec
Ahmet Erdogan

Contents

Part 1

Architectures for Embedded Applications

Lossless Multi-Mode Interband Image Compression and Its Hardware
Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Xiaolin Chen, Nishan Canagarajah, and Jose L. Nunez-Yanez

3

Efficient Memory Management for Uniform and Recursive Grid Traversal 27
Tomasz Toczek and Stéphane Mancini
Mapping a Telecommunication Application on a Multiprocessor
System-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Daniela Genius, Etienne Faure, and Nicolas Pouillon
Part 2

53

Data Acquisition and Embedded Systems

A Standard 3.5T CMOS Imager Including a Light Adaptive System for
Integration Time Optimization . . . . . . . . . . . . . . . . . . . . .
Gilles Sicard, Estelle Labonne, and Robin Rolland

81

Approximate Multiplication and Division for Arithmetic Data Value
Speculation in a RISC Processor . . . . . . . . . . . . . . . . . . . .
Daniel R. Kelly, Braden J. Phillips, and Said Al-Sarawi

95

RANN: A Reconfigurable Artificial Neural Network Model for Task
Scheduling on Reconfigurable System-on-Chip . . . . . . . . . . . . 117
Daniel Chillet, Sébastien Pillement, and Olivier Sentieys
A New Three-Level Strategy for Off-Line Placement of Hardware Tasks
on Partially and Dynamically Reconfigurable Hardware . . . . . . . 145
Ikbel Belaid, Fabrice Muller, and Maher Benjemaa
End-to-End Bitstreams Repository Hierarchy for FPGA Partially
Reconfigurable Systems . . . . . . . . . . . . . . . . . . . . . . . . . 171
Jérémie Crenne, Pierre Bomel, Guy Gogniat, and Jean-Philippe Diguet
vii

viii

Part 3

Contents

Embedded Systems Design

SystemC Multiprocessor RTOS Model for Services Distribution on
MPSoC Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Benoît Miramond, Emmanuel Huck, Thomas Lefebvre, and François
Verdier
A List Scheduling Heuristic with New Node Priorities and Critical Child
Technique for Task Scheduling with Communication Contention . . 217
Pengcheng Mu, Jean-François Nezan, and Mickaël Raulet
Multiprocessor Scheduling of Dataflow Programs within the
Reconfigurable Video Coding Framework . . . . . . . . . . . . . . . 237
Jani Boutellier, Christophe Lucarz, Victor Martin Gomez, Marco
Mattavelli, and Olli Silvén
A High Level Synthesis Flow Using Model Driven Engineering . . . . . . 253
Sébastien Le Beux, Laurent Moss, Philippe Marquet, and Jean-Luc
Dekeyser
Generation of Hardware/Software Systems Based on CAL Dataflow
Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Richard Thavot, Romuald Mosqueron, Julien Dubois, and Marco
Mattavelli
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293

Contributors

Said Al-Sarawi CHiPTec, Centre for High Performance Integrated Technologies
and Systems, The University of Adelaide, Adelaide, Australia,
alsarawi@eleceng.adelaide.edu.au
Ikbel Belaid University of Nice-Sophia Antipolis/LEAT-CNRS, 250 rue Albert
Einstein, 06560 Valbonne, France, Ikbel.Belaid@unice.fr
Maher Benjemaa National Engineering School of Sfax/Research Unit ReDCAD,
Sfax, Tunisia, Maher.Benjemaa@enis.rnu.tn
Sebastien Le Beux Institut des Nanotechnologies de Lyon, Ecole Centrale de
Lyon, 36, Avenue Guy de Collongue, 69134 Ecully Cedex, France,
Sebastien.Le-Beux@ec-lyon.fr
Pierre Bomel LAB-STICC, Université Européenne de Bretagne, Lorient, France,
pierre.bomel@univ-ubs.fr
Jani Boutellier Computer Science and Engineering Laboratory, University of
Oulu, Oulu, Finland, jani.boutellier@ee.oulu.fi
Nishan Canagarajah University of Bristol, Bristol, UK,
Nishan.Canagarajah@bristol.ac.uk
Xiaolin Chen University of Bristol, Bristol, UK, Xiaolin.Chen@bristol.ac.uk
Daniel Chillet University of Rennes 1, IRISA/INRIA, BP 80518, 6 rue de Kerampont, F22305 Lannion, France, Daniel.Chillet@irisa.fr
Jérémie Crenne LAB-STICC, Université Européenne de Bretagne, Lorient, France,
jeremie.crenne@univ-ubs.fr
Jean-Luc Dekeyser LIFL and INRIA Lille Nord-Europe, Parc Scientifique de
la Haute Borne, Park Plaza, Bât A, 40 avenue Halley, Villeneuve d’Ascq 59650,
France, Jean-Luc.Dekeyser@lifl.fr
Jean-Philippe Diguet LAB-STICC, Université Européenne de Bretagne, Lorient,
France, jean-philippe.diguet@univ-ubs.fr
ix

x

Contributors

Julien Dubois Laboratoire LE2I, Université de Bourgogne, 21000 Dijon, France,
julien.dubois@u-bourgogne.fr
Etienne Faure SoC Department, LIP6, 4 place Jussieu, 75252 Paris Cedex, France,
etienne.faure@lip6.fr
Daniela Genius SoC Department, LIP6, 4 place Jussieu, 75252 Paris Cedex,
France, daniela.genius@lip6.fr
Guy Gogniat LAB-STICC, Université Européenne de Bretagne, Lorient, France,
guy.gogniat@univ-ubs.fr
Victor Martin Gomez Computer Science and Engineering Laboratory, University
of Oulu, Oulu, Finland, victor.martin@ee.oulu.fi
Emmanuel Huck ETIS Laboratory, UMR CNRS 8051, Université de CergyPontoise/ENSEA, 6, avenue du Ponceau, 95014 Cergy-Pontoise, France,
huck@ensea.fr
Daniel R. Kelly CHiPTec, Centre for High Performance Integrated Technologies
and Systems, The University of Adelaide, Adelaide, Australia,
dankelly@eleceng.adelaide.edu.au
Estelle Labonne TIMA Laboratory (CNRS, Grenoble INP, UJF), Grenoble, France
Thomas Lefebvre ETIS Laboratory, UMR CNRS 8051, Université de CergyPontoise/ENSEA, 6, avenue du Ponceau, 95014 Cergy-Pontoise, France,
lefebvre@ensea.fr
Christophe Lucarz Microelectronic Systems Laboratory, École Polytechnique
Fédérale de Lausanne, Lausanne, Switzerland, christophe.lucarz@epfl.ch
Stéphane Mancini GIPSA-lab, INPG-CNRS, 961 rue de la Houille Blanche Domaine Universitaire-B.P. 46, 38402, Saint Martin d’Heres, France,
stephane.mancini@gipsa-lab.inpg.fr
Philippe Marquet LIFL and INRIA Lille Nord-Europe, Parc Scientifique de la
Haute Borne, Park Plaza, Bât A, 40 avenue Halley, Villeneuve d’Ascq 59650,
France, Philippe.Marquet@lifl.fr
Marco Mattavelli SCI-STI-MM, Ecole Polytechnique Fédérale de Lausanne, CH
1015 Lausanne, Switzerland, marco.mattavelli@epfl.ch
Marco Mattavelli Microelectronic Systems Laboratory, École Polytechnique
Fédérale de Lausanne, Lausanne, Switzerland, marco.mattavelli@epfl.ch
Benoît Miramond ETIS Laboratory, UMR CNRS 8051, Université de CergyPontoise/ENSEA, 6, avenue du Ponceau, 95014 Cergy-Pontoise, France,
miramond@ensea.fr
Romuald Mosqueron SCI-STI-MM, Ecole Polytechnique Fédérale de Lausanne,
CH 1015 Lausanne, Switzerland, romuald.mosqueron@epfl.ch

Contributors

xi

Laurent Moss Ecole Polytechnique de Montréal, Campus de l’Université de Montréal, 2500, chemin de Polytechnique, 2900 boulevard Edouard-Montpetit, Montréal, Quebec H3T 1J4, Canada, Laurent.Moss@polymtl.ca
Pengcheng Mu Ministry of Education Key Lab for Intelligent Networks and Network Security, School of Electronic and Information Engineering, Xi’an Jiaotong
University, Xi’an 710049, P.R. China, pengchengmu@gmail.com
Fabrice Muller University of Nice-Sophia Antipolis/LEAT-CNRS, 250 rue Albert
Einstein, 06560 Valbonne, France, Fabrice.Muller@unice.fr
Jean-François Nezan IETR/Image and Remote Sensing Group, CNRS UMR
6164/INSA Rennes, 20, avenue des Buttes de Coësmes, 35043 Rennes Cedex,
France, jnezan@insa-rennes.fr
Jose L. Nunez-Yanez University of Bristol, Bristol, UK,
J.L.Nunez-Yanez@bristol.ac.uk
Braden J. Phillips CHiPTec, Centre for High Performance Integrated Technologies and Systems, The University of Adelaide, Adelaide, Australia,
phillips@eleceng.adelaide.edu.au
Sébastien Pillement University of Rennes 1, IRISA/INRIA, BP 80518, 6 rue de
Kerampont, F22305 Lannion, France
Nicolas Pouillon SoC Department, LIP6, 4 place Jussieu, 75252 Paris Cedex,
France, nicolas.pouillon@lip6.fr
Mickaël Raulet IETR/Image and Remote Sensing Group, CNRS UMR 6164/INSA
Rennes, 20, avenue des Buttes de Coësmes, 35043 Rennes Cedex, France,
mraulet@insa-rennes.fr
Robin Rolland CIME Nanotech, Grenoble, France
Olivier Sentieys University of Rennes 1, IRISA/INRIA, BP 80518, 6 rue de Kerampont, F22305 Lannion, France
Gilles Sicard TIMA Laboratory (CNRS, Grenoble INP, UJF), Grenoble, France
Olli Silvén Computer Science and Engineering Laboratory, University of Oulu,
Oulu, Finland, olli.silven@ee.oulu.fi
Richard Thavot SCI-STI-MM, Ecole Polytechnique Fédérale de Lausanne, CH
1015 Lausanne, Switzerland, richard.thavot@epfl.ch
Tomasz Toczek GIPSA-lab, INPG-CNRS, 961 rue de la Houille Blanche Domaine
Universitaire-B.P. 46, 38402, Saint Martin d’Heres, France,
tomasz.toczek@gipsa-lab.inpg.fr
François Verdier ETIS Laboratory, UMR CNRS 8051, Université de CergyPontoise/ENSEA, 6, avenue du Ponceau, 95014 Cergy-Pontoise, France,
verdier@ensea.fr

Part 1

Architectures for Embedded Applications

Lossless Multi-Mode Interband Image
Compression and Its Hardware Architecture
Xiaolin Chen, Nishan Canagarajah,
and Jose L. Nunez-Yanez

Abstract This paper presents a novel Lossless Multi-Mode Interband image Compression (LMMIC) scheme and its hardware architecture. Our approach detects the
local features of the image and uses different modes to encode regions with different
features adaptively. Run-mode is used in homogeneous regions, while ternary-mode
and regular-mode are used on edges and other regions, respectively. In regular mode,
we propose a simple band shifting technique as interband prediction and a gradientbased switching strategy to select between intraband or interband prediction. We
also enable intraband and interband adaptation in the run-mode and ternary-mode.
The advantage of LMMIC is to adaptively “segment” the image and use suitable
methods to encode different regions. The simplicity of our scheme enables the hardware amenability. Experimental results show that LMMIC achieves superior compression ratios, with the benefits of enabling encoding any number of bands and easy
access to any band. We also describe the hardware architecture for this scheme.

1 Introduction
The rapid advance of multimedia technology generates huge amount of image data,
most of which are multispectral images. We define “multispectral images” here as
images containing more than one spectral band. This includes a wide range of images from colour images to hyperspectral images. For instance, colour images, often
stored as JPEG, BMP, or TIF format, have at least three bands, e.g. red, green and
blue (RGB). In most printing systems, a four-band colour space CMYK (cyan, magenta, yellow and black) is commonly used. Moreover, some high fidelity image
capture systems collect the spectral reflectance measured at different wavelengths
X. Chen ()
University of Bristol, Bristol, UK
e-mail: Xiaolin.Chen@bristol.ac.uk
G. Gogniat et al. (eds.), Algorithm-Architecture Matching for Signal and Image Processing,
Lecture Notes in Electrical Engineering 73,
DOI 10.1007/978-90-481-9965-5_1, © Springer Science+Business Media B.V. 2011

3

4

X. Chen et al.

in order to accurately capture the colour of a physical surface. For example, the
VASARI imaging system [16, 23] developed at the National Gallery in London uses
a seven-channel multispectral camera to capture paintings. In remote sensing, the
LANDSAT 7 [15] satellite images have seven spectral bands, and the AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) [7] hyperspectral images contain 224
contiguous bands. These images form the base of the widely used web mapping services, e.g. the Google Earth. In medical imaging, multispectral images also prevail.
Examples include magnetic resonance imaging (MRI) which can simultaneously
measure multiple characteristics of an object [6], and medical images formed by
different medical imaging modalities such as MRI, CT and X-ray [8]. These images
are normally compressed for transmission and storage. As many applications, e.g.
remote sensing imaging, medical imaging, pre-press imaging and archiving [30],
demand perfect reconstruction of images, lossless compression on multispectral images attracts increasing interests. Also for applications that need to transmit image
data instantly after acquisition, real-time compression is desirable. As software compression often suffers from huge CPU resource occupation and memory consumption, we aim to design an efficient hardware amenable lossless image compression
scheme, with the capability of real-time processing.
Unlike gray-scale image, multispectral image has not only spatial but also spectral (or called interband) redundancy among different spectral bands. Moreover, the
existence of multiple spectral bands suggests two problems worth of concern: (1) to
encode any number of bands, which is not offered by schemes with certain restricted
structure; (2) to enable random access to whichever band such that any bands can
be retrieved without processing the whole multispectral image. For example, each
band of the LANDSAT image contains different information of the earth – water
body or vegetation moisture contents etc., and different combinations of bands give
illustrations on different issues such as mineral, soil and so on [15]. Therefore, it
is desirable to have random band access. Many compression schemes in literature
have achieved good compression ratios, but few consider these merits.
In literature, both transform and prediction techniques are used in interband image compression. Popular transform based interband coding techniques include vector quantization [12, 14, 21], discrete cosine transform [4], discrete wavelet transform [25], and vector-lifting schemes [2]. These techniques are efficient in reducing spectral redundancy, but their high computational complexity and sometimes
jointly encoding several bands (e.g. 16 bands in [25]) are obstacles for hardware
implementation and real-time processing. On the other hand, predictive coding does
not only perform well in removing spatial redundancy but also spectral redundancy.
Wu extended Intraband CALIC [29] to Interband CALIC [30], which is claimed to
offer one of the best interband compression results in literature but requires complex interband correlation coefficients calculation and context formation. SICLIC
[1] is a simple and efficient coder based on LOCO-I [28], but its 3-band joint-run
mode, while offering good bit rates, constrains it from encoding any number of
bands.
To relieve these problems, we propose a Lossless Multi-Mode Interband image
Compression (LMMIC) scheme. The proposal of this scheme is inspired by the

Lossless Multi-Mode Interband Image Compression and Its Hardware Architecture

5

concept of segmentation. Segmentation, in a general sense, is to partition an image into multiple segments in order to change the representation of an image into
something meaningful or easy to analyse. However, traditional segmentation, e.g.
statistical model-based methods [9] and graph-based methods [5], is too complex to
implement in real-time compression. The novelty of our scheme is to apply the concept of segmentation to group pixels with similar features and use different methods
to encode them. We called this method multi-mode strategy, where a new ternarymode is designed to detect and encode the edges and smooth areas, and a run-length
coder [28] is used to encode the homogeneous regions, while the rest of the image,
say the texture regions, is coded by a regular-mode which consists of intraband and
interband prediction. We propose a simple band shifting technique for interband
prediction and adopt the Gradient-Adjusted Prediction (GAP) from CALIC [29] for
intraband prediction. A new gradient-based switch is also designed to select the
better predictor in regular-mode and to allow intraband and interband adaptation
in run-mode and ternary-mode. The proposed scheme does not only offer excellent
compression ratios, but also the distinctive feature of the flexibility of encoding any
number of bands. As LMMIC only involves a limited number of addition and shifting, it is hardware amenable. Note that the proposed scheme in this paper is for
general purpose (e.g. space, medical, archiving) images with more than one spectral
bands but not specifically geared for hyperspectral images. Since some of the techniques for hyperspectral images make specific use of the structure of these images,
for example, by including a band ordering process [26] or by clustering a number of
bands [25], we do not include those highly specialized and not necessarily hardware
amenable methods in our comparison study. However, we acknowledge that refining
our techniques for specific use with hyperspectral images is an interesting topic for
further research, and its findings will be reported elsewhere.
This paper is organized as follows. In Sect. 2 we present an overview of LMMIC.
Then we explain the core techniques – multi-mode strategy in Sect. 3 and band shifting and gradient-based switching in Sect. 4. Context modelling is briefly described
in Sect. 5. The performance comparison with other state-of-the-art schemes is presented in Sect. 6. We propose the hardware architecture to support LMMIC in Sect. 7
and conclude our work in Sect. 8.

2 An Overview of LMMIC
An image contains many features, such as smooth regions, edges, texture etc. The
complexity of an image is an obstacle for compression, thus segmentation (also
referred to as region-based methodology) is a viable approach to help with distinguishing these features. The lossless image compression method TMW [11], which
achieves the best gray-scale image compression ratio so far, uses segmentation to
analyse the image in the first pass. Shen [24] took advantage of the region-growing
algorithm for segmentation of lossless compression of medical images. Ratakonda
[20] used multiscale segmentation to encode general images. However, they are all

6

X. Chen et al.

Fig. 1 Schematic of the proposed image compression system

complex two-pass schemes so cannot meet well the real-time processing requirement. Due to the complexity of segmentation, we skip the conventional segmentation methods, but instead propose to apply its concept, by using a simple switch to
detect the image features adaptively, and choose suitable modes to encode these features. This switching technique resembles the function of “segmenting” the image
into different areas. This is the idea that our scheme is based on.
Figure 1 shows the schematic of LMMIC. It consists of preprocessing, prediction,
context modelling and arithmetic coding. At the beginning, a base band is chosen.
It is encoded independently using intraband compression method. Then a preprocessing step is performed on all bands except the base band to calculate the band
difference between the current band and the base band. The output of the preprocessing includes a difference band, an original band and a base band. They are then fed
into the multi-mode predictor. The prediction step includes run-mode, ternary-mode
and regular mode. In each mode, there is a choice between intraband and interband
operations. For regular-mode, we proposed a band shifting technique for interband
prediction, while the intraband prediction is based on the Gradient-Adjusted Predictor (GAP) [29]. A new gradient-based switching is designed to select the better
predictor. We also enable adaptation on intraband and interband operations for runmode and ternary-mode. This multi-mode strategy applies to all bands in an image.
The context modelling is constructed in a similar but simpler way as in [29] to
further exploit higher order redundancy. The probability estimator and arithmetic
coding are described together with the hardware architecture in Sect. 7. LMMIC
offers not only very good compression ratios, but also the distinctive feature of encoding any number of bands, since the structure of our scheme does not enforce any
restriction on the number of bands to be coded. The simple multi-mode strategy and
switching method make it hardware amenable.

Lossless Multi-Mode Interband Image Compression and Its Hardware Architecture

7

3 Multi-Mode Strategy
As stated before, the multi-mode strategy is based on the concept of segmentation.
As important as segmenting regions correctly in segmentation, it is crucial to carefully decide under which circumstances the system should be working under which
mode, since it is exactly these entry conditions “segmenting” the image. In this section, we present how each mode works and the conditions for entering each mode.
Prior to that, the preprocessing step is briefly described below.

3.1 Preprocessing
In Fig. 1, to start the process, one band is chosen as base band. For instance the
band G in RGB images, or the first band received in other multispectral images. The
base band is coded with intraband coding only. Then a preprocessing step simply
subtracts the base band from the current band to get the band difference.
Banddiff = Bandcurr − Bandbase .

(1)

This seems to be a simple act, but leads to a lot of benefits. For example in RGB images, the bands that can be used during prediction without violating the reversibility
of the algorithm are
G, R, B, R − G, B − G.

(2)

Instead of having three original bands, now we have five “bands” (some are difference bands) that can be used in prediction. For images with any number of bands,
there always exist an original band and a difference band for encoding each band.
This enables adaptation between intraband and interband prediction. Also, in this
way, each band is only coupled with the base band and no multi-band joint encoding is performed. Since the base band is coded independently, it can be retrieved
any time without processing other bands. Once the base band is retrieved, the current band can be retrieved. This allows the flexibility of compressing any number
of bands and enables easy access to any bands. For any bands except the base band,
multi-mode strategy is applied on both the original band and the difference band.

3.2 Run-Mode
Run-length coding is simple and efficient in grouping identical symbols [28]. It encodes the occurrence, i.e. the number of times that the symbol occurs consecutively,
also called run-length. We use run-length coding to encode the homogeneous regions of the image. Figure 2 shows the neighboring pixels of the current pixel X
according to their geographical positions. When W = N = N W = N E, the current
pixel is assumed to be in a homogeneous region and is tried to be encoded in runmode. If X is identical to W , the run-length increases by one; otherwise “run” stops
and the current run-length is encoded. The latter case means that encoding symbol
in run-mode is unsuccessful, so the symbol is encoded in regular-mode.

8

X. Chen et al.

Fig. 2 Neighboring pixels of
the current pixel

3.3 Ternary-Mode
Ternary-mode is our new proposal and is inspired by the binary-mode in CALIC,
which works on the binary area where there are only two different pixels in the
neighborhood, e.g. black and white texts. However, unlike CALIC, our new ternarymode targets on two types of areas – clear edge areas and smooth but not exactly
homogeneous areas. Edge, which appears as the abrupt changes in pixel intensity,
is very difficult to predict. Therefore, instead of predicting it, we propose to record
the similarity between the edge pixels and their neighbouring pixels. In areas where
a sharp edge occurs, as shown in Fig. 3a, pixels on the edge tend to be the same or
similar but pixels at the two sides of the edge are usually different; also, in areas
where a less sharp edge occurs, as in Fig. 3b, pixel values tend to be changing
gradually from non-edge area to edge area. In both cases, we assume that within
a small neighborhood of the current pixel, say the seven neighbouring pixels in
Fig. 2, there are no more than three distinctive symbols and the ternary-mode is
triggered. In operation, pixel W as the first unique pixel value, is represented as s1.
Then the other six pixels in the neighbourhood are evaluated and the second and
third distinctive pixels are represented as s2 and s3, respectively. By comparing the
current pixel with s1, s2 and s3, we can assign a value to the current pixel by
⎧
0,
if X = s1;
⎪
⎪
⎨ 1,
if X = s2;
(3)
T=
⎪
2,
if
X = s3;
⎪
⎩
escape, otherwise.

Fig. 3 Areas where ternary-mode is performed

Lossless Multi-Mode Interband Image Compression and Its Hardware Architecture

9

In other word, the current pixel can be denoted by the order of its value appearance
in the neighbourhood, given the condition that the checking in the neighbourhood
is always conducted in the same order. “Escape” happens when the current pixel
X is not equal to any of the pixel values in the neighbourhood and thus encoding
in this mode fails. It is a way of switching among different modes. Figure 3 shows
the areas that the ternary-mode is performed. T indicates the symbols encoded by
ternary-mode, while R indicates run-mode, and the colour is the gray-level of the
symbols. The figure tells that edge areas can be largely covered by this mode.
We choose to use three distinctive pixel values but not fewer or more for the
following reasons:
1. In the cases shown in Fig. 3, there are usually more than two distinctive pixel
values in the neighbourhood. Using only two pixel values cannot adequately describe the edge conditions.
2. In many cases, image edges are more complicated than the ones shown in Fig. 3.
However, allowing more distinctive pixel values is very likely to result in more
negative than positive effect, as explained below:
a. It makes entering the ternary-mode too easy, if four or more different pixel
values are allowed in a 7 pixel neighbourhood. This would fail to characterize
the specific areas that are suitable to be encoded in ternary-mode;
b. It would lead to a lot more “escapes”. Because more random areas are classified as applicable in ternary-mode, the current pixel is more likely to fail to
find a match with any of the pixels in a more diverse neighbourhood;
c. Allowing more distinctive pixel values would increase the alphabet size, and
hence the bits that are needed to encode pixels.
The alphabet size for encoding in this mode is only 4 instead of 256 in the original
form, so lower entropy can be obtained. Ternary-mode also works as a “backup” of
the run-mode in smooth but not exactly homogeneous regions.

3.4 Regular-Mode
The regular-mode is triggered, either when the entry conditions for run-mode and
ternary-mode cannot be met, or when encoding in other modes fails. The regularmode consists of intraband and inter-band prediction, which is selected according
to the local features adaptively. Details of the interband prediction and switching
strategy are discussed in next section.

4 Band Shifting and Gradient-Based Switching
We design a simple band shifting technique for interband prediction, and adapt the
GAP from CALIC [29] for intra-band prediction. However, the performance of interband prediction depends on the interband correlation. In the case of strong inter-

10

X. Chen et al.

band correlation, interband prediction is preferred, otherwise intra-band prediction
is selected. A gradient-based switching method is proposed for the selection.

4.1 Band Shifting for Inter-Band Prediction
In the regions where bands are strongly correlated, pixel changes in one band often
happen in another band. For instance, Fig. 4 plots the pixel values of one line in
band G and band B of the image “peppers” respectively. It is clear that the dot plot
of band G has a similar trend with the dash plot of band B. We also notice that
although changing in a similar trend, the difference between two bands varies from
areas to areas. Thus directly subtracting band B from band G tends to result in big
errors. The ideal way would be to move the base band to a position that is as close
to the current band as possible so that only a small difference between the current
band and the shifted base band needs to be coded. There are a lot of possible ways
to predict the value for band shifting. Since band shifting is only performed when
the interband correlation is high, we assume that in this case the band difference
is reasonably small and varies in a regular way. Therefore, we propose to use the
simple Median Edge Detector [28] to predict the band shifting value. We denote the
band difference at position W , N , N W as W _diff , N _diff , N W _diff , and calculate
the band shifting by
if NW_diff >= max(N_diff, W_diff)
shift_band = min(N_diff, W_diff);
else if NW_diff <= min(N_diff, W_diff)
shift_band = max(N_diff, W_diff);
else
shift_band = N_diff + W_diff - NW_diff;
end
The solid plot in Fig. 4 shows that this prediction method successfully generate a
zero-mean band difference between the current band and the shifted reference band.

Fig. 4 Plots of one line in band G and band B and their difference after shifting

Lossless Multi-Mode Interband Image Compression and Its Hardware Architecture

11

4.2 Gradient-Based Switching
For the regular-mode in the multi-mode strategy, we use the band shifting technique
for interband prediction and adopt the GAP [29] for intraband prediction. Since the
performance of the two predictors varies in different regions of an image depending
on the spatial and spectral correlation, it is critical to decide which predictor to use
in different areas. As we aim at designing a hardware amenable scheme, complex
calculation of interband correlation coefficients is not desirable. We propose a simple switching method based on the local horizontal and vertical gradients, which are
calculated by
dh = |W − W W | + |N − N W | + |N − N E|,

(4)

dv = |W − N W | + |N − N N | + |N E − N N E|,

(5)

where dv and dh are the vertical and horizontal gradients, respectively. When calculating the interband gradients, W , N , N W , N E, N N , W W , N N E are substituted
by the interband difference at the same positions. Interband gradients indicate how
closely the two bands change in the same way. In addition to the gradients, the
previous prediction error is taken into account to evaluate how well the predictor
performs in the local area. Therefore, for both intraband and interband prediction,
we calculate the switching coefficient S by
S = dv + dh + |ew |,

(6)

where ew means the prediction error at position W . The predictor that gives smaller
S is selected to encode the current pixel. We counted the proportions of pixels that
are treated by intraband and interband prediction in the regular-mode respectively,
in band B and R on a set of RGB images in Table 1. We also calculated how often
the predictor that gives smaller errors is selected, as right_ratio. The proportions of
intraband and interband prediction do not sum up to 1 because the rest pixels are
processed by the run-mode or ternary-mode. On average, more than 40% interband
Table 1 Proportions of pixels using intraband and interband prediction and the ratios of the better
predictor is selected in the regular mode
Image

b_intra

b_inter

b_right_ratio

r_intra

r_inter

r_right_ratio

cmpnd1

6.41

17.66

70.13

4.62

18.64

73.33

cmpnd2

2.06

cats

1.78

24.10

83.63

1.64

24.91

82.21

47.78

87.12

1.10

47.56

water

90.29

5.33

44.19

76.53

3.04

45.49

82.15

lena

33.97

65.90

59.36

64.09

34.43

67.33

peppers

17.41

76.89

71.38

22.53

76.60

72.26

bike3

52.46

27.23

64.82

52.41

29.19

60.67

average

17.06

43.39

73.28

21.35

39.55

75.46

12

X. Chen et al.

prediction is selected, meaning that there is a substantial amount of interband redundancy. The simple gradient-based switching technique has achieved over 70%
correct choice in selecting a better predictor.

4.3 Adaptation in Run-Mode and Ternary-Mode
The above gradient-based switching is not only used in selecting the intraband and
interband prediction in regular-mode, but is also modified to be used in enabling
adaptation in the run-mode and ternary-mode. Since the run-mode encodes pixels
directly and the ternary-mode only records the similarity among pixels, there is no
prediction error generated by these two modes. Therefore, we eliminate the term of
error from (6) to obtain an adaptation coefficient S  .
In the run-mode, when run-length is 0, either the pixels in the current band or in
the band difference is selected according to which neighbourhood gives a smaller S  .
The selected pixels are used in the run-mode and a flag is used to indicate this
selection. When run-length is not 0, the previously used pixel values – whether
from the current band or the band difference, are used to keep the continuity of the
run.
In the ternary-mode, the whole neighbourhood of pixels used for ternary-mode
are selected either from the current band or the band difference based on the value
of S  . Since the coefficient S  can be calculated before receiving the current pixel, it
is guaranteed that the current pixel used for comparing with its neighbouring pixels
is from the same source. This adaptation in run-mode and ternary-mode improves
the spatial and spectral decorrelation performance of our proposed scheme in our
experiments.
We show in Table 2 the effect of using the original band, difference band or the
two combined in the run-mode and ternary-mode. The “adaptation” on the fourth
column means choosing the bands adaptively. The resulting bit rates vary for different images, but on average, using the adaptation technique slightly improves the
compression ratios.
Table 2 Compression ratios comparison on LMMIC using different neighbourhoods in the runmode and ternary-mode, in bits per pixel
Image

Current band

Difference band

Adaptation

cmpnd1

1.117

1.054

1.057

cmpnd2

1.037

0.972

0.969

cats

1.813

1.822

1.823

water

1.434

1.436

1.442

lena

4.233

4.230

4.233

peppers

3.339

3.363

3.356

bike3

4.274

4.353

4.289

average

2.464

2.461

2.453

Lossless Multi-Mode Interband Image Compression and Its Hardware Architecture

13

5 Context Modelling
Context modeling is to group the prediction residue based on some local features,
named contexts, in order to obtain a lower conditional entropy. In the proposed
scheme, context is formed in a similar manner with CALIC [29] but is simplified to
reduce the memory usage. We take 6 context symbols (W, N, N W, N E, N N, W W )
to compare with the predicted value to obtain a texture pattern t, representing the
local texture feature. Also, to indicated the activity of errors in a context, the coding
contexts are generated with the local gradients dv, dh and a previous prediction
error e of pixel W . The coding contexts are then quantized into 8 levels to form the
coding context indexes. Combining the texture pattern and the coding contexts, a set
of 512 compound contexts are formed with 6 bits texture pattern and 3 bits coding
context indexes. In the case of interband context formation, original pixel values are
replaced by the band difference at the same positions. These contexts are also used
to generate an error feedback to the predictor, which will be discussed in Sect. 7.1.
The 8 coding contexts are used to calculate the occurrence probability of pixels in
the probability estimator presented in Sect. 7.2.

6 Performance Comparison
The performance of LMMIC is presented in this section. We firstly give two examples to show the areas where different modes apply in images. In example 1, Fig. 5b
shows the regions where different modes are performed, comparing with the original
image Fig. 5a. Run-mode works on the grey areas, which are smooth and homogeneous; ternary-mode works on the white regions, which often lie on the edge of the
homogeneous regions, some smooth regions or clear edges; regular-mode works on
the dark regions, which are mostly texture or noisy areas. In example 2, we only
apply ternary-mode and regular-mode on image Fig. 6a to emphasize the function
of ternary-mode. In Fig. 6b, white areas indicate where the ternary-mode applies,
while the black areas indicate where the regular-mode applies. We can see that all
the texts including those in the picture are covered by the ternary-mode, as well as
some of the clear edges and homogeneous areas.
To have a quantitative measure of the proportion that the three modes apply on
an image, we count the number of pixels being treated by each mode in Table 3.
The “run” in the second column stands for the number of pixels being coded by the
run-mode. The “ternary” in the third column means the number of pixels satisfying
the entry condition of the ternary-mode, and the “successful ternary” in the fourth
column means the number of pixels being successfully coded by the ternary-mode.
The “successful ternary ratio” in the fifth column is the proportion of the “successful ternary” in the whole image. Table 3 shows that run-mode does not happen often
except in relatively simple images like “bike3-g” and “CMPND1-G”. Ternary-mode
occurs more often but still takes on a small proportion, as shown in the fifth column
of “successful ternary ratio”. But it has a roughly 60% successful rate when comparing the fourth and the third column, which means 60% of the pixels entering the

14

X. Chen et al.

Fig. 5 Example 1: (a) Original image “bike3-g” and (b) image indicating different modes applied
Table 3 Mode counts of LMMIC
Images
lena

Run

Ternary

148

8117

barb2

7492

hotel

2826

bike3-g
CMPND1-G

Successful ternary

Successful ternary ratio (%)

4363

0.017

10708

5585

0.013

16823

11686

0.028

109012

146804

113654

0.158

276114

31227

27833

0.071

ternary-mode can be successfully coded by it. This makes it improve the overall bit
rates compared to applying the regular-mode alone.
The experimental result in terms of compression ratios is presented in Table 4.
We choose a set of standard 3-band RGB images, a 4-band CMYK image “park” and
a 7-band LANDSAT 7 image “coastal” from CCSDC (the Consultative Committee
for Space Data Systems). The RGB images include continuous-tone images (“cats”,
“water”, “lena” and “peppers”), compound images (“cmpnd1” and “cmpnd2”) and
synthetic image (“bike3”). We compare the proposed compression scheme with
JPEG2000 [27], intra-band CALIC [29], IB-CALIC [30] and SICLIC [1]. The results of IB-CALIC and SICLIC are extracted from [1]. Some results are absent
due to the unavailability of the programs. JPEG2000 is the current standard for image compression. The results of IB-CALIC are claimed to be one of the best in

Lossless Multi-Mode Interband Image Compression and Its Hardware Architecture

15

Fig. 6 Example 2: (a) Original image “CMPND1-G” and (b) image indicating ternary-mode and
regular-mode areas

Table 4 Bit rates comparison on selected schemes, in bits per pixel per band
Image

JPEG2000

CALIC

IB-CALIC

SICLIC

LMMIC

cmpnd1

1.44

1.21

1.02

1.12

1.05

cmpnd2

1.30

1.22

0.92

0.97

0.97

cats

1.75

2.49

1.81

1.85

1.82

water

1.41

1.74

1.51

1.45

1.44

lena

4.53

4.40

4.46

4.23

peppers

3.41

4.62

3.25

3.35

bike3

5.17

4.21

4.41

4.29

average

2.72

2.84

2.50

2.45

park

5.72

5.39

5.30

coastal

2.89

2.68

2.62

16

X. Chen et al.

literature in terms of general interband image compression, but not hyperspectral
image compression, which is not the scope of our proposed method either. And
SICLIC is a good trade-off between compression ratio and complexity. Table 4
shows that LMMIC outperforms JPEG2000 and intraband CALIC by 10% and 14%,
respectively. It is superior than SICLIC on average, though slightly inferior than IBCALIC which has higher computational complexity. Since the interband coding in
LMMIC only couples two bands, it has the flexibility of compressing images with
any number of bands and easy access to any bands.

7 Hardware Architecture
Hardware amenability is one of the priorities in the design of the proposed scheme.
Therefore, as previously described, the whole procedure, including the prediction
and mode switching, only involves a limited number of addition and shifting, and
memory usage is strictly controlled. In this section we propose the suitable hardware
architecture to support the proposed compression scheme, which can be mainly divided into two parts – the architectures of lossless image modelling and encoding.
The modelling part includes prediction and context modelling, while the encoding part includes probability estimator and binary arithmetic coder. We will discuss
them below respectively in details.

7.1 Lossless Image Modelling
Lossless image modelling here serves both gray-scale and multispectral images.
The user can specify which category the input image belongs to. The data flow of
the image compression scheme is shown in Fig. 7.
To optimize the speed and hence the throughput, the data flow of the scheme is
designed as two pipelines running in parallel, as shown in Fig. 7. Line 1, indicated
by the flow on the left in blue, operates for the current symbol. It takes in the input
symbol and selects the suitable mode to encode it. The mean of errors and context
index calculated from Line 2 are fed into the multi-mode prediction and probability
estimator. The output from the multi-mode prediction, which is either the “runs” of
the symbols, or the symbol order from the ternary mode, or the prediction error from
the regular mode, are used to drive the probability estimator and arithmetic coder.
Line 2, indicated by the flow on the right in red, works for the next symbol. It takes
the input symbol to update the contexts, and calculate the prediction value and context index under the selected mode for the next symbol. The advantage of dividing
the procedure into two parallel pipelines is, while not violating the sequential constraint, to halve the execution time and hence obtain higher throughput. A summary
of the operations in each pipelines is as follows.

Lossless Multi-Mode Interband Image Compression and Its Hardware Architecture

17

Fig. 7 Data flow of the prediction and context modeling module

Line 1:
1.
2.
3.
4.
5.

Select a suitable mode and calculate prediction error  = X − (X̂ + error_mean);
Update the sum of errors sum and the number of pixels count in each context;
Map prediction error  to ˜ ;
Update coding context index Q;
Encode the prediction error with probability estimator and arithmetic coder.
Line 2:

1. Update contexts with new symbol;
2. Calculate primary prediction value X̂ for regular mode, and evaluate the condition of entering run-mode and ternary-mode;
3. Calculate the texture pattern;
4. Calculate the error energy and the context index;
5. Calculate the mean of the errors error_mean.
During the above process of image modelling, there are two issues worth of special care, while the rest of the process is relatively straightforward numerical computation.
• There is usually heavy memory usage in image compression algorithms, either
for storing prediction context or coding context, which are defined in Sect. 5. To
minimize the memory usage for prediction context, we only store three lines of
image pixel values in memory. We use three pointers to indicate the line orders.
New input symbols are always stored in line A. Every time when a line is filled
up, the three pointers to each line rotate in such an order that the oldest line is
discarded and the newly formed line is saved.

18

X. Chen et al.

• As introduced in [29], an error feedback technique is used in the prediction step to
adjust the prediction bias in each context. We adapt this approach in our scheme,
but we calculate the prediction bias based on a different context formation and
provide special treatment for hardware amenability. The mean of errors (C)
¯
in
each context C is assumed to be the most probable prediction offset error, and
hence is a good observation of the bias of the predictor. We improve the prediction
by adding this term. It is calculated by
¯ (C) = sum/count

(7)

where sum and count are the sum and occurrence count of errors in the context C,
respectively. The calculation of ¯ (C) requires arbitrary division, which resources
demanding in hardware implementation. To solve this problem, instead of having
an infinite range, we store the count with only 5 bits (25 − 1 = 31). When the
count reaches its maximal value 31, it is halved by right-shifting one bit; meanwhile sum is also halved so as to maintain the mean ¯ (C). For images with 8
bits per pixel, the mapped prediction error ranges from 0 to 28 − 1. Therefore
we only need 13 bits (25 × 28 = 213 ) plus one sign bit to store the sum of errors. Experimental results prove that this rescaling technique slightly improves
the compression ratio by “aging” the observed data. However, division is always
a difficult problem in hardware, especially when the dividend can be as large as
13 bits. To make this division practical, we bound the dividend sum by 10 bits
for two reasons: firstly, experiments on our image test set show that the chance
of the sum being larger than 1023 is less than 0.001%; secondly, extraordinary
large errors tend not to reflect the true behavior of the context because prediction errors should be moderately small given an adequate predictor. Therefore,
we can safely clip the sum value to 1023. We use the most significant bits of
the divisor count in the division, with the dividend being rescaled accordingly to
maintain the same result. Consequently, we only need a lookup table of 1 KByte
(2 × (210 /2) = 1024) to complete fast division. Although the result of division
is only an approximation, it does not affect the compression performance in our
experiments.

7.2 Probability Estimator and Arithmetic Coding
To encode the prediction errors generated from the image modelling stage, arithmetic coding is chosen in our implementation due to its best performance among
other coding methods in terms of compression ratios. In actual implementation, it is
advantageous to separate the arithmetic coding procedure as probability estimation
and coding process [22]. Probability estimation is to calculate the probabilities of
the data source adaptively and the coding process uses these probabilities to recursively calculate the proportion within the interval [0, 1), which is used to encode the
input symbols. In this section, we briefly introduce the arithmetic coder we choose
in our proposal and present the architecture of the probability estimator.

Lossless Multi-Mode Interband Image Compression and Its Hardware Architecture

19

7.2.1 Arithmetic Coding
Since probability estimator serves for the arithmetic coder, we can decide the implementation of the arithmetic coder first. A general arithmetic coder handles data
source with multi-symbol alphabets. While providing good performance, it is complex and not easily amenable to hardware implementation. Alternatively, binary
arithmetic coding has become a popular implementation of the arithmetic coding.
It was proposed by Langdon and Rissanen in 1981 [10] and later adopted by the
bi-level lossless image compression standard JBIG. Binary arithmetic coding works
with binary source alphabet (0 and 1) and thereby the cumulative distribution vector
of the alphabet, denoted by pcum, is simply pcum = [0 p0 1], with p0 being the
probability of symbol 0. The cumulative distribution of each symbol is required to
be updated adaptively for each coded symbol in arithmetic coding. As Moffat [13]
and Said [22] pointed out, this task can be much simpler in binary arithmetic coding
due to the simple data structure. Binary coding also avoids implementation of some
resource expensive components such as multiplication, and thus obtains efficiency
gains. Due to the amenability and fast speed of binary arithmetic coder for software
and hardware implementation, we adopt it in our proposed system. For more details
about the binary arithmetic coder we used, the readers are referred to [13, 22]. An
implementation of this binary arithmetic coder is published in previous work of our
group [18].

7.2.2 Probability Estimation
Overview As the prediction errors generated by the prediction step has a multisymbol alphabet, e.g. 28 = 256 symbols for images with 8 bits per pixel, they cannot be processed directly by a binary arithmetic coder. We construct a probability
estimator to adaptively calculate the probabilities of symbol occurrence in each context and decompose these probabilities into binary symbols (0 or 1). It enables the
application of a simple and efficient binary arithmetic coder and hence results in
full pipelining and high throughput. This probability estimator is based on the one
presented in [17]. In [17], the probability estimator is optimized for the statistical
lossless general data compression algorithm which includes variable order contexts,
while in our implementation we modified it for the fixed-order contexts of image
compression. In particular, we studied the effect of using different amount of bits to
store the symbol occurrence count and the initialization of the probability estimator.
We will explain in detail soon below.
In the context modelling described here, we divide the prediction errors into different groups, which we called coding context. We assume that the prediction errors
are independent identically distributed within each context. Therefore, probability
estimation is performed for each coding context.
Working Mechanism of the Context Trees The main part of the probability estimator is a tree structure stored in a SRAM. Generally, each coding context is represented by an n + 1 level (n is the bits per pixel) balanced binary tree with 2n

20

X. Chen et al.

leaves associated with each symbol in the alphabet. For instance, a 9-level tree with
28 = 256 leaves should be used for each context of the prediction errors ranging
from 0 to 255. In each tree node, a register is used to store the symbol occurrence
count. Initially, all the symbols in the alphabet are assigned to a certain probability,
which is equal probability in most of the cases, and the whole range of the probability sums up to 1. Thus each tree leave has an initial value to represent its probability,
and other tree nodes have the value of the sum of the value of their two sub-trees.
When a new symbol is received, the value of the corresponding tree nodes increases
to reflect the change of the probability distribution of symbol occurrence.
Here we demonstrate the working mechanism of the tree for probability estimation in Fig. 8. It shows a simplified binary tree with 3 level and 4 leaves, which
represents a four symbol alphabet. The number in each tree node denotes the symbol occurrence. The value with underline on top of each tree leave is the symbol
value represented by the leave. Firstly, all the trees need to be initialized before being used. In Fig. 8a, all the tree nodes are initialized to 0, except the tree root and the

Fig. 8 Simplified tree structure of the probability estimator

Lossless Multi-Mode Interband Image Compression and Its Hardware Architecture

21

escape node. Escape here means coding is not successful. It happens when a valid
probability of a symbol cannot be found. For example, when an input symbol has
not occur before and hence its tree node value (symbol occurrence) is 0, the escape
event happens and the new symbol cannot be coded directly. We will show how it
affects the tree in the rest of Fig. 8 and more discussion about escape would follow
soon. In the initialization, the escape node is assigned to 1 to avoid coding failure.
The value of the tree root is the sum of the symbol occurrence and the escape events,
and hence is 1. We stored the tree root value as total count. It is clear that at this
stage, the whole range of probability is assigned to escape. In Fig. 8b, a new symbol 0 is received. This symbol is not seen before so its tree node is 0. In this case,
escape happens and the escape count increases by 1. Meanwhile, the tree leave for
symbol 0 should also increase to reflect the symbol occurrence. Correspondingly,
the tree nodes, which symbol 0 walks pass when going down from the second level
tree node, should also increase accordingly. The coordinate below the tree reveals
the current probability distribution, where symbol 0 has 1/3 and escape takes on
2/3. A similar situation happens in Fig. 8c, where a new symbol 1 is received. Both
escape and 1 increase their occurrence. This would happen to all symbols occurring
first time after initialization to 0 of the tree. But when the symbol already has a valid
probability (non-zero occurrence), as in Fig. 8d, the input symbol can be coded directly. Figure 8d shows when a symbol 1 arrives for the second time. This time only
the tree leave of symbol 1 increases, resulting its probability to 1/3. When more
symbols are received, fewer escape would happen and symbols are most likely to
be directly coded. But there are exceptions which we will discuss in the next two
paragraphs.
Context Tree Initialization In our proposed scheme, we use 8 trees for the
regular-mode and one for the run-mode. Each tree has 28 = 256 nodes. For ternarymode, we use 64 trees, each of which has 4 nodes corresponding to the four options
(s1, s2, s3 and “escape”) in ternary-mode. These trees are dynamically updated during the coding process and thus are called dynamic trees. In addition, we use two
trees, one of each kind, to represent the “escape” condition. Because these two trees
do not change during the coding procedure, they are called static trees. Symbols
coded by the static trees do not achieve any compression as the static trees do not
change to reflect any probability distribution changes. Therefore, escape is undesirable. Then why do we still want to initialize the tree nodes to 0 even though there
are more escape happening?
We had thought of a couple of initialization possibilities: (a) to initialize the tree
nodes all to 1 just to reduce escape; (b) as the prediction errors tend to follow a
Laplacian distribution [29], we can possibly initialize the tree nodes in a Laplacian way – let the small symbol be assigned higher probability and as the symbol
becomes bigger the probability decreases as the Laplacian curve. Therefore, we
carry out the following experiment. We apply the same lossless image compression scheme on the image “lena”. The only difference among these schemes is the
initialization of the probability estimation. The comparison of compression ratios

22

X. Chen et al.

Table 5 Performance comparison on different initialization of probability estimator
Initialization

Escape count

Compression ratio (bpp)

0

697

4.137

1

229

4.142

Laplacian

560

4.140

and escape count is shown in Table 5. Although the compression ratios do not vary
dramatically under different initialization methods, it can help us with understanding the problem. When the trees are initialized as 0, despite the large amount of
escape happening, it performs best because it ignores those symbols that never or
rarely occur and thus reduces code space. When the trees are initialized as 1, it
sets all occurrence count to 1, even for those that have never appeared. This might
slightly distort the original histogram of prediction errors. When the trees are initialized as Laplacian distribution, it actually helps with building up the desirable
error histogram in the first instance but this advantage might soon be overtaken by
the possibly incorrect forced occurrence count for some symbols. From the above
observation, we initialize the probability for all the symbols to 0.
Choice of Context Tree Node Size Escape takes place when the occurrence count
of the input symbol is 0. This does not only happens after initialization, but also
occurs when the value of the tree node is rescaled. Since the occurrence count is
stored in a register, the size of the register decides its limit, which is the maximum
occurrence count. When the maximum occurrence count of a symbol is reached,
the occurrence count needs to be rescaled. Rescale is done by halving the occurrence count, which can be easily executed by right shifting. In order to maintain
the probability distribution, all the tree nodes need to be rescaled. Consequently, the
occurrence count of symbols that have only been received less than twice will be
rescaled to 0, resulting in escape when those symbols occur later. Therefore, the size
of the register, which is the number of bits, to store the symbol occurrence count
needs to be carefully chosen. We carry out experiments on the image “lena” using
the same image compression algorithm but with different number of bits for storing the occurrence count. The results of average compression bit rates are shown in
Fig. 9. We can see that the number of bits does affect the compression performance
considerably. When the maximum occurrence count is too small, more escapes are
likely to happen; when the maximum occurrence count is too big, the “aging” effect
of observed data is attenuated. Therefore, we choose 14 bits according to the result
of Fig. 9.
Output of Probability Estimation In order to drive the binary arithmetic coder,
the probability estimator output three values: decision bit, cumulative probability
of 0 (denoted as cum0) and cumulative probability of 1 (denoted as cum1).
The cumulative probabilities of 0 and 1 here means the cumulative probabilities
of going left or going right in the context tree. They can be calculated from the

Lossless Multi-Mode Interband Image Compression and Its Hardware Architecture

23

Fig. 9 Average bit rates under different probability precision, in bpp

Fig. 10 Arithmetic coding using the output of probability estimator

tree node values. Let us recall Fig. 8d. When the symbol 1 arrives, the total count
increases from 5 to 6. Since this symbol appears before, it goes to the left branch but
not escape. Here it outputs one decision bit 0. The cum0 is the value of the current
tree node, which is 3, and cum1 is the value of its parent tree node minus the one
of the current tree node (cum0). Figure 10 shows how these procedure works and
the effect of these output. The values are assigned based on the tree in Fig. 8d. The
number in parentheses denotes the level of the tree. So at the first level, cum0 = 3,
cum1 = 6. Because the decision bit is 0, the probability 1/2 is chosen. Since symbol
1 is 01 in binary, it goes to the left branch in the next level. In the selected interval
from the previous level in Fig. 10, the whole range is assigned to 0 so we have
cum0 = cum1 = 3. As decision bit is 0, the interval between 0 and cum0 is chosen.
In the last level, symbol 1 goes to the right branch of the tree. So the selected interval
is between cum0 and cum1 and its probability is 2/3. The total probability of the
symbol 1 can be calculated as (1/2) × (2/3) = 1/3. In this way, the probability
estimator only needs to output the decision bit, cum0 and cum1 at each clock cycle,
and a total of 9 cycles are needed to encode one 8-bit symbol.
Architecture of Probability Estimator Figure 11 shows a simplified diagram of
the probability estimator. It is modified from our previous work in [17]. As men-

24

X. Chen et al.

Fig. 11 Architecture of the probability estimator

tioned earlier, the probability estimator in [17] works for general data compression,
while this diagram is modified to work for our proposed image compression. Two
SRAM memory are used to store the context trees and the total count for each tree
respectively. The size of the SRAM for the context trees is
Ncon × Stree × Wnode

(8)

where Ncon is the number of contexts, Stree is the number of tree nodes in each tree,
and Wnode is the width of the register used to store the information of each tree node.
This not only includes the 14 bits used to store the tree node value, but also include
two scale bits and two reset bits. They are used to control whether the left or right
sub-tree of the current tree node needs to be rescaled or reseted. With the scale bits
and reset bits, the operation of scale and reset can be done while the input symbol
transverses down the tree, instead of being executed separately and wasting clock
cycles. The input symbol, which is the prediction error, is shifted one bit each clock
cycle. Its most significant bit is sent to the context tree for probability calculation.
Meanwhile, the context index is read in to specify the corresponding context tree.
The increment denotes the amount to be added to each tree node when a symbol
walks pass. We set increment to 1. But it can also be set to bigger values if the user
wants to accelerate the “aging” effect of rescaling. The middle count and top count
are output as cum0 and cum1, about which we have explained the role in calculating
probability.

Lossless Multi-Mode Interband Image Compression and Its Hardware Architecture

25

The probability estimator maps the probability data into a set of binary decisions
and cumulative occurrence counts, which enable the use of efficient binary arithmetic coding. It needs 9 clock cycles to code a 8-bit symbol, without using any
extra clock cycles for rescaling and resetting. Therefore, it is highly efficient and allows high throughput. Some preliminary implementation results of the regular-mode
working for grayscale images are published in [3, 19]. This part of the architecture
together with the binary arithmetic coder enables the system to process one bit per
clock cycle, which translates into a throughput of around 123 Mbits/s on a Xilinx
Virtex 4 FPGA. Full implementation of the proposed system will be part of our
future works.

8 Conclusions
An original Lossless Multi-Mode Interband image Compression (LMMIC) scheme
is proposed. The concept of segmentation is well ingrained in this scheme to deal
with different regions in the image adaptively. The simple and efficient band shifting
technique and the switching strategy successfully remove the interband redundancy.
Experiments show that LMMIC achieves highly competitive compression ratios and
provides the flexibility of compressing any number of bands as well as easy access
to any bands, which are not offered by many other schemes. The complexity of
the scheme is strictly controlled and hardware amenability is maintained. A corresponding hardware architecture is also proposed to support the functionality of the
proposed algorithm.
Acknowledgements
EP/D011639/1.

The authors would like to thank the support from EPSRC under grant

References
1. Barequet R, Feder M (1999) SICLIC: a simple inter-color lossless image coder. In: Proc data
compression conf, pp 501–510
2. Benazza-Benyahia A, Pesquet J-C, Hamdi M (2002) Vector-lifting schemes for lossless
coding and progressive archival of multispectral images. IEEE Trans Geosci Remote Sens
40(9):2011–2024
3. Chen X, Canagarajah N, Nunez-Yanez JL, Vitulli R (2007) Hardware architecture for lossless
image compression based on context-based modelling and arithmetic coding. In: Proc IEEE
int system on chip conf, pp 251–254
4. Dragotti PL, Poggi G, Ragozini ARP (2000) Compression of multispectral images by threedimensional SPIHT algorithm. IEEE Trans Geosci Remote Sens 38(1):416–428
5. Felzenszwalb PF, Huttenlocher DP (2004) Efficient graph-based image segmentation. Int J
Comput Vis 59(2):167–181
6. Hu J-H, Wang Y, Cahill, PT (1997) Multispectral code excited linear prediction coding and its
application in magnetic resonance images. IEEE Trans Image Process 6(11):1555–1566
7. Jet Propulsion Laboratory, California Institute of Technology. Airborne Visible/Infrared Imaging Spectrometer (AVIRIS). http://aviris.jpl.nasa.gov/

26

X. Chen et al.

8. Kayyali MSE multispectral technology applications. http://www.articlealley.com/article_
1604_11.html
9. Kim J, Fisher JW, Yezzi A, Cetin M, Willsky AS (2005) A nonparametric statistical method for
image segmentation using information theory and curve evolution. IEEE Trans Image Process
14(10):1486–1502
10. Langdon GG, Rissanen JJ (1981) Compression of black–white images with arithmetic coding.
IEEE Trans Commun 29(6):858–867
11. Meyer B, Tischer PE (1997) TMW – a new method for lossless image compression. In: Proc
int picture coding symp
12. Mielikainen J, Toivanen P (2002) Improved vector quantization for lossless compression of
AVIRIS images. In: Proc XI European signal processing conf
13. Moffat A, Neal R, Witten IH (1998) Arithmetic coding revisited. ACM Trans Inf Sys
16(3):256–294
14. Motta G, Rizzo F, Storer JA (2003) Compression of hyperspectral imagery. In: Proc data
compression conf, pp 333–342
15. National Aeronautics Space Administration (NASA): the Landsat program. http://landsat.
gsfc.nasa.gov/
16. National gallery: visual arts system for archiving and retrieval of images. http://users.ecs.
soton.ac.uk/km/projs/vasari/
17. Nunez-Yanez JL, Chouliaras VA (2005) A configurable statistical lossless compression core
based on variable order Markov modeling and arithmetic coding. IEEE Trans Comput
54(11):1345–1359
18. Nunez-Yanez JL, Chouliaras VA (2005) Design and implementation of a high-performance
and silicon efficient arithmetic coding accelerator for the H.264 advanced video codec. In:
Proc IEEE int conf on application-specific systems, architecture processors, pp 411–416
19. Nunez-Yanez JL, Chen X, Canagarajah N, Vitulli R (2007) Dynamic reconfigurable hardware
for lossless compression of image, video and general data content. In: Proc XXII conf on
design of circuits and integrated systems. Invited paper
20. Ratakonda K, Ahuja N (2002) Lossless image compression with multiscale segmentation.
IEEE Trans Image Process 11(11):1228–1237
21. Ryan MJ, Arnold JF (1997) The lossless compression of AVIRIS images by vector quantization. IEEE Trans Geosci Remote Sens 35(3):546–550
22. Said A (2004) Introduction to arithmetic coding – theory and practice. Imaging Systems Laboratory, HP Laboratories Palo Alto
23. Saunders D, Cupitt J (2003) Image processing at the national gallery: the VASARI project.
The National Gallery, Technical Bulletin 14(1):72–85. London, UK
24. Shen L, Rangayyan RM (1997) A segmentation based lossless image coding methods for
high-resolution medical image compression. IEEE Trans Med Imaging 16(3):301–307
25. Tang X, Pearlman WA, Modestino JW (2003) Hyperspectral image compression using threedimensional wavelet coding. Proc SPIE, vol. 5022. SPIE, Bellingham, pp 1037–1047
26. Tate SR (1997) Band ordering in lossless compression of multispectral images. IEEE Trans
Comput 46(4):477–483
27. Taubman DS, Marcellin MW (1996) JPEG2000 image compression fundamentals, standards
and practice. Kluwer, Norwell
28. Weinberger MJ, Seroussi G, Sapiro G (1996) LOCO-I: a low complexity, context-based, lossless image compression algorithm. In: Proc data compression conf, pp 140–149
29. Wu X, Memon N (1997) Context-based adaptive, lossless image coding. IEEE Trans Commun
45(4):437–444
30. Wu X, Memon N (2000) Context-based lossless interband compression – extending CALIC.
IEEE Trans Image Process 9(6):994–1001

Efficient Memory Management for Uniform and
Recursive Grid Traversal
Tomasz Toczek and Stéphane Mancini

Abstract This chapter presents the usefulness of predictive and adaptive caching
methods for the traversal of both uniform and recursive 3D grid structures. Recursive data structures are used in several image processing kernels and their efficient
management is one challenge to save silicon area and reduce the power consumption due to the data transport. The described architectures greatly reduce the needs
in term of bandwidth by exploiting the spatial and temporal locality of memory accesses during ray shooting in uniform and recursive grids. To maximize the cache
efficiency, the original kernel is transformed to a “phase locked” ray-packet based
propagation algorithm. Our results show that well-suited caching strategies can indeed yield significant performance gains during the traversal of both uniform and
hierarchical grids. This emphasizes the relevance of semi-general purpose multidimensional predictive caches.

1 Introduction
The management of high quantities of data is a challenge for many digital systems.
This problem is getting more and more complex with the increase of the quantity
of memory embedded in digital integrated systems such as System on Chip (SoC).
As an example, the International Technology Road-map for Semiconductors Consortium plans that memory will occupy 90% of a circuit in the next years. Then, to
alleviate the well known memory wall, it is mandatory to provide efficient memory hierarchies that are optimized together with a cache friendly optimization of
applications.
T. Toczek ()
GIPSA-lab, INPG-CNRS, 961 rue de la Houille Blanche Domaine Universitaire-B.P. 46, 38402,
Saint Martin d’Heres, France
e-mail: tomasz.toczek@gipsa-lab.inpg.fr
G. Gogniat et al. (eds.), Algorithm-Architecture Matching for Signal and Image Processing,
Lecture Notes in Electrical Engineering 73,
DOI 10.1007/978-90-481-9965-5_2, © Springer Science+Business Media B.V. 2011

27

28

T. Toczek and S. Mancini

Several image processing algorithms are using multi-resolution images to perform tasks such as vision [4], video compression [3] and 3D rendering [1]. Multiresolution images are used for vision algorithms to integrate some global information at the pixel level: the results at the low resolution are used to constrain the computations at the more detailed levels. To compress video, the motion estimation step
also benefits from a multi-resolution pyramid of the input images: the coarse grain
motion vectors at the low resolution are used to guide the computations at higher
resolutions, preventing the algorithm to “fall” in some local minima. Mip-Map images are used in real-time 3D rendering to apply textures at a level of resolution that
fits the raster scan sampling of the scene triangles. These multi-resolution textures
allow to speed-up the rendering and increase the quality of the rendered images
by preventing subsampling (aliasing). The efficient management of multi-resolution
and recursive data structure is one of the challenge to design embedded systems for
image processing.
Multi-resolution 3D grids are widely used for applications such as realistic rendering ray-tracing, medical visualization, volume rendering and tomographic reconstruction. These applications extensively use the ray shooting kernel to simulate the
propagation of light in a 3D volume. Ray shooting based algorithms are widely considered as both computationally and bandwidth hungry. They have however the advantage of producing high quality results while being conceptually simple, thereby
being a good example of what modern architectures may be used for. The principle
is to compute ray paths through diversely structured scenes.
In this chapter we will focus on the traversal of both uniform and recursive grids,
which can be seen as either space indexing means or direct volume representations.
On the one hand, space indexing means are used to store a primitive-based scene
(typically a set of textured triangles) in an easily traversable structure, commonly
known as an acceleration structure (AS). The AS traversal returns a list of primitives or even a more complex sub-scene to which some further computations are
performed. Examples of this approach include SAH-based kd-trees, bounding box
hierarchies, and so on [7]. Direct volume representations, on the other hand, involve
partitioning the space into small cubic fragments called voxels, and storing a meaningful value for each voxel, such as color, density, and so on. This is especially
useful in medical visualization and imaging. With no loss of generality, we choose
the later direct volume representation as our performance case study.
This chapter demonstrates the usefulness of the nD-AP Cache architecture (nDimensional Adaptive and Predictive Cache) [12, 15] and the associated methodology for some ray-shooting based applications. The results demonstrate the versatility of the nD-AP Cache, which performs well for the uniform as well as the
recursive grid traversal. To tackle the later aspect, we used a set of small communicating nD-AP Caches to cache part of the scene tree.
To fit the prediction mechanisms, the original ray (line) propagation kernel needs
to be transformed to exhibit more temporal and spatial locality. That for a “phaselocked” ray-packet based propagation algorithm is proposed. The idea is to enable
a virtual loop to synchronize the propagation of a set of rays. Hence, this transformation improves the very low data-reuse ratio of the original iterative algorithm for
which a grid cell is traversed only once.

Efficient Memory Management for Uniform and Recursive Grid Traversal

29

The proposed memory hierarchies and “phase-locked” propagation algorithm,
both for uniform and recursive grids, are validated by performance measures performed both on an emulation platform and through CABA (Cycle Accurate, Bit
Accurate) simulations. The experiments lead to some improvements of the cache
features so as to adapt the nD-AP Cache to the ray shooting kernel and any processing with similar patterns of memory references.
In Sect. 2, we briefly describe some of the hardware designed so far for memory
management for ray shooting. In Sects. 3 through 6, we present our architecture,
which includes the above mentioned cache and traversal pipelines. Finally, we discuss its performances in Sect. 7 and conclude in Sect. 8.

2 State of the Art
2.1 Dataset Traversal
The iterative parametric methods for scene traversal are the most popular ones. They
are conceptually easy and their implementation is often efficient. Most of them are
variants of the DDA (Digital Differential Analyzer) algorithm adapted to ray tracing
[2], which parametrizes the ray and performs all the computations in the parameter
space thereafter. Figure 1 illustrates the variables used by the DDA on a 2D example.
The algorithm iterates on the cells of a uniform grid, storing the parameters of
the intersection between the ray and the yet-uncrossed cell faces’ planes along each
axis (called tx and ty in Fig. 1). In 3D, it comes the next cell is the neighbors of
the current one, sharing the face corresponding to the smallest of the (tx , ty , tz ) parameters (that is, ty in our example). This smallest parameter tl (where l ∈ {x, y, z})
is updated by being added tl , the parameter difference between the intersection
points of the ray and the two faces of a cell orthogonal to a given axis l. In our
example, it comes the next cell is the one just above the dark gray one, and the
new face intersection parameters are (tx , ty , tz ) ← (tx , ty + ty , tz ). The tx and

Fig. 1 Geometrical meaning
of the variables used by the
DDA algorithm

30

T. Toczek and S. Mancini

ty parameters are computed once for all for a given ray at the initialization on the
DDA algorithm. The resulting traversed nodes are shown in light gray in Fig. 1.
It should be noted that the DDA algorithm can be adapted to use projective geometry [14], which permits a higher traversal accuracy, and is a good substitute for
floating-point arithmetics. It is the method we used in the architectures described
hereafter.
When performing ray casting, the contribution1 of each traversed cell is taken
into account for the resulting pixel value computation. This is called compositing,
and can be implemented in a variety of ways; for our tests, we used voxel-based
volume rendering, considering each voxel as having uniform density, and integrating
this density along the ray as a compositing rule. We could have used virtually any
front-to-back compositing method instead. Also, the compositing unit can be turned
into a primitive intersection test unit when the grid is used as a space indexing
structure instead of a direct volume representation.
Parametric methods are quite versatile, and very similar techniques can be used
for recursive grid traversal [9], the most well known special case being the traversal
of octrees [20].
Amongst the non-parametric traversal and compositing methods, a very visually
satisfying one consists in sampling the volume along the rays at regular intervals [8,
19]. It is especially efficient in the case of regular grids, where an element can be
accessed in constant time. Aside from the oversampling/undersampling issues that
may arise, this kind of approach tends to perform poorly quality-wise when used for
the projection step of iterative tomographic reconstruction.

2.2 Memory Management
Most of the ray shooting dedicated hardware design with an emphasis on efficient
memory management was done in the field of volume rendering. This can be explained by the fact that volume rendering naturally involves very large data sets, up
to 10243 grids with recent medical appliances for instance. This is why memory
bandwidth is likely to become the performance bottleneck of such systems. Therefore, most memory management strategies put a strong emphasis on data reuse and
access locality through diverse means.
The cache efficiency of software systems can be optimized thanks to multithreaded software [6, 11]: each thread deals with a small set of rays and the speedup comes from the fact that some grid data used by each thread are in the cache
memory. This implies rays are shot by coherent packets. Multi-threading based
techniques is limited by task context switch, cache trashing and by the available
computing power.
1 Which may be emitted or re-emitted light (rendering), density (PET reconstruction), attenuation
(X-ray based reconstruction), . . . .

Efficient Memory Management for Uniform and Recursive Grid Traversal

31

The Cube series is an example of regular-grid-sampling-based volume rendering
hardware designed to spare this very bandwidth. While Cube-3 [18] is a regular ray
parallel ray tracing architecture, Cube-4 [19] is based on object-order ray-tracing.
The Cube-4 hardware is designed so as each voxel is fetched exactly once per frame
from the central memory. Therefore, it is optimally efficient in terms of memory
bandwidth usage, if we admit that every voxel accounts for each frame. However,
Cube-4 has a number of severe limitations, one of them being a scene size limited to 2563 equally sized voxels. Moreover, the very principle of object-order ray
tracing upon which Cube-4 was built makes perspective rendering implementation
impractical; that is why VolumePro, a commercial implementation of Cube-4, only
supports isometric rendering. Several other problems were underlined by [17].
VoxelCache [8] is a cache specifically crafted for sampling based ray casting, as
well. It is small enough to be implemented on reconfigurable hardware. It uses only
a single external memory bus, but has an internal 8 memory bank organization. This
makes possible to fetch a tri-linearly interpolated sample every cycle. VoxelCache
also has a prefetching mechanism, requiring that beams of 4 × 4 coherent rays are
being shot. Despite the fact it was designed for uniform grids, VoxelCache was successfully used for full-octree-based volume rendering as well [22]. This however
required the use of off-chip SRAM to keep the performances high, and the caching
strategy was shown inefficient for large scenes due to cache trashing. On the contrary, we tried in our approach to use as much as possible inexpensive components
found on most prototyping boards, while focusing on arbitrary large sparse octreelike acceleration structures, much more flexible than full octrees.
Since hardware systems benefit of uniform memory accesses, a solution is to
allow only parallel rays, possible rendering a given volume slice by slice to produce
an illusion of perspective. Such a strategy underlies volume rendering on commodity
PC textured rasterization hardware [10, 21]. Some parallel rays are sampled at the
same interval to provide “plane parallelism”. Perspective volume rendering is then
simulated by a perspective transformation of the resulting image. This solution is
very popular for visualization because it is fast but it suffers of too low accuracy for
tomographic image reconstruction.
More recently, programmable graphics hardware has been used for volume rendering. It is quite suitable for brute force ray casting through uniform grids [16]: the
built-in texture cache can handle 3D scenes efficiently, and the Single Instruction,
Multiple Thread (SIMT) programming model is suited for casting rays in beams, ensuring reference locality. There are also other ways to perform volume visualization
on GPGPUs, for instance through the well known marching cubes algorithm [13],
which converts a volumetric scene into polygons before displaying it. Nowadays,
GPUs can afford not only to display the generated polygons, but also to build them
from the volumetric data in the first place [16], which used to be done on the CPU.
The drawbacks of GPUs include a high power consumption, small per-core on-chip
memory quantities making them unsuitable for recursive algorithms with high stackspace requirements, and a loss of efficiency in case of diverging code paths between
threads, bad load distribution, and so on. Implementations of ray tracing algorithms
are rather prone to such pitfalls.

32

T. Toczek and S. Mancini

Fig. 2 An overview of a
complete rendering system

3 System Architecture
The architecture we propose is composed of two main parts: an adaptive and predictive cache for either uniform or recursive grids and a traversal unit, capable of
determining the sequence of grid cells traversed by each ray of a coherent beam.
The generated sequences are meant to be communicated to a compositing unit.
Since we chose visualization as an application, once a ray of the beam ends, its
accumulated value may be written to a frame buffer. Figure 2 presents an overview
of a whole rendering architecture as implemented on a prototyping board.
The traversal unit and the cache were designed and synthesized with the Xilinx
Virtex 4 technology as a target. Depending of the number of units and the size
of the FPGA, the results were obtained either by actual on-board runs, or CABA
simulations.

4 The nD-AP Cache
The 3D-AP Cache is an instance of the nD-AP Cache which aims at caching multidimensional data as originally described in [15]. Also it performs pre-fetching by
estimating the future zones of data the processing unit is supposed to fetch. As
shown Fig.3(a), the nD-AP Cache provides a virtual interface to the computing unit
that issues multi-dimensional indexes in the data structure. The nD-AP Cache performs both the mappings between indexes and the external memory addresses and
the internal memory addresses.

Efficient Memory Management for Uniform and Recursive Grid Traversal

33

Fig. 3 The 3D-AP Cache aims at performing prefetching in multi-dimensional arrays

The pre-fetching mechanism relies on a tracking of indexes on each dimension.
The trackers try to estimate the zones of indexes the computing unit may fetch from
an analysis of the past fetches and a prediction model. The nD-AP Cache architecture is modular and several kinds of trackers are available. In the following, we consider a first order SC tracker which prediction model makes the hypothesis that the
indexes of the fetch sequence evolve as a compound of a fast displacement around a
low speed displacement. The trackers compute the estimated means of the indexes
on each axis and request the control unit to grab a new zone of data when the estimated means cross a guard zone. This mechanism ensures that the new zone will be
uploaded before the references reach the cached zone boundaries as demonstrated
in [12] (see Fig. 3(b)). This prefetching is enabled when the 3D-AP Cache is tuned
in such a way that the cut-off frequencies of the estimators and the different thresholds (T , , ) fit the average speed of the indexes (v), their local amplitude (Al )
and the background memory latency (see [12] for more details). Above a threshold
latency the cache efficiency drops but the cache parameters may be set to fit a given
latency.

5 Uniform Grids
5.1 Uniform Grid Traversal
The traversal pipeline for regular grids is shown in Fig. 4. The RCPG-U unit (Ray
Casting in Projective Geometry-Uniform grid) is made of a traversal unit connected
to a 3D-AP Cache. The Line integral unit implements the compositing processing to
compute sinograms in tomography applications. The traversal unit gets parameters
from the PowerPC processor and manages the phase-locked propagation of a beam
of rays. As can be seen, the memory references to the volume do not depend on

34

T. Toczek and S. Mancini

Fig. 4 RCPG-U pipeline and cache interface

the fetched data, and could theoretically be determined ahead of time. We chose
nonetheless not to use this specificity, not to increase the hardware complexity of
the traversal unit. This allows us to test the behavior of the cache while providing it
with no clues regarding the future accesses.
In order to obtain good performances even if caching very small parts of the
volume, we rely on “phase-locked” traversal. The phase-locked propagation enables
a virtual loop over the traversed cells. Indeed, as shown in Fig. 5, the RCPG-U

Fig. 5 The phase-locked propagation of a beam of rays improves the spatial and temporal locality
of grid traversal

Efficient Memory Management for Uniform and Recursive Grid Traversal

35

algorithm is synchronized over a set of rays to propagate them together along a main
direction, orthogonal to a phase plane. This direction is collinear to the major axis
which is the closest to the overall direction of the beam of rays, and is pre-computed
at the same time the initial states of the rays of the beams are generated (typically, by
a hard or soft processor). For each ray, a step of the RCPG-U algorithm returns the
next cell to cross. This is iterated while the resulting cell remains in the phase plane.
When all the rays of the set are out of the phase plane, then the phase is updated
to the next one. The process loops until all the rays exit the volume. The phase acts
as a virtual loop, the innermost one being the ray index. Since it is known that the
phase moves in only one direction and that a lot of consecutive accesses will request
cells sharing the same phase, we can afford to cache only a very narrow slice of the
scene along the phase axis. Along the two other axes, the cached zone needs to be
just broad enough to contain all the rays of the beam.

5.2 Uniform Grid Caching
The experiments on Ray Casting have raised the need of more efficient tracking
mechanisms and new cache behaviors. To manage different classes of fetch sequences, multi-mode trackers implement together several tracking mechanisms that
may be dynamically selected. The mapping of fetch indices to the internal cache addresses can also be dynamically chosen to fit different sizes of cache. The trackers
are joined together to allow more complex cache zone displacements: a displacement request from a tracker makes the other trackers to center on their estimated
center. Furthermore, the user module can now select the priority of misses over
cache updates.
The dynamic selection of the priority of misses is efficient especially when the
misses are faster than the uploading of new zones into the cache. At a first sight, updates of the cached zone have a higher priority than misses because it prevents the
user module to always request misses if the trackers are too slow. But fetches can
evolve differently on each dimension and misses can have different priorities depending on the cache geometry. At the time the user module requests a high priority
miss, the cache update is interrupted, the miss is served and the update continues
over.
The phase locked propagation of rays benefits from these improvements. The
multi-mode tracking is necessary because the virtual loop on the propagation direction can be efficiently tracked by a linear tracker while the other dimensions are
tracked by statistical trackers. The misses have high priority along directions perpendicular to the phase direction and have low priority along the later. Indeed, the
phase increases (or decreases) uniformly faster than the other directions in the average case. The worst case is when the rays have a 45-degree angle with all or some
of the main axes.

36

T. Toczek and S. Mancini

6 Recursive Grids
6.1 Caching the RG Data Structure
We generalize the above described pipeline and cache to the traversal of a generalization of octrees [5] we call recursive grids (for the lack of an established name).
As their name somewhat implies, recursive grids are sparse hierarchical structures,
and therefore induce a dependency between the memory access sequence and the
data fetched from memory. Predictive prefetch mechanisms become therefore unavoidable in order to reach acceptable performances.
After a description of recursive grids, we will show how 3D-AP Caches can handle them, and which access patterns should be used to maximize caching efficiency.

6.1.1 Recursive Grids
A 2n × 2n × 2n recursive grid is a tree with the following characteristics:
• each node is a cube
• each node has either a datum, or 23n equally-sized children
As a consequence, every node of the recursive grid has the same size as any other
node on the same depth. We call the leaf nodes voxels. Figure 6(a) shows a 4 × 4 × 4
recursive grid, with two non-voxel children at the root node. It is clear that, under
our formulation, octrees are 2 × 2 × 2 recursive grids.
2n × 2n × 2n recursive grids, especially for “moderately large” values of n such as
2 or 3, hold several interesting properties few other hierarchical acceleration structures do. While still having the benefits of sparse hierarchical structures, they also
offer more regularity than most of them, allowing more efficient caching. It can be
noticed that recursive grids are a subset of adaptive grids [9], and hence inherit of
most of their advantages when it comes to their traversal.
For our tests, we chose to encode a 4 × 4 × 4 recursive grid by an array of at
most 32768 nodes, the node 0 being the root. The exact coding of a node is itself

Fig. 6 Recursive grids and their phase-locked traversal

Efficient Memory Management for Uniform and Recursive Grid Traversal

37

that of an array of 64 words of 16 bits, each coding for a child (15 bit datum and a
1 bit flag, determining if the child is subdivided or not). The datum a child contains
is either a density, or index of the child node.

6.1.2 RG Cache
The Recursive Grid (RG) Cache, described in Fig. 7, aims at caching a part of the
recursive grid by exploiting the spatial coherency of references. Furthermore, it provides a virtual interface to the processing unit. This means that the processing unit
issues a 3D coordinate (x, y, z) together with a resolution level and the cache provides the corresponding datum, preventing the processing unit to manage the tree
structure. The RG cache uses the nD-AP Cache as a basic block.
The proposed strategy is to cache each level of resolution with an nD-AP Cache
and to perform prefetching in the tree. Indeed, when a reference occurs, it is likely
that the next reference is at a close coordinate, either at the same, upper or lower
resolution. Each nD-AP Cache is in charge of tracking references at a resolution
and the neighboring ones.
The tree manager (TM) unit returns parts of the scene requested by the nD-AP
Caches and maintains a coherent state of the caches at different resolutions. Each
time a cache requests a part of the scene, the TM reads the corresponding data at
the upper level to determine if the requested block is a leaf or a child node. In the
later case, the TM fetches the data at the obtained address to fill the nD-AP Cache.
As a consequence, the nD-AP Cache is slightly modified to allow the TM to read
an nD-AP Cache concurrently with reads at the processing unit interface. Also, the
cached zone at level n has to be inside the cached zone a level n − 1 to maintain
cache coherency. Without this constraint, when the cache n would request a part out
of the zone in the n − 1 cache, it would be too slow to get the data by traversing the
tree from the root node.
Each of the cache is optimized to manage data at its level of resolution. The size
of embedded cache memories fits the level of resolution to save space. For example,
because the level 1 cache contains only 4 × 4 × 4 = 64 data, it doesn’t need trackers
and has a simplified control management.

6.1.3 Improving Reference Locality
Just like we did for uniform grid traversal, we use a phase-locked propagation principle in order to keep the accesses coherent and the cached zone minimal.
More specifically, once the phase axis chosen, we propagate rays in a way that ensures the cell accesses during traversal are ordered by their coordinate on the phase
axis. An example of such an access sequence is given in Fig. 6(b). Section 6.2.2
gives more details as how this can be implemented. Just like previously, the cached
zone should be narrow along the phase axis.

Fig. 7 Recursive Grid (RG) Cache

38
T. Toczek and S. Mancini

Efficient Memory Management for Uniform and Recursive Grid Traversal

39

Fig. 8 Architecture of the
recursive traversal unit

6.2 Recursive Grid Traversal
We implemented a hardwired traversal unit for efficient ray shooting through recursive grids. It works on beams of up to 256 rays. Figure 8 shows its structure. Its
three main parts are:
• the phase-locked beam propagation unit, which determines the order in which
grid nodes are fetched so as to minimize the probability of cache misses; do to
so, it holds the states of all the rays of the current beam in such a way it is able to
ensure they all propagate with the same speed along the phase axis
• the RG Cache interface, which performs cache requests while pipelining ray parameters linked to the request
• the neighbor finding unit, which determines the next node a ray should access and
its updated parameters, based on the response from the cache and the former ray
parameters
The cache interface simply contains two ray state holding FIFOs, which minimize the impact of small pipeline bubbles. Therefore, in the rest of this section, we
will focus on the two other components.

6.2.1 Neighbour Finding Unit
We use a slightly modified version of the DDA algorithm for neighbor finding, in
a fashion very similar to that presented in [2]. As we need to be able to vertically
traverse the tree as well, we adapted the principles exposed for octrees in [20] to
2n × 2n × 2n recursive grids, by considering each of our nodes as an n-level perfect

40

1
2
3
4
5
6
7

T. Toczek and S. Mancini

r a y _ s t a t e . r ← u n i t a r y d i r e c t i o n v e c t o r o f o u r r a y
ray_state . t ← current c e l l entry point parameter
r a y _ s t a t e . ( tx , ty , tz ) ← b o r d e r i n t e r s e c t i o n p a r a m e t e r s
r a y _ s t a t e . ( tx , ty , tz ) ← p a r a m e t e r i n c r e m e n t s
r a y _ s t a t e . (px , py , pz) ← c u r r e n t c e l l a b s o l u t e p o s i t i o n
r a y _ s t a t e . depth ← depth of the c u r r e n t c e l l
max_depth ← maximum d e p t h o f t h e t r e e

Listing 1 Variables characterizing a ray state

octree. On a side note, this approach works for adapting pretty much any algorithm
working on octrees to recursive grids, and without increasing its asymptotic cost.
To sum up everything, let us consider that a ray propagation state is fully characterized by the variables of Listing 1. The “parameters increments” are the differences between the parameters of the intersection points between the ray and two
opposite faces of our cell, for each of the three possible such pairs of faces. The
absolute position of the current cell is given in units corresponding to the size of
cells located at a given maximum depth (the constant max_depth). If this depth is 6,
for instance, the maximum detail level of a 4 × 4 × 4 recursive grid will be the same
as that of a 40963 uniform grid.
Let’s suppose without loss of generality that our ray propagates in the positive
direction along each axis.2 Our neighbor finding algorithm is then summarized by
the pseudo code of Listing 2. Basically, our traversal unit advances to the next cell if
the current cell is a leaf (this takes one cycle), of dives further if it is an internal node
(n cycles for 23n grids). Figure 9(a) presents the unit’s organization. As the diving
and advancing modes are mutually exclusive, there is hardware reuse between them
which does not appear on this figure for the sake of clarity.
One should pay attention that the input and output data of the neighbor finding
unit may grow moderately large depending on its exact coding. What call ray propagation state a structure composed of the variables seen in Listing 1, at the exception
of the direction vector of the ray (which does not matter for the traversal assuming
all the other variables are known). Assuming max_depth = 5, n = 2, and 32 bits per
parameter, we need 260 bits per ray state. For each ray, those parameters need to
be initially computed from the ray geometry before they are fed to the propagation
unit. It involves obtaining the intersection points parameters between the ray and
each of the faces of the root node cube.

6.2.2 Phase-Locked Ray Beam Propagation
The phase-locked propagation unit (PPU), as shown in Fig. 9(b), manages the indexes of rays to enter the propagation pipeline. The indexes of active rays are stored
2 Indeed, i