Document [original]

Event-Driven Message Passing

and

Parallel Simulation of Global Illumination

Tom´aˇs Plachetka

2003

Event-Driven Message Passing

and

Parallel Simulation of Global Illumination

a dissertation

submitted to the faculty

of computer science, electrical engineering

and mathematics

of the university of paderborn

in partial fulfillment of the requirements

for the degree

doctor rerum naturalium

(Dr. rer. nat.)

Tom´aˇs Plachetka

2003

Event-Driven Message Passing

and

Parallel Simulation of Global Illumination

dissertationsschrift

vorgelegt in der fakult¨

f¨

ur elektrotechnik, informatik und mathematik

der universit¨

at paderborn

zur erlangung des akademischen grades

doctor rerum naturalium

(Dr. rer. nat.)

von

Tom´aˇs Plachetka

2003

Acknowledgements

There are many people who supported this work. I can mention only a few of

them here but I am thankful to them all. Firstly I would like to express my

sincerest thanks to my advisor, Prof. Dr. Burkhard Monien from the University

of Paderborn, who was always very understanding when I was floating midway

between science and engineering and who always gave me enough freedom in

making decisions about what to do next.

I would like to thank Prof. Dr. Branislav Rovan and Prof. Dr. Peter Ruˇziˇcka

from Comenius University in Bratislava, who turned my attention to parallel and

distributed computing a long time ago. I would also like to thank Dr. Andrej

Ferko, Dr. L’udov´ıt Niepel and Dr. Eugen Ruˇzick´y from Comenius University,

who showed me that computer graphics is an interesting area of research where

parallel processing can be very helpful.

I was lucky to meet many other people who would never say no when I

needed to discuss problems or ideas even though they did not directly match

their own research interests. In the last few years I consulted mostly with Dr. Ulf-

Peter Schroeder, Axel Keller, Prof. Dr. Friedhelm Meyer auf der Heide, Dr. Olaf

Schmidt, Dr. J¨urgen Schulze, Dr. Thorsten Falle, Dr. Rainer Feldmann, Dr. Ulf

Lorenz and Dr. Adrian Slowik. Thank you.

My further thanks go to all my colleagues at the University of Paderborn,

Comenius University in Bratislava, Sheffield Hallam University, to the people I

worked with on various projects, and to my students and technical staff. They

all contributed to making our working days a joy. Special thanks to Geraldine

Brehony for the proofreading of a great part of this text.

I am obliged to all my friends who did not forget me when I was spending

more of my spare time with books and computers than with them. Finally, my

deepest gratitude goes to my whole family, especially to my parents and to my

little sister, for their love and patience.

Contents

1 Introduction 1

1.1 Current parallel programming standards . . . . . . . . . . . . . . 2

1.1.1 Polling in non-trivial parallel applications . . . . . . . . . 2

1.2 Photorealistic image synthesis . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Measures of photorealism . . . . . . . . . . . . . . . . . . . 4

1.2.2 Photorealistic rendering systems . . . . . . . . . . . . . . . 5

1.2.3 Light phenomena and their simulation . . . . . . . . . . . 6

1.3 Outline of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Event-driven message passing 11

2.1 Non-trivial parallel applications . . . . . . . . . . . . . . . . . . . 12

2.2 Development of parallel programming . . . . . . . . . . . . . . . . 12

2.2.1 Occam programming language . . . . . . . . . . . . . . . . 14

2.2.2 Transputer . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.3 Occam, Transputer and non-trivial parallel applications . . 23

2.3 Current message passing standards: PVM and MPI . . . . . . . . 24

2.4 Point-to-point message passing in PVM and MPI . . . . . . . . . 26

2.4.1 Message assembling and sending . . . . . . . . . . . . . . . 28

2.4.2 Message receiving and disassembling . . . . . . . . . . . . 31

2.5 Unifying framework for message passing . . . . . . . . . . . . . . 34

2.5.1 Components of the message passing framework . . . . . . . 36

2.5.2 Application process . . . . . . . . . . . . . . . . . . . . . . 38

2.5.3 Basic message passing operations . . . . . . . . . . . . . . 39

2.5.4 Message passing system . . . . . . . . . . . . . . . . . . . 40

2.5.5 Language binding . . . . . . . . . . . . . . . . . . . . . . . 41

2.5.6 Operation binding . . . . . . . . . . . . . . . . . . . . . . 43

2.6 Threaded non-trivial PVM and MPI applications . . . . . . . . . 45

2.6.1 Threads and thread-safety . . . . . . . . . . . . . . . . . . 45

2.6.2 Polling in threaded non-trivial PVM and MPI applications 47

2.6.3 Polling in communication libraries . . . . . . . . . . . . . . 50

2.6.4 Limits of active polling . . . . . . . . . . . . . . . . . . . . 54

2.6.5 Previous work related to thread-safety of PVM and MPI . 56

2.6.6 Quasi-thread-safe PVM and MPI . . . . . . . . . . . . . . 59

vii

viii CONTENTS

2.6.7 Towards a complete thread-safety of PVM and MPI . . . . 65

2.7 TPL: Event-Driven Thread Parallel Library . . . . . . . . . . . . 68

2.7.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

2.7.2 Process startup and termination . . . . . . . . . . . . . . . 70

2.7.3 Thread management . . . . . . . . . . . . . . . . . . . . . 72

2.7.4 Message passing . . . . . . . . . . . . . . . . . . . . . . . . 74

2.7.5 Message handling and message callbacks . . . . . . . . . . 77

2.7.6 Message packing and unpacking . . . . . . . . . . . . . . . 83

2.7.7 Error handling and debugging . . . . . . . . . . . . . . . . 84

2.7.8 Flow control . . . . . . . . . . . . . . . . . . . . . . . . . . 85

2.8 Efficiency benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 87

2.8.1 ONE-SIDED THREADED PINGPONG . . . . . . . . . . 89

2.8.2 SYMMETRICAL THREADED PINGPONG . . . . . . . 94

2.8.3 Summary of benchmarking results . . . . . . . . . . . . . . 96

2.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

2.9.1 Overlapping of Communication and Computation . . . . . 102

3 Global illumination 107

3.1 Physics of light . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

3.2 3D modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

3.2.1 Modeling of colour spectrum . . . . . . . . . . . . . . . . . 113

3.2.2 Modeling of surface geometry . . . . . . . . . . . . . . . . 114

3.2.3 Modeling of surface materials . . . . . . . . . . . . . . . . 117

3.2.4 Modeling of light sources . . . . . . . . . . . . . . . . . . . 120

3.2.5 Modeling of camera . . . . . . . . . . . . . . . . . . . . . . 121

3.3 The global illumination problem . . . . . . . . . . . . . . . . . . . 121

3.3.1 Rendering equations . . . . . . . . . . . . . . . . . . . . . 122

3.4 Approaches to the global illumination problem . . . . . . . . . . . 123

3.4.1 Direct methods . . . . . . . . . . . . . . . . . . . . . . . . 124

3.4.2 Approximation methods . . . . . . . . . . . . . . . . . . . 130

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

4 Ray tracing 139

4.1 The basic ray tracing algorithm . . . . . . . . . . . . . . . . . . . 140

4.2 Sequential optimisation techniques . . . . . . . . . . . . . . . . . 141

4.2.1 Bounding volumes . . . . . . . . . . . . . . . . . . . . . . 141

4.2.2 Bounding slabs . . . . . . . . . . . . . . . . . . . . . . . . 142

4.2.3 Light buffers . . . . . . . . . . . . . . . . . . . . . . . . . . 143

4.3 Persistence of Vision Ray Tracer . . . . . . . . . . . . . . . . . . . 143

4.4 Parallel ray tracing . . . . . . . . . . . . . . . . . . . . . . . . . . 144

4.4.1 Existing approaches . . . . . . . . . . . . . . . . . . . . . . 144

4.4.2 Image space subdivision . . . . . . . . . . . . . . . . . . . 146

4.4.3 Setting of parameters in the perfect load balancing algorithm152

CONTENTS ix

4.4.4 Distributed object database . . . . . . . . . . . . . . . . . 155

4.4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 157

4.4.6 Further extensions and improvements . . . . . . . . . . . . 162

4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

5 Radiosity 167

5.1 Southwell relaxation . . . . . . . . . . . . . . . . . . . . . . . . . 168

5.1.1 Shooting radiosity algorithm . . . . . . . . . . . . . . . . . 169

5.2 Form factor computation . . . . . . . . . . . . . . . . . . . . . . . 169

5.2.1 Monte Carlo form factor computation . . . . . . . . . . . . 172

5.3 Discretisation of surface geometry . . . . . . . . . . . . . . . . . . 174

5.4 Illumination storage and reconstruction . . . . . . . . . . . . . . . 175

5.5 Energy transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

5.5.1 Shooting radiosity algorithm using the ray tracing shader . 179

5.6 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

5.7.1 Form factors . . . . . . . . . . . . . . . . . . . . . . . . . . 184

5.7.2 Experiments with the box scene . . . . . . . . . . . . . . . 186

5.7.3 Experiments with large scenes . . . . . . . . . . . . . . . . 189

5.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

6 Summary 193

6.1 Towards portable 3D standards . . . . . . . . . . . . . . . . . . . 195

A MPI progress rule tester 197

B Threaded pingpong benchmark 199

B.1 TPL 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

B.2 PVM 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

B.3 MPI (MPI 1, MPI 2) . . . . . . . . . . . . . . . . . . . . . . . . . 209

List of figures 215

Bibliography 221

xCONTENTS

Chapter 1

Introduction

One of the goals of computer graphics is a photorealistic image synthesis. There

are many applications which require photorealistic visualisation of three-dimen-

sional (3D) environments which often do not actually exist. Film production,

computer game production and architectural design are industrial areas in which

photorealism is very important and it is thanks to these areas that computer

graphics is flourishing today.

The requirements of the quality and speed of visualisation of 3D environments

strongly depend on the application. An engineer observing a flow of air around a

wing is more interested in the visualisation of air speed and air pressure than in a

photorealistic image of the wing. A mechanical engineer designing a screw is the

majority of the time interested in a simple flat-shaded projection of the screw on

the computer screen rather than in a visualisation with reflections and shadows.

Even when photorealism is desirable (e.g. in movies or in computer games),

there are several levels of it. The choice of an appropriate rendering engine is

always a matter of compromise. There is no “best” method, which is suitable for

everyone. This thesis concentrates on photorealistic image synthesis. Throughout

this thesis, rendering denotes the process of creating images and rendering system

(or rendering engine) denotes a system which implements this process. Images

rendered by a rendering system are visualisations of a 3D environment. The

process of rendering solves the global illumination problem.

Although all rendering algorithms used in computer graphics only compute

an approximation of real illumination, they are all computationally expensive.

The simplest ones (which compute either no illumination or only its very crude

approximation) are implemented in the hardware of graphics cards. More so-

phisticated rendering algorithms are implemented in the software and only use

the hardware of graphics cards in their final visualisation phases (if ever). This

thesis deals with parallelisations of photorealistic rendering algorithms which is

one way of speeding them up.

The following section sketches the problems of the efficient programming of

parallel applications using available standards. The next section informally ex-

2CHAPTER 1. INTRODUCTION

plains how photorealism is measured and how photorealistic images are computed.

Formal definitions are given later. An outline of this thesis can be found in the

last section of this chapter.

1.1 Current parallel programming standards

Two standards for the support of programming of parallel applications are avail-

able today: Parallel Virtual Machine (PVM) [GBD+94] and Message Passing In-

terface (MPI) [MPI94], [MPI97], [Gro02]. PVM and MPI provide an application

programmer with an abstract layer of message passing, which hides differences

between various parallel system architectures. Both PVM and MPI assume that

a parallel application consists of (sequential) processes that communicate using

message passing. The processes do not share any memory.

The performance of the PVM and MPI implementations is measured using

a set of benchmarks, by reporting statistics on the measured throughput and

latency of message passing, speed of message assembly and decoding etc. The

performance of PVM and MPI can be compared to the performance of lower-

level communication libraries on some benchmarks. The loss of performance due

to the use of PVM or MPI on the top of a lower-level communication library

is usually not very high. Unlike the lower-level libraries, PVM and MPI can

be (and have been) ported to practically all architectures without changes in

their application programming interfaces. The porting of applications that build

on PVM or MPI is therefore much easier (only compiling and linking is usually

needed). The portability on the source code level pays out the slight performance

loss. However, the set of benchmarks is not complete. A benchmark is missing

which would reflect the needs of non-trivial parallel applications.

1.1.1 Polling in non-trivial parallel applications

Contemporary implementations of PVM and MPI do not allow for the efficient

implementation of many important parallel applications: shared-memory simu-

lation libraries, parallel databases, media servers, parallel simulations of global

illumination, . . . Although the applications look very different, two independent

activities can always be observed when focusing on one process:

•T1: a computation with an occasional communication

•T2: servicing of requests coming from other processes

We call applications with this pattern of behaviour non-trivial.1When ex-

plaining to non-experts what non-trivial applications are, we use the following

analogy:

1All non-trivial parallel applications belong to the class of irregular parallel applications.

Communication patterns in irregular applications cannot be predicted.

1.1. CURRENT PARALLEL PROGRAMMING STANDARDS 3

Imagine a large building with many windows. All the windows must

be cleaned. There are, say, four workers to complete the task. If

cleaning the windows was their only responsibility, each worker would

simply choose a dirty window, clean it and then look for another dirty

window.

In addition to cleaning the windows, the four workers must build a

brick wall behind the corner. The wall must be four layers high and

each worker is responsible for building one whole horizontal layer—

worker one lays bricks on the ground (layer one), worker two lays

bricks on top of the bricks in layer one etc. Notice the dependencies

between the workers. For instance, worker two cannot begin until

worker one has first laid at least two bricks on the ground.

Both tasks, cleaning the windows and building the brick wall must be

accomplished as soon as possible. The laying of the bricks clearly has

a higher priority than cleaning the windows because a worker laying

bricks creates work for the workers who are responsible for the higher

layers.

There are several ways how cleaning the windows can be interleaved

with building the brick wall. Here is an example of an inefficient one.

All workers begin with cleaning the windows. From time to time,

each worker interrupts the cleaning of the windows and walks to the

wall to see whether he can add a brick to his layer. If so, he adds

a brick and returns to cleaning the windows. If not, he immediately

returns to cleaning the windows. This is called active polling (or busy

waiting). Obviously, if “from time to time” means often (say, every

one minute), then all the workers waste time and energy by walking

to the wall (also when there is no work for them on the wall) and back

to the windows. If “from time to time” means seldom (say, once an

hour), then a worker who has just returned to cleaning the windows

will continue cleaning them for the next hour—even though in the

meantime there may be a more important work to be done on the

brick wall.

A more efficient work organisation would keep the workers busy with

laying the bricks whenever possible. When a worker who is laying

bricks creates work for some other worker, he should shout at the other

worker to stop cleaning the windows and come lay bricks instead.

When a worker who is laying bricks cannot continue laying them, he

should return to cleaning the windows and should keep cleaning the

windows until he is called to the wall again. Such a work organisation

is called demand-driven.

4CHAPTER 1. INTRODUCTION

PVM and MPI are highly optimised for the performance of a single task

(cleaning windows or building a brick wall) by several workers. However, active

polling must be used when two (or more) tasks are combined. Our goal was to

understand where the active polling originates from and how it can be avoided.

The problem is related to a lack of thread-safety in existing PVM and MPI im-

plementations and to an imprecise semantics of asynchronous communication in

the MPI Standard.

1.2 Photorealistic image synthesis

This section gives a brief introduction to the computer synthesis of photorealis-

tic images. Before arriving at a design of photorealistic rendering systems and

algorithms implemented in the systems, several philosophical questions must be

answered.

1.2.1 Measures of photorealism

Why do we say that some images look realistic whereas others do not? An

important aspect of measuring the degree of realism involves human perception.

A healthy human eye together with the nature around it make a perfect rendering

system. The images produced by this system are real. If we replace the human

eye with a good photographic camera, we get another rendering system which is

similar to the original one in the sense that the images produced by this system

are indistinguishable (by a human eye) from the real images. We might replace

both nature and the human eye with a piece of software and compute the images

on a computer—and the images might still be undistinguishable from the real

ones! A photorealistic rendering system, no matter how it works, should produce

images that a human eye cannot distinguish from real (or photographic) images

of a real environment. The harder it is for a human eye to distinguish an image

from a real one, the more photorealistic the image is.

This intuitive measure of realism has a flaw. If we compute an image of a non-

existing environment on a computer, we have nothing to compare the picture to.

The computed image usually corresponds to a photograph of some environment

which might exist and which we can recognise. However, also in this case we

must use our imagination and experience instead of a direct visual comparison

to judge the quality of the computed picture.

Another flaw of the above definition of realism is that a human eye (or rather

the human visual perception system) is easy to fool. Nice unreal pictures some-

times appear “realistic”. Synthetic beasts in movies can appear real at first

glance even though a closer study of the images reveals that they are in fact

not. An untrained eye is very tolerant even of very obvious mistakes, especially

in movies. For instance, the movie Indecent proposal (with Robert Redford and

1.2. PHOTOREALISTIC IMAGE SYNTHESIS 5

Demi Moore) [Lyn93] contains at least two relatively long image sequences which

contain a forgotten microphone hanging from above. (None of the people we

have asked noticed the hanging microphone.) It is interesting to note that it

also works the other way around—ugly “non-realistic-looking” pictures may be

real! There is a strange shadow above the staircase in Caf´e Central in the ger-

man city of Paderborn. People who work with computer graphics must be saying

to themselves: “This shadow is too sharp to be realistic.” (None of the people

we have asked noticed the surprising appearance of the shadow.) Further ex-

amples of where our visual perception is likely to fail are the wonderful images

of M. C. Escher’s work [EBe92], stereograms [Ent93] and other optical illusions

[Per85].

The previous discussion suggests that the measuring of photorealism based

on the visual impression is unreliable and should be replaced by a more formal

model. Such mathematical models do exist but their practical use implies the

simplification of assumptions on the behavior of light. In other words, the quality

of images can be evaluated with respect to a chosen mathematical model but not

against the reality which they represent. Also, a comparison of software imple-

mentations of different rendering systems is usually impossible due to different

simplifying assumptions made during the implementation of the systems. This is

why mathematical quality measures are sometimes combined with a subjective

perception of images, even if it means a retreat to the imprecise “psychological”

definition of realism [McN00].

1.2.2 Photorealistic rendering systems

The final result of the rendering of a 3D environment is an image. The image

can be created using a number of methods. It can be taken by a camera, painted

by an artist or computed by a program. These different rendering systems may

produce the same or similar images—however, they differ internally and their use

depends on the purpose of the images.

When we see a nice scenery, we might want to take a picture of it because we

are either not likely to be there again or the scenery is perhaps likely to change. A

camera or a painter can put the image into a form which can be stored practically

forever (a photographic film, a painting).

A painter is more flexible than a photographic camera. A painter does not

need to see the scenery while painting if he or she remembers what the scenery

looks like. The painter’s imagination can also produce images of a scenery as

seen from different perspectives or viewed under different lighting conditions.

Moreover, a painter can paint images of environments which do not exist. The

paintings can still be very realistic—comparable to photographic pictures. (How-

ever, realism is usually not what a painter tries to achieve because an artistic

perfection is different from the measures discussed in the previous section and

cannot be very well formalised.)

6CHAPTER 1. INTRODUCTION

It is even more flexible to create and store a computer model of a 3D envi-

ronment and to postpone the rendering of images for later. The model consists

of surfaces, surface materials, atmosphere and light sources. Later on, (virtual)

cameras can be added into the model and a rendering program can be used to

render images of what the cameras “see”.

The task of a realistic 3D artist is to create a realistic 3D model. When we have

a perfect model of a 3D environment, we would also like the images taken by the

virtual cameras to be perfect. We do not want the artist to retouch the computed

images, to manually paint shadows on the floor or mirroring on a glass table or

to brush-up a brick wall to make it look like a brick wall. A 3D artist should

concentrate on the modeling, not on making images. A photorealistic rendering

system should correctly simulate all light phenomena (or at least phenomena which

are relevant to the given model).

All of the above examples of rendering systems (a passive observer, a camera,

a realistic painter, a rendering program) solve instances of the global illumination

problem. The task of a 3D artist is to set up an instance of the global illumination

problem.

1.2.3 Light phenomena and their simulation

The nature of light has been studied for many centuries. The most comprehensive

theory on the behaviour of light is quantum electrodynamics [Fey88]. The basic

assumption of this theory is that light is carried by particles called photons.

Photons are emitted from light sources and transport energy through a medium

(air, for instance) until they hit a surface. The interaction of a photon with a

surface results in the absorption of the photon and in its scattering (reflection,

refraction) which produces a new photon or photons. Photons do not travel along

straight lines from their origin to their destination—they travel along arbitrary

“crooked” paths in space. The intensity of light as measured at a point in space

is an integral of contributions of many photons traveling along all possible paths

between the light source and the destination.

Even though quantum electrodynamics is the best existing theory which ex-

plains the behaviour of light, it cannot be directly used in macro-scale computer

graphics—it is too expensive to simulate the transport of light on a subatomic

level. The modeling paradigms and algorithms used in computer graphics are

based on geometric optics (developed by Sir Isaac Newton [New52]) which makes

several simplifying assumptions.

The basic assumption of geometric optics is that photons only travel along

straight lines. Scattering is only allowed on surfaces. In other words, the inter-

action of photons with participating media is either ignored or only certain types

are allowed (in most algorithms it is easy to include media that only absorb light).

Another assumption is that light is monochromatic. This means, that all

emitted photons have a single frequency. This assumption is made in order

1.3. OUTLINE OF THIS THESIS 7

to simplify the representation of “colour” (computation with vectors is more

convenient than computation with continuous spectra).

Many light phenomena are completely ignored in computer graphics or are

handled in a special way (when they are important to the application): diffraction

(“bending” of light around obstacles), interference (an effect that can be observed

on thin surfaces such as oil films or soap bubbles), polarisation (the scattering of

light on surfaces such as water or glass depends on the orientation of the electric

vector of the incident light beam), fluorescence (molecules of some materials

absorb photons and then emit new photons at a different frequency, which makes

clothes “glow in the dark” under certain lighting conditions), etc.

Some computer graphics algorithms (e.g. the basic radiosity algorithm) only

assume an ideal diffuse reflection and no transmission, whereas others (e.g. the

basic ray tracing algorithm) only assume an ideal (indirect) specular reflection

and an ideal specular transmission.

Nevertheless, most of the light phenomena which are ignored are not rele-

vant to applications of computer graphics (the relevant phenomena can usually

be incorporated into the chosen rendering algorithm). Two important rendering

algorithms are ray tracing and radiosity. Ray tracing is a view-dependent algo-

rithm which traces rays (photons) from the observer’s eye to the light sources.

Radiosity is a view-independent algorithm which solves a linear equation system

in order to compute the distribution of light on a finite number of patches which

approximate surfaces of the 3D model. Even though ray tracing and radiosity are

incomparable in almost all respects, they both compute solutions of the global

illumination problem. The global illumination problem was formally defined by

James T. Kajiya [Kaj86] as a Fredholm integral equation of the second kind.

Ray tracing and radiosity algorithms solve special cases of Kajiya’s rendering

equation, using approximations which simplify the equation. Modern rendering

algorithms generally try to avoid approximations and directly solve the Kajiya’s

equation, using probabilistic methods.

1.3 Outline of this thesis

This thesis is organised as follows.

Chapter 2deals with the efficient parallel programming using message pass-

ing. The current standard communication libraries, PVM (Parallel Virtual Ma-

chine) and MPI (Message Passing Interface) do not allow an efficient implemen-

tation of large parallel algorithms. The problem is not specific to parallel global

illumination—practically all larger parallel applications are forced to use active

polling (also known as busy waiting). Active polling not only diminishes perfor-

mance and destroys the natural structure of parallel programs but it also leads

to a non-deterministic message passing latency. The reason why active polling

cannot be avoided is that PVM and MPI implementations which are currently

8CHAPTER 1. INTRODUCTION

available are either thread-unsafe or (even worse) they are thread-safe but active

polling is hidden inside the libraries. We extended both PVM and MPI libraries

by using a new interrupt mechanism which allows for the writing of parallel pro-

grams without active polling. The interrupt mechanism does not make the PVM

and MPI libraries completely thread-safe—nevertheless, it makes it possible to

write multi-threaded applications without active polling (we call this property

quasi-thread-safety). Moreover, we developed a communication library called

TPL (Thread Parallel Library) which builds upon the extended PVM and MPI

standards and which is thread-safe (on the set of its communication functions).

Thread-safety alone is not enough. We define a formal framework for message

passing which builds on the well-accepted parallel processing models and which

helps to explain the semantical drawbacks of the existing message passing models

such as PVM, MPI or CORBA. In particular, we explain how asynchronous com-

munication can be formally looked at and why the specification of asynchronous

communication in the above systems does not match the semantics defined in

fundamental abstract message-passing models. TPL is not just another commu-

nication library, it is a straightforward implementation of our framework. TPL

offers (unlike PVM, MPI or CORBA) asynchronous communication on parallel

systems using standard hardware and software components.

Chapter 3presents and explains a formal definition of the global illumination

problem. Approaches to the global illumination problem are presented and their

advantages and limitations are discussed.

Chapter 4is devoted to the ray tracing method and its parallelisation. A

novel perfect demand-driven load-balancing algorithm is presented, its optimal-

ity is proved and its exact message complexity is given. One disadvantage of a

straightforward demand-driven parallelisation of ray tracing is that the 3D model

must be replicated in the memories of all processors. We solve this problem by

using a distributed object database. Each processor “owns” a subset of objects

of the 3D model and besides that it maintains a memory for the storing of other

objects. If a processor needs an object which is currently not stored in its mem-

ory, it interrupts the computation, makes place for the object in its cache and

then sends an object-request to the object’s owner. The caching policy tries to

minimise the number of object-requests in order to reduce the communication

between processors. We compare the efficiency of several caching policies in or-

der to choose the most appropriate one. Most importantly, we experimentally

show that the performance of parallel ray tracing is strongly influenced by the

choice of the communication library. This result is not specific to parallel ray

tracing. An empirical comparison of the performance of parallel algorithms may

be very biased if polling is used in the implementation of the algorithms or the

communication library.

Chapter 5deals with the radiosity method. We concentrate on an integration

of a two-pass algorithm which consists of a radiosity pass (which computes a

view-independent diffuse illumination) and a ray tracing pass (the second pass

1.3. OUTLINE OF THIS THESIS 9

adds some of the view-dependent illumination effects to the precomputed radios-

ity solution). For the radiosity pass we use the shooting algorithm (Southwell

relaxation) with a Monte Carlo form factor computation. The main goals behind

this choice are a correct handling of materials during the radiosity pass and a

seamless integration with the ray tracing pass. These ideas are not entirely new.

Our novel contribution is a combination of the energy transfer with a state of

the art Monte-Carlo form factor computation in one step. This combined step is

always performed on the top level of the subdivision hierarchy without a loss of

accuracy of the radiosity solution. An important positive aspect is an automati-

sation of the algorithm (minimisation of the number of parameters which control

the algorithm).

Chapter 6gives a summary of the results presented.

10 CHAPTER 1. INTRODUCTION

Chapter 2

Event-driven message passing

Parallel processing is a very common practice nowadays. Much of it is hidden

inside chips (general-purpose processors and graphics cards), inside schedulers

of operating systems which run processes on shared memory multiprocessor ma-

chines etc. This chapter focuses on efficient message passing parallel program-

ming on a larger scale. Most parallel applications on this scale use either PVM

(Parallel Virtual Machine) or MPI (Message Passing Interface).

One problem of the current implementations of the PVM and MPI standards

is that many of them are not thread-safe (there is no thread-safe implementation

of PVM and there is no freely available implementation of MPI). This forces appli-

cation programmers to use unnatural and inefficient active polling (also known as

polling or busy waiting) in many important applications which we call non-trivial.

(More precisely, there are several implementations of PVM and MPI which are

thread-safe, but the active polling is hidden inside the libraries—which is even

worse than the active polling inside applications!1) We explain where the ac-

tive polling comes from and present the previous approaches to the problem.

Then we present an interrupt mechanism which makes both PVM and MPI im-

plementations quasi-thread-safe. Quasi-thread-safety allows the programming of

multi-threaded parallel non-trivial applications without active polling. We also

sketch how PVM and MPI implementations can be made completely thread-safe,

without active polling.

The second (perhaps less apparent) problem of the above standards is a lack

of asynchronous communication (PVM) or its imprecise semantics (MPI), respec-

tively. This also leads to polling in the applications which require asynchronous

communication, although the form of this polling is slightly different. We explain

and solve this problem as well.

1MPI/Pro by MPI Software Technology, Inc. seems to be an exception. [DS02]

12 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

2.1 Non-trivial parallel applications

There are surprisingly many parallel applications (including simulation of global

illumination) which cannot be implemented in an efficient way using current

PVM or MPI implementations. These applications can be clearly characterised

and form a class of non-trivial parallel applications.

Definition. Non-trivial parallel application is a parallel application which con-

sists of parallel processes which do not share any memory and communicate via

message passing. Each process performs two independent activities, as depicted

in Fig. 2.1.T1and T2operate on the same data (shared memory). •

T1CPU intensive computation, communicating occasionally with other pro-

cesses.

T2Fast servicing of requests coming from other processes.

Figure 2.1: Two independent activities in one process of a non-trivial application

Practically all the larger parallel applications belong to this class. (There is

no more exaggeration in the previous sentence than in the sentence “Every larger

sequential application needs a dynamical memory management.”) Examples of

non-trivial parallel applications are distributed databases (or applications which

use a distributed database), media servers, shared memory simulation libraries,

parallel scientific computations which use application-independent load balancing

libraries etc.

2.2 Development of parallel programming

In order to explain what is missing in the modern parallel programming standards

(PVM, MPI) as regards active polling in non-trivial applications, we will shortly

look at the roots of parallel programming. Out of a large number of parallel

computers, programming paradigms and programming languages, we chose Inmos

Transputer and Occam as a representative sample. Even though Transputers

have not survived the technological progress of the last twenty years, many ideas

which were originally implemented in Transputers are still valid. Particularly,

the problem of non-trivial applications is solved in Transputers.

Many parallel computer architectures, parallel programming paradigms, par-

allel computer languages, parallel algorithms etc. have been developed over the

past 20 years. The demand came from the need to solve problems which re-

quired more computing power than single-processor computers could offer (e.g.

2.2. DEVELOPMENT OF PARALLEL PROGRAMMING 13

weather prediction, fluid dynamics simulation, simulation of global illumination).

These problems can usually be divided into subproblems which can be solved in-

dependently of one another and their results are then combined together. The

subproblems can be mapped onto independent processing elements (which can

exchange data). All processing elements simultaneously compute their subprob-

lems, thus reducing the total computational time required.

Long discussions on the choice of parallel programming model resulted in

questions such as:

•Should the parallelism be expressed explicitly (by a programmer) or should

a machine (a compiler or an operating system or a processor) automatically

generate a parallel program from a sequential one?

•Should the underlying hardware be a specialised hardware or just a collec-

tion of standard computers connected in a network?

•Should message passing or shared memory be used for communication?

•How fine grained should the parallelism be?

•Should the processing be synchronous or asynchronous?

•Should the communication be synchronous or asynchronous?

•What should the interconnection network look like?

•Should there be a special programming language for expressing parallel

algorithms or should existing ones be used?

•...

Some of these questions are important to developers of parallel computers,

operating systems and compilers, some to parallel application programmers and

some to theoreticians working on parallel algorithms. Almost any combination

of answers to the questions above is correct. The resulting models are equivalent

in the sense that they can be mapped onto each other. But who should do the

mapping?

The diversity of ways how parallel processing can be looked at may have been

one of the reasons why parallel applications are still so rare. Clearly, a standard

which would allow the writing of parallel programs which last longer than the

hardware and system software used was missing.

Transputers [GK90] and the Occam programming language [Gal96] were one

of the first historical attempts to establish a standard, which seamlessly con-

nects hardware, programming paradigm, programming language and parallel al-

gorithms. Transputers have not survived the technological progress, but in our

opinion they significantly influenced the development of parallel computing.

14 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

2.2.1 Occam programming language

Sir William of Occam (1284–1347) was an English philosopher. He advocated a

principle known as Occam’s Razor: “Pluralitas non est ponenda sine necessitate”.

(“Entities must not be multiplied beyond what is necessary.”) In other words,

“If there are several solutions or approaches to a problem then the simplest one

should be used.” The Occam programming language (named after William of

Occam) is indeed minimalistic. It is not our intention to formally or completely

describe the Occam language or the Transputer. We will only define a small sub-

set of Occam in order to demonstrate in an example how simple and at the same

time powerful the language is. We will also show how non-trivial applications are

supported by Occam and the Transputer.

Occam requires the programmer to explicitly express the parallelism. An

Occam program consists of a set of processes which can run either sequentially or

in parallel. Each of the processes can consists further of processes which can again

run either sequentially or in parallel. This nesting is potentially unlimited but

must end up with atomic processes. Processes running in parallel are not allowed

to share variables—they are only allowed to share communication channels and

explicitly exchange data using message passing.

The idea of the process nesting was introduced by Hoare. [Hoa85]. Occam is a

“materialisation” of Hoare’s CSP model (Communicating Sequential Processes).

It is easy to schematically visualise every Occam program, thanks to the process

nesting and channel declarations. The visualisation can only include the first

level of the program structure (or a few first levels) which is usually sufficient

to understand the idea of the parallelisation without going into implementation

details. (INMOS developer’s toolsets included an integrated editor which allowed

the programmer to browse the structure of the program and edit its code.)

The basic Occam concepts are:

•Data and data types. Occam provides the programmer with the usual

basic data types BOOL, BYTE, INT, REAL. Data can be organised in

arrays (this corresponds to replication in the Occam’s terminology). Array

indices are integers and begin with 0. A variable is local to the process

in which it was declared (and in its subprocesses). Processes running in

parallel are not allowed to share any variables—an exception to this is a

variable which is not altered by any of the parallel processes. Variables

must be declared and strongly typed.

•Processes. Processes are organised in a hierarchical manner. An Oc-

cam program is a single process. This single process consists of either one

atomic process (assignment, input, output, SKIP,STOP) or a constructor

(SEQ,PAR,IF,CASE,WHILE,ALT) which combines processes themselves are

either atomic processes or constructors. The structure is thus always a

tree. Indentation denotes process nesting in an Occam program. There is

2.2. DEVELOPMENT OF PARALLEL PROGRAMMING 15

no dynamic memory management in Occam (or in Transputer). Recursion

cannot be expressed in Occam (it must be simulated).

•Channels. Channel is the only means of communication between two

parallel processes (each channel must be shared by exactly two parallel

processes). Occam channels are uni-directional (a process is allowed to

either read from or write to a channel but not both) and synchronous (a

process which tries to read from or write to a channel becomes blocked

until the process on the other end of the channel is ready to communicate).

Channels, as with variables, must be declared and strongly typed. “Type”

of a channel is the protocol used on that channel (a description of the

sequence of data types passed through the channel). Timers are treated

as special channels with only one end which can be read. Timers are used

to either read the clock value or to block the reading process for a specified

time.

Occam defines 5 atomic processes:

Assignment. Assigns a value to a variable and terminates.

variable := value

Input (receive). Reads a value from a channel, stores it into a variable

and terminates. This process blocks until the process at the other side of

the channel is ready to write to the channel.

channel ? variable

Output (send). Writes a value to a channel and terminates. This process

blocks until the process at the other side of the channel is ready to read

from the channel.

channel ! value

No action. Does nothing and terminates. This process is used in some

constructors to explicitly express that no action should be taken.

SKIP

Stop. Does nothing and never terminates (blocks forever). This process is

usually used to indicate an error condition.

STOP

The atomic processes can be combined to more complex processes using con-

structors. Two of the most important constructors are:

SEQ, a chain of sequential processes. A process of this constructor

begins after the previous has terminated. The SEQ constructor terminates

when its last process has terminated.

16 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

SEQ

process1

process2

...

processN

PAR, a collection of parallel processes. All processes run in paral-

lel. The PAR constructor terminates when all its processes have terminated.

PAR

process1

process2

...

processN

The remaining constructors are IF,CASE,WHILE, and ALT, whereby the first

three correspond to the constructors which are known from other programming

languages. ALT is a special constructor which acts as an input multiplexor. It

allows a process to wait for a number of events and execute an action when an

event happens. When data can be received from several channels, then one of the

events is triggered (an arbitrary one). The following example involves a process

which blocks until it receives data either from channel1(executing action1in

this case) or from channel2(executing action2in this case):

ALT

channel1? variable1

action1

channel2? variable2

action2

Furthermore, each channel input can be accompanied by a boolean condition

(a guard). An event is triggered when a channel becomes readable and at the same

time the boolean condition guarding the channel is satisfied. Guarded inputs can

be combined with non-guarded ones. An ALT may also contain one special event,

TRUE, which is triggered when none of the other events is triggered.

All constructors can be replicated in Occam. A replicated SEQ corresponds

to a sequential loop. For instance, the program in Fig. 2.2 computes j= 210 (the

inner SEQ is replicated).

A replicated PAR starts a collection of parallel processes. The processes run

the same code. The program in Fig. 2.3 starts 10 parallel processes which are

connected to a pipeline. The value 1 is passed to the pipeline by a special source

process which is connected to the input end of the pipeline. The pipeline com-

putes 210 which is passed as a result to another special process, sink, which is

connected to the output end of the pipeline (the two special processes run in

parallel with the 10 pipeline processes).

2.2. DEVELOPMENT OF PARALLEL PROGRAMMING 17

INT j:

SEQ

j := 1

SEQ i = 1 FOR 10

j := 2 * j

Figure 2.2: An example of a replicated SEQ. The program computes j= 210

[11]CHAN OF INT chpipe:

PAR

chpipe[0] ! 1

PAR i = 0 FOR 10

INT value:

SEQ

chpipe[i] ? value

chpipe[i+1] ! 2 * value

INT j:

chpipe[10] ? j

Figure 2.3: An example of a replicated PAR. The program computes j= 210

A replicated ALT is usually used in the multiplexing of input from an array of

channels. Fig. 2.4 shows a very artificial example of a replicated ALT.

[10]CHAN OF INT ch:

PAR

PAR i = 0 FOR 10

ch[i] ! 2

INT j:

SEQ

j := 1

SEQ i = 0 FOR 10

ALT k = 0 FOR 10

ch[k] ? value

j := j * value

Figure 2.4: An example of a replicated ALT. The program computes j= 210

Processes can be assigned names in Occam. This improves the readability of

a program. Parameters can be passed to named processes. The pipeline which

computes 210 (Fig. 2.3) can be written as shown in Fig. 2.5.

An Occam program is independent of the underlying hardware. The hardware

is usually a Transputer or a network of Transputers. There are special language

extensions for the specification of the mapping of processes onto processors and

18 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

PROC pipe(CHAN OF INT in, CHAN OF INT out)

INT value:

SEQ

in ? value

out ! 2 * value

PROC source(CHAN OF INT out)

out ! 1

PROC sink(CHAN OF INT in)

INT j:

in ? j

[11]CHAN OF INT chpipe:

PAR

source(chpipe[0])

sink(chpipe[10])

PAR i = 0 FOR 10

pipe(chpipe[i], chpipe[i+1])

Figure 2.5: An example of named processes in Occam. The program computes

j= 210 in PROC sink

mapping of channels onto physical links, whereby parallel processes of the top-

most PAR constructor are typically mapped onto different processors. Mapping

a program onto several processors only influences the efficiency of the program,

not its semantics.

Example of an Occam program: Dining philosophers

The problem of dining philosophers [Dij71] is used in many textbooks as a mo-

tivating example of resource sharing. It helps in the understanding of the syn-

chronisation of parallel processes and solution strategies to the deadlock problem.

We chose this problem to show the power which is hidden in simplistic Occam

constructors:

Problem of dining philosophers

Five philosophers are sitting by a round table. They spend time

alternately philosophing and eating spaghetti. Each philosopher has

a plate in front of him. There is one fork between each pair of plates.

2.2. DEVELOPMENT OF PARALLEL PROGRAMMING 19

Each fork is shared by two neighbouring philosophers. In order to eat,

a philosopher needs both neighbouring forks, the left one and the right

one. A philosopher is not allowed to take both forks simultaneously,

he must decide which fork to acquire first.

“Philosophing” should be interpreted (in the context of this problem)

as doing nothing for a random length of time, while not holding any

fork in one’s hand. “Eating” should be interpreted as doing nothing

for a random length of time, while holding both forks.

The task is to write a program which simulates the life of the philoso-

phers. The program must guarantee a fairness—a hungry philoso-

pher must eventually get to eat. A problem which must be solved

is the prevention of deadlock and the consequent starvation of the

philosophers. A deadlock occurs for instance when all the philoso-

phers acquire their left forks, in which case no fork remains on the

table and all the philosophers will starve (unless at least one of them

voluntarily returns the fork he is holding).

One possible solution to the deadlock problem involves the breaking of the

symmetry among philosophers. For instance, if the philosophers are assigned

numbers from 1 to 5 and if odd philosophers acquire forks in a reverse order

to even ones, then deadlock will never occur. (There are also other solutions

as to how deadlock can be prevented but we shall continue with this particular

solution.)

It is very simple under these assumptions to write an Occam program which

simulates the situation at the table. Philosophers, as well as forks, are processes

which are connected to a ring (each philosopher communicates with the two forks

next to him and each fork communicates with the two philosophers next to the

fork). The program is shown in Fig. 2.6.

Note how the fork process is implemented. At the beginning a fork waits for

a signal from any of its two neighbouring philosophers. Once a signal arrives,

the fork is assigned to the philosopher who sent this signal and the fork begins

listening only to that philosopher. Another signal from the same philosopher

means that the philosopher returns the fork, and the fork begins listening to

both philosophers again.

The philosopher process which is trying to acquire a fork simply sends a

signal to the fork (in the IF constructor). The send will block when the fork is

not listening to the philosopher (all communication in Occam is synchronous). A

blocked send will unblock after the message has been received by the fork. Note

that the sends after eat() (used for returning forks) never block.

The main program starts five instances of the philosopher process and five

instances of the fork process in parallel, and connects the processes using chan-

nels to a circle. Fig. 2.7 shows a schematic visualisation (a process diagram) of

20 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

PROC fork(CHAN OF INT left, right)

INT signal:

WHILE TRUE

ALT

right ? signal

left ? signal

PROC philosopher(INT id, CHAN OF INT left, right)

WHILE TRUE

SEQ

think()

(id REM 2) = 0

left ! 0

right ! 0

(id REM 2) = 1

right ! 0

left ! 0

eat()

right ! 0

left ! 0

[10]CHAN OF INT ph2fork:

PAR

PAR i = 0 FOR 5

philosopher(i, ph2fork[(2 * i + 9) REM 10], ph2fork[2 * i])

PAR i = 0 FOR 5

fork(ph2fork[2 * i], ph2fork[(2 * i) + 1])

Figure 2.6: Simulation of dining philosophers in Occam

the program in Fig. 2.6. Only the outermost PAR and the outermost channel

declarations must be followed in order to draw a process diagram of any Occam

program.

The Occam program is extremely short. (We did not go into any details con-

cerning an output to a terminal or the generation of a random delay in PROC

think as this is not important.) An equivalent program written in a conven-

tional (sequential) programming language such as C or Pascal would be much

bigger. Why? The reason being that the simulation of dining philosophers is

2.2. DEVELOPMENT OF PARALLEL PROGRAMMING 21

Figure 2.7: Simulation of dining philosophers, a process diagram

an inherently parallel problem. A sequential program written in a conventional

programming language must implement the scheduling hidden behind the Occam

PAR construction. A table with all process’ states must be maintained and the

independent executions of the processes must be simulated.

2.2.2 Transputer

Transputer was a processor (more precisely, a family of processors) introduced

by a British company INMOS Ltd (now SGS-Thomson Microelectronics Ltd)

in 1986. The processor (T414) was optimised to support parallel programming,

particularly programs written in the Occam language (all Occam’s constructors

and atomic processes have their counterparts in Transputer’s instructions). In

the three years which followed there were 10 processors in the Transputer family,

out of which T805 and T9000 were the most successful ones.

Fig. 2.8 shows a block diagram of the T805 Transputer. The features that

makes this architecture special are the four links used to connect several Trans-

puters to a network (pins LinkIn0...LinkIn3 and LinkOut0. . . LinkOut3) and

the 4 kB on-chip RAM. Transputers were often used in embedded systems—all

they required was power (pins GND and VCC) and an external clock (ClockIn).

22 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

Figure 2.8: Hardware block diagram of the T805 Transputer

The full technical description of Transputer processors can be found in [GK90].

In the following section we will only explain the process scheduling in the Trans-

puter, which is related to our work.

The basic concept of the Transputer low-level programming is a process.

Processes in Transputer correspond to Occam’s parallel processes which are cre-

ated in the outermost PAR constructor. Each process is assigned a priority (high

or low) and a fixed amount of memory (workspace) while it is being created. A

process can be either running or blocked (waiting on a channel or timer).

Only one process can run at a time in one Transputer. Transputer maintains

two priority queues, a high priority queue (HPQ) and a low priority queue (LPQ).

Each process which is neither running nor blocked is stored in one of these queues,

depending on the priority of the process. The process scheduling is micro-coded

in hardware and the process switching is extremely fast. The scheduling works

as follows:

•A running high priority process is never preempted. It runs until it termi-

nates or blocks. When a high priority process blocks, it is placed at the

end of the HPQ.

2.2. DEVELOPMENT OF PARALLEL PROGRAMMING 23

•When a blocked high priority process unblocks (becomes ready to run),

three cases can arise:

1. No process is running. In this case the unblocked high priority process

is scheduled to run.

2. Another high priority process is running. In this case the unblocked

high priority process is placed at the end of the HPQ.

3. A low priority process is running. In this case the running low priority

process is preempted and placed at the end of the LPQ. The unblocked

high priority process is scheduled to run. (This is the only case where

the context of the running process must be saved.)

•A running low priority process runs until it is either preempted by a high

priority process or until it runs longer than 50 ms. The former case has al-

ready been discussed above. In the latter case, the scheduler waits until the

running process executes an instruction which leaves the process’ context

undefined (in doing so no context must be saved). In this situation there

are two possibilities:

1. The LPQ is empty. In this case the running process keeps running.

2. The LPQ is not empty. In this case the running process is placed at

the end of the LPQ and the first process of the LPQ is scheduled to

run.

The following section shows that this scheduling model corresponds to the

one needed for an efficient implementation of non-trivial applications which were

introduced in Section 2.1.

2.2.3 Occam, Transputer and non-trivial parallel applica-

tions

Each of the parallel processes of a non-trivial application (see Section 2.1) needs

to perform a local computation and at the same time quickly service requests from

all other processes. The design of the Transputer allows for an efficient imple-

mentation of this scenario if each top-most process of the non-trivial application

is mapped onto one Transputer.

A non-trivial application can be implemented in Occam as follows. Each top-

most process is a PAR which contains two subprocesses, T1and T2.T1does the

computation, while T2carries out the servicing of incoming requests (Fig. 2.9).

A subtle problem must be solved when T1and T2share variables and when

T1or T2alters a shared variable (it must be recalled here that Occam forbids the

altering of variables shared by parallel processes). In such a case a third process,

24 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

PROC NONTRIVIAL([NR.IN] CHAN OF INT in, [NR.OUT] CHAN OF INT out)

PROC T1()

SEQ

...compute and occasionally communicate...

PROC T2()

INT request:

ALT i = 0 FOR NR.IN

in[i] ? request

...service request ...

PAR

T1(in, out)

T2(in, out)

Figure 2.9: Implementation of one process of a non-trivial application in Occam

DM (Data Manager) must run in parallel with T1and T2.DM acts as a passive

server which listens to T1and T2. Only DM manipulates the “shared” data. T1

or T2are only allowed to indirectly access the “shared” data (read or write them),

via the passing of messages to DM—otherwise they work with local copies.

Presumably the incoming requests are sent by the processes which are blocked

until the requests are answered. In order to increase the efficiency of the whole

parallel program, the processes which are waiting for a response should be serviced

as fast as possible. T2should therefore be prioritised over T1. If there are no

incoming requests, T2sleeps (it is blocked in ALT) and thus it does not prevent

T1from running. If there is a request, we want T2to run immediately, possibly

preempting T1.T1should never preempt T2.Note that all these requirements

are guaranteed by the Transputer hardware scheduling semantics if T2

runs with high priority and T1with low priority (the priority of DM is

not important).

2.3 Current message passing standards: PVM

and MPI

There is a subtle difference between parallel and distributed computing. The

first term is used for computing on dedicated multi-processor machines which

fit into a single box (e.g. Transputer-based systems). Distributed computing

2.3. CURRENT MESSAGE PASSING STANDARDS: PVM AND MPI 25

usually denotes computing on a network of “ordinary” computers connected via

an “ordinary” network.

The current trend involves writing portable parallel applications. The differ-

ence between parallel and distributed computing disappears—at least from the

point of view of an application’s programmer. It is more important to minimise

the effort needed for the porting of a parallel application onto another architec-

ture than to maximise the application’s performance on a specialised hardware

platform. This goal is achieved by developing standards which provide appli-

cations with abstract message passing functions and which can be efficiently

mapped onto practically all the available parallel systems. PVM (Parallel Vir-

tual Machine) [GBD+94] and MPI (Message Passing Interface) [MPI94], [MPI98],

[MPI97] are parallel programming libraries (more precisely, specifications of ap-

plication programming interfaces) which have been established as standards for

parallel programming. The vast majority of parallel applications use either PVM

or MPI.

The porting of a PVM or MPI application onto a new system usually simply

means the compilation of the application on the new system. Most vendors of

parallel machines and operating systems have a tailored implementation of PVM

or MPI, optimised for their systems. The supported systems range from pro-

prietary parallel hardware through shared memory multiprocessors or (possibly

heterogeneous) workstation clusters to virtual computers which consist of loosely

coupled computers in wide-area networks (Internet).

Even though PVM and MPI significantly differ (in their implementations as

well as in their interfaces), they use common programming paradigms:

•The parallel machine which runs the program, regardless of what its ar-

chitecture looks like, is viewed as a virtual parallel machine with message

passing capabilities.

•A parallel application consists of parallel processes that do not share any

memory. Explicit communication must be used to exchange data between

processes.

•The actual implementation of the inter-process communication is transpar-

ent to the programmer. PVM and MPI provide the programmer with a set

of communication functions, most important of which are point-to-point

send and recv functions.

•PVM and MPI are libraries. A process of a parallel application is a sequen-

tial program written in a host language (e.g. C, C++, Fortran) and linked

to PVM or MPI.

•Neither PVM nor MPI try to be minimalistic (as opposed to Occam). They

provide the programmer with many functions which are not necessary in the

26 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

sense that they can be assembled from other functions. The additional func-

tions make (arguably) programming applications more comfortable. Also,

some of the functions can be implemented in a more efficient way inside the

library than in the application.

2.4 Point-to-point message passing in PVM and

MPI

PVM and MPI differ in their specifications of point-to-point communication.

Some differences are obvious, e.g. a function that sends a message is called

pvm send in PVM whereas in MPI it is called MPI Send. The functions also differ

in the numbers and types of arguments, in their return values, . . . This section

does not deal with similar differences, instead it focuses on differences concerning

the semantics of point-to-point communication.

Point-to-point message passing involves a message delivery between two par-

allel processes, sender and receiver. Each process is assigned a rank (called task

identifier in PVM terminology) which is an integer distinct to the process. Each

message is accompanied by a message tag, also an integer. The sender’s and

the recipient’s ranks together with the message tag form a message header.2A

message can contain data. The data is stored in a message body. The body may

be empty (messages with an empty body are called zero-length messages).

The delivery of a message is regarded independently from the point of view

of the sender and the receiver processes. The sender initiates a send operation

(initiation of a send operation is sometimes called “posting a send”), which states

the receiver to which the message is to be delivered and the message tag which

will be used. The receiver initiates a receive operation (initiation of a receive

operation is sometimes called “posting a receive”), which states the sender (or

set of senders) from which a message is to be received and the message tag (or

tags) which the message must have. A message can be delivered when an initiated

(and not yet completed) send operation exists and when an initiated (and not

yet completed) receive operation exists that match. The delivery of a message

completes both the send and receive operations associated with the message.

Note that the existence of matching receive and send operations Rand Sdoes

not guarantee that a message will be passed from the process which initiated Sto

the process which initiated R. There may be a third process in the system which

initiated a send operation S0which also matches R. Any of the send operations

Sand S0may complete.

The communication library provides a somewhat weaker guarantee on the

message delivery. This guarantee is formulated as the progress rule in the Message

2Implementation of a communication library may store additional information in the header,

for example the size of the message in bytes.

2.4. POINT-TO-POINT MESSAGE PASSING IN PVM AND MPI 27

Passing Standard. [MPI94], [MPI98], [MPI97]:

Progress rule. If a pair of matching send and receives have

been initiated on two processes, then at least one of these two opera-

tions will complete, independently of other actions in the system: the

send operation will complete, unless the receive is satisfied by another

message, and completes; the receive operation will complete, unless

the message sent is consumed by another matching receive that was

posted at the same destination process.

Remark. The notions operation,initiation of an operation,completion of an

operation are not precisely defined in the MPI standard (e.g. a send operation

is defined as MPI Send, a receive operation is defined as MPI Recv). This leads

to misunderstandings among MPI implementors and MPI users. Particularly the

progress rule is often incorrectly interpreted. The progress rule not only applies

to MPI Send and MPI Recv as the text of the standard might suggest—it also

applies to all communication functions (to blocking as well as to nonblocking

ones). We give a more formal definition of these notions in Section 2.5.•

The reference book on MPI [SOHL+95] uses a slightly different formulation

of the progress rule:

Progress rule: blocking communication. If a pair of match-

ing send and receives have been initiated on two processes, then at

least one of these two operations will complete, independently of other

actions in the system. The send operation will complete, unless the

receive is satisfied by another message. The receive operation will

complete, unless the message sent is consumed by another matching

receive posted at the same destination process.

Progress rule: nonblocking communication. A communica-

tion is enabled once a send and a matching receive have been enabled

communication posted by two processes. The progress rule requires

that once a communication is enabled, then either the send or the

receive will proceed to completion (they might not both complete as

the send might be matched by another receive or the receive might be

matched by another send). Thus, a call to MPI Wait that completes

a receive will eventually return if a matching send has been started,

unless the send is satisfied by another receive. In particular, if the

matching send is nonblocking, then the receive completes even if no

complete-send call is made on the sender side.

Similarly, a call to MPI Wait that completes a send eventually

returns if a matching receive has been started, unless the receive is

28 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

satisfied by another send, and even if no complete-receive call is made

on the receiving side.

If a call to MPI Test that completes a receive is repeatedly made

with the same arguments, and a matching send has been started, then

the call will eventually return flag=true,3unless the send is satisfied

by another receive. If a call to MPI Test that completes a send is

repeatedly made with the same arguments, and a matching receive has

been started, then the call will eventually return flag=true, unless

the receive is satisfied by another send.

2.4.1 Message assembling and sending

Prior to the initiation of a send operation, the sending process must tell the

communication library which data should be sent in the message body (unless

the message body is empty). This phase—which we call message assembling—can

consist of several function calls when using PVM or MPI. We will only explain

the so-called packing method here, the result of which is a contiguous memory

buffer which contains the data to be sent (the message body).4

Remark. The packing functions do slightly more work than simply copying

data from the user’s memory space into the send buffer when the communication

takes place in a heterogeneous system because the representations of data (e.g.

the byte order or the length of basic types) in different processes may be different.

There is a canonical encoding of all basic types defined in the POSIX standard

[ISO90], called XDR encoding. The data in the send buffer is stored in the XDR

encoding which guarantees their unique interpretation in the sender and receiver

processes. •

After a message has been assembled, it can be sent to a recipient. The initia-

tion of the send operation is implemented in send functions in the communication

library. If a call to a send function does not return until the send operation has

been completed, the function is called a blocking send (or a synchronous send).

If the function is allowed to return before the completion of the send operation,

it is called a nonblocking send (or an asynchronous send).

The crucial questions here are:

•Is the send buffer allocated and freed by the application or by the commu-

nication library?

3The argument flag is called completed in the next section.

4The packing method requires additional data copying. There are other methods of message

assembly provided by both PVM and MPI that do not require the storing of the data in a

contiguous buffer. However, our performance tests show that copying the data is not very

costly compared to other actions as regards sending a message on commodity systems such as

the hpcLine by Fujitsu-Siemens.

2.4. POINT-TO-POINT MESSAGE PASSING IN PVM AND MPI 29

•How large must the send buffer be in order to store the packed (XDR) data?

•When is it safe to reuse or free the send buffer?

PVM

The PVM library internally provides the send buffers. PVM can maintain several

send buffers at a time. There is always one active send buffer and all packing

functions pack (append) the data into the active send buffer. The application can

switch between buffers (by saving the current buffer and making another buffer

the active buffer).

The process which wants to pack a message either calls the function

int pvm initsend(int encoding)

which destroys the current active buffer and creates a new one, or

int pvm mkbuf(int encoding)

which creates a new buffer without destroying the active one.

The application does not need to specify the size of the buffer—the PVM

library is responsible for the provision of enough space to store the packed data.

After a buffer has been created, the message is assembled by calling one of

the packing functions (there is one packing function for each basic type). For

instance, this is a function that packs an integer (or an integer array) into the

active send buffer:

int pvm pkint(int *integer, int count, int stride)

After the message has been assembled in the active send buffer, it can be sent

using the function

int pvm send(int recipient, int tag)

This function sends the contents of the active message buffer to the recipient—

more exactly, it initiates a send operation. The PVM library does not specify

whether this function is a blocking or nonblocking send (it can safely be regarded

as a nonblocking send). The call may return before the send operation has been

completed. However, PVM guarantees that the active send buffer may be reused

or freed after the return from a pvm send() call, and it also guarantees progress

formulated in the progress rule.

MPI

The MPI library requires the application to allocate and free send buffers. No

buffer is necessary in the case of zero-length messages but if the message body is

not empty, the application must decide on the size of the buffer to be allocated

before packing data into it. The application does not know how large the packed

representation of the data is. Therefore MPI provides a function

int MPI Pack size(int n, MPI Datatype t, MPI Comm com, int *size)

30 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

which returns in size the number of bytes necessary to store nitems of type

tsent via the communicator com (a communicator is a “channel” between the

sender and receiver processes).

There is only one packing function in MPI:

int MPI Pack(void *data, int n, MPI Datatype t, void *buf,

int size, int *offset, MPI Comm com)

which packs nitems of type tfrom the memory location data to the buffer buf of

size size at offset offset. The buffer is intended to be sent via the communicator

com. Note that the value of offset is updated by MPI Pack. The new value of

offset is used by the next call to MPI Pack (offset must be set to 0 by the

application when packing data into an empty buffer).

There is a variety of send functions in MPI and any of them can be used

to send the data which is packed in a buffer—more precisely, to initiate a send

operation. The functions differ only in the conditions guaranteed when they

return.

int MPI Send(void*buf, int offset, MPI PACKED, int recipient,

int tag, MPI Comm com)

is a default send. Its semantics corresponds to the semantics of pvm send in PVM.

It is not specified whether the send operation is blocking or nonblocking, and so

the call to MPI Send may return before the send operation has been completed.

However, MPI guarantees that the send buffer can be freed or reused by the

application after the call has returned, and it also guarantees progress formulated

in the progress rule.

int MPI Ssend(void*buf, int offset, MPI PACKED, int recipient,

int tag, MPI Comm com)

is a synchronous send. This function does not return until the initiated send

operation has been completed. The completion of the send operation implies

that the send buffer can be freed or reused.

int MPI Isend(void*buf, int offset, MPI PACKED, int recipient,

int tag, MPI Comm com, MPI Request req)

is a nonblocking send which initiates the send operation and returns immediately,

without waiting for its completion. The application must not free or reuse the

send buffer until the send operation has been completed. In order to let the appli-

cation detect the completion, MPI Isend returns a handle to the send operation

(a request), req and provides functions that test whether the operation has been

completed:

int MPI Wait(MPI Request *req, MPI Status *status)

blocks until the operation pointed to by req has been completed. Hence a return

from the call implies that it is safe to free or reuse the buffer.

int MPI Test(MPI Request *req, int *completed, MPI Status *status)

returns immediately, indicating in completed whether the operation pointed to

by req has been completed or not.

2.4. POINT-TO-POINT MESSAGE PASSING IN PVM AND MPI 31

int MPI Bsend(void*buf, int offset, MPI PACKED, int recipient,

int tag, MPI Comm com)

is a buffered send. Note that its interface is the same as the interface of the default

MPI Send. The same applies to its semantics, with one difference: MPI Bsend first

copies the message data from buf to an extra buffer space and then immediately

returns. Note that it is safe for the application to reuse or free the buffer buf

after the return from a MPI Bsend() call. The extra buffer space is provided by

the application which (prior to a MPI Bsend() call) calls the function

int MPI Buffer attach(void *extra buf, int size)

where extra buf points to a block of memory which belongs to the application,

of size bytes. A pairwise function

int MPI Buffer dettach(void *extra buf, int size)

is used to reclaim the extra buffer space from MPI (so that the application can

free or reuse it). The extra buffer space must be large enough to store message

data of all buffered send operations that have not been completed at any one

time. The function MPI Buffer dettach() blocks until all buffered sends that

use the buffer complete.

Remark. There are other send functions in MPI which we have omitted here—

we just presented the most important and representative ones. Note that MPI is

much richer than PVM as regards the choice of send functions. However, as we

show in Section 2.6.3, the seemingly richer set of send functions does not mean a

richer functionality. The use of MPI functions which are related to asynchronous

or buffered communication (that means the use of all send functions except the

default and synchronous send) is in fact very limited. •

2.4.2 Message receiving and disassembling

Receiving a message differs from sending a message, even though there are certain

similarities. One difference is that a receive operation can, but does not have to

specify the sender from which it wants to receive a message. Similarly, a receive

operation can, but does not have to, specify the tag a message must have in

order to match the receive operation. This wildcard matching only applies to the

receive operations, not to the send ones.

Prior to initiating a receive operation, a buffer must be allocated to the re-

ceiving process. However, how does the receiving process know the size of the

buffer which must be allocated when it decides to receive any message? This is

another difference between sending and receiving—the sender can compute the

size of the buffer, whereas the receiver can not. It is also unclear at first glance

whether the application or the communication library should allocate the buffer.

After a message has been received, the receiver usually wants to disassemble

32 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

the data stored in the message body. This is one more asymmetry between send-

ing and receiving—the sender knows what the message body contains, whereas

the receiver must rely on the fact that the sender assembled the message using

an agreed method. In other words, the receiver relies on the fact that the sender

obeys the agreed protocol.

Similarly to sending, a question arises when the buffer can be freed or reused.

The answer is easy: the application (not the communication library) knows when

it has disassembled all the data which it needed from the receive buffer.

PVM

The PVM library internally provides the receive buffer and it automatically ad-

justs its size to the size of the incoming message. There are several receive func-

tions provided by PVM, the application can basically decide whether it wants

to block until a matching message arrives (this corresponds to an initiation of a

receive operation and waiting for its completion) or only probe whether there is a

matching message (this corresponds to a temporary initiation of a receive opera-

tion, checking whether it can be matched against a send operation and canceling

the receive operation after the check).

The blocking receive function

int pvm recv(int from, int tag)

initiates a receive operation and blocks until a message which matches from and

tag arrives. If the parameter from is set to −1, then a message coming from any

process is matched (ranks of all processes are positive, the value of −1 serves as

a wildcard). Similarly, tag set to −1 matches any message tag. The arrival of a

matching message completes the receive operation and the function call returns

to the application.

The nonblocking receive function

int pvm nrecv(int from, int tag)

initiates a receive operation and checks for a matching message. If there is one,

the message is delivered and the receive operation is completed. If there is none,

the receive operation is canceled. In both cases, the function call returns without

being blocked.

The probing function

int pvm probe(int from, int tag)

does not create a receive operation, it only checks whether there is a message

which matches from and tag. The function call returns without blocking. Note

that if pvm probe detects a matching message then a subsequent call to pvm recv

or pvm nrecv (with the same parameters) completes the receive operation initi-

ated by the call.

The functions pvm recv or pvm nrecv return an integer which identifies the

buffer where the delivered message is stored on the completion of the receive

2.4. POINT-TO-POINT MESSAGE PASSING IN PVM AND MPI 33

operation. Note that if a wildcard was used in a call to these functions, the

application still does not know which process sent the message and with which

tag. This information (the message header) can be obtained using the function

pvm bufinfo(int bufid, int *bytes, int *tag, int *from)

This function obtains the buffer identifier bufid which is returned by pvm recv

or pvm nrecv and returns information on the message header: the length of the

message (in bytes), the message tag and the rank of the sender (in from).

After a message which matches pvm recv or pvm nrecv has been delivered, the

buffer storing the message becomes the active message buffer. The data stored in

the active message buffer can be disassembled using unpacking functions (there

is one packing function for each basic type). For instance, this is a function that

unpacks an integer (or an integer array) from the active receive buffer:

int pvm upkint(int *integer, int count, int stride)

The application has the option to save the active message buffer and free it

later. If it does not, the contents of the active message buffer is simply overwritten

by a new message when a subsequent receive operation completes.

MPI

The MPI library does not automatically provide receive buffers. The applica-

tion is responsible for the allocation of a sufficiently large buffer to the incoming

message when initiating a receive operation. It is silently assumed that the ap-

plication already knows the contents of the message (types of the data stored

in the message) that can arrive before initiating a receive operation. To sim-

plify matters, we assume that the arriving messages have been assembled using

the packing method. The receiver can determine the size of the buffer using the

MPI Pack size function which we described when talking about message packing.

The default receive function is synchronous:5

int MPI Recv(void *buf, int count, MPI Datatype t, int from,

int tag, MPI Comm com, MPI Status *status)

initiates a receive operation and is then blocked until the receive operation is

completed. buf is the receiving buffer where the message data are stored, count

is the maximum number of data elements of type tthat may be received and

stored in buf,tis the type of the data elements. The from and tag parameters

specify the messages which are allowed to be received. Wildcards can be used in

both from and tag:MPI ANY SOURCE in from means the accepting of messages

from all processes, MPI ANY TAG in tag means the accepting of all message tags.

A return from an MPI Recv call guarantees that a matching message has been

delivered and the buffer buf can be unpacked or freed or reused by the application.

The header of the message received is stored in status.

5MPI has no receiving counterpart to the synchronous send function MPI Ssend. There is

no receiving counterpart to the buffered send function MPI Bsend, either.

34 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

The nonblocking receive function

int MPI Irecv(void *buf, int count, MPI Datatype t, int from,

int tag, MPI Comm com, MPI Status *status, MPI Request *req)

initiates a receive operation and immediately returns, without blocking and with-

out canceling the receive operation. A return from the MPI Irecv() call does not

guarantee the completion of the receive operation. The meaning of all parameters

is the same as with MPI Recv. The additional req parameter is a handle of the

receive operation. The handle can be passed to MPI Wait or MPI Test functions,

which respectively wait for the completion of the receive operation (blocking) or

test whether the receive operation is completed (nonblocking).

The blocking probe function

int MPI Probe(int from, int tag, MPI Comm com, MPI Status *status)

blocks until a send operation that matches the parameters from and tag is found.

The return from this function guarantees the completion of a subsequent receive

operation with the same parameters. The MPI probe functions do not receive

the message.

The nonblocking probe function

int MPI Iprobe(int from, int tag, MPI Comm com, int *flag,

MPI Status *status)

does not create a receive operation, it only checks whether there is a message

which matches from and tag. The function call returns without being blocked

and the information as to whether a matching message has been received is stored

in flag. Note that if MPI Iprobe detects a matching message then the completion

of a subsequent receive operation with the same parameters is guaranteed.

2.5 Unifying framework for message passing

In this section we introduce a new framework for message passing. Our main

intention is not to introduce another formal model but to unify the existing ones.

Section 2.3 discussed the similarities and differences between the PVM and MPI

standards but it is perhaps not yet obvious as to whether a program which uses

PVM functions can also be written using MPI functions or vice versa.

Well-known models of parallel processing include PRAM (Parallel Random

Access Machine) [FW78], ATM (Alternating Turing Machine) [MS87], Cellu-

lar Automata [vNe66] etc. Widely accepted models of message passing are

CSP (Communicating Sequential Processes) by Hoare [Hoa85] and the “chan-

nel model” by Andrews [And91]. The formal message passing models as well as

the real-life message passing standards are apparently different even though they

all deal with communication between parallel processes. However, the models

are equivalent in the sense that a program in one model can be simulated in any

other model. We believe that the framework we propose covers all the existing

2.5. UNIFYING FRAMEWORK FOR MESSAGE PASSING 35

message passing models, formal ones as well as the real-life needs.

Another reason for the introduction of a unifying framework is that there is

no formal model that we know of which defines message passing, which can be

directly mapped onto contemporary networks and which reflects notions used in

the definitions of real-life message passing systems such as PVM or MPI. For

instance, the Hoare’s original model lacks asynchronous communication which

is used in all contemporary real-life message passing systems. The “channel

model” by Andrews defines the semantics of synchronous and asynchronous com-

munication but the underlying mechanism used for communication between two

processes is a channel shared between the two processes. A channel is an abstract

FIFO structure similar to a pipe in the UNIX operating system. A channel has

acapacity—it stores messages sent by the sender process. The receiver process

removes the messages from the channel. However, the wires in real-life computer

networks do not have a capacity—either the receiver or the sender processes store

messages, not the wire itself. Therefore the channel abstraction cannot be directly

applied to real-life networks. The insertion of a third process between the sender

and the receiver does not directly help because the question remains as to how

the sender and the receiver can communicate with the third process.

Our framework covers fundamental message passing concepts: synchronous

and asynchronous communication, buffering, flow control. The major novelty of

the framework is a strict separation of the interface between a parallel application

and the message passing system from the implementation of the message passing

system. This allows us to define minimal semantical requirements for the imple-

mentation of a message passing system which are independent of the hardware,

operating system, means of communication, programming language and other

similar factors used in the implementation of the communication system.

From a software engineering point of view, our message passing framework

defines an interface between application programmers and implementors of com-

munication systems in terms of basic message passing operations. This is what

existing message passing models also do but at the same time they bind the

semantics of the operations to mechanisms used in the implementations of the

message passing systems. This binding reduces the set of architectures onto which

the models can be directly mapped. Our framework avoids such a binding.

The binding of operations to mechanisms also makes a comparison between

message passing models which are based on different mechanisms difficult. For

instance, a reasoning in a model bound to a shared memory communication is

much different to a reasoning in a model bound to a channel communication.

These two apparently different models describe the same concept even though

the expression of, say, the proof of correctness of a program written in one model

in terms of the other model is not obvious. This is where our model helps. An

abstraction from a particular mechanism makes the reasoning valid for a whole

class of models which adhere to the semantics defined in our framework.

Our framework for message passing systems is in many ways similar to the

36 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

well-known framework for database systems used by academic researchers as well

as by the implementors of database systems. [BHG87], [BL93], [Bac98] It defines

an interface between the database application and a database system. The inter-

face consists of basic operations which work on database records: read and write,

insert and delete. (The last two operations are often omitted in database text-

books that silently assume that the database is non-empty and its cardinality does

not change.) The semantics of these basic operations is defined independently of

the actual binding of the primitives to a database programming language and in-

dependently of the implementation of the operations in the database system. On

the one hand, this allows the writing of database applications without any knowl-

edge as to how these operations are implemented—this means, independently of

other applications running in the system, independently of whether the database

system is a centralised or a distributed one and independently of the hardware or

the operating system. On the other hand, the clean interface definition gives rise

to development of important abstract theories such as serialisability and recovery

which help the implementors of database systems to optimise their systems while

adhering to the semantics of the basic operations. This all holds for the message

passing framework proposed in this paper, only the set of operations and their

semantics are different to those defined in database systems. The database op-

erations work on database records, whereas the message passing operations work

on messages.

A small step towards a similar separation of message passing operations from

their implementation can be observed in the definition of the Message Passing

Interface, MPI. However, the semantics of MPI is described quite informally on

more than 300 pages of text and notions used throughout the text are not consis-

tently used. This often causes confusion between both application programmers

and implementors of MPI. Particularly, we show that the MPI language binding

does not include asynchronous communication even though the MPI standard

claims to support it. The same holds for another standard, CORBA, Com-

mon Object Request Broker Architecture. [Gro98], [HV99] In Section 2.6.3, we

present code fragments crucial to irregular applications which need asynchronous

communication and which cannot be expressed in MPI or CORBA. The “asyn-

chronous communication” defined in MPI and CORBA is not equivalent to the

asynchronous communication defined in fundamental abstract models.

2.5.1 Components of the message passing framework

This section gives a formal definition of the message passing framework. Firstly

we introduce the components used in the framework and their roles. The com-

ponents and their relationships are depicted in Fig. 2.10.

•Application process is a component that needs to communicate with other

similar components. The application process is typically a process in the

2.5. UNIFYING FRAMEWORK FOR MESSAGE PASSING 37

Application process

Basic message passing operations

Message passing system

Language binding

Operation binding

Figure 2.10: Components of the message passing framework

POSIX sense but this framework does not require that. It is not important

whether the process is single- or multi-threaded, and in which programming

language the process is written or the communication primitives which the

process uses is also not important. What is important from the point of view

of this framework is that the communication primitives of the application

process can be expressed using the four basic message passing operations.

•Basic message passing operations are four abstract operations which are

provided by the message passing system: create,destroy,recv and send.

These operations are the interface between the application process and the

message passing system.

•Message passing system is a system that implements the semantics of the

basic message passing operations. The implementation does not need to be

hardware or software. An abstract theory defining protocols which imple-

ment the basic operations on shared memory architectures can also be seen

as a message passing system. A theory defining protocols which implement

the basic operations on distributed memory system can also be seen as a

message passing system.

In the database analogy, an application process corresponds to a transaction

which passes basic database operations to a database system. The process can

be written in an arbitrary programming language, for instance C or OCCAM

(similarly, a database transaction can be written for instance in SQL or embedded

C).

The language binding defines how constructions of the programming language

which is used for the implementation of the application process translate into se-

quences of basic passing operations. The language binding can be implemented

as a precompiler which generates function calls that generate the basic opera-

tions and pass them to the message passing system. The choice between syn-

chronous and asynchronous communication falls into the competence of the lan-

38 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

guage binding—this choice does not influence the semantics of the basic message

passing operations or the message passing system.

The set of the basic message passing operations is relatively small. We suggest

the set that consists of only four operations which cover the fundamental needs of

the exchange of information between processes. The four basic message passing

operations have their counterparts in database systems as shown in Table 2.1.

Message passing Databases

recv read

send write

create insert

destroy delete

Table 2.1: Basic message passing operations and their counterparts in database

systems

The operation binding specifies how the basic message passing operations are

mapped onto a specific architecture of the message passing system. A somewhat

artificial example of a message passing system architecture is a single process

which reads the basic operations from a file and interprets them. Another example

of an architecture is a ring which directly connects application processes. In this

case the message passing system must use an internal protocol which implements

the routing of messages in the ring. Such a protocol is called a mechanism in our

framework.

The message passing system processes the operations which arrive from appli-

cation processes. The processing of the operations in the message passing system

implicitly defines the semantics of the operations.

2.5.2 Application process

The application process passes basic message passing operations to the message

passing system. Each application process has a unique identifier. The set of

application processes can be either static or dynamic. We will consider the dy-

namic model in which the message passing system assigns the unique identifiers

to processes on-the-fly as the processes sign on and off. A static model is only a

special case of the dynamic model.

An application process does not need to be a single process in the POSIX

sense. It can for example be a collection of POSIX processes or a single thread of

control or even a group of people. However, from the message passing system’s

point of view one application process is regarded as one entity. The message

passing system does not need to know what an application process does or what

2.5. UNIFYING FRAMEWORK FOR MESSAGE PASSING 39

it looks like. The system only obtains basic message passing operations generated

by the application processes and performs them.

An application process can directly pass the basic message passing opera-

tions to the system. However, it can also use higher communication primitives

(e.g. a barrier synchronisation instruction) from which the basic operations are

generated. The message passing system does not see the higher communication

primitives. We call the mapping from higher communication primitives to basic

operations a language binding. Language binding does not affect the semantics

of the basic message passing operations.

2.5.3 Basic message passing operations

The basic message passing operations are the interface between the application

and a message passing system. The semantics of these operations is independent

of the higher communication primitives or of the implementation of the system.

Firstly we introduce some notions needed to informally explain the semantics of

the basic operations. An important notion is the scope of an application process.

Any object can only be manipulated by the application process in whose scope

the object exists.

Application processes use messages as the only means of communication. The

create operation creates a new message in the application process which issues

the create operation. An application process can only access a message (read or

write the contents of the message) that exists in its scope. Our framework does

not specify how messages are represented or what they contain. The language

primitives for reading or writing a message are not part of the interface between

the application process and the message passing system. The destroy operation

removes a message from the scope of the application process which issues the

destroy operation.

There are two basic operations in relation to point-to-point message passing

between two processes. The send operation creates a new send request in the

scope of the application process which issues the send operation. A send request

is a tuple < S, x, y, M >, where xis the identifier of the process which issues the

send operation, yis the identifier of the destination process (ycan be equal to

x) and Mis the identifier of a message which exists in the scope of the process

which issues the send operation. The completion of the send request means that

the message Mhas been removed from the scope of the process xand an exact

copy of Mhas been created in the scope of the process y.

Similarly, the recv operation creates a new receive request in the scope of

the application process which issues the recv operation. A receive request is a

tuple < R, x, y, >, where xis the identifier of the process which issues the recv

operation, yis either an identifier of the source process (ycan be equal to x) or

∗(∗denotes any source process) and means that no message is associated with

40 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

the receive request.6The completion of the receive request means that a new

message has been created in the scope of the process x. The message identifier is

returned upon the completion of the receive request. Right before the completion,

the exact copy of the message exists in the scope of the source process yif y6≡ ∗,

or in some source process if y≡ ∗.

A request is an abstract entity which is used in the formal definition of send

and recv operations. Some systems do not even have to explicitly represent

requests—provided that the formal semantics of recv and send operations are

correctly implemented.

Note that there are no operations removing requests from the system. They

are not needed. The message passing system takes care of the removal of requests

after their completion (if requests are explicitly represented in the implementation

of the system).

2.5.4 Message passing system

Basic message passing operations are input of the message passing system. The

message passing system processes the operations and looks for matching request

pairs. The matching algorithm implicitly defines the semantics of the basic mes-

sage passing operations.

Requests R1≡< S, x1, y1, M > and R2≡< R, x2, y2,>are a matching pair

iff y1≡x2, and either y2≡x1or y2≡ ∗.

When the system finds a matching request pair

< S, x1, y1, M >, < R, x2, y2, >

it completes the request pair, by performing the following sequence of actions (the

message Mis passed from the process x1to the process y1):

•New message M0is created in the scope of the process x2. The contents of

M0is identical to the contents of the message M.

•Message Mis removed from the scope of the process x1.

•Both the send and the receive requests are completed (removed from the

system).

A request that has been created remains in the system until it is matched

with some other request. The system guarantees that matching request pairs are

not ignored forever (weak progress): If there is a matching request pair at any one

time then the system eventually finds and completes some matching pair. This

6The semantics of the send and recv operation can be easily extended so that any set of

destination and source process identifiers is stored in send and receive requests. We only present

the one-destination and one-or-all-source semantics in order to simplify the notation.

2.5. UNIFYING FRAMEWORK FOR MESSAGE PASSING 41

guarantee can be strengthened (strong progress): If there is a matching request

pair at any one time then the system eventually completes at least one of the

requests of this matching pair.

The message passing system can work sequentially in discrete time steps. In

each time step it either reads a new message passing operation or completes a

matching request pair (if there is one). In this case the completion of request

pairs must be prioritised over the reading of new operations in order to guarantee

the strong progress. If the rate of incoming operations is higher than the rate of

completed request pairs, some request pairs can remain forever in the system.7

The system can complete several request pairs at the same time (if the system

works in discrete time steps, several request pairs can be completed in one time

step). However, the effect of the parallel completion must be equivalent to the

effect of some sequential completion.

2.5.5 Language binding

The application process does not need to know how to pass basic message pass-

ing operations to the message passing system. It does not even need to explicitly

use the basic operations, it can use collective message passing operations, broad-

casting and similar operations that are not included among the basic ones. The

process can be even written in a non-imperative programming language, e.g.

LISP or PROLOG. The language binding defines how sequences of the basic op-

erations are generated from the higher-level programming constructs used by the

application process and how the operations are passed to the message passing

system.

Synchronous and asynchronous message passing constructs

A particularly interesting issue as regards the definition of higher-level languages

for message passing is the support of synchronous and asynchronous message

passing constructs. Note that the semantics of the basic passing operations is

asynchronous in the sense that a completion of a request which is associated

with a basic operation (see the Section 2.5.3 and Section 2.5.4) is independent of

the intervention of the application process. Below we sketch what the message

passing constructs might look like in an imperative programming language such

as C.

An asynchronous send, Isend (the “I” stands for “immediate”), can be im-

plemented as a single function call which in turn produces a basic send operation

7Note that the pending requests and messages associated with them consume memory. In

order to keep the concepts clean, our framework assumes a potentially infinite amount of

memory available in processes. An incorporation of a flow control mechanism (which bounds

the amount of memory used by the pending requests) is possible in real-world implementations

of the framework.

42 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

that is passed to the system. The application process does not need to know

whether or when the request is completed. Note that this implies that the mes-

sage passing system is responsible for freeing the message buffer after the com-

pletion of the request. The MPI standard [MPI94], [MPI98], [MPI97] specifies

that the application is responsible for freeing the buffer, see Section 2.4. The

consequence of this is polling in the MPI applications, as we show Section 2.6.3.

A synchronous send, Ssend (the “S” stands for “synchronous”), can be im-

plemented as a single function call which produces a basic send operation that

is passed to the system. This function call blocks until the request which is

associated with the operation is completed. The application is not responsible

for freeing the message buffer after the completion of the request. (The MPI

standard states otherwise, see Section 2.4.)

An asynchronous receive, Irecv, can be implemented as a single function call

which in turn produces a basic recv operation that is passed to the system. An

asynchronous receive means that the application process is willing to receive a

message in the future but does not want to wait for it now. It is the responsibility

of the application to wait for the requested message when it needs it (the appli-

cation is not able ask the message passing system whether or not the message has

already arrived but the system can notify the application when the request has

been completed). The message passing system is responsible for the allocation

of memory for the incoming message, the application is responsible for freeing

the memory when the contents of the message is no longer needed. (The MPI

standard states that the application is responsible for the allocation of memory

for the incoming message, see Section 2.4.)

A synchronous receive, Recv, can be implemented as a single function call

which produces a basic recv operation that is passed to the system. The function

call blocks until the request which is associated with the operation is completed.

The message passing system is responsible for the allocation of memory for the

incoming message, the application is responsible for freeing the memory when the

contents of the message is no longer needed. (The MPI standard states that the

application is responsible for the allocation of memory for the incoming message,

see Section 2.4.)

Remark. The set of the four point-to-point functions can be reduced to two.

The relevant functions are the blocking Recv and the nonblocking Isend. This

choice is natural. A process (or a thread) wants to receive a message when it

cannot proceed without the message. A process (or a thread) which sends a

message does not usually need to know when or whether the recipient decided to

receive the message. •

Remark. The message passing primitives defined in MPI cannot be mapped to

our message passing framework. More precisely, it is impossible to extract the

2.5. UNIFYING FRAMEWORK FOR MESSAGE PASSING 43

basic message passing operations from the MPI functions so that their semantics

would correspond to the semantics defined in our message passing framework.

The reason is the wrong buffer allocation and deallocation policy defined in the

MPI standard. Roughly expressed, the MPI standard does not define message

passing.

The automatic buffering policy and the language primitives of PVM match

our framework. However, the lack of thread-safety in the current PVM imple-

mentations imposes restrictions to the use of the language primitives (i.e. to the

generation of sequences of message requests in a process). •

2.5.6 Operation binding

Operation binding specifies how the semantics of the basic message passing op-

erations is implemented on a specific architecture of a message passing system.

In other words, operation binding states how the system performs its steps, how

it finds the matching request pairs and how it completes them. An architecture

does not necessarily mean hardware. Examples of abstract architectures include

processors connected to a ring or a torus with channels, processors which share

memory, processors connected to an Ethernet network etc. A mechanism (a pro-

tocol) must be found for every architecture that at least simulates a sequential

message passing system with the weak progress guarantee.

Examples of protocols which may be useful for the definition of operation

binding for distributed memory architectures can be found in [Ray88]. The pro-

tocols include election algorithms for various topologies, decentralised deadlock

detection, termination detection, distributed data management, fault tolerance

algorithms etc.

Equivalence of our framework to Andrews’ message passing model

The channel model by Andrews is described in [And91] in Chapter 7, “Asyn-

chronous Message Passing”. Synchronous message passing is defined in Andrews’

book as asynchronous message passing with some additional constraints. We

will not repeat the whole formal semantics of Andrews’ model here (otherwise

we would have to repeat the definition of the programming logic which is used

throughout the book), we will only focus on the key concepts of the model.

Andrews’ model is based on the concept of channels. A channel is basically

a FIFO queue in which messages are stored. The queue can be extended by

appending a message to the tail of the queue and shortened by removing a message

from the head of the queue. The queue has a potentially unlimited capacity—that

means, a message can be appended to it at any one time.

Andrews defines the semantics of send and receive:

“The effect of executing send ch(expr1, . . . , exprn) is to evaluate the expressions

44 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

expr1, . . . , exprn, then append a message containing these values to the end of

the queue associated with channel ch.”

“The effect of executing receive ch(var1, . . . , varn) is to delay the receiver until

there is at least one message on the channel’s queue. Then the message at the

front of the queue is removed, and its fields are assigned to the vari. Thus, in

contrast to send,receive is a blocking primitive since it might cause delay. The

receive primitive has blocking semantics so that the receiving process does not

have to busy-wait polling the channel if it has nothing else to do until a message

arrives.”

Andrews further defines a property which ensures that if a receiver is blocked

on a channel and a message arrives on the channel, then the receiver will even-

tually consume the message.

Claim. If Andrews’ model is restricted so that each channel is read by exactly

one process (but several processes may write into the same channel), then the

semantics of an abstract message passing system is equivalent to the semantics

of the channel model defined by Andrews.

Proof. The only technical difficulty in the comparison of the two models is that

Andrews’ model silently assumes a dynamic allocation and deallocation in send

and receive (this is the memory which stores the elements of the channel FIFO

queue). In our model we assume an explicit dynamic allocation and deallocation

of memory which stores the messages—this allocation and deallocation is carried

out in create and destroy operations, separately from the recv and store

operations.

We will first show that Andrews’ channel model can simulate our model. Con-

sider a program which contains message passing operations create,destroy,

recv,send. Replace each recv in the program with receive ch(vars) where ch

is the only channel which is read by the receive (vars denotes now the contents

of the message which has been allocated by the corresponding create operation.

The semantics of the create operation remains unchanged—this operation per-

forms a dynamic allocation of memory for the message. Similarly, the semantics

of the destroy operation remains unchanged—this operation performs a dynamic

deallocation of the memory block which has been created by some create oper-

ation.) Replace each send in the program with send ch(expr) where ch is the

channel which is read by the recipient. It is easy to observe that the replacements

above do not change the semantics of the program in Andrews’ model.

We will show that our message passing model can simulate the (slightly re-

stricted) Andrews’ channel model. Consider a program which contains send

and receive statements with the channel semantics. Replace each channel send

ch(exprs) statement with the sequence of two message passing operations create;

send. The former creates a message which is large enough to store the values

of exprs. The latter creates a send request which addresses this message to the

2.6. THREADED NON-TRIVIAL PVM AND MPI APPLICATIONS 45

process which reads the channel ch. The program which issues these operations

is further extended with the filling of the newly created message with exprs in

the time period after it has issued the create operation and before it has issued

the send operation. Replace each channel receive ch(vars) statement with the

sequence of two message passing operations recv;destroy. The first creates a

receive request which matches all messages (a wildcard is used in this request).

The second frees the memory used by the incoming message. The program which

issues these operations is extended with the filling of its local variables with vars

before it has issued the destroy operation. It is easy to observe that the eventual

consumption of the message by the receiver is guaranteed by our model (by the

progress rule). •

Remark. Note that the restriction in the Andrews’ model (each channel is read

by exactly one process) is only made in order to simplify the operation binding for

distributed memory architectures. If we allow wildcards in send requests in our

model then this restriction is not necessary. (Only the definition of the request

matching must be extended in our model in this case. The rest of the model

remains unchanged.) •

2.6 Threaded non-trivial PVM and MPI appli-

cations

2.6.1 Threads and thread-safety

The process model implemented in the Transputer is—as expressed in contem-

porary terms—a restricted concurrent thread model. An Occam program which

is executed in a Transputer can be seen as a process that consists of several

independent threads of control.

All modern operating systems support the concept of threads. A process is

a program which can consist of several threads. [ISO90] The threads share the

same memory space (except for the stack—each thread has its own stack) but

each has its own flow of control. An “ordinary” sequential program runs as a

process with one thread.

Definition. Thread is an encapsulation of the flow control in a program. •

Particular care must be taken when writing multi-threaded programs. There

is only one running thread at any one time (on a single-processor system), but the

running thread can be descheduled and replaced with some other thread anytime

(unless the thread scheduling policy states otherwise). Any thread which is not

46 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

blocked at any one time can be scheduled to run. The interleaving of threads is

equivalent to a quasi-parallel execution of the threads.

An explicit synchronisation is needed when threads share resources (e.g. vari-

ables).8There are many examples of so-called racing conditions (an unwanted

phenomenon caused by an unforeseen order of the execution of running threads)

in concurrent programming textbooks. [And91] A popular example is a linked

list which is manipulated by two threads, whereby one thread adds an item to

the list and the other thread removes an item. The linked list becomes corrupted

when the two threads are interleaved in a certain way. Processors and operating

systems provide mechanisms for the synchronisation of threads, such as mutexes

and semaphores.

Several threads often need to call the same function. I/O functions such as

open,read,write are very common examples. A problem may occur when a

function needs to maintain a global state. For instance, the implementation of

the open function may add the newly opened file to a global linked list. In such

a case a racing condition can occur when a thread calls open() while another

thread is manipulating the linked list in its open() call. A function is regarded

as thread-safe when its implementation avoids racing conditions.

Definition. Thread-safe function (reentrant function) is a function that can be

concurrently called from several threads, providing the same semantics to each

individual thread. •

Thread-safe function can thus be called from several threads without cor-

rupting memory (without destroying internal memory structures used by that

function) and without thread-interference (the semantics of the function does

not change from the point of view of the calling thread if at the same time the

function is concurrently called from another thread).

Functions are usually collected in libraries. Even if each of the functions is

thread-safe, a racing condition may occur when different functions of a library are

concurrently called from different threads. A library is regarded as thread-safe if

such a racing condition cannot occur.

Definition. Thread-safe library is a library in which all functions can be con-

currently called from several threads without memory corruption and without

thread-interference. •

Practice shows that the programming of thread-safe libraries is difficult. How-

ever, thread-safety is so important for many applications that vendors of oper-

ating systems invest the effort into making at least the low-level system libraries

[ISO90] thread-safe, and carefully document those which remain thread-unsafe.

8Note that no synchronisation is needed when multiple threads only read a shared variable.

2.6. THREADED NON-TRIVIAL PVM AND MPI APPLICATIONS 47

System calls which may eventually block deserve special attention—a blocked

system call must only block the calling thread, not the entire process.

A natural scenario of the implementation of a non-trivial application (Sec-

tion 2.1) involves running two threads in each process as shown in Fig. 2.11. The

heavy computation is hidden in the function compute() of the thread T1. We

restrict the occasional communication of the thread T1to only sending messages

to other processes (later it becomes clear why this restriction is not important).

The thread T2services the incoming requests. The recv function is a blocking

receive (pvm recv in PVM, MPI Recv in MPI). This means that if there are no

requests to be serviced, the thread T2is blocked in the recv() call, leaving the

full CPU power for the computation in T1.

thread T1()

{while (not done)

{compute();

send();

}

thread T2()

{while (not done)

{recv();

service request();

}

Figure 2.11: Natural threaded implementation of one process of a non-trivial

application: it only works if the communication library is thread-safe

The natural scenario of Fig. 2.11 only works if the communication library is

thread-safe (because of the concurrent send and recv() calls). We are aware of

no thread-safe implementation of PVM. There are some MPI implementations

that are thread-safe but internally use active polling, see Section 2.6.3—the only

two exceptions we know about is the IBM implementation for IBM RS/6000 SP

[Tre97] and MPI/Pro implementation by Software Technologies, Inc. [DS02]

Observation. Most existing implementations of PVM and MPI are not thread-

safe. This means that (most) PVM/MPI functions can not be concurrently called

from several threads. The natural scenario of Fig. 2.11 does not work for all these

implementations. •

2.6.2 Polling in threaded non-trivial PVM and MPI ap-

plications

When the communication library is not thread-safe, thread T1in Fig. 2.11 must

not call send() when the thread T2is inside the recv() call. The easiest way of

48 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

guaranteeing the mutual exclusion of calls to the communication library involves

the protection of each call to the library with a mutex (this protection can also be

hidden in the library). This approach is suggested in [HR03b], [HR03a] in order

to overcome thread-unsafety of PVM or MPI implementations. It avoids racing

conditions but it does not solve the problem of non-trivial (irregular) applications

(although the cited works aim to become a framework for development of irregular

applications). If the recv() and send() calls are mutually excluded and the

thread T2is blocked in the recv() call then the send() call in the thread T1

will block until a message from some other process arrives, which would unblock

T2. However, if no message arrives, then the whole process will remain blocked

forever.

A general approach which guarantees the mutual exclusion of calls to a thread-

unsafe communication library and at the same time a progress in a non-trivial

application is active polling (also known as busy waiting or for short polling).

Fig. 2.12 shows a polling implementation of one process of a non-trivial appli-

cation. The shared mutex comm is used for the protection of the communication

library calls in T1and T2. This mutex can only be locked (acquired) by one thread

at any one time. A thread which attempts to lock an already locked mutex be-

comes blocked until the thread which holds the mutex unlocks it. The thread

T2, after having locked the mutex comm, calls a nonblocking probe() to check

whether there is an incoming message. If there is a message, it is received in the

recv() call and processed. If there is no message, T2unlocks the mutex and falls

asleep for some time. The sleep() call before entering the loop is necessary in

order to give some CPU time to the computation in T1. Before falling asleep, T2

unlocks the mutex in order to allow T1to call send().

The pseudo-code of Fig. 2.12 can be expressed in different ways when using

a (thread-unsafe) PVM or MPI library. Some of the alternatives include:

1. The nonblocking recv (pvm nrecv in PVM, MPI Irecv followed by MPI Test

in MPI) can be used in the thread T2instead of the nonblocking probe

(pvm probe in PVM, MPI Iprobe in MPI) and the blocking recv (pvm recv

in PVM, MPI Recv in MPI).

2. All actual communication with other processes can be moved to the thread

T2. The send() in the thread T1can be replaced with an insertion of the

message to a send queue (together with a message header which stores the

recipient’s address, message tag, . . . ). Inside the polling loop, the thread

T2checks whether the send queue is empty. If the queue is not empty, T2

sends the messages in the send queue to the recipients.

3. If the structure of the function compute in the thread T1is simple and if

the application itself does not require multiple threads, then the polling

from the thread T2can be moved to the thread T1. An example of a simple

structure of the compute function is a loop which is repeated many times.

2.6. THREADED NON-TRIVIAL PVM AND MPI APPLICATIONS 49

thread T1()

{while (not done)

{compute();

lock(comm);

send();

unlock(comm);

}

thread T2()

{while (not done)

{lock(comm);

arrived=probe();

if (arrived)

{recv();

service request();

}

unlock(comm);

sleep(time);

}

Figure 2.12: Polling implementation of one process of a non-trivial application:

a thread-safe communication library is not required for the application to work

correctly—on the other hand, polling makes the application inefficient and non-

portable

In this case the thread T2can be eliminated, which reduces the program

to the single thread T1running the sequential loop (this loop contains a

computation which is mixed with nonblocking send() calls). In addition,

after every few executions of the loop a nonblocking probe is called to

check for incoming requests. If there are requests, they are serviced—

otherwise the computation continues. The problem with this solution is

that it assumes a regular application structure. Moreover, the tuning of

the number of loop executions after which the probe is executed is strongly

machine-dependent.

The choice of the right alternative can improve the efficiency of a certain

application on a certain system. However, when the application is ported onto a

new system, the tuning procedure must be repeated in order to find the optimal

polling parameters. Even worse than that, the optimal setting of the polling

parameters can also depend on the inputs to the application, in which case an

empirical tuning does not help.

Claim. Every non-trivial application which builds on a thread-unsafe communi-

cation library is forced to use active polling (busy waiting). •

50 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

2.6.3 Polling in communication libraries

When avoiding polling in an application, one must ensure that none of the li-

braries used by the application uses polling. If a polling library function is called

from a thread, then all other application threads are slowed down by the polling

thread.

Section 2.6 shows that the problem with non-trivial applications is a lack of

thread-safety of libraries which are used for inter-process communication. To the

best of our knowledge, there is no thread-safe implementation of the PVM li-

brary. The MPI-2 standard [MPI97] defines four levels of thread-safety (whereby

the highest one, MPI THREAD MULTIPLE is needed for efficient non-trivial appli-

cations). Some MPI implementations are thread-safe (MPI THREAD MULTIPLE).

However, the MPI developers did not keep in mind the reason why the thread-

safety is important—they “solve” the problem using active polling inside the

library! The original problem with non-trivial applications is thus only dug one

level deeper. Such an approach can be found in MPICH2 [Gro02].

Claim. If active polling is used to solve thread-safety problems inside a com-

munication library, it is impossible for an application using the library to avoid

active polling. Even worse than this, the application does not even have the free-

dom of choice between the alternative polling implementations as described in

Section 2.6.2—it is forced to use the one implemented in the library. •

A thread-safe communication library that does not internally use threads but

uses polling instead must keep a table of pending send and receive requests.

The polling thread-safe scenario may work as follows (similarly to the program

in Fig. 2.12):

•A mutex is used inside the library to ensure a mutual exclusion of send()

and recv() calls.

•Blocking send() and recv() calls do not internally block. The library polls

on behalf of all pending requests in the implementation of each “blocking”

send or receive. The global mutex is locked during the time needed to

detect progress and it is unlocked during the sleep() call used in the polling

loop.

Some research papers on communication libraries [Fer98], [KHS96] do not

mention their implicit use of active polling, thus hiding the source of a significant

performance loss.

Polling related to asynchronous communication

Another potential source of active polling in communication libraries involves

using optimisation techniques for single-threaded communicating processes. We

2.6. THREADED NON-TRIVIAL PVM AND MPI APPLICATIONS 51

will focus on the implementation of the asynchronous (nonblocking) MPI Isend in

MPICH [GL96]. As we will show, the implementation is wrong because it violates

the progress rule defined in the MPI standard [MPI94], [MPI98], [MPI97]. An

MPI program which demonstrates this weakness is given in Appendix A. We

will also show that even if the implementation of MPI Isend was correct, the

application using the nonblocking send would suffer from polling.

An MPI Isend() call returns immediately to the caller. The MPI tutorial

book [GLS95] (Chapter 4, Section “Using Nonblocking Communication”, page

81) says: “The buffer containing the message to be sent using MPI Isend must

not be modified until the message has been delivered (more precisely, until the

operation is complete, as indicated by one of the MPI Wait or MPI Test rou-

tines).” (MPI Wait is a blocking MPI function which blocks until the correspond-

ing MPI Send has been completed. MPI Test is a nonblocking MPI function which

returns immediately with the information on the completion of the corresponding

MPI Send.)

A question arises: Under what circumstances can an MPI Isend complete

before MPI Wait or MPI Test has been called? (If no such circumstances exist,

then MPI Isend loses its reason for existence.)

In order to complete an MPI Isend, the sender’s side must push the message

to the network (this pushing corresponds to writing to a socket). The pushing

itself can either be blocking (blocks until the recipient’s side is ready to receive) or

nonblocking (tests whether the recipient’s side is ready to receive). The pushing

must be hidden in the MPI library. However, MPI Isend returns the control to

the application after some initial nonblocking pushing at the latest.

If the message can not be delivered to the network during the initial pushing,

the next opportunity to retry the pushing is when the MPI library regains control

again. This happens when the application calls an MPI function. The pushing of

pending MPI Isend messages can be (and usually is) implemented as a side effect

of most MPI functions. An eventual delivery of pending MPI Isend messages

is only guaranteed if the application either regularly calls MPI functions or the

control in the application gets to the MPI Wait() call paired with the MPI Isend.

The hidden pushing of pending MPI Isend messages which is implemented as

a side-effect of any MPI function works well with single-threaded applications but

causes problems when either the application is multi-threaded (not necessarily

non-trivial in the sense of Section 2.1) or when the application’s processes run on

multi-user operating systems. Consider the following sequence of MPI calls in one

process: MPI Isend();MPI Recv(), and assume that the initial pushing during

the MPI Isend() was not successful (the receiver’s side was not ready to receive

the message at that time). The MPI Recv() call blocks until a matching message

arrives. But the arrival of the matching message may depend on the delivery

of the pending MPI Isend() message. For this reason the MPI library must not

passively block in the MPI Recv() call. Instead of this the implementation of

MPI Recv() must either block while waiting for both events (the pushing of the

52 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

pending outgoing message which has been sent using MPI Isend() and the arrival

of a message in MPI Recv()) or the polling for the matching message must be

interleaved with the pushing of the pending MPI Isend() message. The second

option—interleaved polling—can be found e.g. in MPICH [GL96]. The use of

the interleaved polling may be advantageous for a single-threaded application or

an application which runs on a single-user operating system. However, a multi-

threaded application which uses the above mechanism in one of its threads will

spin until a matching message arrives and the control returns from the MPI Recv

call. This spinning means that the thread which is inside the interleaved polling

unnecessarily consumes up to 100% of the CPU time, blocking the other threads

or other user’s processes (if the process is running on a multi-user operating

system) which may make better use of the wasted CPU time.

Also, the interleaved polling only guarantees progress when the control in the

application process regularly reaches an MPI call. In a general case, no progress

is guaranteed.

Claim. MPICH violates the progress rule defined in the MPI standard. In other

words, MPICH does not comply with the MPI standard.

Proof. Consider a program that consists of two parallel processes, P1and P2.

The program performs the following actions (in this order):

1. P1calls MPI Isend(), sending a message to P2.

2. P1runs an infinite loop (followed by MPI Wait()).

3. P2calls MPI Recv() which matches the message sent by P1.

The progress rule states that the message sent by P1will eventually be deliv-

ered once it is matched by the MPI Recv() in P2(the MPI standard states that

this must occur independently of the other actions in P1). The message sent by

P1will never be delivered to P2in MPICH. •

A common workaround involves the insertion of an MPI Test() call (or some

other MPI call) into the infinite loop in the process P1. However, sometimes this

workaround cannot be directly applied. For instance, the text “P1runs an infinite

loop” in step 2can be replaced with “P1makes an I/O call which can block”. This

causes application programmers to look for other workarounds, which generally

destroy the natural organisation of the code.

Remark. The default MPI Send can be safely implemented as a synchronous

MPI Ssend. However, MPICH does not implement MPI Send as MPI Ssend—

MPI Send in MPICH can return before the message has been delivered to the re-

ceiver. This means that the progress rule is also violated for the default MPI Send

in MPICH. •

2.6. THREADED NON-TRIVIAL PVM AND MPI APPLICATIONS 53

The MPI standard requires the application to free the send buffer used in

MPI Isend. The application can free the buffer after it made sure that an

MPI Isend() call has completed. The functions MPI Wait or MPI Test can (or

rather, must) be used to detect the completion of the MPI Isend. This is another

source of polling—this time in the application, not inside the MPI library:

Claim. The semantics of asynchronous (nonblocking) communication in MPI

forces the application to use polling in order to avoid using an unbounded amount

of memory.

Proof. Consider a program with an unpredictable flow of control. The pro-

gram calls MPI Isend() unpredictably many times. Each of these calls must

be paired with an MPI Wait() at some point in order to free the buffer used

in MPI Isend(). However, the unpredictability of the flow of control makes it

impossible to place the MPI Wait() anywhere but at the end of the program

(right before MPI Finalize()). This is undesirable because the buffers of all

MPI Isend() calls are not freed until the end of the program (even though they

are no longer needed). •

A general workaround assumes running a separate thread which polls for com-

pletion on behalf all pending MPI Isend() calls. Another workaround involves

preceding each MPI Isend() call with an MPI Wait() call. However, this reduces

the maximum number of pending MPI Isend() calls to one, which can lead to

unexpected deadlocks in applications which relies on a higher number of concur-

rent MPI Isends. A more general solution of this kind involves preceding each

MPI Isend() call with an MPI Testany() call with the intention to control the

amount of pending MPI Isends—but this is equivalent to polling.

The consequence is that an efficient use of the nonblocking communication

defined in the MPI standard is restricted to the applications in which communi-

cation and computation phases alternate and are strictly separated. The commu-

nication in one phase can then overlap with the computation in the next phase.

Such a model is studied in [LAB93].

The CORBA standard [Gro98] explicitly prescribes polling in applications

that use asynchronous communication. This is a quotation from the tutorial

book [OPR96] on using CORBA (Section 8.2.2, Deferred Synchronous Commu-

nication):

“What CORBA refers to as the deferred synchronous communi-

cation style is a form what the non-CORBA world calls asynchronous

communication. When an application invokes a deferred synchronous

request, the application does not wait for the request to complete

before it continues with other work. However, the application must

54 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

periodically check to see if the request has completed by polling using

the CORBA Request get response operation on the

CORBA get next response routine.

The deferred synchronous style of communication is most appro-

priate when you do not want your application to have to wait for

the current request to complete before sending the next request. For

example, if you do not know how long the operation may take, you

may not want your application to wait for the request to complete.

This will almost always be the case when you write a CORBA client

to connect to someone else’s framework or server.”

2.6.4 Limits of active polling

This section explains why active polling diminishes the performance of non-trivial

parallel applications and to what extent. Consider the polling implementation

of a non-trivial application (Fig. 2.12, Section 2.6.2). The only parameter which

allows the tuning of the implementation is the time argument of the sleep(time)

call in the thread T2. The setting of time determines the tradeoff between the

latency of the servicing of incoming requests and the wasted CPU cycles:

•If time is a short time period, say, 1 millisecond, then the latency of request

servicing is 0.5 millisecond on average. On the other hand, CPU cycles

are unnecessarily wasted every 1 millisecond if there are no requests to be

serviced. The computation in the thread T1is slowed down.

•If time is a long time period, say, 1 second, then the CPU cycles are wasted

only once a second. However, the average latency of request servicing is

0.5 seconds.

The setting of the time constant is dependent on many factors: application’s

computation/communication ratio, input data, operating system, network speed,

network latency, buffering scheme in the communication library, . . . The time

constant must therefore be experimentally tuned. The need for tuning makes the

polling approach non-portable, but a closer look at the implementation of sleep

in operating systems reveals an even a more serious polling deficiency.

The POSIX standard [ISO90] defines a high resolution version of the sleep

system call, nanosleep(time), which is supported by all contemporary operating

systems. nanosleep(time) puts the caller asleep for the specified time whereby

time is given in nanoseconds. A typical implementation of nanosleep(time) in

the kernel of an operating system is shown in Fig. 2.13.

Sleeping for a time shorter than the threshold of the kernel does not cause

the descheduling of the calling thread—the calling thread is “actively” sleeping.

In other words, other threads or processes do not get scheduled while the cur-

rent thread is sleeping (in fact, the current thread is burning CPU cycles). For

2.6. THREADED NON-TRIVIAL PVM AND MPI APPLICATIONS 55

nanosleep(time)

{if (time <VERY SHORT TIME PERIOD)

...run idle CPU cycles for the given time...

else

...set an alarm and deschedule the current process/thread...

}

Figure 2.13: Implementation of nanosleep in the kernel of an operating system

this reason time which is shorter than the threshold must be avoided in polling

implementations of non-trivial applications (Fig. 2.12).

Sleeping for a time longer than the threshold causes the calling thread to be

descheduled and rescheduled again after time has elapsed. However, it turns

out that without a special tuning of the kernel (which leads to many unwanted

consequences) the minimal duration of nanosleep(time) is 0.02 seconds in all

systems available to us (Solaris, Linux, Ultrix). We measured this value by

calling nanosleep(time) many times in a loop. (Alternative system calls such

as usleep or select behave similarly. This behaviour is caused by the clock

granularity in the operating systems which is set to 10 ms. As the clock ticks are

discrete, this value is often increased with additional 10 ms.9) In other words,

nanosleep(time) can be called at most 50 times per second. This has a worrying

implication for all non-trivial applications using the polling scheme of Fig. 2.12:

Claim. The upper bound on the number of serviced requests in a polling process

of a non-trivial application of Fig. 2.12 is 50 per second. This number does not

depend on the speed of the processors or the network connecting them. •

Remark. The loss of performance is not the only negative consequence of polling.

Another negative consequence (sometimes even worse than performance loss) is

anon-determinism. A request that needs attention can be created at any time

during the sleep() call of the thread T2(see Fig. 2.12). The process which sent

the request usually waits for a reply—in other words, it is idle. The length of its

idling interval depends on the length of time the recipient’s thread T2will sleep

from the moment at which the request was created. As the need for a request is

randomly created, the request servicing times are random. The expected servicing

time is a half of the polling time interval (0.01 seconds or more, as we have shown

above). If a process of a communication-intensive non-trivial application sends

9We experimentally found that the select system call adds the additional 10 ms penalty

least frequently on Linux 2.4.20-xfs i686.

56 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

1000000 requests, the expected polling overhead is 10000 seconds (which add to

the parallel application time). However, the actually measured polling overhead

of one run can theoretically be anything between 0 seconds and 20000 seconds

(ca. 5 hours). •

2.6.5 Previous work related to thread-safety of PVM and

MPI

Thread-safety of PVM and MPI has been studied for many years but the prob-

lem remains unsolved. We know of none implementation of thread-safe PVM

and only about two successful implementations of thread-safe MPI which do not

use polling (an inofficial IBM implementation and the implementation by MPI

Software Technology, Inc.). This section presents a few key research works in the

field.

PVM

The LPVM (Lightweight-process PVM) system was introduced in [ZG98]. LPVM

is designed for shared-memory machines. PVM tasks are implemented as threads

in LPVM. The authors recognise two main issues that have to be dealt with

in order to make PVM multi-thread-safe: global state and reentrancy. LPVM

removes global states from the PVM library by assigning receive and send buffers

(and other resources) to each task. The user interface of PVM 3.3 is only slightly

modified but a major redesign of libpvm is needed. Our implementation is simple

and does not require the removal of global states.

TPVM [FS98] takes a different approach which assumes a thread to be the

basic unit of parallelism in a distributed system. There is a thread server which

registers all threads running in the system. This fine-grained model is mapped

onto the coarse-grained process model of PVM for the purpose of message pass-

ing. Rather than going into technical details, we will explain the problem of

TPVM from the point of view of a non-trivial application. There is a global

message queue accessible to all threads in each process. A thread which wants to

receive a message follows the following protocol. Firstly it looks for the message

in the global message queue. If the message is there, the thread continues; if not,

the thread attempts to receive a message from another TPVM task using a non-

blocking receive. If a message is there, but it is addressed to another thread, the

thread stores the message in the global message queue and retries the nonblock-

ing receive (and later wakes up threads which are waiting for those messages);

otherwise it falls asleep. The following is the weakness of the protocol: When a

thread falls asleep, then another thread must attempt to receive a message in

order to wake up the sleeping thread. If no such thread exists, the sleeping thread

2.6. THREADED NON-TRIVIAL PVM AND MPI APPLICATIONS 57

will sleep forever—even though there may be a message addressed to the sleep-

ing thread which was sent by some other TPVM process. This situation can be

resolved by running a special thread that regularly polls for incoming messages

and wakes other threads up. Our mechanism does not require the running of a

polling thread.

A misleading attempt by PVM developers to support non-trivial applications

was the introduction of message handlers in the version PVM 3.4. A function

pvm addmhf(int src, int tag, int ctx, int (*func)(int mid))

was added to the PVM library. This function registers a user’s message handler

func. The handler is fired (the function func is called) when a matching message

arrives (the message header must match the parameters src,tag and ctx). An

elegant implementation of a non-trivial application would involve registering a set

of message handlers instead of running the thread T2of Fig. 2.11. The program

would remain single-threaded, executing only the thread T1. However, the manual

to pvm addmhf says (note the marked text): “. . . pvm addmhf specifies a function

that will be called whenever libpvm copies in a message whose header fields

of src,tag and ctx match those provided to pvm addmhf.” In other words, the

message handlers are only called when the application requests it—a thread which

regularly calls e.g. pvm probe() must be running, which forces pvmlib to copy in

the arriving message. This is equivalent to active polling and contrasts to what is

stated in [GKPS97]: “. . . these message handlers are invoked internally without

any user intervention.”

PVM allows the delivery of signals between tasks. Signals, in combination

with message handlers (see previous paragraph) may be used to get rid of the

polling thread in a non-trivial application. A non-trivial application would be

implemented as a single-threaded application which executes the code of T1of

Fig. 2.11. The thread T2of Fig. 2.11 would be implemented as a set of message

handlers. The message passing protocol would be extended as follows (the actual

implementation can be very complex):

1. Process Asends a message to process B

2. Meanwhile, process Bruns T1, without taking the incoming message into

account.

3. Process Asends a signal to process B, saying “Get up, you have a message!”

4. Process Breceives the signal and fires a signal handler. The signal handler

calls pvm probe which fires a message handler which receives the message

and performs an appropriate user’s action handle message.

A similar idea was used in [SKH96] for the implementation of active messaging

in PVM. No polling occurs in the above scenario. However, technical issues make

this approach inefficient and non-portable and restrict its use to homogeneous

parallel machines.

58 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

MPI

There are only two existing implementations of MPI which are thread-safe and

do not use polling:

•An experimental (inofficial) implementation by IBM [Tre97] was developed

for IBM RS/6000 SP machines and has never been ported on other systems.

The official IBM implementation of MPI uses polling to solve the problem

of thread-safety.

•MPI/Pro, a commercial MPI implementation by MPI Software Technology,

Inc. (MSTI), which claims to be fully thread-safe. [DS02] It also claims

to implement asynchronous communication without polling. However, the

product description states that MPI/Pro implements the MPI 2.0 standard

[MPI97], therefore it is unclear how the problem of completion described in

Section 2.6.3 is tackled in MPI/Pro.

MiMPI [GCC99] claims to be fully thread-safe but the authors do not explain

how they solve this problem in their paper. According to our brief personal

communication with the authors, MiMPI uses a socket pair between all pairs of

processes and avoids sharing any other resources through threads. This socket

replication only makes this approach scalable up to a certain number of processes

because of the memory and other system limitations of processors.

ScaMPI [HOB+99] is a commercial implementation of MPI by Scali. ScaMPI

uses polling to implement thread-safety.

A multi-threaded implementation of MPI is proposed in [PS98]. This pro-

posal has never been implemented. The authors address efficiency issues in this

paper (note the marked text): “. . . It should be mentioned that switching be-

tween threads and using synchronisation primitives also incurs a finite overhead.

Hence, communication benchmark results obtained with a multi-threaded commu-

nication software will probably be not as good as the results of a single-threaded

implementation that burns CPU cycles in busy waiting for incoming data and

delivers low communication latency. However, this latency can be hidden with

the overlap of communication and computation, so in a real-life situation, well-

designed applications that use multi-threaded communication software will reveal

better overall performance than their single-threaded counterparts even though

the communication benchmarks show the opposite.” The reasoning behind the

marked text is wrong. In fact just the opposite is true for many important appli-

cations (all non-trivial applications)—a very high communication latency is the

main flaw of the polling approach!

Some works assume a so-called active device support such as hardware in-

terrupts triggered by a message arrival. [BMR02], [GRS97], [LRBB96] However,

this assumption makes the approach non-portable to other architectures which do

not support active devices. Moreover, even when active devices are available, the

2.6. THREADED NON-TRIVIAL PVM AND MPI APPLICATIONS 59

coupling of the devices, operating system and the communication library remains

unclear.

Another concept of thread-safe message passing using P4 and MPI is proposed

in [CSD94]. The authors identify non-reentrant functions in P4 and sketch a

threaded implementation of MPI which may lead to the solving of the problem

of thread-safety of MPI. These ideas have never been implemented.

A possible impact of threads on new generation MPI implementations is stud-

ied in [HSS+98]. The MPI implementations based on polling (such as MPICH)

are referred to as “first generation” implementations, the implementations based

on threads are referred to as “second generation” implementations. Arguments

are given which support both the polling and threaded approaches. An impor-

tant issue throughout the paper is the interpretation of the progress rule, see

Section 2.4. The authors recognise two interpretations, a “liberal” one (progress

depends on whether the application periodically calls certain MPI functions) and

a “strict” one (progress does not depend of actions in the system). In our opinion,

the “liberal” interpretation is simply wrong. Also statements such as “Perhaps

the strongest argument for polling is that it minimizes latency.” are very dis-

putable because they may only apply to a certain class of MPI programs (the

trivial ones).

2.6.6 Quasi-thread-safe PVM and MPI

It is not particularly frustrating that PVM and MPI implementations are not

thread-safe. What is frustrating is that non-trivial applications cannot be effi-

ciently implemented using these libraries. This section describes a mechanism of

an interruptable blocking recv which is missing in the libraries [Pla02b]. This

mechanism does not make the libraries thread-safe (see Section 2.6.1) but it al-

lows an efficient and portable implementation of non-trivial applications. We

concentrate on a socket-based inter-process communication in the sequel but the

mechanism can also be applied to the shared memory implementation of message

passing (or even to a mix of shared memory and socket communicators).

The reasons why the natural threaded implementation of a non-trivial appli-

cation of Fig. 2.11 cannot be implemented without active polling are:

1. Tasks T1and T2must be implemented as threads.

2. The thread T2must run a blocking recv.

3. The thread T1cannot send any messages while the thread T2is blocked

in the blocking recv because the communication library (PVM or MPI) is

not thread-safe (the recv and send functions cannot be concurrently called

from two threads).

60 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

4. PVM and MPI are not thread-safe because it seems to be very difficult

from a software engineering point of view to implement all their interface

functions in a reentrant and portable way without active polling.

The mechanism of the interruptable blocking recv addresses point 3. In terms

of concurrent programming, the thread T2is blocked in the critical section of the

blocking recv when at the same timeT1needs to call the send function.

The reason why the concurrent send() call leads to problems strongly depends

on the implementation of the communication library. For instance:

•PVM 3.4 uses a single buffer for incoming and outgoing messages. The

buffer state is manipulated by both send and recv functions. (More pre-

cisely, the buffer state is manipulated by the mxfer function which is used

for both sending and receiving messages. The mxfer function is called from

both send and recv functions.) The send() call in the thread T1destroys

the buffer state which was set up for the thread T2executing the recv()

call.

•MPICH 1.2.4, driver ch p4 uses distinct buffers for the send() and recv()

calls. The memory corruption results from the implementation of the in-

ternal communication protocols used by the ch p4 driver. For instance, the

implementation of send involves a bidirectional communication for the so-

called rendez-vous protocol. The rendez-vous protocol is used for transfer of

large messages which are sent in chunks, not as a whole. In the rendez-vous

protocol the recipient sends an acknowledging message to the sender after

having received a chunk, telling the sender “I am ready to receive another

chunk”. These acknowledging messages either get mixed with the regular

messages which are being awaited by the thread T2executing the blocking

recv (which puts them into the unexpected queue and these never reach

the thread T1) or they are received (and processed) by both threads T1and

T2which also leads to an inconsistent state of the library. Upon an arrival

of a message, the MPI library is not able to decide which thread is supposed

to process the message—the MPI library does not even know that there are

several threads.

Interruptable blocking receive

Implementations of the PVM and MPI libraries are very different (the imple-

mentations of the MPI standard itself are very different). Nevertheless, a more

detailed study of the low-level implementations of the recv function reveals a

certain regularity. On the very lowest level of the implementation of a blocking

recv there is a select() call.10 This is a system call defined by the POSIX

10A polling loop can be used instead of the select() call. However, such an implementation

would be poor for the reasons discussed in Section 2.6.3 and Section 2.6.4.

2.6. THREADED NON-TRIVIAL PVM AND MPI APPLICATIONS 61

standard [ISO90] which waits on a number of file descriptors to change status.11

It is important to note that select can be (and is) efficiently implemented in

kernels of operating systems—the calling process or thread sleeps until an event

occurs, therefore not consuming any CPU cycles. The file descriptors in the case

of the blocking recv are reading file descriptors bound to the sockets on which a

message may arrive. Fig. 2.14 shows the situation.

Sockets

File descriptors

selectKernel

Thread T2

PVM/MPI

select(fd1, fd2, ..., fdN)

NETWORK

recv

recv()

...

Figure 2.14: Implementation of the blocking recv in a socket-based communica-

tion library

The thread T1wants to send a message while the thread T2is blocked in the

blocking recv (that means, inside a select() system call which is transparent to

the user). T2remains blocked until a message from some other process arrives. T1

cannot wait until the recv() call in T2unblocked (not for efficiency reasons—the

arrival of a message at T2may depend on sending the message from T1). The

basic idea of the interruptable blocking recv involves sending a “fake message”

from the thread T1to the thread T2whenever T1needs to send a message to some

other process. The function of this “fake message” is to interrupt the recv in T2

for the amount of time needed for T1to send the message to the other process.

The interrupt mechanism must ensure that it is safe for T1to make the send()

call during the interrupt.

However, we assumed that the communication library is not thread-safe—it

does not allow a thread to send a message while another thread is blocked in

the recv. How does T1send the “fake message” to T2? In order to solve this

problem, we exploit the fact that the communication between the threads T1and

11The select() call is similar to the Occam’s ALT constructor (see Section 2.2.1). The

functions select() and poll() are the only POSIX system calls which implement an efficient

(non-polling) multiplexing on several file descriptors. (The name of the poll() function is a

little misleading—poll() and select() are almost identical in their semantics and efficiency).

62 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

T2takes place on one processor, more precisely, in the scope of one operating

system’s kernel. The “fake message” bypasses the network and directly fires the

select in the kernel, making the thread T2think that it has received a message

from the network.

If the thread T1writes into a file descriptor which is being used for the purpose

of communication with some other process, it must emulate the message passing

protocol used by the communication library in order for the message to be cor-

rectly received by T2. Furthermore, a message can arrive from the network on

the file descriptor which is being written by T1. In order to avoid these problems,

the set of reading file descriptors in the select in T2is extended with a special

file descriptor, intr fd. The intr fd is the reading end of a POSIX pipe. The

thread T1writes to the writing end of the pipe, intr wfd, in order to interrupt the

blocking receive in T2.There is no need to emulate the message passing protocol

on the pipe because intr fd is exclusively used for the purpose of interrupting

the thread T2. The only information T2needs is that it has been interrupted.

The code of the blocking recv is therefore extended so that it can detect which of

the file descriptors has been fired—if intr fd has been fired, T2knows that it is

being interrupted by some other thread. Fig. 2.15 depicts the scenario.

Sockets

File descriptors

selectKernel

Thread T2

PVM/MPI

select(fd1, fd2, ..., fdN, intr_fd)

NETWORK

recv

recv()

Thread T1

...

write(intr_wfd)

Figure 2.15: Scenario of the interruption of a blocked recv. The intr fd file

descriptor is the reading end of a POSIX pipe. The thread T1writes to the

writing end of the pipe (intr wfd), firing the blocked select in T2

Encapsulation of the interruptable blocking receive mechanism

There are several ways as to how the interruptable blocking receive can be inte-

grated in the implementations of PVM and MPI. The cleanest way would involve

2.6. THREADED NON-TRIVIAL PVM AND MPI APPLICATIONS 63

hiding the mechanism in the communication library in order to make the library

completely thread-safe. However, a performance-optimised implementation of

complete thread-safety would probably require a complete rewrite of some li-

brary modules. This would probably be an extensive uncreative work. We chose

another way. We extended the interfaces of the PVM and MPI libraries with

two new functions. The following are the C interface and semantics of the new

functions:

void interrupt recv(void)

Blocks until the thread which had been blocked in the critical recv’s section

has been blocked outside of the critical recv’s section. Following this the

interrupt recv() call returns. (If there is no thread blocked in the critical

recv’s section, the interrupt recv() call never returns.)

void resume recv(void)

Makes the thread which was blocked outside of the critical recv’s section

reenter the critical recv’s section and block there. After this resume recv()

returns. (If there is no thread blocked outside of the critical recv’s section,

the resume recv() call never returns.)

The sole addition of these new functions does not of course make the PVM or

MPI library thread-safe—a random sequence of concurrent library calls from a

multi-threaded application results in an undefined behaviour of the application.

However, it is safe for a multi-threaded application to concurrently call some of

the library functions in a certain defined order. Most importantly, a sequence of

concurrent calls which are needed for a polling-less implementation of non-trivial

applications is safe.

Definition. A library is called thread-safe on a set of sequences of concurrent

calls when none of the sequences of the set leads to memory corruption or to

thread interference. We call such sequences safe.

A communication library is called quasi-thread-safe when a safe sequence of

concurrent calls which allows a thread to send a message exists while another

thread is blocked in a blocking recv().•

Claim. Both PVM and MPI libraries extended with the interrupt recv and

resume recv functions are quasi-thread-safe.

Proof. The sequences of concurrent calls to a communication library that must be

safe in order to allow a thread T1to send a message while a thread T2is blocked

in a blocking recv() are:

{T2:recv();T1:interrupt recv();T1:send();T1:resume recv()}

64 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

{T1:interrupt recv();T2:recv();T1:send();T1:resume recv()}

A process of a non-trivial application which uses a quasi-thread-safe com-

munication library (thread-safe on the above set of sequences) is depicted in

Fig. 2.16. It is obvious that the program only produces safe sequences of calls to

the communication library. •

thread T1()

{while (not done)

{compute();

interrupt recv();

send();

resume recv();

}

thread T2()

{while (not done)

{recv();

service request();

}

Figure 2.16: Threaded event-driven implementation of one process of a non-trivial

application using a quasi-thread-safe communication library. This program does

not contain any polling

Remark. A technical detail is the termination of the process in Fig. 2.16 (the

setting of the not done conditions in the while loops). In order for the process

in Fig. 2.16 to work correctly, it is important that there is a matching recv()

call in T2to each send() call in T1(otherwise the thread T1would block forever

in interrupt recv() preceding the send() call). On the other hand, a correct

termination requires than the number of recv() calls in T2is equal to the number

of send() calls in T1(in other words, the number of the executions of the while

loop in T1is equal to the number of the executions of the while loop in T2). One

way of guaranteeing this involves sending a special finalising message from T1to

T2(the process addresses the finalising message to itself using the usual sequence

{interrupt recv(); send(); resume recv()}in the thread T1) when T1is

sure that it will not send any more messages. •

Implementation of the interruptable blocking receive mechanism in a

(thread-unsafe) communication library

The implementation of the two newly introduced functions interrupt recv and

resume recv requires changes to the original implementation of a (thread-unsafe)

communication library. However, the extent of these changes is not very large.

2.6. THREADED NON-TRIVIAL PVM AND MPI APPLICATIONS 65

Only the implementation of the blocking recv is affected (in the case of PVM only

minor changes in pvmlib are needed—the pvmd remains unchanged). Fig. 2.17

shows the inner workings of the communication library functions. We use a

synchronous POSIX pipe [intr wfd,intr fd] for the orchestration inside the

communication library. This means that a write to the writing end of the pipe

intr wfd becomes blocked until the reading end intr fd has been read; and vice

versa, a read from intr fd becomes blocked until some data has been written to

intr wfd. A synchronous pipe is not the only synchronisation mechanism that

can be used in this scenario (it is also perhaps not the most efficient one)—but

it is portable (pipes are defined in the POSIX standard [ISO90]) and it can later

be replaced with any equivalent mechanism.

A look at Fig. 2.16 and Fig. 2.17 reveals the idea behind the implementation

described in the previous paragraph. The thread T2eventually becomes blocked

in the recv() call. The blocking is caused by the select() system call in recv.

Now the thread T1needs to call a send(). It cannot do so immediately be-

cause the thread T2is blocked in a critical section of recv.T1therefore calls

interrupt recv() first which in turn writes to the pipe whose reading end is

connected to one of the file descriptors in T2’s select(). The select() in T2

is fired. (Note that T1is blocked at this moment in the write() and remains

blocked until T2reads the intr fd.) The thread T2takes over, bails out of the

critical section of the recv12 and reads intr fd which unblocks the write() in T1

(T1can now safely call send()). Then T2becomes blocked in the second read().

After T1has returned from its send() call, it calls resume recv() which writes

to intr wfd. This unblocks the second read() in T2which reenters the select()

in the critical recv’s section. T1returns from resume recv() to the application

code.

2.6.7 Towards a complete thread-safety of PVM and MPI

The mechanism of the interruptable blocking receive can make any communica-

tion library completely thread-safe without any loss of efficiency caused by active

polling. The communication library should be structured as follows:

•Each process runs a thread (let us call it main thread) which is exclusively

used for receiving all messages addressed to the process. When a message

arrives, it is stored by the main thread in a global message queue. The

main thread is automatically started during the initialisation of the library.

•Access to the message queue is protected by a mutex.

12The implementation of “bailing out” of the critical recv’s section may be tricky. The code

shown in Fig. 2.17 is only an abstraction of actual solutions. At the time of writing we have

solutions for socket-based PVM 3.4 and MPI 1.2.4 (driver ch p4).

66 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

interrupt recv()

{write(intr wfd);

}

resume recv()

{write(intr wfd);

}

recv()

{do

{/* BEGINNING OF CRITICAL SECTION */

...original recv code preceding select() is inserted here...

select(original set of file descriptors, intr fd);

if (some of the original file descriptors was fired)

{interrupted = FALSE;

...original recv code following select() is inserted here...

}

else

{interrupted = TRUE;

}

/* END OF CRITICAL SECTION */

if (interrupted)

{read(intr fd);

read(intr fd);

}

}while (interrupted);

}

Figure 2.17: Implementation of the interrupt mechanism inside a communication

library. intr fd is the reading end of a synchronous pipe, intr wfd is the writing

end of the pipe

2.6. THREADED NON-TRIVIAL PVM AND MPI APPLICATIONS 67

•All send() calls are mutually excluded using a mutex. (The acquiring and

releasing of the mutex are hidden in the implementation of send.) The send

function calls interrupt recv() at the beginning and resume recv() at

the end.

•The user’s code runs as a thread (or several threads if the user’s code is

multi-threaded).

•The blocking recv function (as well as all other functions accessing the

network) is implemented in such a way that it only accesses the global

message queue, not the network. If there is no matching message in the

queue, the recv function becomes blocked on a conditional variable. The

main thread is responsible for waking up a blocked thread when a message

for the blocked thread arrives.

•Thread addressing should not be part of the communication library’s in-

terface. This means that messages can only be addressed to processes, and

not to threads inside a process. It is up to the user to develop a thread

addressing scheme if it is needed in the application.

There are two main objections against implementing the above scenario in-

side communication libraries such as PVM or MPI. [HSS+98] The first objection

is that developers of the communication libraries generally try to avoid using

threads inside their libraries. This objection is not fully justified because most

contemporary operating systems do support threads. The porting of communi-

cation libraries onto systems which are not POSIX compliant is more a political

rather than a technical issue. (Moreover, there may be two versions of a com-

munication library contained in a distribution—one for systems which support

threads and another version for systems which do not.)

The second objection is related to the efficiency of existing single-threaded ap-

plications (which are not non-trivial in the context of Section 2.6). The latency of

arecv in such applications may increase in the above scenario. The reception of

a message in the application involves a thread switching between the main thread

and the application thread. If the implementation of the thread switching is slow

on certain systems then this additional overhead cannot be neglected. However,

there is also an argument which supports the use of the proposed scenario even

for single-threaded applications—the message matching the recv() call may al-

ready be available in the message queue before the call to recv() (if the other

process has already issued the corresponding send() call and the message has

been received by the main thread). This latency hiding can compensate for the

overhead incurred by the thread switching.

Even if the second objection were justified, a communication library should

also support non-trivial applications. If the use of threads inside the communi-

cation library is not feasible (the first objection), users of the library must be

68 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

given the possibility to implement non-trivial applications without polling. The

extension of the libraries’ interfaces (for instance as described in Section 2.6.6) is

the least intrusive way of achieving quasi-thread-safety.

The next section shows how the above scenario can be implemented on the top

of (outside of) quasi-thread-safe PVM and MPI —in a library which is between

PVM/MPI and the user’s application. We call this library TPL, Thread Parallel

Library.

2.7 TPL: Event-Driven Thread Parallel Library

TPL is a communication library which provides an application programmer with

functions which are needed for efficient programming of non-trivial parallel appli-

cations (as usual, distributed memory and message passing are assumed). TPL

is thread-safe (more precisely, thread-safe on the set of function sequences that

perform message passing). The ideas described in Section 2.6.6 and Section 2.6.7

are implemented in TPL in a way which is transparent to the user. An applica-

tion which builds on TPL can be multi-threaded (a single-threaded application

is just a special case of a multi-threaded one) and communication functions can

be safely called from multiple threads. Threads can be dynamically created and

destroyed in each process of the application.

TPL is not only another communication library, TPL is a rigorous implemen-

tation of the message passing framework introduced in Section 2.5 on systems

which are based on commodity hardware and system software. TPL efficiently

implements active messages and message handlers.

Unlike PVM or MPI, TPL tries to be minimalistic. There are two good

reasons for keeping the design minimal (in this order of importance): portability

and efficiency.

There are two degrees of portability. The first degree is an adherence to the

POSIX standard [ISO90] in the implementation of the TPL library. (An essential

part of the POSIX standard which the library depends upon is the concept of

threads.) The second degree is the portability of TPL applications. PVM and

MPI are two message passing standards available today. A mapping onto both

PVM and MPI is implemented in the TPL library. A positive consequence is

that an application written in TPL can be linked and run with either PVM or

MPI without changing a single line of the application’s code.

The minimalistic design of TPL simplifies the tuning of the library for a

specific system. We did not invest much effort into the tuning. Our main goal

was to prove that the event-driven mechanism outperforms the polling one.

Claim. To the best of our knowledge, TPL is the first thread-safe communication

library which is portable and at the same time allows the efficient implementation

of non-trivial parallel applications (without active polling in the application or in

2.7. TPL: EVENT-DRIVEN THREAD PARALLEL LIBRARY 69

the library itself). The system running the application can be heterogeneous.13 •

Claim. To the best of our knowledge, TPL is the first communication library

which allows an application to link and run with an arbitrary (quasi-) thread-safe

implementation of PVM or MPI without changing the application’s code. •

2.7.1 Concept

A TPL application consists of processes which communicate via explicit message

passing. The processes do not share memory (the message passing can but it does

not need to be implemented as shared memory communication on a lower level).

Each process is assigned a unique rank. The code of every process is identical

except for their ranks.14 The rank is usually used at the very beginning of each

process to assign the process its role. A role is the code executed by a process

with a dependence on the rank of the process. (For example, in a process farm

there are two kinds of processes: one master and several workers. MASTER and

WORKER are roles.)

TPL uses an underlying communication library for message passing. This

underlying communication library can be any implementation of PVM or MPI

or any other communication library which is at least quasi-thread-safe (or com-

pletely thread-safe).15 The choice of the underlying communication library does

not influence the functionality of the application16 and does not require any

change in the application (the interface of TPL remains intact). The layered

software architecture is shown in Fig. 2.18.

TPL implements the bindings of its interface to PVM as well as to MPI

interfaces. We implemented two versions of TPL. The major differences between

the two versions is the message queueing model and the thread management.

The first version, TPL 1.0, contains a simple thread management and a queuing

model based on so-called message subscription. The second version, TPL 2.0,

fixes some problems of TPL 1.0 and directly implements the message passing

framework of Section 2.5. There is no internal thread management in TPL 2.0

and the message queueing model is simplified.

13The commercial implementation MPI/Pro claims to have the desired properties—however,

we could not confirm this. We asked the technical support team at MPI Software Technology,

Inc. for more information but the reply we received did not answer our questions.

14The concept of all processes having an identical code is known for example from MPI or

PARIX. [Par94] Most PVM implementations do not require all processes to run an identical

code.

15A lack of thread-safety (or quasi-thread-safety) in the underlying communication library

would result in a retreat to active polling in TPL which we decided not to support.

16The choice of the underlying communication library influences the efficiency of the appli-

cation.

70 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

(Quasi-) Thread-safe

PVM, MPICH, ScaMPI, ...

Thread-safe

TPL

Multi-threaded

Application

Figure 2.18: TPL layered software architecture

2.7.2 Process startup and termination

All processes are started once at the beginning and they cannot be dynamically

created later.17 Even if this simplification has been made, the hiding of different

startup mechanisms in TPL and the way of making them transparent to a TPL

application is not easy. Neither MPI nor PVM specify how the processes are ini-

tially created. For instance, MPICH uses a script mpirun to start the processes.

The manual starting of processes from the console is allowed in some PVM im-

plementations whereas in other PVM implementations only the first process is

started manually (this first process is then responsible for starting the remaining

processes using pvm spawn()).18 The process startup also remains system specific

in TPL. However, from an application programmer’s point of view the startup

procedure always looks the same.

Each process’ entry point is the function main in TPL. The command line

arguments argc and argv are passed to the main function in all processes:

int main(int argc, char *argv[]);

As the command line arguments argc and argv can be used by the system

specific startup mechanisms, each TPL process must call the function

void tpl initialize(int *argc, char **argv[])

at the very beginning. This function initialises TPL’s internal memory structures,

makes initial calls which are required by the underlying communication library

and restores the original contents of argc and argv in the first process (if all

processes have been spawned at this point then all processes obtain the same

argc and argv).

It is still unclear whether all processes are running at this point. In the case

17The static process spawning was a design decision which was made to conform to the

available implementations of PVM and MPI.

18The startup mechanism can be “hard-wired” in higher-level environments like cluster man-

agement systems (e.g. CCS [KRR94], [KR98]).

2.7. TPL: EVENT-DRIVEN THREAD PARALLEL LIBRARY 71

of e.g. MPICH they are (the mpirun script spawns all processes). However, in

some implementations of PVM there may be only one process running whose

responsibility is the spawning of the remaining processes. TPL takes care of this

difference in the implementation of the additional two functions that must follow

tpl initialize():

void tpl world info(int *nr procs, int *nr tasks)

returns information on the parallel machine running the program. The number of

nodes of the parallel machine is returned in nr procs and the recommended num-

ber of processes which should started on the machine is returned in nr tasks.19

Note that if the processes have already been started, TPL returns the actual

number of running processes.

The last function related to startup is tpl spawn which must also be called

when the processes have already been started by the underlying system:

void tpl spawn(int *nr tasks, int *rank, int *argc, char **argv[])

The function tpl spawn has several responsibilities. Firstly, it spawns all pro-

cesses if they have not yet been spawned and it initialises the underlying com-

munication library. nr tasks specifies the number of processes that should be

started. tpl spawn modifies nr tasks so that it contains the actual number of

tasks that have been successfully spawned. Secondly, it restores the original con-

tents of argc and argv (if this has not yet been done) and it also sets the working

path of all processes to the working path of the process which was spawned first

(PVM implementations do not do this).20 The rank of the calling process is

returned in rank.21

After the sequence

{tpl initialize(); tpl world info(); tpl spawn();}

has been called as explained above, TPL guarantees the following:

•All nr procs processes are running (one process per processor).

•All processes have identical argc and argv and working directories. (The

environment of all processes is identical to the environment of the first

process on systems where only the first process is started manually.)

•Each process is assigned a unique rank which is an integer from 0 to

nr procs −1.

19The recommended number of processes is usually equal to the number of nodes. However,

if TPL detects that the nodes are e.g. double-processors, it recommends starting two processes

per node (that means one process per processor).

20The underlying communication library is already actively involved in this phase—the initial

process of synchronisation requires message passing.

21The ranks used in TPL applications are integers from 0 to nr tasks −1. These ranks

may differ from the actual process identifiers which are used by the underlying communication

library. TPL takes care of the translation of the ranks to actual process identifiers and vice

versa. This translation is transparent to the application.

72 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

Remark. It is not possible to control the mapping of processes onto processors

in TPL. TPL relies on the default mappings of underlying libraries. If the number

of processes is equal to the number of processors, the default mappings always

map one process to each processor. It is very seldom that an application needs

to map more than one process onto one processor or to place a certain process

onto a certain processor. Therefore this is not regarded as a real limitation. •

Each process must call

void tpl deinitialize()

before it terminates. tpl deinitialize blocks until all running threads which

are registered to TPL (tpl add thread, see Section 2.7.3) have terminated (more

precisely, the call blocks until all threads which are registered to TPL have been

unregistered using tpl del thread, see Section 2.7.3.).22

Remark. All the functions above must be called once, from the main thread.

•

2.7.3 Thread management

A process of a TPL application consists of threads. One of these threads is special

and is called the main thread. In a typical application the main thread spawns

other threads at the beginning and then serves as a message dispatcher until the

process terminates. The main thread is identical to the function main which is

started by the operating system.

There are many implementations of the thread concept: Solaris threads,

POSIX threads (pthreads) OpenMP, GNU threads, . . . TPL internally uses the

POSIX pthreads library because it is available on most systems. It is not really im-

portant which of the libraries is used as they all support the same mechanisms—

only their interfaces are different.

Threads are not separately addressable entities in the TPL message passing

model (a message can only be addressed to a process, using the process’ rank).

TPL provides a means of delivering a message to a specific thread in a process—

however, the thread addressing scheme must be implemented in the application.

Remark. The rest of this section only applies to TPL 1.0. TPL 2.0 does not

implement any bookkeeping for the running threads. •

TPL carries out a basic bookkeeping of the running threads. It does not keep

a record of each running thread—instead it maintains an internal thread counter

22The blocking of tpl deinitialize only applies to TPL 1.0.

2.7. TPL: EVENT-DRIVEN THREAD PARALLEL LIBRARY 73

in order to ensure the correct termination of the process (TPL must be sure that

there are no running threads when it decides to clean up its memory structures).

TPL also stores the ID of the main thread for internal purposes.

The application can freely use any functions provided by the thread library

but it must assist TPL in its book-keeping task. TPL must be informed of

events such as the creation or termination of a thread. In order to ensure the

correct synchronisation of pthreads and TPL, a few rules must be obeyed in the

application:23

•Each thread (with the exception of the main thread) must call

tpl add thread(pthread self())

after it has been created (prior to any further calls to TPL). TPL creates

a message queue for this thread at this point (see Section 2.7.4).

•Each thread (with the exception of the main thread) should call

tpl signal new thread()

after it has been created. This unblocks the thread that created the cur-

rent thread. tpl signal new thread() does not need to immediately fol-

low the call to tpl add thread(). In the meantime, the newly created

thread usually subscribes messages that it wants to receive in the future

(see Section 2.7.4).

•Each thread (with the exception of the main thread) must call

tpl del thread(pthread self())

once, before it terminates (the thread must not make any other calls to

TPL after this).24 TPL destroys the thread’s message queue and decreases

the internal thread counter at this point.

•Each thread that creates another thread must call

tpl prepare new thread()

before the call to pthread create(). TPL increases the internal thread

counter at this point.25

23All these rules can be hidden inside TPL. However, this may limit the application in the

use of other functions of the pthreads library. The approach proposed here is more flexible.

24A thread can terminate another thread in which case the destroyed thread does not have

an opportunity to call tpl del thread(). In such a case tpl signal new thread() must be

called from the other thread, with the ID of the destroyed thread (instead of pthread self()).

25Increasing the internal thread counter in tpl add thread would lead to a racing condition

as regards tpl deinitialize which takes care of the correct process termination. (Recall

that tpl deinitialize should block until all started threads have called tpl del thread.)

However, there may be a thread that has been created but has not yet called tpl add thread.

In this case TPL does not know about that thread and a call to tpl deinitialize returns

although it should be blocked.

74 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

•Each thread that creates another thread should call

tpl wait new thread()

after the call to pthread create(). This call blocks until

tpl signal new thread() has been called by the created thread.

A generic structure of a multi-threaded TPL process is depicted in Fig. 2.19.

2.7.4 Message passing

There are two layers of message passing in TPL. The coarse layer is message pass-

ing between processes. This coarse layer is covered by PVM and MPI and TPL

implements the mappings of abstract send and recv functions onto PVM and

MPI functions. The fine layer is message passing involving threads in processes.

In TPL, each process (or rather, each thread of the process) can send a mes-

sage to any other process or to a set of processes. Each process (or rather, each

thread of the process) can receive a message from any other process (or a thread

of that process). A message cannot be addressed to a specific thread of a process.

However, any thread can receive an incoming message (even several threads can

receive the same incoming message).

A message consists of a header and a message body. The header consists of

the sender’s rank, recipient’s rank and a message tag (an integer defined by the

user). The message body is a contiguous buffer containing data. (TPL provides

functions for packing and unpacking the data, see Section 2.7.6. These functions

take care of different representations of basic data types in different systems.)

The message body can be empty. A process can address a message to itself.

Remark. This section focuses on sending and receiving messages in any thread

except of the main thread. The main thread can also send and receive messages

but it acts as a message dispatcher for all other threads. The mappings of TPL

communication functions to PVM or MPI are different to the main thread. The

role of the main thread is explained in Section 2.7.5.•

Sending

Sending a message to a process involves a sequence of calls

{tpl begin send(); ...packing...; tpl send(); tpl end send();}

The ...packing... part is used for assembling the message and is explained in

Section 2.7.6.

The following sending sequence implements the mutual exclusion of the PVM’s

or MPI’s send calls sketched in Section 2.6.7:

void tpl begin send()

locks a mutex which protects a PVM’s or MPI’s send() and then calls

2.7. TPL: EVENT-DRIVEN THREAD PARALLEL LIBRARY 75

void *thread 1(void *arg)

{tpl add thread(pthread self());

/* ...subscribe messages... */

tpl signal new thread();

/* ...compute and communicate... */

tpl del thread(pthread self());

}

/* ...definition of other threads... */

int main(int argc, char *argv[])

{int nr procs;

int nr tasks;

int my rank;

pthread t thread id;

/* PROCESS STARTUP */

tpl initialize(&argc, &argv);

tpl world info(&nr procs, &nr tasks);

tpl spawn(&nr tasks, &my rank, &argc, &argv);

/* THREAD STARTUP */

/* Start thread 1 */

tpl prepare new thread();

if (pthread create(&thread id, NULL, thread 1, NULL) == 0)

tpl wait new thread();

else

tpl error("Could not start thread 1\n");

/* ...start other threads... */

/* THREAD AND PROCESS TERMINATION */

tpl deinitialize();

return(0);

}

Figure 2.19: Generic structure of a multi-threaded TPL process (TPL 1.0)

76 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

interrupt recv() which interrupts the blocking recv of the main thread. Note

that it is safe for the current thread to perform a send after this call has returned.

Other threads can continue in their computations but become blocked on the

locked mutex when they attempt to send messages.26

void tpl send(int *recipients, int nr recipients, int tag,

*void message, int offset)

performs the actual send (pvm send or MPI Send). When nr recipients is equal

to one, TPL uses pvm send or MPI Send in the implementation of tpl send.

Multiple recipients can also be specified in recipients and nr recipients in

which case TPL uses the multicast functions of PVM or MPI.

void tpl end send()

unlocks the mutex acquired in tpl begin send, allowing other threads to send

messages.

Remark. A call to tpl send must eventually return—otherwise the process

would not be able to receive or send any further messages. Therefore the send

function of the underlying communication library must be asynchronous (non-

blocking). In the case of PVM, the semantics of pvm send fulfills this requirement.

In the case of MPI, the semantics of MPI Isend also fulfills this requirement

but the buffer used by the message must unfortunately be freed after the com-

pletion of the MPI Isend (in other words, asynchronous send is missing in MPI,

see Section 2.6.3). In order to overcome this problem, we extended the MPICH

library with a new function. This function, MPI Asend, is almost identical to

MPI Isend but the MPICH library frees the message buffer automatically after

the request has been completed. •

Receiving

TPL only supports a blocking receive even though it would be easy to implement a

nonblocking receive as well. In a multi-threaded program any nonblocking receive

can be replaced with an additional thread running a blocking receive.

Receiving a message involves of a sequence of calls

{tpl begin recv(); tpl recv(); ...unpacking...; tpl end recv();}

The ...unpacking... part is used for disassembling the message and it is ex-

plained in Section 2.7.6.

void tpl begin recv()

does nothing. It is reserved for underlying communication libraries other than

26An attempt to send a message blocks when another thread is sending at the same time.

However, a thread is allowed to receive a message while another thread is performing a send

because receiving a message in a thread is the same as accessing a local message queue in TPL

which is not in conflict with the send. (This holds for all threads except for the main thread.)

2.7. TPL: EVENT-DRIVEN THREAD PARALLEL LIBRARY 77

PVM and MPI.

void tpl recv(MATCHING FUNC *match, int *sender, int *tag,

void **message))

blocks until a message matching the specified criteria is available. match is a

user-defined function which obtains the sender’s rank and the message tag on

input and decides whether there is a match. If there is (if the function match

returns TRUE)tpl recv unblocks and returns the sender’s rank, message tag and

the packed message body. The message body must be unpacked before a call to

tpl end recv().

The following is the interface of the function match:

int match(pthread t thread id, int sender, int tag,

void *message)

The function match is called internally by TPL to determine whether the header

of an incoming message matches the user-specified header. The function returns

TRUE if there is a match and FALSE otherwise.

void tpl end recv(void *message))

frees the memory used by the packed message.

Remark. The use of a matching function is a generalisation of PVM’s or MPI’s

matching. PVM and MPI allow either the testing of whether the rank and tag

of a message is equal to the given sender’s rank and tag, or the specification of a

wildcard in the rank or the tag or both. TPL can additionally test for a range of

senders’ ranks (for instance). •

2.7.5 Message handling and message callbacks

TPL relies on the existence of a main thread in an application which serves as

a message handler and dispatcher. The main thread is the function main which

is started by the operating system.

After the main thread has initialised and started other threads, it calls (usu-

ally right before tpl deinitialize())

void tpl handle messages(HANDLING FUNC *message handler)

Under normal circumstances the termination of this function results in a termi-

nation of the application. This function serves several purposes:

•It receives all messages from other processes. The main thread is the only

thread which physically receives the messages from the network using a

blocking receive function of PVM or MPI that matches all incoming mes-

sages.

•Upon the arrival of a message, it unpacks the message, performs a user-

defined action (a callback) and inserts the unpacked message into other

78 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

threads’ (message subscribers’) message queues.

•It controls the termination of the whole process. Upon the arrival of a

user-defined termination message, tpl handle messages stops listening to

the network and returns.

Message queueing model, message subscription and message callbacks

in TPL 1.0

Fig. 2.20 shows the message queueing model of TPL 1.0. The understanding

of this model is crucial to the understanding of the purpose of the function

message handler which is a user-defined function passed as a parameter to

tpl handle messages.

Each thread except for the main thread has its own message queue. None of

the threads except for the main thread receives messages from the network. The

messages are only looked for in the local message queue. All local message queues

are protected by a mutex because they are also accessed by the main thread which

inserts messages into them as they arrive. There is also a conditional variable

associated with each queue. If a thread tries to receive a message which is not in

its queue, it blocks on the conditional variable until it is woken up by the main

thread (this happens when a thread is blocked on the conditional variable and a

matching message has just been inserted into the queue).

Remark. Note the difference between the sending and receiving of messages in

threads (not the main thread). tpl send sends a message directly to the network,

whereas tpl recv only accesses a local message queue. •

The thread message queues only contain pointers to messages. One message

can appear in several message queues, therefore several pointers can point to the

same message data structure. Each message data structure contains a reference

counter. After a thread (not the main thread) has read a message from its local

queue (tpl recv) and unpacked the message data, the corresponding message

pointer is deleted from the thread’s message queue and the reference counter in

the message data structure is decreased (tpl end recv takes care of this). The

message data structure is freed when the reference counter reaches zero.

Each thread must subscribe the messages that it wants to receive in the fu-

ture. This usually happens immediately after the thread has been started (see

Fig. 2.19), but messages can be subscribed and unsubscribed at any time.

void tpl subscribe(pthread t thread id, MATCHING FUNC *match)

adds a subscription of the thread thread id to a list of subscriptions.27 Whenever

27The functions tpl wait new thread and tpl signal new thread (see Section 2.7.3) are

used to ensure that a newly created thread has made its subscriptions before the main thread

continues. Without this synchronisation a message “addressed” to the newly created thread

2.7. TPL: EVENT-DRIVEN THREAD PARALLEL LIBRARY 79

recv()

Main thread

(message handler)

tpl_recv()

Thread1

tpl_recv()

Thread2

tpl_recv()

ThreadN

network

...

msg

data

(1)

data

(2)

data

(1)

data

(2)

Figure 2.20: Message queueing model of TPL 1.0. Upon the arrival of a message,

the main thread inserts messages into the queues of the threads which subscribed

the message. In order to avoid a replication of the (possibly large) data stored

in the message bodies, only the message headers are inserted into the message

queues. The message data is stored only once and referenced by the message

headers. The main thread also signals the semaphore associated with the message

queue into which it is inserting a message (in order to wake up the thread which

may already be waiting for the message)

a message arrives, the main thread matches the message against all subscriptions

in the list and when it finds a match, it inserts the message into the message queue

of the corresponding thread. If there is no match, the message is discarded.

void tpl unsubscribe(pthread t thread id, MATCHING FUNC *match)

removes a subscription from the list. (If match is a NULL pointer, all subscriptions

of the thread thread id are removed from the list.)

might get lost before the new thread can subscribe it.

80 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

Remark. The termination of a process is an example of a situation where several

threads want to subscribe the same message (more precisely, a message with the

same tag is reserved for termination). Upon the arrival of a termination message,

all threads are notified. •

Message callbacks are user-defined actions which are triggered when a message

from the network arrives. The idea of callbacks is not to disturb computing

threads with events that can be serviced by the main thread.

A typical example of a callback is the servicing of a request from some other

process for data stored in the recipient’s memory. A recipient’s thread can be

running a computation in one of its threads while the request for data is received

by its main thread. The request does not need to be inserted into the computation

thread’s message queue. Instead of this the main thread replies with the data

and discards the message afterwards.

Callbacks are implemented in the function message handler which is passed

to the main thread as an argument of tpl handle messages. This function is

called every time a message arrives. tpl handle messages terminates when the

user-defined function message handler returns TRUE. The implementation of the

function tpl handle messages which is called from the main thread is sketched

in Fig. 2.21.

Message queueing model, message subscription and message callbacks

in TPL 2.0

The queueing model in TPL 2.0 is depicted in Fig. 2.22. There is one global queue

of incoming messages (similar to the unexpected queue in the implementation of

MPICH and in other MPI implementations). Threads do not need to subscribe

and unsubscribe messages in TPL 2.0. Any thread can decide to receive or send

a message at any one time. Unlike in TPL 1.0, in TPL 2.0 each message arriving

from the network is delivered to only one thread (or consumed by the message

handler).

Another difference between TPL 1.0 and TPL 2.0 is that there is no thread

management in TPL 2.0. The TPL 2.0 library does not know which threads

are running (the library only knows which thread is the main thread—the main

thread acts as the message handler). The thread synchronisation is left to the

application.

The function

tpl handle messages(HANDLING FUNC *message handler)

internally calls the user-supplied function message handler upon the arrival of a

message. The function message handler processes the message and returns one

of the following values:

•TPL ACTION ENQUEUE indicates that the message should be enqueued in the

2.7. TPL: EVENT-DRIVEN THREAD PARALLEL LIBRARY 81

void tpl handle messages(HANDLING FUNC *message handler)

{int sender;

int tag;

MESSAGE *msg;

void *packed data, *unpacked data;

int quit;

quit = FALSE;

while (! quit)

{tpl begin recv();

Note:

tpl recv receives from network when called from main thread

tpl recv(pthread self(), &sender, &tag, &packed data);

quit = msg handler(sender, tag, packed data, &unpacked data);

tpl end recv();

/* ...initialise the message structure msg ... */

msg->sender = sender;

msg->tag = tag;

msg->data = unpacked data;

...match msg against all subscriptions and

insert msg into message queues of subscribers...

/* ...if no subscriber found then discard msg ... */

}

Figure 2.21: Implementation of tpl handle messages in TPL 1.0

82 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

recv()

Main thread

(message handler)

network

msg msg msg msg msg

Messsage queue

tpl_recv(match1)

Thread1

tpl_recv(match2)

Thread2

...

Waiting thread queue

match or insert message

tpl_recv(matchcurrent)

Threadcurrent

match and remove message

insert if no match

Figure 2.22: Message queueing model of TPL 2.0. Upon the arrival of a message,

the main thread first looks for a match among the threads waiting in the thread

queue. If there is a match, the message is passed to the waiting thread (and the

thread is woken up). If there is no match, the message is inserted into the message

queue. A thread (which is not the main thread) which is calling tpl recv() first

looks into the message queue. If it finds a matching message in the message

queue, it removes it from the queue. If there is no matching message in the

message queue, the thread inserts itself into the waiting thread queue

2.7. TPL: EVENT-DRIVEN THREAD PARALLEL LIBRARY 83

message queue.

•TPL ACTION DROP indicates that the message has already been processed by

the message handler and it should be forgotten without being enqueued.

•TPL ACTION EXIT indicates that the function tpl handle messages should

terminate (without the insertion of the message into the message queue).

The return value of TPL ACTION EXIT is used to terminate the main thread.

The application must take care about the correct termination of the remaining

threads before the main thread terminates. Note that after the termination

of the main thread no messages can be sent or received. A useful trick in the

implementation of a termination protocol is to run a communication round which

ensures that all processes are ready to terminate. After this round every process

sends a termination message to itself.

In TPL 2.0, the user-supplied function message handler is not responsible for

unpacking the message body (unless the handler is willing to process the message

itself and drop it afterwards).

2.7.6 Message packing and unpacking

TPL supports heterogeneous systems in the extent of the underlying communica-

tion library. Processors that run the processes may be different or run different

operating systems. Data types (e.g. int,float) can have different lengths on

two processors or their binary representation can differ.

This has some implications for message passing. It is desirable that if a

process 0 sends a message containing an integer 7 to process 1, then the process

1 should see the same value 7 in its local copy of the integer, despite the different

representations of 7 in both processes.

Both PVM and MPI solve this problem using datatypes. A datatype corre-

sponds to a simple data type of a programming language: char,int,float,long

etc.28 The communication library must have information on datatypes stored in

messages. It can then translate the data to the XDR format which is defined by

POSIX on the receiver side and encode the XDR representation to the sender’s

format on the sender side.

PVM and MPI offer packing and unpacking functions for all simple datatype.

The assembling of a message in a contiguous buffer in the sender is a sequence

of calls to packing functions. The disassembling of a message in the receiver is

a sequence of corresponding calls to unpacking functions.29 Finding a common

28MPI also uses more complex datatypes (derived datatypes) which correspond to higher

level memory structures in programming languages (struct,array, . . . ). They are convenient

but not necessary.

29PVM and MPI also offer other possibilities for message assembly which do not require the

copying of data in a contiguous buffer (“data in place”). This is not supported in TPL.

84 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

interface which maps well to both PVM and MPI is not easy because of the

differences between these two libraries. The main difference is that PVM uses

an internal global buffer for storing the packed data, whereas MPI uses buffers

allocated by the application. A consequence of this is that in MPI the applica-

tion has to determine in advance (before packing a message on the sender side

or before receiving a message on the receiver side) how much space the send-

ing and receiving buffers will require. PVM adjusts the buffer sizes on the fly,

transparently to the application.

TPL must be able to emulate the packing and unpacking of both PVM and

MPI without changing the interfaces of the libraries. The interfaces of TPL’s

packing/unpacking functions correspond to the interface of MPI. However, TPL

internally uses a global sending buffer and a global receiving buffer. The use

of global buffers, which is dictated by the need to map onto PVM, is the reason

why threads in TPL must already be mutually excluded during the packing phase

(see Section 2.7.4)30. This can be a potential source of inefficiency in TPL in the

situation when several threads attempt to pack and send messages at the same

time.

MPI requires the receiving process to allocate the buffer for an incoming

message. TPL allocates buffers of a default size during the initialisation phase of

each process. If the application sends a message that is larger than size of the the

default buffer, the buffer in the receiver must be adjusted. TPL uses an internal

protocol to ensure that there is enough space in the receiver. When the sender

detects that it is sending a message that might exceed the receiver’s buffer size,

it first sends a controlling message telling the receiver to increase its buffer size.31

This leads to an additional communication overhead for large messages but the

additional overhead can be avoided by setting the default buffer size sufficiently

large.

2.7.7 Error handling and debugging

Error handling in TPL is very strict. It is assumed that the application is correct

(for instance, it is assumed that a message is always addressed to an existing

process)—there are no sanity checks inside TPL which would significantly influ-

ence efficiency. Internal sources of problems such as an unsuccessful attempt to

allocate memory lead to the termination of the process.32

30Although PVM 3.4 can work with multiple buffers, it is not thread-safe on the set of packing

and unpacking functions.

31Note that the representation of the data can require more memory in the receiver than

in the sender (for instance, a float can be coded using 4 bytes in the sender but 8 bytes in

the receiver). TPL uses a pessimistic upper bound to estimate the ratio of different lengths of

simple types.

32At the time of writing, the implementation of TPL does not attempt to “correctly” termi-

nate all the processes of the application. A future version of the TPL library may implement

2.7. TPL: EVENT-DRIVEN THREAD PARALLEL LIBRARY 85

Debugging facilities provided by the underlying communication libraries can

be used with TPL. TPL does not provide any additional debugging tools.

2.7.8 Flow control

Some implementations of MPI (e.g. MPICH) use the so-called flow control mech-

anism. This mechanism avoids the flooding of a process with messages which

arrive from other processes. There is a certain fixed amount of memory allocated

for the buffering of unexpected incoming messages (an unexpected message is a

message for which no receive operation has been posted). Once this buffer is

full, the process begins to only receive messages for which a receive operation is

posted.

The buffering at receiver can increase efficiency of message passing appli-

cations in some scenarios. Consider the following sequence of message passing

operations which involves three processes, A,Band C:

1. Aposts a receive operation which matches any message from B.

2. Csends a message to A.

3. Bsends a message to A.

4. Aposts a receive operation which matches any message from C.

What happens in step 2? The message being sent by Ceither remains in Cor

it is passed to Aeven though there is no matching receive operation in A. In the

latter case the message is stored in the unexpected message queue of A(without

the completion of the send operation) and it is retrieved from the unexpected

queue in step 4. The retrieval from the unexpected message queue is usually

much faster than the transfer of the message from Cto A. Hence, the costs of

the transfer are amortised in the latter case.

However, if the message being sent in step 2was larger than the buffering space

available in the process A, then the flow control mechanism would not allow the

transfer of the message to Ain step 2—the transfer would be postponed to step 4.

Note that the scenario above works correctly in both cases. Its result does not

depend on whether the process Abuffers the message sent in step 2or not. The

following scenario is more dangerous as it may sometimes result in a deadlock:

1. Asends a message to B.

2. Bsends a message to A.

a global handling of internal fatal errors (if the underlying communication library provides an

error handling).

86 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

The MPI standard states that this scenario will result in a deadlock if the

synchronous send (MPI Ssend) is used in either Aor B. The MPI standard also

states that the scenario will not result in a deadlock if the nonblocking send

(MPI Isend) is used in both Aand B. Finally, the standard states that the

scenario may result in a deadlock (depending on the implementation of the MPI

library) if the default send (MPI Send) is used in both Aand B(the same holds

for the combination of MPI Send and MPI Isend). The whole truth is even much

worse—the second scenario may sometimes deadlock and sometimes not with the

same MPI implementation if MPI Send is used in both Aand B! The buffer

spaces in Aand Bmay be large enough to store one message. In this case both

Aand Bwill make progress if their buffers are empty. However, a third process

Cmay already have sent an unexpected message to both Aand Bbefore and the

unexpected message may already have consumed the entire available buffering

space in Aand B. In this case the scenario will result in a deadlock. No deadlock

occurs if the third process does not participate in this communication scenario.

Remark. The second scenario can often be found in practice, especially in par-

allel finite-element methods. For instance, consider an application which consists

of parallel processes connected to a ring. All the processes run an identical code.

The parallel computation runs in rounds. In one round each process sends a

value to its two neighbours and receives a value from its two neighbours. The use

of the synchronous send results in deadlock unless the symmetry in the ring is

broken (see the example solution of the Problem of Dining Philosophers in Sec-

tion 2.2.1). As the breaking of the symmetry usually involves structural changes

in the application code, a solution is preferred which preserves the symmetry. A

solution which involves the use of the nonblocking send MPI Isend is given in the

MPI tutorial book [GLS95]. This solution is independent on the amount of the

buffering space at receiver. •

The previous discussion suggests that flow control mechanism in the combi-

nation with buffering at receiver is very useful because it speeds up applications

if there is enough buffer space at receiver and at the same time takes care of not

exceeding the available buffer space in the processes. As the flow control does

not apply to the nonblocking send (MPI Isend), the problem of deadlock in sym-

metrical scenarios seems to be solved. However, the amount of memory at sender

is also limited. If there is no throttle on the amount of memory taken by the

pending nonblocking sends, then a process which issues too many nonblocking

sends runs out of memory.

Flow control is a pessimistic solution to the problem of finite memories of the

processes. Flow control assigns a fixed amount of the buffering space to every

process. There are scenarios which result in a deadlock even though there may be

more available memory in the processes involved than the fixed amount of buffer

space.

2.8. EFFICIENCY BENCHMARKS 87

Active messages

TPL uses an optimistic approach to the problem above. All the incoming mes-

sages are buffered at receiver. This approach attempts to amortise as much la-

tency and message transfer overhead as possible. The disadvantage is that one

of the processes can be overflooded with messages for which no matching receive

has been posted at that process.33 However, this disadvantage is only illusory for

the following reasons:

•The communication scenarios of many applications guarantee that no pro-

cess is overflooded with unexpected messages.

•The applications in which a process may be overflooded with unexpected

messages result in a deadlock if flow control is used.

•The implementation of flow control is fairly easy in TPL. Hence, TPL gives

the application the possibility to decide whether flow control should be used.

Unlike in MPI implementations, the buffer spaces taken by the incoming

and outgoing messages can be separately controlled in TPL.

Remark. The flow control mechanism which is implemented in MPICH is a

source of inefficiency in TPL if MPICH is used as the underlying communication

library. The flow control mechanism of MPICH must be overcome in order to

implement the optimistic approach in TPL. A part of this overcoming is hidden in

the implementation of the function MPI Asend (see Section 2.7.4). The problem

of the overcoming of the MPICH’s flow control mechanism is only tackled in

TPL 2.0 (TPL 1.0 does not work correctly with an underlying MPI library which

uses flow control).

PVM 3.4 uses buffering at receiver but it uses no flow control. This simplifies

the design of the TPL’s operation binding to PVM. •

2.8 Efficiency benchmarks

This section presents comparison of efficiency of the original PVM and MPI

implementations against TPL based on the same libraries. We used an “out of the

box” build of PVM 3.4 and MPICH 1.2.4 with the quasi-thread-safe extensions

described in Section 2.6.6 (the extensions have no impact on the efficiency of

applications that do not use them).

33The matching receive must be either posted by a thread which is different from the main

thread or the message must be consumed by a callback in the main thread. Otherwise the

message is inserted into the “unexpected queue” of the receiving process. The growth of this

“unexpected queue” beyond the available memory is called overflooding.

88 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

All the measurements were run on the Fujitsu-Siemens hpcLine cluster in the

Paderborn Center for Parallel Computing (PC2) at the University of Paderborn,

Germany. The cluster consists of 96 Siemens Primergy double-processor nodes.

The nodes have two independent network interfaces: SCI (500 MBit/second Scal-

able Coherent Interface by Scali/Dolphin) and Fast Ethernet (100 MBit/sec).

We used the Fast Ethernet network which is supported by both “out-of-the-box”

PVM 3.4 and MPICH 1.2.4 libraries. Each node of hpcLine is a double-processor

850 MHz Intel Pentium III with 512 Mbytes RAM, running Linux Redhat.

We observed various aspects of polling and event-driven approaches on two

benchmarks:

•ONE-SIDED THREADED PINGPONG and

•SYMMETRICAL THREADED PINGPONG.

These benchmarks are new. They cannot be found among standard bench-

mark programs for PVM and MPI although their structure, especially the struc-

ture of SYMMETRICAL THREADED PINGPONG, directly corresponds to the

structure of all irregular parallel programs. The traditional PINGPONG bench-

mark involves two parallel processes, whereby one of the processes acts as a server

which awaits “PING” messages and responds with “PONG” and the other pro-

cess generates the “PING” and waits for the “PONG” replies. In our benchmarks,

especially in the SYMMETRICAL THREADED PINGPONG,both processes act

as client and servers at the same time.

Both benchmarks involve two parallel processes, PING and PONG, whereby

the process PING consists of two threads, T1and T2. The thread T1sends a num-

ber of messages to the process PONG and the thread T2only receives the same

number of messages from the process PONG. The two benchmarks only dif-

fer in the implementation of the PONG process. In ONE-SIDED THREADED

PINGPONG, the PONG process is single-threaded. It first waits for a mes-

sage arriving from the process PING and after the reception of the message it

sends an answer back to the process PING. In SYMMETRICAL THREADED

PINGPONG, the PONG process also consists of two threads, T1and T2. The

thread T1of the process PONG does nothing. The messages are only received

and sent in the thread T2. However, T2must be aware of that T1may also want

to communicate. This implies that T2must not use a blocking recv unless the

communication library is thread-safe.

Polling and event-driven versions of both benchmarks (Fig. 2.11 and Fig. 2.12)

were implemented and compared in the measurements. The functions compute

and service request are empty. We did not use the generic polling scheme

from Fig. 2.12 for the measurements but an optimised one. Fig. 2.23 shows the

optimised pseudo-code of PING. The difference between this one and the generic

version from Fig. 2.12 is that the optimised version avoids calling sleep(time)

2.8. EFFICIENCY BENCHMARKS 89

when there is a continuous flow of messages arriving in PING. (Each sleep()

call costs at least 0.02 sec, see Section 2.6.4.)

thread T1()

{while (not done)

{lock(comm);

send();

unlock(comm);

}

thread T2()

{while (not done)

{lock(comm);

arrived=probe();

while (arrived)

{recv();

arrived=probe();

}

unlock(comm);

sleep(time);

}

Figure 2.23: An optimised polling implementation of the PING process. The op-

timal setting of time in the sleep(time) call is 50 milliseconds (see Section 2.6.4).

This optimal setting was used in the measurements

All event-driven measurements with ONE-SIDED THREADED PINGPONG

were performed using TPL 1.0 and all measurements with SYMMETRICAL

THREADED PINGPONG were performed using TPL 2.0. However, the in-

fluence of the differences between TPL 1.0 and TPL 2.0 on the measurements is

neglectable.

In the following, MPICH/TPL refers to benchmarks which use the even-driven

version of TPL based on quasi-thread-safe MPICH. Similarly, PVM/TPL denotes

the even-driven version of TPL based on quasi-thread-safe PVM. PVM/polling

and MPICH/polling denote the polling versions of the benchmarks based on the

official PVM 3.4 and MPICH 1.2.4 libraries.

2.8.1 ONE-SIDED THREADED PINGPONG

In each run of the ONE-SIDED THREADED PINGPONG program, 100000

messages were sent from PING to PONG and 100000 messages were sent in the

opposite direction. Each run was repeated using message sizes from 1 Byte to

1 MByte (steps of powers of 2). Moreover, each of these rounds was repeated

10 times in order to exclude external factors related to the operating system and

the network.

90 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

The absolute running time of one experiment was measured. The timer

was started in the process PING right before the while (not done) loop and

stopped right after the loop. In order to exclude a possible initial message pass-

ing latency of PVM and MPICH, the loop was preceded by a “warmup” phase

during which the PING and PONG processes exchanged a few messages. We

also measured the number of sleep() calls in each run of the polling version of

the benchmark.

1 hpcLine node

The graph in Fig. 2.24 shows the average measured throughput over the 10

rounds. Throughput is the total size of all messages sent during a run, divided

by the duration of the run (messages in both directions count). Throughput

is a measure which is strongly correlated to the efficiency of communication-

intensive parallel applications. In this set of measurements, the processes PING

and PONG were run on the same node (a loopback was used for the physical

communication, avoiding the networking overhead).

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 1 node hpcLine

Message length (Bytes)

PVM/TPL

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

Figure 2.24: Average throughput, 1 node hpcLine

The PVM/polling graph must be compared to the PVM/TPL graph. The polling

version is on average only slightly better than the event-driven version for message

sizes up to 1 kByte. For larger message sizes PVM/TPL is not only clearly better

2.8. EFFICIENCY BENCHMARKS 91

but also much more stable.34 Note the PVM/polling’s falloff at the message size

of 32 kbytes—this is the message size where the message flow arriving in PING

is not continuous and so the thread T2calls sleep(time) very frequently (see

Fig. 2.23). The danger of such falloffs is that they can neither be predicted nor

systematically eliminated.

The same holds for the comparison of the MPICH/polling to MPICH/TPL.

MPICH/TPL is slightly worse for small message sizes but clearly better for message

sizes from 8 kbytes. Also—most importantly—MPICH/TPL is much more stable

than MPICH/polling.

Another measure of stability is the standard deviation of the absolute times, or

(after scaling by the total message size) the standard deviation of the throughput

over the same 10 runs. The standard deviation of the throughput is shown in

Fig. 2.25.

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 1 node hpcLine

Message length (Bytes)

PVM/TPL

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 1 node hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

Figure 2.25: Standard deviation of throughput, 1 node hpcLine

The standard deviations of the event-driven PVM/TPL and MPICH/TPL should

theoretically be 0 because the overhead of the interrupt mechanism is constant

for every message arriving in PING. These deviations are indeed very low but

range from 0 to 8%. The non-zero values can be explained by external system

factors (e.g. thread scheduling brings certain irregularities to the measurements).

34The characteristics of the PVM/TPL graph are similar to the characteristics of a simple

pingpong PVM benchmark. In a simple pingpong benchmark the process P ING is single-

threaded. It runs a loop in which it first sends a message and then receives the reply.

92 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

MPICH/polling is surprisingly stable on this measure. The same experiment

always took approximately the same time when it was repeated 10 times. The

difference in the number of sleep() calls in the runs was also neglectable. This

does not correspond to our expectations—the number of sleep() calls should be

very random. Only a detailed study of MPICH’s complex buffering model may

lead to a satisfactory explanation. MPICH also exhibits surprising regularities

also in specially constructed irregular communication scenarios. [KS02]

PVM/polling is as expected very unstable. Different runs of the same ex-

periment produce very different times. There is an extremely high correlation

between the standard deviation of sleep() calls in the runs and the standard

deviation of throughput (or absolute time). The graph in Fig. 2.26 is the stan-

dard deviation of the number of sleep() calls. This graph and the PVM/polling

graph in Fig. 2.25 are almost identical!

100

1 32 1024 32768 1.04858e+06

Standard deviation of number of sleep() calls (%), 1 node hpcLine

Message length (Bytes)

PVM/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of number of sleep() calls (%), 1 node hpcLine

Message length (Bytes)

PVM/polling

MPICH/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of number of sleep() calls (%), 1 node hpcLine

Message length (Bytes)

PVM/polling

MPICH/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of number of sleep() calls (%), 1 node hpcLine

Message length (Bytes)

PVM/polling

MPICH/polling

Figure 2.26: Standard deviation of the number of sleep() calls in the polling

versions of the benchmark, 1 node hpcLine

The polling overhead further depends—when the optimised polling version is

used—on the continuity of the message flow. The continuity of the message flow

(which is measured as the number of sleep() calls) depends on the application

that generates the messages, on the buffering model used by the communica-

tion library and by the network, and on the polling inside the communication

library (see Section 2.6.3). The buffering has a “smoothing” effect—messages

that have been sent by PONG in short time intervals (there is a short pause

2.8. EFFICIENCY BENCHMARKS 93

between each two messages sent by the process PONG) will be received in the

process PING as a continuous flow (see Fig. 2.27). Note that the PINGPONG

benchmark (whether PING is multi-threaded or single-threaded is irrelevant)

has the potential to generate continuous message flows. However, a continuous

stream of large messages leads to buffer overflows which cause “cuts” in the flow.

Each “cut” results in a sleep() call. Certain settings of message sizes trigger

the generation of the “cuts” in message flows, which reverts the “smoothing”

effect of the buffering. This leads to falloffs that can be seen in PVM/polling and

MPICH/polling graphs in Fig. 2.24.

PONG PING

TCP

network

Figure 2.27: Smoothing effect of the TCP protocol (Nagel’s algorithm). [Ste94],

[WS95] A non-continuous message flow generated by the process PONG is re-

ceived as a continuous message flow in the process PING

The continuity of the message flow is expressed by burstiness. Burstiness

is the average size of the data transferred between two “cuts” (the size of the

data between two successive sleep() calls. The graph in Fig. 2.28 shows an

average burstiness over the 10 runs (the total transferred data size divided by

the measured number of sleep() calls) for the polling version of the benchmark.

Note the extreme similarity between the graphs in Fig. 2.28 and in Fig. 2.24!

2 hpcLine nodes

We repeated all the experiments on 2 hpcLine nodes, by mapping the processes

PING and PONG onto different nodes. The only difference between this and

previous scenarios is that the network overhead is added to this scenario.

The graph in Fig. 2.29 shows the average measured throughput over the 10

rounds. (Note that 25 MBits/second is the physical limit of Fast Ethernet.)

The average throughputs of PVM/polling and event-driven PVM/TPL are similar.

MPICH/polling is even better than MPICH/TPL—however, there is a falloff at a

message size of 128 kByte.

The graph in Fig. 2.30 depicts the deviation of the throughput. The deviation

of PVM/polling’s throughput is extremely high for messages up to 1 kByte. This

was expected. However, PVM/polling behaves very deterministically for large

messages. MPICH/polling is also deterministic, as on 1 node. We are not able

to explain this phenomenon. Without a detailed study of the buffering models of

PVM and MPICH and their synchronisation with TCP buffering [Ste94], [WS95]

a satisfactory explanation cannot be given. We can only point out that we did

not change the “out of the box” distributions of PVM and MPICH which may not

94 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

0.5

1.5

2.5

3.5

4.5

1 32 1024 32768 1.04858e+06

Average burstiness (MByte per sleep() call), 1 node hpcLine

Message length (Bytes)

PVM/polling

0.5

1.5

2.5

3.5

4.5

1 32 1024 32768 1.04858e+06

Average burstiness (MByte per sleep() call), 1 node hpcLine

Message length (Bytes)

PVM/polling

MPICH/polling

0.5

1.5

2.5

3.5

4.5

1 32 1024 32768 1.04858e+06

Average burstiness (MByte per sleep() call), 1 node hpcLine

Message length (Bytes)

PVM/polling

MPICH/polling

0.5

1.5

2.5

3.5

4.5

1 32 1024 32768 1.04858e+06

Average burstiness (MByte per sleep() call), 1 node hpcLine

Message length (Bytes)

PVM/polling

MPICH/polling

Figure 2.28: Average burstiness (amount of transferred data per sleep() call)

for the polling versions of the benchmark, 1 node hpcLine

have been optimal for the TPL implementation. (TPL uses these distributions

as the underlying communication libraries, see Fig. 2.18.)

A comparison of the graphs in Fig. 2.30 and Fig. 2.31 again reveals an ex-

tremely strong correlation between the standard deviation of throughput and

the standard deviation of the number of sleep() calls for PVM/polling and

MPICH/polling.

Similarly, the polling graphs in Fig. 2.32 (average burstiness) and Fig. 2.29

(average throughput) are almost identical. This means that the throughput of

the polling version strongly depends on the continuity of the data flow. Bursts in

the data flow cause the more frequent calling of sleep() inside the polling loop,

which is the source of the efficiency loss.

2.8.2 SYMMETRICAL THREADED PINGPONG

The difference between ONE-SIDED THREADED PINGPONG and SYMMET-

RICAL THREADED PINGPONG is that in the latter benchmark both the pro-

cesses PING and PONG use the same mechanism in order to receive messages.

SYMMETRICAL THREADED PINGPONG therefore corresponds to a generic

implementation of a non-trivial application in which all processes are acting as

clients and servers at the same time.

2.8. EFFICIENCY BENCHMARKS 95

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

1 32 1024 32768 1.04858e+06

Throughput (MByte/sec), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

Figure 2.29: Average throughput, 2 nodes hpcLine

It is often believed that a high latency is the main drawback of the event-

driven approach. [HSS+98], [PS98] This may be true for applications which do

not require thread-safety or asynchronous communication. All other applications

must use polling and suffer from a high latency caused by polling. The goal of the

measurements with the SYMMETRICAL THREADED PINGPONG benchmark

was to compare the latencies of the event-driven and the polling versions. The

source code of the programs which were used for this comparison is in Appendix B.

Average roundtrip times measured between 2 nodes of hpcLine are shown in

Fig. 2.33 as a function of the number of roundtrips. The roundtrips are initiated

by the process PING and they are independent on one another (a roundtrip

begins without waiting for the completion of the previous one). The message size

was set to 1 Byte in all experiments, each experiment was repeated ten times.

Latency is usually defined as a single roundtrip time divided by two. Hence,

latency corresponds to the first column in Fig. 2.33. Note that the latency of the

event-driven PVM/TPL and MPICH/TPL is very close to zero at this scale. Fig. 2.34

is the same graph, only 100 times magnified. The latencies of the polling versions

of this benchmark are 0.17 for MPICH/polling and 0.3 for PVM/polling—about

500 times higher than the latencies of the event-driven versions (ca. 0.0006 sec-

onds)!35 The polling latencies amortise with a growing number of the number

35For a comparison, a single roundtrip time of the raw TCP protocol (with no higher com-

96 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of throughput (%), 2 nodes hpcLine

Message length (Bytes)

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

Figure 2.30: Standard deviation of throughput, 2 nodes hpcLine

of roundtrips. However, a usual communication scenario in each process of a

non-trivial application is “send a request; wait for the reply”, in which case there

is only one roundtrip performed at any one time (from the point of view of the

process which initiated the request). Even if there was an application which

would make use of performing many roundtrips by one process in parallel with-

out waiting for the replies, the process would eventually run out of memory. The

reason is the missing flow control mechanism for outgoing messages in PVM and

MPICH, see Section 2.7.8.

Theoretically, the latency of the event-polling version should be constant.

However, both the PVM/TPL and MPICH/TPL graphs in Fig. 2.34 contain two un-

expected peaks. We are not able to explain these peaks—we only suspect that

they are related to the residual amount of polling in the underlying quasi-thread-

safe communication libraries, or to the Nagel’s algorithm of the TCP protocol,

or to the non-optimal thread-switching policy in Linux.

2.8.3 Summary of benchmarking results

Our theoretical expectations from Section 2.6.4 are fully justified by the exper-

iments. The throughput of the polling implementations of the benchmarks is

munication library involved) is ca. 0.00015 seconds.

2.8. EFFICIENCY BENCHMARKS 97

100

1 32 1024 32768 1.04858e+06

Standard deviation of number of sleep() calls (%), 2 nodes hpcLine

Message length (Bytes)

PVM/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of number of sleep() calls (%), 2 nodes hpcLine

Message length (Bytes)

PVM/polling

MPICH/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of number of sleep() calls (%), 2 nodes hpcLine

Message length (Bytes)

PVM/polling

MPICH/polling

100

1 32 1024 32768 1.04858e+06

Standard deviation of number of sleep() calls (%), 2 nodes hpcLine

Message length (Bytes)

PVM/polling

MPICH/polling

Figure 2.31: Standard deviation of the number of calls in the polling versions of

the benchmark, 2 nodes hpcLine

a function of the number of sleep() calls in the polling loop. The overhead

of every sleep() call which adds to the parallel time is 0 to 0.02 seconds (or

more)36. Thus the total polling overhead in our benchmarking scenarios ranges

between 0 and 2000 seconds (= 0.02 seconds * 100000) in each run. This ran-

domness can clearly be observed in Fig. 2.25 and Fig. 2.26 for PVM/polling.

The absolute times of the runs without this overhead (a simple single-threaded

pingpong benchmark) are between ca. 5 seconds and 6000 seconds, depending

(only) on the size of the messages used in the benchmark. It is also evident from

the MPICH/polling graphs that the performance depends solely on the number

of sleep() calls in the polling loop.

The overhead of the event-driven mechanism is constant in the benchmarks

because the mechanism is triggered by every message that arrives in the process

PING. It does not depend on the continuity of the message flow. This overhead

is slightly larger than the polling overhead by continuous message flows, and much

smaller than the polling overhead when the smoothing effect of TCP buffering

(Fig. 2.27) does not help.

36This is only a one-sided overhead (when only the P ING process runs the polling loop).

However, it is likely in a non-trivial application that all processes are symmetric in this respect—

that means, the process sending a request also runs a polling loop when it is waiting for a reply.

In such a case, the overhead of servicing one request ranges from 0 to 0.04 seconds (or more).

98 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

0.5

1.5

2.5

3.5

4.5

1 32 1024 32768 1.04858e+06

Average burstiness (MByte per sleep() call), 2 nodes hpcLine

Message length (Bytes)

PVM/polling

0.5

1.5

2.5

3.5

4.5

1 32 1024 32768 1.04858e+06

Average burstiness (MByte per sleep() call), 2 nodes hpcLine

Message length (Bytes)

PVM/polling

MPICH/polling

0.5

1.5

2.5

3.5

4.5

1 32 1024 32768 1.04858e+06

Average burstiness (MByte per sleep() call), 2 nodes hpcLine

Message length (Bytes)

PVM/polling

MPICH/polling

0.5

1.5

2.5

3.5

4.5

1 32 1024 32768 1.04858e+06

Average burstiness (MByte per sleep() call), 2 nodes hpcLine

Message length (Bytes)

PVM/polling

MPICH/polling

Figure 2.32: Average burstiness (amount of transferred data per sleep() call)

for the polling versions of the benchmark, 2 nodes hpcLine

The event-driven PVM/TPL clearly outperformed PVM/polling as regards sta-

bility. PVM/polling behaved very non-deterministically on 1 as well as on 2 nodes

of hpcLine.

MPICH/polling was surprisingly very deterministic. However, for certain

message sizes there is a falloff in MPICH/polling’s performance. Even though

MPICH/TPL was outperformed by MPICH/polling for small message sizes, it did

not exhibit any falloffs. Also, MPICH/TPL was handicapped in this benchmark be-

cause of the polling inside the MPICH library (see Section 2.6.3) which negatively

influences the performance of the event-driven mechanism.

We must stress that both benchmarks (one-sided as well as symmetrical) are

the best possible representatives of non-trivial applications for the polling com-

munication libraries as it produces continuous message flows that minimise the

number of expensive sleep() calls in the polling loop. An application which does

not produce continuous message flows leads to the calling of sleep() after each

received message in the polling loop, which drastically decreases performance. In

such a case the latency of servicing a request will always be at least 0.02 seconds

if polling is used (the stream of requests is likely to be non-continuous). The

latency of 0.02 seconds is much higher than the average latency of the event-

driven versions of the benchmarks. The latency of the event-driven mechanism

does not depend on the rate with which requests are generated. Therefore, if

2.8. EFFICIENCY BENCHMARKS 99

0.05

0.1

0.15

0.2

0.25

0.3

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec). 2 nodes hpcLine

Number of roundtrips

PVM/TPL

0.05

0.1

0.15

0.2

0.25

0.3

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec). 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

0.05

0.1

0.15

0.2

0.25

0.3

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec). 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

PVM/polling

0.05

0.1

0.15

0.2

0.25

0.3

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec). 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

0.05

0.1

0.15

0.2

0.25

0.3

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec). 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

0.05

0.1

0.15

0.2

0.25

0.3

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec). 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

0.05

0.1

0.15

0.2

0.25

0.3

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec). 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

0.05

0.1

0.15

0.2

0.25

0.3

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec). 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

Figure 2.33: Average roundtrip time, 2 nodes hpcLine

the event-driven mechanism is not worse than polling on these benchmarks, it

can only perform better in a real non-trivial application which produces many

non-continuous message flows.

The event-driven mechanism outperformed polling in terms of message passing

latency by two orders of magnitude. The minimisation of latency is essential for

decreasing idle periods in processes of many non-trivial applications.

Remark. There is one additional factor which determines the latency of request

servicing of the even-driven implementation—thread scheduling policy. The fol-

lowing thread scheduling policy is optimal (yielding minimum latency) for the

concept depicted in Fig. 2.20. This thread scheduling policy is almost identical

to the Transputer’s scheduling policy described in Section 2.2.2:

•The main thread cannot be preempted by any other thread.

•The remaining threads Thread1... ThreadNare scheduled using any pol-

icy, for example round-robin. It is not very important whether the schedul-

ing of the threads Thread1...ThreadNis preemptive or not, but the non-

preemptive scheduling is a more rigorous choice because it avoids an un-

necessary context switching.

•The main thread has a higher priority than Thread1...ThreadN. This

100 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

0.0005

0.001

0.0015

0.002

0.0025

0.003

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec), 2 nodes hpcLine

Number of roundtrips

PVM/TPL

0.0005

0.001

0.0015

0.002

0.0025

0.003

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec), 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

0.0005

0.001

0.0015

0.002

0.0025

0.003

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec), 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

PVM/polling

0.0005

0.001

0.0015

0.002

0.0025

0.003

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec), 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

0.0005

0.001

0.0015

0.002

0.0025

0.003

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec), 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

0.0005

0.001

0.0015

0.002

0.0025

0.003

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec), 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

0.0005

0.001

0.0015

0.002

0.0025

0.003

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec), 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

0.0005

0.001

0.0015

0.002

0.0025

0.003

1 4 16 64 256 1024 4096 16384 65536 262144

Average roundtrip time (sec), 2 nodes hpcLine

Number of roundtrips

PVM/TPL

MPICH/TPL

PVM/polling

MPICH/polling

Figure 2.34: Average roundtrip time, 2 nodes hpcLine (a 100x magnification of

the graph from Fig. 2.33)

means that the threads Thread1...ThreadNare only scheduled when the

main thread is blocked. The main thread is immediately scheduled when it

unblocks, possibly preempting another running thread (if there is one).

The POSIX standard defines thread scheduling policies which should be sup-

ported by operating systems: SCHED FIFO (run to completion), SCHED RR (round-

robin), . . . However, many operating systems (Solaris, Linux, Ultrix) restrict the

choice of the scheduling policy to SCHED RR. Other thread scheduling policies are

either not implemented or require super-user privileges. The reason for this is

that thread scheduling is mixed with process scheduling in the implementations

of the operating systems which are designed for shared-memory multiprocessors.

The round-robin thread scheduling policy is a source of increased message

servicing latency in TPL. Unfortunately, this is a problem that cannot be solved

without a change to the operating system’s scheduler. •

Open questions remain:

•Under what circumstances does the network (more precisely the transport

protocol, TCP in this case) produce a continuous stream of messages?

2.9. CONCLUSIONS 101

•What are the optimal settings of TCP or other transport protocols for the

event-driven mechanism of TPL?

•Would the use of other existing transport protocols significantly influence

the answer to the first question?

2.9 Conclusions

We showed that the existing message passing standards such as PVM, MPI and

CORBA, do not allow an efficient implementation of a large class of important

applications which we refer to as non-trivial. All irregular applications belong to

this class, including so-called grand-challenge problems. (Practically every larger

parallel application belongs to this class.) The source of inefficiency is polling.

Inefficiency is not the only drawback of polling. Further drawbacks include a

destruction of the natural structure of the program code, a high execution time

variation of the same program on the same input on the same system, limited

flow control etc.

We proposed a formal message passing framework which is compatible with

existing fundamental abstract models for parallel processing. This framework

defines the structure and behaviour of any system which implements message

passing but it is very flexible at the same time—it does not dictate whether the

physical system architecture is shared or distributed memory; it does not dictate

whether the programming language is functional, logical or imperative; it does

not exclude fault-tolerance; etc. A similar framework exists for database systems

and it is well accepted by all implementors of database systems.

To the best of our knowledge, this is the first formal framework which adheres

to the existing abstract message passing models and which also covers practical

issues of parallel processing. The reason why non-trivial applications cannot be

implemented efficiently in MPI and CORBA is that these standards do not fit

into our framework. This makes these standards incompatible with formal mes-

sage passing models such as Hoare’s CSP or Andrews’ channel model. Moreover,

MPI and CORBA do not define asynchronous message passing, even though they

claim that they do. Interestingly, PVM does fit into our framework—however,

its operation binding (the current implementation of PVM) does not cover asyn-

chronous communication.

TPL, our message passing library, is a straightforward materialisation of our

message passing framework. We implemented the operation binding for clus-

ters with distributed-memory nodes. However, the language primitives defined

in TPL are system-independent. This means that a program written using the

TPL library will run without a change in the source code on a shared-memory

architecture as well (only the operation binding must be added to the implemen-

tation of the library). The TPL library is thread-safe and portable on the POSIX

102 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

level which is supported by all contemporary platforms. The interface of TPL is

very small—it only consists of 11 functions (plus 24 functions for message assem-

bling and disassembling). However, it efficiently (without polling) implements

asynchronous communication, one-sided communication, message handlers, ac-

tive messages, flow control, support for heterogeneous systems etc. We are aware

of no communication library which would cover all these features without polling

or without a loss of portability.

The current implementation of TPL is based on quasi-thread-safe PVM 3.4

and MPICH 1.2.4-ch p4. These quasi-thread-safe libraries are the original li-

braries extended with a novel interrupt mechanism. Quasi-thread-safety is al-

ready sufficient for the implementation of non-trivial application without polling.

The choice of the underlying library is not very important—a direct use of a

socket library instead of PVM or MPICH would be even more efficient. However,

our intention was to demonstrate the flexibility of TPL. An application written

in TPL can be linked with the quasi-thread-safe PVM or the quasi-thread-safe

MPICH (or any lower level library which is thread-safe or at least quasi-thread-

safe) without a change in the application’s source code. We are aware of no other

communication library which can do this.

TPL outperforms the standard PVM and MPI implementations (PVM 3.4 and

MPICH 1.2.4-ch p4) on a standard cluster platform by two orders of magnitude

on an irregular benchmark. This benchmark, a threaded pingpong, is derived

from our definition of a non-trivial application—in other words, the structure

of this benchmark directly corresponds e.g. to the structure of grand-challenge

problems. All implementations of this benchmark using the standard PVM 3.4

and MPICH 1.2.4-ch p4 libraries are forced to use polling.

In order to illustrate the importance of the results above, the following short

section presents a short case study which presents mechanisms of the MPI model

for overlapping of communication and computation. We compare these mecha-

nisms to the mechanisms of TPL.

2.9.1 Overlapping of Communication and Computation

The MPI model for overlapping of communication and computation is described

in the PhD thesis by R. P. Dimitrov, “Overlapping of Communication and Com-

putation and Early Binding: Fundamental Mechanisms for Improving Parallel

Performance on Clusters of Workstations” [Dim01] in Section 3.2.2 “Statement

of Model and Definition of Parameters”. One of the goals of the cited work

is the minimisation of the effective overhead of “asynchronous communication”

defined in the MPI standard by a proper delaying of completion synchronisa-

tion. (We claim that MPI does not define message passing, unless MPI’s “asyn-

chronous communication” is removed from the specification, therefore the quo-

tation marks.)

Dimitrov’s work is technically correct with respect to the MPI model. However,

2.9. CONCLUSIONS 103

some of the assumptions used throughout the work either do not hold or are

irrelevant in models which define message passing.

Figure 2.35: Overhead and transmission time in the MPI model. tsp denotes

the moment at which the sender posts a send request. t1denotes the moment

when the first byte of the message is placed on the network. t2denotes the

moment when the last byte of the message is placed on the network. tsc denotes

the moment when the application is notified about the completion of the send

request. trp denotes the moment when the application posts a receive request

(which matches the send request). t3denotes the moment when the first byte of

the message arrives. t4denotes the moment when the last byte of the message

arrives. trc denotes the moment when the receiver is notified about the completion

of the receive request

In Section 3.2.2, pp. 93–94, Dimitrov explains the idea behind the non-

blocking MPI Recv (the notation is explained in Fig. 2.35):

“Once the non-blocking receive request is posted, the message-

passing middleware typically does not perform any processing on this

request until a matching message arrives (i.e., until t3). According

to this scenario, Trcv1will simply be shifted in time, and, as a result,

the application can perform other computation or communication in

the period between the moment when the request is posted and t3.

104 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

Therefore, this shift of the moment of request posting will not result

in an effective overhead increase. In fact, this behavior is encouraged

by most MPI implementations because the first component of the re-

ceive overhead can be shifted to earlier parts of the parallel algorithm

which are not overhead sensitive (e.g., in the initialization phase of

the algorithm). . . . This will decrease the effective receive overhead

incurred by the user process and improve the opportunities for over-

lapping of communication and computation. These opportunities are

further enhanced by message-passing middleware that supports in-

dependent message progress. If such a service is available, the user

process may delay the completion synchronization (e.g., MPI Wait)

until a moment after t4, which would enable this process to effectively

overlap the entire transition time with other activities.”

No implementation of MPI may violate the progress rule which is defined in

the MPI standard—for instance, MPICH does not comply to the MPI standard

(see Section 2.6.3). More generally, if the message passing middleware does not

support independent message progress, then no MPI implementation exists which

builds on such a middleware.37 (Nevertheless, these remarks are purely academi-

cal because practically all contemporary message passing middlewares do support

threading and therefore independent message progress. A similar middleware was

already provided by INMOS Transputers.)

The kind of overlapping of communication and computation which is described

in the quoted text only applies to regular (trivial) applications. The reason for

that is that is impossible to insert the completion synchronisation (MPI Wait())

anywhere into a program with an unpredictable flow of control but at the pro-

gram’s end. This is not desirable because then the lower bound of the latency of

the receive perceived by the program will be equal to the execution time of the

whole program.

Even if the application is regular—that means, it consists of communication

and computation phases which are strictly separated in the program—it does

not know when the matching message actually arrives. Therefore it can be only

guessed how much computation can be inserted between the posting of a non-

blocking request (MPI Recv()) and the completion synchronisation (MPI Wait()).

This guess is system-dependent, which means that the application must be tuned

after it has been ported to another system or when certain parameters of the

system change. This tuning may sometimes require structural changes in the

application which is very undesirable.

37The interpretation of the progress rule has been a very hot topic in the MPI Forum for

many years. According to statements of MPI developers in public forums, the interpretation

of the progress rule is still unclear. In order to clarify this, we defined a formal framework

which adheres to well-accepted abstract message passing models such as the channel model by

Andrews and which clearly defines point-to-point communication.

2.9. CONCLUSIONS 105

TPL does not offer non-blocking receive for a simple reason—it is not needed.

If the application can proceed in the computation without receiving a message, it

does not post any receive. If it cannot proceed, it blocks and waits for that mes-

sage. It is possible that the message has already arrived at the time of posting a

blocking receive, in which case the waiting is not needed. We recall that TPL does

make use of the middleware which supports independent message progress and

buffering at receiver (see Section 2.7.8). Therefore, if the matching message has

already been sent to this process (if the send request has already been posted by

the sender), it has either been received or is being received by this process. TPL

provides an automatic overlapping of computation and communication. An appli-

cation written in TPL needs no tuning after it has been ported to another system.

The strategy which is described in the quotation can only outperform TPL in

the case when the guess of the completion synchronisation point (the insertion of

MPI Wait() into the program) was correct. However, if this synchronisation point

can be precisely guessed, then a blocking receive (MPI Recv()) at the synchroni-

sation point can be used instead of the MPI Irecv() and MPI Wait() pair without

a significant loss of efficiency. This implies that TPL can only be outperformed

in the case when the application exhibits absolutely predictable communication

patterns and therefore does not need asynchronous communication at all. (This

situation is very rare because also theoretically regular communication patterns

are perturbed by external factors in real systems, e.g. by I/O operations, process

scheduling etc.)

Dimitrov further explains the idea behind the non-blocking MPI Isend, and

defines an objective of his study, pp. 94:

“The send process can achieve effective overlapping of communica-

tion and computation, similarly to the receive process, by shifting the

synchronisation procedure of the send request to a later moment, and

scheduling computation activities immediately after the send request

is registered with the MPI library (i.e., after t1). This can effectively

hide the transmission time (assuming sufficient memory bandwidth)

and also move the notification overhead to a non-time-critical segment

of the algorithm. The actual benefit of overlapping depends on the

capabilities of the computer platform, on the network infrastructure,

and the communication software. A main objective of this study is to

reveal the factors that affect overlapping, how overlapping efficiency

can be improved, and how parallel algorithms can take advantage of

overlapping.”

No abstract message passing model (except of modern “message passing sys-

tems” such as MPI or CORBA) requires the synchronisation procedure of the

send request (e.g. MPI Wait()), see Section 2.5.5. This “feature” alone makes the

modern systems incompatible with fundamental abstract message passing mod-

els. This “feature” obviously does not mean an improvement of the fundamental

106 CHAPTER 2. EVENT-DRIVEN MESSAGE PASSING

models as it forces polling in all irregular applications which use non-blocking

send. A further study of this semantics is therefore irrelevant.

TPL does not require any synchronisation procedure. It makes no sense to dis-

tinguish between blocking or non-blocking send in TPL because TPL’s buffering

at receiver guarantees that no send will block forever.

Chapter 3

Global illumination

The goal of rendering is to provide an observer who is watching a computer screen

with the same sensation as if the observer was watching a real 3D scene on the

screen. The image on the screen is computed from a 3D model. The 3D model

consists of the geometry of all 3D objects, the material properties of the 3D

objects and the properties of the light sources which illuminate the scene. The

model also contains a description of the virtual camera which takes a picture of

the scene. The picture of the 3D scene as seen by the camera appears on the

screen. In informal terms, the global illumination problem involves computing

this picture from the information stored in the 3D model.

The light distribution in the scene does not depend on the camera. Even if

the observer is missing, the light is distributed in the scene. The solution to

the global illumination problem can therefore be divided into two independent

phases:

1. Computation of the light distribution in the 3D scene

2. Measurement of the light distribution by the virtual camera

The first phase may simulate the laws of physics [FLS70] in order to compute

the distribution of light in the scene. This involves the simulation of the physics

of light—interactions of light with the objects and also with the media between

them. Note that if this phase is separated from the second phase, then the

computed illumination must also be stored. On the other hand, it can be assumed

during the computation of the first phase that the picture computed in the second

phase will be viewed by a human eye. We already know that the human eye is

sensitive to radiance and therefore a sufficient product of the first phase provides

a knowledge of radiances for all surface points and all directions in the 3D scene.

The second phase deals mostly with the human perception of colour. The hu-

man eye is a device which measures the spectral energy distribution of impinging

light. This energy distribution is a function which is defined on the wavelength

which ranges from ca. 400 nm to ca. 700 nm. The impinging light seen by

107

108 CHAPTER 3. GLOBAL ILLUMINATION

the virtual camera must be reproduced by a physical device (computer monitor,

glasses, paper) in order to create the impression of “looking at the 3D scene”.

The contemporary display devices are not able to display all energy distributions.

However, the human eye is also not able to distinguish between many different

energy distributions and therefore the simplified models used for the colour syn-

thesis in physical devices (tone mapping) are usually sufficient.

3.1 Physics of light

At a macro-level, light behaves as an electromagnetic wave. Light has all the

usual properties of waves, such as bending around obstacles and interference.

Unlike sound, light also propagates in the vacuum (sound is not an electromag-

netic wave). However, not all phenomena can be explained by the wave theory. A

simple example is the refraction of light. [Fey88] Newton was not able to explain

refraction, even though he attempted to model light with particles. It turned

out that some phenomena can be explained using the wave theory, whereas other

phenomena must be explained using the particle theory (wave-particle duality).

However, it continued to remain unclear as to which cases and why light some-

times behaves as a wave and sometimes as particles. It took several centuries

until the quantum electrodynamics theory was developed by Maxwell, Einstein,

Feynman and others. [FLS70] Quantum electrodynamics is to date the best ex-

isting theory which satisfactorily explains all light phenomena. At a micro-level,

a quantum of energy is transported by a particle called a photon. This is the

reasoning from [Fey88]:

We know that light is made of particles because we can take a

very sensitive instrument that makes clicks when light shines on it,

and if the light gets dimmer, the clicks remain just as loud—there are

just fewer of them. Thus light is something like raindrops—each little

lump of light is called a photon—and if the light is all one color, all

the “raindrops” are the same size.

A single photon is itself a “wave” with some frequency. The length of the

“wave” lambda of a single photon can be computed if its frequency is known:

λ=c

f(3.1)

where cis the speed of light which is constant (c≈2.99·108m/s). The energy

Etransported by a single photon with the frequency fis equal to

E=h·f(3.2)

where his the Planck’s constant (h≈6.63 ·10−34 Js).

3.1. PHYSICS OF LIGHT 109

Heisenberg’s uncertainty principle states that it is impossible to precisely mea-

sure both the photon’s location and momentum (hence, its energy) at the same

time. The photon’s momentum p(note that photons have no mass) is defined as

p=E

c(3.3)

More precisely, Heisenberg’s principle states that if one makes a large num-

ber of “identical” measurements of the photon’s location and momentum under

the same experimental conditions, then the measurements will show surprising

differences. If ∆xdenotes the standard deviation of the location and ∆pdenotes

the standard deviation of the momentum measured over the set of “identical”

experiments, then the following tradeoff holds:

∆x∆p≥h

4π(3.4)

The consequence of Heisenberg’s principle is that photons do not (only) travel

along straight lines. A photon which moves between two points, Aand B(and

which is heading towards B) can theoretically choose any path from Ato B(also

if there are no obstacles between the two points). The length of the path is only

limited by the speed of the light cif the photon is supposed to get to Bwithin

a limited period of time. However, the probabilistic distribution of the paths is

not uniform. A vast majority of the photons will follow a path which is close

to the straight line connecting Aand B. This fact is widely used in computer

graphics which assumes that photons only travel along straight lines unless they

interact with an obstacle. The participating medium (such as air or water) is

often ignored in computer graphics—in other words, it is assumed that the space

between object surfaces is filled with a vacuum.

Photons do not interact with each other. However, they interact with object

surfaces. When a photon hits an obstacle (an object surface), one of the following

events happens:

•The photon is absorbed.

•The photon is reflected. More precisely, the photon is absorbed and a new

photon is emitted from the point of the incidence into the half-space above

the surface (pointed to by the surface normal). The new photon carries less

energy than the absorbed photon.

•The photon is refracted. More precisely, the photon is absorbed and a new

photon is emitted from the point of the incidence into the half-space below

the surface.

Reflection and refraction differ in the direction of the reemitted photon. The

term scattering covers both reflection and refraction—a photon is scattered if it

is either reflected or refracted.

110 CHAPTER 3. GLOBAL ILLUMINATION

Most transparent materials cause a partial reflection and a partial reflection

(and a partial absorption) of photons. For instance, if a narrow beam of photons

(a ray) is shot at a glass surface under a certain angle, then one part of the pho-

tons will be reflected, another part of the photons will be refracted and some part

of the photons will be absorbed. The same experiment can also be continually

repeated using individual photons with the same statistics. Only quantum elec-

trodynamics can explain how an individual photon “makes up its mind” whether

to go through the glass or whether to reflect off it (and in which direction). The

photon randomly “chooses” one of the three events. The probabilities of the

events generally depend on the surface material, on the photon’s incoming angle

and on the photon’s frequency.

A simulation of a large number of photons and their interactions with surfaces

and media on a micro-level is very expensive. Computer simulations of lighting

bundle many photons into a beam. This bundling is already a simplification

of real-world physics but allows for the design of algorithms which are more

appropriate for lighting simulation in larger environments in a reasonable amount

of time. In the following text we will define several notions which are useful in

the simulations.

Remark. Whereas a single photon has a single frequency, a beam can contain

photons with many different frequencies. Therefore it makes sense to refer to

the light spectrum of a beam, which is an energy histogram over an interval of

(visible) frequencies. •

Definition. The solid angle subtended by a 3D surface, viewed from a point x

is the area of the projection of the surface onto the unit sphere centered at x.•

The solid angle is measured in steradians (sr) and ranges from 0 to 4πsr.

The solid angle is a 3D analogy of the angle in 2D subtended by a curve from

a point x(which is the length of the arc of the projection of the curve onto the

unit circle centered at this point).

The differential solid angle dω subtended by a differential surface with area

dA and viewed from a point at the distance ris equal to

dω =dA cos θ

r2(3.5)

where θis the angle between the surface normal and the direction from the

point to the surface.

Adirection in 3D can be expressed using two angles (θ, φ). A differential solid

angle dω around a direction (θ, φ) is equal to

dω = sin θ dθ dφ (3.6)

3.1. PHYSICS OF LIGHT 111

Definition. The light flux Φ is the amount of energy which passes through a

boundary per unit time over a given range of spectrum. •

The spectrum range in the flux definition is usually an interval of wavelengths

[λmin, λmax]. As it is more convenient to talk about energy radiated around a

direction rather than through a boundary, another measure is defined:

Definition. Radiance (intensity)Lis the amount of energy which travels at a

given point in a given direction, per unit time, per unit area perpendicular to the

direction of travel, per unit solid angle (over a given spectrum range). •

From its definition, radiance is the flux leaving a differential area around a

given point, which leaves the point in a differential solid angle around a given

direction:

L(x, ω) = dΦ(x, dA, ω, dω)

dA dω cos θ(3.7)

where xis the given point, ωis the differential angle around the given direc-

tion, dω is a differential angle around the given direction, θis the angle between

the surface normal at the given point and the given direction.

Definition. We denote Li(x, ω0) the incoming radiance which impinges at the

point xfrom the direction ω0.1•

Remark. The physical measures such as radiance, incoming radiance, . . . are

defined for a wavelength or a range of wavelengths. Equation 3.7 should therefore

be written as

LΛ(x, ω) = dΦΛ(x, dA, ω, dω)

dA dω cos θ(3.8)

where Λ denotes a range of wavelengths. We shall omit the superscript Λ for

the remainder of the text.

The usual practice in computer graphics is to write similar equations for three

representative wavelengths (such as R,Gand B, see Section 3.2.1) and to work

with the three equations independently of each other. This means that some

phenomena such as fluorescence (which occurs when a photon hits a surface and a

new photon is reemitted at a different wavelength) cannot be correctly simulated.

•

1In order to simplify the notation, we refer to differential solid angles as directions.

112 CHAPTER 3. GLOBAL ILLUMINATION

Definition. The Ray-Trace function RT(x, ω) is a function which returns the

nearest surface point to xin the direction ω. (If there is no surface point in

the direction ωfrom the point x,RT(x, ω0) returns an arbitrary point along the

direction ωfrom the point x.) •

Radiance has a reciprocal property. For any two mutually visible points x

and y, the radiance leaving the point xin the direction of yis the same as

the incoming radiance at the point yfrom the direction of the point x. This

property can be directly proven from the above definitions: Let us denote dA

the differential surface around the point xand let us denote dA0the differential

surface around the point y. Let us denote ωthe differential solid angle around

the direction from xto yand let us denote ω0the differential solid angle around

the direction from yto x. Let us denote θthe angle between the surface normal

at the point xand the direction ω. Similarly, let us denote θ0the angle between

the surface normal at the point yand the direction ω0. Equation 3.7 can then be

rearranged as

dΦ(x, dA, ω, dω) = L(x, ω)dA dω cos θ(3.9)

The substitution of dω into Equation 3.9 using Equation 3.5 yields

dΦ(x, dA, ω, dω) = L(x, ω)dA cos θ dA0cos θ0

r2(3.10)

The Equation 3.10 is called the fundamental law of photometry.

The flux which passes from xto ythrough any area which is a cross section of

the solid angle between xand yand a plane perpendicular to the direction from

xto yis the same as the flux which arrives at the point yfrom the direction of

the point x(as both fluxes pass through the same boundary):

dΦ(x, dA, ω, dω) = dΦ(y, dA0, ω0, dω0) (3.11)

As the direction of the flux which passes between the differential areas around

the points xand yis not important, an equation similar to Equation 3.10 can be

derived for the reciprocal flux:

dΦ(y, dA0, ω0, dω0) = Li(y, ω0)dA cos θ dA0cos θ0

r2(3.12)

From Equations 3.10,3.11 and 3.12 it follows

L(x, ω) = Li(y, ω0) = L(RT(x, −ω0), ω0) (3.13)

where −ω0is the direction opposite to ω0(in this case −ω0=ω). Consequently,

radiance L(x, ω) does not depend on the distance between the points xand y.

This is the reason why the human eye and photographic cameras (which are

3.2. 3D MODELING 113

sensitive to radiance) perceive the same colour at all viewing distances when

they observe a point from the same angle.

Definition. Radiosity Bis the total energy which leaves a differential area

around a given point, per unit area, per unit time (over a given range of spec-

trum). •

Hence,

B(x) = ZΩL(x, ω) cos θ dω (3.14)

where xis the given point, Ω is the space of all directions leaving x,θis the

angle between a direction ωand the surface normal at x.

3.2 3D modeling

The modeling of real 3D scenes usually makes further simplifying assumptions.

Further approximations are required by the algorithms which compute the illu-

mination in the scenes. Some of these approximations are only necessary for the

computation of a reasonable illumination in a reasonable amount of time (some

applications require real-time) on the current hardware. As the computing power

grows, it is important to avoid using the approximations which are “hard-wired”

in an algorithm and which cannot be eliminated later unless the algorithm is

changed.

3.2.1 Modeling of colour spectrum

Spectrum is an energy histogram over a wavelength interval. A discrete rep-

resentation of real functions usually involves a sampling of the interval. The

sampling used in computer graphics applies the fact that the human eye is an

imperfect spectrometer. The whole visible spectrum can be represented using

three numbers, R,Gand B(red, green, blue). [FvDFH90] Based on physiologi-

cal experiments, three basis functions (colour matching functions), r(λ), g(λ) and

b(λ) are defined on the entire visible range of wavelengths. Any spectral function

C(λ) is approximated as a linear combination of these basis functions:

C(λ) = R r(λ) + G g(λ) + B b(λ) (3.15)

When given a spectrum L(λ), the coefficients R,Gand Bcan be computed

R=ZΛL(λ)r(λ)dλ, G =ZΛL(λ)g(λ)dλ, B =ZΛL(λ)b(λ)dλ (3.16)

114 CHAPTER 3. GLOBAL ILLUMINATION

where Λ is the range of visible wavelengths.

Furthermore, the human eye cannot distinguish between colours which have

only slightly different RGB coordinates. This allows a relatively coarse sampling

of the RGB coordinates. The values R,G,Bare usually represented as integers

from 0 to 255 (1 Byte).

Remark. There are other colour models such as CIE XYZ, CMYK, HSV etc.

[FvDFH90] However, a conversion exists between any two of these. •

The use of any of the above colour models in a lighting simulation algorithm

introduces the assumption that light is monochromatic. The polarisation of light

is also ignored.

3.2.2 Modeling of surface geometry

The surfaces in a 3D scene are usually divided into disjoint parts, called objects.

This division can be hierarchical—an object can consist of smaller objects etc.

The objects at the bottom of this hierarchy are surfaces which are of the same

material.

The surface geometry describes the shape of a 3D surface. There are two

principal approaches to the description of surface geometry:

1. Polygonal representation (triangle mesh).

2. Constructive solid geometry (CSG).

Polygonal representation

In the polygonal representation, a surface is modeled as a union of polygons,

usually triangles. The majority of the triangles are bordering with three other

triangles. Triangle mesh is a data structure which uses this fact in order to save

space which is required for storing the triangles:

V={v1,...,vmaxv}, vi∈ R3(3.17)

is a set (array) of vertices,

T={t1,...,tmaxt}, ti=hva, vb, vcii, a, b, c ∈ {1, . . . , maxv}(3.18)

is a set (array) of triangles. The triangle “vertices” in the set Tare indices to

the vertex set V. A triangle mesh can be extended for the storing of additional

information. In particular, surface normals are often stored in the vertices in

3.2. 3D MODELING 115

order to smooth the discontinuities of the normals on the edges of neighbouring

triangles.2This extension only requires to store an array

N={n1,...,nmaxn}, ni∈ R3(3.19)

where niis a surface normal in the vertex vi. The surface normal at a point

inside a triangle can be linearly interpolated using the surface normals stored in

the triangle vertices. [Pho75], [Bli78] It is generally agreed that a surface normal

always points to the half-space which is outside the surface.

Remark. Some 3D formats or programs which export surface geometry to a

3D format do not allow for the storing of the surface normal information. A

common problem involves then distinguishing which side of a triangle is inside

and which side is outside of the triangle—in other words, the orientation of the

surface normals is unclear. A common solution to this problem involves using

the order of the vertices in Tto implicitly determine the direction of the normal

for a given triangle ti—unless the normals are explicitly provided in the array

N, the direction of the surface normal corresponds to the direction of the vector

product of the three vectors (vertices) indexed by ti.

Note that a triangle is independently illuminated from its front and back

sides. If the mesh structure is extended so that it stores the illumination of the

triangles in the triangle vertices, then this information must be stored separately

for the front and back sides of the triangle. However, a common practice is not

to store the illumination for the inner surfaces of objects such as balls as it is

not assumed that the inside of a ball may be of any importance. While this is

true, the simplified modeling sometimes leads to unexpected problems during the

computation of the illumination and during the visualisation of the illuminated

scene. •

One advantage of this representation is that the surface can be parameterised,

which is useful for texture mapping (the so-called uv-mapping requires a parame-

terisation of the surface). [FvDFH90] Another advantage is that this representa-

tion is supported by the hardware of the contemporary graphics cards, 3D scan-

ners and other physical devices. A practically arbitrary surface representation

can be converted into a triangle mesh—such a conversion is called a tessella-

tion. The polygonal representation is also supported in practically all existing

3D formats.

The most serious disadvantage of the polygonal representation is that triangle

meshes are only approximations of curved surfaces. The more triangles are stored

2This smoothing is desirable for the modeling of curved surfaces such as spheres. However,

the smoothing must be avoided for box-like surfaces which actually contain sharp edges, such

as a table or a cigarette box. A common workaround involves assigning a so-called crease-angle

to an object (or to the entire scene). Normals of neighbouring triangles are then only smoothed

if the angle between the natural normals of the triangles does not exceed the given crease-angle.

116 CHAPTER 3. GLOBAL ILLUMINATION

in a mesh, the better the approximation—however, the resolution of the mesh

must be fixed when the scene is being stored in a file at the latest. The chosen

resolution may not be sufficient later (for instance, when the surface is viewed

from a small distance, the discontinuities which were neglectable before may

become visible). However, the resolution cannot be increased further once it has

been fixed, see Fig. 3.1.

Figure 3.1: An example of a triangle mesh. Note the discontinuities on the top

and on the bottom of the cone

Constructive solid geometry

Constructive solid geometry (CSG) is a modeling methodology which allows us to

combine basic 3D surfaces (geometric primitives) in order to create more complex

ones using boolean set operations. The following binary operations are used to

combine two geometric primitives: union,intersection and difference. Unary

operations which can be applied to any surface are inverse (which is usually used

together with intersection in order to avoid the definition of infinite surfaces) and

transformation (rotation, scaling and translation or any combination of these).

A CSG object can be stored as a tree. The leaves of the tree store the

geometric primitives (e.g. spheres, cones, boxes, . . . ), other nodes store the

operations. An example of a CSG tree is depicted in Fig. 3.2.

It is very important to note that an algorithm which computes all intersections

of a line with a CSG object exists (provided that the line-object intersections can

be computed for all geometric primitives used in the CSG tree which defines the

object). The surface normal of a CSG object can also be computed at any surface

point (if it can be computed for all the geometric primitives). [GN71], [Jan86]

The two methods which compute the intersections with a line and the surface

normals are known for many object primitives. These object primitives include

planes, quadrics, blobs, b´ezier surfaces, sweep surfaces, polygons, height fields

etc. [Gla89]

3.2. 3D MODELING 117

Figure 3.2: An example of a CSG tree. The object shown in the root node of the

tree is a result of the union and difference operations. The unary transformation

operations are not depicted in the figure (a transformation is applied to each node

of the tree)

Remark. The problem with surface normals as mentioned for the polygonal

representation must also be solved for the CSG representation. Hence, the com-

putation of normals for the CSG primitives must include the computation of the

orientations of the normal vectors using an agreed method. •

3.2.3 Modeling of surface materials

Surface material describes the scattering properties of the surface. Light scatter-

ing depends on the microstructure of the surface which is usually not included in

the model of the surface geometry. The scattering properties are described using

so-called material which is assigned to the surface geometry.

Generally, a ray of light which hits a surface enters the surface and then

leaves the surface from a different location. This so-called sub-surface scattering

is usually ignored and replaced by a model which assumes that the incident ray

of light leaves the surface at the point of incidence:

Definition. The bidirectional scattering distribution function,BSDF is defined

as the ratio of the scattered radiance and incoming radiance:

BSDF(x, ω0, ω) = dLs(x, ω)

dEi(x)=dLs(x, ω)

Li(x, ω0) cos θ0dω0(3.20)

118 CHAPTER 3. GLOBAL ILLUMINATION

where xis the point of incidence, ωis a differential solid angle around the

outgoing direction, ω0is a differential solid angle around the incoming direction,

θ0is the angle between the surface normal and the incoming direction. The

subscript of the outgoing radiance Lsunderlines the fact that Lsis only the part

of the outgoing radiance due to the scattering of the incoming light (the surface

at the point xcan also emit light in which case the outgoing radiance for a given

direction is the sum of the emitted and scattered radiances). •

Remark. Note that Ls(x, ω) depends on the incoming radiances Li(x, ω0) from

all directions ω0.dLs(x,ω)

dω0fixes the incoming direction to one particular direction

of interest ω0.•

Remark. BSDF covers both the reflection and refraction of light. BSDF is

defined for all incoming directions ω0and outgoing directions ωaround the point

x.•

The following equation, called the scattering equation, describes the local il-

lumination model. [CW93] If the incoming radiance Li(x, ω0) is known for all

incoming directions ω0, then the scattered radiance in the direction of interest ω

can be computed as (this follows from Equation 3.20)

Ls(x, ω) = ZΩBSDF(x, ω0, ω)Li(x, ω0) cos θ0dω0(3.21)

There are physical constraints on BSDF. A surface cannot reflect more light

than it receives (Equation 3.22 and Equation 3.23). Furthermore, the reciprocal

property also applies to BSDF (Helmholtz’s principle, Equation 3.24):

ZΩBSDF(x, ω0, ω) cos θ0dω0≤1,∀ω∈Ω (3.22)

ZΩZΩBSDF(x, ω0, ω)Li(x, ω0) cos θ0dω0dω ≤ZΩLi(x, ω0)dω0(3.23)

BSDF(x, ω0, ω) = BSDF(x, ω, ω0) (3.24)

The BSDF function can be directly represented as a set of values defined for

sampled surface points and incoming and outgoing directions. However, such a

representation would probably consume a lot of memory. In practice, BSDF is

described using a set of parameters of a chosen reflection model. Commonly used

reflection models were proposed by Gouraud [Gou71], Phong [Pho75], Torrance-

Sparrow [TS67], Blinn [Bli77], Schlick [Sch93] and others. We will briefly intro-

duce the Phong model.

3.2. 3D MODELING 119

Phong reflection model

In the Phong reflection model, the reflective material properties are described

by four scalars kd(diffuse coefficient), ks(specular coefficient), ka(ambient co-

efficient) and s(shininess). The model does not actually define the BSDF—it

replaces Equation 3.21 with another one. [Pho75] The scattered radiance Ls(x, ω)

is expressed as (we generalise the original Phong formula slightly)

Ls(x, ω) = kaZΩ\ΩL

Li(x, ω0) cos θ0dω0+ZΩL

Li(x, ω0) (kdcos θ0+kscossα)dω0

(3.25)

The integration domain is split into two parts. The part ΩLdenotes all

incoming directions from a light source to the point xwhich are not blocked by

any other object. In other words, ΩLis a set of directions from which the point

xis directly illuminated.

The Phong model is purely empirical. Its parameters have no physical mean-

ing. The splitting of the integration is already wrong—the real BSDF function

makes no distinction between direct and indirect incoming light (the surface mate-

rial has no means of distinguishing between direct and indirect incoming light—it

reflects both in the same way).

The term kscossαin Equation 3.25 depends on the position of the camera, as

αis the angle between the perfectly mirrored direction of ω0around the normal at

xand the viewing direction. (This is another flaw of the model—the real BSDF

function does not depend on the camera.) This term simulates so-called specular

highlights which are caused by a direct reflection of light from a metalic surface

onto the camera. The parameter kscontrols the intensity of the highlight and

the parameter scontrols its “tightness”.3

The indirect illumination term RΩ\ΩLLi(x, ω0) cos θ0dω0is sometimes approx-

imated with a constant in some illumination algorithms (local illumination algo-

rithms).

The reason why we deal with the Phong model is that it is assumed in the

majority of the existing 3D formats—in which the material description consists

of the four scalars kd,ks,kaand s. The description of materials is a very serious

problem of contemporary computer graphics.

Modified (more realistic) Phong reflection model

Fortunately, the four parameters used in the Phong model can be given a more

realistic interpretation than that of Equation 3.25. [LW94] The modified Phong

3We assume here silently that the surface of each object consists of the same material and

that each object is assigned its own BSDF . It would be possible to describe the surfaces of all

objects using one global BSDF but in such a case the parameters kd,ks,kaand swould be

functions of x.

120 CHAPTER 3. GLOBAL ILLUMINATION

model obeys Equation 3.21 and defines the BSDF function as a sum of specular

and diffuse components:

BSDF(x, ω0, ω) = BSDFd(x, ω0, ω) + BSDFs(x, ω0, ω) = kd

π+ks

s+ 2

2πcossα

(3.26)

where αis the angle between the perfect specular reflective direction and the

outgoing direction. (The parameter kais ignored.)

This BSDF model does not include light transmission. The adding of light

transmission usually requires an inclusion of additional parameters such as IOR

(index of refraction) and the transparency coefficients of the model (and also an

inclusion of the additive term which corresponds to Equation 3.26).

3.2.4 Modeling of light sources

A light source is an area which emits light without being illuminated from the

outside. A light source can be characterised by the placement and geometry of

the area and by its directional radiant properties. It is usually assumed that these

properties do not change over time. A light source iis characterised by its radiant

emittance li

e(x, ω). There is a finite set of light sources in the 3D model. The

whole set of light sources is described using the function Le(x, ω) = Pli

e(x, ω).4

The idealised light source types widely used in computer graphics are a point

light source and an area diffuse light source. The area of a point light source is a

differential area around a point and the energy emitted in all directions is equal.

(A simple modification of a point light source is a spot light source which is a

differential area around a point which emits energy in a cone around the point.

The emitted radiance is maximal in the direction of the cone axis and zero for the

directions outside the cone.) The area of an area diffuse light source is a non-zero

(usually planar) area, point of which emits energy equally in all directions.

A more realistic description of light sources is provided by the ANSI/IESNA

standard “IES Recommended Standard File Format for Electronic Transfer of

Photometric Data”. [LM-02] Characteristics of many real luminaires (light fix-

tures) by various manufacturers are stored in the IES format (the description of

a luminaire is essentially the radiance function sampled in many points of the

luminaire in many directions). This format is being adopted by the computer

graphics community. For instance, the RADIANCE rendering system [War94]

can import light sources which are described using the IES format and work with

them.

4We can assume that the areas of light sources do not overlap. (The area of a light source i

is the set of points xfor which li

e(x, ω)6= 0 for some ω.)

3.3. THE GLOBAL ILLUMINATION PROBLEM 121

3.2.5 Modeling of camera

A camera is a device which is sensitive to radiance. The radiance is measured

using a finite set of sensors. A sensor iis characterised by its sensor responsive-

ness function wi

e(x, ω0) which returns 1 if the radiance impinging at the point

xin the direction ω0directly reaches the measuring device (e.g. a film or an

eye). Otherwise it returns 0. The total response measured by the sensor iin a

differential area around a point xis

ZΩwi

e(x, ω0)Li(x, ω0) cos θ0dω0(3.27)

where θ0is the angle between the incoming direction ω0.

The set of all sensors is characterised by the function We(x, ω0) = Pwi

e(x, ω0).5

The picture seen by the camera consists of a finite number of pixels which are

organised in a rectangular 2D grid. Each pixel is covered by one sensor. The

total response measured over a pixel is thus

ZSZΩWe(x, ω0)Li(x, ω0) cos θ0dω0dA

=ZSZΩWe(x, ω0)L(RT(x, −ω0), ω0) cos θ0dω0dA (3.28)

where dA is a differential area around the point x,θ0is the angle between the

incoming direction ω0and the surface normal at the point xand Sis the area

covered by the pixel. The interpretation of this equation is: “The total response

of a sensor is the radiance which directly reaches the measuring device.”

3.3 The global illumination problem

An instance of the global illumination problem is a tuple

hG, BSDF, Le, Ci(3.29)

where Gis a description of the surfaces, BSDF is a description of the material

properties of the surfaces, Leis a description of the light sources and Cis a

description of the camera. The problem is to compute the values measured by

the camera sensors.

5We can assume that the areas of sensors do not overlap. (The area of a sensor iis the set

of points xfor which wi

e(x, ω0)6= 0 for some ω0.)

122 CHAPTER 3. GLOBAL ILLUMINATION

3.3.1 Rendering equations

Radiance equation

A mathematical definition of the global illumination problem is comprised in the

Equation 3.28. The calculation of the total response of a camera sensor (colour

of a pixel) depends on the knowledge of the function Liover the set of points x

and directions ω0of interest (for which We(x, ω0)6= 0).

The unknown incoming radiance Li(or radiance L, see Equation 3.13 which

relates Land Li) can be calculated using the scattering equation 3.21. We recall

that Ls(x, ω) on the left side of Equation 3.21 is the scattered radiance at the

point x. If the point xalso emits light, then the total outgoing radiance L(x, ω)

which leaves the point xdue to emission and scattering is equal to

L(x, ω) = Le(x, ω) + Ls(x, ω) (3.30)

=Le(x, ω) + ZΩBSDF(x, ω0, ω)L(RT(x, −ω0), ω0) cos θ0dω0

Equation 3.30 is called the radiance equation. In order to solve the global

illumination problem, Equation 3.30 must be solved (the function Lmust be

calculated) for those xand ωwhich contribute to the integration in Equation 3.28

at least.

Potential equation

The global illumination problem can also be looked at from another point of view,

using an abstract measure which expresses the visual importance.

Definition. (Visual) potential (also called visual importance)W(x, ω0) is defined

as the percentage of the incoming radiance Li(x, ω0) at the point xin the direction

ω0which reaches the measuring device. •

Remark. The percentage of the incoming radiance Li(x, ω0) at the point xin

the direction ω0which directly reaches the measuring device is equal to We(x, ω0).

•

Let us denote Ws(x, ω0) the percentage of the incoming radiance Li(x, ω0) at

the point xin the direction ω0which leaves the point xand indirectly reaches the

measuring device—that means after one or more scatterings. From the definitions

of BSDF (Equation 3.20) and the potential it follows that

W(RT(x, ω), ω)BSDF(x, ω0, ω) cos θ

3.4. APPROACHES TO THE GLOBAL ILLUMINATION PROBLEM 123

is the percentage of Li(x, ω0) which leaves the point xin the direction ωdue to

scattering and then (directly or indirectly) reaches the measuring device. θis the

angle between the surface normal at the point xand the direction ω. The total

percentage of Li(x, ω0) which reaches the measuring device is thus equal to

W(x, ω0) = We(x, ω0) + Ws(x, ω0) (3.31)

=We(x, ω0) + ZΩBSDF(x, ω0, ω)W(RT(x, ω), ω) cos θ dω

Equation 3.31 is called the potential equation. There is a strong structural

similarity between the potential equation and the radiance equation. Indeed,

the solving of the potential equation also solves the global illumination problem.

If the function W(x, ω0) is known, then the total response of a sensor can be

calculated as

ZSZΩW(RT(x, ω), ω)Le(x, ω) cos θ dω dA (3.32)

where dA is a differential area around the point x,θis the angle between the

direction ωand the surface normal at the point xand Sis the area of all the light

sources. The interpretation of this equation is: “The total response of a sensor

is the percentage of the emitted radiance times the percentage of this emitted

radiance which eventually reaches the measuring device”.

3.4 Approaches to the global illumination prob-

lem

Both the radiance equation 3.30 and the potential equation 3.31 are Fredholm

integral equations of the second kind and cannot be (except for a few special cases)

solved analytically. [Atk76] There are two classes of methods which numerically

solve the equations:

•Direct methods which directly solve the integral equations. These include

Monte Carlo and Quasi Monte Carlo integrations in higher dimensions.

•Approximation methods which make additional simplifying assumptions and

solve simplified equations. These include (eye-) ray tracing and finite ele-

ment methods such as radiosity.

The main advantage of the direct integration methods is that they work with-

out approximations with the original model. This means for instance that if the

integration method guarantees a certain error bound, then this error bound also

applies to the computed images.

124 CHAPTER 3. GLOBAL ILLUMINATION

An important practical question by designing a global illumination algorithm

is whether the algorithm requires an explicit storage of the radiance or the po-

tential function over the space of all surface points and directions. As this space

is infinite, the explicit representation of the function radiance or the potential

function in finite memory is already an approximation. The error bound of such

an approximation is usually difficult to predict. This means that direct methods

should not rely on the explicit storing of the radiance or the potential functions

in order to guarantee the error bound provided by the underlying integration

method.

3.4.1 Direct methods

Gathering path integration

The radiance equation 3.30 can be regarded as a recurrent definition of the un-

known function L. Let us denote Rthe radiance transport operator

(RL)(x, ω) = ZΩBSDF(x, ω0, ω)L(RT(x, −ω0), ω0) cos θ0dω0(3.33)

Using this operator, the radiance equation 3.30 can be written as

L=Le+RL

=Le+R(Le+RL) = Le+RLe+R2L

...

=Ãn

i=0 RiLe!+Rn+1L(3.34)

Fig. 3.3 and Fig. 3.4 depict the geometry of the integrands of (RL)(x, ω) and

(R2L)(x, ω), respectively. If the operator Ris a contraction (and it is—thanks to

the underlying physics, see Equation 3.22) then limn→∞ Rn+1L= 0. Therefore

L= lim

n→∞

i=0 RiLe=∞

i=0 RiLe(3.35)

The terms RiLehave the following structure:

(R0Le)(x, ω) = Le(x, ω)

(R1Le)(x, ω) = ZΩ1

BSDF(x, ω0

1, ω)Le(RT(x, −ω0

1), ω0

1) cos θ0

1dω0

(R2Le)(x, ω) = ZΩ2ZΩ1

BSDF(RT(x, −ω0

1), ω0

2, ω0

1)BSDF(x, ω0

1, ω)

Le(RT(RT(x, −ω0

1),−ω0

2), ω0

2) cos θ0

1cos θ0

2dω0

1dω0

... (3.36)

3.4. APPROACHES TO THE GLOBAL ILLUMINATION PROBLEM 125























Figure 3.3: Gathering path integration: The geometry of the integrand of the

term (RL)(x, ω)









!







"#

!





!

$

!

%&"#

!

#

!



!



Figure 3.4: Gathering path integration: The geometry of the integrand of the

term (R2L)(x, ω)

These terms have a physical interpretation: “(RiLe)(x, ω) is the radiance at

the point xin the direction ωwhich has been scattered exactly i-times after it

had left a light source (including the scattering at the point x).”

As the operator Ris contractive (a part of the transported radiance is ab-

sorbed in each scattering), the values of (RiLe)(x, ω) get smaller as iincreases:

(R0Le)(x, ω)≥(R1Le)(x, ω)≥(R2Le)(x, ω)≥(R3Le)(x, ω)≥. . . (3.37)

The most rigorous algorithms which use Equation 3.35 in order to solve the

global illumination problem are path tracing [Kaj86] and distributed ray tracing

(also called Monte Carlo Ray Tracing) [CPC84]. These algorithms do not ex-

plicitly store the computed radiance function—instead they directly compute the

integral 3.28 using Monte Carlo integration. Both path tracing and distributed

126 CHAPTER 3. GLOBAL ILLUMINATION

ray tracing generate paths which consist of line segments (rays). The first ray

starts at the measuring device and goes through a pixel (its direction is ran-

domly generated). If a ray does not hit a surface, the path will be terminated.6

Otherwise a decision is made whether the path will be prolonged by another ray

(the direction of this scattered ray is generated randomly) or whether the path

is terminated.7The probability of the prolongation of a path is proportional to

an estimated contribution of the new ray to the integral 3.28 (as the radiance

transport operator Ris a contraction, this contribution decreases as the length of

the path increases). When a path is terminated, its contribution to the integral

3.28 is added and the path is discarded. The difference between path tracing

(Fig. 3.5) and distributed ray tracing (Fig. 3.6) is that path tracing only collects

the direct lighting Lefrom the light sources in the last point of the path which

is being terminated, whereas distributed ray tracing collects the direct lighting

in the last point of every segment of the path. (Note that the collection of the

direct lighting is a necessity if the scene contains point light sources, because the

probability of hitting a point light with a randomly generated ray is zero.)



Figure 3.5: Camera tracing with a single collection of the direct radiance (path

tracing)

Shooting path integration

The potential equation 3.31 can be expanded in a similar way to that of Equa-

tion 3.34. Let us denote Pthe potential transport operator

6Instead of the termination of the path in this case, the algorithm may shorten the path

and recursively trace other rays.

7If a ray hits a surface of a light source, the path is terminated in order to avoid a multiple

addition of the direct lighting. Another possibility involves excluding the surfaces of light

sources from the computations of these intersections.

3.4. APPROACHES TO THE GLOBAL ILLUMINATION PROBLEM 127



∫



∫



∫



Figure 3.6: Camera tracing with a multiple collection of the direct radiance

(distributed ray tracing)

(PW)(x, ω0) = We(x, ω0) + ZΩBSDF(x, ω0, ω)W(RT(x, ω), ω) cos θ dω (3.38)

The potential equation 3.31 can be written as (note that the operator Pis a

contraction)

P=We+PW

=We+P(We+PW) = We+PWe+P2W

...

=Ãn

i=0 PiWe!+Pn+1W

=∞

i=0 PiWe(3.39)

Fig. 3.7 and Fig. 3.8 depict the geometry of the integrands of (PW)(x, ω0)

and (P2W)(x, ω0), respectively.

The terms PiWeof the infinite sum have the following structure:

(P0We)(x, ω0) = We(x, ω0)

(P1We)(x, ω0) = ZΩ1

BSDF(x, ω0, ω1)We(RT(x, ω1), ω1) cos θ1dω1

(P2We)(x, ω0) = ZΩ2ZΩ1

BSDF(RT(x, ω1), ω1, ω2)BSDF(x, ω0, ω1)

We(RT(RT(x, ω1), ω2), ω2) cos θ1cos θ2dω1dω2

... (3.40)

128 CHAPTER 3. GLOBAL ILLUMINATION











 









Figure 3.7: Shooting path integration: The geometry of the integrand of the term

(PW)(x, ω0)











 





























Figure 3.8: Shooting path integration: The geometry of the integrand of the term

(P2W)(x, ω0)

The physical interpretation of these terms is: “(PiWe)(x, ω0) is the percentage

of the incoming radiance impinging at the point xin the direction ω0which reaches

the measuring device after exactly iscatterings (including the scattering at the

point x).”

The most rigorous algorithm which uses Equation 3.39 in order to solve the

global illumination problem is light tracing. [DLW93] Light tracing does not ex-

plicitly store the computed potential function—instead it directly computes the

integral 3.32 using Monte Carlo integration. The algorithm generates paths which

consist of line segments (rays). The first ray starts at a randomly chosen point of

a randomly chosen light source in a randomly chosen direction. If a ray does not

hit a surface, the path is terminated. Otherwise a decision is made as to whether

the path will be prolonged by another ray (the direction of this scattered ray is

randomly generated) or whether the path will be terminated.8The probability

8Instead of the termination of the path in this case, the algorithm may shorten the path

and recursively trace another rays.

3.4. APPROACHES TO THE GLOBAL ILLUMINATION PROBLEM 129

of the prolongation of a path is proportional to an estimated contribution of the

new ray to the integral 3.32 (as the potential transport operator Pis a contrac-

tion, this contribution decreases as the length of the path increases). When a

path is terminated, its contribution to the integral 3.32 is added and the path is

discarded. The direct potential Wefrom the camera can be collected either at

the last point of the path which is being terminated (Fig 3.9) or at the last point

of every segment of the path (Fig 3.10).



∫



Figure 3.9: Light tracing with a single collection of the direct potential



∫



∫



∫



Figure 3.10: Light tracing with a multiple collection of the direct potential

Bidirectional path integration

A disadvantage of the gathering and shooting path integrations are their slow

convergence. In practice, this slow convergence means that certain light phenom-

ena, e.g. caustics, take a long time to compute—however, we must stress that if

130 CHAPTER 3. GLOBAL ILLUMINATION

the computational time is unlimited then any of the direct methods can correctly

solve the global illumination problem (the expected solution is equal to the exact

solution with probability 1). Bidirectional path integration combines the gather-

ing and shooting paths into global paths which connect the measuring device with

a light source. Note that the gathering paths generated by distributed ray trac-

ing also connect the measuring device with a light source—however, the length

of the “shooting path segment” (which collects the direct lighting) is limited to

the length one. Similarly, the shooting paths generated by light tracing also con-

nect the measuring device with a light source, but in this case the “gathering

path segment” (which collects the direct potential) is limited to the length one.

Bidirectional path integration connects shooting and gathering paths of arbitrary

lengths, see Fig. 3.11 and Fig. 3.12 [LW93], [VG94], [Vea97]



Figure 3.11: Bidirectional path tracing

The above path integration methods as well as other direct methods (such

as stochastic iteration, [SK99b], [SK99a]) are stochastic methods. As such they

suffer from stochastic errors which are perceived as noise in the computed images.

However, if approximations are avoided in the algorithms which are used within

the methods, then probabilistic guarantees can be given on the computed results.

The most important guarantee is that the computed results are on average correct

and that the stochastic errors can be eliminated by using more random samples

(e.g. more paths or more iterations).

3.4.2 Approximation methods

Most approximation methods restrict the modeling of surface material proper-

ties to perfect diffuse or perfect specular reflectors. The idea is to simplify the

structure of one of the rendering equations and to apply a deterministic method

which solves the modified equation. The disadvantage of this is that practically

3.4. APPROACHES TO THE GLOBAL ILLUMINATION PROBLEM 131



Figure 3.12: Bidirectional path tracing with multiple connections of the gathering

and shooting paths

no surface of nature is perfectly diffuse or perfectly specular (or a linear combina-

tion of both). Therefore the solution to the modified problem usually differs from

the solution to the original (physically more correct) problem and this difference

cannot usually be bounded. Two well-known methods of this kind are (eye-) ray

tracing and radiosity. The next two chapters are devoted to these two methods.

We sketch the simplifying assumptions which they make below.

(Eye-) Ray tracing

(Eye-) ray tracing solves a simplified version of Equations 3.28 and 3.30 using the

expansion 3.35. It is assumed that all object surfaces are perfect specular reflec-

tors (or perfect specular transmitters, or both) for the purpose of the computation

of the indirect illumination (the terms RiLe(x, ω), i ≥2).

Aperfect specular reflector is a surface which is characterised by the following

BSDF:

BSDFr(x, ω0, ω) = ks∆(ω, ω0−2(N·ω0)·N) (3.41)

where xis a surface point, ω0is the incoming direction (a normalised 3D

vector), ωis the outgoing direction (a normalised 3D vector), ks∈ h0,1) is a

specular coefficient, Nis the surface normal at the point xand ∆ is a slightly

modified Dirac function (for directions in 3D):

∀ω26=ω1: ∆(ω1, ω2) = 0

∀ω1:ZΩ∆(ω1, ω2)dω2= 1

132 CHAPTER 3. GLOBAL ILLUMINATION

The direction ω0−2(N·ω0)·Nin Equation 3.41 is the mirror direction which

lies in the same plane as ω0and Nand the angle between this mirror direction

and the surface normal is equal to the incoming angle, see Fig. 3.13.









Figure 3.13: Perfect specular reflection. θ=θ0

Aperfect specular transmitter is a surface which is characterised by the fol-

lowing BSDF :

BSDFt(x, ω0, ω) =











kt∆³ω, kior ω0+³kior cos θ−q1 + k2

ior(cos2θ−1)´·N´

if 1 + k2

ior(cos2θ−1) ≥0

ks∆(ω, ω0−2(N·ω0)·N)

if 1 + k2

ior(cos2θ−1) <0

(3.42)

where xis a surface point, ω0is the incoming direction (a normalised 3D

vector), ωis the outgoing direction (a normalised 3D vector), kt∈ h0,1) is a

specular transmission coefficient, kior ∈(0,1iis the index of refraction9between

the surrounding medium and the surface material at x,Nis the surface normal

at x, ∆ is the modified Dirac function and θis the angle between the incoming

direction and the surface normal at x(hence, cos θ=N·(−ω0)), see Fig. 3.14.

The direction kior ω0+³kior cos θ−q1 + k2

ior(cos2θ−1)´·Nis the perfect

direction of refraction (Snell’ Law). The first split term of BSDFtrepresents

the perfect specular refraction, the second split term represents the total internal

reflection which occurs when the incoming angles is small.

The resulting BSDF is the sum of the perfect specular reflection and the

perfect specular transmission:

9The index of refraction may depend on the wavelength of the incoming light. This is why

a glass prism divides the refracted white light into a rainbow. [New52] A varying index of

refraction can be included in this model.

3.4. APPROACHES TO THE GLOBAL ILLUMINATION PROBLEM 133









Figure 3.14: Perfect specular refraction. sin θ=kior sin θ0

BSDF(x, ω0, ω) = BSDFr(x, ω0, ω) + BSDFt(x, ω0, ω) (3.43)

The BSDF above is used in the computation of the terms RiLe(x, ω), i ≥2

in the potential equation 3.31. For the computation of the direct lighting (the

terms R0Le(x, ω) and R1Le(x, ω)), either the Phong model (Equation 3.25) or

the modified Phong model (Equation 3.26) are used.10 This is not quite correct

but it saves computational time. Eye ray tracing is very similar to distributed ray

tracing which is schematically depicted in Fig. 3.6. The difference between the

two is that eye ray tracing only computes the integral which corresponds to the

direct camera rays (the computation of this integral is called anti-aliasing) and

the integral which corresponds to the dashed direct light rays (the computation of

this integral is called shading). All the integrals in-between are only approximated

by sampling the radiance function in two principal directions (the direction of the

perfect reflection and the direction of the perfect transmission).

Radiosity

The radiosity method consists of two steps in order to compute the picture viewed

by the camera. In the first step (the so-called view-independent step), the un-

known radiance function is computed using a simplified version of Equation 3.30.

Unlike (eye-) ray tracing, the radiosity method explicitly represents the radiance

10The term R0Le(x, ω) is often ignored. This term corresponds to the direct lighting which

impinges at the camera and it is responsible for the effect known as “lens flare”. This effect

can be observed when a picture is taken against the sun. The sunlight which directly hits the

camera lens creates colourful circles.

134 CHAPTER 3. GLOBAL ILLUMINATION

function. In the second step (the view-dependent step) the picture viewed by the

camera is computed using Equation 3.28.

The first of the simplifying assumptions which are made by radiosity is that

all object surfaces are perfect diffuse reflectors. A perfect diffuse reflector is a

surface which is characterised by the following BSDF:

BSDF(x, ω0, ω) = (kd(x)

πif ωlies in the half-space of reflection (ω0and N)

0 otherwise

(3.44)

where kd(x)∈ h0,1) is a diffuse reflection coefficient, see Fig. 3.14. Note that

this BSDF does not depend on ω0or ω(as the basic radiosity method ignores

the transmission of light, we will in the following text restrict the set of incoming

and outgoing directions Ω to the directions in the half-space of reflection). This

means that the incoming radiance is equally reflected in all outgoing directions.

This assumption allows us to write Equation 3.30 as

L(x, ω) = Le(x, ω) + kd(x)

πZΩL(RT(x, −ω0), ω0) cos θ0dω0(3.45)







Figure 3.15: Perfect diffuse reflection. The incoming radiance is equally scattered

in all outgoing directions in the half-space of reflection, independently of the

incoming direction ω0

The second simplifying assumption is that all surfaces are modeled as planar

areas (so-called patches) and that the 3D model only contains a finite number

of these patches. Moreover, the radiance over all points of one patch is assumed

to be constant. Let P1,...,Pndenote the patches. These patches include light

sources. The radiant emittance of all light sources is assumed to be perfectly

diffuse (Le(x, ω) is only a function of x) and constant at every point of one patch.

In other words, all light sources are area light sources. We define exitance at a

point of light source (which is characterised by its radiant emittance li

e(x, ω), see

Section 3.2.4) as

3.4. APPROACHES TO THE GLOBAL ILLUMINATION PROBLEM 135

E(x) = ZΩli

e(x, ω) cos θ dω (3.46)

If all light sources are area light sources (perfect diffuse emitters) and if their

areas do not overlap, then the above equation can be simplified by using the fact

that Le(x, ω) does not depend on ω(Equation 3.6 is used to express the direction

ωas polar angles):

E(x) = ZΩLe(x, ω) cos θ dω =Le(x, ω)ZΩcos θ dω

=Le(x, ω)Zπ

0Z2π

0cos θsin θ dθ dφ =π Le(x, ω) (3.47)

We recall here the definition of radiosity (Equation 3.14). If a point xlies on

a surface of a perfect diffuse reflector (or a perfect diffuse emitter), then L(x, ω)

does not depend on ω. Hence, radiosity at the point xis equal to

B(x) = ZΩL(x, ω) cos θ dω =L(x, ω)ZΩcos θ dω =π L(x, ω) (3.48)

The multiplication of Equation 3.45 by πyields

B(x) = E(x) + kd(x)ZΩL(RT(x, −ω0), ω0) cos θ0dω0(3.49)

Note that not all incoming directions ω0contribute equally to the integral in

the above equation. Those directions for which the function RT(x, −ω0) does not

find a surface point do not contribute at all. For all other directions the point

y=RT(x, −ω0) lies on a surface of a perfect diffuse reflector or emitter. For the

point yit holds that

L(y, ω0) = L(RT(x, −ω0), ω0) = B(y)

π(3.50)

The above equation and the substitution of dω0into Equation 3.49 using

Equation 3.5 yield

B(x) = E(x) + kd(x)

πZSB(y)cos θcos θ0

r(x, y)2V(x, y)dA (3.51)

where θis the angle between the surface normal at the point yand the di-

rection from yto x,r(x, y) is the distance between the points xand yand dA is

a differential area around the point y. The integration domain Sare all surface

points. The function V(x, y) is a visibility function which is needed in order to

avoid the collection of multiple contributions to the integral after the change to

the integration domain. Note that in the integral of Equation 3.49, the func-

tion RT(x, −ω) returns the nearest point to xin the direction of −ω, whereas

136 CHAPTER 3. GLOBAL ILLUMINATION

the integration over all surface points also includes further points. The visibility

function V(x, y) returns 1 for the nearest point and 0 for all further points:

V(x, y) = (1 if the points xand yare mutually visible

0 otherwise (3.52)

Radiosity B(y) in Equation 3.51 depends on the patch on which the point

ylies. However, we assume that the radiosity in each point of one patch is

constant. The integration domain Sis the union of all patches, S=Sn

i=1 Piand

so the integral of Equation 3.51 can be divided into a sum of integrals:

B(x) = E(x) + kd(x)

i=1 Zy∈Pi

B(y)cos θcos θ0

r(x, y)2V(x, y)dAi(3.53)

Equation 3.53 is not quite correct because it violates the assumption of the

constant radiosity over all points of one patch—the values of B(x) which are com-

puted for different points xof one patch using Equation 3.53 are not necessarily

equal. To overcome this problem, we define patch radiosity Bi, i = 1, . . . , n as

an area-weighted average of the point radiosities Bi(x), x ∈Pi:

Bi=1

AiZx∈Pi

B(x)dx (3.54)

where Aiis the area of the patch Pi. Similarly, we define patch exitance

Ei, i = 1,...,nas an area-weighted average of the point exitances E(x), x ∈Pi:

Ei=1

AiZx∈Pi

E(x)dx (3.55)

A correct version of Equation 3.53 is therefore

Bi=Ei+ρi

j=1

AiZx∈PiZy∈Pj

cos θcos θ0

π r(x, y)2V(x, y)dAjdAi(3.56)

where ρi=kd(x) at any point x∈Pi(the diffuse reflection coefficient kd(x)

is constant for all points of a patch).

If we denote

Fij =1

AiZx∈PiZy∈Pj

cos θcos θ0

π r(x, y)2V(x, y)dAjdAi(3.57)

then Equation 3.56 can be written as

Bi=Ei+ρi

j=1

FijBj(3.58)

3.5. CONCLUSIONS 137

Equation 3.58 is called the radiosity equation. Note that the terms Fij only

depend on the geometry of the patches in the 3D model and can therefore be

computed independently of the illumination. The terms Fij are called form fac-

tors.

The radiosity equation is a linear equation system with the unknowns Bi, i =

1,...,n. The system can be written in an equivalent matrix form as







1−ρ1F11 −ρ1F12 ... −ρ1F1n

−ρ2F21 1−ρ2F22 ... .

.... ....

−ρnFn1... ... 1−ρnFnn

















=









(3.59)

Any known method for solving linear equation systems (e.g. Gauss method)

can theoretically be used to solve this equation system. The main practical

difficulties are the size of the system (very large models consist of millions of

patches) and the computation of the matrix elements (the computation of the

form factors Fij).

When the radiosity equation is solved, Equation 3.28 is used to compute the

picture. The patch radiosities are stored, therefore the model can quickly be ren-

dered for many different cameras without the need to recompute the illumination.

Remark. A simplification of the potential equation leads to a similar linear

equation system to that of Equation 3.58. The unknowns in this case are “diffuse

patch potentials” (“diffuse patch potential” is a counterpart of radiosity). After

the modified equation system has been solved, pictures of the model viewed by

one camera can quickly be rendered for different lighting conditions (for different

sets of light sources) using Equation 3.32.

The knowledge of both the patch radiosities and the “diffuse patch potentials”

allows for a quick rendering of the model for different cameras under different

lighting conditions. The preprocessing phase in this case consists of the solving

of two (similar) linear equation systems.

Furthermore, the two equation systems (the radiosity equation and the “dif-

fuse potential equation”) can be combined in order to speed up the rendering for

one camera and for one set of light sources. [SAS92], [SP94], [BW95]•

3.5 Conclusions

A realistic simulation of global illumination is a challenging problem. An accurate

simulation of the underlying physics on an atomic level is computationally not

feasible for the purpose of the illumination of large-scale scenes. The state of the

138 CHAPTER 3. GLOBAL ILLUMINATION

art mathematical formulation of the global illumination problem involves two

adjoint Fredholm integral equations of the second kind: radiance and potential

equations. These equations can be extended in order to include further light

phenomena without destroying their structure.

The methods which solve the global illumination equations can be divided

into two categories. The methods in the first category attempt to directly solve

the equations, without further approximations. The essence of these methods is a

Monte Carlo integration in high dimensions which at least provides probabilistic

guarantees of the accuracy of the computed images. The methods in the second

category make additional approximations. Two popular examples of such meth-

ods are ray tracing and radiosity. Ray tracing is more flexible than radiosity as it

does not make approximations on the modeling level and can be extended in order

to compute the full global illumination with no approximations. Also, some light

phenomena which are not covered in the radiance and potential equations—for

instance participating medium—can be incorporated into a ray tracing algorithm

(much more easily than into a radiosity algorithm).

Some textbooks on computer graphics make a distinction between “view-

dependent” and “view-independent” methods. (Note that the illumination never

depends on the camera, therefore the quotation marks.) From a theoretical point

of view, it is not very important as to whether a method explicitly stores the

computed illumination (“view-independent”) or only computes the information

which is necessary in order to render a picture viewed by the camera (“view-

dependent”). However, the explicit storage of illumination may consume a large

amount of memory and may invalidate error bounds given by the numerical meth-

ods which are used to solve a rendering equation.

Contemporary 3D standards do not reflect the requirements of global illumi-

nation algorithms. Commercial modeling programs use proprietary 3D formats

which are incompatible with other modeling programs. The 3D models can be

exported into one of the existing open formats such as VRML [ISO97] but this

leads to a loss of information in the 3D model. For instance, it is very paradoxal

that although practically no modeling system internally works with triangles (hu-

man 3D artists do not model a surface using non-overlapping triangles, they use

Constructive Solid Geometry instead), the only representation which can be ex-

ported is a triangle mesh. The lack of a portable 3D standard is in our opinion a

very serious problem because imperfections in input data cause imperfections in

rendered images. We sketch a solution to this problem in Section 6.1.

Chapter 4

Ray tracing

The ray tracing method (also referred to as eye-ray tracing) computes an im-

age of a 3D scene by recursively tracing rays from the eye through the pixels of

a virtual screen into the scene, summing the light path contributions to pixels’

colors. The main idea behind this method is to only follow those photon paths

which contribute to the image seen by the camera. The basic ray tracing algo-

rithm was proposed by Whitted in 1980.1[Whi80] This algorithm is still very

popular in real-world rendering systems. It can serve as a basis for all direct

methods which are introduced in Section 3.4 and also, perhaps less apparently,

for the radiosity method which we discuss in Chapter 5.

This chapter focuses on the basic ray tracing algorithm which only traces

rays in the directions of perfect specular reflection and perfect specular refrac-

tion in order to evaluate the higher-order integrals of the radiance equation (see

Section 3.4.2). We present the existing optimisation techniques which acceler-

ate the computation of the function RT(x, ω) (which is defined in Section 3.1).

[Gla89] In spite of these optimisation techniques, sequential computation times

can range from minutes to hours. A parallelisation of the algorithm is therefore

very desirable.

Many research papers on parallel ray tracing are only interested in the perfor-

mance of the algorithms. Software engineering issues are often ignored. Among

these belong questions such as:

•Can the parallel algorithm be easily integrated into an existing sequential

code?

•Will it be possible to continue the development of the sequential code with-

out the need of reimplementation of the parallel version?

1The idea of tracing rays from the camera to the 3D scene was first described in [App68] in

the context of hidden-surface removal.

139

140 CHAPTER 4. RAY TRACING

•Which existing sequential optimisation techniques can be reused in the

parallel version (and which can not)?

We keep these questions in mind throughout this chapter. The parallelisation

method which we propose is based on the method of Green and Paddon. [GP89],

[Gre91] We propose a better screen subdivision algorithm which improves the

existing screen subdivision algorithms. We show that false conclusions may be

drawn if the performance of the parallel algorithms is only compared empirically

and if active polling is used in the underlying communication library.

4.1 The basic ray tracing algorithm

The basic ray tracing algorithm (sometimes referred to as backwards ray tracing

or eye ray tracing) solves the radiance equation 3.30. [Whi80], [Gla89] Instead

of the computation of the radiance L(x, ω) at all surface points xand all direc-

tions ω, ray tracing only computes the radiance function at points and directions

which contribute to the integral of Equation 3.28 (which corresponds to the im-

age viewed by the camera). The computed radiance values are typically not

permanently stored. Ray tracing traces rays through the camera pixels in order

to compute the integral of Equation 3.28. These rays are called primary rays.

The direct lighting contribution to the radiance function is computed by the ray

tracing shader at the closest intersection points between these rays and the 3D

surfaces in the direction to the camera (the computation of the closest intersec-

tion points equals to the computation of the function RT which is defined in

Section 3.1). At these intersection points new rays are generated. The basic ray

tracing algorithm only generates two secondary rays, in the directions of perfect

specular reflection and perfect specular transmission. These secondary rays are

recursively traced until the direct lighting in the new intersection points has no

significant contribution to the integral of Equation 3.28 (or until a user-defined

recursion limit is reached).

The computation of the direct illumination contributions in the intersec-

tion points involves the generation of so-called shadow rays (the dashed rays in

Fig. 3.6). The shadow rays sample the directions from the points on the surfaces

of light sources to the intersection point in order to compute the direct illumina-

tion terms of the integrals of Equation 3.25 (the Phong illumination model). The

basic ray tracing algorithm usually assumes the use of point light sources only,

which reduces the number of the shadow rays to the number of light sources for

each intersection point.

4.2. SEQUENTIAL OPTIMISATION TECHNIQUES 141

4.2 Sequential optimisation techniques

Already Whitted identified the repeated evaluation of the function RT(x, ω)

(which returns the closest intersection of the ray starting at point xin direc-

tion ωwith the 3D scene) as the far most expensive activity (ca. 90% of the

processing time) in the ray tracing algorithm. [Whi80] This evaluation involves

the computation of a number of ray-object intersections. We will refer to these

intersection computations as ray tracing operations (RTOPs).

Although not all secondary rays must be generated (not all materials are

specular reflectors or transmitters), the number of the RTOPs is still very high.

The following estimation can be found in [DS84] Denote an average number of

secondary rays spawned at an intersection point Nand assume that the average

depth of the ray tracing recursion tree over all primary rays is D. (One ray tracing

recursion tree is assigned to one pixel. The root of the ray tracing recursion tree

corresponds to the primary ray generated for that pixel. The remaining nodes

of the tree correspond to the rays generated by the ray tracing recursion.) The

average number of nodes in one tree is then equal to

i=1

Ni−1=ND−1

N−1

Denote Lthe average number of shadow rays over all intersection points (if

we assume that the scene only contains point light sources, then Lis equal to

the number of point light sources). Denote the number of primary rays W(W

is equal to the number of camera pixels). Denote the number of objects in the

scene X(we assume here that objects are geometric primitives). Then the total

number of RTOPs is equal to

#RTOP =W·XÃND−1

N−1!(1 + L)

For a realistic setting W= 720 ×576, X= 1000, L= 2, N= 1.2, D= 5

the total number of RTOPs is approximately equal to 9.26 ·109. This section

describes some of the techniques which reduce this number.

4.2.1 Bounding volumes

A simple optimisation technique are bounding volumes. Bounding volumes do not

actually reduce the number of ray tracing operations. The idea behind bounding

volumes is to replace many expensive ray tracing operations with less expensive

ones. Each finite object is enclosed into a volume whereby the computation of

the intersections of a ray with the volume is much cheaper than the computation

of the intersections of the ray with the object enclosed. The intersections with

142 CHAPTER 4. RAY TRACING

the object must only be computed for those rays which intersect the object’s

bounding volume.

Good candidates for bounding volumes are boxes and spheres for which the in-

tersection calculations are very fast. [RW80] Bounding volumes can be computed

automatically in the preprocessing phase of the ray tracing algorithm. The more

tightly the bounding objects enclose the original objects, the more computational

effort is saved during the ray tracing algorithm.

4.2.2 Bounding slabs

A very important technique for the reduction of ray tracing operations are bound-

ing slabs—a space subdivision hierarchy. The idea behind this technique is a

construction of a tree (or, more generally, a DAG) which subdivides the 3D space

containing the scene into a hierarchy of non-overlapping volumes. The leaves of

this tree are either objects or their bounding volumes. [RW80]

When intersections of a ray with the scene are to be computed, the ray is first

tested for an intersection with the root of the tree (the volume which contains all

the objects). If the ray intersects this volume, the successors of the root node are

recursively tested for intersections. The recursion is terminated when either the

ray does not intersect any of the current node’s successors or the current node

is a leaf and the intersections for all objects comprised in the volume have been

computed.

The theoretical upper bound on the cost of the tree traversal is O(3

√N) for any

balanced subdivision tree with the number of leaves equal (or proportional) to

the number of objects. [RKJ98] Even though the worst case does not practically

happen, the use of bounding slabs means a certain tradeoff. A bad situation

occurs when a node of the tree corresponds to a large volume which contains

several small objects (or one small object). In this case the costs of the tree

traversal may be higher than the costs of the direct intersection of the objects’

bounding boxes.

It is important that the bounding slabs can be constructed automatically in

the preprocessing of the ray tracing algorithm. Non-uniform space subdivision

based on BSP trees [Kap85] or octrees [Gla84] adapts better to general scenes

than uniform space subdivision [FTI86]. A hybrid spatial subdivision is described

in [CDP95].

Remark. It makes sense for some object types (e.g. triangle meshes) to build a

local subdivision tree for the object’s volume. This speeds up the local intersec-

tion calculations and does not influence the traversal of the main tree. •

4.3. PERSISTENCE OF VISION RAY TRACER 143

4.2.3 Light buffers

Light buffers are a technique which is specific to the reduction of ray tracing

operations for shadow rays. [HG86] In order to determine whether an intersection

point is in a full shadow in a given direction with respect to a light source, it is not

necessary to compute all the intersections along the direction. The light source

is surrounded by a cube the surface of which is discretised into cells. (The choice

of the number of the cells only influences the performance of this technique, not

its correctness.) Bounding boxes of all objects are projected onto this cube in the

preprocessing step and a list is created for each cell which contain the pointers

to the bounding boxes which hit the cell. This list of objects’ bounding boxes is

sorted by their distances from the light source.

In order to determine whether a given point lies in a full shadow with respect

to the light source, the intersections of the ray from the light source to the given

point are only computed with the objects which are stored in the list of the

cell through which the ray passes. This computation is terminated when either

an intersection of the ray with an opaque object is found, or the distance of

the next stored object is greater than the distance from the light source to the

given point. (Note that if the given point is not in a full shadow with respect to

the light source, the algorithm returns the list of intersections between the light

source and the point. This information is needed by the ray tracing shader in

order to approximate the attenuation of the emitted radiance with respect to the

given point.)

4.3 Persistence of Vision Ray Tracer

It is not particularly difficult to implement a sequential ray tracer. However, it is

not easy to implement a good ray tracer because of technical pitfalls which arise

when several optimisation techniques are combined together and when further

extensions have to be built in. The freeware (sequential) Persistence of Vision

Ray Tracer [PT] is state of the art for several reasons:

•All important existing ray tracing optimisation techniques are comprised

in POV-Ray. These include the use of bounding volumes, bounding slabs

(a space subdivision hierarchy), light buffers and vista buffer.

•POV-Ray supports a variety of geometry primitives, light source types,

cameras and materials. Constructive Solid Geometry is used in order to

create more complex objects (see Section 3.2.2).

•The scene description language is a macro language which adds power to

the CSG modeling.

144 CHAPTER 4. RAY TRACING

•The implementation is portable. POV-Ray has been ported to practically

all existing platforms. It does not rely on any graphical interface (although

it is possible to use one) or external libraries which may cause a loss of

portability.

•Although the program is relatively large (ca. 100000 lines of ANSI C code),

the object-oriented way of coding make it robust and extensible. The source

code is freely available.

•POV-Ray implements several extensions to the basic ray tracing algorithm.

One of these extensions is the computation of the indirect diffuse illumina-

tion using distributed ray tracing (see Section 3.4.1). The implementation is

based on the algorithm proposed by Ward, Rubinstein and Clear. [WRC88]

•POV-Ray is used and supported by many people. Most of the contributions

of the Internet Ray Tracing Competition (IRTC) use POV-Ray as the final

rendering system. [IRT]

POV-Ray has been developed by POV-Team, a group of volunteer program-

mers. The original implementation of POV-Ray (version 0.5, released in 1991)

was based on DKBTrace by David Kirk Buck. The current official version is 3.5.

However, the version 3.5 leaves the original principles of POV-Ray. In particular,

the implementation of caustics in the version 3.5 is not only far from being per-

fect but also enormously increases the complexity of the implementation. [L¨uc03]

In our opinion, the last extensible official version of POV-Ray is 3.1g which we

chose for our parallelisation and experiments.

4.4 Parallel ray tracing

Unless stated otherwise, we assume throughout this section that message passing

is used for the communication between parallel processes.

4.4.1 Existing approaches

Parallel ray tracing algorithms can be roughly divided into two classes [Gre91]:

•Image space subdivision (or screen space subdivision) algorithms exploit the

fact that the primary rays sent from the camera through the pixels of the

virtual screen are independent of each other. Tracing of primary rays can

run in parallel without a communication between processes. The problem

of an unequal workload in processes must be considered. Another problem

arises by rendering large scenes—a straightforward parallelization requires

a copy of the whole 3D scene to be stored in the memory of each process.

On the other hand, these algorithms are usually easy to implement.

4.4. PARALLEL RAY TRACING 145

•Object space subdivision algorithms geometrically divide the 3D scene into

disjunct regions which are distributed in process’ (processors’) memories.

The computation begins with passing of the primary rays to processes stor-

ing the regions through which the primary rays pass first. The rays are

then recursively traced by the processes. If a ray leaves a process’s region,

it is passed to the process which stores the adjacent region (or discarded if

there is no adjacent region in the ray’s direction). An advantage of object

space subdivision algorithms is that the maximum size of the rendered 3D

scene is theoretically unlimited because it depends only of the total memory

available in all processes. Potential problems are an unequal workload and

a heavy communication between processes. Moreover, an implementation

of these algorithms may be laborious.

•Functional decomposition and hybrid algorithms, extend the data-driven

approach of object space subdivision algorithms with additional demand-

driven tasks in order to achieve a better load balance.

We will briefly present the works which belong to the last two classes. (We

recommend [RCJ98] and [CDR02] for further reading.) Then we will return to

image space subdivision which is the base of our parallelisation.

Object space subdivision algorithms

One of the first parallel algorithms was proposed in [DS84] Their algorithm is

based on a geometrical subdivision of 3D space into convex 3D regions. Each

region is assigned to one process. Rays are traced by processes and when a ray

leaves the region assigned to the process, it is passed to the neighbour which

maintains the region in the direction of the ray. The authors give a theoretical

estimation of the speedup, O(3

√S2) where Sis the number of regions. (An analysis

of the object space subdivision is given in [CWBV85] for an empty scene.) The

authors also propose to adjust the boundaries of the regions in run-time in order

to achieve a better balance. This algorithm has never been implemented.

The cost estimation prediction given in [RKC98] assumes an octree spatial

subdivision (see Section 4.2.2) and predicts costs of parallel ray tracing for voxels

of the octree. This prediction can be useful for a balanced static assignment of

objects to processors e.g. in image space subdivision algorithms which work with

a distributed object database (see Section 4.4.4).

The use of pyramidal-shaped regions is proposed in [BP88] and [PB89]. The

pyramidal regions begin in the eye point and ensure that primary rays never leave

their initial regions, which saves some communication between processes.

A mapping of 3D regions onto a hypercube is given in [KNS87]. Their idea is

to achieve an efficient implementation of object space subdivision on a hypercube

multiprocessor architecture. In [KNK+88], a load balancing strategy is described

146 CHAPTER 4. RAY TRACING

in which each 3D region is maintained by a cluster of several processes. This

allows for parallel intersection calculations inside one region.

The idea behind Jevan’s work [Jev89] is to immediately send a prolonged ray

to the neighbouring process which maintains the next region, before computing

intersections with the ray with the current region. If the ray is intersected with

the current region, the results of the neighbour’s computations are canceled.

In [Pit93], an implementation of an object space algorithm is described. The

experiments showed that the fine-grain strategy leads to a low (56%) efficiency

and that the regular 3D space subdivision which was used in the implementation

leads to a work imbalance. The regions which contain light sources are responsible

for a high load in these regions.

Functional decomposition and hybrid algorithms

The algorithm described in [SC88] uses a functional decomposition in order to

parallelise ray tracing. Each process stores a copy of the entire scene together with

a bounding volume hierarchy (see Section 4.2.2). The tree of bounding volumes

is divided into an upper and lower parts. All processes perform the intersection

computations with the upper part of the tree for different rays in parallel. How-

ever, an overloaded process does not perform the intersection computations with

the lower part of the tree—instead of that it sends this work to an underloaded

process. Experimentally measured efficiencies range from 68% to 88%.

The algorithm presented in [RC97] is based on object space subdivision (the

scene is distributed in process’ memories). This algorithm distinguishes between

several task types (such as shading tasks and intersection calculation tasks). Some

of these tasks are generated on-the-fly and as not every task can be performed by

any process, it is impossible to predict the loads in the processes. Other tasks are

generated by a master process. Whenever a process is idle (that means, its queue

of external requests is empty), it sends a work request to the master process. A

similar approach is described in [NL96].

An algorithm which uses functional decomposition in order to make use of

programmable hardware is proposed in [PBMH02]. (This work assumes a triangle

representation of objects.)

4.4.2 Image space subdivision

The computations on the primary rays (pixels of the virtual screen) are inde-

pendent of each other which suggest an assignment of screen areas to parallel

processes. Therefore, the processes which perform the computations on non-

overlapping screen areas do not need to communicate at all. However, there are

several additional issues which must be considered in order to make this approach

efficient and general:

4.4. PARALLEL RAY TRACING 147

1. The computational times for different primary rays are not equal and they

cannot be reliably predicted before the computations have actually been

performed.2It must also be said that even though the computations on

primary rays are independent of each other, there is a coherence between

primary rays which are close to each other in the image space.

2. The sum of computational times for all primary rays is much greater than

the computational time for any single primary ray.

3. The communication between processes cannot be neglected even if the mes-

sages exchanged are very short.

4. Some sequential optimisation techniques such as saving of the primary rays

in anti-aliasing schemes do lead to dependencies between pixels.

5. The parallelisation should assume that the complete scene description does

not fit into memory of each process.

The first three issues are dealt with in this section. The problem of memory

limitations is addressed separately, as it is suggested in [GP89], [Gre91]. Sec-

tion 4.4.4 is devoted to the design of a distributed database.

Astatic subdivision of the image space is proposed in [Woo84]. This static

subdivision scheme assigns the same amount of pixels to processes. The pixels

assigned to one process are spaced at regular intervals across the screen. Wood-

wark’s work assumes that the whole scene description fits into the memory of

each process.

Theoretical as well as experimental arguments for why any static subdivision

scheme is inadequate in order to achieve a good efficiency (over 90%) are given

in [HA98]. The granularity of screen subdivision is not fine enough to yield the

desired efficiency when the law of large numbers is applied.

Several works use chunking (which is sometimes referred to as tiling) in order

to distribute the work among processes. Chunks are screen regions of a constant

size which are assigned to idle processes on demand by a central master process.

[CT96], [GP89], [Gre91], [BBP94], [FFB99], [FHK97], [KH95]3

We found only one paper on screen space subdivision which reports experience

with work stealing. [BBP94] The processes are connected to a logical ring. Ini-

tially, each process is assigned an equal amount of work. When a process finishes

a job, it sends a job request to the ring. If the same job request returns from the

2Some results concerning cost control rather than cost estimation for primary rays are given

in [CC02]. This cost control is based on the psychological observation that different pixels in

the computed image have a different visual importance when the image is perceived by a human

observer.

3The work by Keates makes use of hardware-supported shared memory in order to address

the scene storage problems in the processes.

148 CHAPTER 4. RAY TRACING

other side of the ring without having been satisfied, then the process knows that

it can terminate. Otherwise the process gets a new screen part to compute.

The abstract model for scheduling parallel loops is very similar to screen space

subdivision of parallel ray tracing (assuming enough memory to store the whole

scene in memories of all processes) in the sense that there is a static pool of

tasks which can be computed in parallel but for which the computational times

cannot be predicted. [KW85], [FHSF91], [FHSF92], [FHBWW95], [FHSUW96]

The cited papers give an analysis of factoring (also referred to as fractiling) which

is similar to chunking but the chunk size is gradually decreased. The latest

four papers propose a halving of the chunk size and compare such a factoring

with a uniform size chunking. Assuming a normal probability distribution of the

tasks’ computational times, an analysis on the expected imbalance is given. The

problem with these results is that the halving of the chunk size does not lead

to a perfect balance of load. The knowledge of the expected imbalance does not

improve the actual parallel time.

The algorithm which we propose below is a generalisation of the previous ap-

proaches. The uniform size chunking and the factoring into halves are special

cases of our algorithm. The parameters of our algorithm are intuitive. We can

characterise the setting of the parameters which yields a perfect load balance (while

minimising the communication costs). The only remaining question is how to find

this setting.

Perfect load balancing algorithm

We will assume a farming model which consists of one master process and N

worker processes (see Fig. 4.1). The master process assigns non-overlapping

screen areas to workers, collect the results from the workers and updates the

frame buffer (the image which is being computed). A worker is initially waiting

until it receives a job from the master. Then the worker traces the primary rays

for the given area, returns the computed subimage to the master and waits for

another job.

The farming scheme can be extended with a load balancing process (load-

balancer), which takes over one of the master’s responsibilities—the distribution

of jobs to workers (see Fig. 4.1). The only information which is needed by the

loadbalancer process is the size of the image (this information is passed from

the master to the loadbalancer at the very beginning of the computation). The

introduction of the loadbalancer process effectively reduces the idle times in the

worker processes. When a worker becomes idle (or shortly before it becomes

idle), it sends a job request to the loadbalancer and the computed subimage to

the master. [Pla98]

The problem is to determine an appropriate granularity of the assigned parts.

The choice of granularity influences the load balance and the number of exchanged

messages. There are two extreme cases: 1. The assigned parts are minimal (pix-

4.4. PARALLEL RAY TRACING 149

MASTER

WORKER 1WORKER 1 WORKER 2 WORKER N

.....

Figure 4.1: Left: A process farm. Right: A process farm extended with a load

balancing process

els). In this case the load is balanced perfectly but the number of work requests

is large (equal to the number of pixels which is usually much greater than the

number of workers). 2. The assigned parts are maximal (the whole image is par-

titioned into as many parts as the number of workers). In this case the number

of work requests is low but the load imbalance may be great. Fig. 4.2 depicts

these two extreme cases.

1 2345678

Figure 4.2: The extreme cases of chunking. Left: Minimal chunks. Right: Maxi-

mal chunks

Let Wdenote the total number of atomic parts (e.g. image pixels or image

columns), let Ndenote the number of workers (a homogeneous parallel machine

is assumed). The task of a load balancing algorithm is to compute the Watomic

parts on Nworkers in the shortest possible parallel time. Let us assume that a

minimal constant Tis known which bounds the maximal ratio of the computa-

tional times on any two atomic parts (T≥1.0):

processing time on part 1

processing time on part 2 ≤T(4.1)

If this assumption holds, then the algorithm in Fig. 4.3 is perfect in the sense

that it guarantees a perfect load balance (the maximal imbalance is not larger

than the processing time of the atomic job with the largest processing time). At

150 CHAPTER 4. RAY TRACING

the same time, the number of work requests is minimal. [Pla02a]4

loadbalancer(float T, int W, int N)

int part size;

int work =W;

while (work > 0)

part size = max (1,bwork/(1 + T·(N−1))c);

for (counter = 0; counter < N;counter++)

wait for a work request from an idle worker;

if (work > 0)

send job of size part size to the worker;

work =work −part size;

collect work requests from all workers;

send termination messages to all workers;

Figure 4.3: The perfect load balancing algorithm (used in the loadbalancer pro-

cess)

Claim. The algorithm in Fig. 4.3 always assigns as much work as possible to

idle workers, while still ensuring the best possible load balance.

Proof. The algorithm works in rounds, one round being one execution of the

while-loop. In the first round the algorithm assigns image parts of size

smax = max (1,bW/(1 + T·(N −1))c)

(measured in the number of atomic parts). In each of the following rounds the

parts are smaller than in the previous round. Obviously, the greatest imbalance

is obtained when a processor pmax computes a part of the size smax from the first

round as long as possible (whereby the case of smax = 1 is trivial and will not

be considered here) and all the remaining N−1 processors compute the rest of

the image as quickly as possible (in other words, the load of the remaining N−1

processors is perfectly balanced). The number of parts computed in parallel by

all processors except of pmax is W−smax. The ratio of the total workload (in

terms of the number of processed atomic parts) of one of the N−1 processors

(let pother denote the processor and let sother denote its total workload) and smax

is then

sother

smax

W−smax

N−1

smax

W−bW/(1+T·(N−1))c

N−1

bW/(1 + T·(N −1))c

4A similar algorithm was independently published in [PMTR95].

4.4. PARALLEL RAY TRACING 151

This ratio is greater or equal to T. This means that the processor pother does at

least Ttimes more work than the processor pmax in this scenario. From this and

from our assumptions about Tand about the homogeneity of processors follows

that the processor pmax must finish computing its part from the first round at

the latest when pother finishes its part from the last round. Thence, a perfect load

balance is achieved even in the worst case scenario.

It follows directly from the previous reasoning that the part sizes smax assigned

in the first round cannot be increased without affecting the perfect load balance.

(For part sizes assigned in the following rounds a similar reasoning can be used,

with a reduced image size.) This proves the optimality of the above algorithm.

•

Claim. The number of work requests (including final work requests that are not

going to be fulfilled) in the algorithm in Fig. 4.3 is equal to

N·(r+ 1) + &W·Ã1−N

1 + T·(N−1)!r'

where

r= max µ0,¹log1−N

1+T·(N−1) (N/W)º¶

Proof. It is easy to observe that

N·W

1 + T·(N−1) ·Ã1−N

1 + T·(N−1)!i−1

atomic parts get assigned to workers during the ith execution of the while-loop

and that

W·Ã1−N

1 + T·(N−1)!i

atomic parts remain unassigned after the ith execution of the while-loop.

ris the total number of executions of the while-loop minus 1. The round r

is the last round on the beginning of which the number of yet unassigned atomic

parts is greater than the number of workers N.rcan be determined from the

fact that the number of yet unassigned atomic parts after rexecutions of the

while-loop is at most N:

W·Ã1−N

1 + T·(N−1)!r

≤N

which yields (ris an integer greater than or equal to 0)

r= max µ0,¹log1−N

1+T·(N−1) (N/W)º¶

152 CHAPTER 4. RAY TRACING

There are Nwork requests received during each of the rexecutions of the

while-loop, yielding a total of N·rwork requests. These do not include the

work requests received during the last execution of the while-loop. The number

of work requests received during the last execution of the while-loop is equal to

&W·Ã1−N

1 + T·(N−1)!r'

Finally, each of the workers sends one work request which cannot be satisfied.

Summed up,

N·(r+ 1) + &W·Ã1−N

1 + T·(N−1)!r'

is the total number of work requests. •

The perfect load balancing algorithm in Fig. 4.3 is a compromise between

the two extreme chunking cases in Fig. 4.2. The two extremes are obtained

when T→ ∞ (minimal chunks), or when T= 1 (maximal chunks), respectively.

Fig. 4.4 illustrates the work assignment for N= 2 and T= 3.





















     

6FUHHQUHVROXWLRQ

QXPEHURIDWRPLFSDUWV

1XPEHURIZRUNUHTXHVWV

ZRUNHUV

ZRUNHUV

ZRUNHUV

ZRUNHUV

7 

Figure 4.4: Left: Illustration of the work assignment in the perfect load balancing

algorithm for N= 2 and T= 3. Right: The exact number of work requests in

the perfect load balancing algorithm as a function of the number of workers and

the number of atomic parts

4.4.3 Setting of parameters in the perfect load balancing

algorithm

Two parameters must be tuned in the perfect load balancing algorithm. The

first parameter is the job time ratio parameter T. The second parameter has

not been introduced yet. It may be useful to pack more pixels into a single job

4.4. PARALLEL RAY TRACING 153

towards the end of the algorithm because a computation on several pixels may

cost much less than sending several messages instead of one. We will define Mas

the size of the minimal job which should be assigned to a worker. Mis measured

as the number of the original atomic work parts (M≥1). One line in the load

balancing algorithm in Fig. 4.3 will be modified:

part size = max (1,bwork/(1 + T·(N−1))c)

will be replaced with

part size = max (M, bwork/(1 + T·(N−1))c)

It is obvious that the chunking approaches are special cases of our algorithm.

Indeed, if Mis equal to the chunk size and T→ ∞, then chunks of the size M

will be distributed among workers on demand. The algorithm is also a generali-

sation of factoring which assigns a half of the still unassigned work equally to all

workers—factoring in halves is obtained by setting T= (2N−1)/(N−1) and

M= 1.

The tuning of the parameters Tand Mshould be fully automatical. The

parameters can be tuned independently of each other. Unfortunately, both must

be set before the computation begins and they both depend on the amount of

the computation which is unknown before the computation finishes.

Setting of the atomic job size M

The parameter Mcontrols the sizes of the smallest jobs which will be distributed

in the last round of the algorithm. An overestimation of Mresults in a potential

imbalance. The extreme setting of M=W/N yields (independently of T) a static

distribution of load which is obviously the worst case in terms of load balance.

The optimal setting of Mdepends on the communication costs and compu-

tational times for the original jobs. If Mis too small, then communication costs

can dominate the computation of jobs of the size M. Moreover, the loadbalancer

process can become a bottleneck when many workers send their job requests

frequently. This also influences the optimal setting of M.

Our suggestion is to run the load balancing algorithm with M= 1 and adapt

Maccording to the measurements performed in the run-time. This involves an

extension of the protocol between the loadbalancer and worker processes. The

worker process can measure the computational time Tjob spent on the last job and

report this time to the loadbalancer together with a job request. The loadbalancer

process measures the time from the moment tstart when a job was assigned to the

worker to the moment tfinish when it becomes another job request from that

worker. The communication time Tcomm (for that particular job) is then equal to

Tcomm =tfinish −tstart −Tjob.Tcomm ≥Tjob is an indication of that the job size in

the load balancing algorithm should not be further decreased in the next rounds.

154 CHAPTER 4. RAY TRACING

(The fixation of Mcan be postponed e.g. until Tcomm ≥Tjob is measured for all

jobs which were assigned in the same round.)

Remark. The constant Mcan also be used in the worker process in order

to prefetch another job. If the worker is computing a job and detects that the

number of parts remaining does not exceed Mat some moment, it can send a work

request to the loadbalancer before its current job is finished. This prefetching

can hide a part of the communication overhead. •

Setting of the job time ratio T

The parameter Tcontrols the sizes of the largest jobs which will be distributed

already in the first round of the algorithm. An underestimation of Tresults in

a potential imbalance. The extreme setting of T= 1 yields (independently of

M) a static distribution of load. The problem is that if Twas underestimated

at the beginning of the algorithm, then an adjustment of Tin run-time does not

help (unlike a run-time adjustment of M) unless jobs which have already been

assigned can be taken away from workers.

The optimal setting of Tdepends on the ratio between the computational

times on the longest and shortest job assigned. If Tis too great, then an unnec-

essary communication overhead will add to the parallel time.

A conservative approach to the tuning of Tis to assign jobs of the size M

among the Nworkers in the first round and estimate Tfrom the statistics which

are collected after a worker finishes. Tis then set to the maximum job time ratio

measured on previous jobs. This alone does not prevent an underestimation of

T—the small jobs from the first round only cover a small part of the image and

therefore they are not a representative sample of the computational times across

the whole image. However, the statistics collected during the first round allow

for setting a limit on the computational times of jobs which are assigned in the

next round. If a worker which computes a job detects during the computation

that it has spent more time on the job than the limit allows (this detects an

underestimation of T), then it stops the computation, sends a partially computed

part of the job to the master and returns the part which has not been computed

back to the loadbalancer. The loadbalancer updates its estimation of Tand at

an appropriate time it notifies the workers for which this update is relevant.

We suggest an optimistic approach which allows an underestimation of Tby

using an empirical constant for Twhich remains constant during the algorithm.

The potential imbalance can be eliminated by the use of work stealing as an

additional phase which begins immediately after the loadbalancer has no more

parts to distribute. The work stealing phase involves a higher overhead than the

farming but it overlaps with the farming phase and it is only initiated when an

imbalance is detected—this means, when a worker sends a work request to the

4.4. PARALLEL RAY TRACING 155

loadbalancer and receives a “no more work” reply. The work stealing phase can

overlap with the farming phase. The constant Tcontrols the amount of work

stealing which is needed to balance the load at the end. T→ ∞ yields a pure

work stealing. The closer the estimation of Tis to the optimal T, the shorter

will be the work stealing phase.

4.4.4 Distributed object database

It is desirable that the screen subdivision algorithm proposed in the previous

section also works if the entire scene description does not fit into the memory of

each processor. This problem can be overcome by the use of a database which is

capable of storing all the scene data. The access to an external database (such

as a standard client-server database system or a disk storage in general) may be

prohibitively expensive because the frequency of queries is very high. A careful

reordering of operations in the ray tracing algorithm may overcome the problem

of the slow communication with an external database [PKGH02] but this may

result in a special implementation of an one-purpose ray tracer.

Green and Paddon suggest a different approach [GP89], [Gre91] which we

decided to follow. It is assumed that the sum of the memories of all the processes

involved in the parallel computation (the worker processes) is sufficient to hold

the entire scene description. The function of each worker process is twofold: 1.it

performs the recursive computations on the primary rays; 2.it serves as a database

server for all the remaining workers, which means that it accepts data requests

from the other workers and provides them with the requested data. This makes

parallel ray tracing a non-trivial application, see Section 2.1.

Before coming to a design of the distributed database, we will make some

observations. Ray tracers can support a variety of object types most of which

are very small in memory. Polygon meshes are one of a few exceptions. If a

scene does not fit into memory of one process, it is usually because it consists

of several large polygon meshes. We therefore distribute mesh objects in our

implementation but the implementation does not exclude a distribution of other

object types.

We assume that any single object does fit into memory of each process. (It

should also be said that a majority of scenes do fit into memory of each worker.)

Some amount of memory is required by the program code, stack, image buffer,

acceleration ray tracing structures (such as bounding boxes, bounding slabs, light

buffers) etc. Most of these acceleration structures can be switched off so that they

do not consume any memory.

The scene description stored in the database does not usually change during

the computations of one image. This allows to decide during the preprocessing

whether the amount of memory is sufficient to store the scene or not. Initially,

the scene is stored in a file. The scene description in the file does not necessarily

correspond to the storage of the scene in the (main) memory—we recall that

156 CHAPTER 4. RAY TRACING

e.g. POV-Ray uses a macro language which allows a procedural creation of the

objects. During the parsing of the scene file by all worker processes, objects

are created in the memories of the worker processes. After an object has been

created in memory, the workers synchronise and decide whether the object will be

replicated in memory of all workers or whether the object will only be stored in the

memory of one worker. In the latter case the worker with minimal current memory

load is selected to become the owner of that object. All the remaining workers

delete the data which belong to that object (e.g. vertex coordinates, vertex

normals, triangle indices etc. belong to the data of a mesh object). However,

they do not delete the object’s envelope. This envelope contains a global object

identifier, the identifier of the object’s owner (e.g. the rank of the process which

stores the object’s data), the object’s bounding box, a flag whether the object

is currently present in the memory etc. After this, workers continue in parsing

the scene file. Note that none of the workers consumes more memory than it is

necessary at any one time.

If the parsing phase was successful, then each worker is able to store the

objects which it owns and it has at least as much free memory as it is needed

to store the data of the largest object in the distributed database. The worker’s

memory which is unused after the parsing stage will be used as a cache for the

objects which are not owned by the worker. A worker never deletes the data of

the objects which it owns.

Object’s data are needed at two places in the ray tracing algorithm: in the

intersection computations and in the shading. It must be ensured that the data

of the object are in the memory before they are referenced—if the data are not in

the memory, a data request is sent to the owner. It is likely that the object which

is being referenced at the moment will also be referenced in a near future because

of the coherence of the primary rays.5The sole fact that a cached object is

being referenced is useful for the bookkeeping of the cache policy which uses this

information in order to decide which objects will be released from the cache when

new object’s data are to be inserted into the cache. Fig. 4.5 depicts the pseudo-

code of the function Fetch Object Data which is inserted in the ray tracing code

immediately before the object’s data will be referenced.

An object’s owner acts as a server for all other workers which eventually need

the object’s data. A worker must run a separate thread which reacts to mesh

requests by sending the data independently of the worker’s computations.

Remark. The shading computations always follow the intersection computa-

tions for the same object—however, not immediately. A large number of other

objects may be referenced meanwhile. The object which was referenced in the

intersection computations may be released from the cache before it is referenced

again in the shading. In order to save the latter data request, all the information

5A more sophisticated method can be used to predict the future object references. [RKC98].

4.4. PARALLEL RAY TRACING 157

Fetch Object Data(object)

{if (!is in memory(object))

{send data request(object->owner, object->id);

insert into cache(object);

wait for data(object->owner, object);

}

else

{if (object->owner != my rank)

cache hit(object);

}

Figure 4.5: The pseudo-code of the function Fetch Object Data. The function

insert into cache makes space for the requested object data by removing other

object’s data according to the cache policy, and then increases the requested

object’s importance. The function cache hit increases the object’s importance

which is needed for the shading can be precomputed at the moment when the

object is referenced for the first time. This information is stored in the object’s

envelope which always remains in the memory. By doing so an eventual expensive

communication can be avoided for the price of a much less expensive unnecessary

computation (not all intersected objects are going to be shaded).

Another efficiency improvement involves a prepacking of the object’s data to

ready-to-send buffers. However, there are several reasons for why this optimi-

sation should not be used. One reason is that the prepacked buffers consume

memory which can otherwise be used for the cache. Another reason is that

this technique may limit the implementation to homogeneous parallel machines,

unless the data encoding used for the prepacking is platform-independent. Fur-

thermore, the time which is needed for the packing of the data is usually much

shorter than the communication overhead. •

4.4.5 Experiments

This section first illustrates the tuning of the constants Mand Tof the load bal-

ancing algorithm in Fig. 4.3. Our experiments simulate the automatical tuning

which is described in Section 4.4.3. Then we present results of experiments with

a distributed object database. We compare three caching policies which attempt

158 CHAPTER 4. RAY TRACING

to exploit coherence in object references in the ray tracing algorithm in order to

save the number of expensive data requests. We also present efficiency measure-

ments of the parallel ray tracing implementation which uses a distributed object

database.

Throughout this section, the efficiency of a run of a parallel program will be

measured as

sequential time

parallel time with Nworkers ·N

Unless stated otherwise, the resolution of all the images computed during the

experiments of this section was 720x576 (PAL). All images were computed with

default POV-Ray 3.1g settings (with no anti-aliasing).

Setting of the atomic job size M

In order to show how the efficiency of a parallel implementation depends on the

atomic job size, we used a chunking job assignment in the loadbalancer process

which always assigns a constant chunk of the screen to an idle worker on demand.

Our goal was to (manually) find the optimal chunk size for two given scenes, BLOB

and HAUS6. The BLOB scene is extremely simple—it only consists of one object

and one point light source. The HAUS6 is fairly complex—it consist of ca. 600

objects and 8 point light sources.

For these experiments, we used a configuration with 90 workers which were

running on a partition of 92 processors of the hpcLine (see Section 2.8 for the

machine description). This is the maximal number of workers which can be

mapped onto the allocated partition so that each process is mapped onto a sin-

gle node of the machine (the remaining two nodes are used by the loadbalancer

and the master processes). The optimal chunk size for this configuration deter-

mines the upper limit on the efficiency of any parallel computation which uses 90

worker processes. We recall that the chunk size which is smaller than the optimal

chunk size will result in the domination of the communication overhead over the

computation times of the smallest jobs which are assigned in the load balancing

algorithm in Fig. 4.3 extended with the constant M, see Section 4.4.3. However,

the smaller the optimal chunk size is, the better balance of load can be expected.

We compared two programs in these experiments, an event-driven one and

a polling one. These programs are identical on the binary level. They both use

the very same implementation of POV||Ray. The only difference is in the im-

plementation of the TPL library. The event-driven program is POV||Ray linked

with the TPL implementation which uses the interrupt mechanism described in

Section 2.6.6. The polling program is (the same) POV||Ray linked with the TPL

implementation which uses the polling mechanism described in Section 2.6.2. The

two TPL implementations are based on the same source code and the difference

4.4. PARALLEL RAY TRACING 159

between the two are only a few lines of code which are conditionally selected us-

ing #ifdef directives. In order to make the comparison fair, neither the polling

nor the event-driven version uses special optimisations—they both are generic

implementations of the polling and event-driven mechanisms from Chapter 2.

The results of the experiments are shown in Fig. 4.6. The optimal chunk size

for the BLOB scene is 720 pixels for both the polling and event-driven versions.

The optimal chunk size for the HAUS6 scene is 720 pixels for the polling program

and 72–720 pixels for the event-driven program.

0 100 200 300 400 500 600 700 800

Time (seconds)

Chunk size (pixels)

event-driven TPL/PVM

0 100 200 300 400 500 600 700 800

Time (seconds)

Chunk size (pixels)

event-driven TPL/PVM

polling PVM

0 100 200 300 400 500 600 700 800

Time (seconds)

Chunk size (pixels)

event-driven TPL/PVM

polling PVM

0 100 200 300 400 500 600 700 800

Time (seconds)

Chunk size (pixels)

event-driven TPL/PVM

polling PVM

0 100 200 300 400 500 600 700 800

Time (seconds)

Chunk size (pixels)

event-driven TPL/PVM

0 100 200 300 400 500 600 700 800

Time (seconds)

Chunk size (pixels)

event-driven TPL/PVM

polling PVM

0 100 200 300 400 500 600 700 800

Time (seconds)

Chunk size (pixels)

event-driven TPL/PVM

polling PVM

0 100 200 300 400 500 600 700 800

Time (seconds)

Chunk size (pixels)

event-driven TPL/PVM

polling PVM

Figure 4.6: Absolute parallel times for 90 workers for a varying chunk size. Left:

BLOB scene. Right: HAUS6 scene

Fig. 4.7 shows the efficiencies of event-driven and polling programs for a pure

chunking job assignment. The optimal constant chunk sizes (720 pixels for the

BLOB scene and 360 pixels for the HAUS6 scene) were used for these measurements.

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

event-driven TPL/PVM

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

event-driven TPL/PVM

polling PVM

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

event-driven TPL/PVM

polling PVM

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

event-driven TPL/PVM

polling PVM

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

event-driven TPL/PVM

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

event-driven TPL/PVM

polling PVM

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

event-driven TPL/PVM

polling PVM

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

event-driven TPL/PVM

polling PVM

Figure 4.7: Efficiency of the chunking algorithm for a constant chunk size and

varying number of worker processes. Left: BLOB scene (chunk size 720 pixels).

Right: HAUS6 scene (chunk size 360 pixels)

160 CHAPTER 4. RAY TRACING

Setting of the job time ratio T

The optimal settings of Mfrom Fig. 4.8 were used in order to determine the opti-

mal settings of Tfor the same two scenes. We only performed these measurements

for the event-driven program. The loadbalancer process used the algorithm of

Fig. 4.3 extended with the constant Mdefined in Section 4.4.3. (No work stealing

was used.)

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

T=1 (event-driven TPL/PVM)

T=2 (event-driven TPL/PVM)

T=3 (event-driven TPL/PVM)

T=inf (event-driven TPL/PVM)

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

T=1 (event-driven TPL/PVM)

T=2 (event-driven TPL/PVM)

T=3 (event-driven TPL/PVM)

T=inf (event-driven TPL/PVM)

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

T=1 (event-driven TPL/PVM)

T=2 (event-driven TPL/PVM)

T=3 (event-driven TPL/PVM)

T=inf (event-driven TPL/PVM)

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

T=1 (event-driven TPL/PVM)

T=2 (event-driven TPL/PVM)

T=3 (event-driven TPL/PVM)

T=inf (event-driven TPL/PVM)

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

T=1 (event-driven TPL/PVM)

T=2 (event-driven TPL/PVM)

T=3 (event-driven TPL/PVM)

T=inf (event-driven TPL/PVM)

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

T=1 (event-driven TPL/PVM)

T=2 (event-driven TPL/PVM)

T=3 (event-driven TPL/PVM)

T=inf (event-driven TPL/PVM)

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

T=1 (event-driven TPL/PVM)

T=2 (event-driven TPL/PVM)

T=3 (event-driven TPL/PVM)

T=inf (event-driven TPL/PVM)

0.2

0.4

0.6

0.8

1.2

0 20 40 60 80 100 120 140

Efficiency

Number of workers

T=1 (event-driven TPL/PVM)

T=2 (event-driven TPL/PVM)

T=3 (event-driven TPL/PVM)

T=inf (event-driven TPL/PVM)

Figure 4.8: Efficiency of the perfect load balancing algorithm with the optimal

chunk size and varying number of worker processes. Top: BLOB scene (M= 720

pixels). Bottom: HAUS6 scene (M= 360 pixels)

These measurements show that the maximal efficiency is obtained for these

two scenes for T→ ∞. However, the setting of T= 3 already yields the maximal

4.4. PARALLEL RAY TRACING 161

efficiency.6Note the poor efficiency for the setting of T= 1 (a static load

assignment).

Choice of the cache policy

Together with P. Grambliˇcka, our student, we compared the efficiencies of several

cache policies on several scenes. [Gra98] These policies differ in how they react

when an object is being referenced (cache hit in Fig. 4.5) and in how they select

the objects for the removal from the cache (so-called victims) in order to make

space for a requested object (insert into cache in Fig. 4.5). An object request

is also called a cache miss. The efficiency of a caching policy is measured using

the miss ratio:

#object requests

#object references

The lower the miss ratio is, the more efficient is the policy. The object ref-

erences in the definition of the miss ratio include references to objects which are

owned by the process. We will only present the policies which are fully automat-

ical and do not require any tuning:

•RANDOM does nothing when an object is being referenced. On an object

request, victims are selected randomly until there is enough space in cache

to store the requested object.

•LRU (Last Recently Used) maintains a linked list of the cached objects.

When an object is being referenced, it is moved to the beginning of the list.

On an object request, always the last object in the list is removed from the

cache until there is enough space in cache to store the requested object.

•LRU-COUNTER assigns a counter to each object. This counter is initially

zero and it is increased when the object is being referenced. On an object

request, always the object with the lowest counter is removed from the

cache.

Fig. 4.9 shows the results of the measurements on three scenes which were

rendered sequentially in the resolution 640x480, using a simulated memory limit.

The graphs may suggest that the RANDOM policy as approximately as good as LRU.

This is only because the total number of references is relatively high. In fact,

the absolute number of cache misses of LRU is ca. 25% lower than the absolute

number of cache misses of RANDOM in all the graphs. LRU performs very well

although its obvious disadvantage is that objects which are only referenced a few

times remain a long in the cache. LRU-COUNTER uses counters in order to prevent

6The efficiency greater 1 for the BLOB scene is caused by the short sequential time which

does not allow an amortisation of I/O calls in the sequential program.

162 CHAPTER 4. RAY TRACING

this situation. The weakness of LRU-COUNTER is that objects which are referenced

often during a short time interval remain very long in the cache.

Distributed object database

Fig. 4.10 shows efficiencies of parallel ray tracing with distributed object database

which were measured with our old implementation of POV||Ray. The old im-

plementation of POV||Ray was based on the GOLEM communication library.

[Ree97] The GOLEM library uses PVM for inter-process communication and it

uses polling in order to allow for an implementation of non-trivial applications.

The LRU cache policy was used in these measurements. The cache miss

ratio was under 1% in the 20% case (only 20% of all objects are relevant for the

rendering of this image) and the total number of data request was ca. 3000. The

cache miss ratio increased only by 0.1% in the 10% case but the total number of

data requests increased to ca. 500000. In the 5% case, the cache miss ratio was

ca. 15% and the total number of data requests was ca. 7000000.

Our recent POV||Ray implementation differs from the old one in many details

which make a direct comparison impossible. We compared efficiencies of two

programs which only differ in the mechanism which is used for the communication

(similarly as in the subsection above on the settings of the constants Mand T).

One program uses the TPL library with an event-driven mechanism, the other one

uses the TPL library with polling. The programs are otherwise identical (even

on the binary level). We only made the measurements for 90 workers where the

memory limit was set to ca. 5%. We used a slightly modified HAUS6 scene in this

experiment (with fewer light sources and a different camera). We used the setting

of M= 5 (pixels) and T→ ∞ (and no work stealing). This setting (chunking)

excludes a significant imbalance and a bottleneck at loadbalancer. We measured

an efficiency of ca. 0.0178 for the event-driven version and ca. 0.0015 for the

polling version.

4.4.6 Further extensions and improvements

It may take a long time to read the scene description from a large file to the

memories of the worker processes, especially if the objects are modeled as triangle

meshes. This issue becomes critical when the parallel program is running on a

disk-less machine which is connected to a disk server via a slow communication

link. For instance, if 100 processes read the same file of size 100 MB, then

100x100 MB=10 GB must be transferred via the slow link. This transfer can take

longer than the parallel computation itself. The solution involves the reading of

the file in one of the workers only. This worker broadcasts the data which it reads

among other workers. Broadcasting is usually much faster than disk operations

involving the same volume of data. [Pla98]

4.4. PARALLEL RAY TRACING 163

0.005

0.01

0.015

0.02

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

0.005

0.01

0.015

0.02

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

0.005

0.01

0.015

0.02

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

RANDOM

0.005

0.01

0.015

0.02

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

RANDOM

0.005

0.01

0.015

0.02

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

RANDOM

0.005

0.01

0.015

0.02

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

RANDOM

0.01

0.02

0.03

0.04

0.05

0.06

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

0.01

0.02

0.03

0.04

0.05

0.06

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

0.01

0.02

0.03

0.04

0.05

0.06

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

RANDOM

0.01

0.02

0.03

0.04

0.05

0.06

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

RANDOM

0.01

0.02

0.03

0.04

0.05

0.06

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

RANDOM

0.01

0.02

0.03

0.04

0.05

0.06

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

RANDOM

0.05

0.1

0.15

0.2

0.25

0.3

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

0.05

0.1

0.15

0.2

0.25

0.3

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

0.05

0.1

0.15

0.2

0.25

0.3

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

RANDOM

0.05

0.1

0.15

0.2

0.25

0.3

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

RANDOM

0.05

0.1

0.15

0.2

0.25

0.3

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

RANDOM

0.05

0.1

0.15

0.2

0.25

0.3

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Miss ratio

Cache size (relative to the total size of cached objects)

LRU-COUNTER

LRU

RANDOM

Figure 4.9: Cache miss ratios. Top: BATH, 353 objects. Centre: ROSENTHALERHOF,

2215 objects. Bottom: HELICOPTER, 167 objects

164 CHAPTER 4. RAY TRACING

0.2

0.4

0.6

0.8

1.2

0 5 10 15 20 25 30 35 40 45 50

Efficiency

Number of workers

0.2

0.4

0.6

0.8

1.2

0 5 10 15 20 25 30 35 40 45 50

Efficiency

Number of workers

0.2

0.4

0.6

0.8

1.2

0 5 10 15 20 25 30 35 40 45 50

Efficiency

Number of workers

0.2

0.4

0.6

0.8

1.2

0 5 10 15 20 25 30 35 40 45 50

Efficiency

Number of workers

0.2

0.4

0.6

0.8

1.2

0 5 10 15 20 25 30 35 40 45 50

Efficiency

Number of workers

100% memory

20% memory

10% memory

5% memory

0.2

0.4

0.6

0.8

1.2

0 5 10 15 20 25 30 35 40 45 50

Efficiency

Number of workers

100% memory

20% memory

10% memory

5% memory

0.2

0.4

0.6

0.8

1.2

0 5 10 15 20 25 30 35 40 45 50

Efficiency

Number of workers

100% memory

20% memory

10% memory

5% memory

0.2

0.4

0.6

0.8

1.2

0 5 10 15 20 25 30 35 40 45 50

Efficiency

Number of workers

100% memory

20% memory

10% memory

5% memory

Figure 4.10: Efficiency of POV||Ray (an old polling version) with a distributed

object database. The memory percentage states how large part of the sum of all

object data sizes is allowed to be stored in the memory of each worker. A worker

is only allowed to use this amount of memory for the storage of objects which it

owns and for the object cache. The missing data in the graph indicate the cases

where this simulated memory limit was exceeded in some worker

The sequential anti-aliasing optimisation technique which measures the dif-

ferences between already computed pixel colours in order to reduce the number

of supersampling primary rays cannot be directly used in parallel screen subdivi-

sion without a loss of efficiency. If this technique is applied directly, then primary

rays for the pixels on the job borders will be computed twice. We use additional

communication in order to avoid this double computation. [Pla02a]

An interesting practical issue is persistence of data. The workers can remain

running after they have finished a computation of one image, and keep the scene

description in their memories. It is possible to define a protocol between an

external frontend program (which implements the user interface) and the backend

(the parallel program itself) which allows the user to interact with the parallel

program. We used this scenario in order to render camera animations without

the need of re-reading the entire scene from a file (only a short camera description

must be parsed by the workers in order to render the next frame). The use of

persistent data in the context of animation allows for further optimisations of

the parallel ray tracing algorithm which exploit the temporal coherence between

subsequent frames. [FHK97]

Our current implementation of POV||Ray does not address the diffuse in-

terreflection [WRC88] which is implemented in the sequential POV-Ray. The

mechanism which is needed in order to share the irradiance database distributed

in the memories of workers is similar to the sharing of the distributed object data.

4.5. CONCLUSIONS 165

The difference between the two is that the irradiance database quickly changes

during the parallel computation and a change in one worker must be passed to

other workers as soon as possible. [Rei96], [RCJ99]

4.5 Conclusions

We presented a simple and robust parallelisation of ray tracing. The paralleli-

sation is based on screen space subdivision. We proposed a demand-driven load

balancing algorithm which is a generalisation of chunking approaches. We proved

that this algorithm guarantees a perfect load balance while minimising the num-

ber of work requests if its two parameters are optimally set. The optimal setting

of these two parameters is generally unknown before the actual computation fin-

ishes. However, both the parameters have an intuitive interpretation and their

optimal setting can be characterised. We proposed an automatical tuning proce-

dure for these parameters and illustrated the procedure on experiments.

We addressed the problem of large scenes which cannot be copied into mem-

ories of the parallel processes. The parallel program uses a distributed object

database which maintained by the same processes which use the data. (The

use of a distributed object database makes the parallel ray tracing application

non-trivial—each worker process performs the ray tracing computations and inde-

pendently it acts as a database server for other worker processes.) We compared

several cache strategies out of which we presented three which do not require any

tuning. The LRU (Last Recently Used) strategy performed best.

We showed that the choice of the communication library strongly influences

the performance of the parallel program. We compared two programs which are

identical on the binary level—the only difference between the two is the choice

of a mechanism inside the communication library. One program uses polling in

order to achieve thread-safety and an independent message progress, the other

program uses an event-driven mechanism. None of these mechanisms uses spe-

cial optimisation techniques, the implementation of the mechanisms is practically

identical to the code snippets from Chapter 2. The event-driven version outper-

formed the polling version in all experiments (for all problem instances and in all

runs).

166 CHAPTER 4. RAY TRACING

Chapter 5

Radiosity

The radiosity method solves a discretised version of the radiance equation 3.30

or a discretised version of the potential equation 3.31 or a combination of both

as sketched in Section 3.4.2. The discretised radiance and potential equations

are linear equation systems. We will focus on the discretised radiosity equation

Equation 3.59 which can be written as

MB =E(5.1)

where Mij =δij −ρiFij are the elements of the radiosity matrix M. The

symbol δij denotes the Kronecker delta which is 1 for i=jand 0 otherwise.1

There are several approaches to solving a linear equation system. An example

of a simple approach is Gauss elimination which was used e.g. in [GTGB84],

(the first paper which describes the use of radiosity for image synthesis). One

disadvantage of Gauss elimination (or other full-matrix methods) is that the

matrix of the system must be completely computed—this means that the form

factors between all pairs of patches must be computed. This approach is not

feasible for large matrices because the computation of each form factor involves

a double integration (Equation 3.57).

Most radiosity algorithms use Southwell relaxation as the underlying linear

equation solver.2[CCWG88], [SP94] The idea is to compute an acceptable ap-

proximation of the radiosity solution without having to compute the whole radios-

1The form factor Fii = 0, therefore Mii = 1 for i= 1,...,n.

2An interesting alternative approach is presented in [Ash01]. The idea is to express the

(slightly modified) radiosity matrix Musing the spectral decomposition theorem as M=

i=1 λivivT

i, where λiare the eigenvalues of the matrix Mand viare the corresponding

eigenvectors. The radiosity matrix can be approximated as M≈Pp

i=1 λivivT

i,p < n, where

the eigenvalues have been sorted in a decreasing order: |λ1| ≥ |λ2| ≥ ...≥ |λn|. It has already

been shown that only the first few largest eigenvalues carry significant information for typical

radiosity matrices. Hence, even if pis much less than n, the approximated radiosity solution is

close to the exact radiosity solution. From a practical point of view, it is important that the

computation of the first few largest eigenvalues (and the corresponding eigenvectors) does not

require full knowledge of the matrix Mif e.g. the Block Lanczos algorithm is used. [GU77]

167

168 CHAPTER 5. RADIOSITY

ity matrix. Southwell relaxation is an iterative method which computes a series

of approximations of the radiosity solution. The more iterations are performed,

the better the final approximation is. As a result of the physical constraints,

the radiosity matrix Mis diagonally dominant which ensures the convergence of

Southwell relaxation.

5.1 Southwell relaxation

In order to give a convenient physical interpretation of Southwell relaxation,

Equation 5.1 can be slightly reformulated. We denote the total energy radiated

by the patch Pi(Aiis the area of the patch Pi) as βi=BiAi. Similarly, we

denote the total energy emitted by the patch Pias ²i=EiAi. In this notation,

the radiosity equation can be written as

Kβ=²(5.2)

where Kij =Ai

AjMij. The vector of the unknowns βiafter k-th iteration of

Southwell relaxation is denote as β(k). The residual vector after k-th iteration is

denoted as r(k)=²−Kβ(k). (The matrix Kis diagonally dominant, therefore

the iteration process converges: limk→∞ kr(k)k= 0.) In each iteration, Southwell

relaxation choses the maximal residuum r(k−1)

sfrom the residual vector of the

previous iteration and sets the new residuum to zero: r(k)

s= 0. This yields the

following recurrent formulas for βand r: [SP94]

β(k)

s=β(k−1)

s+r(k−1)

s(5.3)

r(k)

i=r(k−1)

i−Kis r(k−1)

s=(0 for i=s

r(k−1)

i+ρiFsi r(k−1)

ifor i6=s(5.4)

The natural initial settings of βand rare β(0) =0and r(0) =². Note that

after k-th iteration, β(k)

i+r(k)

Aiis a good estimate of the patch radiosity Bi. These

radiosity estimates rapidly converge to the exact radiosity solution already after

a few iterations. [CCWG88]

Several points are worth mentioning:

•The residua r(k)

icorrespond to the yet unshot patch energies (the energies

to be distributed in the future iterations k+ 1, k + 2, . . .).

•The values β(k)

icorrespond to the accumulated patch energies.

•In one iteration k, the unshot patch energy r(k−1)

sis “shot” to all other n−1

patches and the contributions of this “shot” are added to the residua of the

n−1 patches. The patch Pswhose patch energy is being shot is called a

shooting patch.

5.2. FORM FACTOR COMPUTATION 169

•In each iteration k, the radiosity estimates β(k)

i+r(k)

Aiare updated for all

patches Piexcept for the shooting patch Pswith the maximal residuum

r(k)

•The same patch Pscan have the maximum residua r(k1)

sand r(k2)

sin different

iterations k1and k2. Hence, the same patch can be selected several times

to be the shooting patch during the iteration process.

•The total yet unshot energy kr(k)k=Pn

i=1 r(k)

iis a good estimate of the

global error after kiterations of Southwell relaxation. The fraction kr(k)k

k²k

is the percentage of the initial total unshot energy which has not yet been

shot. This percentage can serve as a convenient termination criterium of

the iteration process.

•One iteration only involves the computation of n−1 form factors (the form

factors between the shooting patch Psand all other n−1 patches). As

the number of iterations performed is usually much smaller than n, the

majority of the form factors do not need to be computed in order to obtain

the desired approximation of the radiosity solution.

5.1.1 Shooting radiosity algorithm

The algorithm in Fig. 5.1 is based on Southwell relaxation which has been de-

scribed above. This algorithm is referred to as the shooting radiosity algorithm.

The variables Birepresent patch radiosities (Bi= (β(k)

i+r(k)

i)/Aiafter kexecu-

tions of the while loop). The variables ∆Birepresent patch unshot radiosities

(∆Bi=r(k)

i/Aiafter kexecutions of the while loop).

The most expensive part of the algorithm is the computation of the form

factors between the shooting patch Psand the receiving patch Pr(the function

compute form factor). The larger the equation system, the more iteration steps

are required in order to reach the termination criterium and the more form factors

must be computed in each iteration step. It is therefore desirable to use as few

patches as possible for the scene representation (for instance, a rectangular wall

should be modeled as two triangles, not more).

The radiosity algorithm in Fig. 5.1 explicitly enumerates the receiving patches

(this happens in the for loop which follows the shooting patch selection). A

different approach—direction sampling (an implicit form factor computation)—

can be found e.g. in [Kel94] and [Kel96]

5.2 Form factor computation

The computation of form factors between pairs of patches in Fig. 5.1 (Equa-

tion 3.57) is the basis of the shooting algorithm. The approaches to the compu-

170 CHAPTER 5. RADIOSITY

/* Input:

patches P1,...,Pn

patch reflectances ρ1,...,ρn

patch emittances E1,...,En

desired accuracy p

shooting radiosity()

{/* Initialisation. */

for (i=1;i<n;i++)

{Bi=0;

∆Bi=Ei;

Ai= patch area(Pi);

}

/* Southwell iterations. */

while (Pn

i=1 Ai∆Bi>p)

{Ps= select shooting patch(); /* As∆Bs= maxiAi∆Bi*/

for (r=1;r<n;r++)

{F= compute form factor(Ps,Pr);

inc =ρrF As∆Bs/ Ar;

Br+= inc;

∆Br+= inc;

}

∆Bs=0;

}

/* Output:

patch radiosities B1,...,Bn

Figure 5.1: The basic shooting radiosity algorithm

5.2. FORM FACTOR COMPUTATION 171

tation of form factors can be divided into four classes: [SP94]

1. Direct analytical integration. Exact form factors are known for many

special configurations of the two patches. [How82] The known configura-

tions in Howell’s catalogue are divided into three groups: differential area to

differential area (e.g. two differential areas in an arbitrary configuration),

differential area to finite area (e.g. a differential planar element to finite

parallel rectangle) and finite area to finite area (e.g. two identical, paral-

lel, directly opposed rectangles). Howell’s catalogue assumes that the two

patches are not occluded by any other patch. Some radiative heat transfer

applications work with 3D models without occlusions. Unfortunately, in

most models relevant to computer graphics applications it is even impos-

sible to predict whether there is an occlusion between two patches or not

(there generally is).

2. Contour integration. Using Stokes’ theorem, the area integral of Equa-

tion 3.57 can be transformed into a contour integral:

Fsr =1

2πAsICsICr

ln r dcrdcs(5.5)

where Csand Crare the contours of the patches Psand Pr.

A form factor between a point x(a differential area) and a polygon P=

[v1, v2,... vmaxp] can be analytically carried out using contour integration

if no occlusion exists between the point and the polygon: [HS67], [SH93]

FxP =1

2π

maxp

i=1

N·6(Ri, Ri⊕1)

|Ri×Ri⊕1|(Ri×Ri⊕1) (5.6)

where Nis the normal at the point xand ⊕is the “circular next” operator

on 1, . . . , maxp.6denotes the (signed) angle between two vectors. Riis

the vector from xto vi.

3. Projection. Projection methods are based on Nusselt’s analogy which

provides an alternative definition of a form factor between a differential

area around a point xand a finite surface patch P: [Nus28], [SP94]

The form factor FxP is the fraction of the area in the base

plane3which is obtained by projecting the patch Ponto the unit

hemisphere centered at the point x, and then orthogonally down

onto the base plane.

3The base plane is the plane which contains the point xand which is perpendicular to the

normal at the point x.

172 CHAPTER 5. RADIOSITY

A popular algorithm which is based on the above projection is the hemicube

algorithm (a hemicube or an arbitrary surface around the point xcan be

used instead of the hemisphere in the above definition). [CG85] The ad-

vantages of the hemicube algorithm are that it can deal with an occlusion

between two patches, it is relatively simple and it can use the hardware of

contemporary graphics cards (the z-buffer algorithm is implemented in the

hardware). One disadvantage is that the precision of the form factor ap-

proximation depends on the discretisation of the hemicube. An insufficient

discretisation leads to severe inaccuracies.

4. Monte Carlo integration (ray casting). Monte Carlo integration [ES00]

is the most straightforward method of computing the form factor between

two patches Psand Prwhich may be occluded by other patches. The ad-

vantage of Monte Carlo integration is generally its flexibility. Care must

be taken when choosing the estimator. Estimators which are unbiased and

which have a low variance are preferred. Estimators which have both these

properties are proposed in [Pie93] and [Bek99]. Unlike the projection meth-

ods, Monte Carlo integration is not confronted with aliasing problems. The

only parameter which must be tuned is the number of samples.

We will focus on the methods of Monte Carlo integration in the next section.

5.2.1 Monte Carlo form factor computation

Direct area estimator

A direct approach to the computation of the factor Fsr uniformly generates Ns

points xi, i = 1,...,Nson the patch Psand Nrpoints yj, j = 1, . . . , Nron the

patch Prand it uses the estimator

Fsr =1

NsNrπ As

i=1

j=1

cos θij cos θ0

r(xi, yj)2V(xi, yj) (5.7)

where θij is the angle between the surface normal of the patch Psand the

direction from xito yj,θ0

ij is the angle between the surface normal of the patch

Prand the direction from yjto xi,r(xi, yj) is the distance between the points xi

and yjand Vis the visibility function defined in Equation 3.52.

The estimator in Equation 5.7 is unbiased. This means that its expected

value is correct (E[ˆ

Fsr] = Fsr). However, its variance can be very high and even

unbounded: [Bek99]

var[ˆ

Fsr] = Ar

π2AsZx∈PsZy∈PrÃcos θcos θ0

r(x, y)2V(x, y)!2

dArdAs−F2

sr (5.8)

5.2. FORM FACTOR COMPUTATION 173

The problem with the unbounded variance is caused by the factor r(x, y)4

in the denominator of 5.8 and it shows up for abutting patches. In this case,

increasing the number of samples does not necessarily improve the form factor

estimate.

Delta area estimator

The approach of [WEH89] uniformly generates Nspoints xi, i = 1, . . . , Nson the

patch Psand Nrpoints yj, j = 1,...,Nron the patch Pr. For a pair of points

xiand yj, an analytical form factor from the differential area around the point

xito a disk around the point yjis calculated in order to approximate the inner

form factor integral (the sum of the Nrdisk areas is equal to Ar). This yields

the estimator

Fsr =1

NsNrπ As

i=1

j=1

cos θij cos θ0

r(xi, yj)2+Ar

V(xi, yj) (5.9)

This estimator is biased (E[Fsr]6=Fsr) but it is consistent (E[Fsr] = Fsr for

Nr→ ∞). The bias can only be neglected if Nris large enough.

Directional estimator

The inner form factor integral (Equation 3.57) can be written as a directional

integral, which yields an equivalent form factor formula

Fsr =1

π AsZx∈PsZω∈Ωr(x)cos θ dω dAs(5.10)

where Ωr(xi) denotes the solid angle subtended by the visible part of the patch

Pras seen from the point x,θis the angle between the surface normal at xand

the direction ωand dAsis a differential area around the point x.

The determination of the integration domain Ωr(xi) involves solving the vis-

ibility problem. The visibility can be analytically solved in this case but this

analytical computation is very expensive. [BRW89] Equation 5.10 can be refor-

mulated as

Fsr =1

π AsZx∈PsZω∈Ω0

r(x)cos θ V 0(x, ω, Pr)dω dAs(5.11)

where Ω0

r(x) denotes the solid angle subtended by the patch Pras seen from

the point x. The function V0(x, ω, Pr) returns 1 if RT(x, ω)∈Prand 0 otherwise.

The directional integration uniformly generates Nspoints xi, i = 1, . . . , Ns

on the patch Ps. For each point xi, directions ωj, j = 1,...,Nrare uniformly

generated over the solid angle Ωr(xi) subtended by the patch Pras seen from the

point xi. The resulting form factor estimator is

174 CHAPTER 5. RADIOSITY

Fsr =1

NsNrπ As

i=1

j=1

cos θij V0(xi, ωj, Pr) (5.12)

This estimator is unbiased and its variance is always bounded: [Bek99]

var[ˆ

Fsr] = 1

π2AsZx∈Ps

Ω0

r(x)2Zω∈Ω0

r(x)(cos θij V0(x, ω, Pr))2dω dAs−F2

≤4 (5.13)

A technical problem remains and that is how to uniformly sample the direc-

tions ωin the solid angle Ω0

r(x). If the patch Pris a triangle, then Prcan be

projected onto a hemisphere around the point xand the sampling technique for

spherical triangles can be applied. [Arv95]

Weighted analytical estimators

The idea behind the following estimators is to uniformly generate Nspoints

xi, i = 1,...,Nson the patch Psand to analytically compute the unoccluded

point-to-patch form factor FxiPr(Equation 5.6) for each point xi. This form

factor is weighted by the visibility sampling. For each point xi,Nrpoints

yij, j = 1,...,Nrare uniformly generated on the patch Pr. The visibility function

V(xi, yij) is computed for the point xiand the points yij, j = 1, . . . , Nr.

The resulting estimator which is proposed in [Pie93] is

Fsr =1

NsNr

i=1

FxiPr

j=1

V(xi, yij) (5.14)

A similar estimator is proposed in [Bek99] (weighted area sampling):

Fsr =1

i=1

FxiPrPNr

j=1 µcos θij cos θ0

πr(xi,yij )2V(xi, yij)¶

PNr

j=1

cos θij cos θ0

πr(xi,yij )2

(5.15)

Both of the estimators above are unbiased and their variance is bounded by

NsRx∈PsF2

xPrdAs. [Bek99]In our opinion these two estimators are the best known

ones in connection with the shooting radiosity method.

5.3 Discretisation of surface geometry

The surface discretisation (also known as meshing) is a conversion of the sur-

face geometry of a 3D model into a polygonal representation. This polygonal

representation is usually a set of triangle meshes. The illumination computed

by a radiosity algorithm is stored in this mesh (usually in the vertices of the

5.4. ILLUMINATION STORAGE AND RECONSTRUCTION 175

triangles). The initial mesh is created in the preprocessing stage and it is dy-

namically refined by the radiosity algorithm in order to accurately represent the

stored illumination.

Mesh generation has been the subject of many research papers, especially in

relation to the finite element methods. Specific requirements of mesh generation

for the purpose of radiosity computations were formulated in [BMSW91]. These

specific requirements place constraints on the size of the triangles and on the

topology of the input mesh. We will show that none of these constraints are

important if suitable techniques are used in the radiosity algorithm. We propose

a radiosity algorithm which works well with an arbitrary input mesh—the only

requirement is, that the triangles are numerically interpreted as triangles (that

means, that the numerically computed area of any triangle must not be 0). The

initial size of the triangles is also not important—the larger they are, the better.

Some radiosity algorithms (e.g. [Sch00]) work with triangles and quadrangles—

we only allow the use of triangles in order to keep the algorithm simple (this only

influences the computational time, not the quality of the radiosity solution).

5.4 Illumination storage and reconstruction

The assumption of the radiosity method is constant radiosity over a patch. How-

ever, as the diffuse illumination usually smoothly varies over a surface, the storage

of a single RGB value per patch leads to unpleasant visual artifacts. These arti-

facts are caused by the discontinuities of the illumination on the borders between

the patches. This problem can be solved by making the patches very small—

however, this would increase the computational time. A usual practice is to store

different RGB values in the vertices of the patches.4These RGB values are called

vertex radiosities. The illumination at a point of the patch is reconstructed using

the interpolation of the vertex radiosities of the patch. As patches are two-sided,

the illumination must be independently stored for the front side and the back

side of each patch (see Section 3.2.2).

The illumination over a patch is not always smooth. It can vary rapidly in

particular on shadow boundaries. The linear interpolation is not able to capture

sudden changes. This problem can also be solved by using smaller patches but it

is desirable to keep the number of patches as small as possible. A commonly used

compromise is using an adaptive hierarchical subdivision. The initial patches are

as large as possible (e.g. a planar wall is represented as two triangles, regardless

of its size). A patch is only subdivided when a significant discontinuity of the

illumination is detected on its surface. The resulting subpatches can be further

subdivided in a similar manner.

4An alternative is to represent the illumination over a patch as a linear combination of a

finite number of base functions (the so-called Galerkin method).

176 CHAPTER 5. RADIOSITY

A patch only needs to be subdivided at the moment when its vertex radiosities

are being updated. This only happens when the patch is acting as the receiving

patch during the radiosity algorithm and when its current level of subdivision is

not able to capture the illumination.

The push-pull algorithm can be used to maintain the patch radiosities and

patch unshot radiosities at each subdivision level. [SP94] The radiosity stored in

a node of the tree is equal to the average radiosity stored in the nodes’ children.

This is important for algorithms which perform the energy exchange between the

shooting patch and the receiving patch at different subdivision levels.

Our algorithm also uses the adaptive hierarchical subdivision. However, the

energy exchange always takes place on the top level of the hierarchy—that means,

the shooting patch and the receiving patch are the original (large) patches. The

tree data structure provides two operations for the storage and retrieval of ra-

diosity and unshot radiosity (these are both represented as RGB values):5

•retrieve(P,side,x,c)retrieves the RGB value at the point xon the

side side of the patch P. This retrieved value is stored into c. The retrieval

procedure first traverses the hierarchy of the patch Pon the side side in

order to find the smallest triangle (the smallest triangle is always a leaf of

the subdivision tree) which contains the point x. Then the RGB values

which are stored in the vertices of this triangle are interpolated in order to

compute the value c.

•store(P,side,x,c)stores the RGB value cat the point xon the side

side of the patch P. The storing procedure first traverses the hierarchy of

the patch Pin order to find the smallest triangle which contains the point

x. Then it updates the RGB values which are stored in the vertices of this

smallest triangle and checks whether the interpolation of the new vertex

values at the point xdiffers from c. If the difference is larger than a user-

specified threshold, the triangle is subdivided into smaller non-overlapping

triangles6and the update is recursively repeated for that of these new tri-

angles which contains the point x. When this recursion returns, the vertex

values are “pulled” from the smallest triangle up to the root of the tree.

The (point-based) retrieve(P,side,x,c)operation always retrieves the

best level of detail of the illumination at the point x. The point-based store oper-

ation is more general than a patch-based store operation. Indeed, a patch-based

store(P,side,c)operation can be simulated using the point-based operation

store(P,side,xcenter,c), where xcenter is the center of the patch P. The

patch-based retrieve operation (this operation is needed for the selection of

5Similar operations are proposed in [Bek99] (per-ray refinement).

6All the smaller triangles must still numerically be interpreted as triangles (with non-zero

areas). If this assertion fails then the subdivision of the current node is discarded.

5.5. ENERGY TRANSFER 177

the shooting patch) simply returns the average of the vertex radiosities of the

top-level patch.

Remark. The illumination can be stored in any data structure which provides

the store and receive operations. The data structure must additionally al-

low for an efficient generation of sample points on the shooting patch which is

described in Section 5.5.1 (Fig. 5.4). •

5.5 Energy transfer

The transfer of energy from the shooting patch Psto the receiving patch Prin

the basic radiosity algorithm in Fig. 5.1 is comprised in the while loop which

follows the shooting patch selection. The actual energy transfer is preceded by

the form factor computation. Our algorithm combines these two steps (and it

retains the explicit enumeration of the receiving patches).

We will focus on the Monte Carlo form factor computation (see Section 5.2.1).

In order to explain the proposed energy transfer mechanism, we will first con-

sider the simple direct area form factor estimator (Equation 5.7). This method

uniformly generates Nspoints xi, i = 1,...,Nson the patch Psand Nrpoints

yj, j = 1,...,Nron the patch Pr. Equation 5.7 sums up the contributions of

all point pairs [xi, yj] to the estimated form factor ˆ

Fsr. We can merge this step

with the computation of inc in Fig. 5.1—instead of working with the form factor

estimator ˆ

Fsr we can directly work with the added radiosity estimator

inc =ρr∆Bs

NsNrπ Ar

i=1

j=1

cos θij cos θ0

r(xi, yj)2V(xi, yj) (5.16)

Let us assume that there are Nspoint light sources located at the points

xi, i = 1,...,Ns. Their emittances are

E(xi, ω) = (cos θ

r(xi,y)2if ωlies in the half-space of the surface normal at xi

0 otherwise

(5.17)

where θis the angle between the surface normal at the point xiand the

direction ω. The attenuation term r(xi, y)2is artificial. It expresses that the

emittance is inversely proportional to the square of the distance to the point

illuminated by the light source. If the Nspoint light sources are the only light

sources in the model, then the sum

ρr

i=1

cos θij cos θ0

r(xi, yj)2V(xi, yj) (5.18)

178 CHAPTER 5. RADIOSITY

is equal7to the direct illumination term which is computed by the ray tracing

shader at the point yj. Equation 5.18 is the inner sum of Equation 5.16 multiplied

by ρr(if the sums in Equation 5.16 are swapped). If the ray tracing shader is

applied to all Nrpoints on the patch Pr, then the sum of the resulting “colours”

scaled by ∆Bs

NsNrπ Aris equal to the right side of Equation 5.16. This final summa-

tion can be expressed using the store operation (see Section 5.4). The “colours”

cjcomputed by the ray tracing shader for the receiving samples yj, j = 1, . . . , Nr

are scaled and added to vertex radiosities of the patch Pr:store(Pr,side,yj,

cj∗∆Bs

NsNrπ Ar). The idea behind using the ray tracing shader for the energy

exchange in the radiosity algorithm is illustrated in Fig. 5.2.

Figure 5.2: Radiosity energy exchange using the ray tracing shader. Left: shadow

rays are traced by the ray tracing shader from the light sources in order to

illuminate a surface point Y. Right: shadow rays are traced by the ray tracing

shader in order to transfer energy from the shooting patch to the receiving patch.

The temporary point light sources are randomly generated on the shooting patch

Remark. The construction above can be easily adapted to use a better form

factor estimator of Section 5.2.1, for instance the weighted analytical estimator

(Equation 5.14). We show in Section 5.7.1 that the weighted analytical esti-

mator outperforms the direct area form factor estimator which we used in the

construction above for explanatory purposes. •

Remark. The idea of using artificially generated light sources for the purpose of

the computation of the indirect diffuse illumination can also be found in [Kel97].

Keller’s method does not use any permanent illumination storage—it is “view-

7More precisely, Equation 5.18 is equal to the direct illumination term which is computed by

the ray tracing shader only if two simplifying assumptions hold in the 3D model: 1.all patches

are opaque, 2.all patches are pure diffuse reflectors described by the scalar ρr. Ray tracing

shaders usually work with more general material descriptions. If the materials in the model

do not satisfy the previous assumptions then the “colour” returned by the ray tracing shader

may differ from Equation 5.18 because the shader takes into account all material properties,

not only the scalar ρr.

5.5. ENERGY TRANSFER 179

dependent”. The method shoots the original light sources onto randomly gener-

ated points on a patch and creates a new temporary point light source in each

of these points. An image of the scene is then rendered, whereby the scene is

illuminated by the temporary light sources. This process is repeated for each

patch and the final image is obtained as a sum of the single images. The render-

ing step makes use of graphics hardware. The algorithm only computes diffuse

scatterings of level one (the term RLof Equation 3.35). Scatterings of level two

and higher (the terms RiL, i ≥2) can be added by extending the lengths of the

light paths using a direction sampling of the temporary light sources. (Rays are

traced in randomly generated directions and additional temporary light sources

are generated if in the surface points which are hit by the rays.) •

5.5.1 Shooting radiosity algorithm using the ray tracing

shader

The above reformulation of the energy transfer in terms of the ray tracing shader

leads to the algorithm in Fig. 5.3. We will discuss a few algorithmic aspects

hidden in the pseudo-code:

•Patch radiosities Biand patch unshot radiosities ∆Bido not need to be

explicitly stored in the patches (and subpatches) because they can be re-

constructed from the vertex radiosities and vertex unshot radiosities.

•The basic radiosity algorithm only works with patch light sources. The

function set vertex unshot radiosities in Fig. 5.3 may include the so-

called first shot during which all other point light types are shot onto the

patches Pi. This first shot can be implemented similarly to the shooting

in one Southwell iteration—points are randomly generated on the receiving

patches and the ray tracing shader is called to illuminate them. (It is

important that the illumination acquired during the first shot is compatible

with the direct ray tracing illumination.)

•The patch selected by the function select shooting patch is always the

top-level patch (not a subpatch of the subdivision tree). Note that illumi-

nation is stored independently for front and back sides of every patch. The

shooting patch selection routine must search both the front and back sides

of each patch.

•An appropriate choice of the sampling rates (computed in the function

choose sampling rates) requires a heuristics which should depend on the

amount of the transferred energy As∆Bsand on a rough estimation of the

form factor Fsr. A discussion on sampling rates for the Monte Carlo form

factor computations can be found in [Bek99].

180 CHAPTER 5. RADIOSITY

/* Input:

patches P1,...,Pn

patch emittances E1,...,En

desired accuracy p

shooting radiosity()

{/* Initialisation. */

for (i=1;i<n;i++)

{zero vertex radiosities(Bi);

set vertex unshot radiosities(Bi,Ei);

}

/* Southwell iterations. */

while (Pn

i=1 Ai∆Bi>p)

{Ps= select shooting patch(); /* As∆Bs= maxiAi∆Bi*/

for (r=1;r<n;r++)

{choose sampling rates(Ps,Pr,Ns,Nr);

generate shooter random samples(Ps,x1,...,xNs);

store light sources();

set light sources(x1,...,xNs);

generate receiver random samples(Pr,y1,...,yNr);

for (j = 1; j < Nr; j++);

{ray tracing shader(yj);

}

restore light sources();

}

zero vertex unshot radiosities(Bs);

}

/* Output:

patch radiosities B1,...,Bn

vertex radiosities stored in the trees of patches B1, . . . , Bn

Figure 5.3: Shooting radiosity algorithm using the ray tracing shader

5.5. ENERGY TRANSFER 181

•An efficient algorithm for uniform generation of random points inside a

triangle can be found in [Gla90].

•The use of stratified sampling (also known as latin square sampling or

jittering, see [ES00]) in the function generate random samples may reduce

the variance of the Monte Carlo integration.

•The first three function calls in the for (r=1; r < n; r++) loop set up

the temporary point light sources on the shooting patch. These three lines

of code can be moved before the for loop which significantly speeds up the

energy transfer. The idea behind this rearrangement is the reusing of the

shooting samples (more precisely, the point lights) in shootings to different

receiving patches. This leads to a significant increase of the efficiency,

especially if the light buffer technique is applied. It is much more efficient

to compute the light buffers once per iteration than once per a receiving

patch.

•The function ray tracing shader is actually called twice—once for the

front side of the receiving patch Prand once for the back side. This function

computes the direct illumination cjat the points yjand calls the function

store in order to store the obtained “colours” cjin the subdivision tree

(after scaling).

•The function set light sources and the store() call in the function

ray tracing shader can be adapted to (implicitly) compute the the form

factor estimate 5.14 or 5.15 instead of 5.7. (Our implementation uses the

form factor estimate of Equation 5.14.)

Random generation of sampling points

A particularly interesting issue is the generation of the sampling points on the

shooting patch (generate shooter random samples) and the receiving patch

(generate shooter random samples). The computation of form factors using

Equation 5.14 requires a uniform sampling of both the shooting patch and the

receiving patch. However, the following example shows that it is not desirable to

generate the samples uniformly on the shooting patch from the point of view of

the energy transfer (we recall that both the shooting patch and the receiving patch

are top-level patches). Let us assume that the unshot radiosity in one part of the

shooting patch is much greater than in another parts. In an extreme case the

whole unshot radiosity is accumulated in one leaf of the subdivision tree, while

the unshot radiosity stored in the other leaves is zero. If the sampling points

are generated uniformly on the shooting patch, then it may happen that none

of the sampling points lies inside the leaf with the non-zero unshot radiosity.

The retrieve operation would return 0 for all the generated points, therefore no

energy would actually be shot at the receiving patches.

182 CHAPTER 5. RADIOSITY

There are two solutions to this problem. (Note that the problem is only related

to the shooting patch—the points on the receiving patch should be generated

uniformly.) The first solution involves shooting smaller patches than the top-

level patches—however, this would lead to a slow convergence of the shooting

algorithm. The second solution involves a weighted sampling of the shooting

patch. The idea behind this weighted sampling is to control the density of the

generated points on the shooting patch according to the unshot radiosity over

the shooting patch. The pseudo-code in Fig. 5.4 describes the point generation

process. This process first chooses a leaf of the subdivision tree (a triangle) with

a probability proportional to the unshot radiosity of the leaf and then generates

the random point inside this leaf.

generate shooter random samples(Ps,x1,...,xNs)

{for (i = 1; i <= Ns; i++)

{node =Ps;

while (node is not a leaf)

{... select a successor child of node with probability

proportional to the patch unshot radiosity of child;

node =child;

}

xi= (uniform) random point on the patch of the node node;

}

Figure 5.4: Generation of sample points on the shooting patch

It may be argued that the non-uniformity of the random sampling of the

shooting patch violates the assumption of the form factor computation of Equa-

tion 5.14. However, the proposed process should be looked at as shooting of

the group of leaf patches of the subdivision tree. Let us return to the previous

example—let us assume that only one leaf patch stores the whole unshot radiosity

of the shooting patch. In this case the random points will only be generated on

this one leaf patch and the implicitly computed form factor will correspond to

the form factor between this one leaf patch and the receiving patch. The effect

of the subsequent shooting is therefore the same as if that one leaf patch was

selected for the shooting.

5.6. VISUALISATION 183

Practical consequences

The proposed algorithm has a number of positive practical consequences (the

only drawback is the mesh representation of the surface geometry):

•The visibility function V(xi, yj) computed by the ray tracing shader is more

general than its original definition in Equation 3.52 as the ray tracing shader

automatically handles (non-scattering) transparency. Hence, the visibility

function is no longer a function which returns either 0 or 1—it becomes a

real function instead: V(xi, yj)∈ h0,1i.

•Ray tracers can usually work with complex material descriptions which in-

clude layered textures, alpha channel, bump mapping etc. As the energy

transfer in the radiosity algorithm invokes the ray tracing shader, the ra-

diosity algorithm can work with the same material description.

•Only a few parameters must be supplied by the user: 1.desired accuracy of

the radiosity solution (the constant pin Fig. 5.3), 2.threshold which con-

trols the adaptive refinement (see Section 5.4), 3.parameter which controls

the sampling density in the function choose sampling rates. All these

parameters are intuitive as they are expressed as percentages.

•The radiosity computation can be incorporated into any ray tracer as a

preprocessing step. The combination of radiosity and ray tracing yields the

so-called two-pass solution of the global illumination problem. [SP89]

5.6 Visualisation

Ray tracing is used for the visualisation of the radiosity solution. The computa-

tion of the ambient and the direct illumination terms in the ray tracing shader is

replaced by the retrieve() function call (this function is defined in Section 5.4).

The quality of the rendered two-pass images can be further improved if the

direct illumination is subtracted from the radiosity solution and computed by

the ray tracing shader (in this case only the ambient term is replaced by the

retrieve() function call). Unless the size of the leaf triangles in the radiosity

mesh is after the projection onto camera comparable to the size of the camera

pixels, the direct illumination computed by ray tracing is more accurate.

The computed images include the direct illumination, indirect perfect specular

illumination (multiple perfect specular scatterings) and indirect perfect diffuse il-

lumination (multiple perfect diffuse scatterings). Transparency is also accounted

for. Some light phenomena, e.g. caustics, are not included in the two-pass solu-

tion. These missing phenomena are related to photon paths along which indirect

specular and diffuse scatterings are mixed (or photon paths with imperfect scat-

terings).

184 CHAPTER 5. RADIOSITY

5.7 Experiments

We implemented the shooting radiosity algorithm in Fig. 5.3 as a module of the

Persistence of Vision Ray Tracer (POV-Ray). The radiosity computation runs as

a preprocessing step which is followed by ray tracing. The scene must consist of

triangle meshes (more precisely, other object types may be used as well but the

radiosity algorithm only stores the illumination in triangle meshes).

The implementation is fully functional even though not yet complete. An

additional programming effort would be needed in order to implement extensions

such as a heuristics for the dynamical control of the number of samples on the

shooting patch and the receiving patch (a fixed number of samples is currently

used), the adaptive substructuring, the storing of the computed illumination in a

file etc. However, all major algorithmic issues have already been addressed. The

missing details only influence the efficiency of the implementation.

5.7.1 Form factors

In order to verify the correctness of the implementation we first created a special

3D model for which the radiosity solution can be calculated analytically. The

model consists of an empty box (a cube with 6 square walls) and a camera which

views the inside of the box. As our implementation only uses triangles, each

wall is modeled as two triangles. In the following experiments we also used a

refined version of the box in which each wall is modeled as eight triangles. (We

implemented a simple mesher which refines the model until all triangle edges

are shorter than a given threshold.) These two versions, BOX12 and BOX48 are

equivalent in the sense that they describe the same surfaces—the only difference

is the number of triangles, see Fig. 5.5.

Figure 5.5: Left: BOX12, the box scene modeled of 12 top-level triangle patches.

Right: BOX48, the same box scene modeled of 48 top-level triangle patches (level-

one refinement)

There are two kinds of form factors in the scene: 1.the form factor between

the floor and one of the vertical walls, Fperp; 2.the form factor between the floor

5.7. EXPERIMENTS 185

and the ceiling, Fpar. The analytical formulas for both these form factors can be

found in Howell’s catalogue: [How82]

Fperp =1

4ln 3

4−√2 arctan 1

√2

π≈0.200044 (5.19)

Fpar =4√2 arctan 1

√2−ln 3

π−1≈0.199825 (5.20)

Remark. Fperp and Fpar are not equal. This is an interesting asymmetry which

can rarely be observed in the nature for a symmetric configuration such as a

cube. The difference between the two form factors is very small, which yields an

uncommon approximation of π(if we set Fperp =Fpar):

π≈10√2

3arctan 1

√2−5

6ln 3

4≈3.141134 (5.21)

•

It is not straightforward to verify the implementation of the form factor com-

putation as the value of a form factor does not appear explicitly anywhere in

the algorithm in Fig. 5.3. However, it is possible to modify the selections of the

shooting patch and the receiving patch in Fig. 5.3 so that only all the triangles of

the emitting floor will shoot energy at a particular receiving patch (the particular

patch is one of the vertical walls and the ceiling, respectively). We also modified

the algorithm so that it terminates after the energy of the floor has been shot.

Even though we can not directly measure the value of the form factor, we can

read the unshot patch radiosity values of the receiving patches after the shot.

The unshot radiosity of a wall is equal to the sum of unshot radiosities of all its

triangles. If we set the reflectivity ρof the receiving patch to 1 and the unshot

radiosity of the shooting patch to 1, then the unshot radiosity of the receiving

patch after the shooting will be equal to the form factor between the shooting

patch and the receiving patch. The side-effect of this methodology is the testing

of the implementation of the function store and of the energy transfer.

We compare the direct area estimator of Equation 5.7 to the weighted ana-

lytical estimator of Equation 5.14. Using the procedure above we computed the

form factors for the scenes BOX12 and BOX48 for a given number of samples per

triangle in 100 independent runs (the number of samples on the shooting triangle

was equal to the number of samples on the receiving triangle). We measured the

minimum, maximum and average values over the 100 runs.

The results of these experiments are shown in Fig. 5.6. Note that the average

values of the computed form factors match the exact values of the form factors

in all graphs. However, the average is not relevant for the reliability of the imple-

mentation (unless the same shooting is repeated many times and the results are

186 CHAPTER 5. RADIOSITY

averaged—but this is not a common practice as the visibility computation is the

most expensive part of the radiosity algorithm). The variance is more relevant.

Even more relevant than the variance are the minimum and the maximum values

of the computed form factors over the 100 runs. (Note that a wrongly computed

form factor value in an early iteration of the radiosity algorithm may invalidate

the resulting radiosity solution.)

The weighted analytical estimator always outperforms the direct area estima-

tor, especially in the computation of Fperp. The weighted analytical estimation of

the form factor always converges to the exact value with the increasing number

of samples. As we mentioned in Section 5.2.1, the increasing number of samples

does not necessarily improve the precision of the form factor direct area estimator.

This can clearly be observed in the Fperp graphs in Fig. 5.6.

The number of triangles only influences the computational time, not the qual-

ity of the radiosity solution if the weighted analytical estimator is used. (We recall

that the number of samples in Fig. 5.6 is given per triangle, not per wall. This

means that the number of samples per wall for the scene BOX48 is four times

higher than the number of samples per wall for the scene BOX12.) This is not

true for the direct area estimator by the computation of Fperp. The reason for

that is probably that the use of more triangles on the shooting and receiving

patches forces the generation of samples pairs which cause an overestimation of

the form factor Fperp.

5.7.2 Experiments with the box scene

Fig. 5.7 is a box scene which is illuminated by a single point light source which

is located in the center of the box (the white spot). This scene demonstrates an

effect known as colour bleeding which is missing in (eye-) ray tracing.

Fig. 5.8 is another box scene with a textured right wall. This scene is illu-

minated by two spot light sources which are both located in the center of the

box. One spot light source illuminates the left wall, the other one illuminates

the right wall. The floor is a perfect mirror. The texture on the right wall in the

two-pass image is distorted because the direct lighting is reconstructed from the

illumination stored in the triangle mesh. The resolution of the mesh is not as

fine as the resolution of the screen, therefore the texture appear distorted. This

model consists of 13000 triangles. The radiosity computation took more than 3

hours (3 samples on the shooting patch and 3 samples on the receiving patch

were used for the energy transfer).

The direct illumination can be computed by the ray tracing algorithm more

accurately than by the radiosity algorithm. In order not to compute the direct

illumination term twice, the illumination from the first shot must be subtracted

from the radiosity solution before the ray tracing phase. This technique was used

for the computation of the right image in Fig. 5.9. This model consist of ca. 3000

triangles. The radiosity computation took approximately 10 minutes (4 samples

5.7. EXPERIMENTS 187

BOX12, computation of Fpar

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

0.16

0.17

0.18

0.19

0.2

0.21

0.22

0.23

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

BOX48, computation of Fpar

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

0.18

0.185

0.19

0.195

0.2

0.205

0.21

0.215

0.22

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

BOX12, computation of Fperp

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

0.1

0.15

0.2

0.25

0.3

0.35

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

BOX48, computation of Fperp

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

0.14

0.16

0.18

0.2

0.22

0.24

0 2 4 6 8 10 12 14

Form factor

Square root of the number of samples per patch

average

min

max

exact

Figure 5.6: Monte-Carlo computation of the form factors Fpar and Fperp for the

box scenes BOX12 and BOX48. Left column: direct area estimator (Equation 5.7).

Right column: weighted analytical estimator (Equation 5.14)

188 CHAPTER 5. RADIOSITY

Figure 5.7: Left: ray traced box scene. Right: the radiosity solution (90% con-

verged)

Figure 5.8: Left: ray traced box scene. Right: a two-pass solution (radiosity

and ray tracing). The radiosity solution is 90% converged and it is stored in the

vertices of ca. 13000 triangles. The texture in the right image is a little distorted

because the direct illumination was reconstructed from the radiosity solution.

5.7. EXPERIMENTS 189

on the shooting patch and 4 samples on the receiving patch were used for the

energy transfer). Note that the texture is not distorted as in Fig. 5.8 although

less triangles are used for the illumination storage.

Figure 5.9: Left: ray traced box scene. Right: a two-pass solution (radiosity

and ray tracing). The radiosity solution is 95% converged and it is stored in the

vertices of ca. 3000 triangles. The direct illumination was subtracted from the

radiosity solution and computed by the ray tracing algorithm

Fig. 5.10 shows images of the box scene which contains a blocking object.

This scene is illuminated by one spot light source which is directed towards the

textured wall. The blocking object is only illuminated indirectly—it is therefore

invisible in the ray traced image.

5.7.3 Experiments with large scenes

The main goal of the following experiments was to test the stability of the imple-

mentation of radiosity on large scenes rather than to compute converged radiosity

solutions (which would require more programming work). The first problem we

encountered was the selection of large scenes suitable for the radiosity (two-pass)

computation. Our radiosity implementation is based on POV-Ray. However, the

implementation requires the objects to be modeled as triangle meshes. Most of

the POV-Ray scenes do not use meshes—they use CSG objects instead. We

chose two scenes in order to test the stability of the implementation, HOUSE

and JAGDSCHLOSS. Both these scenes were created during the project HiQoS.

[ACH+99], [PSA01], [ABB+01] The HOUSE scene was modeled in Arcon by M3B

and converted into POV-Ray using a special-purpose converter which was de-

veloped during the project HiQoS. The original HOUSE scene contains ca. 86000

triangles and 27 point light sources. The JAGDSCHLOSS scene was modeled in 3DS

Max by Kinetix. The model was created by Upstart! and converted into POV-

Ray using the 3DWin converter by TB Software. The original scene contains ca.

190 CHAPTER 5. RADIOSITY

Figure 5.10: Left: ray traced box scene with a blocker. Right: a two-pass solution

(radiosity and ray tracing). The radiosity solution is 95% converged and it is

stored in the vertices of ca. 3000 triangles. The direct illumination was subtracted

from the radiosity solution and computed by the ray tracing algorithm. Note the

soft shadow, which is typical for secondary diffuse reflections

190000 triangles and 25 point light sources.

We adjusted the mesh resolution8and observed the absolute times during the

first few runs. Another problem which we had to solve was that the light sources

in both the models also illuminated the outside of the buildings, while the camera

viewed the inside of the buildings. This enormously increases the computational

time. Even though the outside patches do not usually illuminate any patches of

interest, they will be selected for the shooting in the early iterations. In order

to save this work, we manually deleted the light sources which were positioned

outside of the rooms of interest.

Jagdschloss

The adjusted JAGDSCHLOSS scene contains ca. 350000 triangles and 25 light

sources. It took ca. 11 hours to compute the first 200 shooting iterations for

the scene JAGDSCHLOSS (the ray tracing phase took additional 25 minutes). 3

samples on the shooting patch and 3 samples on the receiving patch were used

for the energy transfer. The radiosity solution which is shown in Fig. 5.11 is only

ca. 3% converged. An additional modeling work is needed to make the materials

and the lighting more realistic.

8This initial mesh refinement is not necessary when the adaptive substructuring is imple-

mented.

5.7. EXPERIMENTS 191

Figure 5.11: Left: ray traced scene JAGDSCHLOSS. Right: the radiosity (two-pass)

solution. The radiosity solution is 3% converged and it is stored in the vertices

of ca. 350000 triangles.

House

The adjusted HOUSE scene contains ca. 170000 triangles and 6 point light sources.

It took ca. 35 hours to compute the first 100 shooting iterations for the scene

HOUSE. 3 samples on the shooting patch and 3 samples on the receiving patch were

used for the energy transfer. The radiosity solution which is shown in Fig. 5.12

is only ca. 4% converged.

Figure 5.12: Left: ray traced scene HOUSE. Right: the radiosity (two-pass) solu-

tion. The radiosity solution is 4% converged and it is stored in the vertices of ca.

170000 triangles.

192 CHAPTER 5. RADIOSITY

5.8 Conclusions

We presented a shooting radiosity algorithm which combines the Monte-Carlo

form factor computation with the energy exchange using the ray tracing shader.

The main advantages of the algorithm are its simplicity and flexibility. No con-

straints are placed on the topology or the resolution of the input mesh, arbitrary

materials are correctly dealt with. The algorithm can be incorporated into any

ray tracer as a preprocessing step and ray tracing acceleration techniques such

as light buffers and hierarchy of bounding boxes can be reused in the radiosity

computations. Our implementation is based on the state of the art Persistence

of Vision Ray Tracer (POV-Ray).

We experimentally verified the correctness of the implementation on a spe-

cially constructed scene for which the exact radiosity solution is known. We also

compared two Monte-Carlo estimators during these experiments: the direct area

estimator and the weighted analytical estimator. The weighted analytical esti-

mator clearly outperformed the direct area estimator and was used in all the

following experiments.

Another set of experiments included radiosity computations for variants of the

box scene and two larger scenes. The implementation needs further optimisations

in order to be suitable for large scenes—however, it is reliable. We did not observe

any problem although some computations took more than one day.

An inherent disadvantage of the proposed algorithm is that it requires a mesh

representation of the model. The efficiency depends on the resolution of the input

mesh. An interesting possibility may be a decoupling of the geometry and the

illumination storage. For example, the energy exchange may only be carried out

between object bounding boxes or between the nodes of the hierarchy of bounding

boxes (bounding slabs). An obvious disadvantage of this approach is that the

computed illumination is only a rough approximation of the actual illumination.

On the other hand, this approach allows for arbitrary object types, not only

triangles meshes. Moreover, some of the tricks which are used in order to reduce

the time complexity of the shooting radiosity algorithm lead to approximations

of the actual illumination anyway. The use of bounding boxes for the energy

exchange may be a viable alternative.

Chapter 6

Summary

Parallel photorealistic image synthesis is a challenging problem. Parallel ren-

dering systems which compute photorealistic images are very rare. This thesis

identifies some of the obstacles which hamper the development of such systems.

The target architecture for the deployment of parallel rendering systems are

distributed-memory systems such as computing clusters. The contemporary mid-

dleware for the development of parallel applications for these systems are message

passing standards such as PVM and MPI. An obvious problem of some of the

implementations of PVM and MPI is a lack of thread-safety. This forces a large

class of parallel applications to use polling. This class can be characterised and

includes for instance all applications which compute on distributed data. We re-

fer to applications which form this class as non-trivial. Parallel ray tracing using

a distributed object database is a non-trivial application.

Polling generally diminishes performance and causes non-deterministic effects

in applications. We identified several sources of polling in message passing pro-

grams which build on PVM or MPI. Apart from issues such as the lack of thread-

safety or the violation of an independent message progress in the implementation

of the standards, the specification of the MPI standard differs from the specifica-

tion of well-accepted abstract message passing models. This difference does not

mean an improvement of the abstract models, it is another source of problems in

parallel programs which require asynchronous message passing.

We proposed a formal framework which adheres to existing formal message

passing models and which addresses practical issues at the same time. There

is a strong similarity between our framework and the framework for database

systems which is accepted as a standard by both the developers and the users

of database systems. We developed a message passing library, TPL, which is a

direct implementation of our framework. TPL is thread-safe, does not internally

use polling and—unlike PVM or MPI—it defines asynchronous message pass-

ing. We showed that TPL can be implemented on the top of slightly extended

PVM or MPI implementations. These extensions include an introduction of an

interrupt mechanism which is invoked on demand. The event-driven TPL library

193

194 CHAPTER 6. SUMMARY

outperformed both PVM 3.4 and MPICH 1.2.4 by two orders of magnitude on

a simple threaded pingpong benchmark running on a standard computing clus-

ter hardware. This benchmark is an abstraction of a non-trivial application and

it must use polling when the communication library is not thread-safe. We do

not claim that TPL is the best possible implementation of message passing—on

the contrary, there is enough space for improvements in its current implementa-

tion. Interestingly, some of the optimisations were addressed in the hardware of

INMOS Transputers.

We defined the global illumination problem and gave a brief overview of meth-

ods which solve the problem. Direct solution methods are recently getting at-

tention of the computer graphics community but approximation methods such

as ray tracing and radiosity are still widely used in applications. We focused

on ray tracing because the techniques of the basic ray tracing algorithm can

be used in direct methods which solve the global illumination problem without

approximations.

We described a parallelisation of ray tracing which builds on the ideas of Green

and Paddon who developed a parallel ray tracer on Transputers more than 10

years ago. The parallelisation is relatively simple and robust and does not place

constraints on the size of the 3D model, as the model can be distributed in the

memories of the processes. We presented a load balancing algorithm for parallel

ray tracing which uses demand-driven screen space subdivision. Our algorithm

is perfect in the sense that if its parameters are optimally set, then it guarantees

a perfect balance of load (hence, the shortest parallel time) and it minimises the

communication at the same time. The optimal setting of the parameters is un-

known but the parameters are intuitive and can be automatically tuned in the

run-time. We suggested a tuning procedure and demonstrated it on a set of ex-

periments. In these experiments, we compared two parallel ray tracing programs

which are identical in all respects but one: the first program uses the interrupt

mechanism inside the communication library, whereas the second program uses

polling inside the communication library. The first program outperformed the

second one in all experiments (for all problem instances and in all runs).

We introduced a practical shooting radiosity algorithm which can be incor-

porated into any ray tracer. The algorithm uses Monte Carlo form factor com-

putation which is combined with energy transfer using the ray tracing shader.

This combined step keeps the radiosity algorithm simple and general at the same

time. We devoted a special attention to the choice of the Monte Carlo form factor

estimator which is essential for the reliability of the radiosity algorithm.

The state of the art of global illumination algorithms is in our opinion much

more advanced than the state of the art of 3D standards which define the form

in which 3D models are stored. Rendering algorithms, as well as human 3D

artists, are often either forced to work with insufficient or incorrect data or to

develop their own standards. The following short section identifies a source of

this problem.

6.1. TOWARDS PORTABLE 3D STANDARDS 195

6.1 Towards portable 3D standards

Computer graphics has been evolving very fast in the past few decades. It has

found applications in computer games, films, architecture, etc. However, the pro-

cess of producing photorealistic images is far from being automatic. The human

factor is required in order to achieve the desired level of perfection. The 3D artist

must sometimes “help” the rendering system to make the image look realistic.

This is not acceptable in certain applications. An example of such an application

is the conservation of cultural heritage. We would like to store models of real

3D objects such as buildings, cars, furnitures, vases, statues etc. in a form from

which very realistic images can be reconstructed later. This form must not be tied

to any particular modeling or rendering software product or to a particular ren-

dering algorithm. Future improvements of the rendering systems should increase

the level of photorealism. Several hundreds of 3D formats exist but none provides

this level of portability. We will return to the modeling of surface geometry in

order to explain what is missing. However, the following reasoning also applies

to the modeling of materials, light sources etc. (The introduction of procedural

shaders, IES luminaire format and MGF format indicates a movement toward

the standardisation of materials and light sources.)

Almost all modeling programs can internally work with spheres. Why cannot

a sphere be passed to some other modeling or rendering program as a sphere?

Why must an object as simple as a sphere be converted into a triangle mesh

instead? The first reason is the “almost all”—not all modeling or rendering

programs work with spheres. Another reason are differences in representations of

spheres. One program may work with the representation hcenter, radiusiwhereas

another program may work the representation hx1, x2, x3, x4i(where x1, x2, x3, x4

are points in 3D space). The conversion between these two representations is sim-

ple in this particular case, but it may be difficult or impossible for representations

of other geometric primitives.

Let us assume that a standard 3D format exists which can store spheres

(in some representation). Which other geometric primitives must the format

support? Cones? Tori? Cylinders? The current state of the art reduces the set

of geometric primitives to a triangle mesh because triangle meshes are currently

supported in practically all software and hardware systems. We claim that a

portable 3D standard must support all geometric primitives. However, the set of

all geometric primitives is infinite and there is no unique representation for all of

them.

A portable 3D standard should not attempt to define the set of representa-

tions which must be supported by modeling and rendering programs—instead of

that it must define methods (operations on the geometric primitives) on which the

programs can rely. These methods constitute the interface between the represen-

tations of geometric primitives and algorithms which work with the primitives.

Good candidates for such methods are finding all intersections of a ray and a

196 CHAPTER 6. SUMMARY

primitive and computation of the surface normal at an intersection point. The

set of methods must be chosen carefully so that they can also be applied on com-

pound objects which are created using the CSG operations (see Section 3.2.2).

The implementation of the methods for a geometric primitive (a program code)

must be stored in a file together with the primitive. The storage of program code

in a platform-independent form was a problem in the past but this has changed.

At the time being, Java code is portable across practically all existing platforms.

[Sunb] The methods can therefore be implemented in Java.

The hiding of the implementation of the methods allows for an insertion of

a new geometric primitive at any time without the need of reimplementation of

existing modeling or rendering software. This is actually the idea behind the

Java technology. Java3D is in our opinion a step backwards (it is based on

the mesh-only technology) and we believe that the introduction of Java3D by the

original Java developers (Sun Microsystems) is rather a tactical than a strategical

decision. [Suna]

Triangle meshes have been around for several decades and will certainly re-

main being around for a long time. However, the concept which we sketch does

not exclude using triangle meshes further—it only allows for using also other

geometric primitives. Rendering algorithms are prepared for the introduction of

such a concept and so are 3D artists. A great part of contemporary computer

graphics is a collection of one-purpose tricks. A unification concept would help

to distinguish one-purpose tricks from techniques which apply generally.

Appendix A

MPI progress rule tester

The MPI program below verifies whether the implementation of the MPI library

violates the progress rule defined in the MPI standard or not. The sleep(5) in

the process 0 should ensure that the process 1 performs its actions prior to the

MPI Ssend() call in the process 0. This program, linked with an MPI library

which obeys the progress rule, eventually (usually immediately after 5 seconds

have passed) prints out the following:

0: Sleeping for 5 seconds...

1: Posting Irecv

1: Irecv posted, blocking

0: Ssend...

0: Ssend completed (test passed)

An incorrect MPI implementation (an implementation which violates the

progress rule) does not print out the last line. None of the following MPI imple-

mentations passed this test:

•MPICH 1.2.4 (Intel Pentium, Fast-Ethernet/TCP)

•MPICH 1.2.4 (Intel Itanium, Myrinet/GM)

•ScaMPI 1.13.7 (Intel Pentium, Dolphin PCI/SCI)

(We are aware of no MPI implementation which passes this test.)

/* mpi progress.c */

#include <stdio.h>

#include <stdlib.h>

#include "mpi.h"

197

198 APPENDIX A. MPI PROGRESS RULE TESTER

int main(int argc, char *argv[])

{int rank;

MPI Request req;

int tmp int;

MPI Init(&argc, &argv);

MPI Comm rank(MPI COMM WORLD, &rank);

if (rank == 0)

{printf("0: Sleeping for 5 seconds...\n"); fflush(stdout);

sleep(5);

printf("0: Ssend...\n"); fflush(stdout);

MPI Ssend(&tmp int, 1, MPI INT, 1, 0, MPI COMM WORLD);

printf("0: Ssend completed (test passed)\n"); fflush(stdout);

}

else

{printf("1: posting Irecv\n"); fflush(stdout);

MPI Irecv(&tmp int, 1, MPI INT, 0, 0, MPI COMM WORLD, &req);

printf("1: Irecv posted, blocking\n"); fflush(stdout);

for (;;)

;

}

MPI Finalize();

return(0);

}

Appendix B

Threaded pingpong benchmark

The three programs below implement the SYMMETRICAL THREADED PING-

PONG benchmark which is described in Section 2.8.2. The TPL 2.0 program is

event-driven, whereas the PVM 3.4 and MPI programs use polling.

B.1 TPL 2.0

/* pingpong tpl.c */

#include <pthread.h>

#include <stdio.h>

#include <stdlib.h>

#include <unistd.h>

#include <sys/time.h>

#include "tpl.h"

#define RANK PING 0

#define RANK PONG 1

enum

{

MSG INIT = TPL MSGTAG LAST,

MSG PING,

MSG PONG,

MSG QUIT

};

static int nr msgs;

static int msg size;

static char *databuf;

199

200 APPENDIX B. THREADED PINGPONG BENCHMARK

static int rank ping = RANK PING;

static int rank pong = RANK PONG;

static int my rank;

static struct timeval tv1, tv2;

static struct timezone tz1, tz2;

static float elapsed time, mbytes;

static int match pong(int sender, int tag, void *message);

static void *send thread(void *arg);

static void *recv thread(void *arg);

static TPL ACTION message handler(int sender, int tag, void *message);

static int match pong(int sender, int tag, void *message)

{

return(tag == MSG PONG);

}

static void *send thread(void *arg)

{

int counter;

void *sendbuf;

/* To avoid startup bias. */

sleep(3);

/* Start timer. */

gettimeofday(&tv1, &tz1);

for (counter = 0; counter < nr msgs; counter++)

{

tpl begin send(&sendbuf);

tpl pkchar(sendbuf, databuf, msg size);

tpl send(&rank pong, 1, MSG PING, sendbuf);

tpl end send(sendbuf);

}

return NULL;

}

static void *recv thread(void *arg)

{

int counter;

void *message;

B.1. TPL 2.0 201

int tag;

int sender;

char *buf;

if (msg size > 0)

buf = (char *) tpl malloc(msg size * sizeof(char));

for (counter = 0; counter < nr msgs; counter++)

{

tpl begin recv();

tpl recv(match pong, &sender, &tag, &message);

tpl upkchar(message, buf, msg size);

tpl end recv(message);

}

/* Stop timer. */

gettimeofday(&tv2, &tz2);

tpl begin send(&message);

tpl send(&rank pong, 1, MSG QUIT, message);

tpl end send(message);

tpl begin send(&message);

tpl send(&rank ping, 1, MSG QUIT, message);

tpl end send(message);

elapsed time = tv1.tv usec < tv2.tv usec ?

tv2.tv sec - tv1.tv sec + (tv2.tv usec - tv1.tv usec) / 1e6 :

tv2.tv sec - tv1.tv sec - 1 + (tv1.tv usec - tv2.tv usec) / 1e6;

mbytes = 2.0 * (float) nr msgs * (float) msg size / 1048576.0;

printf("%d x %d Bytes x 2 = %f MBytes, %f s, %f MB/s, %f msg/s,

%f s/msg, %d / %d sleeps\n",

nr msgs, msg size, mbytes, elapsed time, mbytes / elapsed time,

nr msgs / elapsed time, elapsed time / nr msgs, 0, 0);

return NULL;

}

static TPL ACTION message handler(int sender, int tag, void *message)

{

void *sendbuf = NULL;

switch (tag)

{

case MSG PING: /* Only for PONG */

202 APPENDIX B. THREADED PINGPONG BENCHMARK

tpl begin send(&sendbuf);

tpl pkchar(sendbuf, databuf, msg size);

tpl send(&rank ping, 1, MSG PONG, sendbuf);

tpl end send(sendbuf);

return(TPL ACTION DROP);

break;

case MSG PONG: /* Only for PING */

return(TPL ACTION ENQUEUE);

break;

case MSG QUIT: /* For both PING and PONG */

return(TPL ACTION EXIT);

break;

default:

printf("%d: ", my rank);

tpl error("Unknown message\n");

break;

}

return(FALSE);

}

int main(int argc, char **argv)

{

pthread t send thread id, recv thread id;

int nr tasks;

tpl initialize(&argc, &argv);

/* Spawn processes. */

nr tasks = 2;

tpl spawn(&nr tasks, &my rank, &argc, &argv);

if (argc != 3)

{

tpl error("Usage: ping <nr msgs> <msg size>\n");

}

nr msgs = atoi(argv[1]);

msg size = atoi(argv[2]);

B.1. TPL 2.0 203

if (msg size > 0)

databuf = (char *) tpl malloc(msg size * sizeof(char));

switch(my rank)

{

case RANK PING:

/* Start sending thread. */

if (pthread create(&send thread id, NULL, send thread,

NULL) != 0)

{

tpl error("Could not start sending thread\n");

}

/* Start receiving thread. */

if (pthread create(&recv thread id, NULL, recv thread,

NULL) != 0)

{

tpl error("Could not start sending thread\n");

}

tpl handle messages(message handler);

pthread join(send thread id, NULL);

pthread join(recv thread id, NULL);

break;

case RANK PONG:

tpl handle messages(message handler);

break;

default:

tpl error("Unknown role\n");

break;

}

/* Terminate the program. */

tpl deinitialize();

if (msg size > 0)

tpl free(databuf);

return(0);

}

204 APPENDIX B. THREADED PINGPONG BENCHMARK

B.2 PVM 3.4

/* ping pvm.c */

#include <stdio.h>

#include <stdlib.h>

#include <pthread.h>

#include <time.h>

#include <sys/time.h>

#include <unistd.h>

#include "pvm3.h"

#define ERROR(a) {printf(a); fflush(stdout); exit(1);}

#define PONG LOCATION "pvm pong";

#define MAX MSG LENGTH 2000000

enum

{

MSG INIT = 0,

MSG PING,

MSG PONG,

MSG QUIT

};

static int pong tid;

static int nr msgs;

static int msg size;

static struct timeval tv1, tv2;

static struct timezone tz1, tz2;

static float elapsed time, mbytes;

static pthread mutex t mutex = PTHREAD MUTEX INITIALIZER;

static void *pong send(void *arg);

static void *pong send(arg)

void *arg;

{

int counter;

char msg[MAX MSG LENGTH];

/* To avoid startup bias. */

sleep(3);

B.2. PVM 3.4 205

/* Start timer. */

gettimeofday(&tv1, &tz1);

for (counter = 0; counter < nr msgs; counter++)

{

pthread mutex lock(&mutex);

if (pvm initsend(PvmDataDefault) < 0)

ERROR("error pvm initsend\n");

pvm pkbyte(msg, msg size, 1);

if (pvm send(pong tid, MSG PING) < 0)

ERROR("error pvm send\n");

pthread mutex unlock(&mutex);

}

return(NULL);

}

int main(argc, argv)

int argc;

char *argv[];

{

int counter;

char msg[MAX MSG LENGTH];

pthread t pth id;

int bufid;

struct timespec pause, rem;

struct pvmhostinfo *pvm hosts;

int pvm nr hosts;

int pvm nr archs;

int mytid = pvm mytid();

int nr sleep calls = 0;

int nr sleep calls2 = 0;

if (argc != 3)

{

printf("Usage: ping <nr msgs> <msg size>\n");

exit(1);

}

nr msgs = atoi(argv[1]);

msg size = atoi(argv[2]);

pvm setopt(PvmRoute, PvmRouteDirect);

pvm config(&pvm nr hosts, &pvm nr archs, &pvm hosts);

206 APPENDIX B. THREADED PINGPONG BENCHMARK

if (pvm spawn(PONG LOCATION, NULL, PvmTaskHost,

pvm hosts[1].hi name, 1, &pong tid) < 1)

ERROR("Could not spawn pong\n");

if (pvm initsend(PvmDataDefault) < 0)

ERROR("error pvm initsend\n");

pvm pkint(&mytid, 1, 1);

pvm pkint(&nr msgs, 1, 1);

pvm pkint(&msg size, 1, 1);

if (pvm send(pong tid, MSG INIT) < 0)

ERROR("error pvm send\n");

pvm recv(-1, MSG INIT);

pthread create(&pth id, NULL, pong send, NULL);

for (counter = 0; counter < nr msgs;)

{

pthread mutex lock(&mutex);

while ((bufid = pvm probe(-1, MSG PONG)) > 0)

{

pvm recv(-1, MSG PONG);

pvm upkbyte(msg, msg size, 1);

pvm freebuf(bufid);

counter++;

}

pthread mutex unlock(&mutex);

if (counter)

nr sleep calls++;

if (counter < nr msgs)

{

pause.tv sec = 0;

pause.tv nsec = 50000000;

while (nanosleep(&pause, &rem) == -1)

nanosleep(&rem, &rem);

}

pvm recv(-1, MSG QUIT);

pvm upkint(&nr sleep calls2, 1, 1);

/* Stop timer. */

gettimeofday(&tv2, &tz2);

pthread join(pth id, NULL);

B.2. PVM 3.4 207

elapsed time = tv1.tv usec < tv2.tv usec ?

tv2.tv sec - tv1.tv sec + (tv2.tv usec - tv1.tv usec) / 1e6 :

tv2.tv sec - tv1.tv sec - 1 + (tv1.tv usec - tv2.tv usec) / 1e6;

mbytes = 2.0 * (float) nr msgs * (float) msg size / 1048576.0;

printf("%d x %d Bytes x 2 = %f MBytes, %f s, %f MB/s, %f msg/s,

%f s/msg, %d / %d sleeps\n",

nr msgs, msg size, mbytes, elapsed time, mbytes / elapsed time,

nr msgs / elapsed time, elapsed time / nr msgs,

nr sleep calls, nr sleep calls2);

pvm exit();

return(0);

}

/* pong pvm.c */

#include <stdio.h>

#include <stdlib.h>

#include <pthread.h>

#include <time.h>

#include <sys/time.h>

#include <unistd.h>

#include "pvm3.h"

#define MAX MSG LENGTH 2000000

enum

{

MSG INIT = 0,

MSG PING,

MSG PONG,

MSG QUIT

};

static int ping tid;

static int nr msgs;

static int msg size;

static char sendbuf[MAX MSG LENGTH];

static pthread mutex t mutex = PTHREAD MUTEX INITIALIZER;

int main()

208 APPENDIX B. THREADED PINGPONG BENCHMARK

{

int counter;

int bufid;

struct timespec pause, rem;

int nr sleep calls = 0;

pvm mytid();

pvm setopt(PvmRoute, PvmRouteDirect);

bufid = pvm recv(-1, MSG INIT);

pvm upkint(&ping tid, 1, 1);

pvm upkint(&nr msgs, 1, 1);

pvm upkint(&msg size, 1, 1);

pvm initsend(PvmDataDefault);

pvm send(ping tid, MSG INIT);

for (counter = 0; counter < nr msgs;)

{

pthread mutex lock(&mutex);

while ((bufid = pvm probe(-1, MSG PING)) > 0)

{

pvm recv(-1, MSG PING);

pvm upkbyte(sendbuf, msg size, 1);

pvm freebuf(bufid);

pvm initsend(PvmDataDefault);

pvm pkbyte(sendbuf, msg size, 1);

pvm send(ping tid, MSG PONG);

counter++;

}

pthread mutex unlock(&mutex);

if (counter)

nr sleep calls++;

if (counter < nr msgs)

{

pause.tv sec = 0;

pause.tv nsec = 50000000;

while (nanosleep(&pause, &rem) == -1)

nanosleep(&rem, &rem);

}

pvm initsend(PvmDataDefault);

pvm pkint(&nr sleep calls, 1, 1);

pvm send(ping tid, MSG QUIT);

B.3. MPI (MPI 1, MPI 2) 209

pvm exit();

return(0);

}

B.3 MPI (MPI 1, MPI 2)

/* pingpong mpi.c */

#include <stdio.h>

#include <stdlib.h>

#include <pthread.h>

#include <time.h>

#include <sys/time.h>

#include <unistd.h>

#include "mpi.h"

#define MAX MSG LENGTH 2000000

#define ERROR(a) {printf(a); fflush(stdout); exit(1);}

enum

{

MSG INIT = 0,

MSG PING,

MSG PONG,

MSG QUIT

};

static int ping tid = 0;

static int pong tid = 1;

static int nr msgs;

static int msg size;

static struct timeval tv1, tv2;

static struct timezone tz1, tz2;

static float elapsed time, mbytes;

static pthread mutex t mutex = PTHREAD MUTEX INITIALIZER;

static void *pong send(void *arg);

static void *pong send(arg)

210 APPENDIX B. THREADED PINGPONG BENCHMARK

void *arg;

{

int counter;

char msg[MAX MSG LENGTH];

char msg2[MAX MSG LENGTH];

int offset;

MPI Request send req;

MPI Status sts;

int send finished;

struct timespec pause, rem;

/* To avoid startup bias. */

sleep(3);

/* Start timer. */

gettimeofday(&tv1, &tz1);

pthread mutex lock(&mutex);

for (counter = 0; counter < nr msgs; counter++)

{

offset = 0;

MPI Pack(msg2, msg size, MPI BYTE, msg, MAX MSG LENGTH, &offset,

MPI COMM WORLD);

/* We must not use synchronous send here, otherwise the data flow

mechanism will block if this thread gets to sending

many times in a row. */

MPI Isend(msg, offset, MPI PACKED, pong tid, MSG PING,

MPI COMM WORLD, &send req);

/* We must make sure that the send finishes before we attempt to

start another one. But at the same time we must allow the recv

thread to proceed. */

{

send finished = 0;

MPI Test(&send req, &send finished, &sts);

if (! send finished)

{

pthread mutex unlock(&mutex);

pause.tv sec = 0;

pause.tv nsec = 50000000;

while (nanosleep(&pause, &rem) == -1)

nanosleep(&rem, &rem);

pthread mutex lock(&mutex);

}

}while (! send finished);

B.3. MPI (MPI 1, MPI 2) 211

}

pthread mutex unlock(&mutex);

return(NULL);

}

int main(argc, argv)

int argc;

char *argv[];

{

int counter;

char msg[MAX MSG LENGTH];

char msg2[MAX MSG LENGTH];

pthread t pth id;

int offset;

struct timespec pause, rem;

int rank;

MPI Status sts;

int flag;

int nr sleep calls = 0;

int nr sleep calls2 = 0;

MPI Init(&argc, &argv);

if (argc != 3)

{

printf("Usage: ping <nr msgs> <msg size>\n");

exit(1);

}

nr msgs = atoi(argv[1]);

msg size = atoi(argv[2]);

MPI Comm rank(MPI COMM WORLD, &rank);

if (rank == ping tid)

{

offset = 0;

MPI Pack(&pong tid, 1, MPI INT, msg, MAX MSG LENGTH, &offset,

MPI COMM WORLD);

MPI Pack(&nr msgs, 1, MPI INT, msg, MAX MSG LENGTH, &offset,

MPI COMM WORLD);

MPI Pack(&msg size, 1, MPI INT, msg, MAX MSG LENGTH, &offset,

MPI COMM WORLD);

MPI Send(msg, offset, MPI PACKED, pong tid, MSG INIT,

MPI COMM WORLD);

MPI Recv((void *) msg, MAX MSG LENGTH, MPI PACKED, pong tid,

212 APPENDIX B. THREADED PINGPONG BENCHMARK

MSG INIT, MPI COMM WORLD, &sts);

pthread create(&pth id, NULL, pong send, NULL);

pthread mutex lock(&mutex);

for (counter = 0; counter < nr msgs;)

{

flag = 0;

MPI Iprobe(pong tid, MSG PONG, MPI COMM WORLD, &flag, &sts);

if (flag)

{

MPI Recv((void *) msg, MAX MSG LENGTH, MPI PACKED,

pong tid, MSG PONG, MPI COMM WORLD, &sts);

offset = 0;

MPI Unpack((void *) msg, MAX MSG LENGTH, &offset,

msg2, msg size, MPI BYTE, MPI COMM WORLD);

counter++;

}

while (flag);

if (counter)

nr sleep calls++;

if (counter < nr msgs)

{

pthread mutex unlock(&mutex);

pause.tv sec = 0;

pause.tv nsec = 50000000;

while (nanosleep(&pause, &rem) == -1)

nanosleep(&rem, &rem);

pthread mutex lock(&mutex);

}

pthread mutex unlock(&mutex);

/* Stop timer. */

gettimeofday(&tv2, &tz2);

MPI Recv((void *) msg, MAX MSG LENGTH, MPI PACKED,

pong tid, MSG QUIT, MPI COMM WORLD, &sts);

offset = 0;

MPI Unpack((void *) msg, sizeof(int), &offset,

B.3. MPI (MPI 1, MPI 2) 213

&nr sleep calls2, 1, MPI INT, MPI COMM WORLD);

pthread join(pth id, NULL);

elapsed time = tv1.tv usec < tv2.tv usec ?

tv2.tv sec - tv1.tv sec + (tv2.tv usec - tv1.tv usec) / 1e6 :

tv2.tv sec - tv1.tv sec - 1 + (tv1.tv usec - tv2.tv usec) / 1e6;

mbytes = 2.0 * (float) nr msgs * (float) msg size / 1048576.0;

printf("%d x %d Bytes x 2 = %f MBytes, %f s, %f MB/s, %f msg/s,

%f s/msg, %d / %d sleeps\n",

nr msgs, msg size, mbytes, elapsed time, mbytes / elapsed time,

nr msgs / elapsed time, elapsed time / nr msgs,

nr sleep calls, nr sleep calls2);

}

else

{

/* Pong process. */

MPI Recv((void *) msg, MAX MSG LENGTH, MPI PACKED, ping tid,

MSG INIT, MPI COMM WORLD, &sts);

offset = 0;

MPI Unpack((void *) msg, MAX MSG LENGTH, &offset, &pong tid, 1,

MPI INT, MPI COMM WORLD);

MPI Unpack((void *) msg, MAX MSG LENGTH, &offset, &nr msgs, 1,

MPI INT, MPI COMM WORLD);

MPI Unpack((void *) msg, MAX MSG LENGTH, &offset, &msg size, 1,

MPI INT, MPI COMM WORLD);

MPI Send(msg, offset, MPI PACKED, ping tid, MSG INIT,

MPI COMM WORLD);

pthread mutex lock(&mutex);

for (counter = 0; counter < nr msgs;)

{

MPI Iprobe(ping tid, MSG PING, MPI COMM WORLD, &flag, &sts);

if (flag)

{

MPI Recv((void *) msg, MAX MSG LENGTH, MPI PACKED,

ping tid, MSG PING, MPI COMM WORLD, &sts);

MPI Unpack((void *) msg, MAX MSG LENGTH, &offset,

msg2, msg size, MPI BYTE, MPI COMM WORLD);

counter++;

offset = 0;

MPI Pack(msg2, msg size, MPI BYTE, msg, MAX MSG LENGTH,

&offset, MPI COMM WORLD);

214 APPENDIX B. THREADED PINGPONG BENCHMARK

MPI Send(msg, offset, MPI PACKED, ping tid, MSG PONG,

MPI COMM WORLD);

}

while (flag);

if (counter)

nr sleep calls++;

if (counter < nr msgs)

{

pthread mutex unlock(&mutex);

pause.tv sec = 0;

pause.tv nsec = 50000000;

while (nanosleep(&pause, &rem) == -1)

nanosleep(&rem, &rem);

pthread mutex lock(&mutex);

}

pthread mutex unlock(&mutex);

offset = 0;

MPI Pack(&nr sleep calls, 1, MPI INT, msg, MAX MSG LENGTH, &offset,

MPI COMM WORLD);

MPI Send(msg, offset, MPI PACKED, ping tid, MSG QUIT,

MPI COMM WORLD);

}

MPI Finalize();

return(0);

}

List of Figures

2.1 Two independent activities in one process of a non-trivial application 12

2.2 An example of a replicated SEQ. The program computes j= 210 . 17

2.3 An example of a replicated PAR. The program computes j= 210 . 17

2.4 An example of a replicated ALT. The program computes j= 210 . 17

2.5 An example of named processes in Occam. The program computes

j= 210 in PROC sink . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.6 Simulation of dining philosophers in Occam . . . . . . . . . . . . 20

2.7 Simulation of dining philosophers, a process diagram . . . . . . . 21

2.8 Hardware block diagram of the T805 Transputer . . . . . . . . . . 22

2.9 Implementation of one process of a non-trivial application in Occam 24

2.10 Components of the message passing framework . . . . . . . . . . . 37

2.11 Natural threaded implementation of one process of a non-trivial

application: it only works if the communication library is thread-

safe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.12 Polling implementation of one process of a non-trivial application:

a thread-safe communication library is not required for the appli-

cation to work correctly—on the other hand, polling makes the

application inefficient and non-portable . . . . . . . . . . . . . . . 49

2.13 Implementation of nanosleep in the kernel of an operating system 55

2.14 Implementation of the blocking recv in a socket-based communi-

cation library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

2.15 Scenario of the interruption of a blocked recv. The intr fd file

descriptor is the reading end of a POSIX pipe. The thread T1

writes to the writing end of the pipe (intr wfd), firing the blocked

select in T2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

2.16 Threaded event-driven implementation of one process of a non-

trivial application using a quasi-thread-safe communication library.

This program does not contain any polling . . . . . . . . . . . . . 64

2.17 Implementation of the interrupt mechanism inside a communica-

tion library. intr fd is the reading end of a synchronous pipe,

intr wfd is the writing end of the pipe . . . . . . . . . . . . . . . 66

2.18 TPL layered software architecture . . . . . . . . . . . . . . . . . . 70

2.19 Generic structure of a multi-threaded TPL process (TPL 1.0) . . 75

215

216 LIST OF FIGURES

2.20 Message queueing model of TPL 1.0. Upon the arrival of a mes-

sage, the main thread inserts messages into the queues of the

threads which subscribed the message. In order to avoid a replica-

tion of the (possibly large) data stored in the message bodies, only

the message headers are inserted into the message queues. The

message data is stored only once and referenced by the message

headers. The main thread also signals the semaphore associated

with the message queue into which it is inserting a message (in

order to wake up the thread which may already be waiting for the

message) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

2.21 Implementation of tpl handle messages in TPL 1.0 . . . . . . . 81

2.22 Message queueing model of TPL 2.0. Upon the arrival of a mes-

sage, the main thread first looks for a match among the threads

waiting in the thread queue. If there is a match, the message is

passed to the waiting thread (and the thread is woken up). If there

is no match, the message is inserted into the message queue. A

thread (which is not the main thread) which is calling tpl recv()

first looks into the message queue. If it finds a matching message

in the message queue, it removes it from the queue. If there is no

matching message in the message queue, the thread inserts itself

into the waiting thread queue . . . . . . . . . . . . . . . . . . . . 82

2.23 An optimised polling implementation of the PING process. The

optimal setting of time in the sleep(time) call is 50 milliseconds

(see Section 2.6.4). This optimal setting was used in the measure-

ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

2.24 Average throughput, 1 node hpcLine . . . . . . . . . . . . . . . . 90

2.25 Standard deviation of throughput, 1 node hpcLine . . . . . . . . . 91

2.26 Standard deviation of the number of sleep() calls in the polling

versions of the benchmark, 1 node hpcLine . . . . . . . . . . . . . 92

2.27 Smoothing effect of the TCP protocol (Nagel’s algorithm). [Ste94],

[WS95] A non-continuous message flow generated by the process

PONG is received as a continuous message flow in the process

PING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

2.28 Average burstiness (amount of transferred data per sleep() call)

for the polling versions of the benchmark, 1 node hpcLine . . . . . 94

2.29 Average throughput, 2 nodes hpcLine . . . . . . . . . . . . . . . . 95

2.30 Standard deviation of throughput, 2 nodes hpcLine . . . . . . . . 96

2.31 Standard deviation of the number of calls in the polling versions

of the benchmark, 2 nodes hpcLine . . . . . . . . . . . . . . . . . 97

2.32 Average burstiness (amount of transferred data per sleep() call)

for the polling versions of the benchmark, 2 nodes hpcLine . . . . 98

2.33 Average roundtrip time, 2 nodes hpcLine . . . . . . . . . . . . . . 99

LIST OF FIGURES 217

2.34 Average roundtrip time, 2 nodes hpcLine (a 100x magnification of

the graph from Fig. 2.33) . . . . . . . . . . . . . . . . . . . . . . . 100

2.35 Overhead and transmission time in the MPI model. tsp denotes

the moment at which the sender posts a send request. t1denotes

the moment when the first byte of the message is placed on the

network. t2denotes the moment when the last byte of the message

is placed on the network. tsc denotes the moment when the ap-

plication is notified about the completion of the send request. trp

denotes the moment when the application posts a receive request

(which matches the send request). t3denotes the moment when

the first byte of the message arrives. t4denotes the moment when

the last byte of the message arrives. trc denotes the moment when

the receiver is notified about the completion of the receive request 103

3.1 An example of a triangle mesh. Note the discontinuities on the

top and on the bottom of the cone . . . . . . . . . . . . . . . . . 116

3.2 An example of a CSG tree. The object shown in the root node

of the tree is a result of the union and difference operations. The

unary transformation operations are not depicted in the figure (a

transformation is applied to each node of the tree) . . . . . . . . . 117

3.3 Gathering path integration: The geometry of the integrand of the

term (RL)(x, ω). . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

3.4 Gathering path integration: The geometry of the integrand of the

term (R2L)(x, ω). . . . . . . . . . . . . . . . . . . . . . . . . . . 125

3.5 Camera tracing with a single collection of the direct radiance (path

tracing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

3.6 Camera tracing with a multiple collection of the direct radiance

(distributed ray tracing) . . . . . . . . . . . . . . . . . . . . . . . 127

3.7 Shooting path integration: The geometry of the integrand of the

term (PW)(x, ω0). . . . . . . . . . . . . . . . . . . . . . . . . . . 128

3.8 Shooting path integration: The geometry of the integrand of the

term (P2W)(x, ω0). . . . . . . . . . . . . . . . . . . . . . . . . . 128

3.9 Light tracing with a single collection of the direct potential . . . . 129

3.10 Light tracing with a multiple collection of the direct potential . . 129

3.11 Bidirectional path tracing . . . . . . . . . . . . . . . . . . . . . . 130

3.12 Bidirectional path tracing with multiple connections of the gath-

ering and shooting paths . . . . . . . . . . . . . . . . . . . . . . . 131

3.13 Perfect specular reflection. θ=θ0. . . . . . . . . . . . . . . . . . 132

3.14 Perfect specular refraction. sin θ=kior sin θ0. . . . . . . . . . . . 133

3.15 Perfect diffuse reflection. The incoming radiance is equally scat-

tered in all outgoing directions in the half-space of reflection, in-

dependently of the incoming direction ω0. . . . . . . . . . . . . . 134

218 LIST OF FIGURES

4.1 Left: A process farm. Right: A process farm extended with a load

balancing process . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

4.2 The extreme cases of chunking. Left: Minimal chunks. Right:

Maximal chunks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

4.3 The perfect load balancing algorithm (used in the loadbalancer

process) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

4.4 Left: Illustration of the work assignment in the perfect load bal-

ancing algorithm for N= 2 and T= 3. Right: The exact number

of work requests in the perfect load balancing algorithm as a func-

tion of the number of workers and the number of atomic parts . . 152

4.5 The pseudo-code of the function Fetch Object Data. The func-

tion insert into cache makes space for the requested object data

by removing other object’s data according to the cache policy, and

then increases the requested object’s importance. The function

cache hit increases the object’s importance . . . . . . . . . . . . 157

4.6 Absolute parallel times for 90 workers for a varying chunk size.

Left: BLOB scene. Right: HAUS6 scene . . . . . . . . . . . . . . . . 159

4.7 Efficiency of the chunking algorithm for a constant chunk size and

varying number of worker processes. Left: BLOB scene (chunk size

720 pixels). Right: HAUS6 scene (chunk size 360 pixels) . . . . . . 159

4.8 Efficiency of the perfect load balancing algorithm with the optimal

chunk size and varying number of worker processes. Top: BLOB

scene (M= 720 pixels). Bottom: HAUS6 scene (M= 360 pixels) . 160

4.9 Cache miss ratios. Top: BATH, 353 objects. Centre: ROSENTHALERHOF,

2215 objects. Bottom: HELICOPTER, 167 objects . . . . . . . . . . 163

4.10 Efficiency of POV||Ray (an old polling version) with a distributed

object database. The memory percentage states how large part

of the sum of all object data sizes is allowed to be stored in the

memory of each worker. A worker is only allowed to use this

amount of memory for the storage of objects which it owns and for

the object cache. The missing data in the graph indicate the cases

where this simulated memory limit was exceeded in some worker . 164

5.1 The basic shooting radiosity algorithm . . . . . . . . . . . . . . . 170

5.2 Radiosity energy exchange using the ray tracing shader. Left:

shadow rays are traced by the ray tracing shader from the light

sources in order to illuminate a surface point Y. Right: shadow

rays are traced by the ray tracing shader in order to transfer energy

from the shooting patch to the receiving patch. The temporary

point light sources are randomly generated on the shooting patch 178

5.3 Shooting radiosity algorithm using the ray tracing shader . . . . . 180

5.4 Generation of sample points on the shooting patch . . . . . . . . 182

LIST OF FIGURES 219

5.5 Left: BOX12, the box scene modeled of 12 top-level triangle patches.

Right: BOX48, the same box scene modeled of 48 top-level triangle

patches (level-one refinement) . . . . . . . . . . . . . . . . . . . . 184

5.6 Monte-Carlo computation of the form factors Fpar and Fperp for

the box scenes BOX12 and BOX48. Left column: direct area estima-

tor (Equation 5.7). Right column: weighted analytical estimator

(Equation 5.14) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

5.7 Left: ray traced box scene. Right: the radiosity solution (90%

converged) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

5.8 Left: ray traced box scene. Right: a two-pass solution (radiosity

and ray tracing). The radiosity solution is 90% converged and it

is stored in the vertices of ca. 13000 triangles. The texture in the

right image is a little distorted because the direct illumination was

reconstructed from the radiosity solution. . . . . . . . . . . . . . . 188

5.9 Left: ray traced box scene. Right: a two-pass solution (radiosity

and ray tracing). The radiosity solution is 95% converged and it is

stored in the vertices of ca. 3000 triangles. The direct illumination

was subtracted from the radiosity solution and computed by the

ray tracing algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 189

5.10 Left: ray traced box scene with a blocker. Right: a two-pass

solution (radiosity and ray tracing). The radiosity solution is 95%

converged and it is stored in the vertices of ca. 3000 triangles.

The direct illumination was subtracted from the radiosity solution

and computed by the ray tracing algorithm. Note the soft shadow,

which is typical for secondary diffuse reflections . . . . . . . . . . 190

5.11 Left: ray traced scene JAGDSCHLOSS. Right: the radiosity (two-

pass) solution. The radiosity solution is 3% converged and it is

stored in the vertices of ca. 350000 triangles. . . . . . . . . . . . . 191

5.12 Left: ray traced scene HOUSE. Right: the radiosity (two-pass) so-

lution. The radiosity solution is 4% converged and it is stored in

the vertices of ca. 170000 triangles. . . . . . . . . . . . . . . . . . 191

220 LIST OF FIGURES

Bibliography

[ABB+01] P. Altenbernd, A. Bartels, S. Bicskey, L. O. Burchard, M. Holch,

J. Jensch, M. Oestreicher, I. Neumann, T. Plachetka, T. Prill,

J. Seiler, and A. Schmitt. HiQoS: High Performance Multimedia-

Dienste mit Quality-of-Service-Garantien, 2001. Projekt HiQoS,

Abschlussbericht.

[ACH+99] P. Altenbernd, F. Cortes, M. Holch, J. Jensch, O. Michel,

C. Moar, T. Prill, R. L¨uling, K. Morisse, I. Neumann, T. Pla-

chetka, M. Reith, O. Schmidt, A. Schmitt, and A. Wabro.

BMBF-Projekt HiQoS: High-Performance-Multimedia-Dienste mit

Quality-of-Service-Garantien. In R. Krahl, editor, Statustagung des

BMBF, HPSC ’99, H¨ochstleistungsrechnen in der Bundesrepublik

Deutschland, pages 29–32. BMBF, Bundesministerium f¨ur Bildung

und Forschung, 1999.

[And91] G. A. Andrews. Concurrent Programming, Principles and Practice.

Benjamin/Cummings Publishing Company, 1991.

[App68] A. Appel. Some techniques for shading machine renderings of

solids. In Proceedings of AFIPS 1968 Joint Computer Conference,

volume 32, pages 37–45, 1968.

[Arv95] J. Arvo. Stratified sampling of spherical triangles. Computer

Graphics, pages 437–438, 1995.

[Ash01] I. Ashdown. Eigenvector radiosity. Diploma thesis, Department

of Computer Science, Faculty of Graduate Studies, University of

British Columbia, 2001.

[Atk76] K. E. Atkinson. A Survey of Numerical Methods for the Solution

of Fredholm Integral Equations of the Second Kind. Society for

Industrial Mathematics (SIAM), 1976.

[Bac98] J. Bacon. Concurrent Systems (Operating Systems, Database and

Distributed Systems: An Integrated Approach). Addison-Wesley-

Longman, 1998.

221

222 BIBLIOGRAPHY

[BBP94] D. Badouel, K. Bouatouch, and T. Priol. Distributing data and

control for ray tracing in parallel. IEEE Computer Graphics and

Applications, 14(4):69–77, 1994.

[Bek99] P. Bekaert. Hierarchical and Stochastic Algorithms for Radiosity.

PhD thesis, Department of Computer Science, Katholieke Univer-

siteit Leuven, 1999.

[BHG87] P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency

Control and Recovery in Database Systems. Addison-Wesley, 1987.

[BL93] A. J. Bernstein and P. M. Lewis. Concurrency in Programming

and Database Systems. Jones and Bartlett Publishers, 1993.

[Bli77] J. F. Blinn. Models of light reflection for computer synthesized

pictures. Computer Graphics, pages 192–198, 1977.

[Bli78] J. F. Blinn. Simulation of wrinkled surfaces. Computer Graphics,

12:296–292, 1978.

[BMR02] R. Brightwell, A. B. Maccabe, and R. Riesen. Design and imple-

mentation of MPI on Portals 3.0. In D. Kranzlm¨uller, P. Kac-

suk, J. Dongarra, and J. Volkert, editors, Proc. of the 9th Eu-

roPVM/MPI User’s Group Conference (Recent Advances in Paral-

lel Virtual Machine and Message Passing Interface), volume 2474

of Lecture Notes in Computer Science, pages 331–340. Springer-

Verlag, 2002.

[BMSW91] D. R. Baum, S. Mann, K. P. Smith, and J. M. Winget. Making

radiosity usable: Automatic preprocessing and meshing techniques

for the generation of accurate radiosity solutions. Computer Graph-

ics, 25(4):51–60, 1991.

[BP88] K. Bouatouch and T. Priol. Parallel space tracing: An experi-

ence on an iPSC hypercube. In Proc. of Computer Graphics Inter-

national’88 (New Trends in Computer Graphics), pages 170–188.

Computer Graphics Society, 1988.

[BRW89] D. R. Baum, H. E. Rushmeier, and J. M. Winget. Improving

radiosity solutions through the use of analytically determined form

factors. Computer Graphics, 23:325–334, 1989.

[BW95] P. Bekaert and Y. D. Willems. Importance-driven progressive re-

finement radiosity. In P. Hanrahan and W. Purgathofer, editors,

Rendering Techniques ’95, Proceedings of the Eurographics Work-

shop on Rendering. Springer, 1995.

BIBLIOGRAPHY 223

[CC02] A. Chalmers and K. Cater. Realistic rendering in real-time. In

B. Monien and R. Feldman, editors, Proceedings of Euro-Par 2002

(Parallel Processing), volume 2400, pages 21–28. Springer, 2002.

[CCWG88] M. F. Cohen, S. E. Chen, J. R. Wallace, and D. P. Greenberg. A

progressive refinement approach to fast radiosity image generation.

Computer Graphics, 22:75–84, 1988.

[CDP95] F. Cazals, G. Drettakis, and C. Puech. Filtering, clustering and

hierarchy construction: a new solution for ray-tracing complex

scenes. Computer Graphics Forum, 14(3):371–382, 1995.

[CDR02] A. Chalmers, T. Davis, and E. Reinhard. Practical Parallel Ren-

dering. A K Peters, 2002.

[CG85] M. F. Cohen and D. P. Greenberg. The hemi-cube: A radiosity

solution for complex environments. Computer Graphics, 19:31–30,

1985.

[CPC84] R. L. Cook, T. Porter, and L. Carpenter. Distributed ray tracing.

Computer Graphics, 18(3):137–145, 1984.

[CSD94] A. K. Chowdappa, A. Skjellum, and N. E. Doss. Thread-safe mes-

sage passing with P4 and MPI. Technical report, Mississippi State

University, Dept. of Computer Science, 1994.

[CT96] A. Chalmers and J. Tidmus. Practical Parallel Processing. Intera-

national Thomson Publishing, 1996.

[CW93] M. F. Cohen and J. R. Wallace. Radiosity and Realistic Image

Synthesis. Academic Press Professional, 1993.

[CWBV85] J. G. Clearly, B. M. Wyvill, G. M. Birtwistle, and R. Vatti. Mul-

tiprocessor ray tracing. Computer Graphics Forum, 5:3–12, 1985.

[Dij71] E. W. Dijkstra. Hierarchical ordering of sequential processes. Acta

Informatica, 1:115–138, 1971.

[Dim01] R. P. Dimitrov. Overlapping of Communication and Computa-

tion and Early Binding: Fundamental Mechanisms for Improving

Parallel Performance on Clusters of Workstations. PhD thesis,

Mississippi State University, 2001.

[DLW93] P. Dutre, E. P. Lafortune, and Y. D. Willems. Monte Carlo light

tracing with direct computation of pixel intensities. In Proceedings

of Compugraphics, pages 128–137. Alvor, 1993.

224 BIBLIOGRAPHY

[DS84] M. Dipp´e and J. Swenssen. An adaptive subdivision algorithm

and parallel architecture for realistic image synthesis. Computer

Graphics, 18(3), 1984.

[DS02] R. Dimitrov and A. Skjellum. Software Architecture and Perfor-

mance Comparison of MPI/Pro and MPICH. MPI Software Tech-

nology, Inc., 2002. (White paper).

[EBe92] M. C. Escher, F. Bool, and J. L. Locher (editor). M. C. Escher:

His Life and Complete Graphic Work. Abradale Press, 1992.

[Ent93] N. E. Things Enterprises. Magic Eye I: A New Way of Looking at

the World. Andrews and McMeel, 1993.

[ES00] M. Evans and T. Swartz. Approximating Integrals via Monte Carlo

and Deterministic Methods. Oxford University Press, 2000.

[Fer98] A. Ferrari. JPVM: Network parallel computing in Java. Concur-

rency: Practice and Experience, 10(11–13):985–992, 1998.

[Fey88] R. Feynman. QED: The Strange Theory of Light and Matter.

Princeton University, 1988.

[FFB99] A. Fava, E. Fava, and M. Bertozzi. MPIPOV: A parallel imple-

mentation of POV-Ray based on MPI. In J. Dongarra, E. Luque,

and T. Margalef, editors, Proc. of the 6th EuroPVM/MPI User’s

Group Conference (Recent Advances in Parallel Virtual Machine

and Message Passing Interface), volume 1697 of Lecture Notes in

Computer Science, pages 426–433. Springer-Verlag, 1999.

[FHBWW95] S. Flynn-Hummel, I. Banicescu, C. T. Wang, and J. Wein. Load

balancing and data locality via fractling: An experimental study.

In B. K. Szymanski and B. Sinharoy, editors, Proc. of the 3rd Work-

shop on Languages, Compilers, and RunTime Systems for Scalable

Computers, pages 85–89. Kluwer Academic Publishers, 1995.

[FHK97] B. Freisleben, D. Hartmann, and T. Kielmann. Parallel raytracing:

A case study on partitioning and scheduling on workstation clus-

ters. In Proceedings of Hawaii International Conference on System

Sciences (HICSS-30), volume 1, pages 596–605. IEEE Computer

Society Press, 1997.

[FHSF91] S. Flynn-Hummel, E. Schonberg, and L. E. Flynn. Factoring: A

practical and robust method for scheduling parallel loops. In Proc.

of Supercomputing ’91, pages 610–619. IEEE Computer Society /

ACM, 1991.

BIBLIOGRAPHY 225

[FHSF92] S. Flynn-Hummel, E. Schonberg, and L. E. Flynn. Factoring: A

method for scheduling parallel loops. Communications of the ACM

(CACM), 35(8):90–101, 1992.

[FHSUW96] S. Flynn-Hummel, J. P. Schmidt, R. N. Uma, and J. Wein. Load-

sharing in heterogeneous systems via weighted factoring. In Proc.

of the 8th Symposium on Parallel Algorithms and Architectures

(SPAA ’96), pages 318–328. ACM Press, 1996.

[FLS70] R. P. Feynman, R. B. Leighton, and M. Sands. The Feynman

Lectures on Physics. Addison Wesley Longman, 1970.

[FS98] A. Ferrari and V. S. Sunderam. TPVM: Distributed concurrent

computing with lightweight processes. Concurrency—Practice and

Experience, 10(3):199–228, 1998.

[FTI86] A. Fujimoto, T. Tanaka, and K. Iwata. ARTS: Accelerated

ray tracing system. IEEE Computer Graphics and Applications,

6(4):16–26, 1986.

[FvDFH90] J. D. Foley, A. van Dam, S. K. Feiner, and J. F. Hughes. Com-

puter Graphics: Principles and Practice. Addison-Wesley, second

edition, 1990.

[FW78] S. Fortune and J. Wyllie. Parallelism in random access machines.

In Proceedings of the 10th Annual ACM Symposium on Theory of

Computing (San Diego, CA), pages 114–118. ACM Press, 1978.

[Gal96] J. Galletly. OCCAM 2. Including OCCAM 2.1. UCL (University

College London) Press Ltd, second edition, 1996.

[GBD+94] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and

V. Sunderam. PVM: Parallel Virtual Machine (A Users’s Guide

and Tutorial for Networked Parallel Computing). The MIT Press,

1994.

[GCC99] F. Garc´ıa, A. Calder´on, and J. Carretero. MiMPI: A multithread-

safe implementation of MPI. In J. Dongarra, E. Luque, and T. Mar-

galef, editors, Proceedings of EuroPVM/MPI 99, Recent Advances

in Parallel Virtual Machine and Message Passing Interface, 6th

European PVM/MPI Users’ Group Meeting, volume 1697 of Lec-

ture Notes on Computer Science, pages 207–214. Springer Verlag,

1999.

[GK90] I. Graham and T. King. The Transputer Handbook. Prentice Hall,

1990.

226 BIBLIOGRAPHY

[GKPS97] A. Geist, J. A. Kohl, P. M. Papadopoulos, and S. L. Scott. Beyond

PVM 3.4: What we’ve learned, what’s next and why. In Proceed-

ings of EuroPVM/MPI 97, 4th European PVM/MPI Users’ Group

Meeting, pages 3–5. Springer-Verlag, 1997.

[GL96] W. Gropp and E. Lusk. MPICH working note: The second-

generation ADI for the MPICH implementation of MPI. Technical

report, Argonne National Laboratory, USA, 1996.

[Gla84] A. S. Glassner. Space subdivision for fast ray tracing. IEEE Com-

puter Graphics and Applications, 4(10):15–22, 1984.

[Gla89] A. S. Glassner. An Introduction to Ray Tracing. Academic Press,

1989.

[Gla90] A. S. Glassner. Graphics Gems I. Academic Press, 1990.

[GLS95] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel

Programming with the Message-Passing Interface. Scientific and

Engineering Computation Series. The MIT Press, 1995.

[GN71] R. A. Goldstein and R. Nagel. 3D visual simulation. Simulation,

16(1):25–31, 1971.

[Gou71] H. Gouraud. Computer display of curved surfaces. IEEE Transac-

tions on Computers, 20(6):623–629, 1971.

[GP89] S. A. Green and D. J. Paddon. Exploiting coherence for multi-

processor ray tracing. IEEE Computer Graphics and Applications,

1989.

[Gra98] P. Grambliˇcka. Cache techniky pre paraleln´y raytracing. Diploma

thesis, Department of Informatics, Faculty of Mathematics and

Physics, Comenius University, Bratislava, Slovakia, 1998.

[Gre91] S. Green. Parallel Processing for Computer Graphics. Research

Monographs in Parallel and Distributed Computing. Pitman Pub-

lishing, 1991.

[Gro98] Object Management Group. The Common Object Request Broker:

Architecture and Specification (Revision 2.3), 1998.

[Gro02] W. Gropp. MPICH2: A new start for MPI implementations. In

D. Kranzlm¨uller, P. Kacsuk, J. Dongarra, and J. Volkert, editors,

Proc. of the 9th EuroPVM/MPI User’s Group Conference (Recent

Advances in Parallel Virtual Machine and Message Passing Inter-

face), volume 2474 of Lecture Notes in Computer Science, pages

7–7. Springer-Verlag, 2002.

BIBLIOGRAPHY 227

[GRS97] J. C. Gomez, V. Rego, and V. S. Sunderam. Efficient multithreaded

user-space transport for network computing: Design and test of the

TRAP protocol. Journal of Parallel and Distributed Computing,

40(1):103–117, 1997.

[GTGB84] C. M. Goral, D. E. Torrance, D. P. Greenberg, and G. Battaile.

Modeling the interaction of light between diffuse surfaces. Com-

puter Graphics, 18:213–222, 1984.

[GU77] G. H. Golub and R. Underwood. The Block Lanczos method for

computing eigenvalues. In J. R. Rice, editor, Mathematical Soft-

ware III, pages 361–377. Academic Press, 1977.

[HA98] A. Heirich and J. Arvo. A competitive analysis of load balancing

strategies for parallel ray tracing. The Journal of Supercomputing,

12(1/2):57–68, 1998.

[HG86] E. A. Haines and D. P. Greenberg. The light buffer: a shadow

testing accelerator. IEEE Computer Graphics and Applications,

6(9):6–16, 1986.

[Hoa85] C. A. R. Hoare. Communicating Sequential Processes. Prentice

Hall, 1985.

[HOB+99] L. P. Huse, K. Omang, H. Bugge, H. Ry, A. T. Haugsdal, and

E. Rustad. ScaMPI—design and implementation. In H. Hellwag-

ner and A. Reinefeld, editors, SCI: Scalable Coherent Interface,

volume 11734 of Lecture Notes in Computer Science, pages 249–

261. Springer-Verlag, 1999.

[How82] J. R. Howell. A Catalog of Radiation Configuration Factors.

McGraw-Hill, 1982.

[HR03a] J. Hippold and G. R¨unger. A communication API for implement-

ing irregular algorithms on SMP clusters. In Proc. of the 10th Eu-

roPVM/MPI User’s Group Conference (Recent Advances in Paral-

lel Virtual Machine and Message Passing Interface), Lecture Notes

in Computer Science. Springer-Verlag, 2003. (To appear).

[HR03b] J. Hippold and G. R¨unger. Task pool teams for implementing irreg-

ular algorithms on clusters of smps. In Proceedings of 17th Inter-

national Parallel and Distributed Processing Symposium (IPDPS

2003), pages 54–54, 2003.

[HS67] H. C. Hottel and A. F. Sarofin. Radiative Transfer. McGraw-Hill,

1967.

228 BIBLIOGRAPHY

[HSS+98] L. S. Hebert, W. G. Seefeld, A. Skjellum, C. D. Taylor, and R. Dim-

itrov. MPI for Windows NT: Two generations of implementa-

tions and experience with the message passing interface for clus-

ters and SMP environments. In H. Arabnia, editor, Proceedings

of the International Conference on Parallel and Distributed Pro-

cessing Techniques and Applications (PDPTA’98), pages 309–316.

CSREA Press, 1998.

[HV99] M. Henning and S. Vinoski. Advanced CORBA Programming with

C++. Addison-Wesley Longman, 1999.

[IRT] Internet Ray Tracing Competition. http://www.irtc.org.

[ISO90] ISO/IEC 9945-1:1990 Information Technology. Portable Operating

System Interface (POSIX), Part 1: System Application Program

Interface (API) [C Language], 1990.

[ISO97] ISO/IEC 14772-1:1997 Information Technology. Computer graph-

ics and image processing—The Virtual Reality Modeling Language

(VRML), 1997.

[Jan86] F. Jansen. Data structures for ray tracing. In L. R. A. Kessener,

F. J. Peters, and M. L. P. van Lierop, editors, Proc. of the Euro-

graphics Seminar, pages 57–73. Springer-Verlag, 1986.

[Jev89] D. A.J. Jevans. Optimistic multi-processor ray tracing. In Proc.

of Computer Graphics International’89 (New Trends in Computer

Graphics), pages 507–522. Computer Graphics Society, 1989.

[Kaj86] J. T. Kajiya. The rendering equation. Computer Graphics,

20(4):143–150, 1986.

[Kap85] M. R. Kaplan. Space tracing: A constant time ray tracer. SIG-

GRAPH ’85 Tutorial, 1985.

[Kel94] A. Keller. A quasi-Monte Carlo algorithm for the global illumi-

nation problem in the radiosity setting. Technical Report 260/94,

Universit¨at Kaiserslautern, Fachbereich Informatik, 1994.

[Kel96] A. Keller. Quasi-Monte Carlo radiosity. Technical Report 279/96,

Universit¨at Kaiserslautern, Fachbereich Informatik, 1996.

[Kel97] A. Keller. Instant radiosity. Computer Graphics, pages 59–56,

1997.

BIBLIOGRAPHY 229

[KH95] M. J. Keates and R. J. Hubbold. Interactive ray tracing on a vir-

tual shared-memory parallel computer. Computer Graphics Forum,

14(4):189–202, 1995.

[KHS96] O. Krone, B. Hirsbrunner, and V. S. Sunderam. PT-PVM+: A

portable platform for multithreaded coordination languages. Cal-

culateurs Parall`eles, 8(2):167–182, 1996.

[KNK+88] H. Kobayashi, S. Nishimura, H. Kubota, T. Nakamura, and

Y. Shigei. Load balancing strategies for a parallel ray-tracing sys-

tem based on constant subdivision. The Visual Computer, 4:197–

209, 1988.

[KNS87] H. Kobayashi, T. Nakamura, and Y. Shigei. Parallel processing of

an object space for image synthesis using ray tracing. The Visual

Computer, 3(1):13–22, 1987.

[KR98] A. Keller and A. Reinefeld. CCS resource management in net-

worked HPC systems. In Proc. Heterogenous Computing Work-

shop HCW’98 at IPPS, pages 44–56, Orlando, Florida, 1998. IEEE

Computer Society Press.

[KRR94] K. Kremer, Thomas R¨omke, and Friedhelm Ramme. A distributed

computing center software for the efficient use of parallel computer

systems. In HPCN 94, volume 797-II of Lecture Notes in Computer

Science, pages 129–136. Springer-Verlag, 1994.

[KS02] D. Kranzlm¨uller and M. Schulz. Notes on nondeterminism in mes-

sage passing programs. In D. Kranzlm¨uller, P. Kacsuk, J. Don-

garra, and J. Volkert, editors, Proc. of the 9th EuroPVM/MPI

User’s Group Conference (Recent Advances in Parallel Virtual Ma-

chine and Message Passing Interface), volume 2474 of Lecture

Notes in Computer Science, pages 357–367. Springer-Verlag, 2002.

[KW85] C. Kruskal and A. Weiss. Allocating independent subtasks on

parallel processors. IEEE Transactions on Software Engineering,

11(10):1001–1016, 1985.

[LAB93] P. Liu, W. Aiello, and S. Bhatt. An atomic model for message-

passing. In The 5th Annual ACM Symposium on Parallel Archi-

tectures and Algorithms (SPAA’93), pages 154–163. ACM Press,

1993.

[LM-02] LM-63-1986 Illuminating Engineering Society of North America

(IESNA) and American National Standards Institute (ANSI). IES

230 BIBLIOGRAPHY

Recommended Standard File Format for Electronic Transfer of

Photometric Data, 2002.

[LRBB96] K. Langendoen, J. Romein, R. Bhoedjang, and H. Bal. Integrat-

ing polling, interrupts, and thread management. In Proc. of the

6th Symposium on the Frontiers of Massively Parallel Computa-

tion (Frontiers ’96), pages 13–22. IEEE, 1996.

[L¨uc03] S. L¨ucking. Berechnung von Caustics in 3D-Rendering-

Programmen. Studienarbeit, University of Paderborn, Department

of Computer Science, 2003.

[LW93] E. P. Lafortune and Y. D. Willems. Bi-directional path-tracing. In

Proceedings of Compugraphics, pages 143–153. Alvor, 1993.

[LW94] E. P. Lafortune and Y. D. Willems. Using the modified Phong

reflectance model for physically based rendering. Technical Report

CW 197, Department of Computer Science, Katholieke Universiteit

Leuven, 1994.

[Lyn93] A. Lyne. Indecent proposal. (Film), 1993.

[McN00] A. McNamara. Comparing Real and Synthetic Scenes using Human

Judgements of Lightness. PhD thesis, University of Bristol, 2000.

[MPI94] MPI Forum. MPI: Message Passing Interface, 1994. Version 1.0.

[MPI97] MPI Forum. MPI-2: Extensions to the Message Passing Interface,

1997. Version 2.0.

[MPI98] MPI Forum. MPI: Message Passing Interface, 1998. Version 1.1.

[MS87] D. E. Muller and P. E. Schupp. Alternating automata on infinite

trees. Theoretical Computer Science, 54:267–276, 1987.

[New52] I. Newton. Opticks: Or a Treatise of the Reflections, Inflections

and Colours of Light. Dover Pubns, 1952. Preface by B. Cohen.

[NL96] M. L. Netto and B. Lange. Exploiting multiple partitioning strate-

gies for an evolutionary ray tracer supported by DOMAIN. In

First Eurographics Workshop on Parallel Graphics and Visualisa-

tion, 1996.

[Nus28] W. Nusselt. Graphische Bestimmung des Winkelverhaltnisses

bei der W¨armestrahlung. Zeitschrift des Vereines Deutscher In-

geneure, 19(3):72–673, 1928.

BIBLIOGRAPHY 231

[OPR96] R. Otte, P. Patrick, and M. Roy. Understanding CORBA: the

common object request broker architecture. Prentice Hall, 1996.

[Par94] Parsytec GmbH. Parix V1.3 PowerPC Software Documentation,

1994.

[PB89] T. Priol and K. Bouatouch. Static loadbalancing for parallel ray

tracing on MIMD hypercube. The Visual Computer, 5:109–119,

1989.

[PBMH02] T. J. Purcell, I. Buck, W. R. Mark, and P. Hanrahan. Ray trac-

ing on programmable graphics hardware. Computer Graphics,

21(3):703–712, 2002.

[Per85] Y. Perelman. Fun With Maths and Physics. Firebird Publications,

Inc, 1985.

[Pho75] B. T. Phong. Illumination for computer generated pictures. Com-

munications of the ACM, 18(6):311–317, 1975.

[Pie93] G. Pietrek. Fast calculation of accurate formfactors. In Proceedings

of the 4th Eurographics Workshop on Rendering, pages 201–220,

1993.

[Pit93] P. Pitot. The Voxar project. IEEE Computer Graphics and Appli-

cations, pages 27–33, 1993.

[PKGH02] M. Pharr, C. Kolb, R. Gershbein, and P. Hanrahan. Render-

ing complex scenes with memory-coherent ray tracing. Computer

Graphics, 21(3):703–712, 2002.

[Pla98] T. Plachetka. POV||Ray: Persistence of Vision parallel raytracer.

In L. Szirmay-Kalos, editor, Proceedings of Spring Conference on

Computer Graphics. Comenius University, Bratislava, 1998.

[Pla02a] T. Plachetka. Perfect load balancing for demand-driven parallel

ray tracing. In B. Monien and R. Feldman, editors, Proceedings of

Euro-Par 2002 (Parallel Processing), volume 2400, pages 410–419.

Springer, 2002.

[Pla02b] T. Plachetka. (Quasi-) thread-safe PVM and (Quasi-) thread-safe

MPI without active polling. In D. Kranzlm¨uller, P. Kacsuk, J. Don-

garra, and J. Volkert, editors, Proc. of the 9th EuroPVM/MPI

User’s Group Conference (Recent Advances in Parallel Virtual Ma-

chine and Message Passing Interface), volume 2474 of Lecture

Notes in Computer Science, pages 296–305. Springer-Verlag, 2002.

232 BIBLIOGRAPHY

[PMTR95] I. S. Pandzic, N. Magnenat-Thalmann, and M. Roethlisberger.

Parallel raytracing on the IBM SP2 and T3D. In EPFL Super-

computing Review (Proceedings of First European T3D Workshop

in Lausanne), volume 7, 1995.

[PS98] B. V. Protopopov and A. Skjellum. A multi-threaded message pass-

ing interface (MPI) architecture: performance and program issues.

Technical report, Computer Science Department, Mississippi State

University, 1998.

[PSA01] T. Plachetka, O. Schmidt, and F. Albracht. The HiQoS render-

ing system. In L. Pacholski and P. Ruˇziˇcka, editors, Proc. of the

28th Annual Conf. on Current Trends in Theory and Practice of

Informatics (SOFSEM 2001: Theory and Practice of Informatics),

volume 2234 of Lecture Notes in Computer Science, pages 304–315.

Springer-Verlag, 2001.

[PT] POV-Team. Persistence of Vision Ray Tracer (POV-Ray).

http://www.povray.org.

[Ray88] M. Raynal. Distributed Algorithms and Protocols. J. Wiley&Sons,

1988.

[RC97] E. Reinhard and A. Chalmers. Message handling in parallel RA-

DIANCE. In Recent Advances in Parallel Virtual Machine and

Message Passing Interface, pages 486–493. Springer-Verlag, 1997.

[RCJ98] E. Reinhard, A. G Chalmers, and F. W. Jansen. Overview of par-

allel photo-realistic graphics. In A. de Sousa and B. Hopgood, ed-

itors, State-of-the-Art reports, Eurographics ’98 Conference, pages

1–25. Springer, 1998.

[RCJ99] E. Reinhard, A. Chalmers, and F. W. Jansen. Hybrid scheduling

for parallel rendering using coherent ray tasks. In IEEE Parallel

Visualisation and Graphics Symposium, pages 21–28, 1999.

[Ree97] L. Reeker. GOLEM Dokumentation. University of Paderborn,

1997.

[Rei96] E. Reinhard. A parallelisation of ray tracing with diffuse interreflec-

tion. In Advanced School for Computing and Imaging (ASCI ’96),

pages 367–372, 1996.

[RKC98] E. Reinhard, A. J. F. Kok, and A. Chalmers. Cost distribution

prediction for parallel ray tracing. In K. Bouatouch, A. Chalmers,

and T. Priol, editors, Rendering Techniques ’96, Proceedings of

BIBLIOGRAPHY 233

the Eurographics Workshop on Parallel Graphics and Visualisation,

pages 77–90. Springer, 1998.

[RKJ98] E. Reinhard, A. J. F. Kok, and F. W. Jansen. Cost distribution

prediction for ray tracing. In X. Pueyo and P. Schr¨oder, editors,

Rendering Techniques ’98, Proceedings of the Eurographics Work-

shop on Rendering, pages 41–50. Springer, 1998.

[RW80] S. Rubin and T. Whitted. A three-dimensional representation for

fast rendering of complex scenes. Computer Graphics, 14(3):110–

116, 1980.

[SAS92] B. E. Smits, J. R. Arvo, and D. H. Salesin. An importance-driven

radiosity algorithm. Computer Graphics, 26(2):273–282, 1992.

[SC88] I. D. Scherson and E. Caspary. Multiprocessing for ray tracing: A

hierarchical self-balancing approach. Visual Computer, (4), 1988.

[Sch93] C. Schlick. A customizable reflectance model for everyday render-

ing. In Proceedings of the 4th Eurographics Workshop on Render-

ing, pages 73–84, 1993.

[Sch00] O. Schmidt. Parallele Simulation der globalen Beleuchtung in kom-

plexen Architekturmodellen. PhD thesis, Department of Mathemat-

ics and Informatics, University of Paderborn, 2000.

[SH93] P. Schr¨oder and P. Hanrahan. On the form factor between two

polygons. Computer Graphics, pages 163–164, 1993.

[SK99a] L. Szirmay-Kalos. Monte-Carlo global illumination methods—state

of the art and new developments. In J. ˇ

Z´ara, editor, Proceedings of

Spring Conference on Computer Graphics, pages 3–21. Comenius

University, Bratislava, 1999.

[SK99b] L. Szirmay-Kalos. Monte-Carlo Methods in Global Illumination.

Institute of Computer Graphics, Vienna University of Technology,

1999.

[SKH96] K. R. Subramaniam, S. C. Kothari, and D. E. Heller. A commu-

nication library using active messages to improve performance of

PVM. Journal of Parallel and Distributed Computing, 39(2):146–

152, 1996.

[SOHL+95] M. Snir, S. W. Otto, S. Huss-Lederman, D. W. Walker, and J. Don-

garra. MPI: The Complete Reference. MIT Press, 1995.

234 BIBLIOGRAPHY

[SP89] F. Sillion and C. Puech. A general two-pass method integrating

specular and diffuse reflection. Computer Graphics, 23(3), 1989.

[SP94] F. Sillion and C. Puech. Radiosity and Global Illumination. Morgan

Kaufmann Publishers, 1994.

[Ste94] W. R. Stevens. TCP/IP Illustrated, Volume 1: The Protocols.

Professional Computing Series. Addison-Wesley, 1994.

[Suna] Sun Microsystems. Java 3D. http://java.sun.com/products/java-

media/3D/index.html.

[Sunb] Sun Microsystems. The source for Java technology.

http://java.sun.com.

[Tre97] R. Treumann. Experiences in the implementation of a thread safe,

threads based MPI for the IBM RS/6000 SP. Technical report,

IBM, T. J. Watson Research Center, 1997.

[TS67] K. E. Torrance and E. M. Sparrow. Theory for off-specular re-

flection from roughened surfaces. Journal of Optical Society of

America, 57(9):1105–1114, 1967.

[Vea97] E. Veach. Robust Monte Carlo Methods for Light Transport Simu-

lation. PhD thesis, Stanford University, 1997.

[VG94] E. Veach and L. Guibas. Bidirectional estimators for light trans-

port. In Proceedings of the Eurographics Workshop on Rendering,

pages 147–162, 1994. Also in G. Sakas, P. Shirley and S. M¨uller (ed-

itors), Photorealistic Rendering Techniques, Springer-Verlag, 1995.

[vNe66] J. von Neumann and A. W. Burks (editor). Theory of Self-

Reproducing Automata. University of Illinois Press, 1966.

[War94] G. J. Ward. The RADIANCE lighting simulation and rendering

system. Computer Graphics, pages 459–472, 1994.

[WEH89] J. R. Wallace, K. A. Elmquist, and E. A. Haines. A ray tracing

algorithm for progressive radiosity. Computer Graphics, 23:315–

324, 1989.

[Whi80] T. Whitted. An improved illumination model for shaded display.

Communications of the ACM, 23(6):343–349, 1980.

[Woo84] J. R. Woodwark. A multiprocessor architecture for viewing solid

models. Display Technology and Applications, 5(2), 1984.

BIBLIOGRAPHY 235

[WRC88] G. J. Ward, F. Rubinstein, and R. Clear. A ray tracing solution

for diffuse interreflection. Computer Graphics, 22(4):85–92, 1988.

[WS95] G. R. Wright and W. R. Stevens. TCP/IP Illustrated, Volume

2: The Implementation. Professional Computing Series. Addison-

Wesley, 1995.

[ZG98] H. Zhou and A. Geist. LPVM: A step towards multithread PVM.

Concurrency—Practice and Experience, 10(5):407–416, 1998.