Parallel real-time rendering using heterogeneous PC clusters [original]

Parallel Real-Time Rendering using

Heterogeneous PC Clusters

Tim S¨

uß

Zusammenfassung

3-D-Szenen aus CAD-Systemen besitzen zumeist eine hohe geometri-

sche Komplexit¨

at. Es gibt eine Reihe von Konzepten, welche sich mit

der Schwierigkeit auseinandersetzen, solche Szenen in Echtzeit zu ren-

dern; beispielsweise daten- und rechenparallele Techniken oder Out-

of-Core Rendering Mechanismen. Diese Dissertationsschrift behandelt

die Nutzung von heterogenen PC-Clustern zur parallelen Bildberech-

nung von hochkomplexen Szenen. F¨

ur drei unterschiedliche Szenarien

wurden jeweils verschiedene Verfahren entwickelt bei deren Berech-

nungen wenige Highend-Rechner von vielen schwachen Rechnern un-

terst¨

utzt werden. Zun¨

achst sind dies statische Szenen, die vollst¨

andig

in den Hauptspeicher eines Rechners geladen werden k¨

onnen. Im zwei-

ten Szenario werden statische Szenen betrachtet, deren Komplexit¨

den Speicher eines einzelnen Rechners ¨

ubersteigt. Zuletzt werden Sze-

nen betrachtet, die neben statischen Objekten auch dynamische Ob-

jekte beinhalten.

Abstract

Often 3D scenes created with CAD applications have a high geo-

metric complexity. There are several concepts (like out-of-core render-

ing, levels of detail, parallel rendering) to render these complex scenes

in real-time. This dissertation focuses on the usage of heterogeneous

PC clusters for parallel real-time rendering of highly complex scenes.

For three different scene types specific rendering approaches were de-

veloped, where a small group of high-end computers is supported by a

large number of weaker PC cluster nodes. The first scene type consists

of static scenes that can be stored completely in a single computer’s

main memory, while the scenes of the second type exceed this memory

limitations. The scenes of the last type contain not only static but

also dynamic objects.

HEINZ NIXDORF INSTITUT

Universität Paderborn

Parallel Real-Time Rendering using

Heterogeneous PC Clusters

Dissertation

Tim S¨

uß

Heinz Nixdorf Institute and Department of Computer Science

University of Paderborn

November 2011

Reviewers:

•Prof. Dr. Friedhelm Meyer auf der Heide, University of Paderborn

•Prof. Dr. Andr´e Brinkmann, University of Mainz

For Anja and Noel

III

Contents

1 Introduction 1

2 Rendering 9

2.1 PC Clusters for Parallel Rendering . . . . . . . . . . . . . . . . . . . . . 9

2.2 Overview of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Parallel Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Out-of-core Rendering . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.3 OcclusionCulling........................... 15

2.2.4 Mesh Simplification and Impostors . . . . . . . . . . . . . . . . . 16

3 Preliminaries and Definitions 17

3.1 Objects, Models, and Scenes . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 PCCluster .................................. 17

Scenario I: Static and Sparse Occluded Scenes Fitting into

Processor’s Primary Memory 21

4 Reliefboards 23

4.1 Overview and Summary of Results . . . . . . . . . . . . . . . . . . . . . 23

4.2 RelatedWork................................. 25

4.3 Distributed Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Reliefboard Structure and Creation . . . . . . . . . . . . . . . . . . . . . 29

4.5 Identifying Objects by Clustering . . . . . . . . . . . . . . . . . . . . . . 33

4.5.1 Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.6 Evaluation................................... 37

4.7 Contribution.................................. 44

Scenario II: Static Scenes Which do not fit into a Processor’s

Primary Memory 49

5 Load-Balancing using the c-Load-Collision Protocol 51

5.1 Overview and Summary of Results . . . . . . . . . . . . . . . . . . . . . 52

5.2 RelatedWork................................. 53

5.3 The Parallel Rendering System . . . . . . . . . . . . . . . . . . . . . . . 53

Contents

5.4 Load-Balancing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.5 Evaluation................................... 59

5.6 Contribution.................................. 67

6 Hull Tree 69

6.1 Overview and Summary of Results . . . . . . . . . . . . . . . . . . . . . 71

6.2 RelatedWork................................. 72

6.3 BuildingtheHullTree ............................ 73

6.4 Determining the Interior Approximations . . . . . . . . . . . . . . . . . . 74

6.5 RenderingAlgorithm............................. 76

6.6 Evaluation................................... 79

6.7 Contribution.................................. 88

7 Parallel Out-of-Core Occlusion Culling using the Hull Tree 89

7.1 Overview and Summary of Results . . . . . . . . . . . . . . . . . . . . . 90

7.2 RelatedWork................................. 91

7.3 ScenePreparation............................... 92

7.4 Data Distribution and Rendering . . . . . . . . . . . . . . . . . . . . . . 93

7.5 Evaluation................................... 97

7.6 Contribution.................................. 104

Scenario III: Large Dynamic Scenes 109

8 Visualization of Multiple Synchronous Simulations 111

8.1 Overview and Summary of Results . . . . . . . . . . . . . . . . . . . . . 112

8.2 RelatedWork ................................. 115

8.3 SystemArchitecture ............................. 116

8.4 Joining the Partial Images . . . . . . . . . . . . . . . . . . . . . . . . . . 121

8.5 Contribution.................................. 122

9 Conclusions 125

9.1 Contribution.................................. 125

9.2 Open Questions and Future Work . . . . . . . . . . . . . . . . . . . . . . 127

1 Introduction

Complex polygonal 3D models may consist of hundreds of millions of triangles and re-

quire multiple gigabytes of memory. Rendering such Massive Models in real-time is one

of the most challenging problems in modern computer graphics [KBF05]. A user should

be able to navigate through a scene or models interactively while at least six to ten

frames per second are computed. The parallelization of the rendering process is a com-

mon approach to face this problem [KDG+08]. Many real-time rendering algorithms can

improve performance by distributing the load among multiple computers. For each frame

that is displayed an image for the current camera position must be rendered. To produce

images of polygonal 3D models, usually their geometric primitives are sent through a ren-

dering pipeline (see Figure 1.1), where they are transformed into pixels [AMHH08]. The

parallelization of such real-time, pipeline rendering algorithms is done rarely because PC

clusters completely equipped with modern graphic adapters are still rare. Usually, PC

clusters are intended to be used for other applications, such as scientific computations.

For example, the PADS and TeraPort PC clusters at the University of Chicago do not

provide any OpenGL accelerating hardware nor does the JUGENE supercomputer of the

J¨ulich Supercomputing Center. Some systems are equipped with weak graphics adapters,

like the Paderborn Center of Parallel Computing’s (PC2) Arminius PC cluster. On the

other hand, many compute centers offer a small group of so-called visualization nodes

that are equipped with high end graphics hardware. The performance improvements

and the GPGPU programmability of modern graphic adapters make these components

more and more valuable for PC clusters. However, this kind of PC clusters can hardly

be used for standard parallel pipeline rendering techniques as proposed by Molnar et

al. [MCEF94, MCEF08]. Typically, the performance of these methods depend on the

slowest node. Due to these reasons, we put the focus of this thesis on the development

of new parallel pipeline rendering algorithms for such heterogeneous PC clusters. We

require that these heterogeneous systems include a small group of powerful visualization

nodes and a large group of weaker back-end nodes. While the visualization nodes should

be equipped with high end graphics adapters, the back-end nodes require only weak

graphics performance. The back-end nodes should be equipped with common hardware

and a network must connect the different nodes.

Vertex

Data

Vertex

operations

Raster

operations

Frame

Buffer

Figure 1.1: Simplified representation of a rendering pipeline.

Introduction

The objective is to render complex 3D scenes in real-time using such heterogeneous

environments. Thus, the following questions are addressed in this thesis:

•How can we utilize heterogeneous PC clusters for parallel pipeline rendering?

•How can the powerful, high end nodes benefit from the additional computational

power of the weaker nodes?

•How can we cope with or even exploit asynchronous communication?

•How can the load be distributed fairly among the different nodes?

Different scene classes require different rendering algorithms. Many levels of detail

systems require objects (for example chairs, balls or even planes) that consist of con-

nected components (often, these methods are unsuitable for triangles in arbitrary order,

usually referred to as polygon-soups) [LWC+02], and the rendering algorithm introduced

by Chamberlain et al. is best applied on spacious scenes that provide an even distri-

bution of their geometry [CDL+96]. Several occlusion culling systems are only suitable

for scenes consisting of architectural models [AMHH08]. These systems discard objects,

which are covered by other objects. Scenes that exceed primary memory (in our case the

main and video memory) often must be processed differently than scenes that fit within

it. Real-time rendering is challenging even for large static scenes; including dynamic

objects further increases the difficulty.

We developed different pipeline rendering algorithms and data structures for three

different scenarios. Each of our developed parallel rendering algorithms is suitable for

one of these. Below we denote pipeline rendering simply by “rendering”, unless otherwise

stated.

Scenario I Static scenes (scene’s objects never change their position) that fit into the

primary memory of a visualization node and have sparse occlusion (objects are

rarely covered completely). The challenge of rendering such scenes is the large

number of objects participating in an image, because it is rare that objects occlude

others completely.

Scenario II Static scenes that do not fit into the primary memory of a visualization node.

Rendering images of such scenes with a suitable frame rate requires fast access to

data items which are not stored in the primary memory and a fast detection of the

visible items.

Scenario III Large dynamic scenes (scene’s objects can change their position). In such

scenes objects’ visibility can change constantly. Furthermore, additional compu-

tations are necessary for the moving objects, which influence the rendering speed.

We developed several data structures, rendering algorithms, and object simplifications

to render such scenes on heterogeneous PC clusters. We also tested different hardware

configurations to accelerate the parallel rendering process of dynamic scenes. Below we

will describe briefly our approaches for each scenario. Additionally, we state the issues,

which must be addressed to realize these approaches.

Introduction

Method for Scenario I

For the first scenario we introduce reliefboards [SJF10]. A reliefboard is an approxima-

tion of a complex scene object. These approximations are generated asynchronously on

the weak back-end nodes while the displayed images are rendered on a single powerful

visualization node. Since the reliefboard’s computation takes several seconds on the

weak back-end nodes, the reliefboard must be suitable for multiple camera positions and

for many frames. Reliefboards appear like a scene’s original objects for many camera

positions, and provide other positive properties. If the image errors that occur through

camera movements exceed a threshold, the reliefboard must be exchanged with a new

one. In this scenario the following problems have been solved:

1. Computation of simplifications that are suitable for many camera positions.

2. Determination of objects that can be replaced by reliefboards.

3. Even distribution of the computational load.

4. Reduction of the payload that is sent across the network.

To 1: Reliefboards make it possible to render complex scenes with high frame rates with

acceptable image errors.

To 2: We employed a clustering algorithm to determine objects which are suitable for

replacement by reliefboards.

To 3: For the load balancing we modified a data management protocol, the so-called

c-collision protocol, for our asynchronous communication scheme.

To 4: Reliefboards have little memory requirements in comparison to the original geom-

etry that leads to small messages sent across the network.

Furthermore, using our developed object simplifications reduces the image noise and

frame rate’s fluctuations during the walk-through. In contrast to other parallel rendering

systems, our approach scales in the image quality rather than scaling in the frame rate if

the number of back-end nodes is increased. Our rendering system do not wait until the

back-end nodes have finished their tasks. If the number of back-end nodes is increased,

we can decrease the number of approximations each of these nodes has to process or the

number of objects it has to test. Thus, we reduce the time spent between the initiation

and response of a job, whereby the required results of these jobs arrive earlier.

Methods for Scenario II

For the second scenario we introduce two different parallel out-of-core rendering systems

[SWF10a, SWF10b, SKJ+11] and one sequential rendering technique [SKJF11]. The

rendering method of all rendering systems is approximative, which leads to pixel errors

in the final image. In the parallel out-of-core systems, the weak back-end nodes serve as

secondary memory of the visualization nodes. The complete scene is distributed among

these weak nodes and stored in their primary memory, to allow for fast data access.

When scene objects are requested, the back-end nodes test the visibility of these objects

instead of sending them blindly.

Introduction

The parallel out-of-core rendering system that we will introduce first uses a version

of the c-load-collision protocol to balance rendering load and nodes’ contention. The

back-end nodes perform visibility tests, while they have only access to a subset of all

objects and global, but aged, distance information of the other objects. Due to the

aged distance information the back-end nodes cannot guarantee to determine all visible

objects until the information are updated. The positive tested objects are sent to the

visualization node, where they are rendered and displayed.

The sequential rendering system and its data structure is the groundwork for the sec-

ond parallel out-of-core render system. Our developed hull tree is a spatial, hierarchical

data structure. It covers scene’s objects more tightly than other commonly used data

structures. Additionally, we store for each object an approximation to improve and ac-

celerate the visibility test. Our associated approximative rendering algorithm exploits

this structure.

The second parallel out-of-core rendering system combines the hull tree with another

spatial data structure, the so-called randomized sample tree, to improve its properties

for parallel rendering. Each back-end node stores a small subset of the original objects

and approximations for the other objects. The back-end nodes perform visibility tests

with this mix of originals and approximations which are organized in a hull tree.

In these parallel rendering systems the following challenges arise:

1. Distribution of the objects to achieve fast and balanced data access.

2. Providing information to perform suitable visibility tests.

3. Even distribution of the computational load and the data load.

4. Reducing the number of objects sent across the network.

To 1: Due to a randomized distribution we achieve good load balancing.

To 2: We achieve suitable visibility tests if we use global, but aged, distance information

of scene’s objects. The hull tree in combination with the used approximations reduces

the complexity of the visibility tests.

To 3: The data management protocol of our first parallel out-of-core rendering system

achieves a good balancing of the load. Due to the hull tree’s combination with a ran-

domized sample tree, sending the visible objects across the network is distributed among

multiple frames.

To 4: Testing objects visibility a priori on the back-end nodes instead sending them

blindly reduces the network load. Thus, we could reduce the network requirements for

the second parallel out-of-core rendering system.

Like the reliefboard approach the parallel out-of-core rendering systems scales in the

image quality if the number of back-end nodes is increased. Furthermore, an increased

number of back-end nodes reduces the delay between a request initiation and its answer.

Introduction

Method to Scenario III

In the third scenario the focus is put on large dynamic scenes [DFH+08, SFH+08,

SFH+09]. In this scenario we separate the dynamic and static objects. In this way

it is possible to apply the techniques for static scenes and process the dynamic objects

separately. The computations of objects’ movements and their interactions is mostly

isolated from the rendering of the static parts. We render the different parts on differ-

ent nodes, and combine the resulting images into a final image. Combining the images

is a time consuming operation. For this reason we searched for alternative techniques

to accelerate the conflation. Here, hardware and software based solutions were devel-

oped [SSPP09, SSPP11]. For our developed systems for dynamic scenes the following

questions must be answered:

1. How do we combine the different images?

2. What kind of hardware can be used to accelerate the combining of the different

images?

To 1: The developed rendering approach allows to visualize multiple dynamic scenes

simultaneously in only one displaying window. We combined the images of the static

and the different dynamic scene parts by comparing the depth values of the different

corresponding pixels.

To 2: To accelerate this merging process we tested different techniques. We utilized

different CPU features as well as GPU features and an FPGA.

Structure of this thesis

In this thesis we focus on the utilization of PC clusters that consist of few visualization

nodes equipped with high performance hardware and a large amount of back-end nodes

with poor graphics performance for parallel pipeline rendering.

In Chapter 2, we introduce the relevance of this work’s topic followed by an overview

of work related to this thesis. In Chapter 3, we will introduce preliminaries and defini-

tions used in this work. Afterwards we will present our different rendering algorithms as

well as techniques for load balancing, data management, and visibility testing for hetero-

geneous PC clusters. These techniques provide object-approximations, data structures,

and different algorithms to accelerate the rendering of massive models.

Chapter 4 introduces the reliefboard technique where an object-approximation is cre-

ated on-the-fly on the weak back-end-nodes. The technique presented in Chapter 5

handles the usability of an adjusted data management protocol for parallel out-of-core

rendering. To accelerate visibility tests, we developed a culling technique using the hull

tree, presented in Chapter 6. We integrated this technique with its associated data

structure in a parallel out-of-core rendering system in Chapter 7. For dynamic scenes

we evaluated, for one technique, the benefit of FPGAs for parallel rendering in hetero-

geneous PC clusters, in Chapter 8. The thesis ends with a conclusion and an outlook to

future work.

Personal Publications

[DFH+08] Wilhelm Dangelmaier, Matthias Fischer, Daniel Huber, Christoph Laroque,

and Tim S¨uß. Aggregated 3d-visualization of a distributed simulation exper-

iment of a queuing system. In S. J. Mason, R. Hill, L. Moench, and O. Rose,

editors, Proceedings of the Winter Simulation Conference, WSC´ 08, pages

2012 – 2020. IEEE, Omnipress, 2008.

[DS10] Dominic Dumrauf and Tim S¨uß. On the complexity of local search for

weighted standard set problems. In Proceedings of the 6th Conference on

Computability in Europe, pages 132–140, 30 June - 4 July 2010.

[SFH+08] Tim S¨uß, Matthias Fischer, Daniel Huber, Christoph Laroque, and Wilhelm

Dangelmaier. A system for aggregated visualization of multiple parallel dis-

crete event simulations. In Proceedings of the International Symposium on

Advances in Parallel and Distributed Computing Techniques, APDCT ’08,

pages 587–593. IEEE, IEEE Computer Society Press, December 2008.

[SFH+09] Tim S¨uß, Matthias Fischer, Daniel Huber, Christoph Laroque, and Wilhelm

Dangelmaier. Ein System zur aggregierten Visualisierung verteilter Materi-

alflusssimulationen. In J¨urgen Gausemeier and Michael Grafe, editors, Aug-

mented & Virtual Reality in der Produktentstehung, volume 252, pages 111–

126. Heinz Nixdorf Institut, Universit¨at Paderborn, May 2009.

[SJF10] Tim S¨uß, Claudius J¨ahn, and Matthias Fischer. Asynchronous parallel re-

liefboard computation for scene object approximation. In Proceedings of the

Eurographics Symposium on Parallel Graphics and Visualization, EGPGV

’10, pages 43–51, Norrk¨oping, Sweden, May 2010. Eurographics Association,

Eurographics Association.

[SKJ+11] Tim S¨uß, Clemens Koch, Claudius J¨ahn, Matthias Fischer, and Fried-

helm Meyer auf der Heide. Ein paralleles Out-of-Core Renderingsystem f¨ur

Standard-Rechnernetze. In J¨urgen Gausemeier, Michael Grafe, and Fried-

helm Meyer auf der Heide, editors, Augmented & Virtual Reality in der Pro-

duktentstehung, volume 295 of HNI-Verlagsschriftenreihe, Paderborn, pages

185–197. Heinz Nixdorf Institut, Universit¨at Paderborn, May 2011.

[SKJF11] Tim S¨uß, Clemens Koch, Claudius J¨ahn, and Matthias Fischer. Approxi-

mative occlusion culling using the hull tree. In Proceedings of the Graphics

Interface 2011, pages 79–86. Canadian Human-Computer Communications

Society, May 2011.

Personal Publications

[SSPP09] Tobias Schumacher, Tim S¨uß, Christian Plessl, and Marco Platzner. Commu-

nication performance characterization for reconfigurable accelerator design on

the XD1000. In Proceedings of the International Conference on ReConFig-

urable Computing and FPGAs (ReConFig), 9 - 11 December 2009.

[SSPP11] Tobias Schumacher, Tim S¨uß, Christian Plessl, and Marco Platzner. Fpga

acceleration of communication-bound streaming applications: Architecture

modeling and a 3d image compositing case study. International Journal of

Reconfigurable Computing, 2011:1–11, 2011. Article ID 760954.

[SWF10a] Tim S¨uß, Timo Wiesemann, and Matthias Fischer. Evaluation of a c-load-

collision-protocol for load-balancing in interactive environments. In Proceed-

ings of the 5th IEEE International Conference on Networking, Architecture,

and Storage, pages 448 – 456. IEEE Computer Society, IEEE Press, 15 -

17 July 2010.

[SWF10b] Tim S¨uß, Timo Wiesemann, and Matthias Fischer. Gewichtetes c-Collision-

Protokoll zur Balancierung eines parallelen Out-of-Core-Renderingsystems.

In J¨urgen Gausemeier and Michael Grafe, editors, Augmented & Virtual Re-

ality in der Produktentstehung, HNI-Verlagsschriftenreihe, Paderborn, pages

39–52. Universit¨at Paderborn, HNI Verlagsschriftenreihe, Paderborn, 2010.

2 Rendering

Computer graphics are becoming more and more important in various areas of our

life. Architects, designers, physicians and many more gain information from the ability

to create realistic images of their objects of interest. One area in the huge field of

computer-graphics is the real-time rendering of three-dimensional scenes: the aim of

3D real-time rendering is to produce images of a given scene with at least six frames

per second [AMHH08]. Thus, a user is able to navigate through the scene without

losing his orientation. Usually these scenes are composed of two-dimensional polygons.

The process of transforming these polygon data sets into 2D raster images is called

rendering. At pipeline rendering this process is supported by specialized hardware of

common graphics adapters (typically programmed with OpenGL or DirectXTM). In order

to improve the realistic appearance of images produced by such scenes, often the scene

objects’ polygon count is increased. Thus the virtual objects become more detailed.

Incrementing the polygon count also results in more precise object surfaces which allows

for more accurate measurements. This is highly relevant for simulations of industrial

complexes, production environments, machinery, etc. where users want to check, for

example, whether one gear fits to another or if they can be replaced without removing

any other parts of the machinery. To increase realism even more, the different objects

and the scene itself include several attributes like materials, shaders, textures, or light

sources. There is a great need to render scenes that consist of hundreds of millions

or more triangles [KDG+08], usually qualified as massive models, in real-time. If the

geometry in such scenes is never modified, they are referred to as static scenes. Otherwise

they are called dynamic scenes. The complexity of such massive models can increase so

much that a single computer is unable to render them in real-time, even if the scenes are

static. When a scene contains dynamic objects, which change their position, orientation

or appearance during runtime, the problem becomes even worse.

2.1 PC Clusters for Parallel Rendering

A common approach to computing images of such massive models is to use PC clusters,

distributing the rendering load among multiple computers. PC clusters offer a lot of

computational power and memory, and in many cases fast networks to exchange data

between different nodes. If the PC cluster nodes are equipped with graphics adapters, it

is possible to distribute the rendering among these cards, instead using only one. Here

for example, each graphics adapter could render a different subimage which are finally

combined to a complete one [SFL01, SFLS00, MCEF08]. However, the usage of many

PCs instead of only one for rendering introduces new problems. For example, to achieve

2 Rendering

good performance the computational load should be distributed evenly among the cluster

nodes. This introduces a need for load balancing algorithms. Another problem passed

by the parallelization of the rendering process is the data organization. Fast data access

and a good data distribution is necessary to render massive models. If the data amount

is so large that it does not fit into the primary memory of a single PC, data access

or data distribution methods are also required. In this thesis the primary memory of

a computer refers to its processor memory, its main memory, and its video memory.

Slower storage elements used (for example, local hard disks, network attached storage

systems) are referred to as secondary memory.

Apart from algorithmic problems, hardware related challenges also occur. While up-

grading a single computer can be done with little effort, the renewal of a whole PC

cluster rises bigger problems. Beside the acquisition costs for the computer hardware,

other costs arise. In many cases, advanced hardware consumes more power which leads

to the requirement of a better power supply. The increasing power requirements usu-

ally result in a higher heat production, which, in turn, requires an improved cooling

system. This then leads to an even higher power consumption. PC clusters’ energy

consumption and temperature control challenge their providers: For example Green-IT

refers to the concept of reducing compute centers’ energy consumption, while increasing

their computational power. Another problem when upgrading PC clusters is hardware’s

short improvement circles. Usually, the influences of new components on the rest of the

system cannot be tested sufficiently. So the upgrade of single hardware components of

all PC cluster nodes can lead to malfunctions and the instability of the entire system.

One method to combat these problems and to raise computational power is to allow

heterogeneous PC clusters, where different nodes are equipped with different hardware

components. The nodes within such systems are equipped with different hardware, which

is specialized for different tasks. While upgrading an entire PC cluster with state of the

art components is a big challenge, a few cluster nodes can be upgraded with signifi-

cantly less effort. The different components of these nodes (CPUs, memory, graphics

cards, etc.) can be replaced in shorter periods of time. While the configuration of most

cluster nodes is unmodified, a minor group of dedicated nodes can be kept up-to-date.

If these nodes are organized beside the main system, their influence on the stability of

the main system is marginal. These nodes can be equipped with high end hardware that

offers all features that a user or developer would like to use.

2.2 Overview of Related Work

This section introduces work related to the algorithms presented in this thesis. Here,

concepts and techniques are introduced that build the basics for all parts of this thesis.

We will give a brief overview of parallel rendering in §2.2.1 and out-of-core rendering in

§2.2.2. Additionally, we will introduce different techniques for occlusion culling in §2.2.3

and mesh replacement in §2.2.4 that are related to methods we have developed for our

rendering methods.

2.2 Overview of Related Work

2.2.1 Parallel Rendering

Parallel rendering has been categorized into three basic approaches by Molnar et al.

[MCEF94, MCEF08]: sort-first, sort-middle, and sort-last. These parallelization ap-

proaches specify in which stage of the rendering pipeline the geometric primitives are

distributed among PC cluster nodes (see Figure 1.1).

Sort-first

In the sort-first approach, the display is subdivided into multiple tiles. Usually the

geometric primitives in each of these tiles are rendered by a different processor. Before

rendering, it is estimated in which tile each primitive must be displayed. In other words,

the primitives are assigned/sorted to the different nodes before they are sent into the

rendering pipeline. When each node has finalized its partial image, the different tiles are

composed to the final image (see Figure 2.1, left).

geometry database

(Pre-transformation)

Redistribute “raw”

primitives

...

Geometry

transformation

Rasterization

Display

(arbitarily partitioned)

geometry database

Redistribute screen-

space primitives

...

Geometry

transformation

Rasterization

Display

()

(arbitarily partitioned)

geometry database

Redistribute pixels,

samples, or fragments

...

Geometry

transformation

Rasterization

Display

...

(Compositing)

(arbitarily partitioned)

Figure 2.1: In sort-first rendering the geometric primitives are redistributed during the

geometry transformation. In sort-middle, the screen-space elements are re-

distributed in between geometry transformation and rasterization. In sort-

last, the different picture elements are redistributed after rasterization and

composed afterwards.

The drawback of this method is that usually the different primitives cannot be as-

signed to the rendering nodes uniquely. Primitives can overlap into multiple tiles, which

influence the scalability of that approach negatively.

Sort-middle

In sort-middle, the primitives’ arrangement is made in the middle of the rendering

pipeline, in between geometry transformation and rasterization. Similar to the sort-first

approach, the display is subdivided into tiles. Additionally, the geometric primitives are

distributed arbitrarily exclusively to the nodes in a preprocessing step. This distribution

2 Rendering

does not change during runtime. Each node performs the geometry transformation for its

assigned geometric primitives and determines its corresponding tiles. The transformed

display coordinates are transferred to the responsible nodes, where the primitives’ ras-

terization are performed. When all the primitives have been transformed, the nodes’

partial images are composed to a final image (see Figure 2.1, middle).

In the sort-middle approach the primitives must be assigned to the different render-

ing nodes while passing the rendering pipeline. Typical common computer hardware

does not provide an interface for interfering with the rendering process at this stage,

without causing high delays. Thus, this parallel rendering approach is impracticable

for parallelization on PC clusters. However, most graphics adapters internally use the

sort-middle approach in their rendering pipeline [CDR02].

Sort-last

In the sort-last approach the arrangement is made after the geometry primitives have

been sent through the rendering-pipeline. Before rendering, the different primitives are

assigned to different nodes. For each frame, the nodes render their subset of primitives

into separate frame and depth buffers. Afterwards all nodes read back both buffers into

their main memory. The partial images are combined into a single image by comparing

the depth values of the different depth buffers. For the final image, the pixel from the

frame buffer whose depth value is the nearest to the view plane is chosen (see Figure

2.1, right).

Load balancing strategies

To accelerate the sort-first rendering process, the computational load must be distributed

evenly among the cluster nodes. It is easy to produce situations where most processors

are idle while only a single node has to render the complete scene. Hence, load balancing

techniques are required. Usually the load is balanced by adapting rendering nodes’ tile

sizes [SZF+99, ACCC04, OU06, RRR06, Paj08] using heuristics, for example.

Using the sort-last approach, the composition of the different partial images requires

a significant amount of time. While the rendering load usually decreases for a rising

number of cluster nodes, the merging time increases. To distribute the load for composi-

tion evenly the binary-swap algorithm and its improvements can be utilized [MPHK94,

MPHK08, TIH03, EP08]. Additional run-length encoding of the images reduce the load

of the network and accelerate the image composition even more [HWC10, KPH+10].

Hybrid techniques

Sort-first and sort-last bring their own advantages and disadvantages. Using sort-first,

the different partial images can be easily composed, while many elements are rendered

redundantly in multiple tiles. Sort-last renders each element at most once, but the

composition of the partial images needs much time. Both approaches can be combined

to reduce the drawbacks [SFL01, SFLS00].

2.2 Overview of Related Work

Hardware acceleration

Special purpose hardware has also been developed for parallel rendering [Whi92]. As

defined in Flynn’s taxonomy, there are Single Instruction, Multiple Data (SIMD) as well

as Multiple Instruction, Multiple Data (MIMD) approaches. In SIMD system a single

instruction stream is performed in parallel on multiple data streams. In contrast, a

MIMD system performs a different instruction stream for each data stream. There are

complete hardware systems to render 3D scenes fast [MBDM97, WLLB97, Ada05], in

the form of special purpose components (for example the VoodooTMgraphics accelerator)

that can be applied to off the shelf computers [DY08, MOM+01, SEP+01].

Many modern graphics adapters provide techniques to combine the computational

power of multiple graphics adapters [Res09, AMHH08]. There are different solutions

such as SLI or Crossfire where homogeneous graphics adapters are combined. Other tech-

niques, like LucudLogix HydraLogix Engine allow the usage of heterogeneous adapters

[Lud10].

Next to Molnar’s parallelization approaches there are further methods to render 3D

scenes in parallel. One of these methods is often offered by the previously mentioned

multiple GPU solutions and is referred to as alternate frame. Here, multiple graphics

adapters or rendering nodes are used to compute consecutive images. Each frame is

assigned to a render unit in a round-robin manner. While a GPU is displaying its

frame, the following GPUs use this time to render their frame. The problem here is

the method’s scalability. When the number of render units increases, the latencies of

the rendering system also increases. Another problem of this approach is the possible

dependency of a frame on the previous one. In this case the required data must be send

from one graphics adapter to another over the bus.

2.2.2 Out-of-core Rendering

Out-of-core rendering methods are necessary if a 3D scene is too large to fit into the

primary memory of a single computer. In this case, secondary memory is used to store

invisible scene parts. In general, the hard drive is used for this purpose. In comparison

to the primary memory this storage has massive capacities, but at the costs of much

greater latency. Thus, it is necessary to store all objects that contribute to the final

image in local main memory to render the images with good performance.

Aliga et al. presented an interactive massive model rendering system to process scenes

that exceed primary memory [ACW+99]. Their approach combines many different ren-

dering techniques to handle such large scenes (for example, hierarchical data structures,

levels of detail, visibility culling, etc.). This was the first rendering system that was

able to process scenes with more than ten million triangles [CKS03]. They employed a

from-region pre-fetching, to load required data from the secondary memory.

To render models that exceed the primary memory pre-fetching and caching mecha-

nisms are required. Extending the viewing frustum is a simple technique to load objects

(see Figure 2.2), which are potentially visible in one of the next frames [VM02]. The

iWalk rendering system [CKS02b, CKS03, CKS02a] stores a given 3D scene in an octree

[AMHH08]. Using this spatial data structure, selective object loading is possible.

2 Rendering

Figure 2.2: Pre-fetching in an out-of-core rendering system.

Sajadi et al. present an out-of-core rendering system that subdivides a given 3D scene

using a k-d tree [SHDG+09]. The k-d tree is also used to find the data required by

their rendering system. Their system reduces the data and cache-management costs by

storing all objects within equal-sized memory blocks. Although this approach uses more

memory than the total object size, it prevents the fragmentation of memory.

Goswami et al. present a parallel out-of-core rendering system for very large terrain

data sets [GMBP10]. Their system provides a sort-first and a sort-last rendering. For

the sort-first portion of the renderer they evaluate the behavior of the system when the

display is segmented only by equally sized horizontal or vertical stripes. They show that

the load is distributed more evenly if the stripes are orientated vertically. Due to terrain

scenes’ properties the triangle distribution is usually densest at the horizon, while in the

sky the polygon count is usually very low. Using equally sized tiles is reasoned by a

reduction of accesses to the data, stored on a network-attached storage.

Br¨uderlin et al. presented Interview3D, an out-of-core rendering system that is able

to render CAD scenes of nearly arbitrary size [BHP07]. The system loads only that

data into the primary memory that is potentially visible. Thus they use visibility-guided

rendering which is an output-sensitive approach to reduce the rendering load. Required

data is incremental updated from the secondary memory to prevent the renderer from

stalling (small details can be delayed).

All rendering nodes of the parallel rendering approaches presented in this section

require high capacity graphics adapters to achieve good performance. Typically, the

performance of the entire system depends on the slowest node. The described out-of-

core rendering approaches utilize the hard drive for storing the data-items. Our methods

for primary memory exceeding scenes do not require - unlike other approaches - that the

objects are stored in equal-sized blocks. In our methods we use a randomized sample

tree for the spatial scene partition, where elements from lower levels in the octree are

randomly lifted into higher levels. For this reason, each sample tree cell can have a

different number of objects attached to it.

The data transfer rates of hard drives are very low compared to the rates that can

be achieved using high speed networks. In other approaches, the data items are loaded

2.2 Overview of Related Work

without testing the object for occlusion by other objects. In our rendering system, we

use the nodes’ RAM as secondary memory instead of hard drives. To ensure fast data

access all nodes are connected via an Infiniband network. On the back-end nodes, all

requested data items have to pass additional tests before they are transmitted to the

rendering node’s main memory.

2.2.3 Occlusion Culling

Occlusion culling is a common approach to reduce the number of objects sent into the

rendering pipeline. Many graphics adapters provide an interface to retrieve occlusion

information directly from the GPU – the so-called hardware accelerated occlusion queries.

The queries return the number of potentially visible pixels for the queried geometry (i.e.

if the query returns zero the geometry is invisible, otherwise it contributes to the final

image).

The different algorithms can be categorized into three groups: conservative, approxi-

mative, and aggressive culling algorithms. While conservative algorithms usually over-

estimate the set of visible objects, approximate algorithms can terminate their visibility

tests for various reasons, even before they determined all visible objects. Aggressive

algorithms only determine objects that are visible, and in addition discard objects even

if they are definitely visible.

All three approaches can be used with from-cell and from-camera position occlusion

culling. Algorithms related to the first group determine sets of visible objects within a re-

gion in a preprocessing step. The visibility within a cell can be determined conservatively

[TS91, NBG02] or approximatively [DDTP00, NB04, Lai05, LSCO03]. Other algorithms

determine in a preprocessing, which regions are invisible for a given cell [SDDS00]. It

is not necessary to determine the visible objects in advanced. There are methods that

compute the from-cell visibility during runtime [RP05, MBWW07, BMW+09]. When

the camera is placed within a region, only objects from pre-determined set of potentially

visible objects are processed. These approaches exploit spatial coherency.

From-camera culling algorithms are typically processed at runtime. Many of these

runtime occlusion culling algorithms profit from the hardware accelerated visibility

tests. Exploiting spatial and time coherency can reduce the number of performed tests.

The conservative techniques determine all currently visible objects [HSLM02, BWPP04,

GBK06, SBS06]. To accelerate the visibility tests object approximations can be used

[DMS01, SKJF11]. Other techniques uses multiple graphics adapters to determine visi-

ble objects [GSYM03, SWF10a, SKJ+11]. In contrast to these conservative approaches,

the approximative techniques do not guarantee that all visible objects are displayed in

the final image [KS00, CKS03, GBSF05, GM05]. Some of these approximative methods

draw objects until a budget is reached, other techniques determine the visible objects in-

crementally. Several methods allow for choosing a conservative or approximative culling

by parametrization [MBW08].

2 Rendering

2.2.4 Mesh Simplification and Impostors

Level of detail and impostor techniques are one of the most common approaches to

process massive models. The idea to replace objects by less complex approximations

organized in a hierarchy has been introduced by Clark [Cla76]. Polygonal simplifica-

tion techniques usually require objects of connected components, like chairs, balls, etc.

Levels of detail (LOD) can be classified into three groups: discrete, continuous, and

view-dependent [Lue01, LWC+02]. Discrete LODs are multiple versions of the same

model with variable complexities [SZL92, Tur92, HDD+93, Kle97]. The different ver-

sions can be used for different objects-to-camera distances, for example. A continuous

LOD has no individual version. Utilizing a data structure, an object’s complexity can

be refined or coarsened seamlessly [Hop96, SGG+00]. View-dependent LODs are based

on continuous LODs that additionally use the current camera position as criteria for the

chosen simplifications [LE97, EMB01].

Impostors replace scene parts as well, but they do not manipulate the original objects.

Billboards replace scene elements with a texture that is placed on a simple geometric

object (e. g. rectangle, cone, cube, etc.). Some techniques replace composited scene

parts [SLS+96], others combine multiple billboards to replace a single object like a tree

[YSK+02, DDSD03, DN09]. Particle systems, to visualize smoke or water, can also be

realized with billboards [USKS06]. Other techniques, like our developed reliefboards,

produce simple meshes that are not based on the original geometry to replace objects

[ABB+07, GFB10, SJF10].

3 Preliminaries and Definitions

In this chapter we will introduce and define some expressions we require throughout this

thesis. Additionally, we describe the PC cluster system used for testing. One of the PC

cluster used has been modified over the time, resulting in different settings for the same

PC cluster environment.

3.1 Objects, Models, and Scenes

For most of the algorithms we need the central concept of objects which are rendered.

In this setting, an object is a set of triangles that builds a semantic group irrespective of

the object’s form, complexity, and spatial dimensions. An arbitrary, single polyhedron

is defined as geometric primitive. Usually these groups of objects are given as input

for our rendering algorithms. They are created using CAD applications or 3D scanners.

The composition of multiple objects is called a model or a scene.

If the input is given as set of independent geometric primitives, or if the given objects

are not suitable for our rendering algorithm, the input must be reorganized to fit our

specific needs. For example, connected components can be found or geometric clustering

techniques can be utilized.

To place objects’ vertices, we use the Cartesian coordinate system. Every vertex is

given as a triple (x,y,z)∈R3, representing a point in the three dimensional Euclidean

space. We cannot use elements of the set of real numbers for our computations because

of hardware’s limitations. The computations are limited by hardware supported floating

point numbers. Nevertheless, in this thesis we will use Rfor the used floating point

numbers, unless otherwise stated. The coordinate system used in this thesis is right

handed [Wat99] (see Figure 3.1). A box containing an object, whose sides are paral-

lel/orthogonal to coordinate system axis is defined as axis aligned bounding box, short

AABB.

3.2 PC Cluster

In the following section we will formally introduce the terms and definition related to a

PC cluster, followed by the detailed PC cluster configuration used in the tests.

Definitions

In this thesis we use the terms processor, rendering node, and cluster node synonymously

as atomic computation units of a PC cluster. Each processing element in a PC cluster

3 Preliminaries and Definitions

Figure 3.1: Right-handed coordinate system with a single vertex. Virtually, when a

hand grabs the z-axis from the top and the arm is placed on the positive

x-axis, the thumb points along the positive z-axis.

consisting of Nnodes is identifiable via a unique id PEi∈ {PE0,PE1,. . . ,PEN−1}.

The used PC cluster is a distributed memory MIMD system, as defined in Flynn’s

taxonomy [Fly72], which means that the different nodes have no access to a shared

memory space [Pac96]. For this restriction, each piece of information or data item must

be sent and received over a network. The different algorithms developed for this thesis

use message passing for communication. The MIMD system is programmed via SPMD

(single-program, multiple-data). That means that each node executes the same program,

but can process different data.

Configuration

The PC cluster we used for the evaluation was the Paderborn Center of Parallel Com-

puting’s (PC2) Arminius cluster. Over time the cluster was modified so we have two

different configurations: a large and a small configuration.

In the large configuration, the strongest visualization nodes provide an NVidia

Geforce 9800 GX2 graphics adapter with 512 MiB memory, 16×PCI-e 1.0, two AMD

Opteron 250 processors and 4 GiB RAM. The installed 64-bit OS was Fedora Linux 9. In

this configuration 200 back-end nodes were available. Each back-end node provides an

NVidia Quadro NVS280 graphics adapter with 64 MiB memory, PCI-e 1.0 (single lane),

two Intel Xeon 3.2 GHz processors, and 4 GiB RAM. The installed 64-bit OS was Red

Hat Enterprise Linux AS, release 4. All nodes are connected via Infiniband (using 16×

PCI-e, about 1.8 GiB/sec at full-duplex transmission).

In the small configuration, neither the hardware configuration of the visualiza-

tion node nor the hardware configuration of the back-end nodes has been changed, but

the number of available back-end nodes is decreased to 16. The operating system was

changed to the 64-bit version of Cent-OS 5.5.

Scenario I:

Static and Sparse Occluded Scenes

Fitting into Processor’s Primary

Memory

Reliefboards

Figure 3.2: If trees are discarded too early it is possible to see the sunset behind them.

This strong change in the illumination can influence shaders negatively.

Here, we will describe the challenges which inspired the development of reliefboards.

Real-time rendering of large scenes is a challenge, especially if the objects’ complete,

mutual occlusion is sparse. In such scenes, camera positions exist where all objects must

be rendered to produce a correct image. Examples for such scenes are forests where it

is possible to see through gaps between tree’s leaves. Here it is difficult to decide which

objects can be discarded and which objects appear in the final image. This is reasoned

by small gaps between leafs, branches, trunks, etc. where an observer is able to look

through (see Figure 3.2). The discarding of objects which allows the background to be

seen can lead to disturbing image errors. This can occur, for example, when shaders are

used which use information of previous frames to determine the illumination.

Additionally, rendering highly tessellated scenes like woods can produce many aliasing

artifacts. This is because of the small projection sizes of distant objects and the difficulty

of determining the pixels’ color (see Figure 3.3). A pixel’s appearance depends on the

primitives which are projected on its center or multiple other sample points included in

this pixel [Ros05, AMHH08]. If the projection size of the primitives in a display region

is smaller than a pixel, small changes in the camera position can result in noise.

Figure 3.3: Minor changes in camera position can result in significant image changes.

If these changes appear permanently during walk-through, it is called noise.

Such effects can easily be produced by small structures in a forest scene.

Scenario I

Image noise during a walk-through can disturb users and reduce the overall image

quality. There are many techniques to avoid this problem, however these techniques

often need significant additional computation time.

One common approach to face this problem is the use of textures that are placed on

planes for distant objects. These billboards are simple and achieve good results, but they

do not cover other objects correctly. Additionally, if only one billboard is used per object

the parallax effect is missing, where elements change their distance in screen space if the

camera is moved sideways (see Figure 3.4).

Figure 3.4: Schematic representation of the parallax effect. Due to the motion parallel

to the viewing plane, objects change their distances.

To process scenes as described before and to avoid the occurring problems, we de-

veloped an object simplification, the so-called reliefboards. We developed a parallel

rendering system that uses these approximations. In contrast to other parallel render-

ing systems, the presented system scales in image quality instead of frame rate. For

load balancing, we used a randomized data management protocol. To achieve an evenly

distribution of the load we distribute the objects randomly and redundant among the

PC cluster’s back-end nodes. This distribution yields good results and is easy to com-

pute. A deterministic computation of the object distribution is hard. With the needed

properties this problem is similar to the set-packing optimization problem. We analyzed

if a polynomial-time local search algorithms could achieve suitable results. We have

shown that the localized version of set-packing is already PLScomplete for a two-differ-

neighborhood [DS10]. That means, starting with an feasible solution we try to optimize

the packing by a sequence of real improvements, in which we exchange at most two

elements in each step. The PLS completeness of set-packing shows that the length of

this sequence could be exponential, until we reach a local optimal solution.

4 Reliefboards

Often complex scenes are composed of many individual objects. Each of these objects

can have a moderate complexity but the composition of all of the objects is too complex

to be rendered by a single computer, even if they fit into its main memory. Some

examples for this are trees or skylines. When a user observes the scene through an

aerial perspective, objects barely occlude each other. In worst case, all objects must be

rendered.

To render such scenes in real-time we developed reliefboards (see Figure 4.1). This

approximation technique is similar to regular billboards, but they have improved prop-

erties (e.g., they support the parallax effect). To generate reliefboards in our parallel

rendering system the PC cluster’s back-end nodes do not even need graphics adapters, a

software rendering system can be used. Our requisite to the back-end nodes is that the

sum of their memory is large enough to store the scene’s objects redundantly [SJF10].

4.1 Overview and Summary of Results

Our system architecture uses a single visualization node equipped with a powerful graph-

ics adapter and a PC cluster consisting of a large number of back-end nodes not neces-

sarily offering accelerated 3D graphic performance. The displayed images of the scene

are only rendered on the visualization node. We assume that a 3D scene consists of a

certain number of atomic objects. The visualization node renders a subset of the objects

with their original geometry. A large number of elementary objects are replaced by re-

liefboards (see Figure 4.1). A reliefboard is a mesh consisting of colored vertexes placed

Figure 4.1: Approximating objects by reproducing their surface by reliefboards.

4 Reliefboards

on a regular grid and shifted along the grid’s normal vector. Reliefboards approximate

the original objects and their quality depends on the viewer’s position. Therefore the

reliefboards have to be updated periodically. This is done by the PC cluster’s nodes,

which compute the reliefboards asynchronously.

Figure 4.2: Objects which are near the viewer, too large or too simple are rendered with

the original polygons. Objects which are far away and suitable are removed

and rendered with reliefboards.

Rendering cycle: The visualization node’s rendering cycle consists of three steps.

First, all unsuitable objects are rendered. Unsuitable objects are rendered with original

geometry because their diameter is too large or the number of triangles is too low and

thus a replacement with a reliefboard would not gain any speedup (see Section 4.5).

Replacing objects which are too large could lead to significant image errors. Second,

all suitable objects are rendered in a front-to-back manner. An object is suitable for

replacement with a reliefboard if the number of triangles and the diameter of the object

is adequate (see Section 4.5). We call the area where suitable objects are rendered with

original triangles the near zone (see Figure 4.2). Third, distant, suitable objects behind

the near zone are replaced and rendered with a reliefboard. We call this area the far

zone (see Figure 4.2).

We developed reliefboards and a related parallel rendering algorithm. Reliefboards are

easy to compute and yield good results (see Figure 4.3). Using these object impostors

allows us to accelerate the rendering process of complex scenes. Additionally, the fluc-

tuations of the required rendering times are reduced by reducing the strong fluctuations

in the number of polygons to render.

Before our rendering algorithm starts to render a preprocessing step is necessary (see

Figure 4.4). For our parallel rendering method we need suitable objects. These objects

4.2 Related Work

Figure 4.3: Reliefboards applied on the UNC Power Plant scene. The left image shows

the scene rendered with the original geometry. For the right image relief-

boards were applied. The red frames shows the highlighted areas.

are groups of many triangle that are pooled in a small space. If the given input does

not fulfill these properties, we have to compute the required objects during preprocess-

ing. To determine suitable objects we can use an agglomerative clustering algorithm.

These objects are needed for the computation of the reliefboards as well as for the load

balancing. For this purpose we use a data management protocol. During runtime we

balance the rendering load of the different back-end nodes with the c-collision protocol

(see Section 5.2 for a brief overview).

optional:

determine

near zone

optional:

compute

objects

distribute

objects

determine

reliefboards

to update

balance load

and issue

async. requests

update arriving

reliefboards

render and

display image

preprocessing image computation

suitable objects

unsuitable

input

Figure 4.4: Our system transforms unsuitable inputs to objects that meet our needs.

These objects are distributed among the back-end nodes. In every loop,

reliefboard updates could be requested or received before a new image is

rendered.

4.2 Related Work

Here, an overview of related work for mesh replacement and displacement mapping will

be presented.

Mesh Replacement substitutes objects’ original geometry with a simplified repre-

sentation, which typically reduces rendering costs (e.g., impostors or LODs). Simple

4 Reliefboards

billboards replace original geometry by single textures placed on a rectangle or another

simple geometric object. A disadvantage of this technique is that billboards can inter-

sect at most once (see Figure 4.5). To reduce this negative effect, it is possible to use

multiple billboards instead of one [YSK+02, DDSD03].

Jeschke and Wimmer introduced Textured Depth Meshes to accelerate the render-

ing process [JW02]. To create these impostors a new mesh and texture are generated

from a voxelizated scene part. These simplifications can be used in fixed view cells as

simplifications for complex objects.

And´ujar et al. presented omni-directional relief impostors to improve rendering per-

formance [ABB+07]. In a preprocessing step they compute relief maps from different

directions for a given object. During runtime, depending on the camera position, some

of these maps are combined and rendered. Thus, the number of polygons sent into the

rendering pipeline is reduced. In contrast to the reliefboards in our rendering system all

these impostors are generated in a preprocessing step and not on-the-fly during runtime.

Shade et al. [SLS+96] use textures to reduce scene complexity. They compute a k-d

tree that contains the scene’s geometry. By utilizing this k-d tree, the cell’s geometry

can be substituted by a rendered image of it, as seen from the observer’s position. The

main drawback of this technique is that it is sometimes possible to see through a gap

between textures and real geometry. Furthermore, if the position is changed, there is no

parallax effect.

LODs, as for example Hoppe’s Mesh optimization [HDD+93] or his developed Pro-

gressive Meshes [Hop96] create simplifications based on an object’s mesh. Usually, these

LODs must be generated during preprocessing and the quality must be checked manu-

ally. Reliefboards are computed in parallel, automatically, and on-the-fly. They can be

used for groups of different objects and generate a less complex representation for all

of them at once. Detailed descriptions of other LOD techniques have been pooled by

Luebke et al. [LWC+02].

Displacement mapping is used to add depth to simple surfaces. In contrast to tech-

niques like bump or normal mapping [AMHH08], which only simulate the visual effects

of depth (e.g. by lighting or shadowing), displacement mapping changes the geometry

of a surface [Coo84]. These techniques support the parallax effect and intersection of

objects, such as parallax mapping [KTI+01, Wel04, AMHH08]. Typically, a simple sur-

face (usually a grid) is transformed to a more complex one by using several maps. By

using these meshes, correct shadows and valid occlusion can be computed. Furthermore,

objects’ silhouettes appear realistic [WWT+03]. Due to not respecting the distance be-

tween an object and the camera the underlying geometries are often overtessellated, so

that a retessellation is needed [GH99, DH00]. Typically, the displacement of the vertices

must be computed in every frame. If the number of vertices to displace is high, this

technique can influence the rendering time negatively. For this reason, reliefboards do

not necessarily displace the vertices in every frame. The vertices of the grid are shifted

only once and used for many frames.

4.3 Distributed Rendering

Original

Reliefboard Billboard

Figure 4.5: The bunny is rendered with original geometry. The lower left picture shows

a picture detail using reliefboards. The upper right picture shows the origi-

nal picture detail and the lower right shows the picture detail using simple

billboards. These simple billboards neither support multiple intersections

nor correct lighting.

4.3 Distributed Rendering

The parallel rendering system uses two different node types: a single visualization node

and a large amount of back-end nodes. The back-end nodes are utilized for the dis-

tributed creation of the reliefboards. The visualization node requests and uses the re-

liefboards to render the actual scene to the screen and provides the interface for the

interactive movement of the observer.

Rendering on the visualization node

The rendering of a frame on the visualization node is performed in three steps.

1. In the first step, all unsuitable objects inside the viewing frustum which were iden-

tified as too large or not complex enough are rendered to the screen. Their polygon

count is thereby accumulated. In neither case would it be practical to replace these

objects by reliefboards. If an object is too large, even a slight change of the ob-

server’s position is likely to cause an unacceptable change in the perspective. If

the object consists of too few polygons, the original object’s rendering time could

actually be better than the approximation’s.

2. In the second step, the other objects are rendered in front-to-back order from the

observer, as long as the overall polygon count does not exceed the value of a given

4 Reliefboards

parameter tnz. If the number of objects is small enough, we can sort the objects

in every frame using a standard sorting algorithm. Otherwise, one should utilize

a spatial data structure like a loose octree for storing the objects and traverse it

in front-to-back order. This implicitly results in the subdivision of the frustum

into the near zone, where only the original geometry is rendered and the far zone,

where reliefboards can be used (compare to Figure 4.2).

3. In the last step, a decision is made for each of the remaining objects in the view-

ing frustum, whether the original object or the corresponding reliefboard should

be rendered. If no reliefboard has yet been received for an object, the original

object is rendered and, if no reliefboard request for this object is pending, a new

request is issued. Each request consists of an unique identifier for the object, the

direction vector from the observer’s position to the object’s center and the object’s

projected size in pixels for the reliefboard’s resolution. Requests are processed

asynchronously and the resulting data arrives after a few frames.

point of view

Reliefboard

normal vector

error

Figure 4.6: When the angle error exceeds a new reliefboard is requested by the visual-

ization node. If the angle error exceeds E, the visual error is considered too

large and the original geometry is rendered.

If a reliefboard has already been received from the visualization node, the current

angle error is calculated. The angle error is defined as the angle between the direction

vector that was used as parameter for creating the reliefboard and the current direction

vector from the observer to the center of the board (see Figure 4.6). This is taken as an

indicator for the visual similarity of the reliefboard’s projection and the original object.

If the error is higher than a fixed parameter , and no request for this object is pending,

then a request for a new reliefboard incorporating the current direction is issued. If the

error is unacceptably large (defined by another fixed parameter E), then again the orig-

inal object is rendered until an updated version is received. Another source for artifacts

produced by a reliefboard is the difference in the current projected size to the relief-

board’s original resolution. If the observer approaches a low resolution mesh, blocking

artifacts occur. If the observer moves away from a high resolution mesh, aliasing arti-

facts occur. Hence, we use the relative difference in reliefboard’s current projected size

4.4 Reliefboard Structure and Creation

and the projected size when generated as an additional indicator to trigger an update

request.

If a reliefboard with an acceptable error is available, the corresponding relief mesh is

rendered at the original object’s position, oriented towards the direction it was created

in. This should be the most common case, as a reliefboard can normally be used for

many frames.

Also, in every frame a check is performed whether newly created reliefboards have been

received from the back-end nodes. If so, the corresponding old ones are replaced. This

results in the occurrence of a slight popping effect in the next frame. This effect is more

visible as the old and new direction vector differ. Nevertheless, these effects disappear

when applying blending. If the new reliefboard replaces an original object, the relief-

board’s direction vector differs from the actual direction in which the object was seen

last.

Distributed creation and load balancing

One goal of the distributed creation of the reliefboards is to keep the latency for their

requests as low as possible. When a request is delayed, it is possible that the out-dated,

old version of a reliefboard is used for a longer period resulting in a increased visual error.

If the request for an object, which is currently rendered with its original geometry, is

delayed the rendering time may increase.

Initially we distribute the data of all objects randomly and redundantly, with a fixed

number of replicas over the back-end nodes. In order to assign the requests to the nodes

at runtime, we make use of a modified version of the c-collision protocol [Ste96a], which

aims at having a low contention of every network node and spreading the requests evenly

over the nodes (see Figure 4.7). This protocol is implemented on the visualization node

and is executed after each frame if new requests were made.

The assignment of requests is performed in rounds. In every round, a node handles at

most crequests. If more than crequests arrives during a round, the node answers none

of them. We begin with 1 as value of c. If not all requests can be handled with that

value, cis increased until one node can fulfill all its requests. Then cis reset to 1 and

the distribution is continued until all requests are assigned to a node. The assignment

of a request to a node persists, until the reliefboard is received.

The dynamic handling of the value of chas the disadvantage of not being able to

ensure a maximal contention as with the original protocol, but guarantees the handling

of all requests, prevents the protocol from deadlocking and allows asynchronous commu-

nication.

4.4 Reliefboard Structure and Creation

A reliefboard’s foundation is a mesh consisting of colored vertices placed on a regular

grid which are shifted along the grid’s normal vector (see Figure 4.8). The general

orientation of the grid’s base plane is orthogonal to the vector from the center of the

4 Reliefboards

Obj5Obj2Obj5Obj2

Obj5Obj2Obj4 Obj2Obj4

Obj4

round i

Obj1

Obj2

Obj6

Obj4

c=1

Obj3

Obj4

Obj5

Obj2

cont=0 cont=0

Obj1

Obj3

Obj6

Obj5

cont=0

Obj1

Obj2

Obj6

Obj4

c=1

Obj3

Obj4

Obj5

Obj2

cont=1 cont=1

Obj1

Obj3

Obj6

Obj5

cont=0

round i+1

Obj1

Obj2

Obj6

c=1

Obj4

Obj3

Obj4

Obj5

cont=1 cont=1

Obj1

Obj3

Obj6

cont=0

Obj1

Obj2

Obj6

c=2

Obj3

Obj4

Obj5

cont=1 cont=1

Obj1

Obj3

Obj6

Obj5

cont=0

Obj1 Obj3 Obj4

Obj1

Obj2

Obj6

c=2

Obj4

Obj3

Obj4

Obj5

cont=2 cont=1

Obj1

Obj3

Obj6

cont=0

Obj1

Obj2

Obj6

Obj4

c=1

Obj3

Obj4

Obj5

cont=2 cont=0

Obj1

Obj3

Obj6

cont=0

Obj1 Obj3 Obj4

Figure 4.7: Circles represent requests and hexagons represent handled requests. The

data items are distributed redundantly and randomly among the different

nodes

reliefboard to the observer’s position at the time when the reliefboard is created. This

absolute orientation stays fixed as long as the reliefboard is valid. It does not change

when the observer’s position changes, unlike some simple billboards that always face the

observer.

The displacement of the vertices on the grid reproduces the surface of the approxi-

mated object as seen from observer’s original position in orthographic projection (i.e.

without perspective distortion). If the position of a vertex does not lie in the area that

is covered by the original object’s projection onto the reliefboard’s plane, this vertex is

removed from the mesh. As a result, the reliefboard’s silhouette reproduces the original

object’s silhouette (again, as seen from the observer’s original position). The vertex

colors and vertex normals are also taken from the projection of the unlighted original

object. The combination of unlighted color and the normal vector allows exploiting the

normal (possibly dynamic) lightning techniques, used during the reliefboard’s rendering

on the visualization node. If the perspective distortion or the lightning would be applied

during reliefboard’s creation these effects would be applied a second time when we render

the reliefboards on the visualization node.

4.4 Reliefboard Structure and Creation

Camera Position

View Frustum

(a)

View Frustum

Camera Position

Grid

(b)

Point of

creation

Camera Position

View Frustum

(c)

Figure 4.8: Image (a) shows from which position the reliefboard is created. Image (b)

visualizes how the vertexes of a grid are shifted. Image (c) shows how the

reliefboard holds its original alignment for multiple camera positions.

Creation

The data and parameters necessary for the creation of a reliefboard are:

•The geometry (including colors, normals, etc.) and the position of the original

object.

•The directional vector from the observer’s position to the original object’s center

(which also becomes the center of the reliefboard).

•The resolution of the grid. As the reliefboard should reproduce the original object

as well as possible, at least at the time it is created, the resolution is set to the

current projected size of the object (in pixels). In this way every pixel is initially

represented by one colored vertex.

For the computation of a reliefboard, the original object’s orthogonal projection needs

to fit into an area on the given screen’s resolution, the desired side facing towards the

camera. Hence the object is initially translated and scaled accordingly (see Figure 4.9).

Then four intermediate, two-dimensional color maps of the given resolution are created

by rendering the original object onto the screen (or to an off-screen buffer). As we want

to exploit a cluster’s nodes (with possibly not up-to-date graphics hardware) for the

generation of the reliefboards, we do not use multi target rendering to render to multiple

buffers at once. Instead, we use multiple rendering passes to obtain the data for the

different target maps separately. Each of the four rendering passes exploits a simple

shader to support the efficient processing of the data into the target map:

4 Reliefboards

Figure 4.9: To compute the textures for reliefboard generation the camera is placed on

the unit-sphere. The spheres center is placed at coordinate systems origin.

Each object is scaled and translated such that its AABB fits into the sphere.

The distance-map encodes the displacement by extracting the

depth information of the pixels. This map is used to shift the

vertices of the grid. The higher a pixel’s value the more the corre-

sponding vertex is pulled towards the camera position, the lower

a value the more the vertex is pushed away.

The stencil-map marks which pixels are covered by the projection

and which vertices should be removed. The stencil is needed to

determine which vertices are needed for the approximation and

which have to be discarded.

The color-map reproduces the unlighted (but possibly textured)

color of the object. Each pixel indicates the unshaded color of the

corresponding vertex.

The normal-map encodes the surface normals in order to support

correct lighting.

Finally, these maps are combined with the grid to form the mesh of the reliefboard

at the given position. The created reliefboard replaces the original mesh and does not

change its alignment during the walk-through.

4.5 Identifying Objects by Clustering

To achieve good results with reliefboards triangle groups are required instead of single

triangles. The triangles in such groups should be pooled as close as possible. Below,

these triangle groups are called objects. These objects are not limited to semantic groups

like chairs. Instead, these objects can consist of arbitrary elements.

For our parallel rendering system we require that, on the one hand, the objects consist

of a sufficient amount of triangles and, on the other hand, that the objects have a

relatively small diameter. Reliefboards replace the triangles of objects by another groups

of triangles, whose complexity has an upper limit. Thus, a sufficient triangle count of

the replaced object is necessary to benefit from a replacement. As shown in the previous

section, reliefboards are generated from several maps. The maximal resolution of those

maps is an input parameter p, which defines the maximal complexity of the reliefboards.

This resolution also defines the granularity of the simplifications. Our tests have shown

that maps with a side length of 256 pixels yield good results. The smaller the diameter

of the replaced object the better reliefboard’s quality. If the diameter of an object which

is to replace is too large, the generated simplification is to coarse.

For entered input there are two options: the entered objects meets the needs for our

rendering system or they do not. In the first case, the entered objects can be used

without any additional computation. In the second case, we need to compute suitable

objects.

cluster cs with

tnz triangles

near zone

radius dnz

(a)

dnz

p pixels

slc in world space

(b)

dnz

slc

(c)

Figure 4.10: At first we determine the expansion of the near zone. Afterwards we

calculate the size of suitable objects, followed by the determination of the

objects for our rendering algorithm.

For a good image quality we render the objects close the the camera position only

with their original geometry. In this near zone we limit the number of rendered triangles

of the original objects. Our tests have shown that a third of the maximal amount of

triangles that can rendered per frame is a good triangle limit for this zone. Additionally,

beyond this near zone the projected size of the replaced objects must not be bigger

than the side length of the textures used to generate the reliefboards. Thus, we have to

4 Reliefboards

determine the expansion of the near zone. We compute the smallest cluster of triangles

that consist of this triangle limit. The diameter of this cluster defines the size of the

near zone (see Figure 4.10a).

Using the expansion of the near zone, display’s resolution, and the size of the textures

to generate reliefboards allows for determination of the size of our clusters (see Figure

4.10b).

In the last step we compute the final object clusters for our parallel rendering system

(see Figure 4.10b). The size of these clusters can be smaller than the calculated maximal

size, but they can be larger. Another property is that the clusters can insect each other

which does not influence our rendering method.

4.5.1 Clustering Algorithm

Suitable objects are needed for our parallel use of reliefboards. The objects must have

a reasonably high triangle count and must be compact in space. If the objects are

given in a suitable form, reliefboards can be applied directly. Otherwise, if the objects

are unsuitable or the geometry consists of triangles in an arbitrary order, we have to

segment the scene into several, new objects. Therefore, we perform a preprocessing

stage consisting of three phases. In the last two phases we use a heuristic, based on

agglomerative clustering [DE84].

Figure 4.11: Clustering based on the smallest joined bounding box. The two objects in

the dashed black bounding box are joined next.

In the preprocessing’s first phase we have to process the input. First we remove

redundant vertexes and faces from the given input. Afterwards, we determine connected

components and their axis aligned bounding boxes. These partial meshes are the base-

clusters for our agglomerative clustering algorithm.

In the next phase we choose the near zone range dnz. In this zone it is guaranteed

that an object’s original geometry is always rendered. Using the radius of this zone, we

calculate the maximal side length slcof the clusters. These clusters can be substituted

by reliefboards. For this calculation, we need the maximum amount of triangles tnz that

4.5 Identifying Objects by Clustering

can be rendered in the near zone, the reliefboards’ maximal projected side length in

pixels p, and the display resolution ras parameters.

Finally, we determine the required clusters with the previously determined side length

slc. During rendering, these clusters can be replaced by reliefboards. Below, phase two

and three are described in detail.

Phase II: Choosing near zone

In the beginning of this phase we calculate the cluster cswith tnz triangles which has

the “smallest” diameter. To do this we use our agglomerative clustering algorithm as

described later in this section. The diameter of csis the range of our near zone dnz.

Using the determined dnz and the given parameters pand r, we can calculate the needed

side length for the clusters slc.

Figure 4.12: Clustering of the UNC Power Plant. Green (brighter) objects exceed the

maximal side length of clusters and red (darker) ones are suitable clusters.

Different colored parts are clusters, whose complexity is not large enough

to replace them with reliefboards.

Phase III: Determining suitable clusters

In the algorithm’s third phase we use the base-clusters and slcin the agglomerative

clustering algorithm. At first we remove all base-clusters whose maximal side length is

bigger than slc. A replacement of such a base-cluster would result in massive pixel errors

because of reliefboard’s limited granularity, given as parameter p. Hence these base-

clusters should not be substituted by reliefboards (these objects are unsuitable). The

4 Reliefboards

clustering algorithm combines the other clusters as long as their longest side is smaller

than slc. When a cluster exceeds this limit it is completed and removed from further

clustering. The resulting clusters are the objects which can potentially be replaced by

reliefboards. In Figure 4.12 these different types of objects are visualized. Green objects

are too large to be replaced by reliefboards (they are unsuitable) while red objects are

suitable. Objects in Figure 4.12 which are rendered in a different color are also unsuitable

because they are too simple to be replaced.

Agglomerative cluster algorithm

Our agglomerative clustering algorithm’s inputs are the base-clusters as described before.

Naive agglomerative clustering algorithms compute the joined axis aligned bounding box

(AABB) for all potential smallest cluster pairs. Here we combine two clusters whose

longest side of their combined AABB is minimal with respect to all other cluster pairs

(see Figure 4.11). For optimization purposes we do not compute all possible joined

AABBs. Instead we compute only a constant number of AABBs for each cluster, we

use only those clusters that are nearby as described later in this section. By this we

reduce the clustering time from O(n3) of the naive algorithm to O(nlog n), where nis

the number of connected components at the beginning.

To determine the cluster efficiently we use a loose octree [Del00] and a priority queue

of cluster pairs (clustera,clusterb),a 6=b. The queue’s first element is always that pair

with the smallest joined AABB. While building the loose octree, its nodes are subdivided

as long as they include more than a defined, constant number of base-clusters. Every

loose octree node is identifiable by a unique ID i∈N.

Afterwards we determine for all loose octree nodes cellithe set of adjacent loose

octree cells on the same octree level: adji={celli,cellx,celly,. . . ,cellz},1≤ |adji| ≤ 27.

The union of the clusters in the loose octree cells in adjiis stored in another set uaci.

While the clusters within the different celliare disjoint, different uacican include equal

clusters. For each uaciwe compute the smallest cluster pair scpiand push this pair into

the priority queue.

The following steps are performed to determine a new cluster and to reduce their total

number: As defined, both clusters (clusteri,clusterj) of the queue’s first element build

the smallest cluster. Clustering this pair requires an update of the priority queue and

the loose octree. For that we determine the octree nodes celliand celljthat contain

the clusters. From these loose octree nodes we get the corresponding sets of adjacent

cells adjiand adjj. For all octree nodes cellk∈adji∪adjj, we remove the clusters

clusteri,clusterjfrom cellkand from its associated uack. Additionally, we remove scpk

from the priority queue.

After all copies of clusteriand clusterjare removed from the loose octree we join

these clusters to clustern. The new cluster is inserted into the loose octree at cell celln.

Next, for all octree nodes cellk∈adji∪adjj∪ {celln}we update the uack, determine the

new scpk, and push it into the priority queue.

While repeating this algorithm the cardinality of at least one uacidecreases. If the

number of included clusters in a loose octree node is below a threshold, we lift all clusters

4.6 Evaluation

from that node celliinto its parent node and remove its scpifrom the priority queue.

Afterwards we update celli’s parent node and delete cellifrom the tree.

4.6 Evaluation

In this section, we present empirical results showing the practicality and the main char-

acteristics of reliefboards. To evaluate system’s scalability we analyze the response time

for reliefboard requests and the effects if increase the number of copies of each suitable

object. Additionally, we analyze how a larger number of back-end nodes influence the

image quality. To gain information about systems rendering performance we compare

the rendering times of our system with the rendering times of a system that does not

use reliefboards.

Evaluation system

We implemented the presented techniques using the programming language C++ (GCC-

3.4.0), OpenGL for rendering and MPI (ScaliMPI-3.12.0-1) for the communication be-

tween the nodes. For the evaluation we use the large configuration of the PC2Arminius

cluster. The frame’s resolution is set to 1,024 ×768 pixels.

Scenes and parameters

Figure 4.13: Camera path in the UNC Power Plant scene. The path is traversed in

6,000 steps at a uniform speed.

We used two different test scenes: A forest composed of many (not instantiated) mod-

els of the Stanford bunny and different kinds of trees (≈25 M triangles in 250 total

objects see Figure 4.5) and the well known UNC Power Plant model (≈12.4 M triangles

in 150 total objects). As the forest scene consists of relatively compact objects with

almost uniform complexity, it is possible to use these objects directly without any clus-

tering. The power plant’s original objects are very complex and lack locality (e.g. long

4 Reliefboards

pipes or a chimney). Hence, to allow the proper application of reliefboards, we defined

150 new objects by clustering the scene (see section 4.5).

The camera path used for the power plant is shown in Figure 4.13. The Forest scene is

circled by the camera in 6,000 camera positions.

In our first tests we started an update when an angle error of = 11.5◦was exceeded

or when the projected side length of the reliefboard differed more than 30% from the

original. The original geometry was rendered when the angle error exceeded E= 45◦.

The value for tnz is set to 3.5 M triangles.

The number of replicas for the objects placed on the back-end nodes was set to three,

as a compromise between a sufficient distribution of the requests over the nodes and the

increased demand for memory for additional copies.

Scalability

In the first scalability test we evaluate the influence of the number of back-end nodes used

on the requests’ average response time. Due to the need of storing the data redundantly

on different nodes we use at least four back-end nodes in all our tests. This is the average

time that passes from the initiation of a request until the corresponding reliefboard has

been received by the visualization node. This includes the time for completing all prior

requests on a node, rendering the original geometry, reading the frame buffer in the

desired resolution several times (see section 4.4), and finally composing and sending the

created reliefboard to the visualization node.

Our experiments show that increasing the number of nodes decreases this response

time, as the average congestion (the average size of the queue of open requests) on every

node decreases (see Figure 4.14). If many back-end nodes are available, the response

time converges to the time needed for the construction and transmission of a single

reliefboard. This implies that, for a given scene and an amount of requests (implied by

the speed of the observer’s movement), increasing the number of nodes past a certain

point does not improve the performance.

The time needed for updating all reliefboards in the scene (e.g. when the observer

turns around very quickly) is not much higher than the average time needed for a single

request, due to the parallel execution. For the UNC Power Plant scene, the time for a

complete update is nearly the same as the average response time; for the forest scene,

the time is about twice the average response time.

In our our second scalability test we evaluate the influence on the c-collision protocol

when the number of nodes is increased. We use the c-collision protocol to distribute the

rendering load among the back-end nodes. Due to the asynchronous communication and

the permanent changing situation on the visualization node, requests for new reliefboards

can be delayed. Instead of using an unchangeable cin our implementation of the protocol,

we increase cto assign requests (described in Section 4.3). The number of back-end

nodes and the number of copies should influence the maximum creached during the

walk-through.

For this test we discard the maximum contention of the nodes, because our algorithm

computes new reliefboards for all suitable objects at the start of execution. When

4.6 Evaluation

5,000

1,0000

15,000

20,000

25,000

30,000

4 back-end nodes

8 back-end nodes

12 back-end nodes

16 back-end nodes

20 back-end nodes

24 back-end nodes

28 back-end nodes

32 back-end nodes

avrg. response time in milliseconds

Response Time

Power Plant

Forest

Figure 4.14: Average response times for requested reliefboards. The more nodes, the

faster is the response – to the point where all requests can be processed

immediately.

reliefboards’ initial computation is finalized, and the contention of all back-end nodes is

zero, we start the measurement. The resulting maximal c’s are displayed in the diagram

of Figure 4.15.

The diagram shows that the maximal value of cdecreases if the number of back-end is

increased. The higher the number of back-end nodes the lower the number of reliefboards

each back-end node has to compute. In the UNC Power Plant scene we were not able to

measure a significant effect when the number of distributed copies is increased, unlike

in the forest scene. Noticeable is the significant increased cwhen we used four back-end

nodes and we distributed each object three times. In this case the cis about 2.35 times

larger than that when we distributed only two copies. This is due to the amount of

data every back-end node has to store and process. Memory space that is required to

store the objects exceed the GPU’s available memory on the back-end nodes. Due to

this objects must be rendered from nodes’ main memory, which is less efficient. Thus

processing a request requires more time and more new requests for other objects appear.

If more than four back-end nodes are used, the system takes advantages of the redun-

dant placement. In the forest scene the maximum creached decreases when the number

of copies is increased.

4 Reliefboards

4 back-end nodes

8 back-end nodes

12 back-end nodes

16 back-end nodes

20 back-end nodes

24 back-end nodes

28 back-end nodes

32 back-end nodes

maximal c

Contention

Power Plant 1 copy

Power Plant 2 copies

Power Plant 3 copies

Forest 1 copy

Forest 2 copies

Forest 3 copies

Figure 4.15: Maximal reached contention during the walk-through. With increasing the

number of processors, the maximal contention decreases.

Rendering time

One property of our method is that the frame rate is mainly determined by the visual-

ization node’s computational power and the properties of the scene (i.e. the number of

objects in the viewing frustum, the complexity of the unsuitable objects and the reso-

lution of the reliefboards). The number of back-end nodes only slightly influences the

rendering time. Due to an overreaching latency resulting in a high error rate, the original

objects are rendered instead of a reliefboard. Figure 4.16 shows that the rendering time

using reliefboards can be reduced substantially if many objects are in the frustum.

The average rendering time along the chosen camera path using only frustum culling

and the original objects is 43 ms, whereas the average rendering time with applied re-

liefboards is only about 16 ms, nearly independent from the number of back-end nodes

used (the observed difference is below the error threshold of the measurements). Thus,

the frame rate for a given scene can be influenced by the hardware configuration of only

a single PC, which can be improved by relatively cheap updates without modifying any

other node of the cluster.

While testing the Forest scene the frame rate fluctuates widely if only original geom-

etry is rendered. Using reliefboards accelerates the rendering process and reduces these

fluctuations.

4.6 Evaluation

0 1,000 2,000 3,000 4,000 5,000 6,000

time in ms

frame

Rendering Time per frame (Power Plant)

4 back-end nodes

8 back-end nodes

12 back-end nodes

16 back-end nodes

20 back-end nodes

24 back-end nodes

28 back-end nodes

32 back-end nodes

No reliefboards

100

120

140

160

180

200

220

240

0 1,000 2,000 3,000 4,000 5,000 6,000

time in ms

frame

Rendering Time per frame (Forest)

4 back-end nodes

20 back-end nodes 32 back-end nodes

No reliefboards

Figure 4.16: The diagrams show the rendering acceleration through reliefboards – top

for the UNC Power Plant scene, bottom for the forest scene . The speedup

is nearly independent from the number of nodes. The highest lines show

the rendering time without reliefboards.

4 Reliefboards

Scene sizes

The complexity of the scene for which the presented method is suitable is determined

by several parameters:

•The overall amount of available memory on the back-end nodes needs to be large

enough to store the scene redundantly with the chosen number of replicas. In our

experimental setting, this was far from becoming a problem.

•The computational power of the graphics adapter of the visualization node mainly

determines the achieved rendering time, which is composed of the time needed for

rendering the original geometry and for the reliefboards. If the number of objects

in the scene gets too large, the frame rate may decrease because too many triangles

need to be rendered. In this case the amount of original geometry rendered can be

adjusted by lowering the parameter tnz. In this way the total number of rendered

triangles is decreased. This reduces the quality but results in additional capacity

for the rendering of reliefboards.

In our prototypical implementation, the size of the scene is also limited by the amount

of available memory of the visualization nodes’ graphics adapters. One should be able

to lift this restriction by adding a simple memory management mechanism, which stores

the objects that are located near the observer’s position or which are likely to get into

the near zone within a few frames, in the graphics cards memory.

Figure 4.17: The left image shows the large forest scene that consists of ≈60 Mtriangles.

The diagram in the right image show the rendering acceleration reached

when reliefboards are applied.

Using another hardware setup consisting of nine desktop PCs, we could render another

forest scene (see right image in Figure 4.17) with about 60 M triangles with 18 fps by

applying reliefboards (tnz = 3.5 M, 550 objects displayed as reliefboard). In this con-

figuration the visualization node is equipped with an Intel Quad Core 2.8 GHz, 8 GiB

4.6 Evaluation

RAM, a NVidia 260 GTX GeForce with 896 MiB graphics memory. Additionally we

use multiple standard office PCs as back-end nodes. When solely rendering the original

scene on the visualization node without reliefboards we only achieve about 4 fps (see left

image in Figure 4.17).

Visual quality

Although the frame rate is nearly independent of the number of back-end nodes, the

visual quality is heavily influenced by the response time and therefore by the number

of nodes. Generally speaking, a lower response time results in less visual error which

can be caused by an outdated reliefboard. Figure 4.18 shows the average distribution of

the angle error for the UNC Power Plant scene. The error angle is a good indicator for

the image error. The difference in the projected size is not considered as the influence

is insignificant due to the chosen camera path. As expected, more nodes decrease the

occurrence of errors. For the chosen setting, about 28 nodes seem to be sufficient to

achieve a mean error of only 16 degrees (compare to Figure 4.1). Additional nodes are

not necessary as almost every request can be processed immediately. The high peaks

occur when new objects appear in the viewing frustum. In this case, the angle between

an object’s normal when its last reliefboard was created and the current viewing vector

towards the object can be arbitrarily large.

100

120

140

160

180

4 back-end nodes

8 back-end nodes

12 back-end nodes

16 back-end nodes

20 back-end nodes

24 back-end nodes

28 back-end nodes

32 back-end nodes

angle error in degree

Image Error

Quartiles Median Average

Figure 4.18: Accumulated angle error in the UNC Power Plant scene (with min, max,

quartiles, median and average values): The visual quality scales with the

number of available nodes.

4 Reliefboards

During the real-time walk-through, a direct pixel based comparison between images

where only original objects are used and images where reliefboards are used is not re-

alizable. This is because of the time needed for this comparison (the required time for

reading from the frame buffer and storing the images influence the results). Less images

are produced in the same time when no comparison is made, so more time available to

generate the approximations. As a result the angle in between two updates decreases

which influences the image’s quality positively.

Differences between images with and without reliefboards, can be seen in Table 4.1.

4.7 Contribution

Reliefboards allow the use of a wide range of PC clusters with only limited graphic

capabilities for parallel rendering, so that only one powerful visualization node is needed.

In contrast to other parallel rendering approaches, this method scales in the quality of

the rendered picture instead of rendering speed, when the number of PC cluster nodes

is increased.

Using reliefboards allows us to render complex three-dimensional scenes that fit into

the visualization node’s memory. To apply these approximations, we have to compute the

suitable objects for replacement by reliefboards. In order to achieve this we use a heuris-

tic, agglomerative clustering algorithm that builds the required objects in O(nlog n)

time.

Because of the scene’s size, we were required to apply approximations. These approx-

imations should be easy to compute, low in complexity, and suitable for many camera

positions. Reliefboards meet all these needs. These approximations can be computed

by computers with weak graphic performance in a few seconds. Their complexity is

controllable and limited by the granularity of the grid from which they are generated.

The relief properties makes them suitable for multiple camera positions, in contrast to

billboards.

To get updates of outdated reliefboards as quick as possible we had to distribute the

rendering load evenly among the PC cluster nodes. Here we used the c-collision protocol

to achieve this goal. Randomized object distribution results in each node having to

process complex objects as well as less complex objects. The redundant distribution can

reduce the nodes’ maximum contention.

4.7 Contribution

Table 4.1: The images in the left column shows images made with only original objects.

Middle column’s images are made using reliefboards. The images in the right

column displays the differences between the images in the previous columns.

Scenario II:

Static Scenes Which do not fit into a

Processor’s Primary Memory

Out-of-Core

Figure 4.19: Massive Models consist of so many triangles that they do not fit into node’s

primary memory. This model of a Boeing 777 consists of approximately

350,000,000 triangles and requires more than 8 GiB memory.

Massive Models have such high memory requirements that these scenes do not fit into

the primary memory of a single PC (see Figure 4.19). To process such scenes, out-of-

core rendering mechanisms are required. In such systems, scene parts are loaded from

a secondary memory device to a computer’s primary memory when they are needed.

Usually the computer’s hard disk is utilized to store the scene parts that do not fit

into primary memory. The usage of hard disks as secondary memory has advantages

for our needs: this memory is cheap, scalable, modifiable, and easy to exchange. The

acquisition of new drives can be done easily, in contrast to the primary memory. Usually,

if no further memory is available new drives can be added or old disks can be replaced by

new ones with higher capacities. The disadvantage of this memory is its speed. For our

needs, hard disk latencies are high in comparison to the machine’s main memory, not to

mention GPU’s memory. Distributed data placement, like different RAID systems, can

reduce these latencies.

For such Massive Models we remove the requirement of the previous scenario that the

complete scene must fit into the computer’s primary memory. Here we do not require

that all visible objects participate on the final rendered image. Thus, this approach can

lead to image errors, but it allows us to render the images in real-time.

We developed two different parallel and approximative rendering algorithms for such

large scenes. The first algorithm uses a modified version of the c-load-collision protocol

to balance the rendering load and to reduce node’s contention. Here we analyze the

balancing if we increase the number of copies, distributed among the back-end nodes.

Furthermore we tested the influences of randomized and deterministic tiebreaker algo-

rithms.

As preliminary work for our second parallel out-of-core rendering system we developed

the hull tree, a special purpose data structure. The hull tree is a spatial data structure

that covers scene objects tighter than common data structures. Additionally, we de-

Scenario II

veloped a sequential rendering algorithm that exploits this data structure. To speedup

visibility tests we also use approximations. In this way we reduce the number of triangles

sent through the rendering pipeline.

To use the hull tree for the parallel out-of-core rendering system some minor modifica-

tions were needed. We combined the hull tree with another hierarchical data structure

to improve its properties for our needs. For these systems we evaluate the time re-

quired to produce images. Additionally, we analyze the influence of different hardware

environments on the rendering algorithm.

Our parallel out-of-core rendering approaches use the main memory of the back-end

nodes in a PC cluster as secondary memory. The network adapters in PC2’s Arminius

cluster enable faster access to the different data items than regular hard drives. Addi-

tionally we can use back-end nodes’ computational power to filter the items sent across

the network. Nevertheless, due to back-end nodes’ weak computational power, we do

not request the items synchronously. Similar to our approach for reliefboards we send a

request and receive the answers several frames later (see Figure 4.20). At any time only

a small amount of the complete scene is stored on a small group of visualization nodes,

where the images for the user are produced.

Figure 4.20: Example of the communication scheme for our parallel out-of-core rendering

systems. The visualization node sends a request and continues its rendering

process, while back-end nodes perform their tests and answer the requests

several frames later.

5 Load-Balancing using the

c-Load-Collision Protocol

In this chapter we will present a workload balancing technique for heterogeneous clusters,

based on the c-load-collision protocol. In the previous part we developed a rendering

technique for scenes that can be stored completely in the primary memory. The parallel

out-of-core rendering systems presented next do not have this limitation. The system

we present in this chapter was designed, implemented, and evaluated in Wiesemann’s

diploma thesis as a collaboration [Wie10]. The presented approach can be applied when

a fast network is available but the data access or the computations on the different

nodes is slow. In this work, we focus on the load-balancing of large groups of back-end

nodes. The rendering speed itself depends mostly on the rendering performance of the

small group of rendering nodes. To evaluate this protocol we implemented a parallel

out-of-core rendering system with dynamic load-balancing for huge data sets (larger

than 8 GiB). In this system, nodes with graphics adapters that provide a high triangle

throughput are responsible for image generation, whereas the back-end nodes serve as

secondary storage and for other smaller tasks [SWF10a, SWF10b, Wie10].

The nodes of PC2’s Arminius Cluster are connected via Infiniband. The nodes of

this PC cluster provide weak rendering performance, but they can be used for other

important tasks to support the rendering process. Their main memory can be used as

secondary storage in an out-of-core rendering system, instead of hard drives, which are

typically used if the main memory resources are exceeded. Unlike the storage systems in

other out-of-core rendering systems, these nodes can also perform additional tests and

calculations largely autonomously. In order to reduce network’s load, we aim to keep

the amount of data that is received by these nodes small. To utilize the PC cluster for

our out-of-core rendering the system has to fulfill the following requirements:

•The scene should be distributed evenly among the main memories of the back-end

nodes.

•The back-end nodes’ tests and computations should require little information from

the other nodes.

•The visibility calculations should be balanced for most camera positions.

•The back-end nodes’ workload should be balanced.

•The amount of data to be sent to the visualization node must be kept small.

5 Load-Balancing using the c-Load-Collision Protocol

Figure 5.1: Layout of our rendering system, consisting of a visualization node (VN), four

render nodes (RN) and several back-end nodes (BN). Each of the four render

nodes is assigned to a portion of the screen (tile).

5.1 Overview and Summary of Results

Our system architecture requirements a single visualization node, a small group of power-

ful render nodes, and a large group of back-end nodes. In this approach, the visualization

node does not need a powerful graphics adapter. This node is used to accept user inputs

and to send them to the render nodes. Additionally, it receives the partial images from

the render nodes, combines them, and displays the final image.

Each render node renders a part of the final frame. Therefore, a node stores only the

necessary objects to its piece of the frame. Newly appearing objects which are requested

from the back-end nodes are determined using a spatial data structure.

Instead of sending the requested objects directly, the back-end nodes test the objects

for visibility. To do this, these nodes use a depth buffer they receive periodically from

the render nodes. If an object it tests is visible, it is sent to the inquiring render node.

In this system we distributed the different objects redundantly and randomly. This

provides good load balancing for most camera positions. To benefit from this distribution

(i.e., to keep contention low and to distribute the load evenly) we use a version of the

c-load-collision protocol.

To decrease the number of rendered objects, the render nodes need only a sample tree as

spatial data structure. To perform the visibility test, the back-end nodes exploit render

nodes’ depth buffers, which are updated periodically.

5.2 Related Work

To minimize the delivery time in our parallel out-of-core rendering system fast data

access is required. There are various data management protocols for networks, which

balance the occurring data requests among multiple data modules. In our sense, a data

module is equal to a back-end node. Dietzfelbinger and Meyer auf der Heide present

a deterministic approach to simulate a PRAM on different types of distributed memory

machines [DMadH93]. Using a c-arbitrary DMM, a data module processes at most cof

the arriving requests in one step. Using a c-collision DMM, a data module processes all

requests if there are at most cof them; otherwise it rejects all. The expected delay to

produce an access schedule is bounded by O(log n) with high probability.

Stegmann developed the first version of the randomized c-collision protocol which

reduces the expected maximal contention of a data module to a constant cwith high

probability [Ste96a]. A known issue for this protocol is that there exist situations where

it cannot terminate for a given c. A rule of the protocol says that each node answers at

most crequests. If all nodes receives more than crequests, none will answer. To resolve

that problem, ccould be increased until the protocol terminates.

The c-collision protocol can be interpreted as balls into bins game. In the previously

described versions, the balls’ weight is uniform. In contrast, Berenbrink et al. introduced

a protocol for weighted balls [BMadHS97]. In their c-load collision protocol, for every

request, three data modules are randomly chosen. Each request is sent to its chosen

data module, which answers all incoming requests if and only if their sum is at most c.

5.3 The Parallel Rendering System

In the following subsection we will describe the developed rendering system in more

detail. Our system requires a visualization node, render nodes, and back-end nodes as

described in the previous section (see Figure 5.1). To allow for sufficiently fast data

transfers we require a fast network. In our setting all nodes are connected through

Infiniband network adapters. Instead of the usually used hard drives as secondary mem-

ory, our rendering system uses the main memory of the different back-end nodes. The

secondary memory stores the currently unused scene parts that cannot be stored in the

primary memory.

The visualization node has four different jobs: displaying the rendered images,

balancing the render nodes’ load, organizing the render nodes’ data requests, and initi-

ating the updates of the back-end nodes’ depth buffers. The visualization node receives,

combines, and displays the different frame parts from the render nodes for each frame

(see Figure 5.3). To balance the individual render nodes’ load, the visualization node

receives the time it took to generate partial images from the render nodes. The task

sizes of the render nodes are modified based on these times, similar to the approach of

Abraham et al. [ACCC04]. To organize the data requests, the visualization node receives

these requests from the render nodes and calculates an assignment using a variant of the

c-load-collision protocol (see Section 5.4). We choose the c-load-collision protocol for

5 Load-Balancing using the c-Load-Collision Protocol

Figure 5.2: The images show the tail of a Boeing 777 from the same position, but different

view angles. When the camera is moved rapidly, image-errors can occur.

Both images show a tiled frame produced by render nodes 1 and 2. The

right image shows missing parts, caused by delayed objects.

load balancing because it assigns requests to nodes while considering all other requests.

Additionally, in general this protocol can be executed via network without keeping a cen-

tralized data structure. To allow the back-end nodes to perform a worthwhile occlusion

culling, the visualization node periodically initiates depth buffer updates.

Figure 5.3: Communication flow to produce an image on the visualization node.

The render nodes’ job is to render partial images of the scene. The dimensions and

positions of their tiles are received from the visualization node, as are the current camera

settings. After receiving the needed data, the nodes align their tile and camera settings

(see Figure 5.1). Next they perform a frustum culling on their extended viewing frustum

(see Figure 2.2) and reorganize their object cache. If the frustum test shows the potential

requirement of objects, which are not stored in their memory, a node sends a request to

5.3 The Parallel Rendering System

the visualization node. Due to this asynchronous communication objects arrive delayed,

which can result in temporal image errors (see Figure 5.2). The render nodes check

if there are new objects available before they render a frame. During rendering, the

back-end nodes perform occlusion culling to determine invisible objects. These objects

are stored in the object cache for potential reuse. After rendering their partial image,

each render node copies it from the GPU-memory to its main memory and sends it for

displaying to the visualization node. If initiated by the visualization node, the render

nodes transmit their local depth buffer to the different back-end nodes.

Figure 5.4: Communication flow to update the render nodes’ tile dimensions and back-

end nodes’ depth buffers.

The back-end nodes serve as smart secondary memory. Instead of sending objects

untested to a render node, they test objects’ visibility first. Each node stores only a

subset of the scene’s objects. It is difficult to perform occlusion culling with only a

subset of the entire scene. Due to the missing occluders, objects might be classified as

visible, while they are actually hidden in the final image. For this reason, the back-end

nodes periodically receive render nodes’ depth buffers and combine them (see Figure

5.4). During their own visibility tests, back-end nodes update the received depth buffers

successively. If they test an object and it is visible they send its data to the requesting

node. Otherwise, they send nothing.

The load-balancing protocol used requires that all data items are distributed re-

dundantly and randomly among the back-end nodes. Because the scene is too large to

fit into the main memory and it is not possible to render it in real time, it has to be

approximated and distributed.

For this purpose we use a sample tree data structure [KKF+02]. The sample tree parti-

tions the scene into smaller subsets that are distributed across the network (see Figure

5.1). In the course of this chapter, an object refers to all of the geometry contained within

a single sample tree cell. For fast data access, we use a modified version of the the c-

load-collision protocol. A request passed into this protocol is equivalent to a request for

one of these objects. Based on this protocol, the back-end nodes’ load is balanced. Al-

though there are other approaches (e.g., Levels of Detail [LWC+02]) the disadvantage of

these is that surface properties usually are altered. In this way, unwanted gaps between

5 Load-Balancing using the c-Load-Collision Protocol

objects can appear and influence image quality and collision-detection algorithms. To

produce a frame, the view-plane is separated in multiple tiles and each tile is rendered by

a different render node (see top, right in Figure 5.1). These nodes are the high-capacity

cluster nodes, which are equipped with modern graphic adapters. If needed data-items

are not stored in a render node’s local memory, a data request is send to the visualization

node. Instead of waiting for the requested data-items, the render node renders its frame

without the missing parts. In this case, errors can occur in the final image (see Figure

5.2). After rendering a tile, it is send to the visualization node.

In our environment each render node is equipped with 4 GiB of RAM, but our bench-

marking model of a Boeing 777 consists of roughly 350 M triangles (≈8.5 GiB), which is

more than twice the size of the primary memory.

The back-end nodes are utilized in two ways: due to their lack of graphics adapters,

they serve as secondary memory and assist the render nodes by performing occlusion

tests. If the tested object is visible, its data will be sent to the requesting render node.

Otherwise, the request is discarded without notifying the requesting render node.

5.4 Load-Balancing Algorithm

The c-collision protocol allows for fast data access in a network, where the different data-

items are distributed randomly and redundantly. Thereby the contention of different

back-end nodes is limited by the constant cwith high probability [Ste96b]. The protocol

works as follows: each request is assigned to all back-end nodes that store a copy of

this item. If a back-end node has at most c assigned requests, it replies to all of them

and removes the answered requests from the other back-end nodes. If a back-end node

gets more than c requests, it answers none of them. This scheme is repeated for several

rounds until all requests are served (see Figure 5.5). It is assumed that all requests have

equal weight. For example, weights can be the required processing time or the message

size.

There are different situations in which the requests are weighted; these weights do

not affect the c-collision protocol. Berenbrink et al. present in [BMadHS97] the c-load-

collision protocol. This protocol is a modified version of the c-collision protocol that

respects the requests’ weights by involving them in the request assignment. The given

load limit ccan only be held if all tasks have been finished before new requests arise. It

is impossible for this condition to hold if the protocol is used to balance the load of a

parallel out-of-core rendering system, which is supposed to offer sufficient frame rates.

Waiting until all requests have been answered would cause serious speed reductions in the

rendering process. Thus, an asynchronous communication scheme is needed that allows

for overlapping rounds. In our system, we investigate how two different strategies for

choosing responsible nodes affect the load-balancing situation. To do so, we developed

a randomized and a deterministic function for node picking. Furthermore, we evaluate

how overlapping rounds affect our version of the c-load-collision protocol.

Similar to the regular c-collision protocol, all data-items are distributed redundantly

and randomly across the different back-end nodes. This assignment does not change

5.4 Load-Balancing Algorithm

Figure 5.5: Three rounds of the c-collision protocol. A node with assigned requests will

not answer again.

during runtime. The render nodes send their requests to the visualization node instead of

sending them directly to the responsible back-end nodes. In every frame the visualization

node receives all new requests of the render nodes and calculates an assignment of tasks

to the back-end nodes. By using the protocol, we try to distribute the network load

evenly among the back-end nodes. Instead performing the c-load-collision protocol on

the real network it is simulated on the visualization node. This way no additional

communication is needed and the balancing-time is reduced.

The pseudo-code for our implementation of the c-collision protocol is shown in algo-

rithm 5.1. In every frame, the set of requests is divided into subsets that contain at most

the same amount of elements as the number of available back-end nodes. The weight

of a request is equal to the requested object’s number of triangles. Using triangles for

request weighting is convenient because the triangle amount influences the needed time

for the occlusion tests as well as the size of the data-item to be sent.

Figure 5.6: The two magenta colored back-end nodes can answer the same request. It

has to be decided which one is responsible.

5 Load-Balancing using the c-Load-Collision Protocol

If a data request rcan be answered by more than one back-end node, it has to be

decided which back-end node will be responsible for the request (see Figure 5.6). In

order to balance the work-load evenly across the nodes, we first calculate the sums of

the weights of all active jobs on all nodes. The function to calculate the sum of the

weights of a node PE is defined as s(PEi). The list of nodes is sorted in ascending order

of these weighted sums and is processed in that order.

Figure 5.7: The selected strategy chose the node with the lower load to process the

request.

We analyze two strategies, a deterministic one and a randomized one. Both strategies

choose the responsible back-end node with respect to their load (see Figure 5.7).

•Our deterministic strategy works as follows: It always picks the first node with the

smallest s(PEi), which can handle all its requests (see Algorithm 5.2).

•Our randomized strategy works as follows: We pick a node randomly from all

nodes that could handle the request r. To influence the probability at which a

node is chosen, we utilize the weight sums of the nodes in question (see Algorithm

5.3).

Let ˆ

Pbe the set of back-end nodes able to handle request r. Then

T=X

P Ei∈ˆ

s(PEi)

is the total sum of the accumulated weights. The probability to choose node PEiis

defined as:

Pr(r is handled by node PEi) = T−s(PEi)

(|ˆ

P| − 1) ·T

The probabilities of each node equate to a unique area in the unit interval. To assign

a data request to a back-end node, a pseudo-random-number between zero and one

is generated. The generated number always lies within an interval associated with a

back-end node. This node is chosen to handle the data request.

5.5 Evaluation

Algorithm 5.1 c-collision protocol

Variables: nodes P,|P|=n,

request sequence S= (s0,s1,. . . ,sl−1),l =i·n.

1: c= 2

2: S= (r0⊕. . . ⊕ri−1), rj= (sj·n,. . . ,s(j+1)·n−1),

0≤j < i

3: for all rj∈Sdo

4: while ∃sk∈rjthat is not answered do

5: if at least one request of rjcan be answered then

6: for all smthat can be answered do

7: pick node PEa∈Nthat handles sm

8: increase load of node paby the weight of sm

9: mark request smas answered

10: end for

11: c= 2

12: else

13: c=c+ 1

14: end if

15: end while

16: end for

Algorithm 5.2 Deterministic picking strategy

Variables: request suto handle

1: determine ˆ

P={ˆ

PE0,. . . , ˆ

PEm}for suand T

2: for k= 0 to m do

3: determine wk=Pr(ˆpk)

4: end for

5: assign suto ˆpkwith smallest wk

Algorithm 5.3 Probabilistic picking strategy

Variables: request suto handle

1: w0= 0

2: determine ˆ

P={ˆ

PE1,. . . , ˆ

PEm}for suand T

3: for k= 1 to m do

4: determine wk=Pr(ˆ

PEk) + wk−1

5: end for

6: Choose a random number 0 < R ≤1,R ∈R

7: assign suto ˆpkwith wk−1< R ≤wk

5.5 Evaluation

For our tests we use a model of a Boeing 777. The model’s size ≈8.5 GiBs, thus it does

not fit into the main memory of the render nodes. As test environment we used four

5 Load-Balancing using the c-Load-Collision Protocol

render nodes and up to 32 back-end nodes of PC2’s Arminius cluster. Here we run our

parallel out-of-core rendering system.

Figure 5.8: Walk-through for our tests.

Walk-throughs and load balancing tests in a real rendering system are limited to at

most 32 back-end nodes. To perform load balancing tests with an arbitrary number of

back-end nodes, we implemented a simulation system for the c-load-collision protocol.

The input for the simulation system is a sequence of data requests, generated during a

walk-through in the real rendering system using 32 back-end nodes. In Figure 5.8 the

test path is visualized.

In our tests, we evaluate the level of load-balancing that can be achieved using the

c-collision protocol with our modifications. Due to the randomness of the c-collision

protocol, we perform multiple test runs on the system, while using different seeds for the

pseudo-random-number generator. In the start-configuration of the c-collision protocol

we set c= 2. We averaged the results of these different runs. For each position of the

walk-through, we measure the load on the individual back-end nodes. The accumulated

amount of triangles of processed objects serves as load indicator for the back-end nodes.

To achieve comparable results at the individual walk-through positions, the results are

normalized. Three different numbers of back-end nodes were used to accomplish the

tests: 32, 80 and 120 back-end nodes (see Figure 5.9 - Figure 5.11).

For our tests, at each position in the walk-through, we measured how many triangles

each back-end node has to process and normalized the results. We determined the

median (red marks) and the 0.1/0.9 quantiles (blue marks) of the measured values.

For each test, our measurements are displayed in three diagrams: the first one shows

the load-balancing of the back-end nodes at every position in the walk-through; the sec-

ond one shows the relation between a given work-load and the number of data requests.

The third diagram combines the previous diagrams in a 3D diagram.

While all values are sorted by the walk-through position in the first diagram, in the

middle diagram all values are sorted by the number of request per frame. The last

5.5 Evaluation

diagram combines both views. The 3D diagrams’ y-axes are equally scaled to visualize

the benefit of an increased number of back-end nodes.

This sorting gives a hint to the origin of the strong peaks in the first diagram. If we

only have a few requests to distribute among many back-end nodes, stronger deviations

occur. In this situation, some nodes do not get any assignments at all, while others

still get a few. Here, even the median value is close to zero. The requests we use are

atomic and can not be distributed to multiple nodes. These peaks do not corroborate the

conclusion that there are overloaded nodes, there are just not enough requests available

to employ every single node.

In all the diagrams the deviation of the load is generally low. As mentioned above, the

difference of the load is larger if only a few requests are generated. The more requests we

have at a given position, the smaller the difference between the median and the quantiles.

If the number of back-end nodes is increased, this difference is even less significant. The

median value is generally close to 1/p, where pis equal to the number of back-end nodes,

which leads to the conclusion that the load is distributed evenly.

Next, we investigate how the redundancy influences the load-balancing. To accomplish

this, we increased the number of redundant copies of the objects in the network from

one to two. Our measurements show that the load can be distributed more evenly if the

number of copies is increased. Thus, the quantiles lie closer to the median within the

diagrams (see Figure 5.9 and Figure 5.12).

Next we distribute three copies of the elements among the back-end nodes and measure

how this influences the load balancing 5.13. The results of these tests show that there is

no significant difference between two and three copies of the elements distributed across

the PC cluster.

Altogether our tests show that our version of the c-load-collision protocol can achieve

good load-balancing if a sufficient number of requests are generated. This can be ob-

served in all diagrams: the strongest deviations occur whenever the number of requests

per position is very low, which leads to the conclusion that the amount of visible geom-

etry in these positions is marginal.

5 Load-Balancing using the c-Load-Collision Protocol

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 500 1,000 1,500 2,000 2,500 3,000 3,500

load (normalized)

frame

Load per Frame (32 nodes 2 copies)

0.1 / 0.9-quantile

median

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 1,000 2,000 3,000 4,000

load (normalized)

frame

Load per Request Count (32 nodes 2 copies)

0.1 / 0.9-quantile

median

1,0002,0003,000 01,000 2,000 3,000 4,000

0.04

0.08

0.12

0.16

load (normalized)

Load per Frame (32 nodes 2 copies)

0.1 / 0.9-quantiles

median

frame request count

load (normalized)

Figure 5.9: Walk-through with 32 back-end nodes and two copies of each element.

5.5 Evaluation

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 500 1,000 1,500 2,000 2,500 3,000 3,500

load (normalized)

frame

Load per Frame (80 nodes 2 copies)

0.1 / 0.9-quantile

median

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0 1,000 2,000 3,000 4,000

load (normalized)

frame

Load per Request Count (80 nodes 2 copies)

0.1 / 0.9-quantile

median

1,0002,0003,000 01,000 2,000 3,000 4,000

0.04

0.08

0.12

0.16

load (normalized)

Load per Frame (80 nodes 2 copies)

0.1 / 0.9-quantiles

median

frame request count

load (normalized)

Figure 5.10: Walk-through with 80 back-end nodes and two copies of each element.

5 Load-Balancing using the c-Load-Collision Protocol

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0 500 1,000 1,500 2,000 2,500 3,000 3,500

load (normalized)

frame

Load per Frame (120 nodes 2 copies)

0.1 / 0.9-quantile

median

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0 1,000 2,000 3,000 4,000

load (normalized)

frame

Load per Request Count (120 nodes 2 copies)

0.1 / 0.9-quantile

median

1,0002,0003,000 01,000 2,000 3,000 4,000

0.04

0.08

0.12

0.16

load (normalized)

Load per Frame (120 nodes 2 copies)

0.1 / 0.9-quantiles

median

frame request count

load (normalized)

Figure 5.11: Walk-through with 120 back-end nodes and two copies of each element.

5.5 Evaluation

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 500 1,000 1,500 2,000 2,500 3,000 3,500

load (normalized)

frame

Load per Frame (32 nodes 1 copy)

0.1 / 0.9-quantile

median

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 1,000 2,000 3,000 4,000

load (normalized)

frame

Load per Request Count (32 nodes 1 copy)

0.1 / 0.9-quantile

median

1,0002,0003,000 01,000 2,000 3,000 4,000

0.04

0.08

0.12

0.16

load (normalized)

Load per Frame (32 nodes 1 copy)

0.1 / 0.9-quantiles

median

frame request count

load (normalized)

Figure 5.12: Walk-through with 32 back-end nodes and one copy of each element.

5 Load-Balancing using the c-Load-Collision Protocol

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0 500 1000 1500 2000 2500 3000 3500

load (normalized)

frame

Load per Frame (32 nodes 3 copies)

0.1 / 0.9-quantile

median

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0 500 1000 1500 2000 2500 3000 3500

load (normalized)

frame

Load per Frame (80 nodes 3 copies)

0.1 / 0.9-quantile

median

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0 500 1000 1500 2000 2500 3000 3500

load (normalized)

frame

Load per Frame (120 nodes 3 copies)

0.1 / 0.9-quantile

median

Figure 5.13: Walk-through with 32, 80, and 120 back-end nodes and three copy of each

element.

5.6 Contribution

Finally, we studied the differences in the balancing situation when using the random-

ized picking strategy and when using the deterministic picking strategy for the c-load-

collision protocol, described in Section 5.4. Generally we could not find any significant

differences when using either strategy. Only marginal differences could be found when

reducing the amount of back-end nodes to four. In Figure 5.14, a comparison between

two measurements was made: for every position of the walk-through we calculated the

difference of the results in deterministic and probabilistic methods, and plotted the ab-

solute value of this difference into the diagram. Most of these results are close to zero.

In the test series spikes occur when there are only a few requests. Whenever stronger

deviations occur, the differences of the two approaches get stronger as well.

0 500 1,000 1,500 2,000 2,500 3,000 3,500

2·10−2

4·10−2

6·10−2

8·10−2

0.1

0.12

0.14

Walkthrough position

Difference

0.9 Quantiles

0.1 Quantiles

Median

Figure 5.14: Differences between randomized and deterministic load-balancing.

5.6 Contribution

For its effectiveness in an interactive environment we tested a variation of the c-load-

collision protocol. We implemented a parallel out-of-core rendering system with the

c-load-collision protocol as a work load balancer. In the course of our tests we were able

to show that the protocol’s balancing capabilities are sufficient for this kind of applica-

tion, although there were no significant differences between two and three copies of the

elements distributed over the back-end nodes.

Through the occlusion culling performed on the back-end nodes, we reduced the amount

of data sent to the visualization node. This increased the speed of transmitting of re-

quired objects.

5 Load-Balancing using the c-Load-Collision Protocol

The occlusion culling on the back-end nodes uses the older depth buffers of the render

nodes. Due to the older depth buffers the occlusion culling is not always correct, but by

exploiting spatial coherency, the tests yields good results.

In the version we presented, there are only three node types available. Slight modifica-

tions to the c-load-collision protocol used would allow a wider variety of back-end nodes

to be handled and balanced. Therefore the different nodes’ computational power could

be included in the distribution computations of the data management protocol.

6 Hull Tree

In this chapter we present a data structure and a sequential rendering system, which is a

preparatory work for our second parallel out-of-core rendering system. To render highly

complex scenes in real-time, techniques are necessary to prevent objects from being sent

into the rendering pipeline if they have no effect on the final image. Invisible objects must

be recognized as early as possible to reduce the load. Culling, especially occlusion culling,

is a common approach to reduce the number of triangles and/or objects that are sent

into the rendering pipeline. However, during these visibility tests many occluded objects

are sent into the rendering pipeline. We developed the hull tree to reduce the number of

those objects, the so-called false positives. Additionally we designed a rendering method

that exploits this data structure.

Hardware vendors offer occlusion queries to accelerate visibility tests using the GPU.

The bounding volumes of all potentially visible objects or object groups are tested for

visibility in front-to-back order. Only if a fraction of the bounding volume, typically an

axis-aligned bounding box, is identified as visible, the associated objects are rendered.

To reduce the time spent on these tests, hierarchical data structures (like an octree) are

often used to test the visibility of whole object groups located in the same subtree.

Using axis aligned bounding boxes (AABBs) for occlusion tests have mixed results.

On the one hand these representations have a very low complexity and are therefore easy

to calculate and quick to test. On the other hand – depending on an object’s projected

shape – these approximations can be too conservative. Due to large empty spaces in the

bounding volumes, even complex objects can be classified as visible although they are

occluded. So the chosen outer approximation of an object (or a group of objects) can

influence the efficiency of an occlusion culling algorithm.

To improve and accelerate the occlusion culling our rendering approach should fulfill

the following requirements:

•Objects’ bounding volumes must be tight to reduce the number of false-positives.

•Bounding volumes should be organized in a hierarchical data structure to test

objects’ visibility quickly.

•The rendering algorithm’s occlusion culling must exploit the data structure to

accelerate rendering process.

The hull tree is a hierarchical data structure of simple approximated exterior object

hulls. This hull tree offers a much tighter coverage of the objects in the scene, while intro-

ducing only a small increase in the geometrical complexity compared to a corresponding

6 Hull Tree

Figure 6.1: Exterior approximations of the UNC Power Plant scene. The transparent

red cubes visualize an octree’s cells at one level. The included dark blue

polyhedrons show the much tighter shapes from our hull tree at the same

level.

ordinary octree with AABBs (see Figure 6.1). In our tests the total complexity increase

was limited by the factor 1.02 [SKJF11].

An additional challenge for current occlusion culling techniques is that the order in

which the objects are rendered is determined by the traversal order of the data struc-

ture. For high visual quality, objects can have numerous different shader, materials,

textures, etc., attached. Even if only one of these properties is changed in between the

rendering of subsequent objects, a state-change in the GPU is induced, which can lead

to a deceleration of the rendering process. One occlusion culling algorithm in which this

problem is addressed is the CHC++ algorithm by Mattausch et al. [MBW08]. There,

the objects are rendered in multiple batches that can each be sorted individually by

the objects’ properties. This can already reduce the number of necessary state changes.

We propose an approximative occlusion culling algorithm (as defined by Nirenstein et

al. [NBG02]), which not only sorts small batches of objects, but sorts all rendered ob-

jects at once according to their state. Consequently, the number of state changes can

be reduced massively. The presented approximative occlusion culling algorithm uses the

hull tree for the determination of visibility at a fine granularity. The occlusion informa-

tion is created using the original objects and, additionally, an automatically generated

inner approximation of the scene’s objects that can be rendered more efficiently. While

it can take a few frames until the complete visibility is determined (which results in the

possibility of temporal image error) our rendering technique makes very efficient use of

the graphics pipeline with only few state changes and pipeline stalls.

6.1 Overview and Summary of Results

For our rendering algorithm, we first preprocess the scene consisting of predefined objects

(meshes) (Note that reasonable objects should not be too big, in relation to scene’s overall

complexity, for our technique to work properly.). We compute two kinds of geometry

approximations: a hierarchical exterior approximation for the whole scene (later used

for our occlusion queries) and an interior approximation for single objects (which can be

used as an efficient occluder replacement during rendering).

For the construction of the outer hulls in the hull tree the scene is first subdivided by

a regular object-based octree, in which the scene’s objects are referenced redundantly

by all leaf nodes that intersect with the object. By construction, an octree’s node’s

bounding box completely encloses all children’s bounding boxes and can therefore serve

as an outer hull for the corresponding subtree. The main idea of the hull tree is not

to directly use this AABB of a node as a hull, but to automatically select a simple

volumetric hull shape out of a predefined set of shapes for all nodes in the tree. For

a shape to be selected, it has to completely cover all shapes of the node’s child nodes

and associated scene objects. These shapes can have a significantly lower volume and

projected size compared to the underlying box while offering a comparable complexity.

For interior approximation generation we use the edge-collapse algorithm introduced

by Hoppe [HDD+93], with a special condition that guarantees the approximation to

always be inside the original geometry (similar to progressive hulls [SGG+00]). Due

to the additional condition, the degree in which an object can be simplified strongly

depends on the topology of the mesh. Although one can easily construct meshes that

cannot be simplified at all without violating the condition, many objects (for example

originated from technical CAD-data) can be reduced to only a small fraction of their

initial complexity.

Rendering cycle: Our approximative occlusion culling algorithm works in two main

phases. First, all objects that were classified as visible in the previous frame are sorted

by their properties and rendered while tested for occlusion using hardware accelerated

occlusion queries (without waiting for the results). Second, the other scene elements’

visibility classification is updated for the next frame’s rendering.

This second phase exploits the prefilled depth buffer on the GPU of the first phase. The

hull tree is traversed and occlusion queries based on the simple volumetric hull shapes

are initiated to determining subtrees’ visibility. To update the visibility of objects which

have not been rendered in phase one, the objects’ interior approximations are used and

rendered to the depth buffer. These two phases provide a minimal number of state

changes and virtually no pipeline stalls.

For this rendering method we developed the hull tree. This data structure is based

on a regular octree and inherits its hierarchical properties. Instead of simple rectangu-

lar bounding boxes, this data structure consists of multiple shaped bounding volumes.

Thus these hulls surfaces and volumes are significantly smaller which lead to faster vis-

ibility tests. We present an approximative render method that exploits the hull tree in

combination with the used interior approximations.

6 Hull Tree

6.2 Related Work

In this section we present work related to occlusion culling, mesh simplification and

hull-approximations.

Occlusion Culling, in general, describes techniques which try to identify and leave

out occluded parts of a scene’s geometry to increase the rendering’s performance. Ac-

cording to the classification of Cohen-Or et al. [COCSD03], the presented occlusion

culling method is a point based (visibility is determined from the current observer posi-

tion) technique with an image-precision definition of visibility, (i. e. an object is visible

if it contributes at least one pixel to the final rendered image). Due to the special sup-

port for so-called hardware assisted occlusion queries in current graphics adapters, these

kinds of techniques are used in a wide range of applications for real-time rendering (for

example games). An occlusion query returns the number of pixels which pass the depth

test during an object’s rasterization. If this object is only a simple and cheap-to-render

proxy of the original geometry, the visibility of the proxy can be used to estimate whether

the original object is visible or not. One problem which arises is that the occlusion query

itself can be performed quite efficiently, but it may take up to several milliseconds be-

fore a query result is available from the graphics card. This problem is addressed by

several techniques. The CHC algorithm (Coherent Hierarchical Culling) by Bittner et

al. [BWPP04] tries to reduce possible pipeline stalls by an asynchronous use of occlusion

queries, in which the goal is to avoid idle waiting for a query’s result. Several techniques

try to further improve this idea. The Near Optimal Hierarchical Culling algorithm by

Guthe et al. [GBK06] tries to reduce the number of issued queries to a minimum by

estimating the outcome of the queries. Another approach is the CHC++ algorithm

by Mattausch et al. [MBW08], which reduces the number of queries and the possible

pipeline stalls, next to other improvements, by grouping multiple queries into batches (in

which, as mentioned, objects can be sorted by their state). By allowing our algorithm to

temporarily make errors in the visibility determination and therefore in the final image

(approximative culling), we can benefit from a wide temporal separation from the start

of an occlusion query to the reading of its result. This makes such queries very efficient

to use compared to other, more conservative algorithms.

One important aspect for a culling algorithm’s efficiency is the hierarchical data struc-

ture for the scene that determines the object’s grouping and the bounding volumes used

for the occlusion queries (for example Meißner et al. [MBM+01] examine this effect). For

this purpose we use the presented hull tree, being optimized to provide tight bounding

volumes for reliable occlusion queries.

Mesh simplification is another possibility to reduce the number of triangles sent

into the rendering pipeline. Instead of rendering the original object, a simpler version

is generated and rendered which should retain the original object’s appearance as close

as possible. For our rendering algorithm, we do not use simplified objects to reduce

the visible object’s complexity, instead simple objects are used as occluders to speed

up the visibility tests. Therefore it is important for our method that an object’s shape

is preserved as well as possible and that the object is not enlarged, whereas the visual

appearance is of no importance.

6.3 Building the Hull Tree

Hoppe et al. [HDD+93] present techniques to optimize and simplify meshes that can

be used as LODs (levels of detail). For continuous LOD Hoppe presents Progressive

meshes [Hop96]. These approximations cannot be used for occlusion culling directly

because they provide no information on whether the surface of these approximations is

outside or inside the original object’s surface.

Cohen et al. [CVM+96] propose a method for mesh approximations called simplified

envelopes where they also generate an inner and outer hull. Those hulls are only used

as intermediates to confine the actual approximation (which is generated in between) to

restrict the variance to the original geometry. The hulls are generated by moving the

original geometry’s points along their normals, thus the hulls have the same complexity

as the original model. As for the approximation, there is no prediction whether it is

inside or outside the original geometry. Therefore it cannot be used for our algorithm.

Sander et al. [SGG+00] present progressive hulls. Progressive hulls are a special case

of progressive meshes. This technique allows the computation of continuous approxi-

mations, which are always inside or always outside of the original mesh. Due to the

progressive property, the amount of memory is even higher than the memory needed to

store the original mesh. For our purposes, we need a mesh simplification with reduced

memory requirements. Our approximation is not continuous, it is one fixed state of this

sequence.

Hull-approximations can be used to approximate objects. Alt et al. present a

method to approximate polygons with less complex polygons or circles [ABW90]. To

apply their method, the polygon must have a convex shape in order to be simplified.

Held and Eibl present an approach to compute a simplified representation that is gen-

erated within a defined area, by tangent-continuous approximation [HE05]. In contrast

to the presented techniques, our approach works for arbitrarily shaped polygons and

meshes.

6.3 Building the Hull Tree

The hull tree’s aim is to automatically create a non-overlapping, hierarchical spatial

data structure in which a simple convex hull is attached to every node which covers

the whole subtree tightly. In order to generate these hulls efficiently, we restrict the

possible geometrical hull shapes to a predefined set of shapes which meet the following

requirements: each shape must consist of only a few triangles, be completely coverable

by other shapes on the next lower level, and allow an efficient decision on which shape

is covering all corresponding children’s shapes tightest. Our manually chosen set of

possible shapes consists of three-sided prisms, cuboids, and different five- and six-sided

prisms. These shapes are the distinction of our exterior hulls. Figure 6.2 shows some

of these available shapes; the complete set also includes the corresponding rotated and

mirrored shapes. Figure 6.3 gives an impression of two different levels of hull tree shapes

compared to the corresponding octree boxes, both covering the same scene.

The construction of the hull tree begins with the construction of an object based

octree covering the whole scene. The maximal depth is given as a parameter. The

6 Hull Tree

Figure 6.2: Set of possible hull shapes. The complete set also contains all reasonable

rotated and mirrored versions.

objects are referenced redundantly in all leaf nodes they intersect with. Beginning with

the lowest level containing the objects, the hull shapes are created for each node from

the leaves up to the root of the tree. A simple bounding box is chosen as a shape for all

leaf nodes. Experiments implied that the additional computational effort for choosing

tighter hull shapes for leaf nodes marginally influences the later rendering process if only

the chosen depth of the octree is large enough. An inner node’s shape is then chosen

based on its children’s shapes. Because of the shapes’ simple structure, it is sufficient to

base the decision on the coverage of several fixed points inside of the node’s bounding

volume (for example if all eight corners of the node’s bounding box are covered by

shapes of the children, the only possible choice for the node is the complete box). In

this manner, a decision tree is used to efficiently identify the smallest shape (by volume)

which completely covers all child nodes’ shapes (see Figure 6.4). If several shapes have

the same volume, a shape with the lowest polygon count is chosen. When the root node

is reached, the construction of the hull tree is complete.

6.4 Determining the Interior Approximations

The second part of the preprocessing stage is the calculation of the scene objects’ interior

approximations. The approach used does not require convex or closed objects. To be

able to use these approximations as occluders (without creating errors by occluding more

than the original object), the volume of these approximations must be smaller than the

original object’s volume and the approximation’s surface may not intersect with the

original object’s geometry. These interior approximations are generated with Hoppe’s

edge-collapse algorithm, with a special cost function. This works in a similar fashion

to the calculation of the interior volume of progressive hulls [SGG+00], but we compute

only a single fixed state of the progressive version.

6.4 Determining the Interior Approximations

(a) (b)

Figure 6.3: Levels 4 and 5 of the Boeing 777 scene’s hull tree. The accuracy increases

with each level. The transparent red cubes visualize the cells of an octree

at one level. The included blue polyhedrons show our much tighter shapes

from the hull tree at the same level.

Figure 6.4: Schematic 2D illustration of a hull tree: the left square shows the root node

(Level 0), and the middle and right square show Level 1 and Level 2. The

hull shapes (see Fig. 6.2) are drawn with blue lines.

We use a special cost function to preserve holes and to ensure that the interior approx-

imation is smaller than the original model. The desired goal is to collapse cedges (cis

given as a parameter) while maximizing the total profit as close as possible to c·d. For

the simplification, we need the diameter dof an object’s bounding box. If the vertices

on an edge are not both vertices of exactly two triangles, the profit for collapsing this

edge is −∞. In other word, an edge that should be collapsed must connect the complete

sides of two triangles. When an edge-collapse operation increases the object’s volume

or leads to an intersection with other triangles, the profit is also −∞. Otherwise the

profit is d−v, where vis the height of the cut off volume. Following this, the algorithm

tries to keep the lost volume small (see Figure 6.5). The algorithm terminates, when c

edges are joined or the only possible collapses brings a profit of −∞ (see Figure 6.5).

We discard normals and colors of our approximations because we do not need them for

our culling algorithm.

6 Hull Tree

Figure 6.5: By collapsing edges using the proposed cost function, the approximation will

always be located inside the original geometry.

(a) (b)

Figure 6.6: The left image shows the original geometry and the right image shows our

interior approximation - both rendered as wire-frame.

These interior approximations then can be used as occluders for the visibility tests.

Their projected size may be smaller than the related original objects’ sizes, but if the

interior approximation occludes another object, the original object would do this as well,

except for rare cases (see Figure 6.6). It is easy to handle the case of a camera placed

in-between the original and approximations surface: If the camera’s position is inside a

hull tree cell, all of its included original geometry is rendered.

6.5 Rendering Algorithm

The rendering algorithm can be subdivided into five steps (see Algorithm 6.1).

i. At the beginning of a new frame, the list of objects that were identified as visible

in the previous frame are grouped by their state (shader, texture, etc.). The state-

grouping can reduce the number of necessary state changes of the rendering pipeline

to the overall number of different states in the scene. This can be essential for the

rendering performance, especially if the rendered scene contains a huge number of objects

6.5 Rendering Algorithm

having many different states. Then, each group’s objects are sorted by distance to the

observer in front-to-back order so that a possible occlusion between objects inside a

group can be detected during rendering. A sorting of the groups themselves may not

reveal the occlusion between objects from different groups because of the possibly large

intersections. Therefore, we now randomly permute the order of the groups themselves

so that an occlusion is not guaranteed to be identified immediately but is likely to be

detected after a few frames. At the end of the first step all objects are rendered into the

frame buffer in the specified order while, for each object, an occlusion query is initiated.

These tests return the number of object’s pixels which actually passed the depth tests

during rendering. These tests introduce only a minor runtime overhead since the results

are read back only at a later step of the algorithm where no active waiting for results

is expected. Due to the usage of possibly outdated visibility information from the last

frame, visible objects can be missing in the rendered image for a few frames. This effect

is mostly noticeable in the first frames (where no visibility information is yet available)

and when the observer performs quick rotations or walks through a wall for example.

These image errors are visible for at most dframes, where dis the depth of the hull tree.

ii. In the second step, all hull tree nodes and objects for which the visibility informa-

tion should be updated are collected into lists by traversing the tree. Nodes which are

collected must be located inside the viewing frustum and either be marked as visible

with at least one invisible child or be marked as invisible, but having a visible parent.

All objects that are referenced by the latter kind of nodes (marked invisible, but whose

parents are marked visible) are added to another list. These are objects that are likely to

appear when the observer moves (or may even actually be visible, but not yet identified),

making them good candidates for further examination.

iii. When the observer moves through the scene, it is likely that many new objects

enter the viewing frustum at once. Therefore, the interior approximations of the objects

collected in the second step are rendered to the depth buffer (in front-to-back order from

the observer) while an occlusion query is initialized for each rendered approximation.

This improves the information in the depth buffer for further tests, gives fine granularity

visibility results for the corresponding objects and is quite efficient since most nodes’

inner approximation complexity is only a fraction from the original mesh’s complexity.

iv. Then, exploiting the completed depth buffer, occlusion tests are initialized for all

the nodes’ hull shapes collected in the second step (without altering the depth buffer).

v. In the final step, the results of all occlusion queries initialized in this frame are

read back from the GPU. Those objects whose corresponding test (or the test of the

associated interior approximation) returns that at least one pixel has passed the depth

test, are added to a list of visible objects. This list is used in the first algorithm step of

the next frame. According to the results for the tested hull shapes, the corresponding

nodes’ visibility markings are updated.

6 Hull Tree

Algorithm 6.1 Rendering-Loop

Variables: HullTreeRoot, visibleObjectList, possiblyVisibleObjectsList, testedEle-

mentsQueue, nodeTestList

// i. Sort and render visible objects

1: enable writing to the frame- and depth buffer

2: group objects in visibleObjectList by their states

3: for all groups in random order do

4: change state according to group

5: sort objects in group by increasing distance to observer

6: for all objects in the group do

7: render original object with occlusion query

8: add object to testedElementsQueue

9: mark object as rendered

10: end for

11: end for

// ii. Collect internal nodes and possibly visible objects

12: TraverseHullTree(HullTreeRoot)

// iii. Handle possibly visible objects

13: disable writing to the frame-buffer

14: sort objects in possiblyVisibleObjectsList by increasing distance to observer

15: for all objects in possiblyVisibleObjectsList do

16: if object not marked as rendered then

17: render object’s interior approximation with occlusion query

18: add object to testedElementsQueue

19: mark object as rendered

20: end if

21: end for

// iv. Init tests for inner nodes

22: disable writing to the frame- and depth buffer

23: for all objects in nodeTestList do

24: render nodes’s hull shape with occlusion query

25: add node to testedElementsQueue

26: end for

// v. Update visibility

27: clear visibleObjectList

28: for all elements in testedElementsQueue do

29: if occlusion query result is visible then

30: mark element as visible

31: if element is object then

32: add object to visibleObjectList

33: end if

34: else

35: mark element as invisible

36: end if

37: end for

6.6 Evaluation

Algorithm 6.2 TraverseHullTree(node)

Variables: testedElementsQueue, cameraPosition

1: if node outside Viewing-Frustum then

2: return

3: else if cameraPosition outside node’s bounding box then

4: if node marked as visible then

5: if at least one child is marked invisible then

6: add node to nodeTestList

7: end if

8: for all node’s child nodes do

9: TraverseHullTree(child node)

10: end for

11: else if node is leaf then

12: add node to nodeTestList

13: for all objects stored in the node do

14: add object to possiblyVisibleObjectsList

15: end for

16: end if

17: else

18: // Special case if the camera is located inside the node

19: mark node as visible

20: if node is leaf then

21: mark all node’s objects as visible

22: else

23: for all node’s child nodes do

24: TraverseHullTree(child node)

25: end for

26: end if

27: end if

On the one hand, our algorithm changes the detected visibility usually only for one

level of the hull tree per frame, which can lead to temporary image errors. On the other

hand, even complex scenes can be rendered with only few state changes of the graphics

pipeline and the occlusion tests are performed very efficiently because, typically, no

active waiting for the results is required.

6.6 Evaluation

We performed different tests to evaluate our approximations and our rendering algo-

rithm. We first started with tests related to the properties of our approximation com-

pared with standard bounding boxes. Tests were made with different scenes: the model

of the UNC Power Plant, a model of a Boeing 777, and two versions of a container port.

After that we performed runtime tests where we compared our rendering method with

6 Hull Tree

the CHC and the CHC++ algorithm. We measured the differences in the number of exe-

cuted occlusion tests, the differences in the count of the rendered objects, and pixel error

that occurs if our approximate rendering algorithm is used. To expose the advantage of

our algorithm with non-axis-aligned scenes, we used the container port scenes Container

Port and Container Port 45, each consisting of 5,322 objects and having approximately

6.5 M triangles. Each object has its own shader. There are five different shader pro-

grams, randomly attached to the different objects. The objects in the Container Port

scene are mostly axis-aligned, so most of the objects can be inserted in axis-aligned

bounding boxes without wasting too much space. The Container Port 45 is identical

to the Container Port scene, except that the scene is rotated by 45 degrees around the

y-axis. In that configuration, occlusion tests with AABBs lead to more false-positives.

For the tests, we implemented our rendering system in C++. The tests are performed

on a standard PC, equipped with a NVIDIA GeForce 260 GTX, a Quad-Core Intel CPU,

8 GiB RAM and Ubuntu 9.04 GNU/Linux as operating system.

Surface area, volume, and memory consumption

In our first tests we evaluated how much of the surface and volume is saved if we use

our exterior approximation instead of bounding boxes (see Figure 6.7). For both port

scenes we build a hull tree with a depth of seven, which proved to be a good trade-off

between accuracy and necessary traversal time. The hull trees for the UNC Power Plant

and the Boeing 777 scene have a depth of ten.

Figure 6.7: On every level of the hull tree, the surface area and the volume of the

approximations is smaller than the surface area and the volume of the octree

cells.

The values inside the diagrams are normalized, so that the root box of the octree

has surface area 1 and volume 1. As expected, the surface and the volume of the

exterior approximation was smaller than the surface and the volume of the octree cells’

bounding boxes. Furthermore, the savings increased on lower levels. Through this

6.6 Evaluation

Table 6.1: Comparison of the triangle count of the AABBs and our approximations.

Container Port Container Port 45

Lev. ∆ AABB ∆ Approx. fac. ∆ AABB ∆ Approx. fac.

0 12 12 1.0 12 12 1.0

1 48 40 0.83 48 48 1.0

2 144 128 0.89 168 152 0.9

3 480 460 0.96 504 500 0.99

4 1,872 1,824 0.97 1,836 1,768 0.96

5 8,208 8,232 1.0 7,740 7,848 1.01

6 37,632 37,484 1.0 35,352 35,220 1.0

UNC Power Plant Boeing 777

Lev. ∆ AABB ∆ Approx. fac. ∆ AABB ∆ Approx. fac.

0 12 8 0.67 12 12 1.0

1 24 24 1.0 48 52 1.08

2 84 72 0.86 168 184 1.1

3 276 236 0.86 720 632 0.88

4 900 860 0.96 2,292 2,184 0.95

5 3,732 3,600 0.96 9,060 8,968 0.99

6 17,988 17,344 0.96 43,212 43,800 1.01

7 92,904 94,860 1.02 222,528 228,332 1.03

8 492,096 503,608 1.02 1,234,884 1,274,460 1.03

9 2,303,244 2,302,204 1.0 6,305,292 6,417,264 1.02

increased precision, the number of false-positive occlusion queries decreases, thus the

number of initiated tests on lower levels is also reduced.

However, saving surface and volume for the visibility tests is not helpful when the

proxy geometry is too complex. Hence, we evaluated how many triangles are needed for

our approximations compared to the standard bounding boxes, for every octree level (see

Table 6.1). The measurements show that the number of triangles needed for our exterior

approximation is close to the number of triangles needed for standard bounding boxes.

In some situations the approximation needs less triangles, in some it needs only a few

more. These first tests show that using our exterior approximation saves a significant

amount of surface area and volume, and that the approximation’s complexity does not

differ too much from the complexity of the standard AABBs. The construction of the

hull tree took only a few seconds longer than the octree’s construction. The overhead

results from the additional operations required to determine the cell’s tighter shape by

examine its children nodes.

In the container port scenes, the interior approximation’s vertex data requires only

86 MiB, while the original scenes need 276 MiB of memory. Our simplification algorithm

was configured to create approximations with about 60 vertices per object, if possible.

Settings of the runtime tests

For the runtime tests of our rendering algorithm and the approximations, we used the

different container port scenes (see Figure 6.9). For comparison, we chose the CHC algo-

rithm as presented in [PF05], using a loose octree for storing the scene. This established

algorithm allows for quick occlusion culling. Although this algorithm is conservative, it

6 Hull Tree

Figure 6.8: Visualization of the camera path with the viewing directions used for the

tests.

basically uses the same building blocks as our algorithm and works without any addi-

tional parameters. In our opinion, this makes it a good candidate for the comparison.

For the rendering time, we additionally compared the results to the CHC++ algorithm,

using the parametrization as given in [MBW08].

We perform a walk-through as shown in Figure 6.8 for all of our runtime tests. We start

outside the port where we can see most of the environment. We quickly move the camera

to the outer left side of the container area. Then we turn right and move between the

containers, until we reach the cranes. We move left beyond the cranes, outside the port,

and direct our view to the center of the scene. After that we turn right again and move

back to the start position of the walk-through, while we still focus on the center of the

scene. The camera never holds its position, which results in a permanently changing set

of visible objects. The whole walk-through consists of 3,000 frames. The walk through

the Container Port 45 scene is similar, but not equal to the described walk-through. In

the middle section (frame 1,000-1,500) it differs. Here the camera is moved through a

channel built by many containers.

Image quality

In our first runtime test, we evaluate the rendering errors that occur if we use our

rendering method. For these tests we render each frame on the camera path twice with

a resolution of 1,024 ×768 pixels. Here a direct comparison is possible because the

additional needed computation time does not influence the results of this test (unlike

6.6 Evaluation

Figure 6.9: The left image shows a segment of the Container Port scene. In the right

image the blue hulls are our approximations, and the transparent red boxes

are the corresponding AABBs.

the parallel rendering system that communicate asynchronously).

First we render an image by using our rendering method and copy the resulting frame

into a buffer. In a second step, we do not change the camera position or orientation

and render an image with all objects, and copy the resulting frame into a second buffer.

Afterwards, we compare the buffers pixel by pixel and count the differences (see Fig-

ure 6.10).

100

0 500 1,000 1,500 2,000 2,500 3,000

pixels in %

frame

Pixel errors per frame

CP00

CP45

Figure 6.10: Our rendering method produces images with a typically low pixel error,

with only a few extreme exceptions (thin peaks in the diagram above).

For most of the camera positions, the pixel error is close to 1%. There are a few

camera positions in which the pixel error is close to 100%. These peeks occur when the

camera is moved through a wall. In this situation it needs at most dframes (dbeing

6 Hull Tree

the depth of the hull tree) to determine the set of visible objects. After those frames the

image is mostly correct again.

Usually, when the visibility changes only gradually, errors occur only at small areas

near the border of the image and disappear almost immediately after only a few frames.

Due to this short time period, this is only perceived as a short flickering.

Tested objects in comparison with the CHC

In the comparison of our rendering algorithm with the CHC algorithm we also evaluated

the number of occlusion queries per frame. Figure 6.11 shows the results of these tests.

In most cases the CHC algorithm needs less tests than our rendering algorithm. We

have implemented the CHC algorithm as it is given in [PF05]. The algorithm does not

perform occlusion queries for each object. It stops testing when it reaches one of the

utilized data structure’s leaves and renders all included objects. Our algorithm also tests

the different objects in the leaves, which is why it initiates more tests.

100

200

300

400

500

600

0 500 1,000 1,500 2,000 2,500 3,000

queries

frame

Occlusion queries per frame

HT-CP00

CHC-CP00

HT-CP45

CHC-CP45

Figure 6.11: Comparison of the initiated occlusion queries using our rendering algorithm

and the CHC algorithm. (Values for the CHC are floating averages over a

range of eight frames to mask out the high frame to frame fluctuations.)

However, by performing more occlusion queries, we can reduce the number of objects

sent into the rendering pipeline. The additional tests do not influence the frame time

negatively. Our testing routine prevents the rendering pipeline from stalling, while

CHC’s occlusion queries can stall it. For nearly every camera position on our walk-

through our rendering algorithm renders less objects, as shown in Figure 6.12.

6.6 Evaluation

1,000

2,000

3,000

4,000

5,000

6,000

0 500 1,000 1,500 2,000 2,500 3,000

objects

frame

Drawn objects per frame

HT-CP00

CHC-CP00

HT-CP45

CHC-CP45

Figure 6.12: Comparison of the drawn objects’ count using our rendering algorithm

compared with the CHC algorithm.

Rendering performance

Next, we evaluate the performance of our rendering technique. For this, we compare the

time needed to render a frame using our algorithm with the rendering time using the

CHC and the CHC++ algorithms. For this test we measure the rendering time for each

camera position with each algorithm. The results are plotted in Figure 6.13.

0 500 1,000 1,500 2,000 2,500 3,000

time in ms

frame

Rendering Time per frame

HT-CP00

CHC-CP00

CHC++-CP00

HT-CP45

CHC-CP45

Figure 6.13: Comparison of the rendering times using our rendering algorithm and the

CHC algorithm. Our algorithm is faster at nearly every camera position.

6 Hull Tree

The result shows that our algorithm provides a higher frame rate than the CHC

because our algorithm renders less objects and initiates fewer state changes. As expected,

the CHC++ outperforms the CHC algorithm at almost all positions. Compared with

our algorithm, the CHC++ produces quite similar results. However, our approximative

culling is faster than the CHC++ most of the time. In the Container Port scene the sum

of the rendering times is ≈61.752 seconds when the CHC is used, ≈46.839 seconds

when the CHC++ is used, and ≈35.750 seconds when we use our rendering algorithm

with the hull tree.

Number of state-changes

This test evaluates how often the shader is exchanged using our rendering algorithm and

the CHC algorithm. Figure 6.14 shows that the number of state-changes is limited by

the number of different shader when our rendering algorithm is used. When the CHC

algorithm is used, the state of the rendering pipeline must be changed for nearly every

object that is rendered.

100

1,000

10,000

0 500 1,000 1,500 2,000 2,500 3,000

state-changes

frame

State-changes per frame

Average HT-CP45

Average CHC-CP45

Figure 6.14: Comparison of the shader changes using our rendering algorithm compared

to the CHC algorithm. The sorting of the objects by their shaders limits

the number of changes to the number of different shaders.

The CHC++ algorithm can reduce the number of state-changes. To reduce waiting

times, this algorithm sends a batch of objects into the rendering pipeline, and receives

the occlusion query results afterwards. The objects of each batch could be sorted by their

state, which influences occlusion tests’ quality negatively. To achieve best results from

the visibility tests, the objects should be sent in front-to-back order into the rendering

pipeline, just like it is done by the CHC algorithm.

6.6 Evaluation

Improvements by the interior approximations and exterior hulls

Our rendering system uses the interior approximations to test the visibility of scene

objects. By using these approximations instead of the original meshes we sent on average

only 68% of the triangles into the rendering pipeline to perform the visibility tests (see

Section 6.5, Step iii). When a user moves through the scene, looking around freely, the

number of initiated occlusion queries is lower when the hull tree is used in comparison to

a regular octree. In our tests the overall number of performed occlusion tests is reduced

by up to 5% compared to the usage of AABBs, especially while the camera moves quickly

and is distanced from the scene. Here the algorithm benefits from the tighter cells. The

culling algorithm must test additional octree levels while it can skip tests when the hull

tree is used.

In our last test we evaluated how many pixels were written into the depth buffer

during the visibility tests. Here we compare the number of drawn pixels while using the

hull tree with the number of drawn pixels while using an octree. If the hull tree is used

for the occlusion tests only 74% of the pixels are drawn into the depth buffer on average

(see Figure 6.15).

1e+07

2e+07

3e+07

4e+07

5e+07

6e+07

7e+07

8e+07

9e+07

1e+08

0 500 1,000 1,500 2,000 2,500 3,000

drawn pixels

frame

Drawn pixels into the depth buffer

hull tree

octree

Figure 6.15: Comparison of the number of pixels drawn to the depth buffer using the

hull tree and an octree.

This reduces the load on the graphics card’s raster operation pipeline. This is an

additional positive effect using the hull tree instead of an octree.

6 Hull Tree

6.7 Contribution

In this chapter we introduced a combination of approximations, organized in a hierar-

chical data structure, which allows a reduction in the number of false-positives, thus

reducing the overall number of performed occlusion tests. The bounding volumes of

the hull tree covers scenes’ objects tighter than other spatial hierarchical data struc-

tures. Additionally, the algorithm incorporates knowledge about the different states in

the draw order of the objects. Therefore it can limit the maximum number of state

changes to the number of different states, whereas other rendering algorithms do not

utilize this information which can lead to one state change per drawn object. Due to

these properties of our rendering algorithm and our approximations we can improve the

rendering performance of complex scenes without any scene specific configurations.

7 Parallel Out-of-Core Occlusion

Culling using the Hull Tree

In the previous chapter we introduced the hull tree data structure and a related rendering

algorithm. In this chapter we modify this data structure to make it suitable for a parallel

out-of-core rendering algorithm we developed [SKJ+11]. Here, we again look at a single

visualization node equipped with a high-end graphics adapter, that is supported by a

group of back-end nodes with weak graphics performance. The scenes to be visualized

are so large that they neither can be stored in the visualization node’s memory nor be

displayed in real-time with most standard techniques. We must therefore consider how

the slower back-end nodes can be used to accelerate the faster visualization node and

simultaneously how the objects are stored in the cluster. In comparison to the back-end

nodes in the approach presented in Chapter 5, these nodes receive no information besides

the current camera settings.

Our approach is as follows: the back-end nodes serve two purposes. First, their main

memory provides an out-of-core memory system for the visualization node. Second,

they assist the visualization node’s rendering by performing visibility calculations and

sending only visible objects to the visualization node. In order to obtain fast rendering

with our system, we have to distribute the objects among the back-end nodes in a way

that not only guarantees an even distribution of the objects, but also evenly distributed

visibility calculations. To achieve scalability and efficiency, our system has to meet the

following goals:

•The scene must be distributed evenly (without redundancy) among the combined

main memory of the back-end nodes.

•The computational load of visibility calculations should be balanced for most cam-

era positions.

•The back-end nodes’ response times (visibility calculations) must be kept as short

as possible.

•The amount of data to be sent to the visualization node must be kept small and

evenly distributed among the back-end nodes.

To fulfill these requirements, we use a pseudo-random distribution of the scene objects

across the back-end nodes’ main memory. To reduce the number of objects sent to the

visualization node at once, we employ an approximate hierarchical occlusion culling on

each back-end node. For this, they are equipped, in addition to the objects assigned to

7 Parallel Out-of-Core Occlusion Culling using the Hull Tree

them, with simplified versions of the remaining objects of the 3D scene. In the previous

chapters, we have used only PC clusters whose nodes are connected via a high speed

network. For the rendering system we additionally analyze the effect of only using a

standard network but with increased rendering performance of the back-end nodes.

7.1 Overview and Summary of Results

The single, powerful visualization node, connected to a number of back-end nodes via

network, renders the scene and allows the user to move interactively throughout the 3D

scene. Scene’s objects are distributed among the back-end nodes.

Object’s random distribution and hull tree modifications: Usually scalability

in scene size is achieved by external storage on disk, but with the drawback of slow

memory accesses. Fast visibility tests are easily achieved by the redundant storage of

the whole scene on each node, but with the drawback of low scalability. We solve these

conflicting requirements for our system as follows: We distribute all objects randomly

and without any redundancy among the back-end nodes and store them in processors’

main memory. We employ an approximate hierarchical occlusion culling algorithm on

each back-end node. For this, they are equipped with simplified versions of the scene

objects in addition to the objects assigned to them. In other words, each back-end node

contains a randomly chosen part of the scene’s objects. For each object that it does not

store, it maintains an interior approximation as described in the previous Section 6.4.

The random distribution of the objects results in the even distribution of final image

artifacts, while also ensuring that the amount of data to be sent is uniformly distributed

among the back-end nodes.

General rendering loop: The main rendering loop works as follows: The visu-

alization node maintains a set of visible objects in its GPU memory and renders them

in successive images (visibility computations are not performed). The back-end nodes

compute the visibility of the objects and send a list of objects to be added to or re-

moved from rendering to the visualization node (see Figure 7.1). In case of objects to be

deleted, only unique object identifiers are sent. Due to slow response times of the back-

end nodes, the visualization node receives at most one update per rendered frame. After

receiving updates from a given back-end node, the visualization node sends the current

camera position back to that node. Now, the given back-end nodes perform occlusion

tests for different camera positions. Parallelization is realized via parallel computation

of visibility tests for different camera positions, unlike other methods which perform

parallel computation for a single camera position.

Visibility computations: For our parallel rendering algorithm, we store either the

original objects or their interior approximation in the hull tree at each back-end node.

This differs from the sequentially used hull tree where we store both in tree’s leaves. This

tree is used to compute a hierarchical occlusion culling, whereby the objects, as well as

the simplified objects, are used as occluders. The result of a hull tree traversal is a list of

potentially visible objects valid for the camera position received from the visualization

node (see Section 7.3). For those original objects whose visibility has changed, the back-

7.2 Related Work

Figure 7.1: Communication scheme between visualization node and back-end nodes.

end node sends an update to the visualization node. If an object becomes visible then

its geometry is sent, if it becomes invisible then its unique identifier is transferred.

In this system each of scene’s original objects is stored on exactly one PC cluster

node. Additionally, every node has approximations for the other objects. Due to the

randomized distribution of the objects, the load through the visibility tests and the

data sent across the network is evenly distributed among the back-end nodes. The

utilization of the hull tree accelerates the back-end nodes’ visibility tests, which results

in a reduction in time between initiating a request and its receiving answer. Through

the combination of a hull tree with a randomized sample tree, we keep the network load

low. Tests in different hardware environments have shown that the graphics adapters’

performance influences images’ quality more than the network speed. This is reasoned

by obtaining faster visibility updates for more reasoned camera positions.

For this system, we did not face the problem where the set of visible objects do not

fit into the visualization node’s primary memory. In this case, one could exploit the hull

tree’s combination with a randomized sample tree to keep the memory requirements low.

It should be simple to combine the rendering algorithm for the randomized sample tree

with our algorithm. Instead of loading all visible objects, we would discard objects on

lower levels of the hierarchical data structure that are far away from camera position.

7.2 Related Work

One Parallel occlusion culling technique was suggested by Naga K. Govindaraju et al.

[GSYM03]. Their system consists of three nodes: two occlusion-test nodes to perform

the occlusion tests, and one visualization node to render the visible objects of the scene.

7 Parallel Out-of-Core Occlusion Culling using the Hull Tree

(a) (b)

Figure 7.2: Both images show approximations for the model of a Boeing 777 used by our

rendering algorithm. The left image shows the exterior approximations in

the sixth level of a hull tree (red shaded areas are related octree cells) and

the right picture shows the composition of the interior approximations.

The objects are organized in a scene-graph, the nodes of which are addressable by a

unique ID. In every frame one of the occlusion-test nodes performs an occlusion culling

and sends the ids of the visible scene-graph nodes to the other computers. At the same

time, the visualization node uses the previously received IDs to render the image visible

for the user. The second occlusion-test node renders these objects as well. This image is

the base for the node’s next occlusion test. When all nodes have finalized their procedure,

the two occlusion-test nodes switch their behavior. By sending only the ids of objects

that must be rendered the network load is reduced. However, every node must also store

the complete scene. In contrast, our system sends more data across the network but

scene-data is distributed over the different back-end nodes. In this way scenes can be

rendered that do not fit into node’s memory.

Hua Xiong et al. presented different parallel occlusion culling methods [XPQS06].

These techniques are designed for a CAVE (partially or completely surrounding dis-

play walls) equipped with 15 projectors, each associated to a single computer. For

their proposed methods, the scene-data is spatially partitioned in voxels. These voxels

are distributed over the different nodes. In the Sort-First Occlusion Culling approach,

every computer computes the occlusion culling on its own (the culling process is not

distributed). In the Sort-Last Occlusion Culling approach, each node computes the vis-

ibility for all voxels in its local memory. Afterwards, the geometry of the visible voxels

is sent to the nodes assigned to the tile where it has to be displayed.

7.3 Scene Preparation

To prepare a virtual scene for the presented technique, two different preprocessing steps

have to be performed. In order to exploit spatial coherence during the occlusion tests,

the scene objects are stored in a hull tree. The hull tree used in the sequential rendering

7.4 Data Distribution and Rendering

system does not meet all our needs. To achieve better performance, we combine the

hull tree with the ideas of the randomized sample tree [KKF+02] . Next we describe the

necessary preprocessing steps.

Bounding volume hierarchy

A loose octree serves as the basis for the construction of the bounding volume hierarchy

of the scene’s objects. In contrast to an ordinary octree, the boxes of the child nodes

do not split up the parent’s box disjointly, but they overlap. This allows objects to

be stored uniquely in a node at a tree level corresponding to its size (see [Del00]). To

combine our hull tree with the randomized sample tree we need to store each object only

once in the data structure. The hull tree for sequential rendering does not provide this.

One important property of a bounding volume hierarchy is that, if the bounding

volume of an inner node is evaluated as fully occluded (and the observer’s position is

outside the volume), all objects stored in the corresponding subtree can also immediately

be discarded as invisible. If only a fraction of the bounding volume is visible (but no

object), the visibility of the nodes in the subtree can not be predetermined and have to

be evaluated through additional tests. To increase the accuracy of the occlusion tests,

we use a modified version of the hull tree (see Chapter 6). A result can be seen in Figure

7.2a, where the tighter volumes can be compared to the larger bounding boxes.

The hull tree’s tighter bounding volumes are employed to increase the accuracy of the

occlusion tests (decreasing the number of necessary tests) and thereby increasing the

efficiency of the occlusion culling. Additionally we apply another technique to slightly

increase the number of objects that are incorrectly classified as visible, but without

increasing the number of necessary occlusion tests. We randomly choose some objects

(weighted by their surface area) to be stored at a higher level in the tree than they would

normally belong to, due to their size. This is comparable to the technique used by the

randomized sample tree. Experiments showed that, during the walk-through, intention-

ally adding these false-positives results is like a simple pre-fetching mechanism which

may increase the image quality while only marginally influencing the overall perfor-

mance of the system. For example, if the observer is standing in front of a wall covering

the whole screen, the issued occlusion queries may evaluate some objects as visible al-

though they are occluded by the wall. When the observer steps through the wall, the

visualization node can immediately render the former wrongly classified objects, while

all other newly visible objects only appear later (after the next test results have arrived).

In the meantime, the images can contain significant errors, but despite this a user can

retain orientation. The interior approximations are determined as described in Section

6.4.

7.4 Data Distribution and Rendering

As mentioned, the system aims at visualizing scenes with a complexity exceeding the

capacities of a single node’s main memory. The data is therefore distributed over the

nodes in the following manner:

7 Parallel Out-of-Core Occlusion Culling using the Hull Tree

The visualization node stores the data of those objects which are currently classi-

fied as visible in the graphics memory. As soon as an object is identified as invisible, its

data is deleted.

Each back-end node stores a subset of the scene’s objects, the precomputed interior

approximations of the corresponding remaining objects and the hierarchical data struc-

ture in their main memory. The assignment of the objects to the back-end nodes is done

by using a pseudo-random hash function, which leads to a uniform distribution where

each object is stored on exactly one back-end node. The scene is thereby completely

stored by the back-end nodes and can be efficiently transferred to the visualization node

without the need of any low latency hard disk access. Furthermore, every back-end node

has a simplified representation of the whole scene in order to be able to perform the

occlusion culling.

Algorithm 7.1 rendering-loop - Visualization Node

Variables: objectMap (objectId 7→ objectData)

1: loop

2: if data is available from a back-end node then

3: receive invisible objects’ IDs and delete them from objectMap

4: receive and add new objects to objectMap

5: send current camera position to back-end node

6: end if

7: clear buffers, set camera, ...

8: render all objects in objectMap

9: handle user input to update camera position

10: end loop

Rendering on the visualization node

The rendering loop of the visualization node is simple and straightforward (see Algorithm

7.1). In every frame, it checks if a back-end node has finished its current visibility cal-

culations. If new data is available, the IDs of the objects that were classified as invisible

are received and the corresponding objects are removed. Then the mesh data of newly

appearing objects is received and stored in the graphics card’s memory. Afterwards,

the back-end node is informed of the observer’s new position (and viewing direction) to

begin with the next visibility calculation. If the data of more than one back-end node is

available, its processing is postponed to the next frame to reduce the fluctuations of the

frame rate. Finally, all stored objects are rendered and the camera position is updated

according to the user’s input before the rendering loop starts over.

In the proposed setting, in which a cluster with a high network bandwidth is given,

the actual time needed for receiving the updated mesh data in most situations does not

influence the overall frame rate. If the network between the visualization node and the

back-end nodes becomes a bottleneck, or if a more stable frame rate is required, this

7.4 Data Distribution and Rendering

problem can be reduced by receiving the data on the visualization node asynchronously

to the rendering process. In system’s evaluation, we show that network’s load is so low

that a standard 1 GBit Ethernet, using TCP/IP, is quick enough to meet our needs.

Occlusion culling on the back-end nodes

Due to the relatively low graphic performance of the back-end nodes and scene’s high

complexity, we chose to use an occlusion culling algorithm which determines the visibility

of the whole scene adaptively over several frames instead of in one frame, similar to the

occlusion culling algorithm for the hull tree (see Section 6.5). During the culling process,

the visibility of the hull tree’s exterior approximations and of their corresponding objects

are determined. The depth in the tree at which elements may be classified as visible

is thereby adaptively changed by only one level towards the leaves per frame. On one

hand, this leads to a very efficient culling process with almost no pipeline stalls. On the

other hand, it may take several frames until all objects (especially the smallest ones at

the leaf nodes) are classified as visible after they actually became in view.

The rendering algorithm works in four phases: Initializing the depth buffer and testing

formerly visible nodes, testing potentially visible objects, fetching and evaluating the

test results and then finally sending the data to the visualization node. The algorithm’s

pseudo-code is given in Algorithm 7.2 and 7.3.

In the first phase the depth buffer is filled by rendering the objects which were visible

in the last frame. This is done in a front-to-back order and a hardware assisted occlusion

query is initiated for each rendered object to determine if it became occluded by other

objects. The rendered objects can be both original meshes stored on this back-end

node or the corresponding interior approximations. To prevent a possible pipeline stall

caused by retrieving the results of occlusion queries before they passed through the entire

rendering pipeline, the retrieval of the results is postponed to a later phase.

The second phase is used to test objects inside visible hull tree nodes and those

nodes whose parent in the tree is visible or which have an invisible child by recursively

traversing the tree (see Algorithm 7.3). The set of visible marked nodes can thereby be

extended or reduced by one level of the tree per frame. For the occlusion test of the

nodes, the precomputed bounding volumes are used.

In the third phase the results of all issued occlusion queries are evaluated. For tested

tree nodes, just their visibility flag is updated. The visibility flag of every tested object

is also updated and all visible objects are collected into the list of visible objects used

for the next frame. Those objects whose original mesh is stored on this specific back-end

node are processed further. If they are found visible, but their data is not present on

the visualization node, their data is prepared for sending. If they are found invisible and

their data is present on the visualization node, their ID is added to the list of disappeared

objects.

In the last phase, the back-end node sends the IDs of disappearing objects to the vi-

sualization node, followed by the mesh-data of the appearing objects. Finally, it receives

the new camera parameters before the visibility testing loop starts over again.

7 Parallel Out-of-Core Occlusion Culling using the Hull Tree

Algorithm 7.2 testing-loop - Back-end Node

Variables: visList (list of visible objects), delList (list of object-ids to delete), sendList

(list of objects to send), testQueue (queue of tested nodes and objects)

1: loop

2: clear buffer, set camera, ...

// i. Init depth buffer & test previously visible objects

3: for all objects in visList in front-to-back order do

4: init occlusion-query and render orig. mesh or approx. to depth buffer

5: add object to testQueue

6: mark object as rendered

7: end for

8: clear visList

// ii. Test potentially visible nodes and objects

9: TraverseTreeFrontToBack(RootNode)

// iii. Fetch and evaluate test results

10: for all elements in testQueue do

11: if occlusion-query result is visible then

12: mark element as visible

13: if element is interior approx. then

14: add object to visList

15: else if element is original mesh then

16: add object to visList

17: if not stored on visuNode then

18: add object to sendList

19: mark object as stored on visuNode

20: end if

21: end if

22: else

23: mark element as invisible

24: if is orig. mesh and is stored on visuNode then

25: add object-id to delList

26: mark object as not stored on visuNode

27: end if

28: end if

29: end for

// iv. Send the data to the visualization node

30: send ids in delList

31: send complete data in sendList

32: receive new camera position (and direction)

33: end loop

7.5 Evaluation

Algorithm 7.3 TraverseTreeFrontToBack(node)

Variables: testQueue (queue of tested nodes and objects)

1: if cameraPosition inside node then

2: mark node and all node’s object as visible

3: for all node’s children in front-to-back order do

4: TraverseTreeFrontToBack(child)

5: end for

6: else if node in viewing frustum then

7: if node marked as visible then

8: for all node’s objects not marked as rendered do

9: init occlusion-query and render orig. mesh or approx. to depth buffer

10: add object to testQueue

11: end for

12: if at least one child is invisible or isLeaf then

13: init occlusion-query for node’s b. volume

14: add node to testQueue

15: end if

16: for all node’s children in front-to-back order do

17: TraverseTreeFrontToBack(child)

18: end for

19: else

20: init occlusion-query for node’s b. volume

21: add node to testQueue

22: end if

23: end if

7.5 Evaluation

In this evaluation we analyze our system’s properties and performance. First, we intro-

duce our system environment, the used scene, the properties of the precomputed object

approximations, as well as the used camera path. We show that the chosen pseudo-

randomized distribution of the data leads to evenly distributed memory consumption

and work load on the back-end nodes. Next we analyze the achieved rendering perfor-

mance and network load of our system. Here we present the measured times for rendering

the images and receiving data, as well as the amounts of data sent across the network.

We show that our rendering performance allows for real-time interaction and that the

network is not saturated. Finally we evaluate the visibility delay, where we measure the

time until all visible objects have been send to the visualization node. We compare the

visibility delay when the camera is teleported to random scene positions to the delay

during walk-through.

7 Parallel Out-of-Core Occlusion Culling using the Hull Tree

Benchmark

For our evaluation we used the small PC cluster configuration of the PC2as described in

Section 3.2. Additionally, we used a PC cluster consisting of 17 nodes. Below we refer to

this configuration as the standard network. In this configuration the visualization node is

equipped with an Intel i7 (4×2.8 GHz), 12 GiB DDR3 RAM, and a NVidia GeForce 480

GTX. The 16 back-end nodes are equipped with an Intel Core 2 CPU (2×2.66 GHz),

4 GiB DDR2 RAM, and a NVidia Quadro FX 3500. All nodes are connected via 1 GBit

Ethernet. The installed operating system is 64-bit OpenSuSE-11.2.

Our application is written in C++, using the GCC-4.4, OpenMPI, and OpenGL. For

our walk-through tests we use the model of a Boeing 777 whose storage requirements

is ≈8.5 GiB. The total size of the interior approximations is ≈2.2 GiB, the size of the

exterior approximation is ≈320 MiB.

Figure 7.3: Visualization of the camera path through the Boeing 777 used for our tests.

For our tests we perform a walk-through as shown in Figure 7.3 using 15,000 fixed

camera positions and alignments. The walk-through starts in front of the model, focusing

the cockpit. We move the camera straight forward through the cockpit, into the first

cabin. Here we reduce movement’s speed and move into the third cabin where we keep

the camera position unchanged for a while. Afterwards, we move through the airplane’s

tail, while rotating the camera. After we leave the airplane’s model, the camera is looking

at the turbine of the left wing. We move forward while rotating the camera until we

pass the turbine and point the camera directly into the engine. We then move along

the shortest route back to the starting point of our walk-through, completing the cycle.

As rapid changes of the visibility are the most challenging situations for the algorithm,

we try not to use portals, like doors, windows, gateways etc., between different model

sections during the walk-through, but move the camera through walls and the airplane’s

inventory.

The speed of the camera movement during the walk-through has a strong influence

on the overall behavior of the system. For the evaluation, we chose to limit the frame

rate to 30 fps, while always stepping one fixed step forward on the path each frame. On

average, this corresponds to the speed of a slowly walking person.

7.5 Evaluation

Memory distribution and culling performance

In our first tests we evaluate the memory distribution and the culling performance on

the back-end nodes.

To evaluate the memory distribution among the back-end nodes, we tested our system

using 9 to 15 back-end nodes, measuring the minimal and the maximal amount of data

stored by the nodes (see Figure 7.4). As expected, the measurements show that the

pseudo-randomized data assignment produces a well-balanced data distribution, with

a difference between the minimum and the maximum amount stored of only ≈75 MiB

(≈2 % of the measured median).

3,500

3,600

3,700

3,800

3,900

4,000

9 10 11 12 13 14 15

memory in MiB

number of nodes

used memory per node

Quartiles

Figure 7.4: Required memory of 9 to 15 back-end nodes.

It was not possible to run this test in the standard network configuration using a

comparable amount of nodes to the PC cluster configuration. The system software on

the back-end nodes required much more memory than the software on the back-end

nodes in the PC cluster. For this reason, we were not able to run our rendering system

with less than 16 back-end nodes on the standard network configuration.

Figure 7.5 shows the time spent for the occlusion culling on the different PC cluster

back-end nodes, measured during the walk-through. At each point in time the workload

for all 15 back-end nodes is almost equal, although the nodes perform their tests at

slightly different camera positions. The time needed for one pass ranges from some

milliseconds up to over 10 seconds. It is important that the different back-end nodes

work asynchronously, especially at those extreme points, so that the visualization node

receives a continuous stream of new objects. Although it takes some time until the

created image shows all details, the image quality rapidly achieves a suitable level, even

if the user steps through a wall. These aspects are further evaluated in Section 7.5.

7 Parallel Out-of-Core Occlusion Culling using the Hull Tree

2,000

4,000

6,000

8,000

10,000

0 20 40 60 80 100 120 140 160 180

testing time in ms

program runtime in sec

Testing times of the different back-end nodes

Testing time IB

Figure 7.5: Required testing times of 15 back-end nodes during program’s execution.

At any time the load of all back-end nodes is almost equal.

Rendering performance

In our next tests we analyzed the performance of the visualization node. The time

needed on the visualization node to receive the data from the back-end nodes and to

render an image is shown in Figure 7.6. As the visualization nodes in the different

configurations do not perform any additional operations for the rendering itself and the

test scene’s objects are all of similar complexity, the rendering time strongly correlates

with the number of drawn objects (see the 30 fps plot in Figure 7.8). The additional

time spent on the visualization nodes correlates with the amount of received data from

the back-end nodes (see Figure 7.7). Other measurements showed that the number of

deleted objects which have to be updated on the visualization node does not significantly

influence the overall performance.

An important indicator for the image quality is the frequency of incoming messages at

the visualization node. The highest update frequency is reached in the standard network

configuration. Here, the red peaks in Figure 7.6 diagram are densest. The number of

drawn objects per frame is in this configuration the highest as well. On the other hand,

the number of drawn objects does not differ much if we use Infiniband or 1 GBit Ethernet

in the PC cluster.

In all configurations the number of drawn objects per frame is small in relation to

the complete number of objects (about 720,848 in total). In the standard network

configuration, the visualization node renders more than 8,000 objects for only few frames.

In this exterior view, the visualization node in this configuration renders significant more

objects than in the PC cluster configuration. On our walk-through with 30 fps, the

number of drawn objects never exceeds 5,700 in both PC cluster configurations, which

is the reason why the frame rate is high enough to allow real-time interaction with the

system.

100

120

140

0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000

rendering + receive time in ms

frame

Median of rendering and receive time (PC cluster, ethernet)

receive time

rendering time

0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000

rendering + receive time in ms

frame

Median of rendering and receive time (PC cluster, Infiniband)

receive time

rendering time

0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000

rendering + receive time in ms

frame

Median of rendering and receive time (standard network)

receive time

rendering time

Figure 7.6: Required rendering time in the different PC cluster configurations. The

measured times include the reception of updates as well as the pure rendering.

The black base lines illustrate the net rendering time, where as the red peaks

show the time spent to receive and process updates from the back-end nodes.

7 Parallel Out-of-Core Occlusion Culling using the Hull Tree

2,000

4,000

6,000

8,000

10,000

12,000

0 2,000 4,000 6,000 8,000 10,00012,00014,00016,000

Data in KiB

frame

Max. received data per frame

Infiniband

standard network

Figure 7.7: Received amount of data at the different camera positions in KiB.

(a) (b)

Figure 7.8: (a) Drawn objects per frame using the PC cluster with Infiniband.

(b): Drawn objects per frame in the different PC cluster configurations (using

30 fps). The black line shows the number of visible objects.

Visibility delay

As the occlusion culling on the back-end nodes is significantly slower than the rendering

on the visualization node, the outdated visibility information leads to temporary image

errors. Depending on the speed that the user moves through the scene, the number of

drawn objects changes (see Figure 7.8). In our system the frame rate limits the number

of fixed-sized steps that can be done per second. The slower the user moves (e.g., with

10 fps), the more objects are rendered on the visualization node as the back-end nodes

have more time to adapt their own visibility information closer to the actual position. If

the user moves faster (e.g., with a limit of 100 fps), the visibility tests on the back-end

102

7.5 Evaluation

nodes do not have time to reach deeper levels of the scene tree and only the larger objects

(and those which were randomly lifted up) are identified as visible.

Figure 7.9: The images in the upper row show screenshots with a gap of two seconds of

different camera positions, during the walk-through. The lower row shows

the corresponding positions in the standard network configuration.

The evaluation setting, with a frame rate limitation to 30 fps, provides a good trade-

off between real-time properties and visibility delay. The images of Figure 7.9 shows

snapshots made during a walk-through with a gap of two seconds between each other.

These images give an impression of the image quality achieved during runtime using

the PC cluster with Infiniband. Even if a larger number of actual visible objects have

not yet been identified, two effects result in a relatively high image quality during the

walk-through: First, the asynchronously working back-end nodes produce an almost

continuous stream of new objects. Although a single culling pass may take several

seconds to complete, every few frames one node has new objects available, so the image

quality continuously improves (if current visibility situation does not change too much).

Second, the presented data structure in combination with the occlusion culling algorithm

used prioritizes the bigger, and thereby more important, objects. The first objects to

arrive from a back-end node after a severe change of the visibility are thus probably

those which actually contribute most to the rendered image.

A worst-case scenario for the image quality is when the observer does not walk to a

position in the scene but immediately teleports to a new position. Figure 7.10 shows

the positions and measurements for such teleports. The different camera settings were

chosen due to their different properties. The loading times using the PC cluster with

Infiniband depends greatly on the target position. The time needed until the image is

complete and no further objects are transmitted ranges from 6 seconds (in the Boeing’s

cockpit) to 50 seconds (when the complete Boeing is in sight). Using the PC cluster

with 1 GBit Ethernet, the processing time increases slightly. In the standard network

configuration, using standard 1 GBit Ethernet but with better graphics adapters better

results were achieved. The time interval for a complete load is strongly decreased.

103

7 Parallel Out-of-Core Occlusion Culling using the Hull Tree

A user in general moves more steadily through the scene, allowing us to exploit spatial

coherence. While the complete reload of the turbine needed 31 seconds after a teleport,

it took only 8 seconds during the walk-through to load the complete engine. Another

observation is that in those experiments the different back-end nodes still work in a

synchronized manner and their updates arrive closely together. This results in larger

popping artifacts where many objects become visible at once. During the walk-through,

the nodes soon start to send their results more evenly distributed over time.

In general, the experiments have shown that the system allows a fluid navigation through

a complex scene with a high frame-rate, while achieving a reasonable image quality.

7.6 Contribution

In the presented parallel out-of-core rendering approach we use weak back-end nodes to

compute visibility tests and their main memory as secondary storage. In order to en-

able back-end nodes to perform occlusion tests for complex scenes despite their limited

memory, we use object approximations. We achieved a fairly uniform data distribu-

tion and load balancing for most camera positions by distributing the scene’s objects

pseudo-randomly on the back-end nodes. Due to the exploitation of spatial coherence,

our rendering system copes with large delays of the visibility tests. Because of these

properties, our rendering system allows real-time interaction with large scenes while

producing images whose error depend on the user’s movement speed.

104

7.6 Contribution

Figure 7.10: The left column shows screenshots of different camera positions, while the

right column shows the required update times on these different scene’s

locations when the camera is teleported to those positions. The plots’ y-

axis represents the amount of received data and is plotted in KiB. The

x-axis shows the time since the teleport in seconds.

105

Scenario III:

Large Dynamic Scenes

107

Large Dynamic Scenes

Figure 7.11: Rendering static scenes is already challenging. Dynamic scenes make this

challenge even more difficult.

The rendering of large, static, three-dimensional scenes is already a well-known chal-

lenge. The previous chapters introduced techniques to processes such scenes in real-time.

If a scene also includes many moving objects, the problem become even worse. To place

the dynamic objects correctly, their positions must be updated in every frame. In our

case the objects are controlled by an external simulation, written in Java, that sends the

updates via a network.

The case-study where we render large scenes with dynamic objects in parallel is the

simulation of material flows [DFH+08, SFH+09]. This is one well-established method

for planning, safeguarding and improving production processes. This is an application

of discrete event simulation which has been introduced by Law and Kelton [LK00].

Problems studied are, for instance, lot size planning, layout improvement, queue size

evaluation, human resource planning, and so forth.

During the last years, the 3D-visualization of material flow simulations has had a

growing importance. It helps the user in understanding the underlying behavior of the

modeled system better and more quickly than other methods. Moreover, the 3D-view

on the simulation model offers helps in communicating results and simulation based

decisions to persons who are not experts in simulation. As the use of material flow

simulations in industry increases, the size and the complexity of models is also growing.

The complexity of not only the simulation model, but also the 3D representation, is high

because they are often imported from CAD-applications.

109

8 Visualization of Multiple

Synchronous Simulations

In the previous parts our techniques were focused on large static scenes, without any

dynamics. In this chapter we present a rendering technique that allows to render multiple

dynamic discrete event simulations simultaneously using a PC cluster. One main aspect

for our application of discrete event simulation is the intensive use of random variables

to model, for instance, processing times or break down intervals. This randomness

leads to the following problem: a single simulation run must not be regarded as the

representation of model behavior because it is possible to view the best case, the worst

case, or anywhere in-between. In classic simulation study, one would repeat simulation

runs in order to allow the assessment of each simulation run in the context of a set

of simulation runs. These replications only differ in their random variables, in such a

manner that the flow of jobs is affected but not the type of machines. An average over

the whole set is created and it is assumed that this mean is a valid representation of the

real system and can be used to make reasonable decisions.

The need for studies with a set of simulations and the benefit of simulation visualiza-

tion cannot be satisfied by current simulation tools because just one single simulation

run can be visualized. The system presented in this paper was designed to offer a solu-

tion to this problem: It enables the easily and fast analyzing of model behavior and it

supports the user to concentrate on significant representations.

The general idea of the developed approach is to visualize not only one simulation

run, but to animate all simulation runs in one 3D-visualization interface by using special

techniques such as transparency or tinting of each dynamic object of each simulation run

[DHL+06, FLH+07]. By aggregating transparent dynamic objects a “visual average” is

created (see Figure 8.1). It will then be possible to visualize the stochastic derivations

of the simulation study in one user interface. Moreover, the identification of extreme

simulation runs can be done faster and the direct analysis for reasons of extreme behavior

is supported. The identification of special simulation runs can be assisted by tinting

dynamic objects in different colors, depending of their simulation. To compare specific

simulations, multiple simulations can be visualized next to each other (see Figure 8.2).

Another feature of our system is that the number of simulation replications can be

altered during runtime. This is useful when the user wants to analyze system parameters,

which he has not thought of at the begin of the simulation. To increase the number of

simulations, a cloning method was implemented.

Our system has to simulate and visualize a large, and possibly growing, number of

complex models simultaneously. Each task cannot be accomplished by a single computer.

111

8 Visualization of Multiple Synchronous Simulations

Figure 8.1: Merging Dynamic Objects.

The execution of the simulation and the rendering is done on a PC cluster and a thin-

client is used for visualization and interaction. By using a PC cluster, a good scalability

of the development process is possible. The thin-client makes mobile application possible.

In order to do this we must solve the following problems:

•We must render large scenes that contain more static objects than a single node

is able to process.

•We must be able to display simulations that contain more dynamic objects than

one single node can process.

•We must join the images of static and dynamic scene parts quickly and efficiently.

8.1 Overview and Summary of Results

For the parallel simulations and the rendering system we use the large configuration of

PC2’s Arminius PC cluster. For the controlling and interaction, an additional external

visualization client is used. The communication between PC cluster and the external

visualization client is realized by a TCP/IP connection. The nodes in the cluster com-

municate over Infiniband interfaces. The communication-flow is shown in Figure 8.3.

For this rendering system, the PC cluster nodes are separated into three groups: a mas-

ter node, a large amount of static-rendering nodes, and a number of dynamic-rendering

nodes. The rendering performance of all utilized nodes is low. For each simulation there

is exactly one dynamic-rendering node. The external client is connected to the master

node. On each static-rendering node, a static-scene renderer is started and there are

112

8.1 Overview and Summary of Results

Figure 8.2: The visualization client.

no other processes started by the program on these nodes. On each dynamic-rendering

node, a dynamic-scene renderer and simulation kernel is started (see Figure 8.4).

The master node receives commands from the client and sends them to the static-

rendering nodes. Load-balancing of the static-rendering nodes, composing of the pictures

from the static and the dynamic-rendering nodes, and sending of composed pictures to

the client are also done by the master node.

The static and the dynamic parts of a 3D-scene are strictly separated. It is assumed

that the overall complexity of the static objects is higher than that of the dynamic

objects. The static objects do not differ from one simulation to another, so it is possible

to render only one picture of these objects for all simulations. In contrast, each dynamic

object is associated to exactly one simulation and typically simulations have different

behavior.

After receiving a user command, the static-rendering nodes render the static parts of

a simulation. This is repeated for every frame. The frame and the depth buffer are sent

to the master node. Additionally, the depth buffer and the last user command are sent

to the dynamic-rendering nodes. The nodes wait for new commands after their message

has been sent.

Every dynamic-scene renderer is connected to the simulation kernel which has been

started on the same node. There is no need for a simulation to be running on this

kernel as it is possible that a kernel is waiting for a simulation as input. This happens

when, for example, a user wants to clone a simulation. Cloning means to create a copy

of a simulation which acts exactly like the original. If a simulation shall be cloned to

another simulation kernel on another dynamic-rendering node, all simulation kernels in

the PC cluster have to be suspended at the same time. The simultaneous suspension

of all kernels keeps the simulation synchronous. When all kernels are suspended, the

selected simulation will be serialized to a byte-stream. This stream will be send to

113

8 Visualization of Multiple Synchronous Simulations

Figure 8.3: Communication Diagram.

a dynamic-rendering node whose simulation kernel is not running a simulation. The

stream will be deserialized in the simulation kernel and all simulation kernels will be

resumed simultaneously.

If a simulation is running on a dynamic-rendering node, the dynamic-scene renderer re-

ceives the signals of the simulation kernel, the depth buffers and the user commands from

the static-rendering nodes. After composing the incoming depth buffers the dynamic-

rendering nodes evaluate the received user command and compute a picture of the dy-

namic objects of the simulation it is associated to. After finalizing the rendering process,

the depth buffer of the rendered frame is compared with the composed depth buffer of

the static-rendering nodes. By doing this, occluded parts will be erased. The pictures of

the dynamic rending nodes are composed hierarchically by MPI-mechanisms. The final

picture of the visible dynamic parts will be sent to the master node. Here the static and

the dynamic pictures will be merged and sent to the client (see Figures 8.5 and 8.6 also

showing occlusion of dynamic objects by static objects).

A user is able to interact with and modify the simulations. All of these interactions

have to be lead through the PC cluster to the dynamic-rendering nodes where they are

transmitted to the simulation kernels.

To render large, static scenes we rendered these scene parts in sort-first manner. Thus

we do not use all available PC cluster nodes. We balance their rendering load using

a heuristic that determines the load of the different rendering nodes by the previously

needed rendering time. The dynamic scene parts are rendered by a second group of

back-end nodes. Each node in this second group renders the dynamic objects of exactly

one simulation. The produced images are combined in sort-last manner. Tests have

shown that the fastest merging of the images is done by FPGAs.

114

8.2 Related Work

Visualization

Client

Static Rendering

Node

Static

Renderer

Master

Node

DynamicRendering

Node

namic

Renderer

Simulation

Kernel

Figure 8.4: Simplified visualization of the system configuration. The master node is the

communication center which handles all data into and out of the PC cluster.

8.2 Related Work

In this section, a short overview of the related work in the areas of simulation visualiza-

tion and parallel and distributed simulation shall be given.

The technique of Interactive simulations with 3D visualization has success-

fully found its way into discrete simulation technology over the last decade. Before that

the discrete simulators rendered the model in a 2D fashion with icons and lines. Com-

mon arguments for 3D visualization of simulation models are that they provide a better

understanding of the regarded problem by all stakeholders and the easier identification

of errors [KM00] (for example the detection of accumulations of packets on conveyor

belts). By allowing to change simulation’s parameters during runtime the number of

repeated simulation runs is decreased. Thus, the total evaluation time decreases as well

[DDLF09]. The changes to the simulation model can be done with a 3D editor.

Recent results of experimental studies that tested the impacts of Virtual Reality (VR)

on Discrete-Event Simulation show that it is easier and faster to spot errors in a 3D/VR

model than in 2D [AB05]. Despite the advantages of 3D visualization, many remain

cautious because of a shallow learning curve for 3D software. However, Renken et al.

have shown that 3D models can be used for motion planning in production facilities

[FRL+10].

Most of today’s discrete event simulation systems (for example Quest, Automod, and

others) already provide integrated 3D visualizations. In these systems, a simulation and

the rendering are running on the same machine. For performance reasons, simulation

and rendering can be executed on different machines and bidirectionally coupled with

each other [SSLR05]. When doing so, the synchronization of both tools (i.e. the time

advance in both tools) must be coordinated.

115

8 Visualization of Multiple Synchronous Simulations

Figure 8.5: Showcase of two static-scene renderers’ frame parts.

Parallel and distributed simulation is used to speed up the computation, in other

words, to reduce the time until reliable results are available. There are two methods to

exploit parallel processors for simulation. First, one simulation can be distributed on

a multiprocessor cluster to speed-up the simulation [Fuj98]. To obtain reliable results,

the simulation runs are executed sequentially. Second, as implemented in our system,

on every processor of a cluster one replicated simulation can be executed [GH91] (i.e.,

simulation runs are executed in parallel).

8.3 System Architecture

The system consists of program parts written in C/C++ and JavaTM. The different

components communicate via network. The static and dynamic scene parts are strictly

separated.

The simulation kernel

The simulation kernel used is a Java based discrete event simulation tool named d3FACT

insight [DMMH05]. This tool has be developed in-house by members of Prof. Dangel-

maier’s group. d3FACT insight is specialized in modeling and simulating material flow

systems, especially job shop problems. A model consists of a graph of entities represent-

ing machines, conveyor belts, buffers, etc. Every entity is a static object with respect to

rendering and an instant of an entity-class. The jobs are represented by token objects,

dynamic objects with respect to rendering. Token allocate the entities of the graph

and are sent from entity to entity, creating the material flow. This sending, receiv-

ing and alteration of tokens is done by event routines, which are executed at discrete

116

8.3 System Architecture

times when according events occur. To visualize a model, each entity and token has

a 3D-representation and a position in R3. The logic data of models (i.e., entities de-

scription and simulation input data) is separated from the 3D-data. Data of both data

types are stored in a central database and are linked by referencing the DB-key of the

3D-representations in the logic entity description.

Figure 8.6: Frame of the dynamic-scene renderer (on black) and the complete frame.

Like in every other common simulation tool, there is a set of random distributions

(e.g., normal, exponential, triangular) to model processing times, breakdown behavior,

etc. To each random variable a specific seed is assigned, allowing the generation of

different random values in each variable. This is used in simulation studies when a set

of simulation runs is used to analyze system behavior. In a study, every variable gets

a different seed in every run, thus resulting in different model behavior. To capture

model behavior, every entity has several statistic variables, like utilization, idle time,

breakdown time.

To enable online visualization of a simulation, there is a simulation-to-real-time ratio

parameter in the simulation kernel. This parameter, though not typical in discrete

event simulation, is necessary to create realistic animations. Normally a next event time

advance function is used, which advances the simulation clock directly from one event

to the next [LK00].

To connect to a simulation for online control, user interaction, visualization and data

collection (or, in more general terms, message reception and transmission) a well de-

fined message protocol is used. To receive these messages, a local socket connection is

necessary.

One special kernel module enables the cloning of simulations. The state of a simulation

can be saved (i.e., the variable values, the list of scheduled events, the position of dynamic

objects). This saved simulation state can be transferred to another simulation kernel,

117

8 Visualization of Multiple Synchronous Simulations

which is not running a simulation yet. This simulation kernel can load the saved state

and initialize a simulation equal to the original. After cloning, both simulation kernels

can run the “same” simulation. If complete equality is not desired, the cloned simulation

can be altered by changing the seeds of the random variables. If the original simulation

is suspended until the clone is ready, both simulations run nearly synchronously because

of the time ratio parameter of the simulation kernel.

The master node

The master node is the main communication center in our system. It sends the rendered

pictures to an external client and receives the inputs of a user (movement) and propagates

them to the rendering nodes. Additionally, the rendering job sizes of the static-scene

renderer are also organized by this special and unique node (load balancing). The

composition of dynamic and static subpictures is also performed on this node. The view

space is divided by a recursive algorithm. The splitting is related to the technique that

Abraham et. al. [ACCC04] present in their work. The number of static rendering nodes

is N= 2n. For the first frame of the visualization, the tiles are equally sized for all

nodes. Every node measures for every frame the time it needs to render its tile. The

needed rendering time is sent to the master node to determine the new tile sizes. The

Nnodes are split in two fixed groups, g1and g2, where the nodes of g1have rendered

the upper half of the screen and the nodes of g2the lower half.

Figure 8.7: Partition of tiles.

This arrangement in groups is persistent for the whole simulation. In Figure 8.7, an

example partition after several balancing steps is shown. The differently colored areas

visualize the different groups where every group has 2n−1members. For both groups,

the average of the last frames’ required rendering times is calculated. The calculated

average of g1will be called m1and for g2will be called m2. If m1is bigger than m2,

118

8.3 System Architecture

typically the nodes of g1have had to render more geometry than the nodes of g2. In this

case the border between this groups will be shifted in relation to the difference of the

averages. The area of g1will be shrunk and the area of g2will expand. If the average

rendering time of g2is bigger the behavior is reversed. If the averages are (nearly) equal,

the border will not be shifted. This harmonization will be done recursively until a group

only consists of one node. Computing the load-balancing in this way is very fast and

tests show that the load of all nodes is well balanced after a few frames [ACCC04].

The dynamic-scene renderer

Every dynamic-scene renderer has its own exclusive simulation. On every dynamic-

rendering node a native MPI process runs for the rendering and a simulation process in

a Java-VM. Both processes communicate locally over a TCP/IP socket using a compact

protocol. The rendering and the simulation processes exchange binary-coded message

over this channel. After starting the MPI process it will instantiate a simulation kernel

in a Java-VM immediately. The dynamic-scene renderer connects a threaded socket-

client to a socket-server threaded by the simulation kernel. Between visualization and

simulation, all commands and events are transmitted over this channel.

First the simulation kernel, which is running on the dynamic-rendering node with

the lowest ID, loads all meshes which are needed to visualize the running simulation

from a database and stores them on the hard disk. When storing has been finished,

a message is sent from the simulation kernel to the connected dynamic-scene renderer.

This dynamic-scene renderer notifies all other renderes – static and dynamic – that the

loading of the scene data can be initiated.

When all renderers have finished loading and organizing their data, they wait for user

commands and the separated depth buffers of frame’s static parts. The itemized depth

buffers have to be combined into a single buffer for the whole frame. The depth buffer

is needed to erase dynamic objects that are occluded by static objects (the frame buffer

is not needed). The dynamic-scene renderer handles the user commands and renders

a picture of the dynamic objects of its corresponding simulation in front of a black

background. When the rendering process has been finished, a dynamic-scene renderer

reads back the frame and the depth buffer from the memory of the graphics adapter

to its local memory. The merging of the separated frames can be done in several ways

depending on the chosen visualization technique.

If all simulations will be shown in only one frame, a MPI-Reduce function is called

where the depth buffer values of all frames of the dynamic-scene renderer are compared

with one another. The comparison is organized hierarchically in a tree. In this way a

pixel will get the color of the object that is closest to the view-plane. The root of the

reducing function (the node where the final result is stored) sends the merged frame and

depth buffer to the master node and all nodes wait for new instructions.

The behavior of the system is similar if all simulations shall be visualized with trans-

parencies in one frame. Instead of merging the dynamic objects opaquely (occluding

each other) the several frames are merged proportionally with transparencies. If there

are nsimulations that should be visualized in one frame, the alpha value of all pixels

119

8 Visualization of Multiple Synchronous Simulations

is set to 1

n, where 0 is transparent and 1 is opaque. For the computation of the color

values of all of the pixels, the Porter Duff algorithm is used in a hierarchical tree as well.

Similar to opaque merging, the root of the reduction sends the final image to the master

node (see Figure 8.1).

The transparent visualization is reasonable up to 16 simulations. Our system allows

merging of more than 16 transparent frames, but for a user it is nearly impossible to

distinguish between one or two tokens at the same position.

If a lot of simulations are visualized in only one frame, a user can lose the overview.

However, sometimes a user wants to know how a specific simulation differs from all the

others. This rendering system offers an mechanism to tint the dynamic objects of one

simulation in an eye-catching color. This way it is possible for a user to see whether the

behavior of a simulation corresponds with the others.

The merging of the different dynamic-scene renderers’ images consumes a considerable

amount time. For this reason we looked into alternative methods to accelerate this

subroutine. These results are presented in Section 8.4.

The static-scene renderer

The static-scene renderer processes only the static parts of a scene (machines, racks,

etc.). These objects are rendered by a sort-first rendering approach. This means that

the viewing plane is divided in separated tiles. In this system, every static-scene renderer

is responsible for exactly one tile, whose relative position never changes.

After receiving a load-ready message from a dynamic-rendering node, all static-rendering

nodes load the data for the whole scene into their memory and organize them in a hier-

archical data structure which can be chosen by the user (octree or k-d tree). This data

structure is needed for frustum culling of every tile of a picture. Every static-rendering

node receives a user command and a tile position with its dimension from the master

node. The view-port will be trimmed and the geometry in this area of the view-plane

will be specified (by doing a frustum intersection test) traversing the built data struc-

ture. Every static-scene renderer computes its subframe, reads the image data from

the graphics adapter’s memory to its main memory. It then sends the frame buffer,

the depth buffer and the needed rendering time to the master node. The master node

receives all tiles from the static-render nodes and combines them to a complete image

of the static parts of the scene. However, the dynamic objects are still needed for dis-

playing a correct frame. Thus, the depth buffer is required to add the images of the

dynamic-rendering nodes. The pure rendering times are needed to balance the load of

the static-scene renderer.

The depth buffer is also sent to all dynamic-rendering nodes, just the same as the last

user command. When sending is completed, the static-render node waits for new input

from the master node.

The visualization client

The external control application is a thin-client written in Java. This decision has

been made to reduce the platform dependency. Additionally we decided that the client

does not need any 3D-graphic-accelerator. This decision makes it possible to develop

120

8.4 Joining the Partial Images

control-interfaces that run on weak computers, such as PDAs or cell-phones. The client

only provides an user-interface to control the cluster, the visualization and the itemized

simulations (see Figure 8.2). All computations are done by the processors in the PC

cluster. The client does not even know about the properties (utilization, speeds, ...) of

the several simulations. If there is the need for any information from the cluster or the

simulations, it has to be requested from the master node.

All images to be displayed on the client application are transmitted from the master

node as raw RGB-images, we discard pixels’ alpha values to reduce the network load.

After receiving an image, it will be visualized and a new user command is sent to the

master node. If there was no user interaction, a new image is requested anyway.

Using the client application it is also possible to select which simulations shall be shown

or modified. The properties of machines can be requested and, after this, a user can

modify them. For example a user can change the distribution ratio of a switch or the

processing time of a machine. After subscribing to machine properties, the user will be

informed about all changes until he unsubscribes from this information.

8.4 Joining the Partial Images

The rendering system renders the static scene parts in a sort-first manner. This way,

composing the different tiles can be done with less effort. In contrast, composing the

images (including the dynamic part) of the image consumes a significant amount of

time. In order to accelerate this image processing step, we test if different hardware

components can be utilized [SSPP09, SSPP11]. Here we aim to analyze the composing

speed and the rendering speed. As benchmark we render different sections of the UNC

Power Plant. We decrease the hierarchy tree’s depth by composing many images on one

node. We analyze the acceleration that can be achieved if images are merged via a pixel’s

depth value comparison. These tests are made separately in a different environment. In

one step we try to combine two up to eight images on one node, while we usually

combine two images on one node per step. In order to test a wide range of hardware

components we use the XtremeData’s XD1000 Architecture as visualization node. This

node is equipped with a 2.2 GHz AMD Opteron CPU and 4 GiB main memory. An Altera

Stratix II EP2S180-3 FPGA and additional 4 GiB external memory is placed in a second

Opteron socket. The CPU and FPGA communicate via a 16-bit-wide HyperTransport

link. Additionally, this node is equipped with a NVidia GeForce 8800 GTS (using 16×

PCI-e). In this configuration, only one back-end node is available. It is equipped with an

Intel Clovertown featuring 2 quad core CPUs running at 2.66 GHz and 8 GiB RAM. The

nodes are connected via Infiniband with a peak bandwidth of 10 GBit/s. We perform

the tests with a resolution of 800×600, 1,024×768, and 1,280×1,024 using 32 bit/pixel.

For the composition we used a simple algorithm (see Algorithm 8.1).

We evaluate six different configurations to accelerate the composition: CPU without

its streaming unit, CPU using SSE2, GPU using GLSL-shader, GPU using CUDA, and

FPGA. We achieve the best results with the FGPA. The second best results are achieved

with the CPU without the streaming unit.

121

8 Visualization of Multiple Synchronous Simulations

Algorithm 8.1 Simple sort-last image composing algorithm.

Variables: color array SrcC1, SrcC2 depth array SrcD1, SrcD2

1: for i = 0; i <SrcC1.length; i++ do

2: if SrcD1[i] >SrcD2[i] then

3: SrcC1[i] = SrcC2[i];

4: SrcD1[i] = SrcD2[i];

5: end if

6: end for

The usage of processor’s streaming units does not pay off. This is due to the number of

CPU instructions which are performed in every loop. To perform the composition using

SSE2 required more than twice the processor instructions than the comparison and copy

operations in the merge algorithm. If the chosen pixels are equally distributed over both

partial images, without SSE2 we can expect that half of the copy instructions can be

skipped. On the other hand, no operations can be skipped because of SSE2’s SIMD

characteristic.

The composition time using the GPU does not differ much when CUDA or GLSL is

used. The problem here is not the composition itself (the junction of several images on

the GPU can be performed rapidly). The problem is the host-to-device data transfer.

Composing four and seven 800 ×600 pixel images, we achieve roughly 11.4−17.1fps

using the CPU, while we achieve only round about 9.7−15.5fps if we use CUDA. Four

images with a resolution of 1,280 ×1,024 can be composed and display using the CPU

with a rate of 6.4fps, while we achieve only 5.8fps if the GPU is used.

The usage of the FPGA outperforms the CPU. The benefit achieved with the FPGA

is displayed in the diagram of Figure 8.8.

Using the FPGA, the rendering speed increases linearly with the number of composed

images until the network is saturated. If the CPU is used, the speed-up collapses ear-

lier. By increasing the number of processors, every node has to process fewer objects.

Additionally, the overhead to produce an image (like copying buffers) does not change.

8.5 Contribution

In this chapter we have presented a system for the simultaneous visualization of several

parallelly executed simulations. By using multiple processors for simulation execution

and parallel rendering, huge models can be simulated and visualized. Because of this,

new techniques to display aggregated data of many simulation runs had to be developed.

We separated the static scene parts from the dynamic. While we render the static parts

in sort-first manner, we render each simulation’s dynamic objects independently. To

accelerate the composition of the different images, we determined that the fastest way

is the usage of a FPGA in comparison to other techniques.

122

8.5 Contribution

2345678

fps

number of rendering nodes

Composing CPU vs. FPGA

CPU, 800 ×600

CPU, 1,024 ×768

CPU, 1,280 ×1,024

FPGA, 800 ×600

FPGA, 1,024 ×768

FPGA, 1,280 ×1,024

sat. network 800 ×600

sat. network 1,024 ×768

sat. network 1,280 ×1,024

Figure 8.8: Comparison of the composition speed using the CPU without SSE2 and the

FPGA. The speedup of FPGA is stopped by the network’s saturation.

123

9 Conclusions

In this thesis we have shown how heterogeneous PC clusters can be utilized for parallel

image rendering. In this context we do not require that all nodes are equipped with

high end hardware. In the developed approaches, only a small group of powerful nodes

is necessary, the rest of the PC cluster nodes can offer weak graphic performance. Due

to this support of mixed hardware configurations it is possible to upgrade only a few

processors instead of the complete PC cluster system. This allows the users to use

hardware’s latest features like GPU-extensions, shaders, streaming units. In this way,

the upgraded group can be tested sufficiently while the other nodes run unmodified.

These properties can influence a system’s overall stability and performance positively.

Additionally, these configurations save energy and costs, which is an important topic for

PC cluster providers and their users. The providers can also use a PC cluster for a longer

time because the system improves constantly. Based on this it is easier for providers and

users to adjust their software for future PC cluster system configurations.

9.1 Contribution

Collectively, we developed four parallel rendering systems, one sequential rendering sys-

tem, and a spatial hierarchical data structure. Additionally, we evaluated how sort-last’s

image combination can be accelerated through the usage of different hardware.

With the developed parallel rendering systems we answered the following questions:

•Can heterogeneous PC clusters be utilized for parallel real-time rendering?

•What tasks can be handled by the weak nodes?

•How can asynchronous communication protocols be applied?

•How can the rendering, data, and network load be distributed among the different

nodes?

In this thesis we have shown that heterogeneous PC clusters, consisting of a few

powerful nodes and many weak nodes, can be used for parallel real-time image rendering.

We have introduced different methods where many weak nodes support a few powerful

nodes. These methods include techniques as simplification generation, visibility test, and

storing data. Additionally, we have shown that few nodes equipped with special purpose

hardware can influence the rendering performance positively. Our parallel rendering

systems for static scenes improve the image quality if the number of back-end nodes is

125

9 Conclusions

increased. This is different to other systems that increase the frame rate if the number

of nodes is increased.

The weak nodes can be utilized for tasks which are not time-critical. In the reliefboard

approach the weak back-end nodes compute the simplifications asynchronously, while

the display images are computer on the powerful visualization nodes. In the out-of-core

rendering systems, the back-end nodes serve as “smart” secondary memory that filter

requested objects by performing visibility tests. Our first introduced parallel out-of-core

rendering systems used older depth buffers to determine visible objects. To accelerate

the visibility test of our secondly introduced parallel out-of-core rendering system we

developed a spatial, hierarchical data structure. The hull tree reduces the amount of

data that each back-end node has to store in its primary memory.

Due to the back-end nodes’ weak performance, the suggested computations require

much processing time. As a result, the back-end nodes have to perform tasks whose

results are suitable for the displaying visualization nodes even when they arrive several

seconds later. Because of slight fluctuations in the various task sizes, it is unusual that

results arrive at the visualization nodes on time. For that reason, the communication

does not have much influence on the rendering process for the visualization nodes. In the

reliefboard approach the back-end nodes compute the simplification which can be used

for many frames on the visualization node. The parallel out-of-core rendering systems

reduce the load of the network by determining visible objects on the back-end nodes and

sending only visible objects.

In the parallel rendering system to visualize multiple simulations we separated the static

scene parts from the dynamic parts. While the static scene is rendered in a balancing

sort-first manner, the dynamic scene parts of each simulation is processed on a separate

back-end node. The different images are combined in sort-last manner. To accelerate

the merging process we use a node equipped with an FPGA, while other promising

techniques (like streaming units or GPUs) were not found to be beneficial, during our

tests.

A randomized distribution of the tasks and the data leads to good load balancing

in the presented systems. In the reliefboard approach and the first presented parallel

out-of-core approach, we distributed the different objects randomized and redundant

among the back-end nodes. In the parallel out-of-core rendering system that uses the

hull tree we renounced redundant storing of the data items. Due to the huge amount

of data items and the relative small number of requests, the load was sufficiently evenly

distributed.

In addition to these parallel rendering methods, we also developed a sequential occlu-

sion culling technique with a associated spatial hierarchical data structure (which was

later modified for parallel rendering purposes). The data structure covers its included

objects more tightly than other data structures, such as regular octrees. Thus it is more

suitable for occlusion culling.

126

9.2 Open Questions and Future Work

We tested the developed rendering techniques with CAD-models without the use of

image manipulating procedures, such as a different post-processing shaders. It would be

interesting to analyze how our techniques can be combined with these techniques. Image

manipulating shaders can improve the image quality, which can lead to a higher level of

realism for the observer. Additionally, it would be interesting to evaluate how the user

perceives the approximations. On the one hand, massive popping effects can disturb

users. On the other hand, a low frame rate is also disturbing and, in some situations,

worse than the spontaneous appearance of objects.

The developed techniques are tested on two different types of nodes. The load-

balancing algorithms of reliefboards and the first out-of-core, rendering system presented

should be expandable to support more types of nodes. The out-of-core renderer using

the hull tree would need a new data access protocol.

In the future, multi-core CPUs will have a huge amount of independently working

cores. This computation parallelism should allow transferring the reliefboards technique

directly on a single computer. A transfer of the other parallel rendering techniques would

probably require more efforts.

For the static scene renderer, we always distributed the different objects randomly.

It would be interesting if there are other easily computable distributions (especially for

large scenes streaming algorithms for clustering) that could be used.

For two of our rendering techniques we use interior approximations. The computation

of these simplifications needs a significant amount of time. Additionally, for certain mesh

configurations it is not possible to determinate this approximation or to reach the aimed

triangle count. Faster and more robust methods are required here.

127

Bibliography

[AB05] Justice I. Akpan and Roger J. Brooks. Experimental investigation of the

impact of virtual reality on discrete-event simulation. In Proceeding of the

2005 Winter Simulation Conference, WSC ’05, pages 1968–1975. Winter

Simulation Conference, 2005.

[ABB+07] Carlos And´ujar, Javier Boo, Pere Brunet, Marta Fair´en, Isabel Navazo,

Pere Pau V´azquez, and Alvar Vinacua. Omni-directional relief impostors.

Computer Graphics Forum, 26(3):553–560, September 2007.

[ABW90] Helmut Alt, Johannes Bl¨omer, and Hubert Wagener. Approximation of

convex polygons. In Proceedings of the 17th International Colloquium on

Automata, Languages and Programming, volume 443 of ICALP ’90, pages

703–716. Springer-Verlag, 1990.

[ACCC04] Frederico Abraham, Waldemar Celes, Renato Cerqueira, and Joao Luiz

Campos. A load-balancing strategy for sort-first distributed rendering. In

Proceedings of the XVII Brazilian Symposium on Computer Graphics and

Image Processing, SIBGRAPI ’04, pages 292–299, Washington, DC, USA,

2004. IEEE Computer Society.

[ACW+99] Daniel Aliaga, Jon Cohen, Andrew Wilson, Eric Baker, Hansong Zhang,

Carl Erikson, Kenny Hoff, Tom Hudson, Wolfgang Stuerzlinger, Rui Bas-

tos, Mary Whitton, Fred Brooks, and Dinesh Manocha. Mmr: an inter-

active massive model rendering system using geometric and image-based

acceleration. In Proceedings of the 1999 symposium on Interactive 3D

graphics, I3D ’99, pages 199–206, New York, NY, USA, 1999. ACM.

[Ada05] Paul Adams. Performance comparisons of visualization architectures. In

Proceedings of the Users Group Conference 2005, DOD-UGC ’05, pages

388–393, Washington, DC, USA, 2005. IEEE Computer Society.

[AMHH08] Tomas Akenine-M¨oller, Eric Haines, and Natty Hoffman. Real-Time Ren-

dering 3rd Edition. A. K. Peters, Ltd., Natick, MA, USA, 2008.

[BHP07] Beat Br¨uderlin, Mathias Heyer, and Sebastian Pf¨utzner. Interviews3d:

A platform for interactive handling of massive data sets. IEEE Comput.

Graph. Appl., 27:48–59, November 2007.

129

Bibliography

[BMadHS97] Petra Berenbrink, Friedhelm Meyer auf der Heide, and Klaus Schr¨oder.

Allocating weighted jobs in parallel. In Proceedings of the 9th Annual

ACM Symposium on Parallel Algorithms and Architectures, SPAA ’97,

pages 302–310, New York, NY, USA, 1997. ACM.

[BMW+09] Jiˇr´ı Bittner, Oliver Mattausch, Peter Wonka, Vlastimil Havran, and

Michael Wimmer. Adaptive global visibility sampling. In ACM Trans-

actions on Graphics, volume 28, pages 94:1–94:10, New York, NY, USA,

August 2009. ACM.

[BWPP04] Jiˇr´ı Bittner, Michael Wimmer, Harald Piringer, and Werner Purgathofer.

Coherent hierarchical culling: Hardware occlusion queries made useful.

Computer Graphics Forum, 23(3):615–624, 2004.

[CDL+96] Bradford Chamberlain, Tony DeRose, Dani Lischinski, David Salesin, and

John Snyder. Fast rendering of complex environments using a spatial hi-

erarchy. In Proceedings of the Conference on Graphics Interface ’96, pages

132–141, Toronto, Ont., Canada, Canada, 1996. Canadian Information

Processing Society.

[CDR02] Alan Chalmers, Timothy Davis, and Erik Reinhard, editors. Practical

parallel rendering. A. K. Peters, Ltd., Natick, MA, USA, 2002.

[CKS02a] Wagner T. Corrˆea, James T. Klosowski, and Cl´audio T. Silva. iwalk:

Interactive out-of-core rendering of large models. Technical Report

TR-653-02, Princeton University, http://www.cs.princeton.edu/˜wtcorrea

/papers/iwalk.pdf, 2002.

[CKS02b] Wagner T. Corrˆea, James T. Klosowski, and Cl´audio T. Silva. Out-of-core

sort-first parallel rendering for cluster-based tiled displays. In Proceed-

ings of the Eurographics Workshop on Parallel Graphics and Visualiza-

tion, EGPGV ’02, pages 89–96, Aire-la-Ville, Switzerland, Switzerland,

2002. Eurographics Association.

[CKS03] Wagner T. Corrˆea, James T. Klosowski, and Claudio T. Silva. Visibility-

based prefetching for interactive out-of-core rendering. In Proceedings of

the 2003 IEEE Symposium on Parallel and Large-Data Visualization and

Graphics, PVG ’03, pages 1–8, Washington, DC, USA, 2003. IEEE Com-

puter Society.

[Cla76] James H. Clark. Hierarchical geometric models for visible surface algo-

rithms. Communications of the ACM, 19:547–554, October 1976.

[COCSD03] Daniel Cohen-Or, Yiorgos Chrysanthou, Cl´audio T. Silva, and Fr´edo Du-

rand. A survey of visibility for walkthrough applications. IEEE Transac-

tions on Visualization and Computer Graphics, 9(3):412–431, 2003.

130

Bibliography

[Coo84] Robert L. Cook. Shade trees. In Proceedings of the 11th Annual Conference

on Computer Graphics and Interactive Techniques, SIGGRAPH ’84, pages

223–231, New York, NY, USA, 1984. ACM.

[CVM+96] Jonathan Cohen, Amitabh Varshney, Dinesh Manocha, Greg Turk, Hans

Weber, Pankaj Agarwal, Frederick Brooks, and William Wright. Sim-

plification envelopes. In Proceedings of the 23rd Annual Conference on

Computer Graphics and Interactive Techniques, SIGGRAPH ’96, pages

119–128, New York, NY, USA, 1996. ACM.

[DDLF09] Wilhelm Dangelmaier, Robin Delius, Christoph Laroque, and Matthias

Fischer. Concepts for model verification and validation during simulation

runtime. In European Simulation and Modelling Conference, ESM ’09,

pages 49–53. EUROSIS, EUROSIS-ETI, 26 - 28 October 2009.

[DDSD03] Xavier D´ecoret, Fr´edo Durand, Fran¸cois X. Sillion, and Julie Dorsey. Bill-

board clouds for extreme model simplification. ACM Transactions on

Graphics, 22:689–696, July 2003.

[DDTP00] Fr´edo Durand, George Drettakis, Jo¨elle Thollot, and Claude Puech. Con-

servative visibility preprocessing using extended projections. In Proceed-

ings of the 27th Annual Conference on Computer Graphics and Interactive

Techniques, SIGGRAPH ’00, pages 239–248, New York, NY, USA, 2000.

ACM Press/Addison-Wesley Publishing Co.

[DE84] William H. E. Day and Herbert Edelsbrunner. Efficient algorithms for

agglomerative hierarchical clustering methods. Journal of Classification,

1:7–24, 1984.

[Del00] Mark. Deloura. Game Programming Gems. Charles River Media, Inc.,

Rockland, MA, USA, 2000.

[DFH+08] Wilhelm Dangelmaier, Matthias Fischer, Daniel Huber, Christoph

Laroque, and Tim S¨uß. Aggregated 3d-visualization of a distributed simu-

lation experiment of a queuing system. In S. J. Mason, R. Hill, L. Moench,

and O. Rose, editors, Proceedings of the 2008 Winter Simulation Confer-

ence, WSC´ 08, pages 2012 – 2020. IEEE, Omnipress, 2008.

[DH00] Michael Doggett and Johannes Hirche. Adaptive view dependent tes-

sellation of displacement maps. In Proceedings of the ACM SIG-

GRAPH/Eurographics Workshop on Graphics Hardware, HWWS ’00,

pages 59–66, New York, NY, USA, 2000. ACM.

[DHL+06] Wilhelm Dangelmaier, Daniel Huber, Christoph Laroque, Mark Aufe-

nanger, Matthias Fischer, Jens Krokowski, and Michael Kortenjan. d3fact

insight goes parallel - aggregation of multiple simulations. In Thomas

131

Bibliography

Schulze, Graham Horton, Bernhard Preim, and Stefan Schlechtweg, edi-

tors, Proceedings of the Simulation und Visualisierung 2006, SimVis ’06,

pages 79–88. SCS European Publishing House, 2006.

[DMadH93] Martin Dietzfelbinger and Friedhelm Meyer auf der Heide. Simple, effi-

cient shared memory simulations. In Proceedings of the 5th Annual ACM

Symposium on Parallel Algorithms and Architectures, SPAA ’93, pages

110–119, New York, NY, USA, 1993. ACM.

[DMMH05] Wilhelm Dangelmaier, Kiran Mahajan, Bengt Mueck, and Daniel Huber.

d3fact insight - simulation of huge scale models in cooperative work. In J¨org

Kr¨uger, Alexei Lisounkin, and Gerhard Schreck, editors, Proceedings of the

Industrial Simulation Conference, ISC ’05, pages 150–158. EUROSIS-ETI,

9 - 11 June 2005.

[DMS01] Laura Downs, Tomas M¨oller, and Carlo H. S´equin. Occlusion horizons for

driving through urban scenery. In Proceedings of the 2001 symposium on

Interactive 3D Graphics, I3D ’01, pages 121–124, New York, NY, USA,

2001. ACM.

[DN09] Philippe Decaudin and Fabrice Neyret. Volumetric billboards. Computer

Graphics Forum, 28(8):2079–2089, 2009.

[DS10] Dominic Dumrauf and Tim S¨uß. On the complexity of local search for

weighted standard set problems. In Proceedings of the 6th Conference on

Computability in Europe, pages 132–140, 30 June - 4 July 2010.

[DY08] Steve Dominick and Ruigang Yang. Anywhere pixel router. In Proceed-

ings of the 5th ACM/IEEE International Workshop on Projector Camera

Systems, PROCAMS ’08, pages 7:1–7:2, New York, NY, USA, 2008. ACM.

[EMB01] Carl Erikson, Dinesh Manocha, and William V. Baxter, III. Hlods for

faster display of large static and dynamic environments. In Proceedings of

the 2001 Symposium on Interactive 3D graphics, I3D ’01, pages 111–120,

New York, NY, USA, 2001. ACM.

[EP08] Stefan Eilemann and Renato Pajarola. Direct send compositing for parallel

sort-last rendering. In ACM SIGGRAPH ASIA 2008 courses, SIGGRAPH

Asia ’08, pages 39:1–39:8, New York, NY, USA, 2008. ACM.

[FLH+07] Matthias Fischer, Christoph Laroque, Daniel Huber, Jens Krokowski,

Bengt Mueck, Michael Kortenjan, Mark Aufenanger, and Wilhelm Dan-

gelmaier. Interactive refinement of a material flow simulation model by

comparing multiple simulation runs in one 3d environment. In Proceedings

of the European Simulation and Modelling Conference, ESM ’07, pages

499–505. EUROSIS, October 2007.

132

Bibliography

[Fly72] Michael J. Flynn. Some computer organizations and their effectiveness.

IEEE Transactions on Computers, 21(9):948–960, September 1972.

[FRL+10] Matthias Fischer, Hendrik Renken, Christoph Laroque, Guido Schaumann,

and Wilhelm Dangelmaier. Automated 3d-motion planning for ramps and

stairs in intra-logistics material flow simulations. In Proceedings of the

2010 Winter Simulation Conference, WSC ’10, pages 1648 – 1660. IEEE,

Omnipress, 5 - 8 December 2010.

[Fuj98] Richard M. Fujimoto. Parallel and distributed simulation. In Jerry Banks,

editor, Handbook of Simulation, pages 429–464. John Wiley & Sons, 1998.

[GBK06] Michael Guthe, ´

Akos Bal´azs, and Reinhard Klein. Near optimal hierar-

chical culling: Performance driven use of hardware occlusion queries. In

T. Akenine-M¨oller and W. Heidrich, editors, Proceeding of the Eurograph-

ics Symposium on Rendering, pages 207–214. The Eurographics Associa-

tion, June 2006.

[GBSF05] Anselm Grundh¨ofer, Benjamin Brombach, Robert Scheibe, and Bernd

Fr¨ohlich. Level of detail based occlusion culling for dynamic scenes. In

Proceedings of the 3rd International conference on Computer Graphics and

Interactive Techniques in Australasia and South East Asia, GRAPHITE

’05, pages 37–45, New York, NY, USA, 2005. ACM.

[GFB10] Johannes Ghiletiuc, Markus F¨arber, and Beat Br´uderlin. A Highly Scal-

able Image-Based Remote Rendering Framework. In J¨urgen Gausemeier

and Michael Grafe, editors, Augmented & Virtual Reality in der Produk-

tentstehung, HNI-Verlagsschriftenreihe, Paderborn, pages 317–330. Uni-

versit¨at Paderborn, HNI Verlagsschriftenreihe, Paderborn, 2010.

[GH91] Peter W. Glynn and Philip Heidelberger. Analysis of parallel replicated

simulations under a completion time constraint. ACM Transactions on

Modeling and Computer Simulations, 1:3–23, January 1991.

[GH99] Stefan Gumhold and Tobias H¨uttner. Multiresolution rendering with dis-

placement mapping. In Proceedings of the ACM SIGGRAPH/Eurographics

Workshop on Graphics Hardware, HWWS ’99, pages 55–66, New York,

NY, USA, 1999. ACM.

[GM05] Enrico Gobbetti and Fabio Marton. Far voxels: a multiresolution frame-

work for interactive rendering of huge complex 3d models on commodity

graphics platforms. ACM Transactions on Graphics, 24(3):878–885, July

2005.

[GMBP10] Prashant Goswami, Maxim Makhinya, Jonas B¨osch, and Renato Pajarola.

Scalable parallel out-of-core terrain rendering. In Proceedings of the Euro-

graphics Symposium on Parallel Graphics and Visualization, EGPGV ’10,

133

Bibliography

pages 63–71, Norrk¨oping, Sweden, May 2010. Eurographics Association,

Eurographics Association.

[GSYM03] Naga K. Govindaraju, Avneesh Sud, Sung-Eui Yoon, and Dinesh Manocha.

Interactive visibility culling in complex environments using occlusion-

switches. In Proceedings of the 2003 Symposium on Interactive 3D graphics,

I3D ’03, pages 103–112, New York, NY, USA, 2003. ACM.

[HDD+93] Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDonald, and

Werner Stuetzle. Mesh optimization. In Proceedings of the 20th An-

nual Conference on Computer Graphics and Interactive Techniques, SIG-

GRAPH ’93, pages 19–26, New York, NY, USA, 1993. ACM.

[HE05] Martin Held and Johannes Eibl. Biarc approximations of polygons within

asymmetric tolerance bands. In Computer-Aided Design, volume 37, pages

357–371, 2005.

[Hop96] Hugues Hoppe. Progressive meshes. In Proceedings of the 23rd An-

nual Conference on Computer Graphics and Interactive Techniques, SIG-

GRAPH ’96, pages 99–108, New York, NY, USA, 1996. ACM.

[HSLM02] Karl Hillesland, Brian Salomon, Anselmo Lastra, and Dinesh Manocha.

Fast and simple occlusion culling using hardware-based depth queries.

Technical report, Department of Computer Science, University of North

Carolina at Chapel Hill, 2002.

[HWC10] Mark Howison, Bethel E. Wes, and Hank Childs. MPI-hybrid Parallelism

for Volume Rendering on Large, Multi-core Systems. In Proceedings of the

Eurographics Symposium on Parallel Graphics and Visualization, EGPGV

’10, Norrk¨oping, Sweden, May 2010. Eurographics Association, Eurograph-

ics Association.

[JW02] Stefan Jeschke and Michael Wimmer. Textured depth meshes for real-time

rendering of arbitrary scenes. In Proceedings of the 13th Eurographics work-

shop on Rendering, EGRW ’02, pages 181–190, Aire-la-Ville, Switzerland,

Switzerland, 2002. Eurographics Association.

[KBF05] David J. Kasik, William Buxton, and David R. Ferguson. Ten CAD chal-

lenges. IEEE Computer Graphics and Applications, 25:81–92, March 2005.

[KDG+08] David Kasik, Andreas Dietrich, Enrico Gobbetti, Fabio Marton, Dinesh

Manocha, Philipp Slusallek, Abe Stephens, and Sung-Eui Yoon. Massive

model visualization techniques: course notes. In ACM SIGGRAPH 2008

classes, SIGGRAPH ’08, pages 40:1–40:188, New York, NY, USA, 2008.

ACM.

134

Bibliography

[KKF+02] Jan Klein, Jens Krokowski, Matthias Fischer, Michael Wand, Rolf Wanka,

and Friedhelm Meyer auf der Heide. The randomized sample tree: a data

structure for interactive walkthroughs in externally stored virtual environ-

ments. In Proceedings of the ACM symposium on Virtual Reality Software

and Technology, VRST ’02, pages 137–146, New York, NY, USA, 2002.

ACM.

[Kle97] Reinhard Klein. Multiresolution representations for surfaces meshes. In

Proceedings of the Spring Conference on Computer Graphics 1997, SCCG

’97, pages 57–66, 1997.

[KM00] Vineet R. Kamat and Julio C. Martinez. 3d visualization of simulated

construction operations. In Proceedings of the 2000 Winter Simulation

Conference, WSC ’00, pages 1933–1937, San Diego, CA, USA, 2000. Soci-

ety for Computer Simulation International.

[KPH+10] Wes Kendall, Tom Peterka, Jian Huang, Han-Wei Shen, and Robert Ross.

Accelerating and benchmarking radix-k image compositing at large scale.

In Proceedings of Eurographics Symposium on Parallel Graphics and Vi-

sualization, EGPGV ’10, pages 101–110. Eurographics Association, Euro-

graphics Association, May 2010.

[KS00] James T. Klosowski and Cl´audio T. Silva. The prioritized-layered projec-

tion algorithm for visible set estimation. IEEE Transactions on Visualiza-

tion and Computer Graphics, 6:108–123, April 2000.

[KTI+01] Tomomichi Kaneko, Toshiyuki Takhei, Masahiko Inami, Naoki Kawakami,

Yasuyuki Yanagida, Taro Maeda, and Susumu Tachi. Detailed shape rep-

resentation with parallax mapping. In Proceedings of the 11th Interna-

tional Conference on Artificial Reality and Telexistence, ICAT ’01, pages

205–208, 2001.

[Lai05] Samuli Laine. A general algorithm for output-sensitive visibility prepro-

cessing. In Proceedings of the 2005 Symposium on Interactive 3D Graphics

and Games, I3D ’05, pages 31–40, New York, NY, USA, 2005. ACM.

[LE97] David Luebke and Carl Erikson. View-dependent simplification of arbi-

trary polygonal environments. In Proceedings of the 24th Annual Confer-

ence on Computer Graphics and Interactive Techniques, SIGGRAPH ’97,

pages 199–208, New York, NY, USA, 1997. ACM Press/Addison-Wesley

Publishing Co.

[LK00] Averill M. Law and W. David Kelton. Simulation Modeling and Analysis.

McGraw-Hill, 2000.

135

Bibliography

[LSCO03] Tommer Leyvand, Olga Sorkine, and Daniel Cohen-Or. Ray space fac-

torization for from-region visibility. ACM Transactions on Graphics,

22(3):595–604, 2003.

[Lud10] Ludic. Graphics task distribution performance scaling without compro-

mise. Technical report, Ludic, September 2010. whitepaper.

[Lue01] David P. Luebke. A developer’s survey of polygonal simplification algo-

rithms. IEEE Computer Graphics and Applications, 21:24–35, May 2001.

[LWC+02] David Luebke, Benjamin Watson, Jonathan D. Cohen, Martin Reddy, and

Amitabh Varshney. Level of Detail for 3D Graphics. Elsevier Science Inc.,

New York, NY, USA, 2002.

[MBDM97] John S. Montrym, Daniel R. Baum, David L. Dignam, and Christopher J.

Migdal. Infinitereality: a real-time graphics system. In Proceedings of

the 24th Annual Conference on Computer Graphics and Interactive Tech-

niques, SIGGRAPH ’97, pages 293–302, New York, NY, USA, 1997. ACM

Press/Addison-Wesley Publishing Co.

[MBM+01] Michael Meißner, Dirk Bartz, Gordon M¨uller, Tobias H¨uttner, and Jens

Einighammer. Generation of subdivision hierarchies for efficient occlusion

culling of large polygonal models. Technical Report TUBSCG-1999-01,

Technical University of Braunschweig, 2001.

[MBW08] Oliver Mattausch, Jiˇr´ı Bittner, and Michael Wimmer. CHC++: Coherent

hierarchical culling revisited. Computer Graphics Forum, 27(2):221–230,

April 2008.

[MBWW07] Oliver Mattausch, Jiˇr´ı Bittner, Peter Wonka, and Michael Wimmer. Opti-

mized subdivisions for preprocessed visibility. In Proceedings of the Graph-

ics Interface, GI ’07, pages 335–342, New York, NY, USA, 2007. ACM.

[MCEF94] Steve Molnar, Michael Cox, David Ellsworth, and Henry Fuchs. A sorting

classification of parallel rendering. IEEE Computer Graphics and Appli-

cations, 14:23–32, July 1994.

[MCEF08] Steve Molnar, Michael Cox, David Ellsworth, and Henry Fuchs. A sort-

ing classification of parallel rendering. In ACM SIGGRAPH ASIA 2008

courses, SIGGRAPH Asia ’08, pages 35:1–35:11, New York, NY, USA,

2008. ACM.

[MOM+01] Shigeru Muraki, Masato Ogata, Kwan-Liu Ma, Kenji Koshizuka, Kagenori

Kajihara, Xuezhen Liu, Yasutada Nagano, and Kazuro Shimokawa. Next-

generation visual supercomputing using pc clusters with volume graphics

hardware devices. In Proceedings of the 2001 ACM/IEEE Conference on

Supercomputing (CDROM), Supercomputing ’01, pages 51–51, New York,

NY, USA, 2001. ACM.

136

Bibliography

[MPHK94] Kwan-Liu Ma, James S. Painter, Charles D. Hansen, and Michael F.

Krogh. Parallel volume rendering using binary-swap compositing. IEEE

Computer Graphics and Applications, 14:59–68, July 1994.

[MPHK08] Kwan-Liu Ma, James S. Painter, Charles D. Hansen, and Michael F.

Krogh. Parallel volume rendering using binary-swap image composition.

In ACM SIGGRAPH ASIA 2008 courses, SIGGRAPH Asia ’08, pages

38:1–38:9, New York, NY, USA, 2008. ACM.

[NB04] Shaun Nirenstein and Edwin H. Blake. Hardware accelerated visibility

preprocessing using adaptive sampling. In Proceedings of the 15th Eu-

rographics Symposium on Rendering, Rendering Techniques 2004, pages

207–216. The Eurographics Association, 2004.

[NBG02] S. Nirenstein, E. Blake, and J. Gain. Exact from-region visibility culling.

In Proceedings of the 13th Eurographics Workshop on Rendering, EGRW

’02, pages 191–202. Eurographics Association, 2002.

[OU06] Tetsuro Ogi and Takaya Uchino. Dynamic load-balanced rendering for a

cave system. In Proceedings of the ACM Symposium on Virtual Reality

Software and Technology, VRST ’06, pages 189–192, New York, NY, USA,

2006. ACM.

[Pac96] Peter S. Pacheco. Parallel programming with MPI. Morgan Kaufmann

Publishers Inc., San Francisco, CA, USA, 1996.

[Paj08] Renato Pajarola. Cluster parallel rendering. In ACM SIGGRAPH ASIA

2008 courses, SIGGRAPH Asia ’08, pages 34:1–34:12, New York, NY,

USA, 2008. ACM.

[PF05] Matt Pharr and Randima Fernando. GPU Gems 2: Programming tech-

niques for high-performance graphics and general-purpose computation.

Addison-Wesley Professional, first edition, 2005.

[Res09] Jon Peddie Research. Multi gpus: Needs, issues, and opportunities. Tech-

nical report, Jon Peddie Research, 2009. whitepaper.

[Ros05] Randi J. Rost. OpenGL(R) Shading Language (2nd Edition). Addison-

Wesley Professional, 2005.

[RP05] Timothy Roden and Ian Parberry. Portholes and planes: faster dynamic

evaluation of potentially visible sets. Computers in Entertainment, 3:3–3,

April 2005.

[RRR06] Marcus Roth, Patrick Riess, and Dirk Reiners. Load balancing on cluster-

based multi projector display systems. In Proceedings of the International

Conference in Central Europe on Computer Graphics, Visualization and

Computer Vision, WSCG ’06, pages 55–62, 2006.

137

Bibliography

[SBS06] Dirk Staneker, Dirk Bartz, and Wolfgang Straßer. Occlusion-driven scene

sorting for efficient culling. In Proceedings of the 4th International Confer-

ence on Computer Graphics, Virtual Reality, Visualisation and Interaction

in Africa, AFRIGRAPH ’06, pages 99–106, New York, NY, USA, 2006.

ACM.

[SDDS00] Gernot Schaufler, Julie Dorsey, Xavier Decoret, and Fran¸cois X. Sillion.

Conservative volumetric visibility with occluder fusion. In Proceedings of

the 27th Annual Conference on Computer Graphics and Interactive Tech-

niques, SIGGRAPH ’00, pages 229–238, New York, NY, USA, 2000. ACM

Press/Addison-Wesley Publishing Co.

[SEP+01] Gordon Stoll, Matthew Eldridge, Dan Patterson, Art Webb, Steven

Berman, Richard Levy, Chris Caywood, Milton Taveira, Stephen Hunt,

and Pat Hanrahan. Lightning-2: a high-performance display subsystem

for pc clusters. In Proceedings of the 28th Annual Conference on Computer

Graphics and Interactive Techniques, SIGGRAPH ’01, pages 141–148, New

York, NY, USA, 2001. ACM.

[SFH+09] Tim S¨uß, Matthias Fischer, Daniel Huber, Christoph Laroque, and Wil-

helm Dangelmaier. Ein System zur aggregierten Visualisierung verteilter

Materialflusssimulationen. In J¨urgen Gausemeier and Michael Grafe, edi-

tors, Augmented & Virtual Reality in der Produktentstehung, volume 252,

pages 111–126. Heinz Nixdorf Institut, Universit¨at Paderborn, May 2009.

[SFL01] Rudrajit Samanta, Thomas Funkhouser, and Kai Li. Parallel rendering

with k-way replication. In Proceedings of the IEEE 2001 Symposium on

Parallel and Large-Data Visualization and Graphics, PVG ’01, pages 75 –

84, Piscataway, NJ, USA, October 2001. IEEE Press.

[SFLS00] Rudrajit Samanta, Thomas Funkhouser, Kai Li, and Jaswinder Pal Singh.

Hybrid sort-first and sort-last parallel rendering with a cluster of pcs. In

Proceedings of the ACM SIGGRAPH/Eurographics Workshop on Graphics

Hardware, HWWS ’00, pages 97–108, New York, NY, USA, August 2000.

ACM.

[SGG+00] Pedro V. Sander, Xianfeng Gu, Steven J. Gortler, Hugues Hoppe, and John

Snyder. Silhouette clipping. In Proceedings of the 27th Annual Confer-

ence on Computer Graphics and Interactive Techniques, SIGGRAPH ’00,

pages 327–334, New York, NY, USA, 2000. ACM Press/Addison-Wesley

Publishing Co.

[SHDG+09] Behzad Sajadi, Yan Huang, Pablo Diaz-Gutierrez, Sung-Eui Yoon, and

M. Gopi. A novel page-based data structure for interactive walkthroughs.

In Proceedings of the 2009 Symposium on Interactive 3D Graphics and

Games, I3D ’09, pages 23–29, New York, NY, USA, 2009. ACM.

138

Bibliography

[SJF10] Tim S¨uß, Claudius J¨ahn, and Matthias Fischer. Asynchronous parallel

reliefboard computation for scene object approximation. In Proceedings

of the Eurographics Symposium on Parallel Graphics and Visualization,

EGPGV ’10, pages 43–51, Norrk¨oping, Sweden, May 2010. Eurographics

Association, Eurographics Association.

[SKJ+11] Tim S¨uß, Clemens Koch, Claudius J¨ahn, Matthias Fischer, and Fried-

helm Meyer auf der Heide. Ein paralleles Out-of-Core Renderingsystem

f¨ur Standard-Rechnernetze. In J¨urgen Gausemeier, Michael Grafe, and

Friedhelm Meyer auf der Heide, editors, Augmented & Virtual Reality in

der Produktentstehung, volume 295 of HNI-Verlagsschriftenreihe, Pader-

born, pages 185–197. Heinz Nixdorf Institut, Universit¨at Paderborn, May

2011.

[SKJF11] Tim S¨uß, Clemens Koch, Claudius J¨ahn, and Matthias Fischer. Approxi-

mative occlusion culling using the hull tree. In Proceedings of the Graphics

Interface 2011, pages 79–86. Canadian Human-Computer Communications

Society, May 2011.

[SLS+96] Jonathan Shade, Dani Lischinski, David H. Salesin, Tony DeRose, and

John Snyder. Hierarchical image caching for accelerated walkthroughs of

complex environments. In Proceedings of the 23rd Annual Conference on

Computer Graphics and Interactive Techniques, SIGGRAPH ’96, pages

75–82, New York, NY, USA, 1996. ACM.

[SSLR05] Steffen Strassburger, Thomas Schulze, Marco Lemessi, and Gordon D.

Rehn. Temporally parallel coupling of discrete simulation systems with

virtual reality systems. In Proceedings of the 2005 Winter Simulation

Conference, WSC ’05, pages 1949–1957. Winter Simulation Conference,

2005.

[SSPP09] Tobias Schumacher, Tim S¨uß, Christian Plessl, and Marco Platzner. Com-

munication performance characterization for reconfigurable accelerator de-

sign on the XD1000. In Proceedings of the International Conference on Re-

ConFigurable Computing and FPGAs (ReConFig), 9 - 11 December 2009.

[SSPP11] Tobias Schumacher, Tim S¨uß, Christian Plessl, and Marco Platzner. Fpga

acceleration of communication-bound streaming applications: Architec-

ture modeling and a 3d image compositing case study. International Jour-

nal of Reconfigurable Computing, 2011:1–11, 2011. Article ID 760954.

[Ste96a] Volker Stemann. Parallel balanced allocations. In Proceedings of the eighth

annual ACM symposium on Parallel algorithms and architectures, SPAA

’96, pages 261–269, New York, NY, USA, 1996. ACM.

139

Bibliography

[Ste96b] Volker Stemann. Parallel balanced allocations. In Proceedings of the 8th

Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA

’96, pages 261–269, New York, NY, USA, 1996. ACM.

[SWF10a] Tim S¨uß, Timo Wiesemann, and Matthias Fischer. Evaluation of a c-

load-collision-protocol for load-balancing in interactive environments. In

Proceedings of the 5th IEEE International Conference on Networking, Ar-

chitecture, and Storage, pages 448 – 456. IEEE Computer Society, IEEE

Press, 15 - 17 July 2010.

[SWF10b] Tim S¨uß, Timo Wiesemann, and Matthias Fischer. Gewichtetes

c-Collision-Protokoll zur Balancierung eines parallelen Out-of-Core-

Renderingsystems. In J¨urgen Gausemeier and Michael Grafe, edi-

tors, Augmented & Virtual Reality in der Produktentstehung, HNI-

Verlagsschriftenreihe, Paderborn, pages 39–52. Universit¨at Paderborn,

HNI Verlagsschriftenreihe, Paderborn, 2010.

[SZF+99] Rudrajit Samanta, Jiannan Zheng, Thomas Funkhouser, Kai Li, and

Jaswinder Pal Singh. Load balancing for multi-projector rendering sys-

tems. In Proceedings of the ACM SIGGRAPH/Eurographics Workshop

on Graphics Hardware, HWWS ’99, pages 107–116, New York, NY, USA,

1999. ACM.

[SZL92] William J. Schroeder, Jonathan A. Zarge, and William E. Lorensen. Dec-

imation of triangle meshes. In Proceedings of the 19th Annual Conference

on Computer Graphics and Interactive Techniques, SIGGRAPH ’92, pages

65–70, New York, NY, USA, 1992. ACM.

[TIH03] Akira Takeuchi, Fumihiko Ino, and Kenichi Hagihara. An improvement on

binary-swap compositing for sort-last parallel rendering. In Proceedings of

the 2003 ACM Symposium on Applied Computing, SAC ’03, pages 996–

1002, New York, NY, USA, 2003. ACM.

[TS91] Seth J. Teller and Carlo H. S´equin. Visibility preprocessing for interactive

walkthroughs. SIGGRAPH Computer Graphics, 25:61–70, July 1991.

[Tur92] Greg Turk. Re-tiling polygonal surfaces. SIGGRAPH Computer Graphics,

26:55–64, July 1992.

[USKS06] Tam´as Umenhoffer, L´aszl´o Szirmay-Kalos, and G´abor Szij´art´o. Spherical

billboards and their application to rendering explosions. In Proceedings of

Graphics Interface, GI ’06, pages 57–63, Toronto, Ont., Canada, Canada,

2006. Canadian Information Processing Society.

[VM02] Gokul Varadhan and Dinesh Manocha. Out-of-core rendering of massive

geometric environments. In Proceedings of the IEEE Visualization, VIS

’02, pages 69–76, Washington, DC, USA, 2002. IEEE Computer Society.

140

Bibliography

[Wat99] Alan H. Watt. 3D Computer Graphics. Addison-Wesley Longman Pub-

lishing Co., Inc., Boston, MA, USA, 3rd edition, 1999.

[Wel04] Terry Welsh. Parallax mapping with offset limiting: A per-pixel approxi-

mation of uneven surfaces, 2004.

[Whi92] Scott Whitman. Multiprocessor Methods for Computer Graphics Render-

ing. A. K. Peters, Ltd., Natick, MA, USA, 1992.

[Wie10] Timo Wiesemann. Effektivit¨atsanalyse des c-Collision Protokolls als

gewichteter Datenbalancierer in einem parallelen Out-of-Core Renderer.

Diploma thesis, University of Paderborn, 2010.

[WLLB97] Harvey J. Wassermann, Olaf M. Lubeck, Yong Luo, and Federico Bassetti.

Performance evaluation of the sgi origin2000: a memory-centric character-

ization of lanl asci applications. In Proceedings of the 1997 ACM/IEEE

Conference on Supercomputing (CDROM), Supercomputing ’97, pages 1–

11, New York, NY, USA, 1997. ACM.

[WWT+03] Lifeng Wang, Xi Wang, Xin Tong, Stephen Lin, Shimin Hu, Baining Guo,

and Heung-Yeung Shum. View-dependent displacement mapping. ACM

Transactions on Graphics, 22:334–339, 2003.

[XPQS06] Hua Xiong, Haoyu Peng, Aihong Qin, and Jiaoying Shi. Parallel occlusion

culling on gpus cluster. In Proceedings of the 2006 ACM International

Conference on Virtual reality Continuum and its Applications, VRCIA

’06, pages 19–26, New York, NY, USA, 2006. ACM.

[YSK+02] Shuntaro Yamazaki, Ryusuke Sagawa, Hiroshi Kawasaki, Katsushi Ikeuchi,

and Masao Sakauchi. Microfacet billboarding. In Proceedings of the 13th

Eurographics Workshop on Rendering, EGRW ’02, pages 169–180, Aire-

la-Ville, Switzerland, Switzerland, 2002. Eurographics Association.

141