scieee Science in your language
[en] (orig)
A Novel Approach to Interactive, Distributed
Visualization and Simulation
on Hybrid Cluster Systems
Dissertation
von
Stefan Lietsch
Schriftliche Arbeit zur Erlangung des Grades
eines Doktors der Naturwissenschaften
Fakult¨at ur Elektrotechnik, Informatik und Mathematik
der Universit¨at Paderborn
Paderborn, Juli 2008
Erstellt am:
Paderborn Center for Parallel Computing - PC2
F¨
urstenallee 11
33102 Paderborn
Datum der m¨undlichen Pr¨ufung:
13. Oktober 2008
Gutachter:
Prof. Dr. Odej Kao, Technische Universit¨
at Berlin
Prof. Dr. Marco Platzner, Universit¨
at Paderborn
ur meinen Vater und meine Familie,
auf dass diese Arbeit neue Kraft spendet.
ur Anna,
die immer ur mich da ist.
Acknowledgment
I would like to thank:
my advisor Prof. Odej Kao for supporting and inspiring me for many years. I
have benefited a lot from his experience and always found a sympathetic ear for
my problems and questions.
Prof. Marco Platzner for giving me valuable advise and for accepting to serve as a
co-examiner and head of the commission for my dissertation without hesitation.
Dr. Jens Simon for helping me to find my place in the scientific world and for
many insightful discussions and technical advises.
Dr. Jan Berssenbr¨
ugge, Dr. Christoph Laroque, Henning Zabel and all other
members of the VisSim project group for being very inspiring partners and for
helping to develop and publish many ideas of this thesis.
Prof. Burckhard Monien, Prof. Holger Karl and my colleagues at the PC2for
providing me with an excellent research environment. A special thank goes to the
technical staff for helping me out even on the most short-termed and unorthodox
requests.
Dr. Oliver Marquardt for his invaluable support in technical and programming
related questions.
the many students that helped to realize our ideas.
my family, Anna and her family and all my friends for their continued support
and encouragement.
i
ii
Abstract
The introduction of hybrid cluster systems, which consist of several heterogeneous
groups of homogenous nodes (e.g. computing nodes, visualization nodes, hardware-
accelerated nodes) marks a paradigm shift in the usage of cluster systems. For the
first time, the powerful hardware is open to applications other than classical high
performance computing (HPC) applications, such as number crunching and massively
parallel simulations. Various new fields of science and industry could use it as a
powerful new tool for simulation and visualization. Interactive simulations and their
visualization as well as Virtual Reality applications are areas where HPC on hybrid
cluster systems can bring significant benefit in many aspects, as for example realism,
multiple comparative simulations, multiuser-support etc.. But, to enable and to later-
on ease the utilization of the complex cluster systems for such simulations, tools to
orchestrate and access these resources are needed. Existing applications should be
quickly able to run on and to benefit from the new hardware with only little effort and
changes in source code. Currently there are only few systems that (partially) fulfill
these tasks. Most of the existing ones for traditional HPC applications lack important
features to provide the desired interactivity and flexibility.
This thesis addresses two main aspects of this problem and introduces concepts for
the computational steering (CS) and the remote visualization (RV) of interactive simu-
lations and Virtual Reality (IS/VR) applications on hybrid cluster systems. The main
focus is to conserve the interactivity and user-integration of these applications and,
at the same time, dramatically extend their features by harvesting the power of the
hybrid cluster architecture. Thus, the main contribution of this thesis is the introduc-
tion of two new subsystems, one for the steering and orchestration of the application’s
distributed components (CS) and one for the remote access to the interactive graphical
applications running on the cluster (RV). The concept for the CS framework bases on
three new models for the steering of IS/VR applications that significantly differ from
the model used for traditional CS. Besides the original idea that a running simula-
tion is observed and controlled by a connected visualization, many parts of traditional
CS needed to be redesigned and adapted to the needs of IS/VR. Especially interac-
tivity and flexibility play a big role during the conceptual design of the system. The
proposed framework and its implementation is tested and evaluated by realizing dis-
tributed versions of two existing IS/VR applications and is further showing that these
applications greatly gain on flexibility, performance and quality for the users. There-
after, the introduction of a framework for RV gives the users graphical access to the
applications running on the cluster system. Again, the developed system takes the
idea of traditional RV (e.g. for remote administration purposes) and transfers it to the
domain of IS/VR, where interactivity and high quality are of highest importance. To
achieve these goals, the developed system, as the first of its kind, makes use of power-
iii
ful graphics cards for image compression. Until now, these GPUs (Graphics Processing
Unit), built into most of the existing hybrid clusters, were almost exclusively used for
the rendering of the results of HPC simulations. But, only recently these processors
have become available for general purpose computing through the introduction of new
and universal APIs. By different benchmarks it is shown that the performance of the
remote visualization is thereby improved in many cases. Additionally, the framework
is able to use the flexibility that clusters provide, to allow the remote access to multiple,
simultaneously running applications on the cluster by multiple users.
All in all, a novel approach to effectively use hybrid clusters for interactive simula-
tions and VR applications is presented, and the findings are manifested by exemplary
applications and various benchmarks.
iv
Zusammenfassung
Das Aufkommen hybrider Cluster-Systeme, welche aus heterogenen Gruppen homo-
gener Knoten (z.B. Rechenknoten, Visualisierungsknoten oder hardwarebeschleunig-
ten Knoten) bestehen, markiert eine Trendwende in der generellen Einsetzbarkeit von
Cluster Systemen. Zum ersten Mal stehen diese leistungsstarken Ressourcen, neben
typischen High Performance Computing (HPC) Anwendungen wie Number Crunch-
ing und massiv-parallelen Simulationen, auch neuen Anwendungen zur Verf¨
ugung.
Unterschiedlichste Anwender aus Forschung und Wirtschaft k¨
onnen diese Systeme
nun als m¨
achtiges Werkzeug zur Simulation und Visualisierung ihrer Probleme nutzen.
Interaktive Simulationen und Virtual Reality Anwendungen, sind Bereiche, denen
High Performance Computing auf hybriden Cluster Systemen große Vorteile und neue
M¨
oglichkeiten bringen. Beispiele hierf¨
ur sind ein verbesserter Grad an Realismus,
die M¨
oglichkeit mehrere Simulationsl¨
aufe gleichzeitig durchf¨
uhren und vergleichen
zu k¨
onnen oder dynamische Mehrbenutzer-Szenarien zu realisieren. Allerdings wer-
den, um die Benutzung solcher komplexer Cluster-Systeme f¨
ur diese Anwendungen
zu erm¨
oglichen und zu erleichtern, Werkzeuge ben¨
otigt, welche die flexible Kopplung
der verteilten Komponenten und den Zugriff auf das Gesamtsystem realisieren. Ex-
istierende Anwendungen sollten schnell und einfach auf der neuen Hardware lauff¨
ahig
sein und m¨
oglichst ohne große Ver¨
anderungen in Code und Design die Vorteile davon
nutzen k¨
onnen. Zurzeit gibt es nur sehr wenige Systeme, welche diese Aufgaben auch
nur teilweise erf¨
ullen. Bestehenden Werkzeuge f¨
ur traditionelles HPC mangelt es an
wichtigen Eigenschaften um die ben¨
otigte Interaktivit¨
at und Flexibilit¨
at gew¨
ahrleisten
zu k¨
onnen.
Die vorliegende Arbeit besch¨
aftigt sich deshalb mit den zwei wichtigsten Aspek-
ten dieses Problems und f¨
uhrt Konzepte f¨
ur das Computational Steering (CS) und
der entfernten Visualisierung (RV) von interaktiven Simulationen und Virtual Reality
(IS/VR) Anwendungen auf hybriden Cluster Systemen ein. Das Hauptaugenmerk liegt
dabei auf der Erhaltung der Interaktivit¨
at und Benutzerintegration der Anwendungen,
bei gleichzeitiger Erweiterung und Verbesserung deren F¨
ahigkeiten durch die Aus-
nutzung der Kapazit¨
aten von hybriden Clustern. Daraus ergab sich die Idee zu den
wichtigen Beitr¨
agen dieser Arbeit: Die Entwicklung und Umsetzung zweier neuar-
tiger Middlewares f¨
ur die Steuerung und Instrumentation der verteilten Komponenten
(CS) und f¨
ur den entfernten Zugang (RV) zu den interaktiven, grafischen Anwendun-
gen auf dem hybriden Cluster System. Das Konzept f¨
ur das CS Framework basiert
auf drei neuen Modellen f¨
ur die Steuerung der IS/VR Anwendungen, welche deut-
lich vom Modell f¨
ur das Steuern traditioneller HPC Simulationen abweicht. Außer der
urspr¨
unglichen Idee, dass eine laufende Simulation durch eine angeschlossene Visual-
isierung beobachtet und gesteuert werden kann, m¨
ussen viele Teile des traditionellen
CS neu entworfen und an die Anforderungen von IS/VR angepasst werden. Speziell
v
Interaktivit¨
at und Flexibilit¨
at spielten eine große Rolle bei der Konzeptionierung des
Systems. Durch die beispielhafte Realisierung von verteilten Versionen zweier ex-
istierender IS/VR Anwendungen wird das Konzept des entwickelten Frameworks und
dessen Umsetzung getestet und evaluiert. Es wird außerdem in Szenarien gezeigt, dass
die Beispielanwendungen an Performance, Flexibilit¨
at und Qualit¨
at f¨
ur die Benutzer
hinzugewinnen. Das zweite Framework (RV f¨
ur IS/VR) gibt dem Benutzer grafischen
Zugriff zu den Anwendungen, die auf dem Cluster laufen. Erneut wurden die tradi-
tionellen Ans¨
atze der entfernten Visualisierung (z.B. Administration entfernter Server)
genutzt und an die Anforderungen der neuen Anwendungen angepasst, bei denen
Interaktivit¨
at und Qualit¨
at der Visualisierung eine große Rolle spielen. Um diesen
Anforderungen gerecht zu werden nutzt das entwickelte Framework, als erstes seiner
Art, die leistungsstarken Grafikkarten, welche in vielen hybriden Clustern verbaut
sind, zur Bildkompression. Bis jetzt wurden diese Hardware-Ressourcen fast auss-
chließlich zur grafischen Darstellung der Ergebnisse der HPC-Simulation genutzt. Erst
seit kurzem k¨
onnen diese leistungsstarken Prozessoren durch die Einf¨
uhrung von uni-
versellen APIs auch f¨
ur generelle Zwecke genutzt werden. Durch verschiedene Mes-
sungen wird gezeigt, dass durch die Nutzung der GPUs (Graphics Processing Unit)
zur Bildkompression eine Performancesteigerung in vielen Bereichen m¨
oglich ist. Dies
wiederum dient zur Erhaltung der wichtigen Interaktivit¨
at auch ¨
uber l¨
angere Strecken
und Netzwerken mit geringen Bandbreiten hinweg. Zus¨
atzlich nutzt das entwick-
elte Framework auch die Flexibilit¨
at, welche hybride Cluster bieten, um beispielsweise
den entfernten Zugang zu mehreren gleichzeitig laufenden Simulationen f¨
ur mehrere
Nutzer zu erm¨
oglichen.
Zusammenfassend beschreibt die vorliegende Arbeit einen neuen Ansatz zur prak-
tischen Nutzung hybrider Cluster Systeme f¨
ur interaktive Simulationen und Virtual
Reality Anwendungen und untermauert die Ergebnisse durch Messungen und die
Umsetzung beispielhafter Anwendungen.
vi
Contents
1 Introduction 1
1.1 ProblemDescription ............................... 2
1.2 AreasofInterest.................................. 2
1.3 GoalsandContributions............................. 3
1.4 ThesisStructure.................................. 5
2 Foundations 7
2.1 ClustersNowandThen ............................. 7
2.1.1 TheRiseofClusters ........................... 7
2.1.2 Hybrid Cluster Systems . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 The Arminius Hybrid Cluster at PC2................. 10
2.2 SoftwareforClusters............................... 12
2.2.1 Application - Simulation . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1.1 Systems, Models and Simulation . . . . . . . . . . . . . . 13
2.2.1.2 Distributed Simulations on Clusters . . . . . . . . . . . . 15
2.2.2 Application - Interactive Simulations and VR . . . . . . . . . . . . 16
2.2.2.1 Definitions ........................... 16
2.2.2.2 IS/VR on Hybrid Clusters . . . . . . . . . . . . . . . . . . 18
2.2.2.3 The VisSim Project . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Subsystems for Traditional and Hybrid Clusters . . . . . . . . . . . 19
2.3 Hardware APIs and NVIDIA CUDA . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 CUDA-Architecture........................... 20
2.3.2 CUDA - Programming Model . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 CUDA - API and Interoperability . . . . . . . . . . . . . . . . . . . 22
3 Related Work 23
3.1 ComputationalSteering ............................. 23
3.1.1 Existing Systems for Traditional CS . . . . . . . . . . . . . . . . . . 24
3.1.2 Approaches towards CS for Interactive Simulations . . . . . . . . . 25
3.2 RemoteVisualization............................... 27
3.2.1 Classes of Remote Visualization . . . . . . . . . . . . . . . . . . . . 27
3.2.1.1 Client-Side Rendering . . . . . . . . . . . . . . . . . . . . . 27
3.2.1.2 Server-Side Rendering for 2D and Administration . . . . 28
3.2.1.3 Server-Side Rendering for 3D Applications . . . . . . . . 28
3.2.2 Limitations and Problems . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Two Essential Subsystems for IS/VR on Hybrid Cluster Systems 31
4.1 Traditional Usage of Hybrid Cluster Systems . . . . . . . . . . . . . . . . . 31
vii
Contents
4.2 Hybrid Cluster Systems for IS/VR . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 CS for IS/VR - Idea and Requirements . . . . . . . . . . . . . . . . . . . . 34
4.4 Advanced RV Techniques for IS/VR - Idea and Requirements . . . . . . . 34
5 Computational Steering of IS/VR 37
5.1 A Concept for Extended Computational Steering of IS/VR . . . . . . . . . 38
5.1.1 NewCSModels.............................. 38
5.1.1.1 Collaborative Computational Steering . . . . . . . . . . . 38
5.1.1.2 Synchronized Computational Steering . . . . . . . . . . . 39
5.1.1.3 Concurrent Computational Steering . . . . . . . . . . . . 41
5.1.1.4 Combinations of the Models . . . . . . . . . . . . . . . . . 42
5.1.2 Requirements............................... 43
5.1.3 Conceptual Design and Classifications . . . . . . . . . . . . . . . . 44
5.1.3.1 Actors in Computational Steering . . . . . . . . . . . . . . 44
5.1.3.2 Classification of Simulation Data . . . . . . . . . . . . . . 45
5.1.3.3 Passing of Volatile State Data . . . . . . . . . . . . . . . . 46
5.1.3.4 Scheduling of Volatile State Data . . . . . . . . . . . . . . 46
5.1.3.5 Dynamic Mapping of Volatile State Data . . . . . . . . . . 47
5.1.3.6 Decoupled Handling of Shared Parameters . . . . . . . . 48
5.1.4 Architecture of the Framework . . . . . . . . . . . . . . . . . . . . . 49
5.1.4.1 Communication Server . . . . . . . . . . . . . . . . . . . . 49
5.1.4.2 Publish/Subscribe Service . . . . . . . . . . . . . . . . . . 50
5.1.4.3 Exemplary Workflow . . . . . . . . . . . . . . . . . . . . . 50
5.1.5 Consistency, Performance Estimation and Latency Optimization . 51
5.1.5.1 Best Effort and Dropping of Latecomers . . . . . . . . . . 52
5.1.5.2 Clustering of Communication Servers . . . . . . . . . . . 52
5.2 Prototype - The CSIS Framework . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.1 Commuvit................................. 54
5.2.1.1 Architecture .......................... 54
5.2.1.2 Variable Mapping . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.2 D-Bus.................................... 55
5.2.3 The CSIS Server - Integrating Commuvit and D-Bus . . . . . . . . 56
5.2.4 The CSIS Steering Library . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Computational Steering of a Distributed Driving Simulator . . . . . . . . 58
5.3.1 The Virtual Night Drive Simulator . . . . . . . . . . . . . . . . . . . 58
5.3.2 Modularizing the VND Simulator . . . . . . . . . . . . . . . . . . . 59
5.3.2.1 The Input Component . . . . . . . . . . . . . . . . . . . . 60
5.3.2.2 The Simulation Component . . . . . . . . . . . . . . . . . 60
5.3.2.3 The Visualization Component . . . . . . . . . . . . . . . . 61
5.3.2.4 The Audio Component . . . . . . . . . . . . . . . . . . . . 61
5.3.3 Exemplary Setup of the VND with CS . . . . . . . . . . . . . . . . 62
5.3.3.1 Example for the Passing of Volatile State Data . . . . . . 63
5.3.3.2 Example for the Passing of Shared Parameters . . . . . . 63
viii
Contents
5.3.4 CS-Enhanced Virtual Night Drive - Conclusion . . . . . . . . . . . 63
5.3.5 Distributed Shader-Based Visualization through CS . . . . . . . . . 64
5.4 Computational Steering for an Interactive Material Flow Simulation . . . 65
5.4.1 The d3FACT insight Material Flow Simulation . . . . . . . . . . . . 65
5.4.1.1 Initialization and Execution of Simulations . . . . . . . . 65
5.4.1.2 Modeling with Buildingblocks and Subblocks . . . . . . 66
5.4.1.3 The Simulation, Tokens and Visualization . . . . . . . . . 67
5.4.1.4 Communication Limitations . . . . . . . . . . . . . . . . . 67
5.4.2 Computational Steering in d3FACT insight . . . . . . . . . . . . . 67
5.4.2.1 Adapting the MFS Kernel and the Visualization . . . . . 68
5.4.2.2 Interfaces............................ 69
5.4.2.3 DataExchange......................... 69
5.4.3 ExampleScenario............................. 71
5.4.4 CS-Enhanced d3FACT insight - Conclusion . . . . . . . . . . . . . . 72
5.5 Conclusion - Computational Steering of IS/VR . . . . . . . . . . . . . . . . 73
6 Remote Visualization for IS/VR 75
6.1 Limitations of Remote Visualization for IS/VR . . . . . . . . . . . . . . . . 76
6.1.1 Slow Compression for High Resolutions . . . . . . . . . . . . . . . 77
6.1.2 Reading Image Data From the Graphics Hardware . . . . . . . . . 78
6.1.3 Rendering to Multiple Targets . . . . . . . . . . . . . . . . . . . . . 78
6.1.4 Programmable Graphics Hardware and GPGPU . . . . . . . . . . . 78
6.2 Invire - A Concept for an Interactive Remote Visualization System . . . . 79
6.2.1 TheArchitecture ............................. 80
6.2.2 DataTransfer ............................... 80
6.2.3 ImageReadback ............................. 81
6.2.4 Compression ............................... 82
6.2.4.1 Run Length Encoding . . . . . . . . . . . . . . . . . . . . . 83
6.2.4.2 Difference Compression with Index . . . . . . . . . . . . . 83
6.2.4.3 Parallel Difference Compression with Index . . . . . . . . 84
6.2.4.4 JPEG............................... 86
6.2.4.5 Parallel JPEG Compression . . . . . . . . . . . . . . . . . . 87
6.3 Remote Visualization on Hybrid Clusters . . . . . . . . . . . . . . . . . . . 89
6.4 Prototype - The Invire Framework . . . . . . . . . . . . . . . . . . . . . . . 89
6.4.1 InvirePlugin ............................... 90
6.4.1.1 Integration into Host Applications . . . . . . . . . . . . . 90
6.4.1.2 Remote Control of the Application . . . . . . . . . . . . . 91
6.4.2 InvireClient................................ 91
6.4.2.1 Reception, Decompression and Displaying of the Frames 91
6.4.2.2 Controlling the Remote Application . . . . . . . . . . . . 92
6.4.3 InvireLibrary............................... 92
6.4.3.1 The Frame Object . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.3.2 Compression/Decompression - General . . . . . . . . . . 93
ix
Contents
6.4.3.3 Compression - CUDA-Based DIC . . . . . . . . . . . . . . 93
6.4.3.4 Decompression - CUDA-Based DIC . . . . . . . . . . . . . 94
6.4.3.5 Compression - CUDA-Based JPEG . . . . . . . . . . . . . 95
6.5 Benchmarking the RV Framework . . . . . . . . . . . . . . . . . . . . . . . 98
6.5.1 The Reference Framework VirtualGL . . . . . . . . . . . . . . . . . 99
6.5.2 Quality Assessment with the SSIM Index . . . . . . . . . . . . . . . 99
6.5.3 The Benchmarked System and the Sample Applications . . . . . . 100
6.5.4 Benchmarking the Overall System Performance . . . . . . . . . . . 102
6.5.4.1 Rotating Teapot . . . . . . . . . . . . . . . . . . . . . . . . 102
6.5.4.2 Virtual Night Driver . . . . . . . . . . . . . . . . . . . . . . 105
6.5.5 Benchmarking Three Different JPEG Implementations . . . . . . . 107
6.5.5.1 JPEG - Compression Times . . . . . . . . . . . . . . . . . . 107
6.5.6 Quality Assessment of the Lossy Compression Methods . . . . . . 110
6.5.7 Benchmarking Conclusion . . . . . . . . . . . . . . . . . . . . . . . 112
6.6 Conclusion - Remote Visualization for IS/VR . . . . . . . . . . . . . . . . . 112
7 Conclusion 115
7.1 Contributions ...................................115
7.2 Conclusion.....................................116
7.3 Outlook.......................................118
Bibliography 119
A Acronyms 127
x
List of Figures
1.1 Covered areas and contributions . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Levels of a cluster architecture . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 CUDAsoftwarestack............................... 21
2.3 CUDAthreadmodel............................... 22
3.1 TraditionalCS................................... 24
3.2 Classification of RV systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 Traditional subsystems for hybrid clusters . . . . . . . . . . . . . . . . . . 32
4.2 New subsystems for IS/VR on hybrid clusters . . . . . . . . . . . . . . . . 33
5.1 Components of traditional CS . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 CollaborativeCS ................................. 39
5.3 SynchronizedCS ................................. 39
5.4 TiledDisplay.................................... 40
5.5 ConcurrentCS................................... 41
5.6 CombinedCSmodels .............................. 42
5.7 General model of the CS framework . . . . . . . . . . . . . . . . . . . . . . 44
5.8 Scheduling in CS for IS/VR . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.9 DynamicmapforVSD.............................. 47
5.10 Architecture of the CS framework . . . . . . . . . . . . . . . . . . . . . . . 50
5.11 Clustering of multiple communication servers . . . . . . . . . . . . . . . . 52
5.12 Connection handling by the Commuvit server . . . . . . . . . . . . . . . . 54
5.13 Exemplary Commuvit map . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.14VNDscreenshot ................................. 59
5.15VNDonthreechannels ............................. 61
5.16CSoftheVNDsimulator ............................ 62
5.17 Composing synchronously rendered VND frames . . . . . . . . . . . . . . 64
5.18 MFS model built of subblocks . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.19MFSwithoutandwithCS............................ 68
5.20 CS of the MFS d3FACTinsight ......................... 72
6.1 ProblemsofRVforIS/VR............................ 76
6.2 ArchitectureofInvire............................... 81
6.3 Run Length Encoding of RGB array . . . . . . . . . . . . . . . . . . . . . . 83
6.4 Sequential difference encoding with index . . . . . . . . . . . . . . . . . . 84
6.5 Parallel difference encoding with index . . . . . . . . . . . . . . . . . . . . 85
6.6 Parallel, local stream compaction . . . . . . . . . . . . . . . . . . . . . . . . 86
xi
List of Figures
6.7 JPEGCompression ................................ 87
6.8 Replacing 2D DCT by combination of 1D DCTs . . . . . . . . . . . . . . . 98
6.9 RVtestcases....................................101
6.10 Global benchmarks: Rotating Teapot 640x480 . . . . . . . . . . . . . . . . . 103
6.11 Global benchmarks: Rotating Teapot 1024x768 . . . . . . . . . . . . . . . . 104
6.12 Global benchmarks: Rotating Teapot 1680x1050 . . . . . . . . . . . . . . . 105
6.13 Global benchmarks: VND 640x480 . . . . . . . . . . . . . . . . . . . . . . . 106
6.14 Global benchmarks: VND 1024x768 and 1680x1050 . . . . . . . . . . . . . 107
6.15 JPEG benchmarks: Teapot and VND 640x480 . . . . . . . . . . . . . . . . . 108
6.16 JPEG benchmarks: Teapot and VND 1024x768 . . . . . . . . . . . . . . . . 109
6.17 JPEG benchmarks: Teapot and VND 1680x1050 . . . . . . . . . . . . . . . 110
6.18 JPEG quality: SSIM index and compression ratio of Teapot . . . . . . . . . 111
6.19 JPEG quality: SSIM index and compression ratio of VND . . . . . . . . . 111
xii
1
Introduction
For the last 40 years, Moore’s Law [55] has determined the performance of modern
computer systems. Since 1965, the prediction that the number of elements which can
be crammed on a chip of the same size, and thereby the achievable computational
power, doubles every 1-2 years, has come true. The performance gains were almost
completely achieved by shrinking the microprocessors and thus being able to increase
the internal clock frequency. Hence, it was not difficult for software developers to in-
crease the speed of their software: It simply ran faster on a faster machine. However,
over the last years it showed that a natural border concerning the clock frequency of
microprocessors has been reached: Clock speeds beyond 4 GHz resulted in unpre-
dictable byte shifts, uncoolable chip surfaces and other major errors. That is why other
ways of increasing performance had to be found to keep up with Moore’s Law. The
most promising solution is to fit multiple homogenous or heterogeneous cores into
one processor (so called multi- or many-cores) and connect them through fast memory
or busses. Those sophisticated architectures can easily fulfill Moore’s Law as long as
the amount of included cores can be doubled every 18 months. That seems to be no
problem for the next few years. But from now on the software becomes the issue. For
decades, sequential programming was the standard for all kinds of software, and only
few systems cared about parallelization, because it simply was not needed. Now, since
the processor architecture changes, the way one thinks about software has to change,
too.
Moore’s Law not only holds for single processors, but also for supercomputers and
clusters, where parallelism plays an even bigger role. Therefore it is important to come
up with middleware and systems that help developers and users to harvest the power
of all kind of parallel and massively parallel hardware. This need mainly motivated
this thesis to deal with middleware for multilevel-parallel and distributed systems and
to present possible applications thereof.
1
1 Introduction
1.1 Problem Description
The problem is not hardware. It’s software!1
This quote clearly emphasizes one of the main problems of today’s Computer Science:
Powerful hardware is at hand but the software leaps behind. In some areas such as
High Performance Computing (HPC), this fact is not tolerable and thus the commu-
nity started early to develop software that was capable to harvest most of the power
the hardware has to offer. Traditional applications of this field are highly parallel
simulations of physical or chemical phenomena. Besides optimizing the simulation
code itself, these applications benefit from subsystems that help them to efficiently
run in parallel on powerful cluster systems. Such subsystems or middleware encapsu-
late special tasks like communication and data transfer over networks and implement
them in an efficient and transparent way. One famous example thereof is the Mes-
sage Passing Interface (MPI) [50], which handles the passing of messages between the
components of a distributed or parallel application. Another example are systems for
computational steering which allow for the interaction of a user with running simu-
lations through a visualization on sophisticated, hybrid visualization and simulation
clusters. However, nearly all of the existing subsystems are optimized for traditional
HPC applications and are only hardly suited for new applications from different back-
grounds. Examples for such applications, which could also benefit from the power
of modern cluster systems, are interactive simulations and Virtual Reality applications
(IS/VR) or distributed interactive applications (DIAs) in general. The main problem
is that in contrast to the traditional HPC applications, the needs of DIAs and IS/VRs
(e.g. (soft) real-time interactivity, multi-user input, high end displaying techniques)
are not supported by common and optimized subsystems. Thus, a parallel and dis-
tributed execution requires big effort to handle all communication and orchestration of
the components. This problem is the motivation for this thesis. In our opinion it is vital
to provide new subsystems for applications different than traditional HPC, to pave the
way for the efficient usage of modern cluster’s capabilities for a broader audience.
1.2 Areas of Interest
As mentioned before, a new field of applications, besides traditional HPC simulations,
turned out to need a massive amount of computational power - highly interactive sim-
ulations and Virtual Reality (IS/VR) applications. Those related groups of applications
are gaining important roles in both science and engineering and allow the simulation
of whole virtual environments instead of single phenomena. VR is used for example
to train people in all kinds of driving, operating, flying etc. scenarios and is getting
more and more realistic with new technologies emerging and existing technologies
1Gregory F. Pfister in his book ”In search of clusters” [67]
2
1.3 Goals and Contributions
evolving. But, these technologies often require more computational power than one
single computer can provide. Indeed, a simple driving simulator can easily run on a
commodity personal computer, but if the scenario gets more complex by adding e.g. a
sophisticated traffic simulation or additional users in the same environment, the limits
of a single node system are reached very quickly. Thus, the development goes to-
wards distributed systems and especially hybrid cluster architectures. They consist of
tightly coupled nodes that are specialized either on visualization, simulation or other
tasks. Those architectures allow having several simulations and several visualizations
running concurrently. In addition to the coarse parallelism through hybrid clusters,
IS/VR applications can also benefit from multicore nodes or nodes with special ac-
celerator hardware. But to be able to handle those complex, distributed systems a
universal middleware is needed. This leads to the idea of reusing and extending the
techniques of computational steering and adopting them to the special needs of highly
interactive simulations and VR.
However, a second problem arises from the development of using high performance,
hybrid cluster systems for interactive applications. Not everyone who uses such an
application can or wants to afford and operate a large hybrid cluster system. In most
cases there is only one central and universal cluster that serves all different kinds of
users. This leads to the need of remote operation services. Especially the remote
visualization of interactive applications is a very sophisticated area of research, since
vast amounts of visual data need to be transferred to a client as fast as possible. There
are new developments in graphics hardware that allow to implement methods for
compression and transmission of graphical data efficiently and without involving the
Central Processing Unit (CPU). These advances can be used to realize a fully remotely
operable platform for highly interactive simulations and VR.
1.3 Goals and Contributions
Thus, the main goal of this thesis is to design and prototypically implement software
subsystems that enable the efficient and comfortable usage of hybrid cluster systems
for interactive simulations and VR. The two main contributions of this thesis are:
1. A concept and prototypical implementation of a framework for computational
steering which is well adapted to the needs of IS/VR applications. The con-
cept evolves the idea of traditional computational steering and adds new models
and data types to fulfill the requirements of IS/VR applications. The proposed
framework, for the first time, allows the efficient coupling of multiple simula-
tions, visualizations and steering components, running on a hybrid cluster and
on multicore nodes.
2. A concept and prototypical implementation of a testing and benchmarking frame-
work for compression and grabbing techniques for the remote visualization of
IS/VR. A main focus lies on the optimization of the achievable FPS / quality ratio
3
1 Introduction
under the assumption to have fast but limited network bandwidth ( 10MBit/s).
The framework is used to design and implement new parallel compression al-
gorithms that make use of programmable GPUs for their computations. The
algorithms will be compared to others in this field and their advantages and
disadvantages for certain scenarios will be described.
Figure 1.1 depicts the different areas and topics addressed in this thesis in more detail.
The red box emphasizes the main contributions - the two new subsystems that enable
IS/VR application to use the power of modern, hybrid cluster systems. The blue boxes
represent topics or areas that are improved by the new systems or make use of it.
This includes for example the two sample IS/VR applications Virtual Night Drive and
d3FACT insight which greatly benefit from the introduction of the CS and RV frame-
works, but also general improvements in the field of image compression and commu-
nication servers through the contributions of this thesis. The yellow boxes represent
the technical and conceptual foundations that, on the one hand, inspired the main
ideas of this thesis (CS and RV) and, on the other hand, helped to realize and imple-
ment the new findings (e.g. CUDA/GPGPU and the Publish/Subscribe approach). To
Computational Steering Remote Visualization
Communication
Server
Publish/
Subscribe CUDA
Image and
video
compression
Computational Steering and Remote Visualization
for interactive simulations and Virtual Reality
Driving simulation
Material Flow
simulation Benchmarking
Foundations from traditional HPC
Enabling technologies
Specialized subsystems for IS/VR
Applications and Benchmarks
Foundation or utilized
technology Main Contribution /
major improvement
Improvement or
extension to new area
Figure 1.1: The areas covered in and the contributions of this thesis.
evaluate the developed systems and to show the improvements over existing systems,
different usage scenarios and benchmarks are described. However, both frameworks
have a strong focus on interactive simulations on hybrid cluster systems. There are no
intentions to challenge existing systems for long running simulations or homogenous
clusters (see chapter 4).
All relevant parts of the thesis are published in the proceedings of reputable confer-
ences in the fields of computer graphics, visualization, signal processing and mechan-
ical engineering (see [42], [43], [44], [45], [46] and [47]).
4
1.4 Thesis Structure
1.4 Thesis Structure
The thesis is structured into seven parts. After this introduction, relevant foundations
on hybrid cluster systems and their applications are given. The main intention of this
chapter is to clarify the need for specialized middleware to reasonably use the com-
putational power of such hardware architectures. In the second part of chapter 2 a
short statement on APIs for hardware accelerators and a deeper insight into the Com-
pute Unified Device Architecture (CUDA) by NVIDIA is given. In chapter 3, a brief
overview over existing frameworks in the broader field of Computational Steering and
Remote Visualization for HPC is provided and limitations are shown. The subsequent
chapter 4 derives the need for two specialized systems for IS/VRs on hybrid clusters
from those limitations and briefly introduces their requirements. In chapter 5 a new
model for the computational steering of Interactive Simulations and VR is presented
and the basic components, functionalities and data types are described. The second
part deals with the prototypical implementation of this model, which results in the
so called CSIS (Computational Steering of IS/VR) framework. The focus lies on the
realization of the framework by combining existing and new software to provide the
desired functionality and flexibility. The last two sections of chapter 5 present two
practical IS/VR applications that greatly benefit from the introduction of the CSIS
framework and thereby help to show its potential. Chapter 6 is meant to introduce the
concept of a framework for remote visualization which is also specialized on IS/VR,
running on hybrid clusters. It is the first framework of its kind which uses the compu-
tational power of GPUs not only for rendering but also for the compression of frames.
After briefly introducing the prototypical implementation of the framework, the focus
lies on the description and implementation of the utilized compression algorithms, es-
pecially those that are GPU-supported. The last section of the chapter presents bench-
marking and quality assessment results for the compression and grabbing algorithms,
as well as for the overall system performance of the RV framework. Finally, chapter 7
lists the achieved contributions, draws an overall conclusion and gives an outlook on
possible extensions and fields of application for the developed concepts and software.
5
1 Introduction
6
2
Foundations
This chapter establishes the foundations of the thesis. Section 2.1 deals with the devel-
opment of cluster systems over the last 10 years and the specification of hybrid cluster
systems with possible fields of usage and an exemplary setup. In section 2.2 a short
overview on Software for Clusters is given. After a general introduction, the focus lies
on applications and subsystems that could make use of the special features of hybrid
cluster systems. The last section of this chapter introduces a rather new technique that
allows the usage of graphics hardware as universal, parallel coprocessors. This will be
a new and innovative way to further utilize hybrid cluster systems.
2.1 Clusters Now and Then
To understand the origins and the development of clusters and hybrid clusters, a brief
retrospection to the world of HPC 10 years ago, followed by a description of current
developments in HPC is given.
2.1.1 The Rise of Clusters
The term cluster is highly unspecific and describes a lot of different systems with
a wide variety of hardware, applications and users. However, Pfister gives quite a
simple and fitting definition for a cluster in his book ”In search of clusters” [67]:
A cluster is a type of parallel or distributed system that:
consists of a collection of interconnected, whole computers,
is used as a single unified computing resource.
Additionally he lists the classic reasons for deploying clusters in the first place: Per-
formance, availability, price/performance ratio, incremental growth and scaling. The
7
2 Foundations
problem of these reasons, according to Pfister, is that they are quite generic and actu-
ally fit on most parallel systems. So he started asking the question why clusters are a
good solution for many problems at the time he wrote the book1. The answers back
then were the following:
1. For the first time very high performance microprocessors are available and are
very cheap; that means they offer great price/performance ratio.
2. Also for the first time really high-speed communication techniques are available
and getting affordable. Especially optical links are very promising.
3. Since the first days of parallelism, some useful tools for distributed computing
have been developed and established to do the basic administration and handling
of clusters effectively. That is the starting point of writing sophisticated software
for clusters.
4. High availability becomes increasingly important since more and more users are
dependent on the inexpensive computing services.
But he also clearly describes the problems that were still to solve to pave the way for
an unlimited usage of cluster systems. His two main issues are:
1. The lack of ”single system image” software. This means that all available systems
for the administration and management of open-system-based clusters are only
loosely coupled toolkits that still have a lot of compatibility and usage issues.
There is now out-of-the-box easy-to-use for everyone software.
2. The limited exploitation of the possibilities a cluster has to offer. Only a few
subsystems of current operating systems exploit the abilities like scaling perfor-
mance and high availability. This on the one hand comes from the problems of
writing parallel software on the low level and on the other hand brings up new
problems for the design of sophisticated parallel programs.
By looking at the pros and the cons of clusters at that time, Pfister clearly pointed out:
The problem is not hardware. It’s software
Thus, what was primarily needed at that time was software that could make use of
the performance and the redundancy of the clusters. During the last ten years a lot of
such software was developed (see [6]). On the one hand the basis was created through
parallel programming frameworks such as MPI or PVM and specialized cluster oper-
ating systems, mostly derived from standard OSes such as Linux (e. g. openMosix [4])
or Windows (e. g. Windows HPC Server 2008 [53]). On the other hand many software
companies, especially in the simulation area, ported their basic software stacks to clus-
ter architectures. That ranges from simply starting multiple independent runs on many
1The last edition appeared in 1998.
8
2.1 Clusters Now and Then
nodes (capacity computing) up to massively parallel systems that heavily rely on intern-
ode communications (capability computing). The availability of commercially developed
and maintained software enabled a lot of big companies to use cluster hardware to
compute many aspects of their everyday work life. This again reaches from dynam-
ics simulation for product design and testing over CFD simulation for optimization
aerodynamics up to geological or financial simulations for various purposes.
2.1.2 Hybrid Cluster Systems
Since Pfister wrote his book in 1998 a lot has changed, especially in the hardware
area. On the one hand cluster systems have nearly completely superseded any other
parallel computing architecture and are widely used for number crunching and high
availability applications. On the other hand the commodity hardware that clusters are
built off, has evolved a lot. Not only the CPUs and network interfaces have gotten
faster by orders of magnitude, but also other components, which didn’t play a big role
at Pfister’s time, can now be used to enhance the functionality and performance of
modern cluster systems. One example of such a component is the Graphics Processing
Unit (GPU). Modern GPUs are highly parallel and programmable multiprocessors with
a strong focus on the processing of graphical data. But in some cases they also deliver
great performance for non-graphic computations as for example shown in the GPGPU
[24] forum. The integration of inexpensive GPUs in (some of) the cluster nodes can
now serve two purposes:
1. To visualize the data that has been computed by the cluster system and thereby
to help users to better understand and explore the results.
2. To support the CPUs of the cluster nodes with the computation of appropriate
tasks.
The GPUs are not the only extension of commodity computers one can think of. An-
other interesting field for example is reconfigurable hardware (e. g. FPGAs) which can
be added to cluster nodes as flexible and affordable co-processors for special tasks.
That leads to the idea of creating hybrid cluster systems that contain nodes with spe-
cial abilities (e. g. visualizing data) or specifications (e. g. FPGA accelerated). Hybrid
cluster systems are therefore a subset of cluster systems and the extended definition
would be:
A hybrid cluster is a type of parallel or distributed system:
that consists of a collection of interconnected, whole computers
with some of the nodes performing special tasks on special hardware
which is used as a single unified computing resource
Hybrid cluster systems have many fields of applications and have a lot of advantages
but also disadvantages over homogeneous cluster systems. For example a system with
9
2 Foundations
dedicated visualization nodes2allows to do simulation and data analysis in one step
and on one cluster. Even direct interaction between simulation and visualization is
possible (as described in section 2.2.1). On the downside, the specialized nodes often
perform different than homogenous compute nodes and thereby could complicate task
scheduling and load balancing. But for many applications it makes sense to deploy
special tasks on special machines to either increase throughput or allow data analysis
on the fly.
So again, the main difficulty is that adequate software is needed, to efficiently make
use of the computational power the hardware provides. This is where this thesis will
have its focus. We analyze classes of software that can make use of hybrid cluster
systems and address two main issues that will allow the efficient usage of hybrid
cluster systems - computational steering and remote visualization.
2.1.3 The Arminius Hybrid Cluster at PC2
This section briefly describes one example of a modern hybrid cluster system, which
also was utilized for some of the tests and benchmarks described in this thesis. The
Arminius Cluster was installed at the Paderborn Center for Parallel Computing PC2
in 2005. It consists of 200 nodes for computation and 8 special nodes for visualization.
Additionally all of the computing nodes contain basic GPUs and some of them are
equipped with FPGAS for reconfigurable computing tasks. The following gives a quick
overview over the hardware specifications:
System Configuration:
400 processors 64-bit INTEL Xeon
16 processors AMD Opteron
2.6 TFLOPS peak performance
900 GByte main memory
Compute Node Configuration (196 nodes):
Dual INTEL Xeon 3.2 GHZ EM64T
4 GByte main memory
80 GByte local disk
NVIDIA Quadro NVS 280 PCI-e GPU
InfiniBand Host Channel Adapter (HCA) PCI-e
Visualization Node Configuration (8 nodes):
Dual AMD Opteron 2.2 GHz AMD64
8 GByte main memory
2E.g. a computer that has a powerful graphics card and is connected to a displaying device.
10
2.1 Clusters Now and Then
NVIDIA Quadro FX 4500 PCI-e GPU
2 nodes equipped with CUDA-enabled NVIDIA GeForce 9800 GX2 PCI-e
GPUs
InfiniBand HCA PCI-e
FPGA Node Configuration (4 nodes):
same as Compute Node Configuration
AlphaData ADM-XP with Xilinx Virtex-II Pro FPGA
Infiniband Switch Fabric Configuration:
216 port InfinIO 9200
Disk Storage Configuration:
5 TByte Fibre Channel RAID
5 TByte parallel file system
The Arminius cluster is used for many different tasks, which can be grouped under
the following categories:
Batch jobs of non-parallel, standard software such as physical simulations or fi-
nite element methods for scientific or industrial purposes. Several jobs are started
independently on a bunch of nodes and the results are collected and processed
after the batch processing finished. Example: Gaussian [23] and Fluent [2]
Traditional HPC and (massively) parallel applications such as computational
fluid dynamics, finite element methods and molecular dynamics simulations that
are computed in parallel on the computing nodes. The results are postprocessed
manually and can be visualized through the visualization nodes which e.g. drive
a stereoscopic power wall. Examples: padfem2 [9] and GROMACS [75]
FPGA and GPU accelerated, parallel applications such as computer chess or com-
puter go or biometric searches in large databases. Those applications make use
both of the parallel computing power and the possibility to accelerate certain
algorithms on the special nodes. Examples: Hydra (computer chess) [15] and
GOmputer (computer go) [68]
Distributed, interactive simulations and VR applications such as driving simu-
lator or material flow simulations, which make use of the new subsystems for
IS/VRs developed in this thesis. Examples: VND [46] and d3FACT insight [12].
This is just a quick overview over the specifications and the applications running on
the Arminius cluster. For more details on the system and its usage see [65].
11
2 Foundations
2.2 Software for Clusters
As outlined before, the software that runs on clusters is often even more important and
expensive than the actual hardware. Therefore it is very important that especially for
new architectures such as hybrid clusters, software is available that allows users to uti-
lize these complex systems for their needs. There are, according to Pfister, mainly three
levels that help to generate the view and use of a cluster as a single unified computing
resource. These levels are shown in figure 2.1. The most extendable and interesting
level is the application and subsystem level. The other two levels (hardware level and
operating system level) are basically derived from the commodity hardware and oper-
ating systems of single user computers. Therefore the top level will be explored more
deeply. The levels toolkit and file system have been thoroughly studied and developed
over the last decade and there are no big differences between those for traditional
cluster systems and those for new systems like hybrid clusters. Thus the focus lies
on applications and subsystems especially for hybrid cluster systems. To give a start-
ing point, a short wrap-up on simulation in general is given. Afterwards we describe
possible applications for hybrid cluster systems with a special focus on interactive vi-
sualization and simulation systems. The second part of this section briefly presents
existing subsystems for traditional clusters and shows why the usage is limited for
hybrid cluster systems.
Application and Subsystem Levels
Operating System Kernel Levels
Hardware Levels
Applications (batch, simulation)
Subsystem (distributed DB, MPI)
FileSystem (NFS, GPFS)
Toolkit (OS extensions)
Figure 2.1: Different levels of a cluster architecture. Focus on applications and subsys-
tems.
2.2.1 Application - Simulation
Computer simulations date back to the Manhattan Project during World War II where,
for the first time, computers were used to simulate real world effects, in this case the
process of nuclear detonation. At that time the computational power hardly exceeded
the power of a current calculator. But the results helped the scientists to understand
12
2.2 Software for Clusters
processes that could not or only merely be investigated through experiments. Since
then computers have evolved steadily and helped to realize simulations of more and
more complex phenomena. With current technology it is possible to either focus on
very specific problems and come really close to reality (e.g. short sequences of com-
plex liquid or air flow simulations) or on approximating large scenes or even worlds to
study and train people in virtual surroundings (e.g. a driving or flight simulator). The
first kind of simulations is widely used in science and industry and has many appli-
cations. A few examples will be given in the following section. The second field is far
less developed and currently undergoing a change of paradigms from needing special-
ized, proprietary hardware to universal (hybrid) cluster systems. Therefore this thesis
focuses on the second field and the possibility to transfer well-approved techniques
from the traditional field of simulation and visualization to this newer area.
2.2.1.1 Systems, Models and Simulation
To be able to classify both traditional and interactive simulations, the following section
gives a short introduction on system, model and simulation theory. One goal of scien-
tists is to gain knowledge through the structured study of facilities or processes in our
environment. A closed set of such facilities and processes of interest is called a system.
A definition of a system is given for example by [40]: ”A system is defined to be a
collection of entities that act and interact together toward the accomplishment of some
logical end.” Systems are classified to be either discrete or continuous. In a discrete
system, state variables change instantly and are event triggered (e.g. switch or balance
of a bank account). In a continuous system, state variables change continuously over
time (e.g. a water level or the position of a car in motion). Systems are rarely purely
continuous or discrete. In fact, most systems have both types of state variables. These
hybrid systems are classified as continuous or discrete according to their predominate
type of state variable change.
There are different approaches to the study of a system. The straightforward method
is to study a system directly through experiments. Though this is often possible, some
systems are simply not suited for experimental studies. In this case it becomes neces-
sary to conduct studies on a model of the system. Other than that, there are a number
of additional reasons to study the model of a system:
1. Cost reduction. For example, in the area of automotive engineering, models are
used for conceptual studies and to study the aerodynamics of a proposed design,
instead of building expensive prototypes.
2. Risk reduction and prevention of harm. An example is the model of how an in-
fectious disease would spread through a population, without the need for human
or animal test subjects.
3. Insight into not yet existing systems. One example would be the planned fu-
sion reactor Iter, which cannot yet be realized but some process can already be
13
2 Foundations
simulated and the planning can be conducted accordingly.
4. Insight into inaccessible systems. This is the case for many studies in the field of
astrophysics and molecular biology.
Models of a system are either physical or mathematical. An example of a physical
model are clay models used for the study of aerodynamics. Simple mathematical
models can be solved analytically, more complex ones can only be simulated on a
computer. A simple mathematical model is for example the modeling of an object in
motion by applying Newton’s second law of motion (F=ma). The object’s position
at any point in time can be easily determined through a simple calculation. Complex
mathematical models, such as a model for the description of fluid dynamics, often
require iterative computation of approximative solutions. An additional benefit of
computer simulation over analytical model evaluation is that, once a model has been
built and verified, it is very easy to examine variations of a system in order to find
optimizations or verify hypotheses of the model’s properties. A common definition of
what a simulation is has been given in a VDI3directive:
Simulation is the recreation of a dynamic process in a model, in order to
gain valid insight on reality.4
Within this definition, a dynamic process is defined as a system’s change of state,
caused by some action on entities in the system. Though the simulation of a systems is
beneficial in many cases, simulations also have some disadvantages. It is identification
of a suitable type of simulation model, as well as its creation and verification which can
be complicated and expensive. Especially the verification of a model is an important
step in a simulation study, which, if done incorrectly, can lead to wrong decisions due
to erroneous simulation results.
The last element needed for an exact description of a system is its state. According
to [40], the state of a system ”is defined to be that collection of variables necessary to
describe a system at a particular time.”
This raises the question of what exactly should be the initial state of a system’s
model. Prior to the start of a simulation, its model must be initialized with a feasible
system state. Depending on the type of simulation in question, it is possible to ei-
ther initialize state variables to predefined values or generate randomized initial state
values. The latter is a common technique whenever stochastic simulation models are
used. Besides the stochastic models, several other kinds of simulation models exist,
which are discussed in the following section.
Depending on the objectives of a study, modeling the same system can result in
various different models with substantial differences in complexity. A good example
is the model for a computer simulation of ship traffic on a canal. In order to study
the impact of heavily increased traffic on the length of a watergate queue, it would be
3Verein Deutscher Ingenieure (Association of German Engineers).
4Translation from German.
14
2.2 Software for Clusters
sufficient to include entities for every ship, the path of the canal and the watergate.
If, however, the objective of the simulation is to examine water erosion at the canal
borders caused by ship traffic, the detail of the model would have to be several orders
of magnitude higher. For a water erosion simulation the water would need to be
modeled with methods of computational fluid dynamics. Moreover, the complexity of
a complete canal model would require high performance parallel computers. Besides
their complexity, simulation models can be classified along three axes: stiffness, state
transition and randomness. The different characteristics of these factors are discussed
below.
Static vs. Dynamic Simulation Models:
The stiffness of a model determines whether or not it can change over time. Models
which do not change over time are said to be static. Static simulation models could
also be called timeless models, as they represent a model at a particular time or
a model in which time is insignificant. The simulation of light in a room through
methods of ray tracing is an example for a static model. Dynamic models, on the
other hand, have properties which are a function of time. Such models evolve during
the course of the simulation (e.g. a traffic simulation model).
Continuous vs. Discrete Simulation Models:
State transitions in a model can be discrete or continuous. The definition of contin-
uous and discrete simulation models is defined analogously to the way continuous
and discrete systems were defined above.
Deterministic vs. Stochastic Simulation Models:
Deterministic simulation models are models in which the output of a simulation de-
pends on its input only. Numerical simulations, as used for the computation of fluid
dynamics, are an example of deterministic simulations. On the other hand, stochastic
simulation models use probabilistic components to simulate the distribution of events
(e.g. arrival times) or create an initial state in a simulation model. Due to their nature,
the output of stochastic simulation models is itself random and multiple simulation
runs can only give an estimate of the model’s characteristics.
Most traditional HPC applications like CFD and FEM often base on dynamic, contin-
uous and deterministic models. However, there are other simulations with other mod-
els such as the material flow simulation which uses a dynamic, discrete and stochastic
model. Subsystems that aim to support various simulations must also support multiple
ways of data transfer and synchronization.
2.2.1.2 Distributed Simulations on Clusters
A distributed simulation is any kind of simulation executed on a distributed computer
system. There are many reasons for a distributed execution of simulations. Firstly, the
distribution of computation intensive high performance simulations can reduce the
15
2 Foundations
total execution time of the simulation. Depending on its type, the particular speed-up
of an individual distributed simulation varies. Secondly, simulations can be distributed
geographically in order to create virtual worlds for multiple participants.
The challenge in the distribution of simulations especially on heterogenous (hybrid)
clusters is that in many cases a virtual global time must be maintained. Several solu-
tions for this problem have been devised. An intuitive solution for distributed time is
to employ a server to manage a central physical time base for simulations. The down-
side of this approach is that one node of the simulation is exposed to more network
traffic and processor load than others. Moreover, due to network latency time always
contains a minimal error. In some cases, however, it is not necessary to implement
a physical time; it might be necessary to be able to define an order on events in the
distributed simulation. Such a time scale based only on the causal order of events is
called logical time.
2.2.2 Application - Interactive Simulations and VR
As mentioned in the beginning of this chapter the simulation of virtual surroundings
for training or evaluation purposes plays a minor role in today’s scientific world. How-
ever, this area, which can be best described as highly interactive simulations or Virtual
Reality (IS/VR), is gaining more and more importance in the industry. One example
is car manufacturing. Through the extensive use of Computer Aided Design (CAD)
nearly all parts of modern cars exist as virtual 3D models. Therefore it is easy to take
these parts or whole cars, simplify them a bit and place them in real world like sur-
roundings to perform several test scenarios even before the first prototype has to be
build. This saves a lot of money for the companies and has therefore arisen a lot of in-
terest in the past few years. Additionally those interactive simulations, in many cases,
show their real potential when they are distributed over several nodes. Only through
multi-user scenarios or the execution of several simulation runs in parallel and the vi-
sualization on advanced displaying devices, these systems provide their users with the
additional information they need.
2.2.2.1 Definitions
There is no commonly agreed definition for IS/VR, because the applications are so
various and differ in many aspects. Nevertheless, Jonathan Kaye and David Castillo
give a quite good description on the term in their newsletter fittingly named Interactive
Simulation Volume 1 [33].
While David and I certainly didn’t want to simply add a new buzzword,
we felt that we needed an umbrella term to represent a foundation of core
concepts and skills applicable to the evolving role of simulation to various
disciplines, from training and assessment/certification to rapid prototyp-
ing, predictive modeling, and marketing. The ”simulation” part was easy,
16
2.2 Software for Clusters
because we certainly wanted to capture the idea of modeling reality, or a
plausible reality, in some way. As you will likely hear by talking with us for
more than a few minutes, however, ”simulation” alone is never a solution;
it must always be seen and planned for in the context of careful consider-
ation for the types of interaction necessitated by the project goals. Because
we felt it was important to emphasize the study of interaction in the devel-
opment of useful simulations, we therefore arrived at the term ”Interactive
Simulation.”
The term interactive simulation also appeared years before in the IEEE standard 1278.1-
1995 ”IEEE Standard for Distributed Interactive Simulation - Application Protocols”
[18] where it was used to define a standard for military real-time battlefield simulations
and virtual training. They give another, more specific definition which is strongly
focused on their military scenario:
Distributed Interactive Simulation (DIS): A time and space coherent syn-
thetic representation of world environments designed for linking the in-
teractive, free-play activities of people in operational exercises. The syn-
thetic environment is created through real-time exchange of data units be-
tween distributed, computationally autonomous simulation applications in
the form of simulations, simulators, and instrumented equipment intercon-
nected through standard computer communicative services. The compu-
tational simulation entities may be present in one location or may be dis-
tributed geographically.
Both definitions show that interactive simulations differ in many points from the clas-
sical and traditional simulations mentioned above. They can only achieve their inter-
activity if they run in (weak) real-time and the main goal is to provide a good approx-
imation of the reality but still in (weak) real time. The second definition also includes
the possibility to distribute the simulations for different purposes, which directly leads
to the idea of using clusters for their computation.
Virtual reality is another form of simulation and its definitions are also very vague.
Roy S. Kalawsky wraps this fact up in his book The Science of Virtual Reality [32]:
. . . There are probably as many definitions of the term virtual reality as
there are people in this field! . . . While it is not important what we call ’the
subject’, it is important that we understand the limitations of the technique
we are dealing with. Personally, I prefer the term ’Virtual Environments’
but I recognize that ’Virtual Reality’ is a term that is here to stay because
that is the description used by international press coverage. In a virtual en-
vironment, the human is immersed in a computer simulation, that imparts
visual , auditory and force sensations. The computer simulation can present
conventional real-world environments without modification or entirely new
environments where different (or no) physical law exists. The human op-
erator is allowed to interact with components of the virtual environment
17
2 Foundations
through his/her responses being sensed appropriately and coupled into
the virtual environment simulation. . . .
The focus when speaking of virtual reality lies more on the audio-visual and haptical
representations of an underlying simulation whereas the focus for interactive simula-
tions lies on the reality-like simulation of an environment. On second thought the two
terms have a lot in common. They both let users be part of a virtual reality and interact
with its entities. Additionally both terms are only vaguely defined and therefore leave
room for all kind of specializations. The next section discusses why a combination of
them is very well suited for the execution on hybrid cluster systems, and what is still
needed for a practical and common utilization.
2.2.2.2 Interactive Simulations and VR on Hybrid Clusters
In contrast to traditional simulation areas like CFD, where the main advantages of
deploying clusters is a better (timely) performance or high availability scenarios (e.g.
clustered web servers) where unlimited reachability is the main goal, interactive sim-
ulations mostly do not profit from these advantages. Although in some cases a per-
formance increase can be achieved by distributing for example a real time dynamics
simulation over several nodes, the problem remains that the parallel overhead elimi-
nates the performance gains for most (weak) real time applications. But besides that,
hybrid cluster systems allow the nearly unlimited extension of simulation setups. One
demonstrative example would be a driving simulator in an urban scenario. One single
computer can handle the graphics output, the dynamics simulation of the steered car
and the simulation of a limited amount of computer controlled cars, let’s say 10. With
10 of those nodes interconnected one could generate a scenario including 10 human
users and 100 computer controlled cars interacting in one common virtual world. This
is only a quite simple example and in practice there will be more sophisticated se-
tups that make use of the different abilities of the groups of nodes in a hybrid cluster
(e. g. special nodes for the visualization on a tiled display wall, hardware in the loop
components etc.), but it shows the main advantages hybrid clusters can provide for
interactive simulations and VR - flexibility and extendability.
However, the task of orchestrating this much more complex system remains the crit-
ical point as already seen for the traditional cluster systems. Good subsystems are
needed that control the data exchange, give users the ability to dynamically setup
desired scenarios and allow them to access the resources they need. There are good
approaches originating in traditional clustering, but adaptations need to be made to
make them work for the utilization of hybrid clusters for interactive simulations and
VR. Section 2.2.3 introduces two important subsystems of traditional cluster comput-
ing and explains why they are also needed on hybrid clusters and what needs to be
modified.
18
2.2 Software for Clusters
2.2.2.3 Research on Distributed Visualization and Simulation under the VisSim
Project
The problems and limitations described in the previous section lead to the foundation
of the joint VisSim project group at the University of Paderborn. The group consists of
several researchers from different backgrounds (e.g. mechanical engineers, computer
scientists and business data processing specialists) and was funded by the Ministry of
Innovation, Science, Research and Technology of the State of North Rhine-Westphalia
for three years. One main goal was to allow various interactive applications from
different backgrounds to make use of the hybrid Cluster system Arminius installed at
the PC2. This thesis partially includes generalizations and extensions of the findings
of the VisSim project.
2.2.3 Subsystems for Traditional and Hybrid Clusters
There are many subsystems (or middleware in other words) for many different task.
All of them share the fact that they are neither a part of the actual application, nor
of the operating system and that they provide services to ease the handling of dis-
tributed or parallel computers. One of the most prominent middleware is the Message
Passing Interface (MPI) [50] which eases the internode communication on any form
of distributed architecture. Especially for clusters, which by definition do not have a
common shared memory, efficient communication methods are vital. But there are a
lot more subsystems which serve very different tasks. Those reach from distributed
databases to cluster wide administration tools.
A very interesting class of middleware, especially for hybrid clusters, is called com-
putational steering (CS). In contrast to many other subsystems this class is aware of
the different capabilities of the heterogeneous nodes of hybrid clusters. Section 3.1
explains the idea behind computational steering and illustrates why there is a need for
further extending this technique to other fields of applications.
Another subsystem that is only needed for graphics or hybrid clusters is the remote
visualization (RV). Unlike traditional clusters, which mostly produce textual or binary
output once per simulation run, graphical and hybrid clusters often produce a constant
stream of images, possibly on various nodes. If those images can not be displayed
on a directly connected visualization device, they have to be transferred through the
network. This, however, mostly requires some kind of postprocessing, since these data
streams are quite big and require fast transmission. This is where remote visualization
subsystems come into play. Section 3.2 introduces the most important existing systems
and explains why there might be the need and the room for improvements to apply
them on IS/VR applications on hybrid clusters.
19
2 Foundations
2.3 Hardware APIs and NVIDIA CUDA
Besides subsystems for a unified usage of the whole systems, also tools to harvest the
computational power of the single nodes are needed, in order to efficiently make use
of hybrid cluster systems. Besides optimized compilers and high performance mathe-
matical and statistical libraries, APIs are needed to access special hardware like GPUs
or FPGAs. In this thesis one focus lies on the relatively new field of using the graphics
hardware for computations other than graphics and visualization. Therefore a power-
ful general purpose API is needed. NVIDIA was the first graphics cards vendor that
introduced such an API for their GPUs in 2006. Before that there were efforts to use
Graphics APIs such as OpenGL to map general computing problems to the graphics
card. The main problem of that approach was that general data structures and func-
tions needed to be artificially mapped to data structures (e.g. textures) and functions
(shader programs) in the graphics world, which led to complex and sometimes ineffi-
cient code.
By introducing the Compute Unified Device Architecture (CUDA) [61], NVIDIA
offered a possibility to use standard C code, standard data types and many other func-
tionalities of general purpose computing, natively on the fast graphics cards. Other
companies (e.g. ATI, Intel) are following this trend with own proprietary implementa-
tions. Thus, there exists no universal API for general purpose computing on GPUs at
the moment. However, NVIDIAs API is the defacto standard at the moment, since it
offers by far the best functionalities, has a broad user and developer base and NVIDIA
is actively pushing the utilization and development of the API. For that reason, CUDA
was chosen for the GPU-based general purpose computing tasks in this work and is
briefly introduced in the following.
Other tools, for example APIs for the integration of FPGAs follow a similar principle:
They offer a well known interface to new hardware. However, this thesis focuses
on the utilization of GPUs for certain acceleration purposes. Therefore the detailed
description of other APIs and libraries is skipped at this point.
2.3.1 CUDA - Architecture
CUDA is a combination of software and hardware architecture (available for NVIDIA
G80 GPUs and above) which enables data-parallel general purpose computing on the
graphics hardware. The CUDA software stack is depicted in figure 2.2. The graphics
hardware is accessed through a special CUDA-enabled driver, which can either be
addressed directly by the application or through utilizing the CUDA Runtime API or
optimized high-level libraries 5. The CUDA API is an extension to the C programming
language which helps to minimize the learning curve. It also offers read and write
access to all areas of graphics RAM which was prohibited by graphics APIs before.
5Currently available are libraries for Fast Fourier Transformations (CUFFT) and Basic Linear Algebra
Subprograms (CUBLAS).
20
2.3 Hardware APIs and NVIDIA CUDA
Figure 2.2: The CUDA software stack. Source [61]
2.3.2 CUDA - Programming Model
Through CUDA the GPU can be accessed as a coprocessor which is capable of execut-
ing a large amount of parallel threads. Thus, compute intensive parts of an application
that base on similar operations on variant data can be offloaded to the GPU very effi-
ciently. This is done by specifying one or more functions that will be executed by every
thread of the GPU, compiling the functions to kernels and uploading them to the GPU.
Input data can be copied from main/host memory to graphics/device memory and the
other way round through optimized API calls.
Figure 2.3 depicts the thread hierarchy provided by CUDA. A kernel is executed by
agrid of thread blocks. A thread block consists of a limited amount of threads that can
communicate and synchronize very efficiently through fast but limited shared memory.
Each thread inside a thread block can be addressed by a unique 1, 2 or 3 dimensional
ID. This ID can be used to assign tasks or data to each thread. Threads within different
blocks cannot communicate through shared memory but only through significantly
slower device memory6. Through splitting the grids into thread blocks ideal scalability
is achieved. Depending on the devices processing power the blocks can be executed
sequentially or in parallel. However, it is vital to choose adequate parameters for the
amounts of blocks and threads in a block for one kernel. Thread blocks can be addressed
by 1 or 2 dimensional IDs and help to globally identify threads by combining the block
ID and the thread ID.
6On current hardware accessing shared memory costs 4-6 clock cycle, whereas accessing device mem-
ory costs 200-300 clock cycles.
21
2 Foundations
Figure 2.3: The CUDA grid, block and thread . Source [61]
2.3.3 CUDA - C Language Extension and Graphics API
Interoperability
CUDAs main goal is to provide an easy and comfortable access of the GPUs computing
power for developers of all kinds. Therefore, a minimal set of extensions to the C
language is introduced to allow them to offload portions of their applications on the
graphics hardware. The API is basically split into three components:
A host component that runs on the host and provides functions to control and access
the GPU.
A device component that runs on the device and provides device specific functions.
A common component which provides built-in vector types and a subset of the C
standard library that is supported in both host and device code.
The syntax of the API and its functions are thoroughly described in [61]. Another
feature of the CUDA architecture is the interoperability with graphic APIs (OpenGL
and Direct3D) which allows to use, for example, rendered images as input to CUDA
kernels. Since this data already resides on the graphics device it only needs to be
copied on the device to be processed by CUDA. This offers great possibilities for e.g.
online image compression which is one topic of this thesis (see section 6.2.4).
22
3
Related Work
This chapter gives an overview over related work in the area of middleware for tradi-
tional and hybrid cluster systems. The focus lies on traditional computational steering
and new approaches towards CS of applications other than classic simulations. Fur-
thermore, related projects and their focuses in the field of Remote Visualization are
presented and grouped into categories to allow a classification of the approach de-
scribed later in this thesis.
3.1 Computational Steering
Computer simulations help scientists to study the behavior of complex systems and to
speed up design and development in advanced fields of engineering. With simulations
running in environments where accessibility is limited to assure best cost performance
ratio and a secure environment, controlling and surveillance of such simulations re-
quires more effort and thought than in conventional environments. Therefore a col-
lection of methods and techniques has been developed to accomplish what is known
as computational steering (CS). Mulder et al. are giving a good introduction to CS and
an overview over current frameworks in [76] and [56]. The basic idea is briefly de-
scribed in the following. Figure 3.1 shows the process of traditional computational
steering. The components User Interface / Visualization,Communication and Data Transfer
and Application / Simulation form the computational steering system. All components
have well-defined interfaces and may be run in a networked, distributed environment.
Additionally the steered application might also be distributed (e.g. a parallel CFD
simulation) or run on a parallel machine. However, the most important part of the
CS system is the component that handles Communication and Data Transfer which is
often called communication server. It has to monitor both, the User Interface / Visu-
alization and the Application / Simulation and update information that has changed.
This data can be of different nature depending on the model of the simulation (see
23
3 Related Work
Simulation
Vis
Application Data
User Interaction
Communication
Server
Figure 3.1: Components and data transfer paths of traditional computational steering.
section 2.2.1) and thereby influence the choice of methods for data transfer and syn-
chronization. For model exploration, for example, the output data of the Application
/ Simulation and the input parameters of the User Interface / Visualization need to be
exchanged. In distributed scenarios additional information on the distribution of the
processes and the network load are of importance to the user. To be able to provide the
data the communication server needs to have access to the Applications / Simulations
input and output parameters, its execution code and its configuration. The access to
parameters and monitoring information might be synchronous or asynchronous since
not every operation on that data is permitted or valid at any time of the simulation
process.
The other side of the system, namely the User Interface / Visualization has two
tasks. The first is to present or visualize the extracted data from the Application /
Simulation to the user in an appropriate way. The second task is to provide the user
with the ability to change the steerable items of the Application / Simulation. These
changes may vary for different CS scenarios from the setting of parameter values over
exchanging source code to reconfiguring an application for optimization. Different
User Interfaces / Visualizations may offer different methods for the setting of the items
according to their nature: For example, textual input fields for source code or sliders
or buttons for parameters.
3.1.1 Existing Systems for Traditional CS
CS systems can differ in many ways depending on the application that needs to be
steered. Therefore several different systems were designed and published in the past
that all have special characteristics. Three significant systems are:
24
3.1 Computational Steering
SCIRun [64] and [64]: SCIRun is an open source system which integrates scientific
simulation and visualization and is actively developed by the Scientific Comput-
ing and Imaging Institute (SCI Institute) at the University of Utah. Among the
computational steering environments discussed, SCIRun is unique with its pre-
sentation of a consistent user interface allowing to design, execute, visualize, and
steer scientific simulations. The fundamental concept of SCIRun is the visual
programming of simulations and visualizations. SCIRun applications are con-
structed as a network of modules. Each module in a SCIRun network acts as a
single-purpose unit whose parameters can be steered by the application’s user.
The SCIRun framework has many advanced modules for scientific simulation
and visualization but is practically restricted to shared memory systems. This is
due to its limited support for remote steering on multi-computer systems.
Cumulvs [36]: Other frameworks are specialized on certain parallel computation li-
braries. These frameworks usually allow steering of simulations which deal with
large distributed data sets, but the provided methods are not generally applica-
ble. In particular, Cumulvs, which represents this sort of steering frameworks,
is restricted to simulations developed with the Parallel Virtual Machine (PVM)
library.
CavernSoft G2 [63]: The most interesting framework, with regard to interactivity, is
the CavernSoft virtual reality framework. It stands out from the other frame-
works with its two types of update methods (UDP, TCP), which can be chosen
as parameters, in dependence of a volatility/reliability trade-off. CavernSoft has
a strict distinction between volatile variables which are (mostly) passed as TCP
streams and the notification of events or the change of shared parameters which
are passed as UDP packets. The whole framework was designed as a modular
system that supports multiple protocols (UDP, TCP, HTTP) as well as distributed
shared memory systems on the one hand and message passing architectures on
the other hand.
The computational steering methods of the discussed frameworks are mostly con-
cerned with the steering of long running simulations/applications and, unfortunately,
none of the discussed frameworks satisfies the requirements of a flexible steering
framework for highly interactive simulations and Virtual Reality. The reasons are man-
ifold and vary from the lack of support for networked systems and synchronization
mechanisms over high latencies to not supporting complex visualization scenarios.
3.1.2 Approaches towards CS for Interactive Simulations
In addition to the established systems for CS there are approaches to port the technique
to other fields of applications. The first one was undertaken by Kesavadas et al. in
2000 [34]. They transfer traditional methods of CS to discrete events simulations of
manufacturing systems. This allows them to control and steer a running simulation
25
3 Related Work
through a graphical user interface. Changes made in this GUI influence the simulation
immediately. To proof the validity of their work the authors simulate a single-server-
queuing system for a certain amount of time. They allowed a user to control three
significant parameters and showed that, by interactively adapting these parameters
to the output of the simulation, the final results of the simulation could greatly be
improved over those of uncontrolled execution. This approach is only a first step in
the direction of using CS for interactive simulation. The authors did for example not
consider the possibilities the approach can bring for distributed computing scenarios
or advanced visualization setups.
Another approach to use VR to steer simulations is the CaveStudy project [69]. It
uses existing systems for distributed VR (Cavernsoft[63] and Cavelib [49]) to build an
environment which is able to steer remote simulations. Their main contribution is that
the source code of simulation to be steered has not to be altered. This is achieved by a
code generator which uses description files to generate a wrapper for the simulation.
This wrapper acts as the server part of the framework. The generator also generates a
proxy component which is the counterpart of the server on the VR client. This proxy
communicates with the actual VR GUI. However, to guarantee the successful deploy-
ment of this approach, the steered simulation needs to fulfill certain requirements.
One of them is that the simulation generates output that is written to a file or stan-
dard output and that there is only one output source for one simulation. Thus, it is
not possible to directly integrate distributed simulations into the framework. Since the
framework mainly bases on CavernSoft (see above), it also inherits some of its advan-
tages and disadvantages. One main advantage is the distinction between events and
volatile variables which allows to separately handle both types of data and provide
best possible transmission conditions for both of them. A disadvantage is that only
point-to-point connections are foreseen by the framework and thus in scenarios with
a lot of m x n communication many redundant links have to be established. A central
communication server might solve this problem. A second issue is a missing global
synchronization method. Through their peer-to-peer oriented approach CavesStudy
and Cavernsoft allow very flexible setups but do not offer a possibility to globally syn-
chronize all of the connected components. Therefore it takes additional effort if for
example several simulations should be compared stepwise. However, this approach
comes very close to what will be presented in the following. It has its strengths in
dynamic and collaborative VR environments whereas the proposed approach in this
work aims for a well-scaling and synchronization-optimized system for hybrid cluster
systems with connected VR / visualization devices.
This topic is still at a borderline between several areas (HPC, mechanical engineering,
visualization, simulation) and just about to rise, so there are not many more systems
to compare to. Currently, there is no system (besides those described above) to our
knowledge which aims to allow the generic computational steering of IS/VR on hybrid
cluster systems.
26
3.2 Remote Visualization
3.2 Remote Visualization
Chapter 2 focused on the foundations that are needed to understand the applications
that are run on a Hybrid Cluster system in a distributed way. However, one problem
arises when moving those simulations and their visualization away from the users’
machine to big universal clusters: The access to those machines is physically limited in
most cases. Mostly, there is one single point of operation (often a power wall or CAVE
device) which is expensive to use and difficult to handle. If a user wants to benefit
from the graphics and visualization power of a high-end, distributed system, he can
only do this through the main operation site. A solution to this problem is software
for remote visualization, which transfers the rendered images from a remote server to
a local client that does not need to have significant computational power. There have
been different approaches for such software over the past few years with different
focuses. Some where designed primarily for remote administration purposes, others
for interactive 3D applications (as shown in the following). But all have in common
that the sheer size of image data is a limiting factor. The following sections give a quick
overview on existing classes of remote visualization and the problems that may occur
while dealing with these systems.
3.2.1 Classes of Remote Visualization
There are different classes of remote visualization for different tasks. In the following
we will introduce three classes that group the most common existing systems. This
should help to understand what the problems of the different classes are, and to which
class the approach proposed in this thesis belongs. The table in figure 3.2 gives a rough
overview over the three classes and their main differences.
3.2.1.1 Client-Side Rendering
Systems in which the application runs on a server but the graphics data (polygon
meshes, textures, volumes) is sent to the client and rendered there belong to this class.
The best known system in this class is a remote X-Server [20] which allows to start
X-based applications on a remote server as if they were local. The main disadvantage,
however, is that the rendering power of the server system is not used at all, while all
rendering work is done by the client. This is fine for small desktop applications but
insufficient for complex visualization applications that require certain rendering power.
Another popular representative of this class is the Chromium [28] framework which is
able to transfer the whole OpenGL stream from a server to a client. This is only one
feature of Chromium but has the same problem as the remote X-Server, namely it does
not use the servers rendering power. Chromium is a very flexible framework and is
not only designed to do remote rendering. It also offers a lot of features to support all
27
3 Related Work
client needs
advanced
GPU
optimized for
slow internet
connections
optimized for
high FPS
Compression
capabilities
Client-side
rendering
yes partially partially none
Server-side
rendering for
2D / adminis-
tration
no yes no high
server side-
rendering
for 3D /in-
teractive
applications
no no yes medium
Figure 3.2: Classification of Remote Visualization Systems
kind of distributed and parallel rendering tasks. But for this work we only consider
the remote rendering capabilities and classify them as Client-Side Rendering.
3.2.1.2 Server-Side Rendering for 2D and Administration
This class contains all systems that are mainly used for server administration and re-
mote control. They render the images on the server side, compress them and send
them to the client where they can be viewed with simple viewers. They also work
with low bandwidth connections and mostly show the whole desktop of the remote
server. They perform well with simple 2D and little interactive applications (e.g. ad-
ministration and settings dialogs) but they are insufficient for highly interactive 3D
applications such as a driving simulator or an interactive 3D viewer. The most popu-
lar representatives of this class are the Virtual Network Computing Framework (VNC)
[70] and Microsoft’s Terminal Service architecture [51]. Both offer possibilities to adapt
the remote visualization to the available bandwidth and client device. But they are not
yet optimized to achieve high frame rates and low latency for interaction.
3.2.1.3 Server-Side Rendering for 3D Applications (with and without Transparent
Integration)
The third class of systems is specialized on displaying interactive 3D applications on
remote clients. The server (or servers) with a lot of graphics processing power renders
a 3D application (e.g. OpenGL-based). Every frame is read back, compressed (lossy
or lossless) and send to a client. In most cases, only the window of the 3D application
is processed, to save bandwidth and processing power. Popular systems of this class
28
3.2 Remote Visualization
are the SGI Viz-Server [72] and VirtualGL [78]. All of these systems are optimized to
provide high frame rates with the available bandwidth. This is supported by the possi-
bility to choose different resolutions and compression methods as well as certain level
of detail mechanisms. Most systems allow a transparent usage of existing applications
(mostly OpenGL-based). This guarantees a comfortable and universal utilization. Un-
like that there are systems that are tightly coupled to a certain application to further
increase performance by adjusting the rendering process dynamically to fit the needs
of remote visualization. The subsystem presented in this thesis belongs to this class
and will support both transparent and non-transparent integration. Furthermore it
utilizes advanced hard- and software mechanisms to achieve high performance remote
visualization.
3.2.2 Limitations and Problems
All of the systems mentioned above have to deal with a vast amount of data that needs
to be transferred. This is because visual data is mostly pixel or voxel based. Therefore
even single images which are displayed for only a fraction of a second can consist of
one or more Megabytes of data. To provide a smooth animation, 30 or more frames
are needed every second. All have to get from the GPU of the server, over the network,
to the GPU and finally to the display of the user. There are already many approaches
to minimize the amount of data that needs to be transferred for example through pre-
fetching certain data to the client or by introducing level of detail mechanisms and
compression of all kind. But there is still no solution on how to enable users to interact
with a remote system as if it was local.
29
3 Related Work
30
4
Two Essential Subsystems for IS/VR on
Hybrid Cluster Systems
This chapter is meant to clarify the goal of this thesis and states what its impact is and
also what it is not. The reasons why this thesis deals with software (middleware) for
hybrid cluster systems were described in chapter 1, 2 and 3. In short the main points
are:
There are powerful hybrid clusters and various new applications that could make
use of them, BUT there is no practical and universal way to get those applications
to run on the cluster
Clusters are still quite expensive and therefore have to be shared between differ-
ent facilities. Hybrid Clusters add a new dimension of accessibility issues since
they can, in most cases, generate graphical output which needs to get to its users.
There currently exists no good solution to fulfill this task.
These are, in our opinion, the two main issues, why hybrid clusters are still only rarely
used in the field of interactive simulation and VR, despite the fact that applications of
this field could make good use of the power of clusters. Other than these problems
there is still a lot of general work to be done until clusters (no matter if traditional or
hybrid ones) can be used by the everyday user without limitations, but that is certainly
not the focus of this thesis. This thesis focuses on designing, implementing and testing
two systems to improve each of the problems described above.
4.1 Traditional Usage of Hybrid Cluster Systems
In order to analyze which software systems are needed for the usage of hybrid clusters
for IS/VR, it greatly helps to take a look at existing systems for traditional HPC on
31
4 Two Essential Subsystems for IS/VR on Hybrid Cluster Systems
hybrid clusters and, in a second step (see section 4.2), derive the new systems from
them. In figure 4.1, the major existing subsystems for traditional HPC are shown. It
is observable that all of them address only parts of the cluster system and therefore
hardly allow the simultaneous utilization of all resources of the whole system. How-
ever, this is only rarely required for traditional HPC applications since most of them
simply do not need the advanced features of a hybrid cluster system (e.g. sophisti-
cated, distributed visualization or the option to use a multi-user input). For traditional
HPC applications, the focus lies on maximizing the utilization of the computing re-
sources. Thus, most effort is put into the subsystems for parallel simulation (such as
MPI and PVM, but also application level parallelism). Another important point for tra-
ditional HPC is the visualization of the computed data to help the users to understand
and analyze the results of the simulation. In most cases this is done by post-processing
the data, save a visual model and display it off-line to the users. Therefore no special
subsystems are needed. With the upcoming of hybrid clusters new possibilities arose
and subsystems were developed to enable the utilization of these resources. Compu-
tational Steering, as described before, allows for the (limited) interactive steering of
running HPC simulations through a visualization. Additionally, distributed visual-
ization or remote visualization systems allow for the usage of the powerful hardware
for graphics processing for elaborate visualization tasks. However, all the subsystems
depicted in figure 4.1 are independent and hardly any of them is ready for interac-
tive usage, which is one of the main requirements of IS/VR applications. Therefore
this thesis introduces two new subsystems that allow interactive simulations and VR
applications to efficiently run on hybrid cluster hardware.
Remote
Rendering
Traditional
Computational
Steering
(Massive)
Parallel
Simulation
Sim1
Vis1
Sim3
Sim2
Vis3
Vis2
Distributed
Visualization
Hybrid
Cluster
Figure 4.1: Different subsystems in existing hybrid cluster environments.
32
4.2 Hybrid Cluster Systems for IS/VR
4.2 Hybrid Cluster Systems for IS/VR
Figure 4.2 depicts the vision of how hybrid clusters can be efficiently used for interac-
tive simulations and VR. This is what will be presented and outlined in the rest of the
thesis. The first subsystem, Computational Steering for IS/VR, for the first time offers
the possibility to compose setups with arbitrary numbers of simulation, visualization
and input components, and therefore allows to create flexible and scalable systems
for the execution of the IS/VR. This helps for example to realize multi-user scenarios,
comparative, parallel simulation runs or the integration of multiple autonomous simu-
lations into one system. The second subsystem, a framework for remote rendering, ad-
dresses the problem of accessability of the cluster’s visualization resources. It is meant
to be a testing, evaluating and productive framework to develop, optimize, benchmark
and use methods for the remote visualization of rendered simulation results. Again,
the focus is to optimize the subsystem and its methods for best interactivity. The fol-
lowing two sections quickly introduce the main and new ideas for both subsystems
whereas chapter 5 and 6 describe and evaluate them in detail.
Computational Steering for Interactive Simulations and VR
Invire Framework for Remote
Rendering
Sim1
Vis1
Sim3
Sim2
Vis3
Vis2
Hybrid
Cluster
Figure 4.2: Two new systems to allow for the efficient usage of hybrid clusters for
interactive simulations and VR.
33
4 Two Essential Subsystems for IS/VR on Hybrid Cluster Systems
4.3 CS for IS/VR - Idea and Requirements
The first system bases on the idea of computational steering and extends it for the
special purpose of interactive simulations on hybrid clusters. After analyzing exist-
ing computational steering frameworks for traditional clusters and number crunching
applications, a new approach for computational steering of interactive simulations on
hybrid clusters is defined. This includes a theoretical approach supported by several
models, as well as a working prototype which is utilized and evaluated by two sample
applications.
The systems will cover the following requirements:
Provide a flexible basis to orchestrate distributed, interactive simulations and VR
applications.
Encapsulate all communication effort and hide it from the user to provide a
single-system-image of the cluster to a certain extend.
Offer interfaces for all relevant components (simulation, visualization, input).
Be deployable on different platforms and consist only of freely available software.
In chapter 5, section 5.1 introduces a concept for such a system and section 5.2 de-
scribes its prototypical implementation. Afterwards, section 5.3 shows the integration
of the system into an exemplary driving simulator and section 5.3 does the same for
an interactive material flow simulation.
4.4 Advanced RV Techniques for IS/VR - Idea and
Requirements
The second system is a platform to evaluate different remote rendering techniques, es-
pecially in the field of compressing and grabbing the rendered frames. The main point
of this system is to utilize the power of hybrid clusters, namely the High Performance
graphics hardware of the visualization nodes to not only render the frames but also to
compress them in several ways. Therefore we can achieve better compression rates in
shorter times than comparable, existing systems are able to. This again allows users to
transparently interact with interactive simulations on hybrid clusters.
The systems will cover the following requirements:
Provide a framework to test and evaluate different methods for remote visualiza-
tion.
Enable the usage of the GPU for compression tasks.
34
4.4 Advanced RV Techniques for IS/VR - Idea and Requirements
Offer new compression methods that allow sufficiently high framerates for the
remote visualization of interactive simulations.
Offer an interface for all kinds of graphical applications, possibly also a transpar-
ent interface.
Prepare for the handling of distributed visualizations.
Be deployable on different platforms and consist only of freely available software.
In chapter 6, section 6.2 introduces a concept for such a system and section 6.4 describes
its prototypical implementation. A selection of benchmarks is given in section 6.5.
35
4 Two Essential Subsystems for IS/VR on Hybrid Cluster Systems
36
5
Computational Steering of IS/VR
As discussed before, computational steering (CS) is a valuable mechanism for scientific
investigation by which the parameters of a running application / simulation can be al-
tered and the results are visualized immediately (see e.g. [77]). It helps scientists to
interact with their simulation and gives them more control than just to react on the re-
sults of a long running computation. Several CS frameworks and systems for different
tasks and applications in High Performance Computing (HPC) have been developed.
The most important architectures are presented and briefly described in section 3.1.
They fit for many applications in HPC such as Computational Fluid Dynamics (CFD),
Molecular Dynamics (MD) or crash simulation. All of them have in common that they
lack the support for real-time interactivity and synchronization. Interaction is basi-
cally limited to changing simulation parameters such as for example the position of an
object in a flow channel. This is done on a best-effort basis over time in most cases.
The focus of this thesis lies on the research of highly interactive, distributed simu-
lation and distributed Virtual Reality systems (IS/VR). This is quite a new but very
promising field in HPC. For a long time those systems were limited in terms of scala-
bility, performance and flexibility. Proprietary systems on specialized hardware were
used to simulate fixed use cases, such as driving or flight simulators or material simula-
tion for rapid prototyping. With a flexible framework for IS/VR on a hybrid computing
and visualization cluster, a basis for highly flexible and scalable applications could be
provided that can be customized, compared and extended by the arrangement of vari-
ous modules. One of these application is for example the distributed driving simulator
Virtual Night Drive (see [7], [22] and [46]). It serves as one demonstrator for the pro-
posed framework and is briefly described in section 5.3.1. However, the computational
steering methods of the existing frameworks are mostly concerned with the steering
of long running simulations and applications (as shown in figure 5.1), and none of the
discussed frameworks satisfies the requirements of a flexible steering framework for
highly interactive simulations. First approaches were made into the direction of using
CS for other simulations such as discrete event simulations (see [34]). Still the methods
37
5 Computational Steering of IS/VR
Simulation
Vis
Application Data
User Interaction
Figure 5.1: Traditional Computational Steering with one user, one visualization and
one simulation.
of traditional CS are used, but only in other scenarios. Therefore, the following sec-
tions will introduce a concept for a framework especially designed for computational
steering of IS/VR. Details are also published in [45], [42] and implementation details
in [17].
5.1 A Concept for Extended Computational Steering of
IS/VR
In order to have a starting point the next section describes in which aspects the tradi-
tional approach of CS needs to be extended to serve the needs of IS/VR. Thereafter,
general requirements are specified and a conceptual design is presented.
5.1.1 New CS Models
Traditional computational steering is mostly limited to the steering and visualization
of a single simulation (see figure 5.1). This basic definition is very strict and has only
limited flexibility in the application of computational steering in areas different from
traditional HPC simulations. Usually, there is only one simulation, one visualization
and one side of user interaction. By extending each of these domains, three addi-
tional models can be identified that are needed to successfully transfer Computational
Steering to IS/VR.
5.1.1.1 Collaborative Computational Steering
In order to collaborate on an interactive, distributed simulation, it is necessary to ex-
tend the traditional paradigm of computational steering. Any user involved should be
able to access and steer the running simulation independently of his location. In this
38
5.1 A Concept for Extended Computational Steering of IS/VR
Simulation
Vis Vis Vis
Application Data
User Interaction
Figure 5.2: Collaborative Computational Steering.
scenario it is possible for one researcher to work on the level of detail of a simulation
model and for another to study its particular performance figures. Figure 5.2 shows
an example of a collaborative computational steering setup. Multiple sites connect to
one high performance simulation, each with its own visualization and methods for
user interaction. The simulation offers several parameters to be steered by the user.
With multiple users steering the same application, care must be taken to guarantee
consistency of steering parameters. In order to avoid conflicts during the access to a
parameter, it has to be possible to lock certain parameters through locking mechanisms.
Collaborative Computational Steering especially makes sense in IS/VR since many
of the existing applications include multiuser scenarios by nature. In a driving sim-
ulator for example several users can control cars in the same virtual world. It is also
possible to introduce different roles for steering and visualization instances. One steer-
ing/visualization couple could for example act as an administrator for the simulation
and the other clients are spectators or have predefined interaction privileges.
5.1.1.2 Synchronized Computational Steering
Simulation
Vis
Application Data
User Interaction
Vis
synchronized
Figure 5.3: Synchronized Computational Steering.
Synchronized Computational Steering opens up new possibilities in the quality of vi-
39
5 Computational Steering of IS/VR
sualization. This variation of traditional computational steering is a direct requirement
of two recent and widespread visualization technologies: Very high-resolution dis-
plays and 3D data visualization (see e.g. Figure 5.4). Both cases require decent frame
synchronization across multiple visualization nodes. High-resolution visualization is
required by users working with data that is too extensive and detailed to fit onto a
conventional screen, e.g. chip designers or city planners. Since the data throughput of
high-resolution displays or tiled displays often exceeds data rates offered by high end
graphic cards, data is fed to a tiled display by multiple graphic nodes, each displaying
a magnified partial view of the original scene. The side by side positioning of displays
in a tiled wall makes a possible time offset between displays disturbing for the viewer.
This especially holds for volatile visualizations with large objects crossing multiple tiles
of a display. Thus the exact synchronization of the visualization components is vital to
provide a seamless visual impression. The second emerging visualization technology
Figure 5.4: Synchronized CS to render on a Tiled Display.
in which time offsets in visualization are more than a minor annoyance is 3D visual-
ization. For a three-dimensional illusion of a data visualization it is necessary to create
a pair of 2D images, of which each image represents a slightly different perspective of
the same scene. This slight difference is set up in such a way that the separate pre-
sentation of each image to the eyes of the viewer creates an artificial depth perception.
Figure 5.3 shows an example of a computational steering environment where a syn-
chronized presentation of two visualizations is required. In this case, two nodes form a
synchronization group in order to minimize time offsets. Uneven processor loads and
scheduling differences on the visualization nodes are avoided by the introduction of a
third node, which handles the user input to the steered application.
In addition to these two cases (high-resolution and stereoscopic displays) even more
sophisticated scenarios like CAVE environments or specialized setups for e.g. driving
simulators can be realized through Synchronized Computational Steering, by setting
up and synchronizing the desired amount of visualization components.
40
5.1 A Concept for Extended Computational Steering of IS/VR
There are already systems (e.g. Chromium [28] or VRJuggler [8]) that are able to
distribute graphics output over several computers to drive tiled walls or stereoscopic
displays. However, there are two main advantages of using synchronized CS for that
task:
In contrast to systems like Chromium, no graphical data (e.g. OpenGL calls,
textures and scene models) has to be sent over the network since all of this data
already resides on the client system. Only simulation data, such as positions of
moving objects or the current position of a global camera, needs to be transferred
to realize a distributed visualization.
With Synchronized CS it is possible to add and remove visualization instance
during the runtime of the simulation, whereas in the established systems all com-
ponents need to be known a priori.
On the other hand those systems are very good in terms of transparent integration.
Chromium for example is able to distribute almost all OpenGL based software and of-
fers various so called SPUs that allow for different post-processing operations directly
on the graphics stream. For systems that already provide interfaces for CS, synchro-
nized CS can achieve even better performance and higher flexibility.
5.1.1.3 Concurrent Computational Steering
Simulation
Vis
Application Data
User Interaction
synchronized
Simulation
Figure 5.5: Concurrent Computational Steering.
In many scenarios one simulation run is not enough to manifest the simulation results,
or one wants to compare the results that are generated by different initial parameters.
Therefore several simulation runs are needed. Traditionally, multiple simulations were
started sequentially with several parameter variations in order to study their influence
on the simulation results. Although, traditional computational steering greatly helped
41
5 Computational Steering of IS/VR
to reduce the turnaround time of the parameter variation, simulation and visualization
loop, the conclusions of a simulation are still drawn based on the comparison of two
or more successive simulation runs. Furthermore, it is quite difficult to perceive small
effects of a changed parameter in a complex simulation. Researchers then have to
rely on a quantitative analysis of the simulation results. In such cases, employing
traditional computational steering adds no value to a simulation study at all. In order
to make computational steering more useful in comparative simulation scenarios, it is
important to allow the concurrent execution of two or more simulations and to allow an
aggregated visualization of the simulations’ outputs. Figure 5.5 shows an exemplary
setup of a comparative simulation scenario and exposes two challenges which arise
from concurrent computational steering. On the one hand, the user must be able
to selectively change parameter values throughout all concurrent simulations, or it
must be defined which user steers which simulation. On the other hand, to improve
the usability of concurrent computational steering, certain parameters can be treated
as virtual global parameters. Setting the value of such a parameter causes it to be
set in all steered applications instantly, without further user interaction. Besides the
requirement of methods for user input direction, concurrent computational steering
requires synchronization of time among the concurrently running simulations in order
to make their results interactively comparable.
5.1.1.4 Combinations of the Models
Figure 5.6: Combined, Extended Models of Computational Steering.
In order to achieve maximum flexibility, another goal of computational steering for
IS/VR is to allow arbitrary combinations of the models described before. Figure 5.6, for
42
5.1 A Concept for Extended Computational Steering of IS/VR
example, shows a scenario that consists of multiple users (collaborative CS), multiple
simulations (Concurrent CS) and a complex visualization device (synchronized CS).
Additionally the users are separated into groups with different privileges (operator,
test user). A real application of such a scenario is described in the section 5.3.1.
5.1.2 Requirements
In order to realize the models described before and thereby create a framework for
computational steering of IS/VR, some special requirements (besides general ones in
software engineering, such as maintainability, extensibility, and portability) need to be
considered. These are the following:
Dynamic setup: Especially in IS/VR, the configuration of the simulation, input and
visualization components is very divers and not necessarily fixed beforehand.
Therefore, one important requirement to a specialized framework is the possi-
bility to set up and arrange these components in a flexible and dynamic way.
Furthermore, the systems should not be fixed to a certain configuration after the
initialization phase but should still be extendable and scalable during its runtime.
Locking: Dedicated methods to exclusively lock parameters of a component are other
main requirements for the realization of concurrent CS and collaborative CS.
These models can only work correctly if there is a mechanism that takes care
of data consistency and the queuing of simultaneous write requests.
Low latency: Low latency is a direct requirement of IS/VR since interaction is a cru-
cial component of these systems. Lack of immediate interactivity is the main
obstacle to user acceptance of a computational steering framework for highly in-
teractive simulations. Therefore techniques need to be considered which ensure
the fast and possibly prioritized transfer of latency relevant information.
Synchronization: The requirement for synchronization among simulations or visual-
izations in the framework is a direct requisite from the synchronized and concur-
rent computational steering models.
Classes of data: This last requirement is not that obvious but will turn out to be very
important. Especially in IS/VR there are different types of data that need to be
treated differently by the CS framework. Therefore it is very important to offer
special mechanisms to, for example, publish parameter changes once to a special
group of receivers or update simulation data in every step of the simulation in
each connected visualization.
These requirements form the basis of the concept for computational steering of IS/VR
and their realization is described in the following sections.
43
5 Computational Steering of IS/VR
5.1.3 Conceptual Design and Classifications
In the following sections, the concept for a flexible computational steering framework
is explained. In general, it is based on the idea of a centralized communication server
where this server and attached simulation, visualization and steering clients should be
as dynamically and loosely coupled as possible to grant extensive flexibility. To design
the framework, the actors need to be identified and a classification of the data that will
be transferred has to be made.
5.1.3.1 Actors in Computational Steering
The idea of the extended models of computational steering is to allow multiple simu-
lations, visualizations and steering sites to interact as one seamless application. Figure
5.7 combines the preceding specific scenarios of computational steering into one gen-
eral model of a steering framework. It allows any number of the three different classes
of actors in computational steering. The common definition of an actor is that of the
Unified Modeling Language (UML) standard: ”an actor is something or someone who
supplies a stimulus to the system”[62]. From the framework’s point of view, actors are
simulation, visualization and steering actors. The steering component is an active ac-
tor since it can ”initiate interactions with a system”, the visualization components are
passive actors since they ”act as targets of requests or are activated by the system” and
the simulation components are both active and passive. In the following, the functional
Computational Steering
Framework
Simulation SteeringVisualization
0..n 0..n0..n
Figure 5.7: General model of the steering framework.
role of each type of actor is described.
Visualization: The purpose of visualization in computational steering is to provide
a visual representation of the state of a simulation, where the state itself can
be distinguished into simulation parameters and the results of a simulation. A
visualization can have several parameters which define the way in which the
state of a simulation is presented. The visualization parameters are local to the
visualization and have no effect on the computation of the visualized data.
44
5.1 A Concept for Extended Computational Steering of IS/VR
Simulation: The simulation is where the results of the simulated process are com-
puted. A simulation is considered as a single component; a potential distribution
of the simulation among multiple processing nodes is neglected from the frame-
work’s point of view. Thus, a distributed simulation mostly provides a single
point for data access and therefore can be handled as a single component in the
most cases. All data provided by a simulation is regarded as its output, which
itself is considered volatile, as it is recomputed in every iteration of the simula-
tion. In the special case of IS/VR the duration of one iteration needs to be short
enough for the whole application to be interactive i.e. less than 100 ms (see
section 5.1.5). Since the simulation needs to run in (soft) real time all results need
to be ready before the end of an iteration. Otherwise a mechanism is needed
which aborts computations that take too long. In order not to be faster than
wall clock time, the simulation is idle for the remainder of the iteration. Like
the visualization, the simulation also has local parameters. As they are used in
the computation of a simulation step, altering/steering these parameters directly
influences the computation of the output data.
Steering: Although steering, which is the actual user interaction, is often integrated
into visualization, it is useful to regard it as an independent component. This is
especially the case for scenarios where autonomous steering devices are used to
interact with a simulation. The steering component is used to change parameters
in visualization and simulation components. Where steering is done by input
devices delivering a continuous stream of data (e.g. analog devices), it is either
necessary to adequately quantize this stream of data and frequently update a
parameter or to deliver a continuous stream of data.
5.1.3.2 Classification of Simulation Data
Until now no distinction has been made between the types of data accessed and ex-
changed through the computational steering framework. Especially in highly interac-
tive computational steering of IS/VR, it is essential to distinguish between two types
of data, namely volatile state data and static parameters.
Volatile state data: changes in every time step of the application. The main charac-
teristics of volatile state data are, firstly, that it has to be transferred in every pass
of the simulation process and, secondly, that all clients have to get it simultane-
ously. Especially for the synchronization-based models it is very important to
ensure on-time delivery of the data.
Static parameters: change only through steering actions by a user or another actor
(timer or event triggered) of the application. If no action is taken the values of
the parameters are constant over the applications iterations and don’t need to be
transferred. Only if a new actor joins the system all necessary parameters need
to be send to him for initialization.
45
5 Computational Steering of IS/VR
The differentiation into volatile state data and static simulation/visualization param-
eters allows to apply different methods for their distribution across all actors. The
following two sections describe how volatile state data and parameters can be passed
in a flexible and dynamic way.
5.1.3.3 Passing of Volatile State Data
There are two requirements for the passing of volatile state data by the computational
steering framework. On the one hand, the volatile data needs be updated in every
time step. On the other hand, data passing needs to be synchronized. The second
requirement is easily fulfilled if the framework passes an item of input data to all its
respective subscribers prior to their internal iteration, and fetches their output imme-
diately afterwards. Thus, the main loop for passing volatile data between the actors of
a computational steering application is the following (from the actor’s point of view):
1. For any variable which has been computed in the previous iteration (output data):
Copy its value to all interested actors.
2. Iterate all actors.
3. Collect output data values from actors for use as input in the next iteration.
The association of producers and consumers, which is necessary for this approach, can
be stored in a simple array and is essentially a graph data structure (map). This data
structure determines the routing of data between actors.
5.1.3.4 Scheduling of Volatile State Data
The scheduling of the data is done in a dependency agnostic way. That means that
the realities between consumer and producer are only used to determine the volatile
state variables to pass. Data dependencies between the actors are ignored in terms of
scheduling. This might lead to delayed data as shown in figure 5.8 a) : The variable
x from the input component first reaches the visualization after the second iteration
since its computation is depended on the simulation component. In other words, data
that reaches the visualization is already outdated due to the scheduling that is used.
This fact is inevitable since the dependency exists independent of the utilized schedul-
ing method. Thus, it is important to reduce iteration time in a way that the longest
dependency chain can still be computed in an interactive time frame (about 100 ms).
If there are multiple dependencies, as for example shown in figure 5.8 b) where the
visualization depends on input from the simulation as well as on input directly from
the input component, more sophisticated scheduling methods need to be considered.
One approach is graph scheduling, where the knowledge of the data dependencies
is used to influence scheduling decisions. Thus, for a directed acyclic graph of data
dependency it is possible to sequentially order the iteration of actors in such a way
46
5.1 A Concept for Extended Computational Steering of IS/VR
Input
Simulation
Visualization
VisualizationSimulationInput
21 3 iteration
x y z
x y
x
a) b)
Figure 5.8: a) Data delay through dependency agnostic scheduling b) Graph-based
scheduling introduces longer iteration times.
that all data on which an actor depends is computed before that actor’s iteration. Se-
quential iteration is particularly useful in a scenario as depicted in figure 5.8 b). In this
example, the actor’s visualization depends on data from the actor’s input and steering.
Through the dependency chain input simulation a dependency agnostic scheduling
algorithm would supply the visualization component with data which has a time offset
of 2 times the duration of a single iteration. An algorithm adhering to a dependency
graph would sequentially iterate the actor’s input, simulation and visualization. Thus,
the data input to an actor would always be up-to-date. However, the disadvantage
of these algorithms is that one iteration of the scheduling algorithm then requires the
time needed to sequentially iterate the longest dependency chain.
5.1.3.5 Dynamic Mapping of Volatile State Data
Producer P Consumer
P.V P.V
Publish/Subscribe
1.ADD(V) 2.VAR_ADDED(P.V)
or QUERY()
3.SUBSCRIBE(P.V)
CS framework
Figure 5.9: Dynamic map for volatile state data.
47
5 Computational Steering of IS/VR
To ensure the flexibility of the framework, the mapping of volatile data has to be dy-
namic, which means it must be possible to add consumers and producers of volatile
data to a running application. On the one hand, consumers of volatile simulation data
must be able to notice when a new producer provides an additional set of volatile
variables. This of course implies that producers must be able to add new data sets on
joining an application. On the other hand, consumers, which are joining an already
running application (e.g. when registering a new visualization), must be able to deter-
mine available sets of volatile state variables. Figure 5.9 shows how these requirements
are met by using publish/subscribe communication and adding a dedicated remote
procedure call (RPC) interface to the communication server. By this interface, the com-
munication server provides access and notification methods for its internal mapping
data structure. These methods are
ADD / REMOVE (var_name)
SUBSCRIBE / UNSUBSCRIBE (var_name)
QUERY ()
The first two methods are used by a producer to add or remove volatile state vari-
ables to, respectively from, the mapping. When adding or removing a variable to the
mapping, the communication server sends a notification of the form:
VAR_ADDED / VAR_REMOVED (var_name)
to inform clients which might be interested in subscribing to it. To request a particular
parameter, a consumer can use the SUBSCRIBE method. If a producer has already
registered a variable with this name, the mapping is established immediately. It is
not necessary for the consumer to wait for the notification of the addition. If the
variable the consumer wishes to subscribe to has not yet been registered, the mapping
is delayed until a producer registers the matching variable. The last method of the
remote interface, QUERY, allows a connecting client to determine what volatile data is
already available. More sophisticated querying mechanisms can be implemented but
are not relevant for the conceptual design.
5.1.3.6 Decoupled Handling of Shared Parameters
In contrast to volatile state data, parameters change less frequently and only as a con-
sequence of user action. Thus, instead of updating parameters every time before an
actor’s iteration, a parameter should only be updated when necessary. Since the num-
ber of actors in the steering application is not assumed to be fixed during runtime,
the decision was to base the concept for steering of shared parameters on the flexible
publish/subscribe method. As publish/subscribe can implement a group-based com-
munication, in which nodes are unaware of group structure and size, this approach
is expected to lead to the most flexible framework for computational steering. Due to
48
5.1 A Concept for Extended Computational Steering of IS/VR
the use of publish/subscribe, a steering client can be implemented with a minimum of
knowledge about the actual steered application and its components. Just knowing the
name and type of a steerable parameter is sufficient to compose a message which sets
this parameter in one or multiple components.
As for the usage of publish/subscribe for volatile state data, the connected actors
exchange publish/subscribe messages for the static parameters. However they do
not subscribe for the reception of parameters through a communication server, but
directly to the reception of updates of the parameter. That means that whenever a
producer changes the value of a parameter, he publishes it and all subscribers receive
the updated value directly. All parameters ever published by a producer are listed in a
separate parameter map and can be queried by potential consumers, according to the
handling of volatile state data.
5.1.4 Architecture, Components and Exemplary Workflow of the
Framework
Figure 5.10 shows the basic architecture of the CS framework for IS/VR. Several ac-
tors are connected through a steering library that encapsulates all functionalities of
the Steering API (adding/removing of variable, subscribing/unsubscribing to param-
eters/variables, querying etc.). An actor can also be on both sides of the systems if
he offers input and output variables. An example for such an actor is a simulation in-
stance which can be steered through a steering instance and displays its output through
a visualization instance.
In the following the two main server components of the framework are presented
and a sample workflow is described to understand the interaction between the actors
and the framework. Implementation details can be found in section 5.2.
5.1.4.1 Communication Server
The communication server is the central part of the CS framework. Its task is to map
the volatile state data output of the producers to the input of the consumers. This is
done by a dynamic variable map which links one or more variables of each side. It
also serves as a synchronization instance as it has an internal clock, which provides
a stroke after which the values of all variables are transferred from left to right. This
clock limits the time the components have for their iterations and offers several modes.
It can be set to strictly adhere the time limit and resend old values if an actor did
not compute inputs in time or it can be set to be less strict and allow certain delays,
which can cause the whole system to stutter. Thus, the clock settings need to be taken
carefully and be adapted to the application’s needs and capabilities.
49
5 Computational Steering of IS/VR
5.1.4.2 Publish/Subscribe Service
The Publish/Subscribe service has two main functionalities. The first is to allow pro-
ducers to announce their volatile state data output. This information can be queried
by the consumers or is pushed to them if they subscribed to receive announcements
of that producer. The P/S service is tightly coupled to the communication server and
invokes the dynamic map generation if a consumer subscribes or unsubscribes to a
producers volatile state data variable. Additionally, the P/S service is responsible for
the handling of the static parameters. Those can be announced and subscribed to by
all actors of the system. Once an actor subscribed to such a parameter, he receives all
further updates until he unsubscribes it or the producer ceases to offer the parameter.
5.1.4.3 Exemplary Workflow
In the following, an exemplary workflow is described for the actor on the left. The
depicted actor computes a couple of variables in each of its iterations. Thereby, a set
of parameters is incorporated into the computation. A user who, in direct interaction
with the actor, changes a parameter, implicitly notifies other users of the particular
parameter’s change. For that purpose, the steering library employs the P/S service.
Additionally, the volatile variables are concurrently transferred to other potential actors
through the mapping and synchronization component of the system. Not shown in the
diagram is the actual synchronization which is imposed on the actor’s iterations. The
Publish/Subscribe
service
CS framework
Actor
Steering Library
Actor
Steering
Library
Actor
Steering
Library
Actor
Steering
Library
Actor
Steering Library
Actor
Steering Library
change
Static
Parameters
use
iterate VSD
notify
transfer
Communication
server
Figure 5.10: Architecture and collaboration of the framework’s components.
components of the computational steering system (visualization, simulation, etc.) need
50
5.1 A Concept for Extended Computational Steering of IS/VR
to be implemented against the steering library. Through the framework, an actor can
achieve the following objectives: 1. receive or transmit volatile state data. 2. share
static parameters among a group of actors. 3. decide on the subscription to newly
added volatile variables. 4. initiate internal function calls on a parameter change.
In order to develop an actor for computational steering, the general approach is to
declare all produced volatile variables, as well as subscriptions to volatile variables
and parameters to the framework. Furthermore, the developer is able to associate
callback functions with certain parameters.
Section 5.2 describes the implementation of the presented concept for Computational
Steering of IS/VR and section 5.3.1 whereas section 5.4 show two scenarios where it is
used in practical applications.
5.1.5 Consistency, Performance Estimation and Latency Optimization
One very important aspect of the proposed framework is the achievable performance
for the distributed, interactive simulations. In contrast to traditional HPC applications,
where in most cases the network throughput is the limiting factor, interactive simula-
tions heavily depend on very low latencies. In [14] Delaney et al. surveyed the effects
of consistency and latency on distributed interactive applications (DIAs). They found
out that the best state a DIA could achieve is absolute consistency. That means that
”at any point in time, all players (users) should ideally see the same information at
the same time independent of the network”. However, this absolute consistency is im-
possible to be attained by a DIA since the existence of network latency and the time
to pass information between two actors cannot be ignored when compared to the time
between events. Consistency mainly refers to three aspects in DIAs: synchronization,
causality of ordering and concurrency. All these aspects influence the system’s consis-
tency in one way or the other. Linked to those aspects are the issues of responsiveness
(”the time taken for the system to register and respond to a user event”) and fidelity
(”the degree to which the representation within a simulation is similar to a real-world
object, feature, or condition in a measurable or perceivable manner”[18]). These three
issues, consistency, responsiveness and fidelity, need to be balanced individually for
each application to give the users of the system the best possible result. In the follow-
ing we focus on the problem of latency since it is a cause for all three issues, because
it influences the network that actually facilitates the DIA. The authors of [14] claim
that there is no commonly agreed definition of latency since it has so many variants.
However, the term network latency (from now on latency) narrows down the problem
to the following definition:
Network latency is the time taken from the start of exchange an application
protocol data unit (APDU) at the application layer of one participating node
to the end of the exchange of the same APDU with the application layer of
a second participating node. [13]
51
5 Computational Steering of IS/VR
If this latency is higher than a certain threshold, real-time interaction is at stake. Ac-
cording to Delaney, an application is still noticed as interactive by general users when
it has a latency below 40 to 300 ms, depending on the application. However in prac-
tical application 100 ms proved to be a realistic value for the maximal tolerable round
trip latency. That is the time for transferring input data to a simulation, process it
and transfer it back to the visualization. 100 ms seems to be a lot of time in terms
of network latency in HPC clusters, where current Infiniband networks have latencies
between 2 and 10 µs (see for example [74]. Those 100 ms are the upper bound for
all components including transfer and computation of the simulation. Especially for
applications with large numbers of components, the central communication server in-
stance might become a bottleneck, but also components that do not work correctly or
simply take too much time to compute their results may slow down the whole system.
Thus it is important to think about mechanisms to prevent the whole system from
halting and to improve the scalability to ensure the adherence to the latency threshold.
The following sections presents two approaches to tackle this problem.
5.1.5.1 Best Effort and Dropping of Latecomers
The simplest approach for eliminating sources of extraordinary latency is the rigorous
dropping of latecomers. Since there exists a global synchronization instance with a
global clock, it is possible to just send the last available result to the receivers instead
of the updated computation which arrives after a certain time limit. However, this
might lead to incorrect visualization and erroneous analysis of the simulated data. A
more failsafe approach is to handle latency at a best effort basis and ban producers
or notice the users if a certain component extends the threshold more often than a
specified times. Thus, both methods will not help if the communication server itself
is the source of the latency, e.g. because it is not able to handle all the connected
components in time.
5.1.5.2 Clustering of Communication Servers
Global Mapping and
P/S Service
Main Communication
Server Communication
Server CS1
Vis1.O
bj1
Sim1.
Obj1
Master
Communication
Server CS1 Communication
Server CSn
VisO.
ObjP
SimN.
ObjM
Vis2.O
bj2
Sim2.
Obj2 ...
... ... ...
Slaves
Figure 5.11: Clustering of multiple communication servers according to master slave
approach.
52
5.2 Prototype - The CSIS Framework
To eliminate the bottleneck of a central communication instance a concept for the clus-
tering of communication servers is developed. Figure 5.11 depicts the basic idea behind
the clustering. There still exists a master communication server which holds the global
map and is interfaced with the P/S service. Additionally there are nslave communi-
cation servers to which the master evenly assigns communication links. This method
can be compared to the striping of data in RAID systems. The master keeps track
of the assignation of communication links and informs the connected components to
which and from which of the slaves they send and receive their data. The master tries
to assign 1xn links to the same slave communication server to avoid segmentation of
those links. However if this is not possible a component has to send outgoing data
for the same object to two or more slave communication servers. By clustering the
communication servers it is possible to much better scale the system. Even dynamic
scaling is an option, where the master launches new servers if it recognizes that a cer-
tain load threshold is reached. As the actual communication goes through the slaves
and no longer through the master, it can fully concentrate on assigning communica-
tion links and synchronizing the slaves. To avoid allocating new nodes for the slave
servers it is also possible to start them on simulation or visualization nodes. Especially,
if for example one simulation sends a lot of information to one communication server
slave it makes sense to put that slave directly on the simulation’s machine. The actual
computing load of a communication server is insignificant in comparison to the com-
munication load. All this functionality needs to be transparently encapsulated by the
steering library.
5.2 Prototype - The CSIS Framework
In the following the implementation of the framework for computational steering of
interactive simulations (CSIS) is briefly described on the basis of the three main com-
ponents:
1. Communication server for the distribution of volatile state data.
2. A publish/subscribe server for the dynamic mapping of the volatile state data
and the passing of static parameters.
3. An API which offers the frameworks functionality to applications.
The following sections will introduce the two supporting technologies used for im-
plementation of the CSIS framework: the Commuvit communication server and the
D-Bus for publish/subscribe services, and describe the necessary adaptations. After-
wards, the focus lies on the actual steering library and the introduction of the API
used to implement actors of a steering application. For further implementation details
please refer to [17].
53
5 Computational Steering of IS/VR
5.2.1 Commuvit
The transfer of volatile state data between actors in the steered application, according
to the communication map, is realized with the Commuvit communication server (see
[5]). It was initially designed for the real-time execution of distributed simulations of
mechatronic systems and fulfills the requirements for the transfer of volatile state data:
The simultaneous and synchronized transferring of data from sources to drains. After
a brief architectural overview, the adaptations that were made for the CSIS framework
are described.
5.2.1.1 Architecture
Commuvit Master
1. Iterate Slaves
∆t
2. Update Map
Commuvit Slave
1. Receive Data
2. Compute
3. Wait
4. Transmit Data
...
Figure 5.12: Connection handling by the Commuvit server.
The core of the Commuvit communication server is its centrally managed simulation
time and a global variable map. The main loop of the Commuvit server steps the
time of the simulation with a configurable time ∆t in each iteration. The map stores
the volatile state variables for each producer and assigns them to the subscribed con-
sumers. As shown in figure 5.12, Commuvit employs a master thread and one slave
thread for each connected component; these are employed to synchronize the compo-
nents of a distributed application. During a single iteration, the master thread instructs
its slaves to process the following steps:
1. transmit the component’s input data from the map to the component.
2. instruct the component to start its computation.
3. wait for the computation to finish.
4. copy the output data (results of the computation) into the internal map.
The mapping, which specifies how data is transferred from and to the different tools,
originally was loaded from a static configuration file at startup. Through the integra-
tion of the publish/subscribe service it is now possible to alter this map dynamically.
54
5.2 Prototype - The CSIS Framework
5.2.1.2 Variable Mapping
The variable mapping is the heart of the Commuvit server. It is stored in a configura-
tion file as shown exemplary in listing 5.13. The server uses this initial map to create
all variables and their bindings. The variables have unique names and can be grouped
hierarchically. In the example simulation sim1 offers one car car1 which has a pos
object with 3 volatile state variables x,yand z. Those variables are mapped to two
visualization instances vis1 and vis2 which also have interfaces for a car object with
apos object consisting of x,yand zvariables. From now on in every global time step,
the values in x,yand zare copied synchronously from sim1 to vis1 and vis2.
#vis1
#car1
Map=sim1 . car1 . pos . x , vis1 . car1 . pos . x
Map=sim1 . car1 . pos . y , vis1 . car1 . pos . y
Map=sim1 . car1 . pos . z , vis1 . car1 . pos . z
#vis1
#car1
Map=sim1 . car1 . pos . x , vis2 . car1 . pos . x
Map=sim1 . car1 . pos . y , vis2 . car1 . pos . y
Map=sim1 . car1 . pos . z , vis2 . car1 . pos . z
Figure 5.13: Commuvit map for one simulation and two visualization instances.
5.2.2 D-Bus
Where Commuvit is ideal for fast and low-overhead transfer of volatile state data, D-
Bus [21] is used to implement group-shared state parameters and the dynamic genera-
tion of Commuvit’s variable map. D-Bus is used by many Linux applications, such as
the Hardware Abstraction Layer (HAL), CUPS, the Gnome Power Manager, or Beagle.
Support for the Windows platform is currently under development. It is a very pow-
erful architecture which can be used for several purposes. A good basic description is
given in [21]. The D-Bus framework itself actually consists of two major components.
On the one hand, at the protocol level, the D-Bus library can be used to implement
peer-to-peer remote procedure calls. On the other hand, the framework provides the
dbus-deamon, a message bus daemon, which acts as a router for messages. Moreover,
the message bus allows the monitoring of registered objects and their methods and
signals. An object in D-Bus is identified by an object path, e.g. /de/upb/commuvit/map.
One important property of D-Bus is that it sends messages to objects, not to applica-
tions. Thus, applications in D-Bus can register multiple objects. The functionality of
55
5 Computational Steering of IS/VR
the major part of the D-Bus framework can also be found in many other RPC frame-
works. There are, however, two reasons for the application of D-Bus for distributed
communication in the steering library. First of all, D-Bus is lightweight and widely
available as part of the major Linux distributions. Secondly, the D-Bus framework can
be employed as an easy-to-use publish/subscribe system. As any object can subscribe
to signals with a lightweight subscription language (string matching), it is possible to
implement publish/subscribe communication with D-Bus. The signals in D-Bus make
it possible to implement the required group-based communication mechanism for the
framework. If, for example, there are several nodes driving a tiled visualization wall,
all of these nodes subscribe to an object /de/upb/steering/tiled-wall and receive all
signals that are addressed to that group. However, the best scalability and performance
could likely be achieved by implementing the publish/subscribe module completely
from scratch and design it to the special requirements of CS for IS/VR. This, however,
could not be done in the limited scope of this prototype and is scheduled for a more
advanced implementation of the framework.
5.2.3 The CSIS Server - Integrating Commuvit and D-Bus
The CSIS server in fact is an extended Commuvit server, which incorporates the exist-
ing systems with the messaging functionalities of the D-Bus system. All the changes
were implemented according to the concepts described in Section 5.1.3.3. In order to
facilitate dynamic addition and removal of producers of volatile variables, the vari-
able map has been extended by two remote procedure calls: add(string var name)
and subscribe(string var name). The first method is used by the D-Bus library to
add new volatile state variable to Commuvit’s internal mapping, whenever a producer
announces new volatile variables. The latter method is used to register a consumer’s
interest in some variable. If this variable is not yet available the request will be queued,
otherwise a mapping is created between the producing actor and the consumer.
The second adaption that had to be made was a notification and monitoring inter-
face. Upon a call to the add output(string name) method, the CSIS server has to notify
existing actors of the new variable to allow them to subscribe to this variable. This is
realized through a notification interface between the communication server and the
publish/subscribe service. In the opposite case, i.e. if the consumer of a variable joins
after the producer, a method for initially receiving all available volatile state variables
has been added. With these implemented changes, the CSIS server can now be started
with an initially empty map. Through the connection of producing and consuming
actors, started by the user, the map will successively be filled.
The passing of shared parameters is handled by the D-Bus system and a dedicated
map is stored in the CSIS server. In contrast to volatile state variables, which are
transferred in every step of the global clock between all subscribed actors, shared pa-
rameters are only transferred when they are changed. This is a basic feature of pub-
lish/subscribe systems, whenever an actor subscribes to a shared parameter it receives
every update until it unsubscribes from it. This functionality, as well as the adding and
56
5.2 Prototype - The CSIS Framework
subscription to volatile state variables, is offered through the CSIS steering library.
5.2.4 The CSIS Steering Library
Instead of going into details of the libraries internal implementation, a more practical
description of the framework’s API will be given. The actors of a computational steer-
ing application (e.g. visualization, simulation) will need to be implemented against the
steering API. Through the framework, an actor can achieve the following objectives: 1.
receive or transmit volatile state data. 2. share parameters among a group of actors.
3. decide on the subscription to newly added volatile variables. 4. initiate internal
function calls on a parameter change. In order to develop an actor for computational
steering, the general approach is to declare all produced volatile variables, as well as
subscriptions to volatile variables and parameters to the framework. Furthermore, the
developer is able to associate callback functions with certain parameters. All in all, the
framework’s API provides the following basic functions:
connect(char* host, int c port, int d port):
Connects to a host with the Commuvit server and the D-Bus service on ports c port
and d port (client-side).
iterate():
Must be called in every iteration of the component/actor. Calling this function will
send and receive outstanding notifications through the D-Bus and synchronize the
component with the Commuvit server.
reg var(char* name, double* var):
Registers a new volatile variable in the Commuvit server. In every iterate() call to
the framework, the value of var will be transferred to consumers of this variable via
Commuvit.
subscribe var(char* name, double* var):
Subscribes to the volatile variable name”. The location var will be updated with the
variable’s new value in every iteration.
set var filter(void (*filter func)(char* var name)):
Sets a callback function, which will handle notifications of newly available volatile
state variables. Thus, a component can decide whether or not to subscribe to such a
new variable.
reg parameter(char* name, int (*cb func)(char* var name, GValue* val)):
Registers a parameter with the framework. Updating the value of the local copy of
this parameter is a task of the specified callback function. Thus, it is possible to
implement calls to a component’s internal functions on the change of a parameter.
57
5 Computational Steering of IS/VR
reg parameter(char* name, GValue* value):
Registers a parameter with the framework. The parameter is automatically updated
by the framework.
update parameter(char* name, GValue* value):
Updates the shared parameter name with the specified GValue.
query():
Returns a list of all available producers of volatile state data and static parameters
which can be used to manually or automatically subscribe to the desired variables.
disconnect():
Disconnects the actor from Commuvit and the D-Bus service.
By integrating these functions into an arbitrary IS/VR application, it can now be started
and executed in a distributed fashion and makes use of the extended models of CS for
IS/VR presented in the sections above. Two examples and the benefits thereof are
presented in the following.
5.3 Computational Steering of a Distributed Driving
Simulator
Since the goal of CSIS was not primarily to improve the performance of certain ap-
plications, but to provide the possibility to run them on hybrid cluster systems, the
evaluation of this system focuses on describing successful deployments. The first use
case of the CSIS framework and the application computational steering paradigms to
a IS/VR application is a distributed driving simulator. Its main focus lies in the real-
istic simulation on automotive headlights but it also demonstrates the possibilities of
a distributed and modular VR application. After a short introduction of the original
application, the modularization and integration of the CSIS framework is described.
Afterwards examples of new functionalities and new scenarios are presented.
5.3.1 The Virtual Night Drive Simulator
The Virtual Night Drive (VND) simulator was introduced by Berssenbruegge et al.
in [7] and in [22] and was developed at the Heinz Nixdorf Institute, Paderborn, in
cooperation with the Hella KGaA Hueck & Co. Its main focus is the realistic simulation
of headlights on real-world tracks to do physiological tests and to let engineers evaluate
prototypes of new headlights. There are several characteristics of headlights that need
to be considered when designing a simulator. It is nearly impossible to visualize the
complex light distribution with traditional computer graphic’s lighting and shading
methods like point light sources or Phong shading. Instead, a system was developed
that takes advantage of programmable pixel and vertex shaders to project the light of
58
5.3 Computational Steering of a Distributed Driving Simulator
the headlight onto the scene per pixel. This provides a very realistic impression for
test people and engineers and allows the differentiation of several types of headlights.
It is also possible to check the compliance of the headlights to strict standards in a
very early stage of development (e.g. dimmed headlights may not beam over a certain
horizon). Figure 5.14 shows a screen shot of the original VND simulator running on
one single computer.
Figure 5.14: Screen shot of the Virtual Nigh Drive simulator.
5.3.2 Modularizing the VND Simulator
The original version of the simulator works quite well as long as only one car with
headlights is simulated. For more cars the light simulation requires more computa-
tional power than a single CPU/GPU workstation can provide. Additionally, it is not
trivial to realize effects like blending with the existing shader approach since no real
light sources are utilized. But especially for the psychological testing it is extremely
important to have more than one car with realistic headlights to simulate effects like
oncoming traffic or colons of vehicles which are quite important in everyday life. This
led to the idea of distributing the whole system onto several computers and bundle
their computational power to provide an interactive system for complex scenarios.
Additionally, the need for various display and computing scenarios arose, for exam-
ple to be able to drive a stereoscopic wall or tiled displays. Thus, a modularization
and a flexible basis for coupling the components was needed. This basis was the CSIS
framework. Before the integration of application and framework is described, a quick
introduction of the three VND modules simulation, visualization, audio and input is
given.
59
5 Computational Steering of IS/VR
5.3.2.1 The Input Component
The VND simulator supports different types of input classes and devices. For the
steering of a simulated car in the virtual world, there currently exist three different
input devices:
Standard keyboard input, which serves only for testing and supervising purposes
since it does not simulate a realistic driving experience.
A steering wheel for gaming, which is cheap and can be easily connected to stan-
dard PCs. It provides a basic driving experience, but lacks the haptical feedback
of a real driving wheel.
Different professional force feedback wheels which are amongst others used to
evaluate steer-by-wire approaches. They are connected through a standard CAN-
Bus interface (see for example [19]) and can simulate the haptic feedback of a real
driving wheel.
For all of these devices an input component was developed which implements a spe-
cial interface. This interface consists of three output volatile state variables: throttle,
break and steering angle. Whenever an input component connects to the CSIS frame-
work it publishes these three variables and updates them in every global time step.
Additionally, there is a second input component (which can also be integrated in
the first component if needed) to launch special events. Examples for such events are
the switching between day and night view, the selection of different headlight types
or the resetting of the car’s position. All events of this fixed set are published to the
system as shared parameters and can be subscribed to by interested consumers. The
input component reacts on user inputs (e.g. pressing a specific key on the keyboard to
trigger a special event) by updating and publishing the appropriate parameter.
5.3.2.2 The Simulation Component
The simulation component of the VND simulator deals with the dynamic simulation
of the cars in the virtual world. It uses the volatile state variables throttle,break
and steering angle from an arbitrary input component, to calculate the position and
direction of one car object in the three-dimensional scene. There can be different forms
of simulation, ranging from simple and not very realistic models to fully detailed
rigid body dynamics simulations. The only thing they must provide are the following
volatile state output variables: pos with the subvariables x,y, and zfor the three
dimensional location of the car and dir with the subvariables x,y,zand phi for the
direction and pitch angle of the car.
In addition to simulation components that accept input from an input component,
there also exist input-less simulation components. Those can steer cars autonomically,
possibly by following predefined tracks or implementing scripted or simulated behav-
ior. However, they also have to provide the output variables pos and dir and their
subvariables.
60
5.3 Computational Steering of a Distributed Driving Simulator
A simulation either subscribes or does not subscribe (if it is autonomous) to one
input components volatile state variables and publishes the hierarchical state variables
pos and dir. Additionally, it can subscribe to simulation related events such as reset-
ting the position of a car.
5.3.2.3 The Visualization Component
Figure 5.15: Three VND visualization components used to drive a 3-channel projection.
The visualization component is responsible for the graphical output of the simulation
data. In the case of the VND simulator it renders the car objects in the virtual world.
Furthermore, it is responsible for the headlight simulation of all displayed cars. A
visualization can have an arbitrary amount of car objects and for every car object it
accepts one dir and one pos volatile state variable. Those variables determine the
actual position and direction of the car as they were computed by a simulation.
Additionally, the visualization component can subscribe to several events, e.g. car-
dependent ones such as switching headlights or car-independent ones such as switch-
ing day and night. Each visualization component also has one fixed field of view and
camera angle. This can be used to create sophisticated displaying setups such as tiled
display walls or multiple-channel-projections as for example shown in figure 5.15.
A special form of visualization which tries to overcome the limitation of simultane-
ously simulated head lights is presented in section 5.3.5.
5.3.2.4 The Audio Component
For the acoustic output of the driving simulation there exists an audio component
which can be bound to any car in the simulation. The audio component is also de-
signed as an independent application which receives the same information about the
car objects as the visualization component. Additionally, it subscribes to the position of
a camera in an arbitrary visualization to be able to generate 3D sound for that specific
61
5 Computational Steering of IS/VR
generation. This allows for the realization of surround sound setups which help the
users to further immerse into the virtual reality.
5.3.3 An Exemplary Setup of the Computationally Steered VND
Simulator
VND-VIS
VND-SIM
VND-VISVND-VIS
AI-SIM
AI-SIM
AI-SIM
Publish/Subscribe
service
CS
Framework
Communication
server
VND-VIS
x car1
x car2
x car3
x user
Collaborative CSConcurrent CS Synchronized CS
VND-AUDIO
Figure 5.16: Computational Steering of the distributed simulator VND.
Figure 5.16 shows an exemplary scenario of the Virtual Night Drive application where
the extended models of computational steering for IS/VR are applied and combined.
The scenario that was realized is the following: A user steers one car in the scene which
is displayed on a high-resolution tiled wall. To render the images for the 3x2 tiled wall,
three different graphics nodes running six instances of the visualization component
(VND-VIS) are used. To ensure that every node renders the right view they are syn-
chronized by using Synchronized CS. Additionally, an audio component (VND-AUDIO)
which receives updates of the camera position of the VND-Vis instances is responsible
for the generation of sounds for all cars in the scene. The input signal from the user’s
steering wheel is processed by an instance of the simulation component (VND-SIM)
which computes the position, direction and speed of the steered vehicle. Furthermore,
three additional cars are introduced to simulate a populated scenery. They are con-
trolled by simple AI simulations (AI-SIM). These four concurrent simulation threads
62
5.3 Computational Steering of a Distributed Driving Simulator
(one user, three AI) are coordinated by Concurrent CS. Finally, the whole scenario is
observed by a supervisor who has a general overview over the scene (also driven by a
synchronized VND-Vis component). He can control the environment of the test person
and the AI simulated cars. Thus the supervisor and the user interact with the system
through Collaborative CS.
5.3.3.1 Exemplary Usage of the Framework for the Passing of Volatile State Data
Both the steering data that comes from the input devices and the vehicle-related data
(position, direction etc.) of the simulation are obviously volatile. Imagine a new actor
(e.g. a spectator) joins the system in the example in Figure 5.16 with an own visual-
ization instance (VND-VIS) that enables him to watch the scenario from a prespecified
point. In order to be able to see the vehicles in his instance of visualization he must
subscribe to receive the volatile simulation data from all active simulation instances. If
this is successful the spectator’s VND-VIS receives the volatile data of all cars in the
scene (user- and AI-controlled) and is able to display them at the actual position and
with the right direction.
5.3.3.2 Exemplary Usage of the Framework for the Handling of Shared Parameters
Analogue to the passing of volatile data, a short example for the handling of shared
parameters is given. An exemplary shared parameter is the current type of headlights
(regular, Xeon, dimmed etc.). The user or the supervisor can choose it. In order to visu-
alize the right headlights, every visualization instance that displays the corresponding
car needs to get informed about changes of that parameter. Hence, a publish/subscribe
group is founded and all visualization instances as well as the user’s and the super-
visor’s input instances subscribe to the group. If the supervisor now changes this
parameter all subscribers are informed and can update their local parameters and if
necessary use them to display the new headlights. If another instance of visualiza-
tion (e.g. the spectator) joins the systems later, it also subscribes to the group and
automatically gets the current parameter status and all future changes.
5.3.4 CS-Enhanced Virtual Night Drive - Conclusion
This sample scenario shows some of the potential of the computational steering frame-
work for a Virtual Reality application. Before the introduction of CS for IS/VR it
was hardly possible to realize this scenario without coding the whole application from
scratch. Now it is sufficient to integrate the steering-library into the relevant com-
ponents and declare input and output variables. With that done it is now possible
to dynamically arrange various setups and allow for the creation of complex virtual
worlds. New modules, for example sophisticated dynamic simulations or advanced
steering devices can easily be integrated by defining their interface and dynamically
connect them to the system. In addition, various displaying setups can be realized
63
5 Computational Steering of IS/VR
without making changes to the source code of the visualization. The whole setup can
now run on a shared cluster resource such as the Arminius Cluster (see section 2.1.3)
and utilize its computational and graphical potential.
5.3.5 Distributed Shader-based Visualization through Computational
Steering
Synchronized
CS
Shader 1
Shader n
Combiner
Renderer
Resulting Image
...
Figure 5.17: Composing synchronously rendered frames.
The framework also enabled new research possibilities in the field of distributed vi-
sualization (see [42]). By using multiple nodes for the simulation of headlights and
coupling them by synchronized CS it was possible to overcome given limitations of
current graphics hardware. In more detail it is now possible to simulate a nearly un-
limited amount of cars with headlights only depending on the number of visualization
nodes available. All of them render as much lit cars as possible and send these images
to one (or more) composer nodes which merges all results to one single view that is
shown to the user. Figure 5.17 shows the idea of composing the synchronously ren-
dered frames. Thereby, the Renderer renders the unlit scene and all Shaders do the
compute intensive, shader based light simulation. The Renderer and all Shaders have
the same view on the scene. Since they are all synchronized through CS, all nodes
have the same state of the simulations and thus, a composition of all rendered frames
64
5.4 Computational Steering for an Interactive Material Flow Simulation
is possible. This composition is done by the Composer node (cascaded Composer se-
tups are planned to avoid the network-related bottleneck). Benchmarks showed that
for 8 nodes speedups between 4-5x could be achieved, when more than 20 headlights
are simulated. For more details please refer to [42].
5.4 Computational Steering for an Interactive Material
Flow Simulation
A second IS/VR system that was adapted to the CSIS framework is the interactive
material flow simulation (MFS) d3FACT insight . It was developed at the Fraunhofer
ALB institute in Paderborn and was first presented in [39]. In addition to the traditional
tasks of MFS such as planning, safeguarding and improving production processes its
main goal is to provide better usability and new methods of exploring the simulated
data. The system has a strong focus on the three-dimensional visualization of the
simulated material flow in order to give the process and factory designers an insight
on how the future systems may look like. After a quick introduction to d3FACT insight
this section describes the adaption of the application to the CSIS framework and the
resulting benefits. More details can be found in [47].
5.4.1 The d3FACT insight Material Flow Simulation
The d3FACT insight MFS system is split in two parts. The first part offers modeling
functionalities and allows the process designer to arrange building blocks (e.g. ma-
chines, conveyors etc.) in a two or three dimensional environment. Each of these
building blocks has a set of variables which determine their throughput, failure rate,
processing time etc.. These can also be set initially in the modeling phase. The model
itself is stored in an XML-file and checked against a DTD for completeness and cor-
rectness. Thereafter, the model is processed to a Java program by an XSL preprocessor.
This Java program serves as input for the actual simulation kernel (the second part)
which is also directly connected to the three dimensional visualization. The whole
d3FACT insight system is a very sophisticated and complex application and it is out
of the focus of this thesis to describe all of its features (including motion planning and
various economical models). However, to understand the concept of introducing CS
to d3FACT insight and the benefits thereof, the basic components and concepts of this
system are briefly introduced in the following.
5.4.1.1 Initialization and Execution of the Simulation Experiments
Before the actual computation of the simulation experiment, an initialization is exe-
cuted where the simulation is filled with the input data from a simulation database or
external sources. In the original version, an experiment manager can manage several
65
5 Computational Steering of IS/VR
variants of the experiment, so that multiple simulation runs can be computed sequen-
tially or in a distributed, but not interactive mode on different computers. Selected
variables and their parameter changes are recorded in each element of the simulation
model or are stamped on the tokens, which run through the system. This collected
data is saved in the database and analyzed subsequently. Because of the possibility of
user interactions during the execution of one simulation model, the parameterizations
made by the user are recorded as well. During an experiment, most variables can be
viewed and in some cases changed. The analysis of the simulation experiment can be
adjusted individually. Some standard analysis and statistics are presented by standard
building blocks, available in a modeling library. However, the current approach is
limited in flexibility and expendability since for example a comparative visualization
of multiple simulation runs is not possible, because the simulation and visualization
instances are not connected interactively.
5.4.1.2 Modeling with Buildingblocks and Subblocks
Figure 5.18: Example of an executable simulation model consisting of 7 subblocks.
To better understand how the modeling in d3FACT insight works, figure 5.18 shows
an exemplary model in a two dimensional representation. The model consists of four
different building blocks: Source,Conv,Saw and Stock. Those building blocks can be
grouped in libraries (e.g. one library for machines, one for conveyors etc.). Through the
definition of variables and event routines, the building block’s behavior is described.
In order to allow a hierarchical modeling, objects are derived of these building blocks.
These objects are called subblocks and inherit all variables, parameters and behavioral
descriptions in the event routines from their building block. For each subblock, it is
possible to change its actual parameters, but not to change its basic behavior. Subblocks
also have several input- and output-channels, so that they can be linked together. In the
example in Figure 5.18 the subblocks source 1,conv 1,saw 1,conv 2,conv 3,stock 1
and stock 2 are linked through those channels and thereby define the actual material
flow.
66
5.4 Computational Steering for an Interactive Material Flow Simulation
5.4.1.3 The Simulation, Tokens and Visualization
The actual simulation follows the principle of a discrete, event based simulation. That
means that one or more sources input tokens to the simulation model, which initiate
certain events in every subblock they reach. Each of the subblocks also has a processing
time, after which a token is passed to an output channel over which it reaches the next
subblock and triggers the next event. This process can be simulated in real-time or
in accelerated/slowed-down time. The simulation finishes when all tokens reach a
sink. The simulation also includes stochastic models to simulate machine failure or the
production of junk.
In addition to the simulation-related events, there are also visualization related ones.
For example, every time a token enters a new subblock, an animation event is send out
to the connected visualization. This event triggers the graphical representation of the
token (e.g. a work piece) to move through the current subblock on a specified path.
It also holds the computed processing time and other information about the token
(e.g. which path inside the subblock it takes). These animation events are the main
interface between simulation and visualization. Additionally, the visualization can
change certain parameters of the subblocks through events which it sends out to the
simulation kernel.
5.4.1.4 Communication Limitations
A general multi-user approach for d3FACT insight should allow the users to collabo-
rate and interact cooperatively on one or more simulation runs at the same time. This
directly leads to the demand of an efficient communication mechanism for the message
transmission from the simulation kernel to the connected visualization modules and
back. The transmission of the events between simulation and visualization needs to
be regulated by a central instance. In the original implementation, the communication
is processed by the simulation kernel itself. Especially for complex simulation models
and multiple connected users, this leads to a significant slow-down of the simulation
run, since a direct connection had to be established for all users. If the user, more-
over, wants to switch between several distributed simulation kernels, this could only
be done with an unacceptable overhead for the initial information processing, which
could lead to an intermediate stop of the simulation run. This could be caused for ex-
ample by a newly connected user, who demands all current data from a running MFS
kernel. In order to avoid such peaks of communication and to reduce the growing
amount of messages new ways to orchestrate the complex system needed to be found.
Computational steering of IS/VR offers powerful techniques to fulfill this task.
5.4.2 Computational Steering in d3FACT insight
Figure 5.19 shows the basic idea behind the integration of CS into the distributed MFS
d3FACT insight. Instead of a direct connection between every visualization and every
67
5 Computational Steering of IS/VR
MFS
Kernel1 Vis1
MFS
Kernel3
MFS
Kernel2
Vis3
Vis2
Before:
MFS
Kernel1 Vis1
MFS
Kernel3
MFS
Kernel2
Vis3
Vis2
CSIS
After:
Figure 5.19: Replacing direct communication links by a centralized CS component.
MFS kernel, a central CS instance is established which orchestrates the whole message
transmission. Thereby, virtually every component can exchange data with any other
component. Events are exchanged by changing the values of external parameters.
These are initially made public through the publish/subscribe system, where clients
can subscribe to selected parameters. The following describes the functions of the main
components (MFS kernel and visualization), the interfaces and the data exchange itself.
5.4.2.1 Adapting the MFS Kernel and the Visualization
Instead of sending the initialization events directly to the connected visualizations,
the MFS kernel publishes them once through the P/S service of the CSIS framework.
Thereby, all basic information about the used subblocks, e.g. their 3D-representative,
their position and their attributes including all actual parameter values is announced.
Afterwards, the simulation kernel waits for an external signal to start the simulation
run. Visualizations that connect to the CSIS framework can now request a list of all
connected simulation kernels and subscribe to their events.
During the actual simulation the kernel sends animation events for each subblock to
the CSIS framework in every step of the global clock. The organization of this event
exchange is described in section 5.4.2.3. The visualizations, however, can subscribe to
these events and receive their updates synchronously. These messages are then used to
animate the representations of the tokens in the three dimensional model. In addition
to these volatile variables the visualizations can also subscribe to the parameters of an
MFS kernel or selected subblocks. They are only updated through publish/subscribe
messages when they are changed.
The visualization itself provides several interaction mechanisms. For example, by
selecting a special machine or element from the 3D-environment, all of its properties
are subscribed to automatically and presented to the user in an additional panel of
the visualization client. From then on, every update is propagated to the client, so
that the user always has all actual data about this selected element, e.g. a machine or
68
5.4 Computational Steering for an Interactive Material Flow Simulation
a forklift. If admitted, the user is able to change parameters in the properties-panel,
which are transmitted back to the kernel through the CS framework and might change
the behavior and thereby the animation in the 3D-view.
The 3D-client also allows tabbed displaying of multiple properties-panel (e.g. for
multiple MFS kernels). Additionally, the client allows to switch between different sim-
ulation runs, which are computed simultaneously on a computer cluster. By switching
the simulation run, the correspondent data is transferred to the client, including all
parameters and parameter updates. Further details about the assignment of simula-
tions and visualization are given in section 5.4.2.3. Besides switching between several
simulations, an aggregated view of two or more simulations can be displayed to di-
rectly compare the computed results. To support the user in distinguishing between
the different simulations, corresponding elements of each simulation, e.g. work pieces
that travel through the model, can be colored uniquely.
5.4.2.2 Interfaces
To publicly announce its building blocks, a MFS Kernel uses the P/S Interface of the
CSIS steering library. For every block, one message is send to the CSIS server con-
taining details like available input and output parameters. With this information the
CSIS service generates an internal list of all connected kernels and their correspond-
ing building blocks. This list can be acquired by potential clients and allows them to
choose to which kernel and which of its subblocks a connection should be established.
This is done by the client’s P/S interface which subscribes to every parameter accord-
ing to the user’s selection. After this step the CSIS server invokes an update of the
global map and connects input and output parameters according to the subscriptions
of the newly connected visualization.
Currently the system only works demand-driven. That means a visualization has
to query for newly added kernels or for the whole list of available kernels. Another
option would be to announce the new arrival of a kernel to all clients and offer the user
possibilities to integrate that kernel into the running visualization. Before the actual
visualization is started, the client offers a GUI to select to which MFS kernel(s) and
which of its building blocks it should connect to. This information is acquired through
the P/S Interface of the CSIS system. This interface is also used to query for new MFS
kernels when the visualization is already running. The result of this query again is
displayed in the clients GUI and shows all currently available MFS kernels and their
subblocks.
5.4.2.3 Data Exchange
The actual data exchange between a subblock in the simulation and its representation
in the visualization is handled by instances of an extension of the CSIS steering li-
brary called moderators. This extension was introduced to represent the hierarchical
structure of the building blocks and subblocks in d3FACT insight. There is one central
69
5 Computational Steering of IS/VR
moderator for each simulation- and visualization component. Each moderator gets a
unique name for identification. Additionally, one sub-moderator is assigned to each
subblock of the model. The sub-moderator offers the local data of the subblock to the
central moderator. For unique identification, each sub-moderator also has a unique
name - the name of the subblock. The moderators at the simulation collect all anima-
tion messages which trigger the animation of the tokens (e.g. packets on an assembly
line) in the visualizations. These messages contain a processing time-interval, simu-
lation outputs and configurable parameters (properties) and are always bound to one
specific instance of a subblock. The submoderator at the visualization (of one sub-
block) offers the complementary interface to receive simulation data and for changing
simulation parameters.
The inputs and outputs of a subblock within a simulation look like this:
Inputs:
<simulator-name>.<subblock-name>.property1
<simulator-name>.<subblock-name>.property2
...
Outputs:
<simulator-name>.<subblock-name>.animate_token
<simulator-name>.<subblock-name>.animate_start
<simulator-name>.<subblock-name>.animate_stop
<simulator-name>.<subblock-name>.output1
<simulator-name>.<subblock-name>.output2
...
The labels property and output are only placeholders and are replaced by subblock
specific names like throughput (output) or manufacturing-time (input). This depends
on the functionality of the subblock.
The moderator and its sub-moderators of the visualization are able to receive the
output of multiple simulations. They also allow the parameterization of a single sub-
block or a set of different instances of the same block in different simulation kernels.
To support multiple inputs from different kernels, the input names are extended by an
identifier as follows:
Inputs:
<vis-name>.<subblock-name>.<instance-id>.animate_token
<vis-name>.<subblock-name>.<instance-id>.animate_start
<vis-name>.<subblock-name>.<instance-id>.animate_stop
<vis-name>.<subblock-name>.<instance-id>.output1
<vis-name>.<subblock-name>.<instance-id>.output2
...
This set of inputs is defined with different values for the instance-id, according to
the numbers of simulation instances, which are observed. The values of outputs cor-
respond to the simulation inputs, but can be mapped to one or multiple inputs of
different simulation kernels:
70
5.4 Computational Steering for an Interactive Material Flow Simulation
Output:
<vis-name>.<subblock-name>.property1
<vis-name>.<subblock-name>.property2
This naming is for example used to evaluate one material flow model under differ-
ent conditions, by starting several runs with different initial parameters (e.g. variable
processing or failure rates for selected machines). Each of these different scenarios is
executed in a separate simulation kernel. Through the utilization of the CSIS frame-
work, the kernels can be executed in a distributed fashion for example on a hybrid
cluster. The simulations run on the cluster’s compute nodes and send their output
directly to the connected visualization nodes where the user can watch and steer the
simulations. This allows for the efficient analysis of material flow simulation by aggre-
gation of simulation results from different kernels and direct comparison between two
(or more) simulations runs with different initial parameters.
For the initial generation of the moderators (the extensions of the CSIS steering li-
brary), the components (simulation and visualization) parse the XML description of the
MFS model. Each instantiation of a subblock creates a corresponding sub-moderator.
The sub-moderator assigns the input- and output data to the central moderator. For
the scenario depicted in figure 5.18, the list of prefixes for moderated subblocks of the
simulation kernel KernelScenario1 will look like this:
KernelScenario1.Source_1.<variables>
KernelScenario1.Conv_1.<variables>
KernelScenario1.Saw_1.<variables>
KernelScenario1.Conv_2.<variables>
KernelScenario1.Conv_3.<variables>
KernelScenario1.Stock_1.<variables>
KernelScenario1.Stock_2.<variables>
The connection between a subblock in simulation and its representation in visualiza-
tion is established through the CSIS steering library itself. If the visualization of a
subblock VisA1.Saw1.1 (including ID for Kernel) subscribes for KernelScenario1.Saw1,
the CSIS server will generate the map entries, in order to pair the corresponding inputs
and outputs of both components and establishes the continuous data exchange.
5.4.3 Example Scenario
A small example scenario shall help to understand the potential benefits. Imagine a
discussion about the acquisition of a new machine for sawing within an existing factory
layout. Figure 5.18 shows an exemplary setup of a production line including a sawing
machine. With existing MFS systems it was only possible to simulate models like that
with different initial parameters (such as conveyor speeds, capacities, failure rate etc.)
sequentially, by starting new simulation runs with new parameters for every possible
permutation. With the proposed CS-enhanced system one can start several simulation
runs simultaneously and see the results in one combined visualization on a hybrid
71
5 Computational Steering of IS/VR
cluster system. This is achieved by replacing the fixed mappings of the parameters
between simulation and visualization in traditional MFS systems through a dynamic
mapping with a communication server as described in section 5.4.2.3. The new solu-
tion also offers the possibility to couple the material flow simulation with advanced
visualization setups such as a stereoscopic projection, in order to allow an immersive
view on several variants. Additionally, a 2D frontend for a simulation expert’s laptop
can be connected to the same simulation runs, in order to allow interactive variation
of the simulation’s parameters and by that, the iterative refinement of the simulation
model itself. Through the flexibility of the CSIS system, it is possible to start addi-
tional simulation scenarios, for example for alternative machines and thus allow a fast
switching between the simulation runs for an efficient discussion, which machine is
best to buy. Figure 5.20 shows the schematic setup of such a scenario: The parameters
of two different MFS kernels are dynamically mapped to three visualization instances,
of which two are connected to a stereoscopic display wall and the other one to a super-
visor’s display. Again, the extended models of computational steering for IS/VR can
be found in this example.
MFS-VISMFS-VIS
MFS-Kernel
Scenario2
Publish/Subscribe
service
CS
Framework
Communication
server
MFS-VIS
Collaborative CS
Concurrent CS
Synchronized CS
MFS-Kernel
Scenario1
Figure 5.20: Computational Steering of the interactive material flow simulation
d3FACT insight.
5.4.4 CS-Enhanced d3FACT insight - Conclusion
The example scenario above just shows the basic new features of the CS-enhanced
system, but helps to understand what becomes possible now. For every day use, the
72
5.5 Conclusion - Computational Steering of IS/VR
system offers good flexibility and scalability and allows to do more extensive simula-
tions than before. Mainly the possibility to do comparative simulations in real-time is a
big benefit for the users of such simulations, since it allows them to compare different
scenarios online. For the practical usage of the CS enhanced material flow simula-
tion, the system needs to be adapted to the user’s needs and enhanced by methods for
security and reliability. By extending the CSIS framework by the hierarchical moder-
ators, a 1 to 1 mapping of the data model of d3FACT insight could be achieved and
thereby only few changes in code had to be made to make use of the hybrid cluster as
a powerful computing resource.
5.5 Conclusion - Computational Steering of IS/VR
As described in chapter 4 the computational steering of IS/VR on hybrid clusters dif-
fers in many ways from traditional CS of HPC simulations. Thus, it was necessary to
introduce a novel approach that specifically addresses the main needs of IS/VR appli-
cations: interactivity and flexibility. The framework that was presented in this chapter
fulfills the requirements which were outlined in section 4.3.
It serves as a flexible basis to connect and orchestrate arbitrary amounts of dis-
tributed components of IS/VR applications. This was achieved by developing
models for different usage scenarios and designing a client-server based system
that combines a centralized communication server and a flexible Publish/Sub-
scribe system for the data exchange.
Transparency was achieved by providing the developers with a slim API which
encapsulates the basic functionalities of the framework and allows for a comfort-
able integration into existing applications. The developers and the users do not
need to care about data transfer and the connection between the components,
they just have to declare the external data and dynamical define the connections
between their components to create the scenario they desire.
All components of the IS/VR application (input, visualization, simulation and
additionally audio) can be integrated the same way, by using the frameworks API.
This even allows the coupling of different IS/VR applications if exchangeable
data is available.
The whole system bases on freely available software and utilizes well known and
approved mechanisms for data exchange, synchronization and group formation
(e.g. TCP-based communication, global clock synchronization and publish/sub-
scribe message passing). The prototypical implementation proved the concept
and will be freely available over the web site of the PC2[65].
Since performance measurement or benchmarking is hardly possible for systems that
aim to allow the connection of arbitrary, distributed components, it stands to reason to
73
5 Computational Steering of IS/VR
evaluate the framework with practical examples. This was done in section 5.3 and 5.4
and the results show that by utilizing CS to IS/VR applications it is, for the first time,
possible to flexibly and dynamically build various distributed scenarios of an existing
application on hybrid clusters. This enables the users and developers to broadly extend
the functionalities of their application (for example through comparative simulation
runs or multiple user simulations) without the need to completely rewrite or redesign
their application.
74
6
Remote Visualization for IS/VR
With today’s ever increasing computational power and the possibilities to simulate and
visualize more and more complex systems, it is obvious that efficient technologies are
needed to make the results of such simulations and visualizations available for a broad
audience. Remote rendering and remote visualization (RV) are techniques to fulfill this
need for visual data. Since the early 90s, researchers work on the topic of transporting
the rendered images or visual data to the clients of remote users to let them analyze
and work with this data. There are many systems that focus on several different aspects
of remote rendering / visualization. Some allow comfortable administration of remote
servers and others focus on interactively showing remote users images that where
rendered on powerful workstations (for a classification and exemplary systems see
section 3.2). But all of those systems have to deal with vast amounts of data that has
to be transferred. This is because visual data is mostly pixel or voxel based. Therefore
even single images which are displayed for only a fraction of a second consist of one
or more Megabytes of data (e.g. 5.76 MB for UXGA 1600 x 1200 RGB with 3 bytes per
pixel). To provide a smooth animation, 20 or more frames1are needed every second.
All have to get from the GPU of the server over the network, to the client and finally to
the display of the user. There are already many approaches to minimize the amount of
data that needs to be transferred, for example through prefetching certain data to the
client or by introducing level of detail mechanisms and compression of all kind (see
for example [37]). But there is still no solution on how to enable users to interact with
a remote system as if it was local.
As described before, the remote access to universal hybrid cluster systems is of great
importance since they often reside in surroundings that are inaccessible for the users.
In order to utilize remote visualization for IS/VR on hybrid clusters it is extremely
important that a certain level of interactivity and responsiveness (see section 5.1.5) is
guaranteed. Additionally, the quality of the remote frames needs to be close to the
1The European PAL standard specifies 25 Fps, the American NTSC standard even 30 Fps.
75
6 Remote Visualization for IS/VR
slow compression
for high resolu-
tions
reading data from
the GPU fast
multiple render
targets
faster host to
graphics hardware
interfaces
no ++ +
programmable
graphics hardware
++ + no
new framebuffer
concepts
no + ++
Figure 6.1: Problems of RV for IS/VR, possible solutions and their impact on the prob-
lems.
original and thus heavy compression is not an option. The available bandwidth of
typical, external cluster connections is high but not unlimited (about 1-5 Mbit/s per
user), which is why the compression methods need to be adapted to this perquisite.
And, last but not least, a framework is needed that is flexible enough to adapt to the
needs of a hybrid cluster system. That is, for example, supporting the aggregation
of multiple frames from multiple visualization nodes or support multiple users at a
time. None of the existing systems presented in section 3.2 completely fulfill these
needs. Many of the system simply are not designed for strict interactivity and are
excluded because of high latencies (e.g. 200+ ms for most of the systems for remote
administration). In contrast, the systems optimized for remote 3D-applications mostly
are not flexible enough or require too much bandwidth. Thus, the following section
points out three main problems that need to be solved to efficiently make use of remote
visualization in combination with IS/VR. Thereafter, a platform is introduced which
allows for the testing and implementation of solutions to the problems described in
section 6.1 and several key technologies and their application are described. Amongst
others, new parallel compression techniques that make use of the computing power of
modern GPUs to achieve a fast and high quality image compression are presented in
section 6.2. Section 6.4 finally describes the prototypical realization of those methods
and the framework itself. Finally, section 6.5 manifests the findings with benchmarks
and a comparison to an existing system.
6.1 Limitations of Remote Visualization for IS/VR
The table in figure 6.1 shows the three main limitations that hinder the effective uti-
lization of RV for IS/VR. Additionally, possible solutions and their impact on the lim-
itations are marked in the table. The following section describes the problems and
possible solutions in detail. The main aim is to minimize the overhead in time that is
76
6.1 Limitations of Remote Visualization for IS/VR
needed to compress, transfer and decompress the rendered frames to allow low laten-
cies and undisturbed interaction. Additionally, the quality of the images should not
suffer from high compression rates and we assume fast internet or ethernet connec-
tions with a bandwidth of at least 1 MB/s to achieve a decent quality of service also
for high resolutions (XGA and above).
6.1.1 Slow Compression for High Resolutions
The main problem of RV is that image data gets very big the higher the resolution
is (e.g. 5.76 MB for 1 frame in UXGA RGB). So it is obvious that some kind of com-
pression is needed to transfer more than 20 frames per second over the network. The
difficult tasks now are to find the right compression technique (there are a lot for
still and moving images) and further to find the right balance between quality, com-
pression time and size of the compressed frames. When it comes to choose the right
compression technique the main factor is compression speed which directly influences
the complexity of the algorithm. A very complex algorithm that produces great com-
pression rates but takes a lot of time to be computed is useless for the utilization in
RV. Therefore, simple and effective compression methods are needed. The simplest ap-
proaches are lossless counting or comparison algorithms like for example Run Length
Coding2(see [71].). For certain scenarios (frames with only few colors, big unicolored
areas, uniform background as for example in CAD), these algorithms generate great
results in unbeatable time. But for other scenarios of RV, such as the transmission
of real videos, they are not suited well and can even produce negative compression
rates. This is where lossy compression techniques like the popular JPEG compression
have their advantages. They achieve quite high compression rates independent of the
input i.a. by filtering information which is not or only hardly visible for the human
eye. Additional techniques like downsampling and dictionary based compression op-
timize the compression rates. But the more complex the algorithms get the more time
is consumed. As for example the JPEG algorithm in the popular libjpeg implementa-
tion takes about 40 ms to compress an image of XGA resolution (1024x768 pixels) on a
modern dual core CPU (see section 6.5 for details). That already limits the achievable
frame rate of a RV system to 25 FPS without taking grabbing, transmission and decom-
pression of the frame into account. For higher resolutions this rate drops dramatically
(about 90 ms for UXGA which translates to 11 FPS).
Since it is not likely that new algorithms for still image compression improve the
compression/complexity ratio dramatically, the only chance to realize interactive frame
rates for high resolutions is to speed up the computation of the compression algo-
rithms. At the first glance, the CPU seems to be the right choice for that kind of
computation since it is supposedly the most powerful component in a computer. But,
2Consecutive, similar symbols are encoded as the symbol followed by the length of the row
77
6 Remote Visualization for IS/VR
since graphics hardware became more and more sophisticated and universally pro-
grammable through their shader units, it might be a solution to this problem to out
source the task of compression to the GPU. This approach is discussed in section 6.2.4.
Another alternative would be video compression methods like the famous MPEG
coding [54]. They produce really good compression rates for image streams / videos,
but have one major disadvantage for the utilization in RV for IS/VR: Nearly all of them
base on JPEG or similar still- image compression and need information of the previous
and in some cases of the following frames to achieve compression. This introduces
latency, which is in almost all cases bigger than just utilizing JPEG compression. Since
the main goal of the RV framework for IS/VR is interactivity, the video compression
methods are left out in the following. However, if there are approaches that are fast
enough to be applied to this problem in the near future, they can easily be integrated
in the framework presented below.
6.1.2 Reading Image Data From the Graphics Hardware
Another issue that limited using RV for IS/VR in terms of speed for a long time was
that the rendered images could only be read back from the graphics hardware very
slowly. That originated from the initial, asynchronous design of the AGP-Bus which
was the standard interface between host system and graphics hardware for nearly a
decade. It was optimized to transfer data like textures and meshes to the graphics
hardware but not the other way round. However, the introduction of PCI Express
for Graphics (PEG) cleared the way to efficiently download rendered images from the
graphics card. Additionally, through several extensions to the graphic APIs (OpenGL
[35] and Direct3D [52]), which allow access to dedicated memory areas on the graphics
card, the grabbing of graphical data from the cards becomes easier and more flexible.
6.1.3 Rendering to Multiple Targets
This limitation is important for the efficient usage of RV on hybrid clusters. Since clus-
ter computers are only rarely used exclusively by one user, one has to make sure that it
can be shared among several users. The same holds for the visualization components
in a hybrid cluster. Thus, the RV system needs to provide its services to more than
one client, too. This can be done by enabling multiple render targets, which is possible
since the introduction of the aforementioned extensions to the graphics API (frame and
pixel buffer objects in OpenGL [31]). With these extensions and the programmability of
the graphics hardware one can achieve a flexible and scalable system for RV on hybrid
cluster systems.
6.1.4 Programmable Graphics Hardware and GPGPU
The description of the limitations of RV for IS/VR showed that one possible solution
for some problems could be programmable GPUs. However, this is a highly unspecific
78
6.2 Invire - A Concept for an Interactive Remote Visualization System
term and needs some clarification and explanation. Programmable GPUs for profes-
sional graphics solutions were first introduced in the early 2000s by companies like SGI
[73]. Those GPUs were very powerful at that time, but also very expensive and rather
proprietary. In 2001 the first programmable consumer graphics hardware (Nvidia’s
Geforce 3 [58]) appeared and loosened the strict pipelining concept that was used for
consumer-grade graphics processing until that time. Through programmable vertex
and fragment processors it became possible to manipulate geometries and even single
pixels during their processing in the graphics pipeline. This allowed for example to en-
hance the visual output of VR environments without adding more geometry informa-
tion (e.g. surface generation through bump mapping [10]). As the graphics hardware
became more and more powerful over the years (mainly driven by demanding and
complex computer games), developers started to use it for other purposes than graph-
ics processing. This development is known as GPGPU (General-Purpose computation
on GPUs, see [24]) and aims to use commodity graphics hardware as powerful copro-
cessors for certain applications. In the early stages of GPGPU one had to use standard
graphic APIs such as OpenGL to describe the general purpose problem to be solved.
Data had to be encoded in textures and the programs were written in special shader
languages (CG [57] or GLSL[38]). This was often complicated and sometimes impos-
sible, since those APIs and data formats are highly optimized for graphics processing.
But the graphics hardware manufacturers reacted and designed APIs that offered ac-
cess to the powerful graphics hardware in a more universal way. The two driving
companies (ATI [1] and NVIDIA [59]) both introduced such (proprietary) APIs around
the same time. They share the goal to allow the usage of a GPU as a massive multi-
processor after the SIMD (Single Instruction Multiple Data) principle. That means that
a large amount of data is processed in parallel by one instruction, or in other words
many threads with identical instructions process different data simultaneously.
For applications that do many similar and independent operations on large data sets
this approach can significantly increase the performance. One example for such an
application is image processing, where very often operations are performed on single
pixels or small groups of pixels. In section 6.2.4 two GPU-based image compression
methods are described that make use of the GPGPU API CUDA [60] by NVIDIA ,
which is also briefly introduced in section 2.3.
6.2 Invire - A Concept for an Interactive Remote
Visualization System
In this section, the concept for a new RV system called Invire (INteractive REmote
VIsualization) is introduced. It belongs (according to the classification in 3.2) to the
third class of remote rendering systems and focuses on high interactivity, maximized
performance, quality visualization results and scalability. The system will be a frame-
work to implement, test, evaluate and improve techniques that help to overcome the
79
6 Remote Visualization for IS/VR
problems and limitations described before and will allow the successful and flexible
usage of remote visualization for IS/VR applications on hybrid cluster systems. Invire
is designed to be the basis for several compression, grabbing and transfer modules and
offers the basic functionalities for data transfer and remote interaction. On this plat-
form enhanced compression and grabbing modules are developed and benchmarked
against each other and existing systems. This helps to find out which suits best for
different tasks and to proof the need for new technologies in this field.
The following sections introduce the architecture and basic components of Invire
as well as different approaches to the grabbing, compression and transfer of the ren-
dered frames. The prototypical implementation as well as the benchmarking results
are presented in the last two sections of this chapter.
6.2.1 The Architecture
The overall architecture of Invire (as shown in figure 6.2) is kept simple to maximize
the remote visualization performance. The system follows the client server concept,
which means that the server side handles the grabbing, compression and transmission
of the rendered frames and the client side receives, decompresses and displays the
data. The server side offers a so called Invire Plugin which can easily be integrated into
existing OpenGL applications (non-transparent integration). This allows the passing
of parameters between the host application and Invire, for example to allow a dy-
namic adaption of the applications resolution or advanced remote interaction features.
A transparent integration of the framework is possible through mechanisms such as
preloaded libraries. Chromium [28] for example uses this mechanism to replace the
standard OpenGL library by a stub library which intercepts all OpenGL calls. These
can be used, amongst other things, to get access to the rendered frames. In both
cases of integration, the current OpenGL context is grabbed into a local or graphics
card memory and then passed to the Compression facility. This part of the software is
implemented as modular collection of compression algorithms that can be exchanged
arbitrarily. This allows to compare the different methods and eventually combine them
to achieve higher compression rates. After the compression of a rendered frame, it is
passed to the TCPServer where a header is generated. The header contains information
about the image size, compression, resolution, etc.. Afterwards, the header followed by
the compressed data is send to the Invire Client. It receives the compressed frame and
passes it to the Decompression facility together with the information from the header.
After its decompression, the frame is displayed by the client. A separate interaction
component registers remote input from the client, serializes, transmits and passes it to
the remote application through the server.
6.2.2 Data Transfer
In order to transfer the data (compressed frames) a TCP socket connection is estab-
lished between server and client. This socket remains active until either client or server
80
6.2 Invire - A Concept for an Interactive Remote Visualization System
Invire Server
OpenGL
application
Compression
TCPServer
Invire Client
Decompression
TCPClient
LAN
Invire
Library
FrameGrab
Interaction
Invire Plugin
Interaction
Client
Figure 6.2: The basic architecture of the Invire framework
cancels the transfer. The advantage of TCP sockets is that the correct order and the
integrity of the image data is guaranteed. After a socket connection is established, the
protocol overhead is minimal and allows a good utilization of the available bandwidth.
Additionally, a second socket is established to transmit the events of the interaction
component. On the one hand, this helps to prevent additional delay caused by the
simultaneous usage of only one socket, and on the other hand logically encapsulates
frame transmission and event handling.
The Invire server is designed in a way that it can open multiple sockets for multiple
users. This is realized by dynamic socket allocation and the initialization of a new
thread for each new client. Additionally the client is prepared to receive data from
multiple servers to allow a composition of multiple frames from different servers to
for example generate one high resolution image.
6.2.3 Image Readback
To be able to process and transfer the rendered images, they need to be read back
from the graphics card to system memory. This can be done by using the graphics API
standard calls such as OpenGL’s glReadPixels(). This function copies the rendered
image to local memory pixel by pixel and saves it in a predefined format (e.g. RGBA
with 4 bytes per pixel). This method is quite fast and optimized for the transfer of
data from the graphics hardware to the host system. However, it is not flexible enough
to support multiple render targets (only the main framebuffer can be read back) or
copying data to other areas in graphics memory (e.g. for postprocessing by the GPU).
These requirements for RV of IS/VR can only be fulfilled by using a more generic
approach. In OpenGL Standard 2.1 an extension called Pixel Buffer Objects [31] was
introduced which allows the easy and universal access to graphics memory for pixel
based objects (e.g. frames or textures). By using these extensions for reading back the
rendered frame, one can achieve the desired flexibility and still benefit from the opti-
mized code an official extension provides. More implementation details are presented
81
6 Remote Visualization for IS/VR
in section 6.4.
6.2.4 Compression
As described in section 6.1.1 it is important to find the right compression method
for different applications. Since Invire is designed to be a framework that allows the
remote usage of arbitrary OpenGL applications and has the goal to provide a ba-
sis to improve techniques for RV, a flexible and extendible model for the integration
of compression algorithms was needed. This flexibility is achieved by encapsulating
all compression related classes in one library and defining universal interfaces for all
compression methods. That means, amongst others, that it is possible to execute com-
pression in two ways:
CPU-based: If the user (or the system) chooses a CPU-based compression, the ren-
dered image is grabbed to the memory of the host system in RGB format. This
raw image data (i.e. a pointer to its memory location) is passed to the selected
compression algorithm. Now the CPU can run the data through the algorithm
and again stores the result in local memory.
GPU-based: If a GPU-based compression is selected, the frame is grabbed to a mem-
ory location on the graphics card. Then the compression method is invoked by
the CPU with a reference to that memory location. The compute intensive parts
of the algorithms are encapsulated in so called kernels (for implementation de-
tails see section 6.4), which are launched on the GPU and process the data in
graphics memory. After that step, the compressed data is read back from the
graphics card and ready to be send over the network.
Both methods can also be used in a double (or more) buffered fashion. That means
that there are two (or more) dedicated memory areas either in host or graphics memory
that hold previous uncompressed frames for comparative compression methods.
The decompression works accordingly, i.e. the compressed data is received and
stored either on host or graphics memory. Thereafter, either the CPU processes and
passes it to the graphics hardware to display it or the compressed data is directly
passed to the GPU, where it is decompressed and directly displayed.
Through common base classes, the library ensures that every compression method
offers a compression and decompression function and follows the interface definitions.
A special frame object encapsulates all information (header) and data of one frame. It
is also possible to implement compression methods both CPU and GPU-based to, for
example, compare compression times or allow the decompression on hardware that
does not support GPU-based computations. In the next sections, three established
compression methods (Run Length Encoding, Difference Compression and JPEG still
image compression) are briefly described and two of them (Difference and JPEG) are
presented as GPU-based, parallel compression methods. The selected algorithms rep-
resent groups of compression algorithms with comparable efficiency and complexity.
82
6.2 Invire - A Concept for an Interactive Remote Visualization System
Input:
Output:
2 2 2 10
Figure 6.3: Example for Run Length Encoding of an RGB array.
RLE and Difference compression are lossless methods whereas JPEG is a sophisticated,
lossy algorithm. This selection had to be done to cover a possibly wide area of com-
pression techniques in the scope of this thesis. As described in chapter 7 there are a
lot more possible and promising compression techniques which can be integrated into
Invire in further projects.
6.2.4.1 Run Length Encoding
The Run Length Encoding (RLE) is a very simple but in some cases quite effective
method to compress arbitrary data. Its basic functionality is shown in figure 6.3. The
run-length algorithm goes over the input array (in this case a sequence of RGB coded
pixel values), counts consecutive pixels with same color values and writes the sum
followed by the actual color information into a new array. The decompression is done
accordingly, by writing the amount of pixels with the same color in a new array con-
secutively. The RLE algorithm can encode and decode npixels in O(n)time. It is
hardly suited for parallel execution since it is heavily dependent on dependencies and
partition might revoke the compression gains.
6.2.4.2 Difference Compression with Index
Another basic technique for lossless image compression is the difference compression
as shown in figure 6.4. The current and the last frame are compared pixel wise and
only the pixels that are different are stored. Additionally, the position of the pixels that
changed is needed to decompress the current frame. This can be done most efficiently
by an index which maps one bit to every single pixel of a frame. If the bit is 1 the pixel
has been changed and the saved pixel value at the position #of preceding 1s in index is
needed to update the pixel of the last frame. This requires O(n)time to compress and
decompress npixels.
83
6 Remote Visualization for IS/VR
Last Frame Current Frame
Compressed Data
0 00 00011 1 00 1 0110
Input:
Output:
Figure 6.4: Sequential Difference encoding algorithm with index generation.
6.2.4.3 Parallel Difference Compression with Index
The difference compression with index method described before is very suitable for
parallel (SIMD) execution. Especially creating the index, as well as copying the pixel
data is independent for every pixel and can be computed simultaneously. The only
problem is that the amount of pixels (n) is most likely higher than the available threads
(k) of a multiprocessor. Therefore each image must be split into m=n
kblocks which
can be computed on a multiprocessor.
Figure 6.5 schematically shows the basic sequence of the algorithm. The compression
algorithm is divided into three main steps. In the first step, the index is generated by
simply comparing corresponding pixels of the last and the current frame. Each thread
processes one pixel of a block and writes only the changed pixels to a new memory
location (array) with the size of a block (4 pixels in the example). It takes O(n
k) = O(m)
time to compute the index and copy the pixel data to this array. When all threads
have finished, a parallel compaction method (the second step) is invoked. The basic
functionality of this compaction method is shown in figure 6.6. This algorithm requires
O(log k)time to compute the amount of empty spaces (i.e. memory locations that do
not hold a changed pixel value) to its left for every item of the local result array. This is
done by doing log ksteps and in every step sadding the amount of empty spaces in ci
and c(i2x)in parallel. After log ksteps the pixels can be stored in a locally compacted
array by calculating their new index with xici. This algorithm needs to run for all
mblocks. Its overall runtime is O(mlog k). With this information, the pixels can now
be stored in a local block without empty spaces. The number of changed pixels is also
stored for each block. After all blocks have finished the second step, the information
about the changed pixels is used to compute the absolute position in global memory
for each blocks partial result (3. step). This is also done in parallel, based on a scan
algorithm which sums up all items to the left of the current value, in O(log m)time.
84
6.2 Invire - A Concept for an Interactive Remote Visualization System
Basically, it works similar to the algorithm used for local stream compaction. However,
instead of adding the amount of empty spaces it adds the numbers of changed pixels
in iterative steps. Finally the partial results of the blocks are copied to the calculated
memory locations and, together with the index and optionally an array of the numbers
of changed pixels in each block, form the final compressed frame. The whole algorithm
Last Frame Current Frame
0 00 0
1. Step: Index generation and pixel copy
2. Step: Local Stream Compaction
3. Step: Global Stream Compaction
One block per row
0011 1 00 1 0110
2 CP 2 CP 0 CP2 CP
Compressed Data
0 00 00011 1 00 1 0110 2 022
+ changed pixels
(optional)
Input:
Output:
Figure 6.5: The steps of the parallel difference encoding algorithm with index genera-
tion, using 4 blocks with 4 threads and one block per line of the 4x4 frame.
can be run in
O(m+mlog k+log m) = O(mlog k) = O(nlog k
k)
time in parallel with n=number of pixels, k=number of threads, m=number of
blocks and n=mk. I.e. depending on the amount of available threads the parallel
85
6 Remote Visualization for IS/VR
XX XXInput:
10 11 1012
Count empty spaces
1 to Left:
Count empty spaces
2 to Left: 10 21 2223
10 21 4333
Count empty spaces
4 to Left (Result ci):
Store Pixels at
Position xi ci:3-11-0 7-36-3
Positions xi: 1 2 3 4 5 6 7 8
+=
+=
Figure 6.6: Parallel, local stream compaction on an array with 8 fields.
algorithm in the worst case performs equally to the sequential one (O(n)) with k=1
threads and in best case achieves O(log n)with k=nthreads.
6.2.4.4 JPEG
The most common format for still image compression is the JPEG standard [66]. It
uses several attributes of human vision to eliminate unnecessary or minor information
from the images and combines them with traditional compression algorithms. Since
it delivers high compression rates with only moderately complex algorithms, it fits
very well for the application in RV. Additionally, many of its compute intensive parts
are well suited for parallelization since they can be calculated independently for single
pixels or small groups of pixels. Figure 6.2.4.4 shows the main steps of a JPEG conform
compression. The first step is the color conversion of the commonly used RGB format
into the YCbCr format. This allows for efficient downsampling3of the raw image
data. In the following step, the Discrete Cosine Transformation is used to transform
the data of 8x8 pixel blocks from the spacial domain to the frequency domain. Since
nearly all pictures have some kind of patterns or areas of similar colors, low frequen-
cies are dominant in the frequency domain, whereas many higher frequency will be
zero. This effect is amplified by the next compression step, the Quantization, where
3Downsampling in this case means to reduce the resolution of the chrominance components to achieve
an initial compression. The human vision is much more sensitive to luminance than it is to chromi-
nance variations, therefore one can achieve high compression rates with only minimal loss of quality.
E.g. 4:1 for the Cb and Cr components (YCbCr 4:2:0).
86
6.2 Invire - A Concept for an Interactive Remote Visualization System
Color Conversion
RGB RAW
Downsampling
Y CrCb
CrCb
Discrete Cosinus Transformation
Quantization
Huffman Encoding
JPEG image
Figure 6.7: Steps of the JPEG compression.
the computed DCT coefficients are divided by a quantization value. Different quan-
tization values are specified for different frequencies, making use of the fact that the
human eye can distinguish changes in the lower frequencies a lot better than changes
in high frequencies. Through Quantization (i.e. dividing all coefficients by constants)
many of the high frequencies in the 8x8 blocks turn zero. The subsequently following
Huffman encoding (a basic dictionary based compression method) makes use of the
many resulting zeros to achieve the actual compression. For more insight on the algo-
rithms JPEG uses, please see [66]. A widely used open source implementation of the
JPEG standard is available as libjpeg [29].
6.2.4.5 Parallel JPEG Compression
The parallelization of the JPEG compression is done in steps according to the model
in figure 6.2.4.4. The color conversion is the first step of the algorithm and can be
computed independently for every single pixel by a fixed formula (constants may vary
depending on which YCbCr standard is used). With k=amount of threads available,
this step can be computed in O(n
k)time. The second step that downsamples the Cb
and Cr components of 4 pixels can also be computed independently for a group of 4
87
6 Remote Visualization for IS/VR
values. Thus, with kthreads available, this computation can be done in O(n
k). The
third and most complex step of the algorithm is the Discrete Cosine Transformation
(DCT). The JPEG standard determines the following two-dimensional DCT formula for
the transformation of the 8x8 pixel blocks into DCT coefficients Gij:
Gij =1
4CiCj
7
X
x=0
7
X
y=0
pxycos (2x+1)
16 cos (2x+1)
16
where Cf=
1
2for f=0
1for f > 0and 0i,j7.
Most JPEG implementations use a simplified method that bases on the computation
and combination of one-dimensional DCTs. In a first substep the 1D DCTs are com-
puted for the rows of a pixel block and in the second substep these results are used to
compute the 1D DCT of the columns. Both computations follow the following formula
to compute the 1D-DCT coefficients Si:
Si=Ci
2
7
X
x=0
cos (2x+1)
16
Ci=
1
2for i=0
1for i > 0and 0i7.
Through splitting up the 2D DCT one achieves a better parallelizability and less com-
plex computations. Through optimization it is possible to further reduce the amount
of necessary computing operations as shown in [3]. Since the computation of the rows
and columns are independent but the computation of the columns bases on the results
of the rows computation, a maximum of 8 threads can be used to compute the DCT
for 1 block. This step takes O(n
b2b) = O(n
btime with b=kmodulo cand c!=0
as block dimension (b=8in this case). The next step, Quantization, performs a sim-
ple division on each pixel in the 8x8 block, thus every pixel can be processed by one
thread. With kthreads available, this step can be computed in O(n
k)time. The quan-
tization table for the pixel blocks in JPEG are constant and additional scaling factors
can be applied beforehand. The Huffman Coding of the resulting DCT coefficients
cannot be executed in parallel since a dynamic tree needs to be generated where all
results depend on the results of their predecessor. There are approaches to parallelize
Huffman encoding, but they work only under certain assumptions, which are contra-
dictive to the JPEG standard (see for example [27] and [11] pp. 263f). Therefore this
step of the algorithm is performed sequentially in O(n)time. All in all the first four
steps of the JPEG algorithm can now be computed in O(n
k)time in parallel, only Huff-
man encoding takes O(n)time. In practical application the compute intensive steps (3
and 4) benefit well from the parallelization, whereas the Huffman encoding is already
88
6.3 Remote Visualization on Hybrid Clusters
optimized for fast compression of the DCT coefficients through lookup tables in the
libjpeg, therefore overall speed improvements are very good for this combination of
sequential and parallel execution as shown in section 6.5.
6.3 Remote Visualization on Hybrid Clusters
In addition to the remote visualization scenarios described in section 3.2, hybrid cluster
systems pose additional demands to RV systems. The most obvious one is that clusters
use distributed computing to increase performance. The same holds for clusters or
parts of clusters specialized on visualization. In order to efficiently use RV on those
clusters the RV systems need to be capable to process and compose subimages from
different nodes of the cluster. That means for example that if a complex scene is split
up in a sort-last manner, so that each node of a 4 node visualization cluster processes
a tile of an image, the RV server runs on each node and sends the result to one client,
where it then is composed to the final frame. Such functionality implies open and
flexible platforms like the one proposed in this thesis. Invire can easily be extended to
serve further special needs of hybrid clusters as it was designed as a modular system
that only uses open source software. All parts were designed with a focus on IS/VR
application on hybrid cluster systems.
6.4 Prototype - The Invire Framework
The Remote Visualization framework was implemented from scratch and offers a basic
platform for grabbing, compressing and transferring rendered images, optimized for
the usage in hybrid cluster surroundings. Hence, the implementation of the frame-
work as well as the realization of selected modules are described in the following. For
a better understanding, a quick introduction of the GPGPU API CUDA was given in
section 2.3. It is intensely used for the realization of the parallel compression algo-
rithms.
The prototypical implementation is structured into 3 components: The Invire plu-
gin which encapsulates all server functionalities and offers different interfaces to the
OpenGL application that is remotely visualized; the Invire client, which offers all client
side functionalities like displaying the remote rendered images and accepting input
from the remote user; and the Invire library which encapsulates all common classes
such as those for compression and those containing file formats etc.. The realization
of these three components is described in the following. The main focus lies on the
implementation of the classes in the invire library since they also contain the newly
developed compression methods.
89
6 Remote Visualization for IS/VR
6.4.1 Invire Plugin
The Invire Plugin encapsulates the server functionalities of the framework. It integrates
the Invire framework into host OpenGL applications, grabs the rendered frames, initi-
ates potential compression, sends the frames over the network and receives and passes
remote input events to the host application. The involved components and their func-
tionalities are described in the following.
6.4.1.1 Integration into Host Applications
The Invire Plugin is not a self-contained application. It can be described best as an
interface between the Invire System and the OpenGL application. To fulfill this task
it offers one major function which needs to be integrated into the main rendering
loop of an OpenGL application: grabFrame (int width, int height, int bpp, int
compression). By integrating this function into the host application the developer
allows Invire to grab and process every rendered frame. By setting the parameters
the developer can also influence the functionality of Invire. This integration is not
transparent for the user / developer and requires (little) code adaption. However, it
has two advantages over a transparent integration like in for example Chromium [28]
or VirtualGL [78] for testing and benchmarking purposes:
the host application can control and influence the functionality of Invire directly.
E.g. if it can predict which compression method is best for the frames it currently
renders, it can automatically pass this information to Invire and thereby sets the
optimal compression method.
it is important especially for the testing and benchmarking of new techniques for
remote Visualization to be able to clearly distinguish between various different
time consumers (e.g. rendering, grabbing, compressing, etc.). This can only be
achieved by explicitly integrating the remote rendering functionality into the host
application and setting several measurement points.
For the practical usage of RV systems, however, it is vital to also provide a transparent
integration. This is a major task as soon as the framework leaves the prototypical stage,
but not a focus of this thesis.
The grabFrame() function invokes, depending on the selected compression tech-
nique, one of two grabbing functions:
Direct: This method grabs the frame by calling the glReadPixels() function, which
copies the actual content of the graphics cards framebuffer to a specified location
in host memory. This method is used for CPU-based compression algorithms and
is offered by the OpenGL API.
CUDA: This method allocates memory on the graphics hardware by generating so
called PixelBufferObjects. These can be used to universally address graphics
90
6.4 Prototype - The Invire Framework
memory and to store arbitrary data. After the allocation, the glReadPixels()
function is used to read the pixels to the PixelBufferObject. The CUDA method
is used for GPU-based compression.
Grabbing can be invoked more than once to store subsequent frames for comparison
based compression or multiple render targets. The result of a call of one of the grabbing
functions is a Frame object which contains several properties of the grabbed frame
(size, bits per pixel, color model etc.) and a pointer to the location in memory or
the PixelBufferObject where the actual frame is stored. This object is passed to
the appropriate compression module in the Invire library (see section 6.4.3) where
it is compressed and a pointer to the result is set in the frame object. The Frame
object is passed to the TCPServer where the properties (including the total size of
the compressed frame in bytes) are serialized and written into a proprietary header.
Finally, the header followed by the compressed frame is sent over a TCP connection to
the client.
6.4.1.2 Remote Control of the Application
The third main functionality of the Invire Plugin is the reception and passing of remote
input events. Those are generated through inputs to the Invire client and passed over
the network as serialized events. Those events are received by an InteractionServer
and converted into standard X-Events for the Windowing System X-Server [20]. The
X-Events are passed to the application which interprets them as if they were inputs of
a locally connect mouse and keyboard set. Thereby it is possible to seamlessly steer
the application from a remote client.
6.4.2 Invire Client
The Invire Client is the part of the framework which is executed at the user’s local
computer. Its main functionality is to display the remotely rendered frames. Therefore
it has to receive and decompress them. Additionally, it offers a GUI interface to control
the properties of Invire (select compression method, toggle input etc.) and offers the
possibility to transfer local input to the remote application. The main functions are
described in the following.
6.4.2.1 Reception, Decompression and Displaying of the Frames
Upon the reception of a frame header through the TCPClient, a new Frame object is
generated and enough memory is allocated to store the compressed pixel data. Ac-
cording to the kind of decompression, either CPU-based or GPU-based, the allocated
memory area is located on host or graphics memory. Thereafter, a pointer to the Frame
object is passed to the appropriate decompression class in the Invire library. After de-
compression, the raw pixel data lies either in host or in graphics memory. In case it
resides in host memory, the OpenGL function glDrawPixels() is used to display the
91
6 Remote Visualization for IS/VR
received frame. If a GPU-based decompression method is applied, the raw pixel data
already resides in graphics memory. Thus, this memory area can be marked as a tex-
ture which then can be mapped to a blank rectangle of the size of the output window.
This is the most efficient way for displaying raw pixel data since graphics hardware is
highly optimized for the mapping and rendering of textures.
6.4.2.2 Controlling the Remote Application
An optional setting allows mouse and keyboard input from within the client window
to be captured and send to the remote application, which takes this input as local
keyboard and mouse commands. This is done by so called glut (OpenGL utility toolkit)
callbacks. Every time the mouse is moved, a button is pressed or a key on the keyboard
is hit, one of the callback functions is called. These functions generate a proprietary
event message containing status information, such as the position of the mouse inside
the window, which mouse button is pushed or held or which key was hit. These events
are serialized and sent over the TCPClient to the Invire plugin where they are received
and interpreted as described above.
6.4.3 Invire Library
The Invire library is the component of the framework where all common classes of
Invire are encapsulated. Those are the common formats such as the Frame objects,
all compression and decompression algorithms as well as tools for benchmarking and
testing. The following describes the implementation of the major component classes.
6.4.3.1 The Frame Object
The frame object holds all relevant information for a frame and pointers to the memory
locations on host or graphics memory where the compressed or uncompressed pixels
are stored. The following listing shows all of its variables including short descriptions:
bool m newFrame: is set true if this frame is not displayed yet
int m compression: the id of the selected compression method
int m filesize: size in bytes before compression
int m transsize: size in bytes after compression
int m width: width of the frame
int m height: height of the frame
int m bpp: bytes per pixel
char* m glFormat: format of the raw pixel data (e.g. RGB)
92
6.4 Prototype - The Invire Framework
unsigned char* m pixels: pointer to memory location where the uncompressed pixel
data is stored
unsigned char* m compressedPixels: pointer to memory location where the compres-
sed pixel data is stored
unsigned int m pixelBufferObject: id of the Pixel Buffer Object where the pixel data
is stored for GPU-de/compression
Aframe object is first generated when a frame is grabbed and then used for all further
operations on the frames, i.e. compression, transmission and decompression. Depend-
ing on the current compression method one ore more frame objects reside in memory.
For frame-to-frame coherent compression methods the last frame(s) are also stored to
be able to use them for compression.
6.4.3.2 Compression/Decompression - General
All compression/decompression algorithms are implemented in the Invire library. To
guarantee interoperability and exchangeability, all classes for compression are derived
from a common Compression class. This ensures that all methods provide a compress()
function which compresses a frame that is passed as an argument and a decompress()
function which decompresses the given frame. Furthermore, all available compression
methods are listed and assigned to unique IDs in this class. These IDs determine the
compression/decompression methods system wide and are stored in the frame object
after compression.
The implementation of the CPU-based compression algorithms is quite straight-
forward and follows the description given in section 6.2.4. For further details please
consult the source code and the doxygen-generated documentation. The main inno-
vation is in the implementation of the CUDA-based compression methods which are
described in the following.
6.4.3.3 Compression - CUDA-Based Difference with Index Compression (DIC)
The first CUDA-based method that was implemented is the parallel difference with
index method described in section 6.2.4.3. After grabbing two consecutive frames by
the CUDA-based grabbing method and storing them in the memory of the graphics
card, the actual compression method for the CUDA-based DIC is called. It allocates
new arrays for the index, the changed pixels, the numbers of changed pixels per block
and the memory indices where the changed pixels of each blocks are written to in the
result array. After that a first CUDA-Kernel is called with a blocksize of 256 threads4.
The first kernel does the following:
4256 threads turned out to be the most effective block size in this scenario. More or less threads resulted
in performance decrease.
93
6 Remote Visualization for IS/VR
1. Load 256 corresponding pixels of the last and the current frame into shared mem-
ory.
2. 8 threads in parallel write an index bit by comparing a pixel of the last frame
to a pixel of the current frame. 1 stands for pixel changed, 0 stands for pixel
did not change. Only 8 threads can do this step in parallel, because writing
simultaneously to one char by more than one thread results in inconsistent data
and unpredictable behavior. This step is repeated 8 times. The resulting 8 bytes
that form the partial index are copied from shared to global memory by 8 threads.
3. To determine the memory position of each changed pixel in the local result, the
local stream compaction algorithm based on the stream compaction implemen-
tation described in [26] is invoked. In the first step 256 threads determine if the
pixel at their ID i and the pixel at the ID i-1 have changed and store the sum of
these information in c(i), an array for the amount of changed pixels in shared
memory. In the next log 256 steps s the threads add the amount of changed pixels
at c(i)and at ID c(i2s)and store them in c(i). When all steps are computed
the resulting information is used to compute the position in local memory where
the changed pixels are stored. This is done by the threads with IDs that belong
to the changed pixels.
4. The total amount of changed pixels in that block is copied to global memory.
After the first kernel processed all imagesize
256 blocks, the index generation is complete
and the result array consists of locally compacted pixel rows which still need to be
compacted globally to achieve the actual compression. This is done with the help of
a second kernel, following the implementation described in [25]. It uses the array
containing the amounts of changed pixels per block to generate a memory index for
the final position of each pixel row in the result array. This is done in parallel by
summing all previous values up to each position in the array and storing this sum in a
new array.
This array, which now contains the absolute memory locations in the result array, is
used to copy the changed pixels from the current positions to the new positions in the
final result array in parallel. The compressed frame, which consists of the index and
the array of the changed pixels in consecutive order5, is copied back to host memory
and sent to the Invire client where it is decompressed CPU-based or GPU-based.
6.4.3.4 Decompression - CUDA-Based Difference with Index Compression (DIC)
If the computer running the Invire client also supports the CUDA architecture, it is
possible to use parallel, GPU-based decompression instead of the standard, sequential,
CPU-based decompression. The parallel decompression method requires the index,
5Optionally, the array that stores the amount of changed pixels per block can be included to ease the
parallel decoding of the frame.
94
6.4 Prototype - The Invire Framework
the changed pixels and the array that stores the amount of changed pixels per block
to reconstruct the original frame. The first step is to calculate the memory position
of each pixel row belonging to a thread block of 256 threads. This is done similar to
the computation for compression, by using a scan algorithm that sums the amount of
changed pixels per block. After that, a second kernel is started with a blocksize of
256 threads. It uses the index to determine the absolute local position of its portion of
the changed pixels, and writes a changed pixel where the index is 1 and a pixel from
the last frame where the index is 0. When all blocks are done the final decompressed
frame is displayed to the user through the Invire Client.
6.4.3.5 Compression - CUDA-Based JPEG
The CUDA-based JPEG compression is logically separated into two CUDA kernels.
The first kernel computes the color conversion and the downsampling and the second
kernel is responsible for the Discrete Cosine Transformation (DCT) and the Quantiza-
tion of the DCT coefficients. This partitioning is the best compromise between max-
imizing the amount of threads per block and minimizing expensive read and write
operations from and to global memory. The maximum amount of threads per block is
determined by the amount of independent operations on a certain amount of data. It is
obvious that color conversion can be done independently for every pixel, thus it is best
to use the maximum amount of threads per block. The CUDA Programming Guide
[61] recommends 64-256 threads per block as the best value for current hardware6. To
keep to the JPEG standard, a blocksize of 8x8 pixels was chosen which results in 64
threads for the color conversion step. The downsampling step uses 4 pixels to compute
their mean Cb and Cr values, thus it would be best to also use 64 threads per block
and assign each thread to 4 pixels. However, it is important to reduce global memory
access in CUDA kernels since they consume 200-300 clock cycles in contrast to 4 cycles
for a memory access to shared memory. After reading and storing 64 pixels in shared
memory for color conversion, it is better to reuse the results for downsampling with
just 32 threads, instead of writing the results to global memory and starting a new ker-
nel which reads them in again. Downsampling is performed for both the Cb and the
Cr components, therefore 32 threads can compute that step in parallel. The following
enumeration briefly describes the computation steps of the first kernel:
1. At first the 64 RGB pixel values are loaded into shared memory in parallel by
64 threads. The RGB values are stored in a one-dimensional array. To be able to
compute the downsampling, two-dimensional 4x4 pixel blocks are needed and
DCT and quantization require 8x8 pixel blocks. Thus, the pixels are stored in 8x8
blocks in shared memory to allow the subsequent computations on the same data
structure.
2. The color conversion is implemented through three equations which compute the
YCbCr values from any given RGB pixel. The equations are: Y = 0.29900 * R +
6This may change for new revisions of GPUs since they may have more parallel execution units.
95
6 Remote Visualization for IS/VR
0.58700 * G + 0.11400 * B - 128; Cb = -0.16874 * R - 0.33126 * G + 0.50000 * B; Cr
= 0.50000 * R + 0.41869 * G - 0.08131 * B. The constants used in these equations
are those used in the libJPEG implementation. All three equations are computed
sequentially by one thread for each pixel.
3. The downsampling of the Cb and Cr components is done by 32 threads in par-
allel. 16 threads compute the mean of a 2x2 pixel block for the Cb and the other
16 threads on the same pixel block for the Cr values. The mean is computed by
adding all 4 values and dividing the result by 4. The division is replaced by a
shifting operation to optimize performance.
4. The last step is to store the downsampled YCbCr values in global memory. Since
DCT and quantization is done independently on each component, they are al-
ready stored separately. The Y values are stored in consecutive 8x8 pixel blocks
which are mapped to a one-dimensional array. The same holds for the Cb and
the Cr components. However, the resulting 4x4 blocks of the downsampling need
to be grouped to 8x8 blocks to prepare the data for DCT.
After kernel 1 has finished for all blocks the second kernel is started. Each component
(Y, Cb and Cr) is processed separately by one call to kernel 2. It has 8 threads per block
and does the following:
1. Loading the data is straight forward since it is already stored in the required 8x8
pixel blocks. Each of the 8 threads loads one row of a block into shared memory.
2. To avoid the complex and sequential computation of a 2D DCT it can be split into
the of computation of several 1D DCTs. The computation of the DCT for each
row is independent and can be done by 8 threads in parallel. Thereafter, those 8
threads compute the DCTs of the columns on the results of the proceeding step.
Figure 6.8 depicts the two passes of these steps. The actual computation of the
equations shown in section 6.2.4.5, which would cost 896 additions and 1024 mul-
tiplications per 8x8 pixel block, can be substituted by the following computations.
This algorithm was developed by Arai et al. and is described in [3]:
Define:
Input i = Array of 8 values
Sums: sjk =i(j) + i(k)
Differences: djk =i(j) i(k)
for: 0j,k7.
m1=2[(s07 +s34) + (s25 +s16)]
m2= (s07 +s34) (s25 +s16)
m3= (s07 s34)
96
6.4 Prototype - The Invire Framework
m4=d07
m5=cos(π/4)[(s16 s34) (s07 s34)]
m6=cos(π/4)(d25 +d16)
m7=cos(3π/8)[(d16 +d07) (d25 +d34)]
m8= [cos(π/8) + cos(3π/8)](d16 +d07)
m9= [cos(π/8) + cos(3π/8)](d25 +d34)
z1=m4+m6
z2=m4m6
z3=m8m7
z4=m9m7
These computations lead to the following equations for the 8 DCT coefficients:
DCT(0) = m1
DCT(1) = m2
DCT(2) = m3+m5
DCT(3) = m3m5
DCT(4) = z2z4
DCT(5) = z2+z4
DCT(6) = z1+z3
DCT(7) = z1z3
The coefficients computed by this algorithm need to be scaled by a constant fac-
tor to receive the final coefficients. Since this factor is constant it can easily be
integrated into the constant quantization tables in the next step. By using this
optimized 1D DCT algorithm the 2D DCT of an 8x8 pixel block can be computed
by 464 additions and 80 multiplications.
3. Quantization is a simple division by a constant for every coefficient in the 8x8
block. Quantization tables are similar for all blocks of a component. Before
they are applied in kernel 2, the tables are multiplied by the scaling values of
the DCT step as described before. Additionally the selected compression quality
influences the values of the quantization tables. The scaled and adapted quanti-
zation tables are applied to the coefficients by 8 threads in parallel. This could
also be done by 64 threads, but in order to avoid copying data back and from
global memory the 8 threads of the DCT steps are reused. Again each thread
computes one line of 8 DCT coefficients and multiplies them by the inverse of the
corresponding value in the quantization table.
4. Finally the quantized DCT coefficients are written back to global memory in 8x8
blocks for each component.
97
6 Remote Visualization for IS/VR
Figure 6.8: 2D DCT through separate 1D DCT computations for an 8x8 pixel block.
The last step of the JPEG compression is the Huffman encoding which makes use of
the many zeros in lower frequencies, resulting from DCT and quantization. This step,
as described in section 6.2.4.5, can not be computed efficiently in parallel. Thus, the
highly optimized sequential version of the libjpeg implementation is used to compute
this step. As a side effect, it is possible to use the libjpeg data structure to automat-
ically generate a JPEG conform header. After initializing the library and allocating a
jpeg compress struct, the function jpeg write coefficients is used to pass the pre-
computed coefficients to the Huffman encoding facility of the libjpeg. The result after
calling jpeg finish compress is a standard conform JPEG image in a specific memory
location. For more implementation details please refer to [41].
The decompression on the client is implemented by using standard libjpeg func-
tions. A CUDA-based decompression is also available and mainly consists of the in-
verse steps of the encoding (i.e. Huffman decoding, inverse DCT and color conversion
from YCbCr to RGB). However, the main focus of this implementation was to show the
feasibility of a fast parallel JPEG encoding to achieve high frame rates for RV. Decod-
ing is quite fast in the sequential case already since Huffman Decoding is faster and
the quantization step is omitted, thus the parallel decompression does not achieve a
significant performance increase (as shown in section 6.5).
6.5 Benchmarking the Remote Visualization Framework
with Practical Examples
The Invire framework was designed to evaluate, test and compare algorithms for im-
age compression and frame grabbing for remote rendering. Thus, after introducing
and implementing new grabbing and compression techniques, this section deals with
various comparative benchmarks of the compression algorithms and performance mea-
surements of the whole system. The Invire framework is also compared to the most
commonly used system in the class of Server-Side Remote Rendering for 3D applica-
98
6.5 Benchmarking the RV Framework
tion, called VirtualGL, which is briefly introduced in section 6.5.1. In addition to the
pure performance benchmarking, a quality assessment for the JPEG-based compres-
sion methods was conducted and is described in section 6.5.6, using the SSIM index,
which is briefly introduced in section 6.5.2. All benchmarks were carried out on two
different sample applications and with varying parameters (e.g. frame resolution, JPEG
quality, bandwidth limitations) that influence the performance and the quality of the
remotely rendered frames. The hardware that was used to perform the benchmarks,
as well as the sample applications, which represent two classes of applications, are
described in section 6.5.3.
6.5.1 The Reference Framework VirtualGL
VirtualGL [78] is a very sophisticated and renowned framework for remote rendering.
It offers transparent integration, fast compression as well as transmission techniques
and mechanisms for the remote interaction. It can also be combined with other frame-
works, such as VNC [70], to speed up the remote administration or Chromium [28]
to allow for remote rendering with distributed graphics applications. VirtualGL offers
different modes for compression and transmission: Besides two modes for uncom-
pressed transmission of the rendered data (as raw RGB stream or as X11 stream), the
standard compression mode is based on the JPEG still image compression. For JPEG
compression VirtualGL offers two implementations. The first is the pseudo-standard
libjpeg [29] implementation which is completely open source and allows decent perfor-
mance. The second is called TurboJPEG [80] and is based on the Intel(R) Integrated Per-
formance Primitives [30], a set of libraries which contain highly-optimized multimedia
functions for x86 processors. According to the developers of VirtualGL, it outperforms
the libjpeg implementation by a factor of 2-4. This implementation, however, is not
free and cannot be compiled without the licenses for the Intel Performance Primitives.
Thus, it was not considered in the overall performance benchmarks. It is included in
the pure JPEG comparison benchmarks in section 6.5.5 to allow a comparison between
the GPU-based and the fastest available CPU-based method. VirtualGL also offers var-
ious optimization options for all kind of applications. For a better comparability all
settings were left at the standard values for benchmarking.
6.5.2 Quality Assessment with the SSIM Index
In addition to benchmarking the performance of the proposed systems and comparing
it to other systems, it is also vital to make statements about the quality of the lossily
compressed images. There are several metrics for the objective measurement of image
quality and most of them aim to simulate selected characteristics of the ultimate judge
the human visual system (HVS). One of the first metrics introduced by Mannos
et al. in [48] is the mean squared error (MSE) which is computed by averaging the
squared intensity differences of distorted (e.g. through compression) and reference
image pixels, along with the related quantity of peak signal-to-noise ratio (PSNR).
99
6 Remote Visualization for IS/VR
These methods and many successors, which tried to enhance the original metric to
better simulate the characteristics of the HVS, are quite appealing because they are
simple to compute, have clear physical meanings and could be reused from other signal
processing applications. However there are many reports that find that the methods
based on the MSE are not very well matched to perceived quality (see for example
[16] and [81]). Thus, Wang et. al introduced a novel approach in 2004 [82], which
takes into account that natural images are highly structured. That means that their
pixels have strong dependencies, especially when they are spatially proximate. These
dependencies carry strong information about the structure and the objects in a visual
scene. In contrast to other approaches, which independently compare the channels
of the distorted and the original image pixel-wise, Wang et al. propose a combined
measurement of Luminance, Contrast and Structure components. The combination
of these three components results in the so called SSIM index which indicates the
objective image quality by a value between 0 and 100%. They showed that their results
were much closer to results which were generated by performing subjective quality
assessments by human users. This is why the SSIM metric is also used in this thesis
to evaluate the quality of the remotely rendered and lossily compressed images. The
goal was to achieve SSIM indices above 90% which translate to nearly invisible quality
loss for the compressed images.
In addition, the SSIM index is used to classify the heterogeneity between two con-
secutive frames of the two sample applications. This allows for the clarification of
certain differences in the benchmarking results. If the mean SSIM index of two con-
secutive frames is high, there is only little heterogeneity between them. That means
that difference-based, lossless compression methods are much likely to achieve high
compression rates for high SSIM indices. If, the other way round, the mean SSIM in-
dex of two consecutive frames is low (below 60%) it is likely that those compression
techniques do not produce reasonable compression rates.
6.5.3 The Benchmarked System and the Sample Applications
In order to prove the theoretical concept and to compare the described compression
algorithms, we prototypically implemented the Invire system and tested it in a sand-
box environment. The server and the client part run on a computer equipped with
a CUDA-ready Geforce 8800 GTS (G92) graphics card by NVIDIA. It has 12 multi-
processors and a wrap size7of 32. That leads to k=12 32 =384 threads that can
run concurrently. Both computers are equipped with an Intel Core 2 Duo E4400 CPU
running at 2.00 GHz and 2GB of RAM. The network connection is a 100Mbit Ether-
net connected through a switch. The peak nominal bandwidth that could be achieved
over this network was roughly 11,8 MB/s. Two sample applications where chosen to
represent groups of applications. The following criteria played a role in the selection:
7Number of threads that are executed in parallel on one multiprocessor.
100
6.5 Benchmarking the RV Framework
Figure 6.9: Test cases: a) simple teapot, b) Virtual Night Drive without headlight sim-
ulation.
The mean heterogeneity of two consecutive frames:
This parameter determines if an application is highly dynamic or rather static. In
the field of IS/VR there exist both kinds of applications, for example rather static
CAD visualizations or highly dynamic virtual worlds with rich visual effects. Thus,
it was important to choose one application of each class. The heterogeneity could be
measured with the help of the SSIM index.
Multicolored or grayscale frames:
Again, there are applications in IS/VR that either have the one or the other attribute,
e.g. grayscale medical imaging visualizations vs. high dynamic range VR environ-
ments.
Large uniformly colored areas vs. very heterogenous multicolored areas:
As for example in CAD applications vs. highly dynamic virtual worlds with rich
visual effects.
Following these criteria, two sample applications were selected:
A rotating teapot (see figure 6.9a) as a representative of the area of rather static ob-
ject visualization applications. It simulates a steady rotation of a grayscaled 3d
model, with one fixed light source. The background is uniformly black and not
lit. This sample application has quite a low heterogeneity between two consecu-
tive frames (mean SSIM index between two consecutive frames 89,35%), greyscale
frames and at least one large uniformly colored area (the black background).
These attributes describe a rather typical CAD or 3D object visualization applica-
tion.
The Virtual Night Drive Simulator (VND) (see section 5.3.1 and figure 6.9b)) as a
representative of the area of highly dynamic virtual worlds. The VND is spe-
cialized on simulating automotive headlights at night and uses the shaders of
101
6 Remote Visualization for IS/VR
the graphics card to calculate the luminance intensity per pixel. It also features
a daylight view on the scene which is very detailed and feature-rich and offers
dynamic lighting features. This sample application has quite a high heterogene-
ity between two consecutive frames (mean SSIM index between two consecutive
frames 53,89%), multicolored frames and very heterogeneous multicolored areas
because of the extensive use of textures. These attributes describe a rather typical
VR application.
By selecting those two representatives of different application areas, the benchmarking
results can give information about how well each of the compression algorithms and
the overall system might perform depending on the selected application. Possibly there
are algorithms that are more suited for one or the other area. There certainly are more
groups of applications in IS/VR, but the ones selected represent the biggest groups
with the most obvious differences. The specifications of most of the other application
groups lie in between these two extremes.
6.5.4 Benchmarking and Comparing the Overall System Performance
In the first round of benchmarks the overall system performance (i.e. grabbing, com-
pression, network transmission, decompression and displaying) is calculated and com-
pared to the performance of the reference system VirtualGL. Both sample applications
are tested in three different resolutions ( 640x480, 1024x768 and 1680x1050 pixels) to
study the effect of the higher load for CPU/GPU and network. [79] describes per-
formance measurements with the VirtualGL framework and gives a good description
of the metrics to use: It is pointed out that simply measuring the achievable frames
per second is not enough since one cannot determine if the limitation to the measured
value is due to processing limitation or bandwidth limitation. Thus, in the following
benchmarking diagrams both the achievable frame rate and the consumed bandwidth
are shown in order to allow the comparison between network and processor limita-
tions of each compression technique. To provide statistical correctness, the measure-
ments were conducted for 60 seconds and a confidence interval is given for each mean
value of the measured FPS. The confidence interval includes 95% of all measured val-
ues. For both RV systems, Invire and VirtualGL, all available compression methods
are tested and the overall performance in frames per second as well as the required
bandwidth in MB/s is depicted in the diagrams. The Invire framework includes run
length, difference, difference with index, CUDA-based difference with index, JPEG
and CUDA-based JPEG compression methods. Additionally the CUDA-based meth-
ods can optionally make use of CUDA-based decompression on the client. VirtualGL
offers uncompressed RGB transmission as well as libjpeg based JPEG compression.
6.5.4.1 Rotating Teapot
Figure 6.10 shows the benchmarking results for the rotating teapot application in
640x480 resolution. Both uncompressed methods (Invire uncompressed and VirtualGL
102
6.5 Benchmarking the RV Framework
Teapot 640 x 480
9,565
36,587 38,622 39,339
35,612
39,699
37,229
41,169 41,036
12,763
37,732
0
5
10
15
20
25
30
35
40
45
uncompressed
RLE
Difference
DIC
CUDA DIC
Cuda DIC & decomp
JPEG
CUDA JPEG
CUDA JPEG & decomp
VirtualGL RGB
VirtualGL JPEG
Compressions
Frames per Second
0
2
4
6
8
10
12
14
Bandwidth in MB/s
MEAN
Bandwidth
Figure 6.10: Rotating Teapot in 640 x 480 Pixels.
RGB) as well as the run length encoding algorithm are already limited by the available
bandwidth. All other techniques are limited by the processor capabilities and achieve
frame rates around 40 Fps. Especially the JPEG based methods achieve very good com-
pression results, however, at the cost of slightly visible block artifacts (JPEG quality8
set to 75 in all benchmarks). The lossless methods consume more bandwidth, but can
achieve similar frame rates for this resolution. The CUDA-based algorithms slightly
outperform their non-parallel equivalents but initialization overhead consumes most
of the parallel speed-up for small resolutions like this. Figure 6.11 shows the bench-
marking results for the rotating teapot application in 1024x768 resolution. Again both
uncompressed methods and the run length encoding algorithm are already limited by
the available bandwidth. Additionally, all difference based algorithms are limited by
the bandwidth. However, it is visible that the simple RLE algorithm achieves a better
compression and thereby results in a higher frame rate for the available bandwidth.
The JPEG based methods are again limited by the processing power and reach values
between 21 and 29 Fps, where the CUDA-based algorithms perform about 20% bet-
ter than the CPU based methods. Again bandwidth consumption is very low for all
8The quality value determines the compression rate of the JPEG algorithm. It is incorporated into the
values of the quantization tables which influence the amount of zeros in the high frequencies. If a
low quality is set, these values are higher and eliminate more coefficients than for higher quality
value. 75 represents a good compression/quality ratio.
103
6 Remote Visualization for IS/VR
Teapot 1024 x 768
3,731
35,93
17,518
29,028 29,344 28,379
21,419
28,35 27,047
4,964
22,616
0
5
10
15
20
25
30
35
40
uncompressed
RLE
Difference
DIC
CUDA DIC
Cuda DIC & decomp
JPEG
CUDA JPEG
CUDA JPEG & decomp
VirtualGL RGB
VirtualGL JPEG
Compressions
Frames per Second
0
2
4
6
8
10
12
14
Bandwidth in MB/s
MEAN
Bandwidth
Figure 6.11: Rotating Teapot in 1024 x 768 Pixels
JPEG methods (around 1 MB/s). Overall the best result is achieved by the simple RLE
method (35,93 Fps) for this scenario.
Figure 6.12 shows the benchmarking results for the rotating teapot application in
1680x1050 resolution. At this resolution only the uncompressed as well as the RLE
and the CUDA based Difference with Index algorithm are limited by the bandwidth.
Thus, the CUDA-based lossless methods slightly outperform the CPU-based methods
with more potential for higher available bandwidths. The RLE method outperforms
all other methods again, because of its great performance/ratio for this sample ap-
plication. All JPEG methods are again clearly limited by processing power and the
CUDA-based JPEG implementation outperforms the CPU based version by roughly 30
%.
Rotating teapot - Conclusion
The rotating teapot application is a special case in terms of remote rendering. Since it
has large unicolored areas (the whole background, regions of the teapot) and very pre-
dictable and homogenous object movements, it is well suited for lossless compression
methods. Especially the very simple run length encoding performs very well in all
evaluated resolutions since it can achieve great compression results with just one pass
over the input data. JPEG based methods achieve great compression results, but only
for the cost of worse performance and lossy image compression. The CUDA-enhanced
104
6.5 Benchmarking the RV Framework
Teapot 1680 x 1050
1,799
25,056
9,165
13,928
15,832 14,53
9,894
13,73 14,297
2,031
9,499
0
5
10
15
20
25
30
uncompressed
RLE
Difference
DIC
CUDA DIC
Cuda DIC & decomp
JPEG
CUDA JPEG
CUDA JPEG & decomp
VirtualGL RGB
VirtualGL JPEG
Compressions
Frames per Second
0
2
4
6
8
10
12
14
Bandwidth in MB/s
MEAN
Bandwidth
Figure 6.12: Rotating Teapot in 1680 x 1050 Pixels
algorithms improve the performance for higher resolutions (1024x768 or higher) by
20-30% with a rising tendency. The rotating teapot scenario was chosen because es-
pecially CAD and many scientific visualization applications have similar displaying
patterns. Therefore it is very promising to use simple lossless compression methods
for the remote rendering of such applications.
6.5.4.2 Virtual Night Driver
Figure 6.13 shows the benchmarking results for the VND application in 640x480 reso-
lution. It becomes very clear that, as well as the uncompressed transmission, also the
lossless compression methods are limited by the available bandwidth. This is a result
of the very heterogeneous color distribution of the application. All lossless compres-
sion methods in Invire base either on similar pixel values in the proximity or in the
following frame. However, if for example a car moves through a textured scene none
of these conditions hold. Thus the compressed frames are likely to be only slightly
smaller or even bigger than the original frames. This is why the lossless compression
methods were not considered in the benchmarks of the higher resolutions for the VND
application. Besides this, the JPEG based methods result in frame rates between 29
and 37 FPS. For Invire, the CUDA based JPEG methods perform around 20% better
than the CPU-based. In comparison to VirtualGL’s JPEG implementation there is no
significant performance increase at this resolution. Compression rates are better for
105
6 Remote Visualization for IS/VR
VND 640 x 480
9,559 10,899 9,095 11,166 11,058 11,093
29,228
35,056 37,033
12,757
35,235
0
5
10
15
20
25
30
35
40
45
uncompressed
RLE
Difference
DIC
CUDA DIC
Cuda DIC & decomp
JPEG
CUDA JPEG
CUDA JPEG & decomp
VirtualGL RGB
VirtualGL JPEG
Compressions
Frames per Second
0
2
4
6
8
10
12
14
Bandwidth in MB/s
MEAN
Bandwidth
Figure 6.13: Virtual Night Drive in 640 x 480 Pixels.
Invire’s JPEG algorithms (1 MB/s vs. 2.5MB/s in bandwidth usage). Figure 6.14a)
shows the benchmarking results for the VND application in 1024x768 resolution. Only
JPEG compression methods are considered and the CPU-based methods of Invire and
VirtualGL achieve about the same results (17 Fps). Now the CUDA-based methods of
Invire outperform the CPU-based one by roughly 30%. Still the compression is better
for the Invire implementation. Figure 6.14b) shows the benchmarking results for the
VND application in 1680x1050 resolution. For this high resolution the gab between
CPU-based and CUDA-based widens further. Now the GPU-accelerated methods are
about 60% faster than the CPU based methods. Bandwidth usage is nearly similar for
all JPEG methods.
Virtual Night Driver - Conclusion
Applications like the VND, which make extensive use of textures, lighting and 3D
models to for example simulate a very realistic surrounding, are not well suited for the
lossless compression methods. There is simply too much heterogeneity in the frames
as that the coding algorithms could efficiently compress differences or consecutively
uniform pixels. Thus, the JPEG algorithms prove to work a lot better since the JPEG
standard was initially designed to compress such images. Especially the CUDA-based
implementation shows good results mainly for high resolutions. The bandwidth limi-
tation in the 100 MBit network is not reached by far and thus there might be even more
106
6.5 Benchmarking the RV Framework
Tabelle3
b) VND 1680x1050
1,749
8,597
13,259 12,632
2,042
8,099
0
2
4
6
8
10
12
14
16
uncompressed
JPEG
CUDA JPEG
CUDA JPEG & decomp
VirtualGL RGB
VirtualGL JPEG
Compressions
Frames per second
0
2
4
6
8
10
12
14
Bandwidth in MB/s
a) VND 1024x768
3,681
17,062
23,249 23,987
4,894
16,259
0
5
10
15
20
25
30
uncompressed
JPEG
CUDA JPEG
CUDA JPEG & decomp
VirtualGL RGB
VirtualGL JPEG
Compressions
Frames per second
0
2
4
6
8
10
12
14
Bandwidth in MB/s
Seite 1
Figure 6.14: a)Virtual Night Drive in 1024 x 768 Pixels and b) Virtual Night Drive in
1680 x 1050 Pixels.
potential for optimized CUDA-based approaches.
6.5.5 Benchmarking Three Different JPEG Implementations
The results of the benchmarking of the overall system performance show that the JPEG
based compression algorithms are likely to be the most universal choice for different
kinds of applications. To get a deeper insight in the raw performance of the different
JPEG implementations of Invire, the following benchmarks compare the pure compres-
sion times of the libjpeg [29], CUDA-based JPEG (see section 6.2.4.5) and turboJPEG
[80] implementations. All three implementations were tested in three different reso-
lutions with the two sample applications (VND and rotating teapot) described above.
For each measurement 1000 samples were taken and the mean of these samples as well
as the minimum and maximum of 98% of all measurements are displayed in the dia-
grams. For the CUDA-based method the overall mean time is composed by the mean
CPU and GPU compression times. All times are measured in milliseconds.
6.5.5.1 JPEG - Compression Times
Figure 6.15 depicts the comparison of the three JPEG implementations for both ap-
plications in 640 x 480 pixels. The libjpeg implementation takes the most time for
compression (about 14 ms for teapot and 18 ms for VND), the CUDA-based version
is about 60% faster (9 ms teapot and 10ms VND) and turboJPEG outperforms CUDA-
based JPEG by the factor 2 (4 ms teapot and 5 ms VND). For CUDA-based compression
an interesting shift between GPU and CPU times for the applications can be regarded:
The grayscale teapot application produces DCT coefficients that are much easier to
107
6 Remote Visualization for IS/VR
b) VND 640 x 480
4,863
5,671
17,843
4,520
0
2
4
6
8
10
12
14
16
18
20
JPEG Cuda JPEG TurboJPEG
Compression
Compression time in ms
Mean
CPU
GPU
a) Teapot 640 x 480
4,795
3,909
14,433
3,570
0
2
4
6
8
10
12
14
16
JPEG Cuda JPEG TurboJPEG
Compression
Compression time in ms
Figure 6.15: Comparison of the JPEG implementations for Teapot (a) and Virtual Night
Drive (b) applications in 640 x 480 Pixels.
compress since most of the color values are 0, thus the Huffman coding, which is per-
formed on the CPU, executes faster for this application than for the more elaborate
compression of the coefficients of the VND application. Therefore, the time needed
for GPU calculation is nearly constant for both applications but the time spend on the
CPU varies heavily. The time for GPU computation is still higher as the total time
for turboJPEG compression. Theoretically achievable framerates are about 70/56 Fps
(Teapot/VND) for libjpeg, 115/95 Fps for CUDA-based JPEG and 280/221 Fps for
turboJPEG.
Figure 6.16 depicts the comparison of the three JPEG implementations for both ap-
plications in 1024 x 768 pixels. The overall result is quite similar to the lower resolution:
The libjpeg implementation is now outperformed by a factor 2 by CUDA-based JPEG
and a factor 4 by turboJPEG. Again CPU computation time for CUDA-based JPEG is
lower than GPU time for the teapot application and higher for the VND application.
The time for GPU computation is now lower as the total time for turboJPEG com-
pression. Theoretically achievable framerates are about 28/22 Fps (Teapot/VND) for
libjpeg, 50/43 Fps for Cuda-based JPEG and 108/84 Fps for turboJPEG.
Figure 6.17 depicts the comparison of the three JPEG implementations for both ap-
plications in 1680 x 1050 pixels. For this high resolution, a similar performance ratio
as for the 1024x768 resolution (1:2:4) can be regarded. However, libjpeg has very high
variations in the VND application. This might be caused by memory or cache limi-
tations for the high amount of colored pixels (1.76 MPixels) to process. Theoretically
achievable framerates are about 14/9 Fps (Teapot/VND) for libjpeg, 25/21 Fps for
CUDA-based JPEG and 57/42 Fps for turboJPEG. Thus, the libjpeg implementation
already is the limiting factor for reaching more than 20 fps with this resolution.
108
6.5 Benchmarking the RV Framework
a) Teapot 1024 x768
10,855
9,344
35,741
9,263
0
5
10
15
20
25
30
35
40
JPEG Cuda JPEG TurboJPEG
Compression
Compression time in ms
b) VND 1024 x 768
10,553
12,718
44,848
11,959
0
10
20
30
40
50
60
JPEG Cuda JPEG TurboJPEG
Compression
Compression time in ms
Mean
CPU
GPU
Figure 6.16: Comparison of the JPEG implementations for Teapot (a) and Virtual Night
Drive (b) applications in 1024 x 768 Pixels.
JPEG - Conclusion
The comparison of the three JPEG variants shows that the highly optimized turboJPEG
implementation is currently the fastest way to compress JPEG-conform images. It uses
the Intel Performance Primitives library which ”... is an extensive library of multi-
core-ready, highly optimized software functions for multimedia data processing, and
communications applications...” [80]. Thus it can fully harvest the power of the dual
processor built in the testing machine. The CUDA-based JPEG implementation intro-
duced in this thesis, however, partially relies on the unoptimized code of the libjpeg
implementation. Especially the sequential Huffman encoding which is used by this
implementation seems to be limiting the performance for high resolutions and multi-
colored frames. The time spend on the GPU for doing the compute intensive steps
(color conversion, downsampling, DCT and quantization) is less than the overall time
used by the fast turboJPEG implementation for resolutions of 1024x768 and higher.
This is a good perspective to further optimize the CUDA-based implementation and
design a CUDA-supported Huffman encoding to outperform the highly optimized tur-
boJPEG implementation. In addition, the GPU-based methods strongly relief the CPU
and leave it available for other tasks. The GPU-based compression can either be done
on the same card that is responsible for rendering, or on a second graphics card in the
visualization node in order to share the computation and rendering load. The libjpeg
implementation is by far the slowest but most feature-rich and universal implemen-
tation. It also offers an extensive documentation and greatly helps to understand the
details of the JPEG standard. Anyway, it is not really suited for the application in
remote visualization of IS/VR since it limits the overall system performance for high
resolutions by its slow compression times.
109
6 Remote Visualization for IS/VR
a) Teapot 1680 x 1050
21,477
17,742
72,949
17,408
0
10
20
30
40
50
60
70
80
JPEG Cuda JPEG TurboJPEG
Compression
Compression time in ms
b) VND 1680 x 1050
20,953
25,913
106,788
23,688
0
20
40
60
80
100
120
140
JPEG Cuda JPEG TurboJPEG
Compression
Compression time in ms
Mean
CPU
GPU
Figure 6.17: Comparison of the JPEG implementations for Teapot (a) and Virtual Night
Drive (b) applications in 1680 x 1050 Pixels.
6.5.6 Quality Assessment of the Lossy Compression Methods
As described before, besides achieving reasonable performance, it is extremely impor-
tant for the practical usage of RV to ensure a certain level of quality of the lossily
compressed images. Thus, the following section describes quality assessments of the
three lossy (JPEG) compression methods, each conducted with the two sample appli-
cations. All JPEG implementations were benchmarked with different quality settings
q9(q=25,q=75 and q=100), and the SSIM index as well as the compression ratio
in bytes per pixel are recorded.
Figure 6.18 shows those results in two diagrams for the teapot application. First
of all, it is visible that all three implementations produce nearly similar SSIM indices
and compression ratios for equal JPEG quality settings. This is evident since all three
versions implement the same basic algorithms (variations only in parallelization and
hardware acceleration). Small deviations may result from error corrections or rounding
differences between the utilized APIs / compilers (GNU/IPP/CUDA) and hardware
(CPU/GPU). For the teapot application it is observable that even for very low JPEG
quality settings (q=25) the SSIM index is around 95% while achieving a great com-
pression ratio (0.03 bytes per pixels in contrast to 4 bytes per pixel of the raw image).
Higher quality settings result in even higher SSIM indices; the best SSIM index / com-
pression ratio is achieved for q=75, as shown in the charts.
Figure 6.19 shows the quality and compression ratio results for the VND application.
They are analogue to the one for the teapot application. But, it is observable that the
lowest JPEG quality setting (q=25) produces frames with a SSIM index below 90%.
9Those settings directly influence compression quality and ratio by altering the quantization table and
thereby determining the amount of DCT coefficients that are eliminated. q=0: highest compression
with strongly visible artifacts, q=100: lowest compression, best quality, worst compression ratio.
110
6.5 Benchmarking the RV Framework
a) SSIM of Teapot in 1024x768
95,48 95,48 95,49
98,04 98,11 98,08
99,8 99,89 99,91
93
94
95
96
97
98
99
100
101
JPEG CUDA_JPEG TURBO_JPEG
Compression
SSIM Index in %
b) Compression ratio of Teapot in 1024x768
0,032 0,033 0,033
0,056 0,055 0,058
0,168 0,164 0,163
0
0,02
0,04
0,06
0,08
0,1
0,12
0,14
0,16
0,18
JPEG CUDA_JPEG TURBO_JPEG
Compression
Compression ratio in bytes
per pixel
Figure 6.18: SSIM index (a) and compression ratio (b) of the three JPEG implementa-
tions for the Teapot application. The color of the bars represents the se-
lected JPEG quality settings: Blue q=25, Red q=75 and White q=100.
a) SSIM of VND in 1024x768
84,88 84,73 84,32
92,43 92,33 91,27
95,28 95,34 95,19
78
80
82
84
86
88
90
92
94
96
98
JPEG CUDA_JPEG TURBO_JPEG
Compression
SSIM Index in %
b) Compression ratio of VND in 1024x768
0,117 0,115 0,114
0,531 0,518 0,523
0,679 0,684 0,677
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
JPEG CUDA_JPEG TURBO_JPEG
Compression
Compression ratio in bytes
per pixel
Figure 6.19: SSIM index (a) and compression ratio (b) of the three JPEG implementa-
tions for the VND application. The color of the bars represents the selected
JPEG quality settings: Blue q=25, Red q=75 and White q=100.
That means that there are visible distortions in the frames that might disturb remote
users of the application. For higher q’s the SSIM index improves and reaches about
92 % for q=75. Admittedly, also the compression ratio rises by a factor 4 between
q=25 and q=75. Setting q=100 further improves the SSIM index a bit, but this
improvement is hardly noticeable for the human eye.
The main finding of the quality measurement is that all three JPEG implementations
produce very high quality compressed images, even with low JPEG quality settings.
However, to ensure that the resulting frames permanently have SSIM indices above
90% qneeds to be set to 75 or higher. It is planned to extend the Invire system by an
automatic quality assessment component, which dynamically measures the quality of
the compressed frames and adapts qto the results of this measurement. This either
helps to save bandwidth or to provide the users with optimal quality.
111
6 Remote Visualization for IS/VR
6.5.7 Benchmarking Conclusion
The benchmarks described above surveyed several performance and quality aspects of
remote rendering and showed their heavy dependency on the underlying compression
algorithms. The first set of benchmarks made clear that for different applications differ-
ent compression techniques (lossy and lossless) produce very different results. Lossless
techniques, for example, are well suited to do high quality remote rendering of visu-
ally homogenous and slow-changing applications (e.g. CAD). Admittedly, they are not
suited for visually heterogeneous applications such as fast-changing VR-applications.
Lossy methods, such as the JPEG still image compression, are deployable more univer-
sally since they are mostly independent of the homogeneity of the input image. JPEG
achieves great compression in interactive time at the cost of more or less visible loss of
image quality, which, however, is acceptable for most IS/VR applications (as shown in
section 6.5.6). This is why the second set of benchmarks analyzes the performance of
JPEG compression for different implementations of the JPEG standard. The compres-
sion speed is extremely important for remote visualization since it mainly determines
the achievable overall performance for a remotely rendered application. The proposed
GPU-based algorithms perform well in comparison to existing approaches and at the
same time relief the CPU. The other limiting factor is the available network bandwidth
which was taken into account in the first set of benchmarks. Again, the JPEG-based
algorithms provide good compression rates, so that even for low bandwidth networks
frames of reasonable quality can be compressed and delivered in time.
6.6 Conclusion - Remote Visualization for IS/VR
Existing systems for Remote Visualization have two major disadvantages regarding
their utilization for IS/VR applications: Either they lack the required interactivity by
taking too long for compression or they cannot provide a decent image quality for high
resolutions. Furthermore most of them cannot handle distributed input or multiple
client connections. Thus, the previous chapter presented a framework which helps to
find solutions for those problems and allows to deploy them in a practical environment.
The framework fulfills the requirements that were pointed out in section 4.4:
The framework was designed in a modular and open way to allow the integration
of arbitrary grabbing and compression methods.
By designing and implementing methods to access rendered frames directly on
the GPU and by integrating the GPGPU API CUDA into the framework, the
usage of the GPU for general purpose computing tasks was realized. Thereby,
new and fast methods for image compression could be developed, benchmarked
and compared to existing methods.
A plugin component allows for the comfortable integration of the RV functional-
ity into existing applications. Transparent integration is possible through pre-
112
6.6 Conclusion - Remote Visualization for IS/VR
loaded libraries (under certain circumstances: e.g. Linux OS, correct X and
OpenGL versions).
Server and client are prepared for distributed input and output. This is realized
through the capability of handling several TCP connections and allowing the
composition of a frame by multiple inputs.
The whole system bases on freely available software and utilizes well known and
approved mechanisms for data transfer, frame grabbing and image/video com-
pression (e.g. TCP-based communication, Pixel Buffer Objects and JPEG stan-
dard). The prototypical implementation proved the concept and will be freely
available over the web site of the PC2[65].
Besides the design and development of the framework for RV, the main contributions
of this chapter are the parallel GPU-based image compression methods. They allow for
the first time to use the powerful GPUs of the visualization nodes for image compres-
sion and outperform most of the standard methods. However, in some cases there are
still traditional methods that are slightly faster than the GPU-based ones but also con-
sume almost all of the CPU’s processing power. The methods presented in this thesis
greatly relief the CPU and bundle all graphics relevant functionalities on the graphics
card. The CPU only handles the transfer of the compressed images and the connection
between client(s) and server(s).
113
6 Remote Visualization for IS/VR
114
7
Conclusion
This chapter briefly summarizes the contributions of this thesis, draws a conclusion
and outlines directions of future works.
7.1 Contributions
This thesis adds the following main contributions to the state of the art in the area of
middleware for hybrid cluster systems:
New models for Computational Steering of IS/VR:
The new models, which reflect the requirements of distributed IS/VR applications,
allow for the classification and realization of arbitrary IS/VR application scenarios on
hybrid cluster systems. They form the basis for the design of a CS framework and
specify the required data types and communication mechanisms.
A flexible framework for CS of IS/VR:
It enables the dynamic coupling of arbitrary components to realize pure or hybrid im-
plementations of the extended CS models described before. The framework focuses
on the special requirements of this field of applications: interactivity and scalability.
Interactivity is achieved by the communication server’s global synchronization mech-
anisms and the methods for the elimination or punishment of producers of latency.
Scalability is provided through the Publish/Subscribe service, which allows for the
dynamic and flexible creation of communication links.
A platform for Remote Visualization on Hybrid Clusters:
The platform makes use of the specific characteristics of a hybrid cluster, i.e. its visual-
ization component, by introducing methods to use the graphics hardware for general
purpose computing tasks and by utilizing the implicit parallelism for distributed vi-
sualization scenarios. It forms the basis to develop new grabbing, composition and
115
7 Conclusion
compression techniques to allow for the efficient, remote usage of hybrid cluster sys-
tems for interactive applications.
New Parallel Image Compression Methods:
The two presented methods both use the powerful GPUs of the clusters visualization
nodes to realize the necessary, fast and high-quality image compression for the remote
visualization. By parallelizing well known image compression algorithms (Difference
Coding (lossless) and JPEG (lossy)) and adapting them to the special GPU hardware,
substantial performance increases and significant load relieving of the CPU could be
achieved. Additionally, through outsourcing the image compression to the GPU and
thereby avoiding the need to readback large uncompressed frames through the lim-
ited host interface, a further source of latency could be eliminated. All these improve-
ments help to achieve the goal of providing a seamless graphical remote interface
without restrictions in latency or quality.
7.2 Conclusion
The focus of this thesis was to enable a special class of applications (IS/VR) to utilize
the power of upcoming hybrid cluster systems. Interactive simulations and Virtual Re-
ality applications were chosen because they can greatly benefit from the computational
and graphical power of hybrid clusters as shown in the scenarios in chapter 5. After
introducing the problem and describing the foundations, including the terminology of
hybrid cluster system and IS/VR applications, a comparison between traditional High
Performance Computing and IS/VR applications was conducted and the need for new
subsystems was pointed out. Those subsystems were derived from existing ones for
traditional HPC applications, but were thoroughly adapted to the requirements of the
new field. The most important requirements, which do not play a big role in traditional
HPC, are interactivity and (soft) real-time execution. Thus, the presented subsystems
strongly focus on techniques to fulfill these needs.
The first subsystem, computational steering of IS/VR, allows for the synchroniza-
tion and coupling of arbitrary amounts of components, to create variable distributed
IS/VR applications on the fly. The models that underlie this subsystem can be ap-
plied to most existing IS/VR applications and allow new functionalities (such as high
resolution visualization, comparison of simulations or multi-user scenarios) with very
little programming effort. Without computational steering for IS/VR those features
were only realizable by completely redesigning the whole application and handling all
communication and synchronization tasks manually. The system is on the one hand
flexible enough to support a broad spectrum of applications and on the other hand
offers specialized functions that help to fulfill the requirements of IS/VR in terms of
interactivity and immersivity. To illustrate the potential of this approach, two very
different applications were ported, with only minimal changes to their original source
code, from static, sequential applications to flexible, distributed systems by utilizing
116
7.2 Conclusion
the models of computational steering for IS/VR.
The second subsystem, remote visualization for IS/VR, addresses another problem
that those applications have when being executed on a universal cluster system: Ac-
cessability. Since IS/VR applications heavily depend on interactivity and the visual-
ization of the simulated worlds, a direct access to the running simulation is essential.
However, clusters often reside in highly inaccessible areas and are in many cases only
reachable over the network (which is sufficient for most traditional HPC applications).
Newer hybrid cluster systems (especially those which include visualization capabil-
ities) in many cases feature one or more sophisticated output devices (stereoscopic
display, tiled wall or a CAVE) as main sites of visualization and interaction. Those
are fixed to one location and cannot be used simultaneously by multiple users. Thus,
systems for the remote visual access are needed to successfully transfer IS/VR ap-
plications from desktop PCs to high performance clusters. The system designed and
developed in this thesis aims to optimize latency and quality as well as scalability of the
remote visual access. None of the systems for remote visualization that currently exist
share this combination of goals. Most remote access frameworks focus on remote ad-
ministration tasks where latency and performance play a minor role but the consumed
bandwidth is the most important feature. Systems that focus on high quality remote
visualization of interactive applications, either introduce big latencies by applying so-
phisticated compression algorithms that consume a lot of computation time or lack the
desired flexibility to allow for example multiple users to simultaneously access the sys-
tem remotely. The system introduced in this thesis is a framework to test, evaluate and
utilize methods that help to achieve the goal of low latency, high quality and flexible
remote visualization with adequate bandwidth consumption. One promising approach
is to use the powerful graphics cards of the hybrid cluster not only for rendering, but
also for compression and other computing tasks to significantly speed-up the process
of grabbing and encoding the rendered frames. Two exemplary methods were de-
veloped and implemented inside the presented framework. Benchmarks showed that
even in a prototypical stage the methods (especially GPU-based JPEG compression)
could compete with and even outperform in some cases the standard methods used in
this field.
All in all, the two developed systems form a basis to successfully deploy IS/VR
applications on hybrid clusters. As pointed out by Pfister and in the introduction
of this thesis, there currently is powerful hardware at hand, but the problem is that
it is still to complicated and complex. Thus, the conceptual design as well as the
development of middleware and subsystems that help the developers and users to
harvest the raw computational and graphical power of such systems are vital. This not
only holds for software for clusters but for most of the software that will be developed
in the near future. The paradigm shift from sequential to parallel and distributed
execution confronts the developers with the same tasks as in massive parallel software
development: Load balancing, communication, consistency etc.. These issues can only
be handled efficiently with the right tools (subsystems/middleware) at hand.
117
7 Conclusion
7.3 Outlook
The subsystems presented in this thesis can be seen as a starting point for further
developments. They allow the generic execution of arbitrary IS/VR applications on
hybrid clusters for the first time. However there is still work to be done to practically
enable all potential users to utilize the powerful features of this combination. At first,
both subsystems need to be permanently deployed on the cluster system and inte-
grated into cluster management software, so that reservation and planning of usage is
possible. The prototypes need to be transferred to stable programs, which also take
care of e.g. security and authorization issues. Especially the simultaneous execution of
various applications on both system needs to be tested and safeguarded. In addition,
universal interfaces for the components connected through computational steering are
thinkable. They would allow to couple components of (nearly) arbitrary applications
in the same field and possibly ease the integration of new components. Finally, fur-
ther compression methods for remote visualization need to be integrated and tested
through the Invire framework to have a broad selection of techniques for all possible
applications.
Through this thesis and the work done under the scope of the VisSim project1, new
topics and tasks have evolved. The systems that were presented before found the
basis for several bachelor and master thesis. One thesis for example deals with the
development of a generic benchmark for the computational steering framework in
order to evaluate the scalability of the system and its potential for further areas of
applications. Besides that there are several projects that are planned to make use of
the hybrid cluster systems through the developed subsystems. They reach from soccer
playing robots to medical image reconstruction. Finally, the software implementations
of both systems will be released under GPL license in alpha versions after this thesis
has been finished.
1A project funded by the German government to do cooperative research in the field of distributed
visualization and simulation.
118
Bibliography
[1] Advanced Micro Devices. Homepage of AMD/ATI. Website: http://ati.amd.
com, 2008.
[2] ANSYS. FLUENT - CFD Flow Modeling Software and Solutions. Website: http:
//www.fluent.com, 2008.
[3] Yukihiro Arai, Takeshi Agui, and Masayuki Nakajima. A Fast DCT-SQ Scheme
for Images. Transactions of IEICE, E71(11):1095–1097, 1988.
[4] Moshe Bar. The openMosix Project. Website: http://openmosix.sourceforge.
net, 2008.
[5] Jochen Bauch, Rafael Radkowski, and Henning Zabel. An Explorative Approach
to the Virtual Prototyping of Self-optmizing Mechatronic Systems. In Proceedings
of ProSTEP iViP Science Days 2005 - Cross Domain Engineering, Darmstadt, 2005.
[6] Heiko Bauke and Stephan Mertens. Cluster Computing: Praktische Einf¨uhrung in das
Hochleistungsrechnen auf Linux-Clustern (german). Springer-Verlag New York, Inc.,
Secaucus, NJ, USA, 2005.
[7] Jan Berssenbr¨
ugge. Virtual Nightdrive Ein Verfahren zur Darstellung der komplexen
Lichtverteilungen moderner Scheinwerfersysteme im Rahmen einer virtuellen Nachtfahrt
(german). PhD thesis, Universit¨
at Paderborn, December 2005.
[8] A. Bierbaum, C. Just, P. Hartling, K. Meinert, A. Baker, and C. Cruz-Neira. VR
Juggler: A Virtual Platform for VR Application Development. In VR01: Proceedings
of the IEEE Virtual Reality conference, pages 89–96. IEEE, 2001.
[9] Stephan Blazy, Odej Kao, and Oliver Marquardt. padfem2 - An Efficient, Com-
fortable Framework for Massively Parallel FEM-Applications. In Recent Advances
in Parallel Virtual Machine and Message Passing Interface, pages 681–685. Springer,
2003.
[10] James F. Blinn. Simulation of wrinkled surfaces. SIGGRAPH Computer Graphics,
12(3):286–292, 1978.
[11] Maxime Crochemore and Wojciech Rytter Wojciech. Jewels of stringology. World
Scientific Publishing Co. Inc., River Edge, NJ, 2003.
[12] Wilhelm Dangelmaier, Daniel Huber, Christoph Laroqueand Mark Aufenanger,
Matthias Fischer, Jens Krokowski, and Michael Kortenjahn. d3FACT insight goes
parallel - Aggregation of multiple simulations. In SimVis 07: Proceedings of the
119
Bibliography
17th Simulation and Visualization Conference, pages 79–88. SCS European Publishing
House, 2006.
[13] Jauvane C. de Oliveira, Shervin Shirmohammadi, and Nicolas D. Georganas. Col-
laborative Virtual Environment Standards: A Performance Evaluation. In DIS-
RT99: Proceedings of the 3rd International Workshop on Distributed Interactive Simula-
tion and Real-Time Applications, page 14, Washington, DC, USA, 1999. IEEE Com-
puter Society.
[14] Declan Delaney, Tom´
as Ward, and Seamus McLoone. On consistency and net-
work latency in distributed interactive applications: a survey part I. Presence:
Teleoperations and Virtual Environments, 15(2):218–234, 2006.
[15] Chrilly Donninger and Ulf Lorenz. The Chess Monster Hydra. In FPL04: Proceed-
ings of the 14th International Conference on Field Programmable Logic and Application,
pages 927–932. Springer, 2004.
[16] M. P. Eckert and A. P. Bradley. Perceptual quality metrics applied to still image
compression. Signal Processing, 70:177–200, 1998.
[17] Martin Eikermann. A Flexible Framework for Computational Steering of Dis-
tributed High Performance Systems. Master’s thesis, Universit¨
at Paderborn,
September 2006.
[18] J. Joseph Brann et al. IEEE standard for distributed interactive simulation - appli-
cation protocols. IEEE Standard 1278.1-1995, 26 March 1996.
[19] Konrad Etschberger. Controller Area Network. IXXAT Automation GmbH, August
2001.
[20] X.Org Foundation. The X.Org project. Website: http://www.x.org/wiki, 2008.
[21] Freedesktop.org. Introduction to D-Bus. http://www.freedesktop.org/wiki/
IntroductionToDBus, 2008.
[22] Juergen Gausemeier, Jan Berssenbruegge, and Jochen Bauch. A Virtual Reality-
based Night Drive Simulator for the Evaluation of a Predictive Advanced Front
Lighting System. In ASME CIE06: Proceedings of the ASME 2006 International De-
sign Engineering Technical Conference and Computers and Information in Engineering
Conference. ASME, 2006.
[23] Gaussian Inc. The Official Gaussian Website. http://www.gaussian.com, 2008.
[24] GPGPU. General-Purpose Computation Using Graphics Hardware. Website:
http://www.gpgpu.org, 2008.
120
Bibliography
[25] M. Harris. Parallel Prefix Sum (Scan) with CUDA. Website: http:
//developer.download.nvidia.com/compute/cuda/sdk/website/projects/
scan/doc/scan.pdf, 2007.
[26] D. Horn. GPU Gems 2: Programming Techniques for High-Performance Graphics and
General-Purpose Computation, chapter Stream Reduction Operations for GPGPU
Applications, pages 573–583. Addison-Wesley Professional, 2005.
[27] Paul G. Howard and Jeffrey Scott Vitter. Parallel lossless image compression using
Huffman and arithmetic coding. Information Processing Letters, 59(2):65–73, 1996.
[28] Greg Humphreys, Mike Houston, Ren Ng, Randall Frank, Sean Ahern, Peter D.
Kirchner, and James T. Klosowski. Chromium: A Stream-Processing Framework
for Interactive Rendering on Clusters. ACM Transactions on Graphics, 21(3):693–702,
2002.
[29] Independent JPEG Group. libjpeg (Open Source JPEG library). Website: http:
//www.ijg.org, 2008.
[30] Intel. Intel R
Integrated Performance Primitives 5.3. Website: http://www.intel.
com/cd/software/products/asmo-na/eng/302910.htm, 2008.
[31] Jeff Juliano and Jeremy Sandmel. OpenGL Extension - Frame Buffer Objects
and Pixel Buffer Objects. Website: http://oss.sgi.com/projects/ogl-sample/
registry/EXT/framebuffer_object.txt, 2005.
[32] Roy Kalawsky. The Science of Virtual Reality and Virtual Environments. Addison-
Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1993.
[33] Jonathan Kaye and David Castillo. Interactive Simulation Newsletter Vol. 1, No. 1
(April, 2003). http://www.flashsim.com/newsletter/v1n1.html, 2003.
[34] T. Kesavadas and Abhishek Sudhir. Computational Steering in Simulation of Man-
ufacturing Systems. In ICRA00: Proceedings of the IEEE International Conference on
Robotics and Automation, pages 2654–2658, 2000.
[35] Khronos Group. The OpenGL Standard. Website: http://www.opengl.org/, 2008.
[36] James Arthur Kohl, Philip M. Papadopoulos, and G. A. Geist II. CUMULVS:
Collaborative Infrastructure for Developing Distributed Simulations. In PPSC97:
Proceedings of the 8th SIAM Conference on Parallel Processing for Scientific Computing.
SIAM, 1997.
[37] David Koller, Michael Turitzin, Marc Levoy, Marco Tarini, Giuseppe Croccia, Paolo
Cignoni, and Roberto Scopigno. Protected interactive 3D graphics via remote
rendering. In SIGGRAPH ’04: Proceedings of the ACM SIGGRAPH conference, pages
695–703, New York, NY, USA, 2004. ACM.
121
Bibliography
[38] Kronos Group. GLSL - The OpenGL Shading Language. Website: http://www.
opengl.org/documentation/glsl, 2008.
[39] Christoph Laroque. Ein mehrbenutzerf¨ahiges Werkzeug zur Modellierung und rich-
tungsoffenen Simulation von wahlweise objekt- und funktionsorientiert gegliederten Fer-
tigungssystemen (german). PhD thesis, Universit¨
at Paderborn, June 2007.
[40] Averill M. Law and W. David Kelton. Simulation Modeling and Analysis. McGraw
Hill Higher Education, 2000.
[41] Paul Hermann Lensing. GPU-basierte, verlustbehaftete Bildkompression f¨
ur Re-
mote Rendering (german). Master’s thesis, Universit¨
at Paderborn, March 2008.
[42] Stefan Lietsch and Jan Berssenbruegge. Parallel, Shader-Based Visualization of
Automotive Headlights. In EGPGV07: Proceedings of the 7th Eurographics Sympo-
sium on Parallel Graphics and Visualization. ACM, 2007.
[43] Stefan Lietsch and Paul Hermann Lensing. CUDA-based, parallel JPEG Compres-
sion for Remote Rendering. In ISIVC08: Proceedings of the 4th International Sympo-
sium on Image/Video Communications over fixed and mobile networks. IEEE, 2008.
[44] Stefan Lietsch and Oliver Marquardt. A CUDA-Supported Approach to Remote
Rendering. In ISVC07: Proceedings of the International Symposium on Visual Com-
puting, volume 4841 of Lecture Notes in Computer Science, pages 724–733. Springer,
2007.
[45] Stefan Lietsch, Henning Zabel, and Jan Berssenbruegge. Computational Steer-
ing of Interactive and Distributed Virtual Reality Applications. In ASME CIE07:
Proceedings of the 27th ASME Computers and Information in Engineering Conference.
ASME, 2007.
[46] Stefan Lietsch, Henning Zabel, Jan Berssenbruegge, Veit Wittenberg, and Martin
Eikermann. Light Simulation in a Distributed Driving Simulator. In ISVC06: Pro-
ceedings of the International Symposium on Visual Computing, volume 4291 of Lecture
Notes in Computer Science, pages 343–353. Springer, 2006.
[47] Stefan Lietsch, Henning Zabel, and Christoph Laroque. Computational Steering
Of Interactive Material Flow Simulations. In ASME CIE08: Proceedings of the 28th
ASME Computers and Information in Engineering Conference. ASME, 2008.
[48] J. L. Mannos and D. J. Sakrison. The effects of a visual fidelity criterion on the
encoding of images. IEEE Transactions on Information Theory, IT-4:525–536, 1974.
[49] Mechdyne. CAVElib - Build a better reality. Website: http://www.mechdyne.com/
integratedSolutions/software/products/CAVELib/CAVELib.htm, 2008.
[50] Message Passing Interface Forum. MPI: A message passing interface standard.
122
Bibliography
[51] Microsoft. Microsoft Windows Server 2003 Terminal Services. Web-
site: http://www.microsoft.com/windowsserver2003/technologies/
terminalservices/default.mspx, 2007.
[52] Microsoft. The Microsoft DirectX Collection. Website: http://msdn.microsoft.
com/directX, 2008.
[53] Microsoft. Windows Compute Cluster Server 2003. Website: http://www.
microsoft.com/windowsserver2003/ccs, 2008.
[54] Joan L. Mitchell, William B. Pennebaker, Chad E. Fogg, and Didier J. Legall, edi-
tors. MPEG Video Compression Standard. Chapman & Hall, Ltd., London, UK, UK,
1996.
[55] Gordon E. Moore. Cramming more components onto integrated circuits. Electron-
ics, 38(8):114–117, 1965.
[56] Jurriaan D. Mulder, Jarke J. van Wijk, and Robert van Liere. A survey of compu-
tational steering environments. Future Generation Computer Systems, 15(1):119–129,
1999.
[57] NVIDIA. CG - The C for Graphics Language. Website: http://developer.
nvidia.com/page/cg_main.html, 2008.
[58] NVIDIA. GeForce3 - The Infinite Effects GPU. Website: http://www.nvidia.com/
page/geforce3.html, 2008.
[59] NVIDIA. Homepage of NVIDIA. Website: http://www.nvidia.com, 2008.
[60] NVIDIA. NVIDIA CUDA - Compute Unified Device Architecture. Website: http:
//www.nvidia.com/object/cuda_home.html, 2008.
[61] NVIDIA. NVIDIA CUDA - Compute Unified Device Architecture - Programming
Guide v1.1. PDF: http://developer.download.nvidia.com/compute/cuda/1_1/
NVIDIA_CUDA_Programming_Guide_1.1.pdf, 2008.
[62] Object Management Group. Unified Modeling Language Specification, Version
1.5, March 2003.
[63] Kyoung S. Park, Yong J. Cho, Naveen K. Krishnaprasad, Chris Scharver, Michael J.
Lewis, Jason Leigh, and Andrew E. Johnson. CAVERNsoft G2: a toolkit for high
performance tele-immersive collaboration. In VRST00: Proceedings of the ACM
symposium on Virtual reality software and technology, pages 8–15, New York, NY,
USA, 2000. ACM Press.
[64] Steven G. Parker and Christopher R. Johnson. SCIRun: A Scientific Programming
Environment for Computational Steering. In SUPCOM95: Proceedings of Supercom-
puting, San Diego, CA, December 1995. ACM/IEEE.
123
Bibliography
[65] PC2. Paderborn Center for Parallel Computing. Website: http://wwwcs.
uni-paderborn.de/pc2, 2008.
[66] William B. Pennebaker and Joan L. Mitchell. JPEG Still Image Data Compression
Standard. Kluwer Academic Publishers, Norwell, MA, USA, 1992.
[67] Gregory F. Pfister. In search of clusters (2nd ed.). Prentice-Hall, Inc., Upper Saddle
River, NJ, USA, 1998.
[68] Marco Platzner, Sven D¨
ohre, Markus Happe, Tobias Kenter, Ulf Lorenz, Tobias
Schumacher, Andre Send, and Alexander Warkentin. The GOmputer: Accelerat-
ing GO with FPGAs. In ERSA08: Proceedings of the 8th International Conference on
Engineering of Reconfigurable Systems and Algorithms, pages –, Las Vegas, Nevada,
USA, 2008. CSREA Press.
[69] Luc Renambot, Henri E. Bal, Desmond Germans, and Hans J. W. Spoelder.
CAVEStudy: An Infrastructure for Computational Steering and Measuring in Vir-
tual Reality Environments. Cluster Computing, 4(1):79–87, 2001.
[70] T. Richardson, Q. Stafford-Fraser, K. Wood, and A. Hopper. Virtual Network
Computing. IEEE Internet Computing, 2(1):33–38, 1998.
[71] D. Salomon. Data Compression: The Complete Reference, 3rd Edition. Springer, 2004.
[72] SGI. SGI OpenGL Vizserver - Visual Area Networking. Website: http://www.sgi.
com/products/software/vizserver/, 2007.
[73] Silicon Graphics, Inc. Homepage of Silicon Graphics, Inc. Website: www.sgi.com,
2008.
[74] Jens Simon. Paderborn Center for Parallel Computing - Benchmark-
ing Center. Website: http://wwwcs.uni-paderborn.de/pc2/about-us/staff/
jens-simons-pages/benchmarkingcenter.html, 2008.
[75] D. Van Der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark, and H. J.
Berendsen. GROMACS: fast, flexible, and free. Jornal on Computational Chemistry,
26(16):1701–1718, December 2005.
[76] Jarke J. van Wijk, Robert van Liere, and Jurriaan D. Mulder. Bringing Computa-
tional Steering to the User. In Proceedings of the Scientific Visualization Conference
1997, pages 304–313, 1997.
[77] Jeffrey Vetter and Karsten Schwan. High Performance Computational Steering
of Physical Simulations. In IPPS97: Proceedings of the 11th International Parallel
Processing Symposium, Geneva, Switzerland, April 1997. The Institute of Electrical
and Electronics Engineers.
124
Bibliography
[78] VirtualGL. The VirtualGL Project. Website: http://www.virtualgl.org/, 2007.
[79] VirtualGL. A Study of the Performance ofVirtualGL 2.1 and TurboVNC 0.4. PDF:
http://www.virtualgl.org/pmwiki/uploads/About/vglperf21.pdf, 2008.
[80] VirtualGL. TurboJPEG 1.10 - Intel IPP accelerated JPEG compression.
Website: http://sourceforge.net/project/showfiles.php?group_id=117509\
&package%_id=166100, 2008.
[81] Z. Wang, A. Bovik, and L. Lu. Why is image quality assessment so difficult. In
ICASSP02: Proceedings of the 27th IEEE International Conference on Acoustics, Speech,
and Signal Processing, pages 3313–3316. IEEE, 2002.
[82] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image Quality Assessment:
From Error Visibility to Structural Similarity. IEEE Transactions on Image Processing,
13:600–612, 2004.
125
Bibliography
126
A
Acronyms
AGP Accelerated Graphics Port
APDU Application Protocol Data Unit
API Application Programming Interface
BPP Bytes Per Pixel
CAD Computer Aided Design
CFD Computational Fluid Dynamics
CG C for Graphics
CPU Central Processing Unit
CUPS Common Unix Printing Service
CS Computational Steering
CSIS Computational Steering of Interactive Simulations
CUDA Compute Unified Device Architecture
DCT Discrete Cosine Transformation
DIA Distributed Interactive Application
DIC Difference with Index Compression
DIS Distributed Interactive Simulation
DTD Document Type Definition (XML)
127
A Acronyms
FEM Finite Element Methods
FPGA Field Programmable Gate Arrays
FPS Frames Per Second
GLSL OpenGL Shading Language
GLUT OpenGL Utility Toolkit
GPGPU General Purpose Graphics Processing Unit
GPU Graphics Processing Unit
HAL Hardware Abstraction Layer
HCA Host Channel Adapter
HPC High Performance Computing
HVS Human Visual System
IPP Intel(R) Integrated Performance Primitives
Invire Interactive Remote Visualization
IS/VR Interactive Simulation and Virtual Reality
JPEG Joint Photographic Experts Group
MD Molecular Dynamics
MFS Material Flow Simulation
MPEG Moving Pictures Expert Group
MPI Message Passing Interface
MSE Mean Squared Error
NTSC National Television System Committee
OS Operating System
PAL Phase Alternating Line
PC Personal Computer
PC2Paderborn Center for Parallel Computing
PCI Peripheral Component Interconnect
128
P/S Publish/Subscribe
PSNR Peak Signal-to-Noise Ratio
PVM Parallel Virtual Machine
RAID Redundant Array of Inexpensive/Independent Disks
RGB(A) Red Green Blue (Alpha)
RLE Run Length Encoding
RPC Remote Procedure Call
RV Remote Visualization
SIMD Single Instruction Multiple Data
SSIM Structural Similarity
TCP Transmission Control Protocol
UDP User Datagram Protocol
UML Unified Modeling Language
(U)XGA (Ultra) Extended Graphics Array
VND Virtual Night Drive
VSD Volatile State Data
VR Virtual Reality
XML Extensible Markup Language
XSL Extensible Stylesheet Language (XML)
YCbCr Luma Chroma Blue Chroma Red Color Space
129