Document [original]

Dissertation

Versatility of Bulk Synchronous Parallel Computing:

From the Heterogeneous Cluster to the System on Chip

Olaf Bonorden

Fakultät für Elektrotechnik, Informatik und Mathematik

Institut für Informatik und Heinz Nixdorf Institut

February 2008

Zusammenfassung

Der Bedarf an Rechenleistung in den Wissenschaften und der Industrie steigt ständig

an. Bei der Entwicklung schnellerer Prozessoren gelangt man allerdings an physi-

kalische Grenzen, so dass zur weiteren Leistungssteigerung sehr viele Prozessoren

zusammen als Parallelcomputer benutzt werden müssen. Zur Nutzung dieser Super-

computer ist es wichtig, allgemeine Modelle und Werkzeuge zu haben, die es erlau-

ben, unabhängig von der speziellen Hardware effiziente Algorithmen zu entwickeln

und zu implementieren.

In dieser Dissertation werden Modelle für parallele Systeme vorgestellt, ein Über-

blick über Algorithmen für diese Modelle gegeben und effiziente Implementierungen

für verschiedene Architekturen entwickelt. Der Schwerpunkt liegt dabei auf der Fami-

lie der Bulk Synchronous Parallel Modelle, da diese die Entwicklung portabler, aber

trotzdem effizienter paralleler Programme erlauben.

Für die Implementierungen werden zwei Architekturen betrachtet: ein On-Chip-

Parallelcomputer und Workstation-Cluster. Im Bereich des On-Chip-Systems zeigt die

Arbeit, wie das benutzte Modell die Entwicklung applikationsunabhängiger, effizien-

ter, paralleler Systeme unterstützen kann.

Die Workstation-Cluster, auf denen nur freie Rechenkapazitäten genutzt werden

dürfen, stehen auf der anderen Seite des Spektrums paralleler Systeme. Sie unter-

scheiden sich vom On-Chip-System nicht nur durch viel größere Latenzen, geringe-

re Kommunikationsbandbreite und größeren Arbeitsspeicher, sie können auch hete-

rogen sein, d. h., verschiedene Computertypen enthalten. Hierdurch und durch die

variable, sich ständig ändernde, nutzbare Rechenkapazität der einzelnen Knoten er-

geben sich besondere Herausforderungen, z.B. Lastbalancierung. Hierfür stellt die

Arbeit eine Implementierung vor, welche mittels virtueller Prozessoren und deren Mi-

gration die Last gleichmäßig im Netzwerk verteilt.

Exemplarische Implementierungen zeigen, dass die Idee eines allgemeinen Mo-

dells funktioniert, d. h., dass ein Algorithmus für dieses Modell, sogar ein Programm-

quelltext, zu effizienten Implementierungen auf unterschiedlichen Systemen führen

kann.

Reviewers: Prof. Dr. Friedhelm Meyer auf der Heide, University of Paderborn

Prof. Dr.-Ing. Ulrich Rückert, University of Paderborn

Acknowledgements

First and foremost, I wish to thank my advisor, Prof. Dr. Friedhelm Meyer auf der Heide,

for his great support over all the years. I had a pretty good time and a very nice atmosphere

in his research group Algorithms and Complexity. I always had the freedom to choose the

topics and direction of my research.

Furthermore, I would like to thank all (former) members of the research group Algo-

rithms and Complexity and all participants of the projects DFG-Sonderforschungsbereich

376 and GigaNetIC, especially the persons with whom I closely collaborated in research,

teaching, or just in daily conversations. I am grateful to Ingo Rieping, who introduced me

to bulk synchronous parallel computing.

For the technical support, my thanks go to the members of the IRB, especially Ulrich

Ahlers and Heinz-Georg Wassing, who always tried to satisfy my needs concerning hard-

ware and software, and to the Paderborn Center for Parallel Computing (PC2), namely

Axel Keller, who always reanimated nodes of the system killed by my experiments.

Special thanks I owe to Bastian Degener and Marc Rautenhaus, who proof-read parts

of this thesis.

Last but not least, I would like to thank Stefanie for her patience, empathy, love, and

all the other things that make it so worthwhile to know her. She turned even the most

stressful days into a wonderful time, nowadays assisted by Emely’s and Nele’s enthusias-

tic welcome each day after work.

Olaf Bonorden

iii

To those who cannot read these lines anymore:

Heide Bonorden

Contents

1 Introduction 1

1.1 High Performance Computing . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 RelatedWork ................................ 4

1.4 Organization................................. 5

2 Parallel Models 7

2.1 Parallel Random Access Machine models . . . . . . . . . . . . . . . . . 8

2.2 Bulk Synchronous Parallel Models . . . . . . . . . . . . . . . . . . . . . 10

2.3 OtherParallelModels............................ 15

2.4 Conclusion ................................. 16

3 Algorithms 19

3.1 BSP-Algorithms............................... 19

3.2 Composition of Efficient Nested BSP Algorithms . . . . . . . . . . . . . 26

4 Implementations 35

4.1 BSPEnvironments ............................. 35

4.2 Implementations of BSP Algorithms . . . . . . . . . . . . . . . . . . . . 43

5 BSP for Parallel System on Chip 49

5.1 ProjectGigaNetIC ............................. 49

5.2 RelatedWork ................................ 50

5.3 Design of the Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.4 Toolchain .................................. 56

5.5 Evaluation.................................. 62

5.6 Conclusions................................. 71

6 BSP for Heterogeneous Workstation Clusters 73

6.1 Heterogeneous Workstation Cluster . . . . . . . . . . . . . . . . . . . . 73

6.2 RelatedWork ................................ 74

vii

viii Contents

6.3 Migration of Virtual Processors . . . . . . . . . . . . . . . . . . . . . . . 76

6.4 LoadBalancing............................... 82

6.5 Conclusions................................. 94

7 Summary and Outlook 95

7.1 Summary .................................. 95

7.2 Open Questions and Further Research . . . . . . . . . . . . . . . . . . . 96

Bibliography 99

Nomenclature 111

Index 113

Chapter 1

Introduction

1.1 High Performance Computing

High performance computing has become an ubiquitous component of scientific work in

natural and engineering sciences. Due to the well-known physical limitations of increas-

ing the computational power of individual processing elements, parallel systems have

become particularly important in recent years.

1.1.1 Technological Progress

Figure 1.1: Moore’s Law

Processing elements are becoming increasingly faster.

Moore’s Law [Moo65, Moo98] is well-known for its

statement that the complexity of microprocessors dou-

bles every two years, a time interval nowadays adapted

to 18 months. In the early days of processor design per-

formance of a system could easily be improved by either

increasing the clock frequency or by using more transis-

tors. Since 1971 the number of transistors of the central

processing unit (CPU2) has been increased from 2,250 to

more than 299 millions in 2006 (Table 1.1). During the

same time, arithmetic units were enlarged from 8 bits to

64 bits word size and support for the direct execution of

increasingly complex instructions (e. g., multiplication)

and floating point numbers was added. Later, sophisticated techniques were introduced to

1“In 1965, Gordon Moore sketched out his prediction of the pace of silicon technology. Decades later,

Moore’s Law remains true, driven largely by Intel’s unparalleled silicon expertise.”[Int06]

2In this thesis we will denote CPU by processor because we are not dealing with other processing units

like e. g. graphic processing units (GPU).

2 Introduction

Table 1.1: Die Sizes

Year Processor Transistors

1971 Intel 4004 2,250

1978 Intel 8086 29,000

1979 Motorola 68000 68,000

1989 Intel 80486 1,180,000

2001 IBM-Motorola Power4 174,000,000

2002 Intel Itanium II (McKinley) 410,000,000

2005 AMD Athlon 64 200,000,000

2006 Intel Pentium Xeon 291,000,000

2007 Intel Itanium II (Montecito) 1,720,000,000

Figure 1.2: AMD Opteron

further increase the clock speed and to hide latencies. Examples are out-of-order execu-

tion, branch prediction, or speculative execution.

Another approach to utilize transistors is caching. Indeed, most transistors of modern

processors are used for caches, for instance, the Intel Itanium II Montecito uses 1.55·109

of its 1.72 ·109transistors for caches (24 MB). Figure 1.2 shows the die of an Opteron

processor, although it has only 1 MB cache, the caches occupy more than half of the chip

area.

1.1.2 Parallelism

The demand of computational power grows even faster than the increase in processor

speed. To accommodate this demand, many processing units are combined to build large

parallel machines. Parallel computers have been on the market for more than 20 years

now, and much experience has been gained in designing such systems. Various design

ideas have leed to a wide range of parallel architectures, which differ in the type of the

processing unit (standard CPUs, vector processors, et cetera) and use quite different com-

munication structures like shared memory or various network topologies.

Although parallel systems are widely used in high performance computing, for in-

stance, for complex simulations in science and engineering (e. g., fluid mechanics or crash

tests). Modern office computers often have multicore processors, i. e., they have two or

more processor cores inside the CPU. They are thus parallel computers, but the num-

ber of processing elements is very small and most standard software does not utilize the

parallelism.

Software is the major reason why massive parallel systems are not widely used, al-

though such systems could be made available at low cost. In the beginning of parallel

computing, most algorithms were developed and analyzed for specific parallel comput-

ers, for example, sorting algorithms for computers with a hypercube network topology

1.2. Contributions of this Thesis 3

[CP93]. When a new topology was developed, one had to adapt all algorithms and rewrite

the software. In contrast, for sequential machines widely used model of the system (von-

Neumann-model [BGvN89]) and architecture independent programming languages (e. g.,

ISO-C99 [ISO05]) exists, so it is easy to develop software that runs efficiently on all

computers that follow the model.

Several parallel computer models try to play a similar role for parallel systems. Some

are mainly used for theoretical analysis, but others are supported by libraries for the effi-

cient implementation of algorithms on different target architectures.

For the success of parallelism on the market, a general parallel model is needed that is

easy to use and allows the development of portable but yet efficient algorithms. Further-

more, a toolchain supporting the implementation is needed, including languages, compil-

ers, libraries, and development tools like parallel debuggers and profilers.

1.2 Contributions of this Thesis

Skillicorn and Talia conclude their survey of parallel algorithms by the following perspec-

tive:

“Thus we can hope that, within a few years, there will be models that are easy

to program, providing at least moderate abstraction, that can be used with a

wide range of parallel computers, making portability a standard feature of

parallel programming, that are easy to understand, and that can be executed

with predictably good performance.”

Skillicorn and Talia, 1998[ST98]

In our view models which have these properties already exist. The main objective of

this thesis is to demonstrate that the bulk synchronous parallel (BSP) models are flexible

enough to allow portable, yet efficient parallel programs for a wide range of machines:

from tightly coupled parallel computers on a single chip to workstation clusters using

traditional local area networks.

This thesis covers the entire work flow: parallel computation models, algorithms, and

implementations on different parallel architectures. We first explore several of the existing

models to find candidates for a general parallel model. We then concentrate on the family

of bulk synchronous models, as we think they are good candidates for parallel models for

general use like the von-Neumann model for sequential computers. Next, we investigate

at algorithms for the BSP model. The models should be in use, i. e., there should be a lot

of algorithms in the literature for them. Furthermore, we want practical usable models, so

we investigate universal implementations for them.

Our main contributions are the design of a configurable, efficient parallel system on

a chip suitable for BSP — demonstrating how the model can guide the development of

efficient hardware for general use — and an efficient implementation of the BSP model

4 Introduction

for workstation clusters providing only unpredictable usable idle times. Our contributions

demonstrate that the BSP model can be used for a wide range of parallel systems. It is

thus well suited as a candidate for a commonly used parallel model.

BSP for Parallel System on Chip. We design a hierarchical parallel architecture, with

clusters of processors using shared memory and an on-chip interconnection network be-

tween the clusters. We develop communication protocols, a BSP library, and a set of sim-

ulators to evaluate the system and the influence of different parameters like cache sizes,

organizations, et cetera. We evaluate the system using different benchmarks, from sim-

ple communication benchmarks to applications like sorting and solving 3-satisfiability,

covering a wide range of limiting factors for parallel programs.

BSP for Heterogenous Workstation Cluster. We implement and evaluate a library for

efficient execution of BSP programs for heterogenous workstation cluster. As we are

allowed to use only the idle times of the computers, and the behavior of other users is not

known in advance, we have to adapt to changes of the free capacity online.

For this purpose we implement virtual processors, which can be migrated to idle nodes

in the network. To the best of our knowledge, this is the first realization of migrating

Linux processes in the userspace, without modifications of the operation system. This

is very important, because normally users are not allowed to change the systems of other

users. Furthermore, patching the Linux needs updates to the software for every new Linux

version. These migration technique is also of independent interest.

We examine the usage of a typical Linux workstation to create models for the load

and the idle times of workstation clusters. Based on the load models we implement and

evaluate load balancing strategies.

1.3 Related Work

Much research has been done in the field of parallel models and algorithms. We give a

survey of some models that are related to the BSP model, and a survey of algorithms and

implementations in the following chapters.

In the field of implementations there are different related publications: other BSP

implementations (e. g., Oxford BSPlib), other systems for process migration and load

balancing, and other approaches for parallel systems on a single chip. We discuss this

work in the corresponding chapters.

1.4. Organization 5

1.4 Organization

In Chapter 2 we describe parallel models in the field of bulk synchronous parallel com-

puting. We start with the famous PRAM model and show the development to the bulk

synchronous parallel model and its extentions.

Chapter 3 gives a survey over algorithms for the bulk synchronous model to show

that it is widely used for algorithm design. The algorithms described in the literature

are mostly analyzed using the Onotation (Landau notation), which shows the asymptotic

behavior for very large input sizes.

In Chapter 4 we show implementations, i. e., practical usable software, to confirm that

the model is not only useful for analyzing algorithms in the Onotation but also leads to

fast algorithms in practice. Furthermore, we give a survey of libraries for implementation

of BSP algorithms. There are many more algorithms than implementations, mainly be-

cause the model is widely used in theory but still unknown to many programmers, which

will hopefully change in the future. The last section in that chapter introduces our sys-

tem to build libraries of algorithms, and shows a way to automatically configure optimal

algorithms for a given machine and a given problem size using a collection of annotated

algorithm implementations.

The next two chapters deal with our implementation of libraries for BSP computing

for two very different architectures: Chapter 5 describes a parallel system on a single

chip, with low latencies, high bandwidth, but small local on-chip memories. It shows the

flexibility of the model, as the same algorithms, in our implementations even nearly with

the same source can be used for quite different systems. It also demonstrate how BSP

can support the development of hardware architectures, e. g. designing communication

protocols suitable for BSP.

In Chapter 6we propose an efficient implementation for a parallel architecture at

the other end of the range of parallel systems: a lossly coupled workstation cluster with

some external load, where we can only use unpredictable idle times. We show how the

migration of virtual processors can help to cope with load changes and faulty machines.

Publications

Parts of this work have been published on conferences and in journals. A fast BSP

algorithm for factorization of polynoms is evaluated in [BvzGG+01]. The combina-

tion of algorithms and implementation and its automatical configuration is introduced

in [BMW02]. Some issues of the design and implementation of our BSP implementation

(PUB library) are described in [BJvR03], the migration and load balancing functionality

for heterogenous workstation clusters in [Bon07]. A model for the utilization of worksta-

tions (constant-usage-interval load model) is published in [BGM05, BGM06]. An intro-

duction to the System-On-Chip architecture is [BSR+03].

6 Introduction

Chapter 2

Parallel Models

In this chapter we will give a survey of parallel computing models. We will start by

introducing a family of parallel random access machine models [SS63], which are natural

extensions to the sequential random access machine. Next bulk synchronous models will

be described, and we conclude by reviewing some other related models. We concentrate

on these models, because we think they can act as general parallel models, which are

suitable for theoretical analysis as well as for portable and efficient implementations.

Figure 2.1 gives an overview of the parallel models described in the following para-

graphs and their relations. In [LMR95], Li, Mills, and Reif present a framework of using

resource metrics to characterize models of parallel computation. They concentrate on the

resources that are accounted for by the model, for example, memory organization, latency,

or block transfers. The paper describes a number of parallel models and classifies them

with respect to their resources.

PRAM

1978

H-BSP

2000

(d,x) BSP

1997

LogP

1993

LogPG

1995

dBSP

1999

E-BSP

1996

QSM

1996

H-PRAM

1991

Y-PRAM

1992

Obl-BSP

2000

BSP*

1998

QRQW-PRAM

1993

CGM

1993

BPRAM

1989

APRAM

1989

BSP

1990

LPRAM

1989

CREW-BSP

2000

impact

Figure 2.1: A Map of some Models for Parallel Computing

8 Parallel Models

P0P1P2Pp−1

...

Random Access Memory

(a) PRAM

local

RAM

local

RAM

local

RAM

local

RAM

P0P1P2Pp−1

...

Global Random Access Memory

(b) LPRAM, BPRAM

local

RAM

local

RAM

local

RAM

local

RAM

P0P1P2Pp−1

...

Global Random Access Memory

Queue

Figure 2.2: Parallel Random Access Machine

Another survey including parallel languages and systems with implicit parallelism

as well, has been written by Skillicorn and Talia [ST98]. The authors use a set of six

criteria that in their view an ideal model should satisfy. Four criteria are related to the

needs of software developers, two address the needs for actual execution of the models on

real parallel machines. The main focus of the paper is the classification of the models in

categories based on these criteria.

2.1 Parallel Random Access Machine models

In this section we will describe a family of parallel computing models that use a global

shared memory. We will start by introducing the standard Parallel Random Access Ma-

chine (PRAM) model, then continue by adding a number of extensions including addi-

tional local memory and block transfers.

2.1.1 Parallel random access machine (PRAM) model

The parallel random access machine [FW78] is an extension of the sequential random-

access-machine (RAM). A RAM consists of one processing element (CPU) and a random

access memory. The CPU can read and write to arbitrary cells of the memory.

In a PRAM, there are pprocessors that can read and write arbitrary cells of a global

shared memory. The processors have a central clock and execute each instruction of a

given program synchronously. All communication between the processors is done by

writing to and reading from the memory (Figure 2.2). Many variants of PRAMs exist

in the literature. They differ in how simultaneous accesses of different processors to

the same memory cell are handled: Exclusive read and write (EREW) does not allow

any conflicts. Concurrent read/write (CRCW), as the most powerful version, can handle

such accesses. For CRCW PRAMs, there are different rules for resolving write conflicts,

including common (all processors have to write the same), priority (processor with lowest

id writes), and arbitrary (the winning processor is not defined).

2.1. Parallel Random Access Machine models 9

Communication costs are not accounted for by the model. Thus, it is suited to analyze

the maximal degree of parallelism which is possible for a given problem while neglecting

problems occurring in practical applications, for instance, limited bandwidth or conges-

tion.

2.1.2 Local-Memory PRAM (LPRAM)

The Local Memory PRAM introduced by Alok Aggarwal, Ashok K. Chandra and Marc

Snir [ACS90] includes local memory. In addition to the global memory, each processor

has its own unlimited private local memory. At each time step, a processor can either read

or write a word to the global memory, or perform a local computation step. The access to

the global memory is modelled by concurrent read, exclusive write.

The authors provide proves for upper and lower bounds for the LPRAM model for

three problems: matrix multiplication, sorting, and computing an n-point FFT graph.

2.1.3 Block-PRAM (BPRAM)

The Block-PRAM model [ACS89] is an extension of the LPRAM, it includes block by

block transfers. All accesses to the shared global memory are blockwise. This means that

a processor can transfer a block of consecutive memory cells from the global memory

to the local memory. The same is possible in the opposite direction. Copying of data is

charged by costs of b+l, where bis the size of the block and lis the latency (a parameter

of the machine). In the BPRAM model, different accesses are not allowed to overlap,

i. e., the shared memory is modelled by exclusive read, exclusive write (EREW). Minimal

running time and the total work performed by the processors are important measures of

the algorithm performance in this model.

The authors demonstrate the capabilities of their model by analyzing sorting, global

sum, matrix transposition and multiplication, rational permutations, and Fast Fourier

Transformation. Furthermore, they summarize the relationship of execution time of stan-

dard PRAM and BPRAM with the following equation, since one step of a EREW-PRAM

can be simulated with lsteps of the Block-PRAM:

TEREW-PRAM ≤TBPRAM ≤l·TEREW-PRAM

2.1.4 Queue-Read Queue-Write PRAM

The Queue-Read Queue-Write (QRQW) PRAM was introduced by Gibbons, Matias, and

Ramachandran [GMR98] and is a model situated between the EREW- and the CRCW

PRAM. Similar to the LPRAM, each processor has its own local memory in addition

to the global memory. Concurrent access to the global memory is allowed, but the cost

10 Parallel Models

of a step is increased by the number of processors that access the global memory. The

motivation of this cost function is that all requests are stored in a queue and processed

sequentially. In [GMR98] the relationship to other models is also discussed: a pprocessor

QRQW-PRAM can be simulated on a p/log pprocessor BSP computer (cf. Section 2.2)

with slowdown O(log p)with high probability.

2.1.5 Asynchronous PRAM (APRAM)

The Asynchronous PRAM by Phillip B. Gibbons [Gib89] is a PRAM that also incorpo-

rates both local and global memory. However, in contrast to the other PRAM models,

each processor of the APRAM has its own local clock and executes instructions of its

own program independently of the timing of other processors. In addition to the global

and local read and write operations, there is an instruction to synchronize arbitrary sub-

groups of the processors. All processors in the subgroup stop until the last has reached a

synchronization barrier.

To prevent race conditions, i. e., the output is depended on the order of events, proces-

sors are prevented from reading a global memory cell that has been modified by another

processor until a synchronization event has occurred. The cost model assumes a global

clock, i. e., all instructions on all processors are equally expensive. However, an algorithm

for the APRAM model is considered correct only if it works regardless of any delay that

may occur.

2.2 Bulk Synchronous Parallel Models

In this section we will describe the Bulk Synchronous Parallel Model and its extensions,

as we think they are the right candidates for a general parallel model, which is usable for

theoretical analysis as well as for portable and efficient implementations.

The models are more restrictive concerning communication patterns than the PRAM

models in the previous section, but this simplifies the analysis of algorithms as well as the

implementation of the model in real hardware systems.

2.2.1 The Bulk Synchronous Parallel Model

The Bulk Synchronous Parallel Model (BSP) was introduced by Leslie G. Valiant as a

bridge between the hardware and the software to simplify the development of parallel

algorithms.

“A major purpose of such a model is simply to act as a standard on which

people can agree.”

Leslie Valiant, 1990 [Val90]

2.2. Bulk Synchronous Parallel Models 11

Pp−1

Interconnection Network

(a) BSP Computer

L

h·g

P0P1P2P3P4

t+1

wmax









communication

local computation

barrier

Superstep t

Superstep

(b) Superstep

Figure 2.3: BSP Model

On the one hand it provides an abstract view of the technical structure and the com-

munication features of the hardware to use (e. g., a parallel computer, a cluster of work-

stations, or a set of personal computers (PCs) interconnected by the internet) for the de-

veloper. On the other hand it helps hardware designer to build efficient parallel machines

without knowing the applications running on it in advance. There are three different parts

of the model: a machine model, a programming model and a cost model. There are several

views of this model with slightly different descriptions, but they are all equivalent up to

constant factors in the cost model. We start with our view, based on our experiences with

implementations. Afterwards, we describe the differences.

Machine model. ABSP computer is defined as a set of processors with local memory

and an arbitrary communication mechanism (e. g., a network or shared memory) that sup-

ports point-to-point communication between the processors, and barrier synchronization.

The model does not assume a specific topology or exploit locality (cf. Figure 2.3(a)).

Programming Model. ABSP program consists of a set of BSP processes and a se-

quence of supersteps – time intervals bounded by the barrier synchronization. Within

a superstep each process performs local computations and sends messages to other pro-

cesses; afterwards it indicates by calling the synchronization method that it is ready for

the barrier synchronization. When all processes have invoked the synchronization method

and all messages are delivered, the next superstep begins. Then the messages sent during

the previous superstep can be accessed by its recipients.

12 Parallel Models

Cost Model. The cost of a superstep is the sum of wmax, the maximal local work of

a processor, h, the maximal amount of data that is sent or received by one processor

multiplied with the gap g of the network, and the time L(latency) for one barrier synchro-

nization (Figure 2.3(b)):

Cost of a superstep =wmax +hg +L

A communication pattern where each processor sends or receives at most hbytes is

called h-relation.

In the original description of Valiant [Val90] the synchronization hardware checks

every Ltime steps whether all processors have finished the superstep. Thus, the length of

a superstep is always a multiple of L. The difference in the cost function compared to our

definition is at most Ltimes the number of supersteps, thus at most a factor of 2.

Computational power. The computational power of BSP is analyzed by Martin Beran

in [Ber98]. He assumes an unlimited number of processors, and functions g(p)and L(p)

describing the bandwidth and the latency with respect to the number of active processors.

If g(p)and L(p)are polylogarithmic in p, a BSP computer is asymptotically as power-

ful as a PRAM. They both belong to a class C2, which are all machine models that are

polynomial-time equivalent to the space complexity of Deterministic Turing Machines

(DTM). All machines in C2can solve all problems of the class PSPACE in polynomial

time (with unbounded number of processors).

For g(p) = Ω(pa), or L(p) = Ω(pa),a>0, or pconstant, a BSP-Computer belongs to

the Class C1(all machines that are polynomial-time equivalent and linearly space equiva-

lent to DTM).

More details on the complexity classes C1and C2can be found in the Handbook of

Theoretical Computer Science [vL90].

Extensions to BSP. A lot of extensions to the standard BSP model have been proposed

since 1990, for example, E-BSP, BSP∗, dBSP, Obl-BSP, or H-BSP. Some of them try to

improve the accuracy of running time predictions by using more parameters to describe

the machine, for example by introducing a minimal block size for efficient communication

in the BSP∗model. Others improve the power of the machine by introducing new features,

an example is oblivious synchronization.

2.2.2 BSP*

Most real machines need some time to prepare messages, for example, for creating mes-

sage headers or starting the transmission. So the achieved bandwidth for the communica-

tion depends on the size of the messages. In the BSP model the cost depends only on the

sum of all transmitted data (hrelation), not on the size of the messages. The BSP∗model

2.2. Bulk Synchronous Parallel Models 13

by Armin Bäumker, Wolfgang Dittrich, and Friedhelm Meyer auf der Heide [BDM98] in-

troduces an additional machine parameter B. Informally, Bis the minimal size a message

must have to achieve a “good” network bandwidth. More formally, the communication

cost of a superstep is h·g, where his the maximal amount of data hi,i=0,...,p−1 send

or received by a processor. In opposite to the normal BSP model, these hiare the sum of

all message sizes where each message smaller than Bbytes is charged for B.

The authors use multisearch as an example to show the improvement over BSP.

2.2.3 Oblivious BSP

One often noted objection to the use of BSP is the global synchronization. Especially

for large parallel machines having many processors, the cost of a global synchronization

seems to be quite big (although in most implementations it is of order O(log p)for ppro-

cessors). So there are some proposals to reduce this cost. The Oblivious-BSP model by

Jesus A. Gonzalez, Coromoto Leon, Fabiana Piccoli, Marcela Printista, José Luis Roda,

Casiano Rodríguez, and Francisco de Sande [GLP+00] formalized Oblivious Synchro-

nization, which was first implemented in the BSPk system (s. Section 4.1.1).

In the BSP model all processors are in the same superstep, guaranteed by the barrier

synchronizations between supersteps. But for a correct execution it is only needed that

messages are received at the right time, i. e., two processors that do not communicate with

each other do not have to synchronize. In the Oblivious BSP model processors may be in

different supersteps, the only limitation is that a processor is not allowed to enter a new

superstep until all messages for it have been received. As it is not possible for the system

to know in advance if other processors, which are still in earlier supersteps, will send

messages, the user has to specify the number of messages each processor will receive.

The processor waits until the messages have been received and continues with the next

superstep. This can reduce the idle time waiting for other processors in barrier synchro-

nizations, especially if the amount of work in a superstep is not equal on all processors.

Furthermore, the time for the barrier synchronization is saved.

The authors demonstrate the accuracy of running time predictions in their model by

evaluating a nested parallel Discrete Fast Fourier Transform algorithm with the PUB li-

brary (Section 4.1.2) on a Cray T3E supercomputer. The performance gain of the obliv-

ious synchronizations in practice can be seen in [BJvR03], where the running time of a

multi search program is evaluated using different advanced features of the PUB library.

2.2.4 E-BSP

In 1996 Ben H. H. Juurlink and Harry A. G. Wijshoff proposed the Extended BSP (E-

BSP) model [JW96]. It includes unbalanced communication and locality. The authors

extend the cost model by replacing the term h-relation in the BSP cost model by an

14 Parallel Models

(M,k1,k2)-relation, where Mis the total number of messages, k1,k2is the maximal num-

ber of message a single processor sends respectively receives, thus a (full) k-relation is a

(kp,k,k)-relation in the E-BSP model.

The locality of a superstep is defined as the maximal difference of the identifiers of

any sender/receiver pair. The implementation should use a numbering of the processors,

such that two processors with a small difference in their identifiers have a short distance

in the underlying physical network, for instance, using a peano embedding for a mesh of

dimension two. The PUB library uses such embeddings for some mesh architectures, for

instance, the Parix GCel/GCPP.

A comparison with other models can be found in Ben Juurlink’s PhD thesis [Juu97],

including comparisons of communication times of sorting algorithms for the E-BSP and

for the standard BSP model.

2.2.5 Decomposable BSP (dBSP)

The Decomposable BSP model by Martin Beran [Ber99] is another approach to include

locality. A dBSP computer can dynamically partition itself into smaller parts which be-

have as dBSP computers with smaller p. Communication between processors of different

partitions is impossible. The parameters gand Lof the original BSP model are replaced by

functions of p. In most implementations, g(p)and L(p)increase with p. If the synchro-

nization is implemented using message passing and a tree as communication structure,

L(p)is usually of order O(log p).

The paper shows a slowdown of O(max{g(p),L(p)})for a superstep of a dBSP com-

puter simulated on a BSP computer, and an additional cost of O((1+L(p)+g(p))·log p)

for the partition and join operations.

Besides the performance advantage, the model is useful in practice, if different al-

gorithms should run simultaneously on parts of a parallel computer. The PUB library

supports this model: it is possible to partition the machine into independent BSP com-

puters; furthermore, BSP parameters of the machine are given for the actual size of the

partition.

2.2.6 H-BSP

The Heterogeneous Bulk Synchronous Parallel (HBSP) model by Tiffani L. Williams and

Rebecca J. Parsons [WP00] is a generalization of the BSP model. The gap gis substituted

by the gap gi,i=0,...,p−1 of each processor. The parameter gidescribes the rate at

with a processor can send data to the network. Additional parameters ci,i=0,...,p−1,

define the computation power of the processors, relative to the slowest one.

The cost function is adapted, such that the local work is weighted by the cis. The

term g·h=g·max0≤i<p{hi}for the communication cost is replaced by max0≤i<p{gihi},

2.3. Other Parallel Models 15

where hiis the amount of data sent to the network by processor i.

How to use the model for designing algorithms is shown by some examples: prefix

sums, matrix multiplication, and randomized sample sort.

2.2.7 (d,x)-BSP

The (d,x)-BSP model by Guy E. Blelloch et al. [BGMZ97] is an extension of the BSP

model designed for high performance parallel computer with global shared memory. The

additional parameters d(delay) and x(expansion) describe the timing of the memory,

i. e., the time for an access (typically in the order of 10 clock cycles for machines in 1996)

and the ratio of the number of processors and the number of parallel accessible memory

banks. The cost functions is expanded by a term for the maximal number of accesses to a

memory bank, scaled with the delay d.

Typical example machines for this model were NEC SX3 (p=4, x=256), Cray C90

(p=16, x=64,d=6), or Cray J90 (p=16, x=64,d=14).

2.2.8 CREW-BSP

The standard BSP model allows point-to-point messages and barrier synchronizations

only. Sometimes the hardware supports other communication patterns, for example,

broadcasts are easy to realize if the communication medium is shared, for example a

bus.

The CREW-BSP-model (concurrent read exclusive write) by Michael T. Goodrich

[Goo00] extends the standard BSP-model by arbitrary multicast operations, i. e., the des-

tination of a message is a set of processors. The same paper defines also a CRCW-BSP

computer, which is able to combine data from more than one processor using an arbitrary

relation. This computer is more powerful than a standard BSP-computer, for instance, it

can compute the maximum of pnumbers in constant time.

2.3 Other Parallel Models

In this section we describe two parallel models, which are similar to the BSP model, but

does not divide the time into supersteps, thus, they are not synchronous models.

2.3.1 Coarse grained multicomputer (CGM)

The Coarse Grained Multicomputer (CGM) was introduced 1993 by Dehne, Fabri, and

Rau-Chaplin [DFRC93, DFRC96]. It is the first model of a family of parallel computa-

tion models called coarse grained parallel computing (CGP) and is similar to the LogP

model (which we will describe in the next section) and the BSP model. A CGM(n,p)

16 Parallel Models

consists of pprocessors with O(n

p)local memory each, and an arbitrary interconnection

network. Is is said to be “coarse grained” because the size of the memory is defined to be

“considerably larger than O(1)”. Dehne et al. assume that n

p≥p.

The basic idea behind this model is the following: The algorithms partition the given

problem into psubproblems of size O(n

p). These subproblems are solved sequentially

by the processors; afterwards the results are combined using a global communication

operation. This scheme is iterated; for many problems a constant number of iterations

is sufficient. For the global operation sorting is used. Many parallel architectures have

global sorting as a system call, or a highly optimized implementation can be obtained

as public domain software. Other communication operations, for example, broadcast

(segmented, multinode), total exchange, or partial sum (scan) can be implemented using

a constant number of global sort operations. So in the CGM sorting is the only way to

communicate, and the cost for an algorithm is based on TS(n,p), the time needed to sort

ndata items on pprocessors.

2.3.2 LogP

The LogP model was introduced by Culler, Karp, Patterson, Sahay, Schauser, Santos,

Subramonian, and von Eicken [CKP+93] in 1993 and tries to model a processor network

by four parameters: Lis the latency, i. e, the delay for sending a very small message

from a source to a destination, ois the overhead, the time a processor is busy starting a

transmission or reception of a message, including the time needed for computing message

headers. Parameter gis the gap, the minimal time interval between consecutive message

transmissions or consecutive message receptions, the reciprocal is the per processor band-

width. Pis the number of processors. Thus, LogP does not consider the topology of the

network, similar to the CGP model. Furthermore, the capacity of the network is limited,

i. e., at most dL/gedata can be in the network at the same time.

On the one hand this model is more general than CGM or BSP as it allows arbitrary

asynchronous communication and does not restrict the algorithms to global communica-

tions or supersteps, on the other hand proofs of correctness or running time analysis of

algorithms may be much complexer.

2.4 Conclusion

We described parallel models, from the PRAM to BSP and its extensions. We think

that the BSP model and its extensions are a good way to model parallel systems such

that algorithm engineers can develop parallel algorithms that run fast on a wide area of

parallel systems, because: On the one hand the PRAM model is hard to realize, because

multiple accesses to a global shared memory require an expensive cross bar system. On

the other hand the plain BSP model is too restricted, i. e., there are algorithms that does

2.4. Conclusion 17

not fit into BSP very well, but some relaxations, for instance, oblivious synchronization

or a decomposable machine increase the number of possible applications.

In some areas it may even be needed to soften the model a little, to allow asynchronous

messages in some parts, but at the cost of loosing some nice features of the synchronous

model, for example, simplicity of verification. Our implementation, the PUB library,

which is described in Section 4.1.2, follows that direction.

18 Parallel Models

Chapter 3

Algorithms

A suitable model for parallel computing should support the development of efficient al-

gorithms. This chapter starts with an overview of algorithms for the bulk synchronous

setting. In the second part we will propose a system which can be used to build libraries

of algorithms for more complex problems by combining basic algorithms. It configures

an optimal algorithm based on information of the input (e. g., the size of the input), and

the architecture, i. e., the BSP parameters, using a library of BSP algorithms and their

running time descriptions.

3.1 BSP-Algorithms

In this section we will give a short overview of algorithms for the BSP model to show that

it is widely used in algorithm design. Here were focus on algorithms and their theoretical

analysis in the BSP model. In Section 4.2 we will show some implementations of BSP

algorithms.

There are many BSP algorithms for various different application areas. Simple ex-

amples are basic communication functions (e. g., broadcast or parallel prefix operations,

which are already mentioned in the paper introducing the BSP model [Val90]) and sort-

ing. Examples for more complex applications are graph algorithms, optimizations, simu-

lations, and algebra.

3.1.1 Basic Communication Algorithms

The BSP model supports point-to-point messages and synchronizations only. All other

communication pattern, for instance broadcast (one processor sends a message to all oth-

ers), multicast (one message to a set of destination), total exchange (every processor has

data for all other processors), or the parallel prefix operations scan and reduce have to be

realized on top of these point-to-point messages.

20 Algorithms

(a) TREEd=3(b) SCATTERGATHERk=3

Long Edges

Figure 3.1: Broadcast Algorithms

Some implementations of BSP are more powerful and support such advanced commu-

nication functions directly, for example, the PUB library (Section 4.1.2), which is some-

times more efficient [Rie00], and more comfortable for the user. There are also models

that allow this, like CREW-BSP (Section 2.2.8), but they are not as widely used as other

BSP models.

Broadcast

In the Broadcast problem one processor has one message of size mthat has to be sent to

all other processors. A basic algorithm TREE is given by the author of the BSP model

[Val90]: in logdpsupersteps each processor that knows the message sends it to dothers.

As dmessages of size mare sent in each superstep, a superstep takes time O(dm ·g+L),

thus the total time is O((dm ·g+L)logdp).

Other broadcast algorithms can benefit from locality, for example in the dBSP model.

An example for such an algorithm is ROOT [JKMR03]: the message is sent to √psub-

groups in the first step, and distributed inside these groups by a recursive broadcast call.

In [Rie00] other broadcast algorithms are analyzed: PTREE (pipelined version of the

standard algorithm TREE), and SCATTERGATHER, an algorithm that splits the message

in kparts, broadcasts the parts to subgroups, and then sends the parts from a subgroup to

all other nodes (similar to an all-to-all communication). These algorithms are illustrated

in Figure 3.1, their running time in the dBSP model can be found in Table 3.1.

3.1.2 Parallel Prefix Operations

In the Parallel Prefix problem every processor ihas some data xi, and for a given asso-

ciative operation ⊗one needs to compute x0⊗x1⊗···⊗xifor all i=0,...,p−1. This

problem is sometimes referred to as scan. A simpler problem is called reduce: instead of

computing the results for all i, only

p−1

i=0

xiis needed. A simple tree based algorithm with

3.1. BSP-Algorithms 21

Table 3.1: Running Times of Broadcast Algorithms in the dBSP Model [Rie00]

Algorithm Result

Tree dlogdpe−1

∑

i=0(d−1)·(m·g(lp

dim)+L(lp

dim)

PTree k·L(p)+2dn

ke·g(p)+dlog pe−1

∑

i=1

Lp

2i+2lm

km·gp

2i

ScatterGather 2(m·g(p)+L(p))+ time for broadcast of m

kto p

kprocessors

Root dloglog pe

∑

i=0

2im·g(lp1/2im)+L(lp1/2im)

Table 3.2: Running Times of Sorting Algorithms

Algorithm Result Reference

CommunEfficientParSort Onlogn

p+logn

log(n/p)n

p·g+L [Goo00]

Randomized SampleSort Onlogn

p+(pε+n

p)·g+Lw. h. p. [GV94]

Regular Sampling Onlogn

p+n

p·g+Lfor n≥p3[SS92]

XY-Sort n=√p,O(√p·g+L)+Tred [Rie00]

Tred time for reduce on √pproc.

a tree of degree darchives the same order of running time than the broadcast algorithm

TREE.

3.1.3 Sorting Algorithms

Sorting is one of the basic problems in computer science. It is well understood and there

are sorting algorithms for almost all parallel computing models. Some results for BSP

sorting algorithms are summarized in Table 3.2.

Sample Sort. Parallel SAMPLESORT is an extention of the sequential QUICKSORT al-

gorithm. QUICKSORT splits the input into two parts, elements smaller than a splitting

element and those that are greater. Next these two groups are sorted recursively. In

SAMPLESORT, the data is partitioned into pgroups according to p−1splitters. Each

processor sends the keys to the right destination, next the received keys are sorted locally.

The main problem is to find good splitters such that the sizes of the partitions are nearly

equal. For this purpose samples of the input are selected, sorted, and used as splitting ele-

ments. The main part of the analysis is about the number of samples. An algorithm based

22 Algorithms

Table 3.3: All-pairs Shortest Paths Algorithms [Tis01]

Algorithm Result

Dijkstra / Path Doubling O(n3

p+n2

p2/3+(log p)L),

O(n3

p+n2+L)

Floyd-Warshall-algorithm O(n3

p+n2

pαg+pαL),1

2≤α≤2

on this approach, proposed by Gerbessiotis and Valiant [GV94], needs a running time

of Onlogn

p+(pε+n

p)·g+Lfor sorting nkeys, with high probability for any constant

0<ε<1, if p≤n1−δ(ε)where δ(ε)is a small constant depending only upon ε.

The randomized sampling can be replaced by regular sampling. This technique is

described by Hanmao Shi and Jonathan Schaeffer [SS92].

XY-Sort. XY-SORT is a sorting algorithm used as an example in [Rie00]. It sorts √p

keys on pprocessors with a constant number of communication operations: the keys are

distributed using two multicast operation, compared in constant time, and the total rank

of each key is calculated with √pparallel reduce operations, each on a subgroup of size

√p.

3.1.4 Graphalgorithms

There are many algorithms for problems based on graphs in the literature. The challenge

for the algorithm designer is that the input is not as regular as in other application areas

like matrix multiplication, thus the algorithms are almost not oblivious, i. e., the commu-

nication pattern is dependent on the input.

Shortest Path. There are two main sequential algorithms for shortest path problems:

Dijkstra’s algorithm and the algorithm of Floyd-Warshall. For the All-Pairs-shortest Paths

problem one can execute DIJKSTRA algorithm in parallel for all sources, or use a parallel

variant of FLOYDWARSHALL, which is based on matrix potentialization. This algorithm

is more communication efficient, but needs more than a constant number of supersteps.

Table 3.3 summarizes the results of Alexandre Tiskin [Tis01].

Minimal Spanning Tree. We give a formal definition of the problem in Section 3.2.2.

There are various parallel algorithms for the calculation of a minimal spanning tree. A

communication lower bound and an overview of BSP algorithms can be found in the PhD

thesis of Ingo Rieping [Rie00]. The algorithms are made up of some modules: BSP-

MERGE, which merges two minimal spanning trees,and BSPBORUVKA, which executes

3.1. BSP-Algorithms 23

Table 3.4: Running Times of MST Algorithms

Algorithm Result Reference

MSTMerge OTseq(n,m

p)+ m

p1+ε/2+Lfor p1+ε≤m/n[Rie00]

MSTDense Onloglog p+m

p+n·g+log p·Lfor p≤m

1-optimal if ploglog p=o(m

[ADJ+98]

MSTSparse Onlog p+m

p+m

p·g+(log p)2

log((m+n)/p)Lwith prob. ≥1−n−c

for any c>0, p≤m+n, log p≤m

n≤p,m≥nlogn

[ADJ+98]

BSP-AS Om+n

plog p+m+n

plog p·g+(log p)2

log((m+n)/p)Lfor n+m≥p[ADJ+98]

so-called Bor˚uvka steps. In each step a (super-)vertex selects a cheapest outgoing edge

and joins the two (super-)vertices to a new supervertex. A third module is BSP-AS, a

BSP version of the algorithm due to Awerbuch and Shiloach, which was originally devel-

oped for a CRCW-PRAM. By combining these modules one get efficient algorithms for

sparse as well as for dense graphs.

The minimal spanning tree problem is also used as an example for combining several

BSP algorithms together to an optimal algorithm, adaptive to the BSP computer and to

the input size. The modules BSPMERGE and BSPBORUVKA and their combinations are

described in more details in Section 3.2.2.

3.1.5 Optimization

One kind of optimization is searching for an optimal solution in a given space of feasible

solutions. Starting from a solution the algorithms perform small legal changes to the

instance and try to improve it concerning a given cost function. As the changes are usually

chosen by an algorithm with some random choices, a natural parallelization is to start the

search in parallel on different processors. After some time the processors exchange their

locally best solutions and start searching at the globally best solution found so far. This

scheme fits perfect to the BSP model: in each superstep the local search is done, at the

end of the superstep the best solution is computed by a parallel prefix operation using

max (or min) as combine function. This approach is used by Klaus Brockmann and Ingo

Rieping for a parallel tabu search algorithm (unpublished), based on parallel tabu search

implementations from Stefan Bock and Otto Rosenberg [BR00].

3.1.6 Simulations

Discrete Event Simulation. In a discrete event simulation there are events at certain

points of time. These events have to be processed in non decreasing time order. Each

24 Algorithms

event can change the state of the simulation, and introduce new events.

One approach for a parallel algorithm is to distribute parts of the simulation to the

processors and send events concerning parts simulated on other processors by messages.

This is implemented by an event queue locally on each processor. For high efficiency

these queues cannot be synchronized with each other, i. e., the events processed in paral-

lel may not be in the right order concerning global time. One solution is called Optimistic

Discrete Event Simulation. Each processor has its own local virtual time (lvt) and pro-

cesses events in its local queue in non decreasing time order. If it receives an event with a

timestamp earlier than the own lvt, it rollbacks the simulation (time warp) and restart it at

that time, sending anti-messages for each wrong message it has sent in the past.

Several bulk synchronous parallel algorithms for optimistic discrete event simulation

are proposed by Radu Calinescu [Cal96].

N-Body. The N-body problem is the problem of simulating the movement of a set of

bodies (or particles) under the influence of some type of force, for example gravitational

or electrostatic. A basic approach is to simulate the system by advancing the bodies in

discrete time steps. In each time step, the algorithm computes (or approximates) the force

exerted on each body due to all other bodies; this determines the acceleration and speed

of that body during the next time step.

Efficient parallel implementation of O(N)adaptive tree codes for the BSP model are

given by David Blackston and Torsten Suel [BS97].

3.1.7 Algebraic Problems

Algebraic problems have been addressed by parallel algorithm designers since long ago.

Operations on matrices, for instance, multiplication of two matrices, are often used as

an introduction to parallel algorithms. Many algorithms are oblivious, they have a static

communication behavior that simplifies things like load balancing. The results for the

following algorithms are summarized in Table 3.5.

Matrix Multiplication. There are different algorithms for matrix multiplication. A BSP

version of the standard algorithm is given as a simple example of a tightly synchronized

algorithm by Valiant [Val90], a faster algorithm using the idea of Strassen to reduce the

work is given by William F. McColl [McC95a].

McColl and Tiskin propose other algorithms optimizing the memory use. In [MT99]

they analyse memory efficient BSP algorithms for both the standard and the fast matrix

multiplication.

Solving Linear Equations. A system of linear equations over a domain Dis given by a

matrix A∈Dn×mand a vector b∈Dn. The problem is to find one or all solutions x∈Dn

3.1. BSP-Algorithms 25

Table 3.5: Algebraic BSP algorithms

Problem Algorithm Result Reference

Matrix Multiplication naive O(n3

p+n2

√p·g+L)[Val90]

standard On3

p+n2

p2/3g+L,p≤n2[McC95b]

Strassen Onlog27

p+n2

p2/log27·g+L,

p≤n2log27/(2+log27)

[McC95a]

Dense Matr.×Vec. standard On2

p+n

√pg+L,p≤n[McC95b]

Sparse Matr.×Vec.1Ocn

p+n+1−1

pn·g+L[Bis04]

Gaussian Elimination block rec. On3

p+n2

pαg+pαL,1

2≤α≤2

3[Tis07]

with pairwise elem. On3

p+n2

pαg+pαL,1

2≤α≤2

3[Tis07]

LU Decomposition On3

p+n2

√p·g+n·L[Bis04]

On3

p+n2

√p·g+√p·L[McC95b]

On3

p+n2

p5/8·g+nlog p

log(L/g)·L,

p≤n8/5

[Juu97]

Solution Triangular

Lin. System

back sub-

stitution On2

p+n·g+p·L[McC95b]

Fourier Transform. FFT Onlogn

p+n

p·g+L,n≥p2[Bis04]

Exponentiation in

Finite Fields

addition

chains O(log p·(g+L)) [Nöc01]

Poly. Factorization [BvzGG+01]

1cdenotes the number of non-zero elements of the matrix

that satisfy the equation A×x=b. There are different algorithms for solving this equa-

tions with different objectives: A parallel version of the standard gaussian elimination is

analyzed by Alexandre Tiskin [Tis03, Tis07]. If there is more than one such system with

the same matrix A, one could split Ainto a product of two triangle matrices A=L×U.

Knowing Land U, the solution can be computed efficiently for various instances of b.

This so called LU-decomposition is one of the examples in [Bis04].

Other Computer Algebraic Problems. Michael Nöcker analyses the problem of ex-

ponentiation in finite fields in his PhD thesis [Nöc01] and propose data structures for fast

algorithms. As a computation model for the parallel algorithms he uses amongst others

26 Algorithms

the BSP model. Together with others he also uses BSP for fast factorization of large

polynoms [BvzGG+01].

3.2 Composition of Efficient Nested BSP Algorithms

The results in this section are based on work done for my diploma thesis [Bon02]. The

actual running time of implementations of parallel algorithms depends on two groups of

parameters, namely “software” parameters, for example, the number of memory accesses

or the degree of a broadcast tree, and “hardware” parameters, for instance, the time nec-

essary to set up a communication or the number of available processors. This leads to the

observation that for different parameter constellations different “plain” algorithms are the

fastest ones. Furthermore, sophisticated parallel algorithms often introduce subproblems

or make recursive calls. For example, efficient parallel algorithms for minimum spanning

tree computation use parallel sorting algorithms, broadcast methods, and make recursive

calls. For each subproblem, one can choose between different algorithms.

Hence, in order to have efficient parallel programs for an actual machine, they have to

be configured. This means that one has to decide which parallel algorithm has to be taken

and which subroutines have to be used on what portion of the parallel machine depending

on hardware parameters as well as on the input instance for the algorithm.

For the BSP model and its extensions, a large variety of efficient parallel algorithms

for many problems has been developed and quite accurately analyzed (see Section 3.1).

This enables the programmer to choose from a pool of available algorithms for the

composition of the final program on his or her parallel machine. However, configuring a

parallel program becomes quickly very complex due to the various parameters and (pos-

sibly mutually influencing) dependencies. Therefore, it should be done automatically by

a program we call configurator [BMW02].

3.2.1 The Configurator: Input/Output Specification and Algorithm

In what follows, we define how to describe algorithms and introduce the term schedule

for a given problem. This is the necessary adaptation of the BSP model. A schedule fixes

the algorithms and all free parameters to be used to solve a problem and all occurring

subproblems. The input of the configurator is the problem Π, the BSP computer (i. e.,

which BSP parameters apply), and a specification of the input of Π. The configurator

works on a library of algorithm descriptions and outputs the schedule. This schedule is

used during the execution to determine the real (sub-)program that will be executed. It is

the equivalent to the actual parallel program.

Algorithm Description. An algorithm description A consists of the following five com-

ponents:

3.2. Composition of Efficient Nested BSP Algorithms 27

BinTreeBroadcast:= (Broadcast,pcount,vcount,t,scount,s)

•pcount(n,p,pmax):={p}, use all processors

•vcount(n,p,c):=1, only one variant

•t(n,p,c,v):=dlog2pe(2·max(n,B(p))·g(p)+L(p)),

one superstep per tree level, nbytes data send to each of the 2 children

•scount(n,p,c,v):=0, no subproblems

•sundefined function

Figure 3.2: Example for an Algorithm Description for BINARYTREE Broadcast

1. The problem Πthat the algorithm solves.

2. The set pcount of feasible machine sizes, i. e., the machine sizes for which the algo-

rithm Acan work depending on the input for A.

3. The number vcount of different possible choices for fixing the free algorithmic pa-

rameters of A. For example, in a broadcast algorithm this might be the number of

feasible tree degrees.

4. The function tthat computes, for a given variant of the algorithm, the running

time of Awithout the time Awill spend in subcalls. Note that in the algorithm

description, it is not known in advance which algorithms for the subproblems will

be used in the schedule.

5. The function sthat computes a list of subproblems for all possible variants of A.

Each list item contains the subproblem, the description of the input of the subprob-

lem, and the processors involved.

Figure 3.3 shows an example for the Broadcast algorithm TREE. A more formal

specification of the components and a more detailed description can be found in [Bon02].

We say that A solves problem Π.

Schedule. Given a set Pof problems and a set Aof algorithm descriptions, a valid

schedule S for a problem Π∈ P and its input description is defined recursively. An al-

gorithm A∈A solving Π∈P is fixed, as well as all free parameters, and there are valid

schedules for all occurring subproblems Π0

i∈P. The recursion terminates when there are

28 Algorithms

Let dv≥2 be the v-th integer number, s. t. dv|p(dvis the tree degree of variant v).

TreeBroadcast:= (Broadcast,pcount,vcount,t,scount,s)

•pcount(n,p,pmax):={p}, use all processors

•vcount(n,p,c):=|{d∈{2,...,p};d|p}|,

variants are all possible tree degrees d,dis valid tree degree ⇔ddivides p

•t(n,p,c,v):=max(n,B(p))·(d−1)·g(p)+L(p),

time for the first level of the tree

•scount(n,p,c,v):=0 if p=dv

dvotherwise

If the degree dvis smaller than p, there is one subproblem (Broadcast)

•s(n,p,c,j,v):= (Broadcast,n,p/dv,1,dv),

input for subproblem Broadcast has input size nfor p/dvprocessors, and is exe-

cuted dvtimes in parallel

Figure 3.3: Example for an Algorithm Description for TREE Broadcast

no further subproblems. So a valid schedule Scan be viewed as a schedule tree directed

from the root to the leaves.

Let Sbe a valid schedule for a problem Πthat has to be executed on a pprocessor

BSP machine given by its machine parameters. The cost of S, i. e., its predicted running

time on the BSP machine, is defined along the schedule tree. The cost of the root is the

running time of algorithm A(without the time spend in solving subproblems) given by

the function t(see point (4) above) plus the sum of the cost of all children of the root.

Configurator. We have implemented a prototypical configurator that computes a sched-

ule tree (in a bottom-up way) and, hence, a valid schedule with minimum cost by a brute

force search testing all possible valid schedules. Note that this computation is offline,

i. e., it is done only once before the schedule is used for many same-sized inputs. The

configurator works for arbitrary problems Πand algorithm descriptions A. Although the

execution time of the configurator may be exponential in the number of algorithms, for

our MST experiments the configuration was finished in less than five minutes.

3.2.2 An Example: Parallel Minimum Spanning Tree Algorithms

As example for a complex problem with a rich combinatorial structure, a usually irregular

communication pattern, and a variety of sophisticated algorithms that in turn use clever

3.2. Composition of Efficient Nested BSP Algorithms 29

subalgorithms, we use the problem of computing a minimum spanning tree (MST) for

an undirected weighted graph. For the algorithmic background of the MST problem,

see [CLR90, Chap. 24].

Minimal Spanning Tree (MST). Given an undirected graph G= (V,E)with edge

weights w:E→Qa tree T= (V,F)with F⊆Eand |F|=|V|−1 is called a spanning

tree. The weight w(T)of a spanning tree is w(T):=∑

e∈F

w(e).Tis a minimal spanning

tree, if and only if w(T)≤w(T0)for all spanning trees T0of G.

3.2.3 Algorithms for the MST problem

Each of our implemented algorithms is based on the following three basic operations:

1. Algorithm KRUSKAL: Kruskal’s sequential algorithm ([CLR90, Sec. 24.2]) tests

for every edge, in order of increasing weights, whether it can be included in the min-

imum spanning tree, i. e., whether it connects two connected components (called

supervertices) created by the edges chosen so far.

2. Operation “Bor˚uvka step”: In a Bor˚uvka step (for a nice and detailed description,

see [Göt98]) each supervertex selects its cheapest outgoing edge. These edges are

added to the MST edges, avoiding cycles (by construction, these cycles are not

longer than 2). After that, the new supervertices, i. e., the connected components,

are calculated, the edges are relabeled according to the new supervertices, and all

edges belonging to the same supervertex are removed. This step reduces the number

of vertices at least by a factor of two.

We have implemented two different algorithms for this problem, namely DENSE-

BORUVKASTEP which is step (2) of algorithm MST-DENSE in [ADJ+98], and

BORUVKASTEP which is in essence from [DG98, Göt98].

DENSEBORUVKASTEP is specially designed for dense graphs. It calculates the

lightest edge of all locally stored edges and then executes a parallel prefix operation

to determine the edges with global minimum weights, for every supervertex.

BORUVKASTEP creates adjacency lists for all vertices by grouping edges of the

same vertices by integer sorting. Then the minima of each group are calculated by

a parallel segmented prefix operation (see [Göt98], Section 4.1.4). Our implemen-

tation uses a sequential algorithm for computing the connected components.

3. Operation “MSTMERGE”: This operation (Step (2) of MSTMERGE in [ADJ+98])

merges local MSTs. It uses a d-ary communication tree. dis a free parameter to

be set by the configurator. Each tree node sends its MST to its predecessor, the

30 Algorithms

MSTWITHMERGE

MST

Sort

KRUSKAL

MSTMerge

BoruvkaStep

MERGESORT

SegmentMinimum

IntegerSort

INTEGERSORTWITHSORT

BORUVKASTEP

EDGEMINIMUMTREEPREFIX

Broadcast

TREEBROADCAST

SCATTERGATHERBROADCAST

ROOTBROADCAST

SEQRADIXSORT

INTEGERWITHMASTERSORT RADIXSORT

MSTBORUVKAANDMERGE

EdgeMinimum

INTEGERADDTREEPREFIX

MSTMERGE

DENSEBORUVKASTEP

MSTBORUVKA

problem

ALGORITHM

solved by

creates

subproblem

DISTRIBUTEDSORTWITHMASTER

ScanAllAddInteger

ScanAddInteger

Figure 3.4: Map of Algorithms used for Configurator Test

predecessors merge the MSTs by calculating an MST of all edges received. In the

end, the root of the communication tree knows the global minimum spanning tree.

Figure 3.4 shows a full map of all used algorithms, the subproblems, and their rela-

tionships. Note that the map even contains cycles that are caused by recursive calls. Of

course, the configuration terminates because the parameters usually decrease.

In the following, we give short remarks on the three parallel MST algorithms that can

be found in the map. The algorithmic description of our general solution of the MST

problem has three main variants: MSTWITHMERGE, MSTBORUVKA, and MSTBORU-

VKAANDMERGE.

MSTWITHMERGE: This algorithm solves the MST problem by calculating the local

3.2. Composition of Efficient Nested BSP Algorithms 31

0000000000000011111111111111

0000000000

1111111111

0000

1111

1 2 4 8 16 32

WITHMERGE

BORUVKA

1024 edges

1024 vertices

(a) p=2

0000000000000011111111111111

0000000000

1111111111

1 2 4 8 16 32

BORUVKA

ANDMERGE

BORUVKA-

1024 edges

1024 vertices

k=3

(b) p=4

0000000000000011111111111111

00000000

11111111

00000000

11111111

1 2 4 8 16 32

ANDMERGE

BORUVKA

BORUVKA-

1024 edges

1024 vertices

k=2

k=3

0000000000000011111111111111

00000000

11111111

000000

111111

1 2 4 8 16 32

BORUVKA

ANDMERGE

BORUVKA-

MERGE

WITH-

1024 edges

1024 vertices

(d) p=16

Figure 3.5: Results of the Configuration on a Pentium III Workstation Cluster with SCI

spanning tree and merges all these trees with operation MSTMERGE. Then the

result is distributed from the root to all nodes.

MSTBORUVKA: This algorithm executes Bor˚uvka steps until the number of superver-

tices is 1. Since each step reduces the number of supervertices by at least a factor

of 2, at most dlog2neof these steps have to be executed (ndenotes the number of

vertices).

MSTBORUVKAANDMERGE: This algorithm combines Bor˚uvka steps and merging. At

the beginning it executes some number kof Bor˚uvka steps in order to reduce the

number of vertices, and then it calculates the minimum spanning tree using MST-

MERGE. The number kof Bor˚uvka steps is a free parameter of the algorithm and

has to be set by the configurator.

3.2.4 Experimental Evaluation of the MST Implementations

Configured Programs: the Schedules

The output of the configurator is a schedule that defines for the problem and each occurred

subproblem which algorithm on which processors should be used. Figure 3.6 shows an

example of a schedule for the minimal spanning tree problem on a graph with 4096 ver-

tices and 32768 edges. Horizontal there are the processors, whereas the vertical direction

marks the time. Each box denotes an algorithm and the needed resources.

Figure 3.5 shows as the result of the configurator the selected algorithms and the

fixing of the free parameters for different sizes of the input graphs and different number

of processors for the parallel machine interconnected as a 2-dimensional torus of SCI

links.

On a computer with a fast network, the algorithm MSTWITHMERGE is chosen on

small graphs only. This algorithm has a small number of supersteps, namely logdp, but

calculates a minimum spanning tree sequentially on (n−1)dedges for nvertices in every

node of the d-ary communication tree. Also, most of the processors (p−p/di−1) are idle

32 Algorithms

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

MergeSort

20, 32768

16, 4096

MergeSort

20, 4096 \alg{Kruskal}

2048, 4096 20, 4096

MergeSort

20, 4096

MergeSort

20, 4096 20, 4096

MergeSort

20, 4096

16, 4096

MergeSort

20, 4096 \alg{Kruskal}

2048, 4096 20, 4096

MergeSort

20, 4096

MergeSort

20, 4096 20, 4096

MergeSort

20, 4096

16, 4096

MergeSort

P1P2P4P5P6P7P8P9P10 P11 P12 P13 P14 P15

4096

EDGETREEREDUCEALL

4096

SCATTERGATHERBROADKASTk=16

INTEGERADDTREEPREFIX

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

2048, 32768

KRUSKAL

4096, 4096

KRUSKAL

2048, 4096

MSTMERGE

2048, 4094

KRUSKAL

2048, 4096

MSTMERGE

2048, 4094

MSTMERGEd=2

2048, 8188

KRUSKAL

2048, 4096

KRUSKAL

2048, 4096

MSTMERGE

2048, 4094

KRUSKAL

2048, 4096

MSTMERGE

2048, 4094

MSTMERGEd=2

2048, 8188

KRUSKAL

4096, 4096

KRUSKAL

2048, 4096

MSTMERGE

2048, 4094

KRUSKAL

2048, 4096

MSTMERGE

2048, 4094

MSTMERGEd=2

2048, 8188

KRUSKAL

2048, 4096

KRUSKAL

2048, 4096

MSTMERGE

2048, 4094

KRUSKAL

2048, 4096

MSTMERGE

2048, 4094

MSTMERGEd=2

2048, 8188

KRUSKAL

4096, 4096

4096, 32768, 4096, 0

DENSEBORUVKASTEP

4096, 32768

MSTBORUVKAANDMERGE

MSTMERGEd=2

MSTMERGEd=2MSTMERGEd=2

2048, 16376 2048, 16376

2048, 32752

Figure 3.6: Schedule for the Minimal Spanning Tree Problem of a Graph with 4096

Vertices, 32768 Edges, for a 16 Processor BSP Computer

3.2. Composition of Efficient Nested BSP Algorithms 33

00000000000000000001111111111111111111

000000

111111

00000000000000

11111111111111

128

1 2 4 8 16 32 64 128 256

WITHMERGE

BORUVKAANDMERGE

BORUVKA

1024 edges

1024 vertices

(a) p=2

00000000000000000001111111111111111111

0000

1111

0000000000

1111111111

0000

1111

00000000

11111111

0000

1111

128

1 2 4 8 16 32 64 128 256

WITHMERGE

BORUVKAANDMERGE

k=1

k=2

k=3

BORUVKA

1024 edges

1024 vertices

(b) p=4

00000000000000000001111111111111111111

00000000

11111111

00000000

11111111

00000000

11111111

128

1 2 4 8 16 32 64 128 256

WITHMERGE

BORUVKAANDMERGE

BORUVKA

1024 edges

1024 vertices

00000000000000000001111111111111111111

000000

111111

0000000000

1111111111

00000000

11111111

128

1 2 4 8 16 32 64 128 256

WITHMERGE

BORUVKAANDMERGE

BORUVKA

1024 vertices

1024 edges

(d) p=16

Figure 3.7: Results of the Configuration on a Pentium III Workstation Cluster with Fast

Ethernet

in round i. If the network is slow (as it is the case with Fast Ethernet), the change from

one chosen algorithm to another occurs for larger graph sizes, as can be seen in Figure 3.7

that presents the results of the automatic configuration for the cluster with Fast Ethernet.

An evaluation of the schedules, i. e., measurements of the running time of config-

ured algorithms, and more details of the used computers and their BSP parameters are

presented in Section 4.2.1 in the next chapter.

34 Algorithms

Chapter 4

Implementations

The BSP model was proposed as a bridging model between the hardware and the soft-

ware, i. e., it should support the development of portable algorithms and lead to efficient

implementation. In the previous chapter we gave an overview of the algorithms, in this

chapter we look at implementations. We start with programming environments which

support the development of BSP algorithms, including our contribution, the PUB library,

and give an overview of implementations of BSP algorithm in the second part.

4.1 BSP Environments

There are several environments to support the implementation of BSP programs, devel-

oped by various people with different objectives. To improve the compatibility, a small set

of core functions is defined as a standard called “BSP Worldwide Standard” [GHL+96],

and some libraries implement these functions.

The libraries were implemented with different motivations: some focus on small, sim-

ple, and portable code (e. g., BSPonMPI), some have adapted implementations for varied

parallel architectures (e. g., Oxford, PUB library); others allow to use the internet for dis-

tributed BSP calculations (e. g., Bayanihan, PUBWCL) or have a total different aim, for

example, Virtual BSP for debugging BSP programs.

In the following we give a short overview of libraries (see Table 4.1 for a list) for

bulk synchronous parallel computing. We focus on libraries for high performance com-

puting on systems with a high performance network and describe our implementation, the

Paderborn University BSP library, in more detail.

4.1.1 Related Work

In this section we describe four libraries for bulk synchronous parallel computing: Oxford

BSP Toolset, BSPonMPI, BSPK, and GreenBSP.

36 Implementations

Table 4.1: Overview of BSP Environments

Implementation Language Focus Reference

Oxford BSP Toolset C, C++, Fortran [HMS+98]

BSPonMPI C, C++ MPI based, portable [Sui07]

GreenBSP C, C++ basic functions [GLR+99]

PUB library C, C++ performance, functionalities [BJvR03]

BSPk C, C++ [FH96]

JBSP Java [GLC01]

Bayanihan Java master-worker with central

server

[Sar98]

PUBWCL Java Web-Computing [BGM06]

Dynamic BSP Grid Computing [MT04]

InteGrade Grid BSP C middleware [GGHK05]

NestStep Java, NestStep programming language [Keß00]

BSMLlib Objective CAML functional BSP [HL02]

Virtual BSP simulator debugging [MW00]

Oxford BSP Toolset

The Oxford BSP Toolset [HMS+98, Hil98] consists of implementations of the BSPlib

standard for various parallel architectures and some profiling tools. It was the first imple-

mentation of BSPlib and is sometimes referred to as Oxford BSPlib. It supports remote

memory access and message passing. Focussing on these basic functions there are only

20 primitives, all other features, including broadcasting and parallel prefix operations, are

implemented in a first level library on top of BSPlib.

There are many publications that deal with this implementation. Some describe the

usage [HMS+98], answer questions about BSP in general and the implementation of the

toolset [SHM96], others study implementation techniques, for instance, how to implement

BSP over a TCP/IP network [DHS97] or the Cray T3E interconnection network [HDS97],

or extend the library by load balancing and fault tolerance [HDL98].

BSPonMPI

BSPonMPI (cf. project webpage [Sui07]) is an implementation of the BSPlib standard

from Wijnand J. Suijlen under the guidance of Rob H. Bisseling, the author of the textbook

about BSP [Bis04] and maintainer of the BSP worldwide website. It is easy to install

(using the GNU autoconf tool) and should be runable on all POSIX architectures that

supports the message passing standard MPI [MPI95]. It is optimized for MPI, so the

author states it should be used wherever native implementations are not available, as it is

4.1. BSP Environments 37

faster than other implementations (PUB, Oxford BSP Toolset) using MPI.

Green BSP Library

The Green BSP Library [GLR+99] was designed to be as simple and portable as possible.

It can only transmit packets of fixed size (e. g., 16 bytes, can be configured at compile

time), so the user has to split the messages into packets and recombine them at the des-

tination. The only receive function bspGetPkt returns a pointer to one packet, i. e., to

receive a large message one have to call this function quite often, and all parts have to be

copied together using a lot of memory copy operations, each of small size. This is neither

comfortable for the user nor very efficient for large messages.

BSPk

BSPk [FH96] is a library for message passing and distributed shared memory for BSP.

The authors introduced two new aspects. First, they implement lazy barriers, i. e., the

supersteps are not separated by global barriers, it is just guaranteed that messages sent in

superstep iwill be received in superstep i+1. If a processor will not receive a message, it

does not have to wait for other processors at the end of a superstep. To use this optimiza-

tion, the user has to inform the library about all expected incoming messages (the same

type of synchronization is implemented in our PUB library and called bsp_oblsync

there). This is formalized in the Oblivious BSP model (Section 2.2.3).

The second new idea is called Communicable Memory by the authors. They try to

tackle the problem of buffering and copying. If user data is sent, it normally has to be

packed in some kind of frame, for instance, a header containing its size and sender. So

the data is copied into a buffer after the header. If the communication network is fast

(compared to the processor and memory speed), the copying could decrease the commu-

nication bandwidth significantly, for example, a remote memory copy on the Cray T3E

computer is faster than a local copy, due to the limited memory bandwidth. In BSPk the

library can manage the memory, so it can be allocated it in proper frames. For the same

reason the PUB library can allocate memory for a message (bsp_createmsg).

Furthermore, BSPk was the first library that supports message passing and distributed

shared memory, other earlier implementations supported only one type of communication.

4.1.2 Paderborn University BSP (PUB) library

In this section we will introduce our Paderborn University BSP (PUB) library [BJvR03,

pub08]. It was developed to support high performance implementations of BSP programs

for monolithic parallel computers, and later extended to workstation clusters. The most

important design goal was efficiency, so we focus on avoiding buffering and copying

38 Implementations

MPI, TCP/IP, InfiniBand

Parix, shared memory

Linux, Solaris, AIX,

Unicos, Parix, Win32

Pthreads, Win32,

System V semaphores

BSP

Comm

Threads CL

System

architecture independend

Util

Pthreads, Win32,

System V semaphores

Linux, Solaris, AIX,

Unicos, Parix, Win32

MPI, TCP/IP, InfiniBand

Parix, shared memory

Figure 4.1: Modules of PUB

wherever possible. Furthermore, we keep in mind software engineering techniques. Al-

though developed using the language C, PUB is object oriented, i. e., there are objects

(e. g., messages), and methods for using them. PUB is available for many different ar-

chitectures, i. e., different hardware (e. g., CPUs, network devices) and various software

(operating system, communication libraries). To achieve this portability, PUB consists of

different modules (Figure 4.1), including one module for the operating system (System)

and one for the communication (CL).

Although developed for BSP implementations, PUB offers much more functionalities.

The BSP module is just one use of the more flexible communications functions. More

details about the BSP part of PUB can be found in [Rie00].

Basic BSP Functionalities

There are two different groups of functions for communication: message passing and

direct remote memory access. These two kinds of communication are widely used in all

parallel system, for example, the Message Passing Interface (MPI [MPI95]) is a common

standard for message passing, whereas the Shmem library from Cray supports remote

memory accesses. As PUB is a BSP library, all these functions have a bulk-synchronous

semantic: messages sent in one superstep are received in the next synchronization and

remote memory operations are also performed at the synchronization. Besides normal

barrier synchronization PUB supports the oblivious synchronization of the Oblivious BSP

model (cf. Section 2.2.3), where the user has to specify the number of received messages

and the synchronization does not need any communication at all.

Collective Communication Operations

Several global operations are provided: Broadcast and multicast algorithms, which au-

tomatically adapt their parameters like block size and tree degree to the hardware archi-

4.1. BSP Environments 39

tecture, parallel reduce and scan operations with arbitrary associative and commutative

operators.

BSP Objects

Another feature is the possibility to dynamically partition the processors into independent

subsets. After a partition operation, each subsystem acts as an autonomous BSP com-

puter, i. e., subsequent barrier synchronizations involve only the processors in the subsys-

tem. Thus, the PUB library supports nested parallelism and subset synchronization. This

feature is especially useful if different subalgorithms have to be performed independently

in parallel. Without the possibility to create subsystems, the algorithms have to be inter-

leaved, which complicates the implementation. Furthermore, the implementation might

be inefficient if the algorithms have different synchronization requirements.

With these BSP objects it is also possible to create totally independent BSP computers,

which can be used in different threads for executing more than one parallel algorithm at

once. Virtual processors (used for load balancing on heterogenous workstation clusters,

Chapter 6) are also realized with BSP objects.

Communication Layer of PUB

The communication functionality of PUB is based on active messages. This is similar

to event based system, for example, graphical user interfaces. Each received message

triggers the execution of a message handler. To describe the message processing in more

detail, we look at two communication objects: commlinks and commcontexts. A comm-

context is a kind of virtualization of the parallel message passing system. In each context

messages can be sent and received independently of messages in other contexts. This is

especially useful when calling sub programs and is the basis for the BSP partition func-

tionality, where a context covers only a part of the processors. The creation of such a

context should be very fast, in particular it should not need any communication at all.

Each context has an identifier used to route messages to it. Every processor needs to

know all identifiers of the contexts on all other hosts. As we do not want to communi-

cate, these identifiers are the same on all hosts. This means that the order of creating and

destroying contexts have to be the same on all participating processors.

To support really independent communication, which can even be used simultaneously

in different threads, PUB uses commlinks. Commlinks are similar to commcontexts, but

can be used independently of each other. This freedom leads to a performance drawback:

the creation of a commlink needs a global total exchange communication. Commlinks

can not be used standalone, but they are the containers for the commcontexts, i. e., a

commlink can contain one or more commcontexts, whereas each commcontext belongs

to exactly one commlink. Commlinks are also used to implement virtual processors,

virtual memory, and process migration (see Section 6.3).

40 Implementations

link

context

msg-list

drma

reduce

sync

msg handler

(a) normal

link

context

msg-list

drma

reduce

sync

msg handler

context

msg-list

drma

reduce

sync

msg handler

(b) bsp_dup

link

context

msg-list

drma

reduce

sync

msg handler

link

context

msg-list

drma

reduce

sync

msg handler

link

context

msg-list

drma

reduce

sync

msg handler

link

link is migrated

to host 5

(d) bsp_migratable_dup

with migrated processor

Figure 4.2: Comm Objects in PUB

Each message contains three identifiers in its header: one for the link, one for the con-

text, and one for the message handler. When a message is received, the receive thread calls

the message handler of the link, which calls the message handler of the context, which

executes the right message handler of the BSP layer. If a context or message handler is

not known yet, the message is buffered in a queue until the destination is registered. This

occurs, for example, in a BSP program, if a message of the next superstep is received.

The library creates one commlink with one commcontext at startup. The functions

bsp_dup and bsp_partion create additional contexts, bsp_threadsafe_dup,

bsp_exclusive_dup, and bsp_migratable_dup use extra links. There are sev-

eral message handlers registered by PUB in each context: one for normal messages, one

for remote memory operations, one for global operations like reduce, and one for synchro-

nization messages. Figure 4.2 shows the objects for (a) normal BSP programs, (b) with

BSP subgroups, (c) with independent BSP objects, and (d) with two virtual processors

(one has left the host and is executed on the host with id 5).

The message handler for BSP messages is registered every superstep, i. e., it will get

a new identifier in each round, so messages from different supersteps are not mixed.

Debugging

In general it is a hard task to find mistakes in implementations, thus a lot of tools were

developed to support the debugging, for example, debugger, which allows to set break-

points, step through a program step by step, and watch the values of variables. Most of

these tools support sequential programs only, some can handle programs using threads,

similar to the PRAM model, but there are only a few tools that support watching a paral-

lel program (e. g., TotalView). Thus a standard approach to find errors is to put a lot of

debug output into the program by hand. The PUB library supports different techniques

for debugging.

4.1. BSP Environments 41

Automatic Execution of a Debugger. If a fault occurs, for instance, a wrong memory

access (segmentation fault) or an assertion failed, PUB can automatically start a debugger

and attach it to the affected process. This feature is known from Microsoft’s developer

tools, there you are asked if you want to debug a program if an access violation occurs.

This is especially usefull for parallel programs running on a workstation cluster, because

it is an annoying job to find the host where the faulty process is being executed by hand.

Pointer Type Check. To each structure of PUB, for example, a message, an additional

field id is added, which stores the size of the structure. Each time a pointer to a structure

is given, the library checks this id. Thus pointers to wrong memory positions or not

initialized data structures are detected quickly.

Range Check. One class of errors are illegal memory accesses, i. e., a program writes

to a memory cell with an invalid address. Too small memory allocations are one reason

for this. Thus we support a kind of range checking of memory regions allocated by

the function malloc. The memory is organized in pages of size 4096 bytes for x86

processors. Each time the processor wants to access a memory cell, the virtual address

of the cell is translated into a real memory page and an offset inside the page by the

Memory Management Unit (MMU) inside the processor. The MMU checks, if there is

physical memory mapped to this address, and if it is accessable by the process, i. e., one

can have different access flags (read, write, execute) attached to memory pages. If there is

no memory, or the access is not allowed, an exception occurs and the operation system is

called, which can swap in the memory, or generate a segmentation fault signal or similar.

We use this feature to find accesses to the memory direct after an allocated block. We

will explain this by a simple example. If the program allocates 512 bytes, we allocate 2

memory pages of size 4096 each, and return a pointer to the offset 4096-512 inside this

block. Furthermore we change the access permissions of the last page to no access.

waste no access

0 4096

data

malloc's result

All memory allocations and freeings can be monitored, such that memory leaks can

be detected.

Message CRC check. To verify the communication system, PUB can calculate a check-

sum for all messages and check, if they are transmitted correctly. We use the standard

Cyclic Redundancy Check (CRC) [PW72] algorithm for the checksum.

42 Implementations

Table 4.2: Comparison of Oxford BSP Toolset and PUB Library

Oxford BSP Toolset PUB Library

Point-To-Point Messages X X

Direct Remote Memory op. X X

High performance op. X X

Oblivious synchronization – X

Broadcast/multicast library on top of BSP automatically optimized

for the machine

Collective communication

Operations

library on top of BSP X

Subgroups – Partitioning, independent

groups

Thread-safety – on most architectures

Virtual Processors – X

Fault tolerance extension [HDL98],

source not available

Checkpointing of virtual

processors

Load balancing for

workstation clusters

extension [HDL98],

source not available

Processmigration, various

load balancing

algorithms, easy to extend

Asynchronous message

passing

– possible, active messages

model

Debugging tools – execution of debugger,

pointer type checking,

range checking, memory

leak detection, message

CRC checking

Profiling call-graph profiling tool

[HJSV98],

BSP Pro [ZKX99]

show idle and compute

times, size of messages

for each superstep

Comparison with other BSP Libraries

Comparing with other implementations, our PUB libraries offers much more functionali-

ties as any other system. The support for decomposable BSP, oblivious synchronization,

support for virtual processors and memory, and fault tolerance are some of its features.

For a comparison with the Oxford BSP Toolset see Table 4.2.

Furthermore, the possibility to use asynchronous communication is very useful for

some algorithms, which does not fit well into the BSP model. Using this functionality, one

losses the advantages of BSP (e. g., simplicity, easy cost model), but for some applications

4.2. Implementations of BSP Algorithms 43

Table 4.3: Implementations of BSP algorithms

Problem Environment Platform Reference

Inner product BSPlib [Bis04, Bis07]

LU decomposition BSPlib, MPI Cray T3E [Bis04, Bis07]

FFT BSPlib, MPI SGI Origin 3800 [Bis04, Bis07]

Matrix

multiplication

Green BSP SGI Challenge, NEC Cenju,

Linux workstation cluster

[GLR+99]

Sparse-matrix-

vector

multiplication

BSPlib, MPI Linux workstation cluster [Bis04, Bis07]

Ocean simulation Green BSP SGI Challenge, NEC Cenju,

Linux workstation cluster

[GLR+99]

N-Body simulation

using Barnes-Hut

Green BSP SGI Challenge, NEC Cenju,

Linux workstation cluster

[GLR+99]

Minimal spanning

tree

Green BSP SGI Challenge, NEC Cenju,

Linux workstation cluster

[GLR+99]

PUB Linux workstation cluster,

SCI and Ethernet

[BMW02]

PUB Parsytec results on CC-48 [DG98]

Shortest paths Green BSP SGI Challenge, NEC Cenju,

Linux workstation cluster

[GLR+99]

Multiple shortest

paths

Green BSP SGI Challenge, NEC Cenju,

Linux workstation cluster

[GLR+99]

Volume rendering SGI PowerChallenge, IBM

SP/2, workstation cluster

[XL97]

this seems to be necessary.

4.2 Implementations of BSP Algorithms

For the success of parallel algorithms in general, and the BSP model in particular, it is

quite important that there are easy to use implementations of the algorithms. The main

reason for the success of Java is the availability of large, powerful class libraries with

implementations of algorithms for many standard problems.

There are some implementations in the bulk synchronous settings, but most algorithms

listed in Chapter 3 are analyzed only theoretically. Furthermore, there is no a central

instance providing a list or a library of implementations. Rob Bisseling, maintainer of the

BSP worldwide website [BSP07], has collected a list of researchers working in the BSP

44 Implementations

area, but since now there is no list of software or anything similar. Table 4.3 contains

examples of BSP implementations of algorithms. It shows the solved problems, the used

toolkits or libraries, the BSP computers for the evaluation, and the references.

4.2.1 Combining BSP Algorithms

Often implementations of algorithms by different providers cannot be combined to one

program because of missing standards for BSP implementations. There is the BSP world-

wide standard, which defines communication functions for BSP programs, but many im-

plementations use other libraries or are implemented directly on top of simple message

passing libraries like MPI. Another non trivial part is the specification, including the for-

mat and the distribution of the input, or the number of supported processors.

It is crucial for an efficient combining of different algorithms that the data is in the

appropriate format and does not need to be rearranged. For example, sometimes the input

of a subproblem is stored in a non consecutive way in the memory, but the subalgorithm

needs it as one memory block, thus one have to copy all data before calling the subalgo-

rithm. In many applications these memory copying leads to inefficient algorithms.

Furthermore, an efficient way of calling subalgorithms is needed, for instance, the

partition feature of the PUB library. Using this technique, it is possible to separate the

communication of the subalgorithms, and to run several subalgorithms in parallel. The

BSP worldwide standard does not support such functionality.

A proposal of a system to build libraries with BSP algorithm can be found in Sec-

tion 3.2.

Evaluation

We have implemented all algorithms of the map in Figure 3.4 on page 30 to evaluate the

schedules made by the configurator. First, we want to know if the running time of the

algorithm is predicted accurate enough to find the optimal schedule, furthermore, we are

evaluating the performance of the resulting algorithm.

The Used Parallel Machines. For the experiments, we used a cluster of 96 Linux work-

stations with two different communication mechanisms. The cluster was operated by the

Paderborn Center for Parallel Computer (PC2). Each node is a Dual-Pentium III worksta-

tion operating at 850 MHz and includes 512 MB memory. For the communication, there

are two alternatives: One can use a 2-dimensional torus of SCI links as interconnection

network, or Fast Ethernet with a Cisco Catalyst 5509 switch (and, hence, a complete

graph as interconnection network). For the communication, MPI and TCP/IP are used,

respectively.

4.2. Implementations of BSP Algorithms 45

0.02

0.04

0.06

0.08

0.1

0.12

0.14

10000 100000 106

time [s]

edges

MSTWITHMERGE

MSTBORUVKAANDMERGE

MSTBORUVKA

configured program

Figure 4.3: Running Time, Random Graphs, 2048 Vertices, 16 Processors, SCI

Table 4.4 presents the BSP* parameters of these two communication systems of the

parallel computer. They were used by our configurator for the prediction of the cost of

our implemented algorithms. A compilation of the BSP* parameters of further machines

(e. g., Cray T3E, Parsytec CC) can be found in [Rie00].

Measured Running Times. In the following we will use the parallel minimum span-

ning tree problem as an example and present the results of two significant series of mea-

surements, one for each communication mechanism. More measurements can be found

in [Bon02].

Table 4.4: The BSP* Parameters of the Pentium III Workstation Cluster

(a) MPI with SCI (2D torus)

p B [B] L[µs] g[ns−1]

2 316 2.10 32.4

4 509 4.49 37.5

8 389 6.28 43.9

16 21 9.08 194.5

(b) TCP/IP with Fast Ethernet (complete graph)

p B [B] L[µs] g[ns−1]

2 3092 98.9 117.1

4 1467 145.5 121.8

8 264 140.7 629.2

16 239 152.7 617.9

46 Implementations

0.5

1.5

2.5

10000 100000 106

ratio measurement/prediction

edges

MSTWITHMERGE

MSTBORUVKAANDMERGE

MSTBORUVKA

configured program

Figure 4.4: Accuracy of the Predictions, 2048 Vertices, 16 Processors, SCI

Figure 4.3 shows the results on the workstation cluster with p=16 processors. Our

graphs consist of n=2048 vertices and randomly chosen edges. Figure 4.3 shows the

running times, Figure 4.4 the ratio of the measurements and the BSP cost, i. e., the pre-

dicted times, for the SCI case. The divergence for the algorithms that use the operation

MSTMERGE is due to inaccurate predictions of the local work. In algorithms based

on merging, the sequentially merging of minimum spanning trees dominates the running

time. The model counts the local memory accesses of these operations. However, the

algorithms run much faster than predicted due to cache effects.

The configured program results, as Figure 4.3 shows the best running time, if the input

graphs are dense. Otherwise, it comes close to the fastest algorithm. The peaks that can

be observed for the configured program in both parts of the figure at 8192 edges is due

to the fact that the configurator prefers MSTBORUVKA to MSTBORUVKAANDMERGE

too early.

Figure 4.5 presents the measurements and the prediction accuracy ratio for the Fast

Ethernet communication on p=16 processors and graphs with n=8192 vertices. The

configurator does not choose the best variant which is mostly MSTWITHMERGE because

using Fast Ethernet the network load being a dominating parameter in the running time

is not sufficiently covered by BSP. More specifically, if the input graph is dense, during

a broadcast only one processor sends data, all other processors receive data. So there

4.2. Implementations of BSP Algorithms 47

0.2

0.4

0.6

0.8

1.2

1.4

10000 100000 106

time [s]

edges

MSTWITHMERGE

MSTBORUVKAANDMERGE

MSTBORUVKA

configured program

(a) running time

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

10000 100000 106

ratio measurement/prediction

edges

MSTWITHMERGE

MSTBORUVKAANDMERGE

MSTBORUVKA

configured program

(b) accuracy of the predictions

Figure 4.5: Results for Random Graphs with 8192 Vertices on 16 Processors, TCP/IP

48 Implementations

is no high network load, and the gap g, listed in Table 4.4 and measured under high

load, is inappropriately large. So the running time is considerably overestimated by the

configurator.

Concluding Remarks. In this section, we have evaluated the configuration approach to

obtain fast parallel BSP programs for solving algorithmically complex problems. As a

case study, we have presented the results of applying the configuration approach to the

minimum spanning tree problem.

The configured algorithm performs well in practice, although the predictions of the

running time is sometimes quite inaccurate. The main reason for this is the fact that the

algorithm description has to provide information about subproblems, depending only on

the description of the input. In our example, we are given the number of vertices and the

number of edges in the graph, and we have to predict the number of supervertices created

by executing a Bor˚uvka step, which is half the number of vertices in the worst case, but

normally much smaller. So the prediction gives an upper bound on the execution time,

but is not always accurate enough to compare the average running time of algorithms.

In [Bon02], a case study for the broadcast problem can be found. For broadcasting the

running time does not depend on the input, thus the predictions are more accurate.

Chapter 5

BSP for Parallel System on Chip

The bulk synchronous parallel model was designed to be a bridging model between the

development of algorithms and the hardware. In this chapter we study how BSP can help

to develop efficient parallel architectures by designing a parallel system on a single chip.

Furthermore, this shows that BSP is also feasible for on-chip parallelism, which has some

special characteristics, for instance, high bandwidth and small memory sizes.

Some of the work has been done in the project GigaNetIC1. We will start with a short

introduction of the project and continue to present related work in the area of parallel

systems on a single chip. The main part of the chapter deals with the architecture and the

evaluation of our system.

5.1 Project GigaNetIC

The GigaNetIC project aimed to develop high-speed components for networking applica-

tions based on massively parallel architectures [BSR+03]. A central part of this project

is the design, evaluation, and realization of a parameterizable parallel processing unit. A

particular attraction of the GigaNetIC project grounded on the interdisciplinary coupling

of the different research groups that marked the project right from the beginning.

There are members of three research groups involved. We from the Research Group

of Algorithms and Complexity develop and analyze models for the architecture and the

on-chip interconnection network. The Research Group of Programming Languages and

Compilers develops a parallelizing compiler, and the Research Group of System and Cir-

cuit Technology is intrusted with the realization of a resource-efficient system design

and with the development of hardware accelerators for special embedded applications to

achieve a higher throughput and to reduce energy consumption. As an industrial partner,

1GigaNetIC was founded by the Federal Ministry of Education and Research (Bundesministerium für

Bildung und Forschung)

50 BSP for Parallel System on Chip

Infineon Technologies provides state of the art chip production technologies needed for

such a complex System-on-Chip (SoC) design.

5.2 Related Work

There are different projects that tries to utilize the huge amount of transistors that can be

integrated into a modern chip. The first approach was increasing the sizes of the caches

and the number of execution units (instruction level parallelism, execution of more than

one instruction per clock cycle). This leads to very huge and complicated processors

like Intel’s Itanium2 (including up to 24 MB cache). But to improve the performance

significantly, one needs more processor cores. One possibility is to integrate standard

(huge) processors cores into a single chip, examples are the Power4 processor (IBM, first

dual core processor, 2001), Intel Core2Duo series, AMD Athlon 64 X2, Intel Itanium2

“Montecito” (2 Cores, 24 MB cache, largest processors with 1,720 million transistors),

and SUN UltraSPARC T2 “Niagara” (8 cores, 64 threads).

Another approach is implementing many more smaller processor cores, for example,

the Single-Chip Message-Passing Parallel Computer (SCMP) [BGB+04] developed at

Virginia Tech. The system consists of a 2-D mesh of tiles, each tile contains CPU, mem-

ory, and network router. The communication of the system is based on active messages.

There are thread messages and data messages. Thread messages start a thread at the desti-

nation, data messages contain an address and store the data at that address (direct remote

memory access). The network connections on the edge of the chip are used for input and

output.

Detailed models of the system were being developed for VHDL (Very High Speed In-

tegrated Circuit Hardware Description Language) and SystemC (C++ library for system

descriptions), and there seems to be a simulator for the system, because the authors pro-

vide several benchmark data: Floyd-Warshall all pairs-shortest-path (transitive closure of

the adjacency matrix), iterative conjugate gradient sparse-matrix solver, and the neighbor-

hood stressmark which estimates the gray-level co-occurrence matrix energy and entropy

by calculating histograms of the sums and differences of pairs of pixels in an image.

The Cell processor (Cell Broadband Engine Architecture, developed by Sony, IBM,

and Toshiba [KDH+05]) is a kind of mixture of both ideas. It consists of one standard pro-

cessor core (Power Processing Engine, PPE), eight fully-functional co-processors (Syner-

gistic Processing Elements, SPE), and a specialized high-bandwidth circular data bus con-

necting PPE and SPEs. The PPE runs the operation system, has control over the SPEs, and

can start, stop, interrupt, and schedule processes running on the SPEs. The SPEs are not

fully autonomous and require the PPE to do any useful work. They work as co-processors

to support the main core in certain applications, for example, decoding/encoding MPEG

streams, generating or transforming three-dimensional data, or undertaking Fourier anal-

ysis of data. The chip was designed for the video game console PlayStation 3, but there

5.3. Design of the Architecture 51

are other system using Cell available as well, including blade servers.

Due to the heterogeneous architecture of the system, it is not a a straightforward task to

develop algorithms for Cell. Ohara et al. introduced an MPI microtask model [OIS+06],

based on the message passing standard MPI. The programs have to be partitioned into a

collection of small microtasks that fit into the local memory of the SPEs. These microtasks

are scheduled statically to get an implementation for the streaming model of the Cell.

5.3 Design of the Architecture

In the first section we will describe the architecture of our parallel system on chip. We

will start with an overview of the components, specify the communication protocol, and

provide more details on the parts in the last subsection.

5.3.1 Overview

Looking at existing parallel computers, there are two different architectures: shared mem-

ory and distributed memory. A global shared memory is often considered to be very

comfortable for the users, as they can use threads to design parallel algorithms. The im-

plementation of shared memory is costly, because a memory bank can only be accessed

by a small constant number of processors at the same time. So the performance of these

architectures does not scale with the number of processors, or a big and thus expensive

crossbar is needed, which connects the processors to a huge number of memory banks.

Another approach is distributed shared memory, which leads to Non-Uniform Mem-

ory Architectures (NUMA). In NUMA, for example, SGI’s Altix, the access time to the

memory varies, for instance, some part of the memory may be local and thus faster.

Distributed memory systems are widely used, all cluster systems built of standard

components and a high speed network are of this kind, and most machines at the top of

the Top500 list of supercomputers [Str06] are distributed memory architectures. In our

architecture we try to combine the advantages of shared and distributed memory. We use

small clusters of processors including shared memory connected by a switched bus sys-

tem. These clusters are connected by a fast on chip interconnection network with routers

and communication controllers. The controllers have the capability to send messages au-

tonomously from one cluster to another without slowing down the computation of the

processors. The part used for the communication of a cluster is called Switchbox.

Figure 5.1(a) shows an overview of the architecture, the blue boxes are the processors

with local caches, the shared memory and the communication controllers are inside the

Switchbox. The configuration (3×4×4) shown in the picture is just an example and can

be adjusted to the needs.

52 BSP for Parallel System on Chip

Cache

CacheCache

CPU

Cache

CPU

Switchbox

CacheCache

Cache

CPU

Cache

CPU

Switchbox

CacheCacheCache

CPU

Cache

CPU

Switchbox

Cache

CPU

Cache

CPU

Switchbox

Cache

CacheCache

CPU

Cache

CPU

Switchbox

CacheCacheCache

CPU

Cache

CPU

Switchbox

Cache

CacheCache

CPU

Cache

CPU

Switchbox

Cache

CacheCache

CPU

Cache

CPU

Switchbox

CacheCacheCache

CPU

Cache

CPU

Switchbox

Cache

CacheCache

CPU

Cache

CPU

Switchbox

Cache

CPU

Cache

CPU

Switchbox

CacheCacheCache

CPU

Cache

CPU

Switchbox

(a) mesh of clusters

CacheCacheCache

CPU

Cache

Shared Memory

Snoop

FIFO

CPUCPU

Crossbar

CPU CPU CPU

(b) single processor cluster

Figure 5.1: Architecture

5.3.2 Communication Protocol

The communication on the chip between clusters of processors is based on message pass-

ing. To avoid the problem that a large message can block a lot of small ones, we do

not allocate the whole path from the source to the destination of a message as done in

wormhole routing. Instead, messages are split into very small parts called flits. Each flit

is transmitted to the next Switchbox in one cycle using parallel wires. To recreate the

message, each flit contains – in addition to the routing information – a flit id and a flow id.

The flit id is the number of the flit inside the message; the flow id is used to distinguish

between messages. The first flit of a message is marked as a packet header and contains

the size of the message. This is used to allocate enough buffer space for the flits, and to

notice when the last flit has been received.

The routing is oblivious, i. e., the path for a packet is dependent only on its source and

destination. Each flit is sent to the next switchbox again and again, until an acknowledg-

ment is received. Waiting for these acknowledgments leads to a reordering problem. As

we want to send a flit every cycle, we cannot wait for the acknowledgment, so we use a

ring buffer and try to send a flit every cycle. This can change the order of the flits, for

instance, if the first flit cannot be sent, the second flit is tried before the next try of the first

one. As the head of a message must arrive first, we do not allow a data flit to overtake a

header flit. Furthermore, we do not allow the other direction, a header flit cannot pass a

data flit, too. This is needed because we must avoid that messages from the same sender

are mixed.

5.3. Design of the Architecture 53

5.3.3 Components

In Figure 5.1 (b) the components of a single cluster are shown in more detail. The Switch-

box contains a crossbar with buffers, the communication controller (CC), which handles

the message passing, the shared memory, and the snooping-slave (Snoop) for the cache

control.

CPU. As a central processor unit we use S-Core, a RISC processor core that is binary

compatible with Motorola’s M-Core M200 architecture [Mot98a, Mot98b]. The processor

has been developed at the University of Paderborn by the Department of System and

Circuit Technology as a soft core using the hardware description language VHDL. This

gives us the opportunity to reuse the core for different target technologies. The S-Core is

a 32 bit RISC two-address machine with a straightforward load/store architecture. It has

two banks of sixteen 32-bit registers, which can be alternatively used in user mode. Each

instruction has a fixed length of 16 bits. This results in a high code density and therefore

reduces memory demands for the application code. Beyond that, each instruction, except

for the load and store instructions, works only on the registers. The execution of most

instructions takes one clock cycle. The S-Core features a short three-stage pipeline and

an addressing interface that supports byte granular access.

The architecture can easily be expanded by adding application-specific instructions or

coprocessors to the core, examples include application specific optimizations for packet

processing for a network processor [NPPR07, NPSR05]. Our implementation of the CPU

is resource-efficient and delivers reasonable performance for embedded systems. In a

0.13 µm 1.2 V Infineon standard cell process the S-Core needs less than 0.25 mm2, and

clock frequencies of more than 200 MHz can be achieved in this technology.

We use this processor as a demonstrator, because it is available and we are free to

modify it as we want, for example, implement new instructions for the synchronization.

As the processor is very simple and does not have a floating point unit, it is not suitable

for high performance scientific computing. But it can be replaced by other, more complex

processing elements. In the following, we will focus on the communication architecture,

which is useful for all kind of parallel on chip systems and is independent of the type of

processors.

Cache System. Each processor owns a small cache for instructions and data to reduce

the congestion at the shared memory. The caches are organized with the MOESI cache

coherence protocol [AMD06]. MOESI supports the following states of a cache line:

Modified: Cache line holds the most recent copy of the data, data in memory is incorrect,

and no other cache contains a copy.

Owned: Cache line holds the most recent copy like in state modified, but other caches

may also have copies. Only one cache may own the data, all other caches having

54 BSP for Parallel System on Chip

the same data have state shared. The owner is responsible for writing the data back

to the memory.

Exclusive: Cache contains the only copy of the data, and the data in the memory is up to

date.

Shared: Cache line holds the most recent copy, but other cache may have copies as well.

The data in the memory is correct, or one other cache contains the data and has state

owned.

Invalid: Cache line does not contain valid data.

As the bus connecting the caches with the shared memory (RAM) is a switched one

(ARM Advanced Microcontroller Bus Architecture (AMBA) with a Multi-Layer-AHB-

Interconnect-Matrix [ARM99, ARM01]), the caches are not able to monitor the bus by

itself. Thus, we use a snooping slave to control the caches. Each write access of a proces-

sor that changes a cache line state is noticed to the snooping slave and broadcasted to all

other caches. Also, in case of cache miss, the snooping slave is asked if the data is stored

in any other cache. This is needed to get the newest copy, and also reduces the needed

memory bandwidth, as the data is fetched from another cache if possible. Details on the

hardware implementation can be found in [NLPR07].

Communication Controller. The communication controller (CC) manages the com-

munication of a processor cluster with the world outside. It is able so send and receive

messages autonomously. The CC uses memory mapped input/output (I/O), i. e., it is ad-

dressable by the normal memory bus. It supports asynchronous sending, blocking re-

ceiving, and probing for messages or acknowledgments. In the following we present the

process of sending a message to another cluster.

Sending. To send a message, a processor executes a write access to the bus. Different

from normal I/O-registers, the send functionality does not use a single address, but a larger

range of the address space. The address bits are used to transmit additional information

to the CC in a single write operation on the processor bus. The start address of the data

is transmitted by the data, the information about the destination and the length is encoded

in the address of the write access. The following data is encoded in the address:

Bits 31–28 27–25 24–21 20–17 16–6 5–0

Data fixed to 01112z x y size id

zis the output port at the destination cluster, x,yare the relative coordinates of the desti-

nation cluster, size is the size of the data (in flits), id a user defined identifier.

5.3. Design of the Architecture 55

message 1 message 2 message 3 free memory

Figure 5.2: Receive Buffer

The communication controller stores the send request in an outgoing queue. This

queue is processed asynchronously. Until it is empty, the CC dequeues one request, reads

the data from the shared memory, and sends it through the on-chip network. As the mem-

ory is a dual port memory, memory accesses of processors and communication controller

can be executed concurrently in our design.

Receiving. Incoming flits are combined to messages by the communication con-

troller. It uses a part of the shared memory as a ring buffer for received messages in

the following way. One registers stores the address of the free memory. For each header

flit introducing a new message, the suitable space in the memory is reserved by advanc-

ing the free-memory-pointer. Next the start of the packet is stored in another register.

For each following data flit, the controller searches these registers for the corresponding

message and stores the data of the flit at the right memory position. Thus, the commu-

nication controller needs one register for each simultaneously receiving message. Our

implementation uses at least as many registers as we have clusters in the system, so we

can guarantee that we will always find a free register if a new message arrives.

Figure 5.2 shows an example of a receive buffer with three incoming messages, mes-

sage 3 is completely received, for the other messages some parts are missing. In message 1

two flits have overtaken the second flit which is still missing.

To receive a message, each processor can read from a special memory address. The

result is either the address of a received message in the shared memory or zero if there is

no message available and the address for the non blocking mode was used. One can also

wait for either a message or an acknowledgment.

A message should be freed as soon as possible to allow the reuse of the buffer space

for other incoming messages. If there is not enough space in the buffer, incoming header

flits are rejected (i. e., they are not acknowledged, so the other node will try so resend

it every second cycle) and the network could be overflowed by blocked flits, especially

since no other flit is allowed to overtake the header flit.

More details on the communication network and its implementation in hardware is

given in [PNPR07].

56 BSP for Parallel System on Chip

Table 5.1: Performance of Simulators

Simulator System Size Application Sim. Speed

Cluster-Simulator21 S-Core Dhrystone 7 231 kHz

Cluster-Simulator24 S-Core Dhrystone 2 290 kHz

SBArray-Sim21 Cluster, 1 S-Cores Dhrystone 315 kHz

SBArray-Sim21 Cluster, 4 S-Cores Dhrystone 173 kHz

SBArray-Sim, Amba Matrix21 Cluster, 4 S-Cores Dhrystone 167 kHz

SBArray-Sim, Amba Matrix24×4 Cluster, 64 S-Cores Dhrystone 10 kHz

SBArray-Sim, Amba Matrix24×4 Cluster, 64 S-Cores Total Exch. 11 kHz

VHDL simulation32×4 Cluster, 32 S-Cores 100 Hz

FPGA implem. independent 20 MHz

2Intel Core2 CPU 6700, 2.66 GHz

3Pentium IV, Hyperthreading, 3 GHz

5.4 Toolchain

This section shows the toolchain for the parallel system on chip. To ease the software

development and testing for the system, we have developed a couple of tools: cross com-

piler, shell scripts for compiling and linking, standard libraries, communication libraries,

and several simulators.

5.4.1 Simulators

For the evaluation of the architecture a couple of simulators have been developed. They

differ in the parts of the system they can simulate, the needed resources (computer,

FPGA), and the accuracy. Tabular 5.1 gives an overview over the simulators and their

performance, the software simulators are described in more details in the following.

The fastest simulator simulates a cluster of up to four processors with shared memory

and a simple multiprocessor cache. It is very fast and gives a rough impression of the

computation performance of the system. But as it does not simulate the bus system and

the caches in whole details, some effects like bus congestion cannot be examined. The

instruction decoder of the simulator is generated automatically from a processor descrip-

tion in UPSLA (unified processor specification language [KLST04]), so it is very useful

to evaluate changes to the instruction set of the processor quickly. The simulator outputs

a lot of statistics, for instance, cycles, wait cycles, occurrences of instructions, and pairs

of instructions that can are candidates for creating a new superinstruction, for example a

combined load-and-xor (Figure 5.3).

The second simulator provides a cycle accurate simulation of the whole system. It

uses SystemC, a C++ library to describe and simulate concurrent processes. SystemC

5.4. Toolchain 57

(a) cycles and waiting cycles (b) occurrences of instructions

Figure 5.3: Visualization of the Profiling Output of the Cluster Simulator

width 2

height 2

number_of_pes 4

0 0 0xd0000000 dhrystone.startup

0 0 0xd000002c dhrystone.0

0 0 0xd01050dc dhrystone.1

0 0 0xd020a18c dhrystone.2

0 0 0xd030f23c dhrystone.3

stack 0x008242e0 0x008342e0 0x008442e0 0x008542e0

comm_buffer 0xd08642e0 0xd08742df

cache_size 16384

line_size 16

associativity 2

mem 0xd0000000 0xd08742df

debug dhrystone.debug

ambamatrix yes

Figure 5.4: Configuration File for SystemC based Simulator SBArray-Sim

was originally developed by Synopsys, Inc., an electronic design automation company,

but later (1999) turned into an open source project [sys07].In 2005, IEEE approves the

IEEE 1666 -2005 standard for SystemC [IEE05]. The simulator is based on the source

code from Norman Nagel for his diploma thesis [Nag04], but extended to simulate our

architecture more precisely (communication, bus system, caches), and optimized.

SystemC provides C++ classes and templates for modules and connections (ports and

signals) between them. It is possible to specify action methods and event handlers, which

are called if a port state changed. In our implementation, components (e. g., CPU, mem-

ory, and cache) are modules, but also busses, the arithmetic logic unit (ALU), and contain-

ers (e. g., CPU-cluster) are realized as SystemC modules. SystemC contains a scheduler

which executes all action methods and calls the appropriate event handlers. The sim-

ulation must not be dependent on the order of the execution, events that happen at the

same time are processed sequentially in arbitrary order. Although the simulated system is

parallel, the free SystemC implementation provided by the SystemC community [sys07]

does not support parallel computers and does not profit from multi core architectures, for

instance, the Intel Core2Duo used for our benchmarks.

The simulator is configured by a configuration file. An example is shown in Fig-

ure 5.4. This file contains the size of the system (width×height×number_of_pes), the

memory map, i. e., files mapped into the memory (code, initialized data segments), ad-

58 BSP for Parallel System on Chip

1SIGINT: cancel simulation

2Simulation time: 41273

3Cluster (1,1)

4communication controller

5input: recv 0, count 0, ptr 0, used 0

6output: write 0, count 0, ptr 0, used 0

7ncore 0

8PC: 0x00000648, opcode ?

9PC-history 0x000005fa 0x00000290 0x00000286 0x000006ee 0x000006de

10 Backtrace: strcmp Proc_8 main _main

11 REG[00]: 0x00824258 REG[01]: 0x00000008 REG[02]: 0x00004fd0 REG[03]: 0x000028a0

12 REG[04]: 0x00000009 REG[05]: 0x00002f00 REG[06]: 0x00000640 REG[07]: 0x00005068

13 REG[08]: 0x00004fb8 REG[09]: 0x00005098 REG[10]: 0x00000020 REG[11]: 0x00004fbd

14 REG[12]: 0x00002ee0 REG[13]: 0x00000003 REG[14]: 0x00004ff0 REG[15]: 0x00000020

15 memory dump of stack (from 0xd0824258)

16 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

18 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

19 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[...]

Figure 5.5: Debug-Output canceling a Simulation using CTRL-c

dresses of stacks and communication buffer, and parameters for the stack and the bus

system. Furthermore, a file with symbol information for debugging can be specified. Us-

ing this information, the simulator can output function names instead of just addresses, in

case of an error, or if the user interrupts the simulation. The output for one cluster (com-

munication controller) and one of the processors is shown in Figure 5.5. The symbols are

used for the backtrace (functions on the stack) output. The PC-history output gives the

last five addresses where the program counter changes by a branch instruction. The con-

figuration file is generated by the compile script scc, which is described in Section 5.4.2.

5.4.2 Compiler

For the comfortable development of software, we want to program the system using a high

level programming language instead of assembler. As the used processor core (S-Core)

is compatible with Motorola’s M-Core processor (with some additional features), we can

use the standard GNU C compiler4, to generate the object files. For access of our exten-

sion of the CPU (e. g., new instructions), we use inline assembler.

For the generation of the code for the parallel system, we provide a compile script,

which calls the compiler several times and generates the object files for each processor

with adapted addresses. It also hides the differences of the targets (simulators, FPGA)

and generates the appropriate output format (e. g., plain binaries, s-rec-files). It creates a

short startup code to set the stack pointer and to jump to the right code. This is needed,

because all processors will start the execution after the reset signal at the same address

4version pre 4.0.0 from the cvs repository in January, 2005, later versions contain a bug and fail to

compile some files

5.4. Toolchain 59

scc [architecture args] [additional compiler args]

--help this text

--target=<target>

clustersim simulator WG Kastens

sbarraysim systemc simulator

wbcluster wishbone VHDL cluster

wbcluster-fpga wishbone FPGA cluster

rcos wishbone cluster without shared memory

ambacluster ambacluster with cache

--output-format=<format>

--width=<w>--height=<h>size of the array

--no-pes=<n>number of PEs per switchbox

--pcscore use pcscorec instead of gcc

--pcscore=<ext>use special version of pcscorec

--stack-size=<size>stack size in bytes

--shared-stack stack in shared memory (wbcluster)

--heap-size=<size>size of heap per PE (default 1048576)

--shared-size=<size>size of __sharedMem

--mem-size=<size>size of memory per PE (rcos, default 32k)

--cache-size=<size>cache size in bytes

--line-size=<size>size of a cache line in bytes

--associativity=<n>associativity of the cache

--cache-stats output statistics for cache use

--verbose output compiler calls

Figure 5.6: Syntax of the Compile Script

(0x00000000). Furthermore, it produces the configuration files for the simulators and a

script to start them. Thus compiling and testing is almost as easy as for a native processor.

Figure 5.6 shows the syntax of the compile script scc. The script also support the use

of another compiler, pcscore, instead of gcc. This is a compiler developed by one of the

project partners at the University of Paderborn, the research group of Professor Kastens.

Although it does not always optimize the code as good as gcc, it has the advantage that

changes at the instruction set of the processor can be realized and evaluated quite simple

and fast. Another version of this compiler that can automatically generate parallel code

for all processors in the cluster is being developed. Using this compiler, the user will

see a cluster as a single processing unit. In our work, we would like to utilize the whole

parallelism of the machine to execute parallel algorithms, so we do not use any technique

to automatically generate parallel code.

5.4.3 System Libraries

We use a small standard C library with some adaptations and optimizations. The output

functions (e. g., printf) use the architecture dependent output method, either a proces-

sor trap or writing the output to a special memory address.

As copying of memory blocks is a crucial operation for our implementation of BSP, we

make some effort on optimizations of this function. The S-Core processor allows reading

and writing multiple registers with a single instruction, so we save the registers, use them

60 BSP for Parallel System on Chip

Table 5.2: Library Functions for the Hardware Access

gn_get_processor_id, gn_pid, gn_pes gives the processor switchbox, processor ids and the

number of processors in a cluster

gn_write_comm_buffer,

gn_read_comm_buffer

write to resp. read from communication buffer,

a 32 bit shared register in each cluster

gn_barrier barrier synchronization inside a cluster

gn_send_data send data from shared memory to another switch box

gn_get_next return pointer of next ack or packet

gn_wait_for_packet wait until a packet is received and return pointer

gn_get_ack return pointer to an acknowledged packet

gn_free_packet free queue at place with pointer

gn_get_out_counter return numbers of packets in send queue

gn_cache_invalidate,

gn_cache_write_back, gn_cache_lock,

gn_cache_unlock, gn_cache_prefetch,

gn_cache_prefetch_exclusive

control of the cache (not all functions are supported

on all architecture)

to copy the data, and restore their content afterwards. This overhead is worthwhile only

for blocks bigger than a threshold, thus our implementation of the memory copy function

(memcpy) contains several different copy loops copying 1, 8 or 11 data words at once.

The following lines of S-Core assembler code show the body of the loop for copying

blocks of 11 words:

r0 source

r2 destination-source

r3 destination-source - 44

r4 length in 11 word-blocks

1decne r4

2.block11CopyLoop:

3ldm r5-r15,(r0)

// r0 ←destination

4addu r0, r2

5stm r5-r15,(r0)

// r0 ←source

6subu r0, r3

7loopt r4, .block11CopyLoop

Some additional functions for accessing the hardware are also provided, see Table 5.2.

An example using the communication functions for sending and receiving messages can

be found in Figure 5.8 in Section 5.5.1.

5.4.4 BSP Library

For the implementation message passing in the BSP style, we have to deal with the two

levels of communication. Inside the clusters, we have shared memory, thus sending of a

message is just a memory copy operation, or, if we do not change the data until it is read,

it is sufficient to send a pointer to the data (high performance operations).

5.4. Toolchain 61

Cache

CacheCache

CPU

Cache

CPU

Switchbox

CacheCache

Cache

CPU

Cache

CPU

Switchbox

CacheCacheCache

CPU

Cache

CPU

Switchbox

Cache

CacheCache

CPU

Cache

CPU

Switchbox

CacheCacheCache

CPU

Cache

CPU

Switchbox

Cache

CacheCache

CPU

Cache

CPU

Switchbox

CacheCacheCache

CPU

Cache

CPU

Switchbox

Cache

CacheCache

CPU

Cache

CPU

Switchbox

Cache

CPU

Cache

CPU

Switchbox

(a) phase 1, degree 3

Cache

CacheCache

CPU

Cache

CPU

Switchbox

CacheCache

Cache

CPU

Cache

CPU

Switchbox

CacheCacheCache

CPU

Cache

CPU

Switchbox

Cache

CacheCache

CPU

Cache

CPU

Switchbox

CacheCacheCache

CPU

Cache

CPU

Switchbox

Cache

CacheCache

CPU

Cache

CPU

Switchbox

CacheCacheCache

CPU

Cache

CPU

Switchbox

Cache

CacheCache

CPU

Cache

CPU

Switchbox

Cache

CPU

Cache

CPU

Switchbox

(b) phase 1, degree O(√p)

Cache

CacheCache

CPU

Cache

CPU

Switchbox

CacheCache

Cache

CPU

Cache

CPU

Switchbox

CacheCacheCache

CPU

Cache

CPU

Switchbox

Cache

CacheCache

CPU

Cache

CPU

Switchbox

CacheCacheCache

CPU

Cache

CPU

Switchbox

Cache

CacheCache

CPU

Cache

CPU

Switchbox

CacheCacheCache

CPU

Cache

CPU

Switchbox

Cache

CacheCache

CPU

Cache

CPU

Switchbox

Cache

CPU

Cache

CPU

Switchbox

Figure 5.7: Synchronization

Shared Memory Cluster. The communication inside the cluster is realized by writing

into the shared memory, similar to the PUB implementation for parallel shared memory

machines like the Cray T3. PUB uses preceive queues, one for each processor, and a send

is done by appending the message to the queue of the destination. For the synchronization

of the accesses of different processors to the queue, an atomic operation like compare-

and-swap is needed, but our architecture does not provide such operations. So we use p2

queues, one for each sender-receiver pair.

Sending a message, i. e., a block of memory, is done in two steps: first, the data is

copied into a buffer, second the address of the message in the buffer is appended to the

appropriate queue. We need to copy the message to meet the BSP specification, because

the data may be changed between the send and the next synchronization. For the high

performance send, we do not need to copy the message and just appended a pointer to the

data.

All queues exist twice. One for the messages received in the actual superstep and

one for the messages of the last superstep. In the synchronization the two groups of

queues are swapped (by changing a pointer to the queues to use), and the queues for

receiving messages are cleared. All processors have to change the queues at the same

time; otherwise messages of different supersteps could be mixed. This is realized by

synchronization barriers of the processors inside a cluster. For this purpose we extend

the instruction set of the processor by a barrier instruction, which can synchronize an

arbitrary subgroup of processors in hardware. The execution time without idle time is

only one cycle, thus, synchronization of the shared memory cluster is very fast.

On-Chip Communication. For the synchronization of the whole system we use the

same technique as in the PUB library for parallel computers: each processor counts the

62 BSP for Parallel System on Chip

number of messages sent to each other processor, and in the synchronization the global

sums of all these counters are computed, so every processors knows the number of its

incoming messages. Next all processors wait until the number of expected messages is

received. With this algorithm it is possible that processors are in different supersteps,

but a processor cannot start a superstep until all other processors have at least started the

synchronization procedure.

For our architecture, we do the global computation in two phases. First, the counters

of the processors inside the clusters are added using shared memory and the barrier in-

struction for the synchronization, second these numbers are added globally. The clusters

are connected by a mesh network. Each cluster, more precisely the first processor of each

cluster, receives the results from the right neighbor, adds its own numbers, and sends the

calculated partial sum to its left neighbor (Figure 5.7(a)). The processors of the first col-

umn receive the data from the right and the bottom and send their results to processors in

the top direction. After this calculation, processor 0 of the cluster in the top left corner

knows all data and broadcasts it the same way back to all the other clusters (Figure 5.7(c)).

To decrease the latency, we implement another variant of this algorithm. All proces-

sors in a row send the data direct to the first column, the processors in that column add

all the numbers and send them to the top (Figure 5.7(b)). This communication uses a tree

with a higher degree (O(√p)for quadratic meshes) and is faster for a small number of

clusters. For a larger systems a combination (fixed degree, but larger than 3) should be

used.

5.5 Evaluation

We use several benchmarks for the evaluation of our architecture. In the GigaNetIC

project, some low level network benchmarks have been performed, for example, IP header

classification and cyclic redundancy checks (CRC).

For parallel computers and workstation clusters, we use an LU-decomposition algo-

rithm as a benchmark for scientific calculation. The algorithm uses not only a lot of

memory, it is also based on floating point operations, whereas our processor core S-Core

does not have a floating point unit. In the following, we use benchmarks based on integer

computations.

We did two kinds of benchmarks: first, low level communication benchmarks to de-

termine the communication performance of the system without using BSP. Next, we use

application benchmarks to find good parameters for the system (e. g., cache size) in a

more realistic scenario. Furthermore, we want to compare the performance with a stan-

dard sequential processing unit.

5.5. Evaluation 63

1if( pid == 0 ) { // processor 0

2gn_print_time(); // print start time

3dy = (dest/4) / (int) _cluster_width;

4dx = (dest/4) % (int) _cluster_width;

5// send message

6gn_send_data( dx, dy, 0, id , sizeof(msg), &msg );

7gn_get_ack( id ); // wait for acknowledgment

8p = gn_get_next( id, 0, 1, 1 ); // wait for answer

9gn_print_time();

10 gn_free_packet( p );

11 }else {

12 if( pid == dest ) {

13 p = gn_get_next( id, 0, 1, 1 ); // wait for packet

14 gn_free_packet( p ); // free data

15 // send answer

16 gn_send_data( dx, dy, 0, id, sizeof(msg), &msg );

17 gn_get_ack( id ); // wait for acknowledgment

18 }

19 }

Figure 5.8: Source Code Ping-Pong Test

5.5.1 Communication Benchmarks

The determination of the BSP parameters gand Lis the standard benchmark for bulk

synchronous parallel computing. As we consider worst case running time estimations, we

stress the network as much as possible. In the BSP model, the total amount of transmitted

data for a hrelation does not affect the time, but for many implementations this is not ac-

curate (cf. E-BSP, Section 2.2.4). Thus, we vary the network congestion and measure the

two extremes: only one message (cheapest h-relation) and a total exchange, where each

processor sends a message to each other. As we expect better performance for messages

inside the clusters and communication time for a message is a function of the number of

switch boxes to cross, we also measure the time needed for messages of different path

lengths.

Ping-Pong. The first benchmark shows the native communication performance without

using the BSP library. It shows the time needed for a small message from processor P0to

processor Pi(ping) and the answer from Pito P0(pong).

The main part of the source code for the benchmark is shown in Figure 5.8. Figure 5.9

shows the result for the Ping-Pong benchmark for two different configurations of the

system: a linear array of 8 clusters and a mesh of 4×4 clusters. The time needed for ping-

pong communication increases linearly with every hop, as expected. The additional cost

for passing a switch box is 10–12 cycles. Surprisingly, the test is significantly faster if the

64 BSP for Parallel System on Chip

250

300

350

400

450

500

550

0 10 20 30 40 50 60 70

time for ping-pong [cycles]

destination

8×1×4 processors, sender 0

8×1×4 processors, sender 1

4×4×4 processors, sender 0

4×4×4 processors, sender 1

Figure 5.9: Ping-Pong Benchmark: Latency

receiver is the first processor of a cluster. We could not figure out the reason for this. We

do not prioritize processor 0. All processors that are not involved in the communication

wait using the barrier instruction, so they do not use any shared resource, for instance,

the bus or the shared memory. Furthermore, the bus arbiter schedules the bus accesses

with a round robin strategy, so no processor is preferred. The only difference between

the processing elements is the following. When a cache miss occurs and the cache line

is contained in more than one other cache, it is fetched from the cache with lowest id.

However, this does not explain the different running times for the tests as we do not use

the shared memory concurrently in this benchmark.

The same effect occurs for the sender: for small systems it seems to be faster if one

but the first processor sends the message. For the larger system, the running time is not

dependent on the sender processor.

The object codes for the different processors are generated in different runs of the

compiler, so it could differ. For instance, code using the processor identifier, a constant

known at compile time, could be optimized diverse, but the only differences in the gener-

ated assembler codes are the different addresses in the memory.

Total-Exchange. Using the BSP library for communication tests, we have two different

types of costs: time needed for sending the messages and the time for the synchronization.

5.5. Evaluation 65

1000

10000

100000

106

107

1 10 100 1000 10000

time for superstep [cycles]

message size [bytes]

time for superstep without communication

2×2×4 processors

2×4×4 processors

4×2×4 processors

4×4×4 processors

Figure 5.10: Total Exchange, Length of Supersteps

Figure 5.10 shows the length of the supersteps in clock cycles for different message

and cluster sizes. Note that the amount of data each processor is sending and receiving (h

in the BSP-model) is dependent on the system size, so the time should increase linearly

with the number of processors if the system scales well. In the same figure, the time

needed for the synchronizations without sending messages is shown (parameter Lof the

BSP-model). The figure shows two curves for each test, the minimal and the maximal

needed cycles over all processors.

The parameter g(p)corresponds to the slopes of the graphs in the figure. We got the

following values for g(p):

p16 32 64

g(p)6.3 7.4 7.5

Figure 5.11 shows the bandwidth per processor, which is the inverse of g. The over-

head for the synchronization decreases as the message size increases, and the achievable

bandwidth is almost independent of the system size for our simulated systems.

66 BSP for Parallel System on Chip

10−4

0.001

0.01

0.1

1 10 100 1000 10000

total bandwith per processor [bytes/cycle]

message size [bytes]

2×2×4 processors

2×4×4 processors

4×2×4 processors

4×4×4 processors

Figure 5.11: Total Exchange, Bandwidth

5.5.2 Application Benchmark

We have run two application benchmarks to evaluate the influence of the parameters of

the architecture to the performance for realistic applications: Solving 3-Satisfiability and

sorting.

3-Satisfiability. As first application benchmark we use a BSP implementation for the 3-

Satisfiability (3-SAT) problem. Satisfiability is the problem of determining if the variables

of a given boolean formula can be assigned, such that the formula is satisfied, i, e., it

evaluates to true. The formula consists of a conjunction of clauses, each clause is a

disjunction of literals. A literal is a variable or its negation. For the 3-SAT problem,

all clauses contains at most three literals. We use 3-SAT as benchmark, because it is

a well studied problem in the computer science. It is known to be NP-hard and thus

a computational challenge. Furthermore, it is based on boolean data types and can be

solved with small memory space.

Our algorithm is a BSP implementation of a divide and conquer algorithm ([HO01],

pp. 169–173). In each step a variable of the shortest clause is assigned and the remaining

formula is checked recursively. If the assignment does not fulfill the clause, one of the

remaining literals have to evaluate to true. Thus, not all possible combinations of the

5.5. Evaluation 67

(¯x7∨x17 ∨x28)∧(¯x15 ∨x2∨x17)∧(¯x35 ∨¯x10 ∨¯x29)∧(¯x49 ∨x19 ∨x31)∧(¯x5∨x8∨x34)

∧(¯x37 ∨x41 ∨x28)∧(¯x14 ∨x42 ∨x9)∧(¯x47 ∨¯x46 ∨x0)∧(¯x44 ∨¯x15 ∨x32)∧(x37 ∨x41 ∨x21)

∧(¯x48 ∨¯x33 ∨x40)∧(x11 ∨x16 ∨x34)∧(¯x48 ∨x21 ∨x6)∧(¯x9∨¯x10 ∨¯x38)∧(¯x27 ∨¯x35 ∨¯x36)

∧(¯x12 ∨¯x45 ∨¯x40)∧(¯x3∨x20 ∨x8)∧(x11 ∨x47 ∨x9)∧(x23 ∨x22 ∨x14)∧(¯x7∨¯x40 ∨¯x42)

∧(¯x43 ∨¯x1∨¯x34)∧(¯x26 ∨x17 ∨x30)∧(x46 ∨x34 ∨x5)∧(¯x10 ∨¯x26 ∨x40)∧(¯x32 ∨¯x46 ∨¯x44)

∧(¯x15 ∨¯x36 ∨x35)∧(¯x45 ∨x26 ∨x1)∧(¯x27 ∨x14 ∨x9)∧(¯x37 ∨¯x38 ∨x45)∧(¯x32 ∨¯x3∨x23)

∧(¯x11 ∨¯x44 ∨x49)∧(¯x31 ∨¯x20 ∨¯x14)∧(x7∨x41 ∨x23)∧(¯x48 ∨x29 ∨x3)∧(¯x8∨x44 ∨x27)

∧(¯x32 ∨¯x46 ∨¯x0)∧(¯x15 ∨x26 ∨x0)∧(¯x10 ∨¯x16 ∨¯x34)∧(¯x41 ∨¯x14 ∨x44)∧(¯x18 ∨¯x26 ∨x29)

∧(x2∨x27 ∨x11)∧(¯x10 ∨¯x32 ∨x47)∧(¯x5∨¯x8∨x36)∧(¯x36 ∨¯x6∨x12)∧(¯x1∨x25 ∨x15)

∧(¯x23 ∨¯x37 ∨x45)∧(¯x12 ∨¯x23 ∨¯x7)∧(¯x35 ∨¯x41 ∨¯x20)∧(¯x36 ∨¯x18 ∨x2)∧(¯x30 ∨¯x49 ∨x34)

∧(¯x6∨¯x25 ∨x28)∧(¯x41 ∨¯x44 ∨x28)∧(¯x5∨x24 ∨x32)∧(¯x44 ∨¯x4∨x6)∧(¯x6∨¯x5∨x27)

∧(¯x47 ∨¯x10 ∨x30)∧(¯x36 ∨x15 ∨x31)∧(¯x23 ∨x47 ∨x0)∧(¯x45 ∨x17 ∨x22)∧(¯x29 ∨¯x49 ∨x47)

∧(¯x20 ∨¯x1∨x38)∧(x23 ∨x46 ∨x41)∧(¯x35 ∨x29 ∨x3)∧(¯x4∨¯x0∨x27)∧(¯x46 ∨¯x41 ∨x31)

∧(¯x21 ∨x36 ∨x15)∧(¯x42 ∨¯x33 ∨x41)∧(¯x39 ∨¯x19 ∨x38)∧(¯x48 ∨x28 ∨x5)∧(¯x40 ∨¯x2∨x38)

∧(¯x15 ∨¯x11 ∨x42)∧(x23 ∨x21 ∨x2)∧(¯x44 ∨x46 ∨x42)∧(¯x36 ∨x44 ∨x45)∧(¯x8∨x25 ∨x4)

∧(¯x2∨¯x12 ∨x22)∧(¯x33 ∨x4∨x12)∧(x11 ∨x38 ∨x12)∧(x21 ∨x49 ∨x36)∧(x18 ∨x8∨x45)

∧(¯x23 ∨¯x26 ∨x7)∧(¯x27 ∨x6∨x20)∧(¯x24 ∨x7∨x49)∧(x19 ∨x49 ∨x3)∧(x26 ∨x35 ∨x12)

∧(¯x24 ∨x30 ∨x25)∧(¯x43 ∨¯x31 ∨x38)∧(¯x19 ∨¯x9∨x40)∧(¯x27 ∨x48 ∨x34)∧(x0∨x43 ∨x33)

∧(¯x10 ∨x34 ∨x38)∧(¯x49 ∨¯x41 ∨¯x6)∧(¯x23 ∨x6∨x46)∧(¯x12 ∨¯x47 ∨x4)∧(¯x8∨¯x19 ∨¯x22)

∧(¯x18 ∨x16 ∨x1)∧(x10 ∨x22 ∨x20)∧(¯x44 ∨x29 ∨x14)∧(¯x23 ∨x25 ∨x10)∧(¯x12 ∨x32 ∨x37)

∧(¯x26 ∨¯x6∨x43)∧(x40 ∨x48 ∨x1)∧(¯x17 ∨¯x36 ∨x11)∧(¯x1∨¯x25 ∨x11)∧(¯x18 ∨x6∨x31)

∧(¯x21 ∨x10 ∨x32)∧(¯x19 ∨x11 ∨x7)∧(¯x47 ∨x39 ∨x15)∧(¯x1∨¯x23 ∨¯x10)∧(¯x16 ∨x25 ∨x36)

∧(¯x13 ∨¯x18 ∨x45)∧(x4∨x46 ∨x35)∧(¯x28 ∨¯x8∨x18)∧(x31 ∨x3∨x27)∧(¯x33 ∨¯x45 ∨x19)

∧(¯x3∨¯x35 ∨¯x12)∧(¯x14 ∨¯x36 ∨x44)∧(¯x20 ∨x28 ∨x22)∧(¯x5∨¯x39 ∨x6)∧(¯x41 ∨¯x28 ∨x30)

∧(¯x35 ∨x23 ∨x30)∧(¯x44 ∨¯x36 ∨¯x0)∧(¯x5∨¯x28 ∨x2)∧(¯x27 ∨¯x49 ∨x26)∧(x43 ∨x25 ∨x4)

∧(¯x16 ∨¯x47 ∨x48)∧(¯x39 ∨¯x6∨x11)∧(¯x11 ∨¯x47 ∨x30)∧(¯x41 ∨x31 ∨x26)∧(¯x26 ∨¯x9∨x0)

∧(¯x48 ∨x5∨x9)∧(¯x23 ∨x7∨x42)∧(x22 ∨x30 ∨x0)∧(¯x46 ∨x10 ∨x37)∧(¯x27 ∨¯x12 ∨x25)

∧(¯x39 ∨¯x41 ∨x11)∧(¯x2∨x38 ∨x45)∧(x16 ∨x40 ∨x45)∧(x22 ∨x20 ∨x12)∧(¯x13 ∨¯x0∨¯x37)

∧(x19 ∨x17 ∨x5)∧(¯x49 ∨¯x8∨x19)∧(¯x31 ∨¯x17 ∨x9)∧(¯x20 ∨¯x33 ∨x48)∧(¯x34 ∨x22 ∨x43)

∧(¯x18 ∨x39 ∨x33)∧(¯x0∨¯x11 ∨x5)∧(¯x1∨¯x6∨x5)∧(¯x19 ∨x31 ∨x33)∧(¯x11 ∨¯x28 ∨x42)

∧(¯x48 ∨x1∨x23)∧(¯x3∨x9∨x39)∧(x10 ∨x4∨x11)∧(¯x2∨¯x30 ∨x46)∧(¯x22 ∨x42 ∨x20)

∧(¯x40 ∨¯x35 ∨¯x49)∧(¯x7∨¯x41 ∨¯x23)∧(x38 ∨x44 ∨x6)∧(¯x44 ∨x36 ∨x6)∧(x40 ∨x39 ∨x7)

∧(¯x49 ∨¯x9∨¯x7)∧(¯x4∨¯x38 ∨¯x13)∧(¯x21 ∨¯x23 ∨¯x42)∧(¯x35 ∨x39 ∨x34)∧(x16 ∨x48 ∨x40)

∧(¯x31 ∨x6∨x23)∧(¯x29 ∨¯x7∨¯x8)∧(¯x40 ∨¯x12 ∨¯x9)∧(¯x32 ∨x25 ∨x30)∧(¯x21 ∨¯x38 ∨x16)

∧(¯x20 ∨x27 ∨x2)∧(¯x13 ∨x45 ∨x22)∧(x28 ∨x15 ∨x18)∧(¯x31 ∨¯x43 ∨x41)∧(¯x23 ∨x9∨x22)

∧(¯x0∨¯x31 ∨¯x20)∧(¯x7∨¯x43 ∨¯x38)∧(x38 ∨x10 ∨x8)∧(¯x45 ∨x13 ∨x18)∧(¯x41 ∨x43 ∨x45)

∧(¯x28 ∨x22 ∨x36)∧(x31 ∨x24 ∨x19)∧(¯x42 ∨¯x11 ∨x13)∧(¯x35 ∨¯x17 ∨x45)∧(¯x25 ∨¯x9∨x13)

∧(¯x1∨¯x29 ∨x4)∧(¯x17 ∨x5∨x45)∧(¯x25 ∨¯x43 ∨x1)∧(¯x7∨¯x10 ∨x19)∧(¯x30 ∨x2∨x15)

∧(¯x21 ∨¯x8∨x38)∧(¯x48 ∨¯x41 ∨x43)∧(¯x44 ∨¯x43 ∨x30)∧(¯x30 ∨¯x10 ∨x49)∧(¯x31 ∨¯x45 ∨x1)

∧(¯x5∨¯x6∨x16)∧(¯x31 ∨x18 ∨x47)∧(¯x9∨x19 ∨x38)∧(¯x21 ∨¯x36 ∨x37)∧(¯x30 ∨¯x47 ∨x8)

∧(x39 ∨x11 ∨x6)∧(¯x23 ∨¯x3∨x8)∧(¯x21 ∨x48 ∨x32)∧(¯x11 ∨x42 ∨x9)∧(¯x29 ∨¯x9∨x24)

∧(x45 ∨x46 ∨x30)∧(¯x6∨x26 ∨x12)∧(¯x44 ∨¯x34 ∨x31)∧(¯x49 ∨x33 ∨x8)∧(x1∨x33 ∨x29)

∧(x2∨x15 ∨x1)∧(¯x17 ∨¯x11 ∨x44)∧(x32 ∨x36 ∨x9)∧(¯x17 ∨x6∨x42)∧(¯x21 ∨¯x18 ∨x43)

∧(¯x30 ∨¯x26 ∨¯x41)∧(¯x2∨¯x39 ∨x7)∧(¯x22 ∨¯x30 ∨x37)

Figure 5.12: 3-SAT Instance uuf050-218, 50 Variables, 218 Clauses, Unsatisfiable

68 BSP for Parallel System on Chip

2·106

4·106

6·106

8·106

107

1.2·107

4 8 16 32 64 128

cycles

cache per processor [kByte]

1 CPU, 1 cluster

2 CPUs, 1 cluster

2 CPUs, 2 clusters

4 CPUs, 1 cluster

4 CPUs, 2 clusters

4 CPUs, 4 cluster

8 CPUs, 2 clusters

8 CPUs, 4 clusters

16 CPUs, 4 cluster

Figure 5.13: 3-Satisfiability: Processor Cache

variable assignment have to be checked.

In the first step processor 0 executes the algorithm sequentially until c·psubprob-

lems are created for a constant c(we use c=4 in our experiments). These branches are

distributed evenly among the processors. Then all processors try to find a solution sequen-

tially for their subproblems. If a solution is found, it is broadcasted and the computation

stops at the end of the actual superstep.

Our objective was not to develop the fastest 3-SAT solver, or concentrate on load

balancing, we just want to compare our architecture with standard technologies. Further-

more, we want to evaluate the influence of system parameter, for instance, cache sizes,

for the running time.

For the benchmark we use 3-SAT instances from the SATLIB-benchmark. The Sat-

isfiability Library SATLIB [HS00] is a collection of benchmark problems, solvers, and

tools for SAT related research. Instances can be downloaded from the website of the

benchmark [SAT00]. In the following, we use file uuf050-218.cnf of the archive uuf50-

218.tar.gz. The instance (Figure 5.12) was generated uniformly at random and proven

to be unsatisfiable. It was converted from a format called DIMACS into our own data

structure, and linked directly into the program, as our simulators do not support input yet.

First, we evaluate the influence of the processor cache for different numbers of proces-

sors and topology (Figure 5.13). For a single processor, more than the default 16 kb cache

5.5. Evaluation 69

2·106

4·106

6·106

8·106

107

1.2·107

1 2 4 8 16

cycles

processors

linear speadup

Pentium IV, 1.7 GHz

1 cluster

2 clusters

1 cluster

2 clusters

4 clusters

2 clusters

4 clusters

8 clusters

4 clusters

8 clusters

16 clusters

Figure 5.14: 3-Satisfiability: Number of Processor

is not needed, the program runs only 0.053 % faster when increasing the cache to 128 kb.

The efficiency of the cache is the reason. We have 7,106,264 hits and only 1,367 misses

for reading (99.9808 % hit rate), and 222,787 hits and 1,908 misses for writing (99.1508 %

hit rate). Using only 8 kb (4 kb) cache, the number of cycles increases by 0.41 % (2.2 %)

due to the lower hit rate (99.5528 % read, 92.675 % write for a 4 kb cache).

If there are four processors sharing the bus, the cache becomes more important, be-

cause the memory bandwidth per processor decreases. However, the 3-Sat benchmark has

a small working set. The performance decreases by only 1 % (4 %) if the cache size is

decreased from 16 kb to 8 kb respectively 4 kb.

Next, we look at different system configurations of the processors. Our algorithm does

not scale well, because the initial work distribution is done by a single processor. If we

increase the number of processors, this part will dominate the total execution time (Fig-

ure 5.14). Scalability was not our intention for this test. We want to analyse the influence

of the topology, i. e., the differences between using shared memory and using message

passing communication in different switch boxes. As there is almost no communication

in this benchmark, we expected that it is faster to have more clusters, thus less processors

competing for the shared memory bus.

But there is almost no difference in the running time. The shared memory is not

the bottleneck in this application, because the cache is very efficient and most memory

70 BSP for Parallel System on Chip

accesses can be executed by the cache itself.

Sorting. The second application benchmark is sorting of integer numbers. We use an

implementation of SAMPLESORT, a parallel sorting algorithm similar to the sequential

QUICKSORT. QUICKSORT partitions the data into two parts, smaller and bigger numbers,

using a single splitting element. Then these parts are sorted recursively. The efficiency

of the algorithm is dependent on a good splitting element which creates two partitions of

approximately the same size.

SAMPLESORT for pprocessors uses p−1 splitting elements and partitions the data

into psets. Each processor is responsible for one set, thus each processor sends all keys

to the appropriate destination. Next, all processor sort their data locally. The efficiency is

dependent on the splitting elements, a good choice leads to a good load balancing. To get

the splitting elements, each processor chooses a number of random samples of the data

and sends these samples to processor 0. Processor 0 sorts them locally and broadcast the

perfect splitting elements for the samples (cf. Section 3.1.3).

In our benchmark we sort random 32 bit integer numbers. Each processor sends 64

samples to processor 0, and all data send to the destination processor is combined in

messages of size 1024 bytes each. These parameters are not optimal for the problem – in

fact we do not need samples at all as we sort numbers chosen uniformly at random – but

our intention is to evaluate the architecture and not to find the best sorting algorithm.

For the architecture we use all valid combinations of the following parameters: width

and height 1, 2, 3, or 4 clusters, 1, 2, or 4 of processors per cluster, 0 kb, 4 kb, 8 kb, 16 kb,

32 kb, 64 kb, or 128 kb of cache per processor, 16 bytes, 32 bytes, or 64 bytes cache line

size, with amba matrix or standard bus system.

The results of these 2,000 simulations (running several hours on up to 35 fast work-

stations in total) is shown in Figure 5.15. Each system size, i. e., each total number of

processors, is shown in another color. The different configurations (e. g., topology, cache

sizes) lead to small differences in the execution time, but the total number of processors

has much more impact on the execution speed. For small data size the overhead of the

sample sort algorithm compared to sequential QUICKSORT is the crucial factor. But for

data sizes larger than 2 kByte, the parallel SAMPLESORT outperforms QUICKSORT. The

larger the data set is, the more processors can be used efficiently. The results do not show

any unexpected effects, for example, network congestion or other problems.

We ran the same benchmark on a standard processor (Pentium IV, 1.7 GHz) to classify

the performance of the system. The parallel system on a chip is much faster (counting

cycles) even using only a single CPU, because it contains on chip memory, whereas the

Pentium has to load and store all data through the memory hierarchy using slow off chip

memory.

5.6. Conclusions 71

100

1000

10000

100000

106

107

108

1 10 100 1000 10000 100000 106107

cycles

data size

1 processor (QuickSort)

2 processors

4 processors

8 processors

16 processors

32 processors

Pentium IV, 1.7 GHz

Figure 5.15: Sorting: Samplesort

5.6 Conclusions

In this chapter we proposed a parallel system on a single chip suiteable for BSP com-

puting. Although our design is just a prototype for evaluation the potential of parallel on

chip systems, it outperforms classical CPUs in some benchmarks. Replacing the simple

processors cores by more powerfull ones will significantly increase the performance of

the total system.

We show two things: first, the BSP model can assist hardware developers in designing

parallel systems for general use. The fast synchronization instruction of the processing

elements, and the communication protocol of the interconnection network are just two

examples of optimizations for the requirements of an efficient BSP implementation.

Furthermore, we show that the BSP model is applicable for this kind of parallel sys-

tem, which differs from the classical parallel computers the model was designed for.

72 BSP for Parallel System on Chip

Chapter 6

BSP for Heterogeneous Workstation

Clusters

The Bulk Synchronous Parallel Model (BSP) was designed for classical monolithic par-

allel machines. In the last chapter we showed that the model is also applicable for parallel

systems on a chip. In this chapter we look at the other end of the spectrum of parallel

systems, classical workstations interconnected by a local area network. We show, how

BSP can be implemented efficiently on such architectures, even if the computers spend

different computation power, which even might change during the execution.

We use virtual processors, and balance the load online by migrating these processors

to less loaded computers if the load changes. We have developed a technique to migrate

processes during execution. This is used for load balancing, but have many other applica-

tions besides BSP.

In the following, we describe our view of heterogenous workstation cluster. Then we

show the techniques of migration a virtual processor to another computer without using

special adaption of the operating system. Next we present and evaluate load balancing

strategies to show the benefit of the system. Parts of this work has been published in

[Bon07].

6.1 Heterogeneous Workstation Cluster

A workstation cluster consists of computers, in the majority of cases personal computers

(PCs). These computers are interconnected by an arbitrary network, which supports the

TCP/IP protocol. In most cases a (Fast-/Gigabit-) Ethernet network is used, but any other

physical layer is possible, e. g., InfiniBand. For demonstration purpose, we restrict our-

selves to PCs with processors using the 32 bit x86-32 instruction set (e. g., Intel Pentium,

AMD Athlon).

The workstations are heterogeneous. This means they are different computers (e. g.,

74 BSP for Heterogeneous Workstation Clusters

manufactures, attached devices), providing different computational power. The usable

power is changing dynamically due to other users, some computer may even break down.

However, they are not complete heterogeneous, they have the same architecture (e. g.,

Intel 80x86), the same operating system (e. g., Linux), and sometimes a global filesystem

(NFS) and a centered administration.

Comparison with Monolithic Parallel Computers

Typically users of classical parallel systems have exclusive access to parts of resources,

i. e., they can allocate partitions of the system for a certain amount of time or there is a

batch system which allows submitting jobs that are executed when the needed resources

are available. There are several advantages of this approach for commercially used super-

computers: each job obtains all resources, e. g., can use the whole memory, less security

problems, and the accounting is easier.

Besides the exclusive access all processors of such system are very similar, in most

cases they are even the same. In the BSP cost model the length of a superstep is given by

the maximum of the work over all processors, regardless of the work of all other proces-

sors. Thus to get an efficient BSP algorithm the work should be distributed evenly among

all processor. If the computer have processors of different speed or, in the dynamic case

of changing available computation power like in our scenario, the standard BSP model is

not adequate. For the efficient execution of jobs on such dynamic system scheduling and

load balancing have to be considered.

Our Approach

The usage of virtual processors is one way to cope with different processor’s speeds: the

BSP program is executed on virtual processors, which are simulated by the real comput-

ers. The faster a computer is, the more virtual processors are assigned to it. For static

processor speeds this can be done by a simple scheduler, but if the available computation

power is changing, or the work is not distributed evenly among the virtual processors of

the BSP program, a more sophisticated online load balancing system is needed.

6.2 Related Work

There exist several approaches to use the idle time of workstation in offices. First, there

are programs for solving specialized problems, SETI@home (search for extraterrestrial

intelligence [set08]) and distributed.net (e. g., breaking cryptographic keys [dis08]) are

two of the most famous ones. These systems scan a large data space by dividing it into

small pieces of data, sending them as jobs to the clients, and waiting for the results. There

is no communication during the calculation, or the communication is combined with the

6.2. Related Work 75

job data and sent through the central server as in the BSP implementation by Bayanihan

[Sar99]. These loosely coupled systems can efficiently be executed on PCs connected by

the Internet.

Achieving fault tolerance is straight forward: If a client does not return the result, a

timeout occurs and the job is assigned to another client. Although only a small, fixed num-

ber of problems can be solved, these systems are very powerful in terms of computation

power because of the huge number of clients taking part.

Second, there are job scheduling and migration systems. Here you can start jobs,

either directly or using a batch system. These jobs are executed on idle nodes and may

migrate to other nodes if the load in the system changes. Condor [LTBL97] is one example

for such a system implemented in user space, MOSIX [BL98] is a module for the Linux

kernel which allows migrations of single processes. Kerrighed [MGLV04] goes beyond

this and provides a single system image of a workstation cluster. It supports migration of

processes and also communication via shared memory.

Checkpointing (of sequential programs) is used either for fault tolerance or to speed

up the startup of applications like the editor Emacs [Sta81] or the L

EX-system do.

Emacs implements a function unexec, which does the opposite of the normal exec

function1and stores the actual state of a process together with the program code into one

executable. If this executable is started later, the process is restored and continues.

Cao, Li, and Guo have implemented process migration for MPI applications based on

coordinated checkpoints [CLG05]. They use an MPI checkpointing system for the LAM

MPI implementation [SSB+05] based on Berkeley Lab’s Linux Checkpoint and Restart

[Due05], a kernel level checkpointing system. As MPI stores the location of the processes,

they have to modify the checkpoints before restarting.

An implementation for migration of single threads, even in a heterogeneous network

with different CPU types, is proposed by Dimitrov and Rego in [DR98]. They extend the

C language and implement a preprocessor which automatically inserts code for keeping

track of all local data. Additionally, the state in each function, i. e., the actual execution

point, is stored and a switch statement, which branches to the right position depending

on the state if a thread is recreated, is inserted at the beginning of each function. These

changes lead to an overhead of 61 % in their example, mainly due to the indirect address-

ing of local variables. A similar approach for Java is JavaGo [SSY00], used in the BSP

web computing library PUBWCL [BGM06].

Most of these systems support only independent jobs (Condor can schedule parallel

jobs using MPI or PVM, but these jobs do not take part in load balancing and migration)

or affect parts of the operation system kernel like Kerrighed, thus the user do not only need

administrative access to the machines, they furthermore have to install a Linux kernel with

special patches, which has some drawbacks (usable kernel versions are limited, e. g., not

up-to-date).

1The exec function replaces the current process image with a new process image read from a file.

76 BSP for Heterogeneous Workstation Clusters

host A

shared memory

virtual

processor

main process

load balancing thread

link id table

virtual

processor

host B

shared memory

virtual

processor

main process

load balancing thread

link id table

virtual

processor

network receive

thread

send

thread

send

thread

receive

thread

Figure 6.1: Structure of PUB with Migratable Processes

There exists also some work about checkpointing threaded processes [DL01], which

is useful to migrate parallel programs for symmetric multiprocessor (SMP) machines.

Hill, Donaldson, and Lanfear [HDL98] describe a system that supports migrations for

parallel programs written with the BSPlib library. They use the unexec function to store

checkpoints to disks. If the load should be balanced, all processors write a checkpoint

to disk, and all processes are terminated. Next they are restarted on the pleast loaded

computers. Such a migration requires a lot of disk accesses, thus it took a long time.

However, after a balancing, the schedule is optimal, whereas our implementation migrates

single jobs, which leads to a small improvement of the schedule, but at much lower costs.

6.3 Migration of Virtual Processors

Using PUB virtual processors can be created by the function bsp_migrate_dup (cf.

the user guide and reference of PUB [BGOW07]). This will start one new process for

each virtual processor. Figure 6.1 gives an overview of the processes, their threads and the

connections. The main process consists of three additional threads: The receive thread

reads all messages from the network and sends them to the right destination by either

calling the proper message-handler or putting the message into a shared memory queue

to a virtual processor. The send thread sends all messages from a shared memory send

queue to the network. The load balancing thread gathers information about the load of

the system and decides when to migrate a virtual processor. We have implemented several

algorithms for load balancing (cf. Section 6.4). The virtual processors are connected to

the main process by two shared memory message queues, one for each direction.

The link-id-table is the routing table. For each virtual processor of the system it stores

the actual execution host and the link id (used to distinguish the processes). This table

6.3. Migration of Virtual Processors 77

does not have to be up-to-date on all hosts, for example if a process was migrated, but

then the destination host stored in the table knows the new location and will forward

the message to it. Thus a message sent to the location stored in the table will reach its

destination, although it might take some additional hops. The maximal number of these

hops is unlimited, if a process does not migrate at the moment, the message may be

forwarded by all hosts in the worst case. If the process is migrating, it may escape the

message, such that the message is always following the process. However, this scenario

is very unlikely.

6.3.1 Procedure of the Migration of a Virtual Processor

When the load balancer decides to migrate one process to another host, the following

steps are executed:

1. The load balancer writes the destination host id to a migration queue.

2. The first process which enters the synchronization dequeues this id. It sends a

message to the main process.

3. The receive thread reads this message, marks the virtual processor as gone and

sends an acknowledgment to it. From now on it will not send any messages to the

process. All messages for this destination will be stored in a buffer in the main

process.

4. The process to migrate reads this acknowledgment, opens a TCP/IP server socket

and sends a message to the destination host.

5. The destination host creates two shared memory queues and a new child process.

6. This new process connects directly via TCP/IP to the old virtual processor.

7. The context of the process is transfered from the old host to the new destination.

The details of this step are described in Section 6.3.2. After this the old process

terminates.

8. The new process sends its new link id to the main process on the original host.

9. This host updates its link table and sends all messages arrived during the migration

to the new destination. If any further messages arrive for this virtual processor, they

will be forwarded to the new location. Optionally the migration can be broadcasted

to reduce the number of forwarded messages.

78 BSP for Heterogeneous Workstation Clusters

6.3.2 Migration of Linux Processes

This section deals with the technical aspects of migrating a Linux process from one com-

puter to another. We consider the parts of a process and look at its interaction with the

operating system kernel. The configuration of a process consists of a user space and a

kernel space part. As we do not extend the kernel, we cannot access the data in the kernel

space unless we use standard system functions; in particular we are not allowed to access

the kernel stack.

In the following we have a closer look to the migratable parts in the user space, and

the dependencies to the kernel, namely system calls and thread local storage.

Linux System Calls. The access to the operating system is done by syscalls. A syscall

is like a normal function call, but additionally the following steps are performed: the

CPU switches to supervisor mode (also known as kernel mode or ring 0) and sets the

stack pointer to the kernel stack. This allows the kernel to use all instructions and deal

with stack overflows of the user stack. There are two different ways for a system call

in Linux: Traditionally, Linux uses a software interrupt, i. e., the number of the system

function and its arguments are stored in processors registers, and an interrupt is generated

by the assembler instruction int 0x80. This interrupt call is very slow on Intel Pentium

IV processors, the time for a system call is much higher on a 2 GHz Pentium IV than on a

850 MHz Pentium III [Hay02] (see [SB05] for a detailed view on measuring Linux kernel

execution times). Intel has introduced a more efficient mechanism for privilege switches,

the sysenter instruction, first available in Pentium Pro processors (November, 1995).

There was a hardware bug in early CPUs, so it took a long time until operating systems

started using sysenter. Linux supports sysenter since version 2.5.53 from December, 2002.

To provide a portable interface to different CPUs, Linux uses a technique called

vsyscall. The kernel maps a page of memory (4096 bytes) into the address space of user

programs. This page contains the best implementation for system calls, thus programs

just do a simple subroutine call to this page. The page is called VDSO (virtual dynamic

shared object), its address can be found in the maps file in the proc-filesystem (cf. Table

6.1). The address is also noted in the ELF header of dynamically linked programs and

can be viewed with the command ldd, here the page looks like a dynamic library called

linux-gate.so.

Before Linux version 2.6.18 (September, 2006) the address of the page was fixed to

0xffffe000. This is the last usable page in the 32 bit address space; the page at 0xfffff000 is

not used because illegal memory accesses (pointer underruns) should lead to a page fault.

Recent Linux versions use a random address [KeN06] to complicate some kind of buffer

overflow attacks. The C library stores the address in a variable, _dl_sysinfo. The

kernel also stores the address because sysenter does not save a return address. Instead, the

sysreturn instruction used to return to the userspace needs this address as an argument.

This means in particular that no user process can change the address of the VDSO page,

6.3. Migration of Virtual Processors 79

Kernel

program

data

bss [heap]

thread stack

data

vdso

stack

_dl_sysinfo sysenter

shared mem.

00000000

a0000000

b0000000

c0000000

sysreturn

ffffffff

Kernel

program

data

bss [heap]

thread stack

data

vdso

stack

shared mem.

process A

process B

Figure 6.2: Virtual Address Space

because the kernel will always jump back to the old address.

If we migrate a process, we cannot guarantee that the source and the destination uses

the same address, but we have to keep the old VDSO page. So we do not migrate this

page and adapt the address in the C library, such that it uses the old page. If this is not

possible because of user data is mapped to the same address (cf. Figure 6.2), we free the

VDSO page, and change the _dl_sysinfo address to our implementation for system

calls using the old interrupt method.

Randomization of the address of the VDSO page can be disabled with the sysctl com-

mand (sysctl -w kernel.randomize_va_space=0) or directly using the proc

filesystem (echo 0 > /proc/sys/kernel/randomize_va_space).

Migratable Resources. When we migrate a process, we transfer all its resources to the

new destination, namely the state of the processor, the used memory, and some other data

stored inside the kernel, e. g., open files.

Processor state: The processor state contains all data stored in the processor, e. g., the

registers. This data has to be migrated. We save the registers in the data segment2of our

process. In [LTBL97], Litzkow and others suggest to use a signal handler for the migra-

tion. Inside a signal handler one can use all registers freely, i. e., the operating system has

to store and restore all registers by itself and the code can become more portable. Older

version of our implementation uses this technique, too. However, this is not compatible

with the handling of signals in actual Linux kernels. If a signal is sent to a process, the

kernel saves some data on the kernel stack, and writes a syscall to the kernel function

2we use the word “segment” for parts of memory, in the context of Intel processors segments are some-

thing different

80 BSP for Heterogeneous Workstation Clusters

Table 6.1: Example of a memory map read from maps of the proc-filesystem

virtual address perm offset mapped file

08048000-08108000 r-xp 00000000 03:05 4432424 /tmp/migtest

08108000-08109000 rw-p 000bf000 03:05 4432424 /tmp/migtest

08109000-081c9000 rw-p 08109000 00:00 0 [heap]

a0000000-a0100000 rwxp a0000000 00:00 0

b0000000-b2000000 rw-s 00000000 00:07 305790982 /SYSV00000000 (deleted)

b7e00000-b7e21000 rw-p b7e00000 00:00 0

b7e21000-b7f00000 --p b7e21000 00:00 0

b7ffc000-b7ffd000 r-xp b7ffc000 00:00 0 [vdso]

bffb7000-bffcd000 rw-p bffb7000 00:00 0 [stack]

sys_sigret onto the user stack. So, when the signal handler returns, the kernel func-

tion restores the data from the kernel stack. We cannot modify or migrate this data on the

kernel stack. See [Bar00] for more details.

Memory: The virtual address space of a Linux process contains different parts which

must be handled separately. Table 6.1 shows the memory mapping of one virtual proces-

sor. This mapping can be read from the file maps in the Linux proc filesystem. First

there is the program code segment, which is marked as read-only and is the execution file

mapped into the address space. We do not need to migrate this because the program is the

same on all nodes. Second, there are two data segments: One is for the initialized data;

it is mapped from the execution file, but only copied on first write access. Next, there is

the real data segment; this data segment grows when the application needs more memory

and calls malloc. The size of this segment can be adjusted directly by brk. If the user

allocates large blocks of data, the system creates new segments. We have to migrate the

data segments and all these new segments.

The last segment type is the stack. It used to start at address PAGE_OFFSET defined

in page.h of the Linux kernel, normally 0xc00000003, and grows downwards. For

security reasons, Linux uses address space randomization [LWN05] since kernel version

2.6.12, i. e., the stack starts at a random address in the 64 KB area below PAGE_OFFSET,

so we need to find the stack and restore it at the same address on the destination node.

During the migration we use a temporary stack inside the shared memory.

Data stored in the Linux kernel: There is some data stored in the kernel, examples are

information about open files and used semaphores. We have added support for regular

files which exists on all the hosts, for instance, on a shared file system like the network

file system (NFS). We gather information about all open files and try to reopen them on

the destination host.

All other information stored in the kernel will be lost during migration, so the not

use of semaphores, network connections, and other non migratable resources during the

3On 32 bit systems Linux splits the 4 GB address space into 3 GB user space from 0x00000000–

0xbfffffff and 1 GB reserved for the kernel

6.3. Migration of Virtual Processors 81

migration (i. e., during the synchronization) is not allowed. If such resources are needed

over more than one superstep, the migration of that virtual processor can temporarily be

disabled.

Thread local storage: Although virtual processors are not allowed to use threads,

we have to deal with thread local storage (TLS), because the ancestor of our processes,

the main process, uses threads, i. e., all the processes in PUB are forked of a threaded

process. TLS is used for global variables (accessible from all functions) that have different

values for each thread. The C library uses this feature among other things for the errno

variable: errno is a variable that is set to the error code of operations. Instead of using

the function return value, the POSIX IEEE Std 1003.1 [IEE96] uses a global variable.

As all threads share the same address space, these variables must have different ad-

dresses for different threads, so the linker cannot insert a normal memory reference here.

The Linux implementation uses a Thread Control Block (TCB), which stores all the in-

formation about TLS data. A pointer for the actual thread is accessible at the address 0

of the segment stored in the GS segment register (see [Dre05] for details). This segment

is mapped to a part of the stack of the actual running thread. We create new processes for

the migration by forking the receive thread, i. e., although we do not use threads inside

virtual processors and have our own stack, we cannot free the stack of the receive thread,

and furthermore we have to guarantee that the stack is at the same address on all hosts.

Thus, we allocate our own stack at a fixed address and assign it to the receive thread with

the function pthread_attr_setstack from the POSIX thread library.

6.3.3 Performance of the Migration

To measure the performance of migrations, we disable load balancing and force one pro-

cess to migrate. Due to different local clocks, we send the process via all hosts in the

network and back to the original location, i. e., a ping-pong test for two nodes and a round

trip for more hosts.

We use two different clusters: four PCs with Intel Pentium D 3 GHz CPUs connected

by a Gigabit Ethernet switch (D-Link DGS-1005D), and four PCs with Intel Core2Duo

2.6 GHz with a Cisco Switch respectively. The benchmark program allocates a block

of memory and migrates one virtual processor from host 0 to 1,2,...p−1 and back to

host 0. These migrations are repeated 20 times. Figure 6.3 shows the mean time for one

migration. The bandwidth grows up to 111 MB/s for large data sets, which is nearly opti-

mal for Gigabit Ethernet. Without network, we achieve 850 MB/s (Core2Duo, 412 MB/s

Pentium4 D) for migrations whereas a plain memory copy obtains 3.123 GB/s (respec-

tively 1.926 GB/s). Considering that we always use TCP/IP connections with all the over-

head of buffering and calls to the operating system, this is a very good performance. It

is not worthwhile to optimize the case of local migrations, for instance, by using shared

memory, because migrations to the same host are usefull only for this benchmark.

82 BSP for Heterogeneous Workstation Clusters

100

1000

1 10 100 1000 10000 100000 106107108109

migration time [ms]

size of user data [bytes]

local Core2Duo

2 hosts Core2Duo

4 hosts Core2Duo

local Pentium4 D

2 hosts Pentium4 D

4 hosts Pentium 4D

Figure 6.3: Time needed for Migrations

6.4 Load Balancing

In the last section, we have described a technique to migrate a virtual processor from

one host to another. In this section, we will use this to balance the load of a workstation

cluster. Before we describe the load balancing strategies, we will examine the typical load

on workstation clusters and show two models for the load.

6.4.1 Load Measurements

To figure out the typical utilization of workstation, we measure the load and processor

utilization of several workstation over weeks. On the one hand, we use this data for testing

the load models, on the other hand, we extract load profiles and use them to simulate the

load for our benchmarks.

First we have to define the term load of a computer. For Unix systems, the operation

system often provides a load value for a computer. Normally, this is said to be the average

number of processes with state ready-, running, or waiting for input/output.

Linux provides three load values, taking the average other 1 minute, 5 minutes, and 15

minutes. More precisely, the value is calculated by the following iterative formula every

5 seconds, given mthe reporting period in minutes (1, 5, or 15), n(t)the number of non

6.4. Load Balancing 83

2αλ

t0+2T t0+3T

t0+T

L(t)

Figure 6.4: Example for the Constant Usage Interval Model

idling processes at time t, load(t−1)the load in the period before t[Gun04]:

load(t):=load(t−1)e−5

60m+n(t)·(1−e−5

60m)

This way to tame highly variable data is called exponential smoothing.

There are other measurements for the utilization of computers. Thomas Kunz an-

alyzed the influence of different workload descriptions for load balancing [Kun91]. He

examines the following indicators: the actual number of tasks in the running queue, rate of

system calls, CPU context switch rate, percentage of idle CPU-time, size of free memory,

and one minute load average or a combination of them.

6.4.2 Models for the Load

There are different ways to model the load of workstations. The first model is motivated

by the observation that the CPU usage is typically nearly constant for larger time intervals.

Constant Usage Interval Model. The CPU usage typically shows a continuous pattern

for quite a time, then changes abruptly, then again shows a continuous pattern for some

time, and so on. The reason therefore is that many users often perform uniform activities

(e. g., word processing, programming) or no activities at all (e. g., at night or during lunch

break).

A given CPU usage graph (e. g., of the length of a week) can thus be split into blocks,

in which the CPU usage is somewhat steady or shows a continuous pattern. These blocks

typically have a duration of some hours, but also durations from only half an hour (e. g.,

lunch break) up to several days (e. g., a weekend) do occur.

84 BSP for Heterogeneous Workstation Clusters

Based on the above observations we have designed a model [BGM05] to describe

and classify the external work load. We describe the CPU usage in such a block by a

rather tight interval (with radius α∈R,α<1

2) around a median load value (λ∈Rwith

0≤λ−αand λ+α≤1) and rates for the upper and lower deviation (β+∈R,β+<1

resp. β−∈R,β−<1

2) as illustrated in Figure 6.4. We will refer to such a block as a

(λ,α,β+,β−,T)-load sequence in the following.

In order to describe the frequency and duration of the deviations, we subdivide the

load sequences into small sections of length T, called load periods. The values β+and

β−must be chosen such that the deviation rates never exceed them for an arbitrary starting

point of a load period within the load sequence.

If a superstep is executed completely within a (λ,α,β+,β−,T)-load sequence, the

factor between the minimal and maximal possible duration is at most q0∈R+with:

q0:=1−(1−β−)(λ−α)

1−(β++(1−β+)(λ+α))

For q∈R+,q≥q0we call a (λ,α,β+,β−,T)-load sequence q-bounded.

This result guarantees that the running times of BSP processes, optimally scheduled

based on the load of the previous superstep, differ at most by a factor q2within a load

sequence. All our load balancing strategies monitor the past and use this information to

predict the future.

Evaluating the Collected Data. When sectioning a given CPU usage sequence into

load sequences, our goal is to obtain load sequences with a q-boundedness as small as pos-

sible and a duration as long as possible, while the rate of unusable time intervals should be

as small as possible. Obviously, these three optimization targets depend on each other. We

have processed the data collected from the monitored PCs (over 6.8 million samples) with

a Perl program which yields an approximation for this non-trivial optimization problem.

The average idle time over a week ranged from approx. 35% up to 95%, so there is

obviously a huge amount of unused computation power. Time intervals of less than half

an hour and such where the CPU is nearly fully utilized by the user or its usage fluctuates

too heavily, are no candidates for a load sequence. The rate of wasted idle time in such

intervals is less than 3%.

Choosing suitable values for the parameters of the load sequences, it was possible to

partition the given CPU usage sequences into load sequences such that the predominant

part of the load sequences was 1.6-bounded. On most PCs, the average duration of a

load sequence was 4 hours or even much longer. Compared with the time needed for a

migration (Section 6.3.3) this is a very long duration.

Oscillation Overlap Model. We also monitored the load of Linux machines. First we

are looking at the total free capacity of the cluster. A computer with a load value smaller

6.4. Load Balancing 85

0 5 10 15 20

free capacity

time of day

day 1

day 2

Figure 6.5: Free Resources over the Day

than 1 is not fully utilized, thus it can execute processes of lower priority. So we define

the total free capacity as the sum of all 1 −lifor all machines iwith a load lismaller

than 1:

total free capacity :=∑

i=0,...,p−1

li<1

1−li

A sample of two days can be seen in Figure 6.5. The free capacity falls down every

night at four o’clock, when system jobs like checks for software updates are performed

on almost all hosts.

Looking at the measured data, one can see that there are some periodic fluctuations,

for example, the load depends on the time of the day or the day of the week. Thus, the load

of a computer can be seen as an overlapping of oscillations with different frequencies. To

analyze the load data, a fast fourier transformation was used to calculate the amplitudes

of the frequencies.

Figure 6.6 shows the spectrum of the free capacity. There are peaks at certain fre-

quencies, for instance, 10 min, 15 min, and multiple of them. System jobs are the main

reason for this. The operation system checks every 15 minutes if there are some jobs to

do, thus the load introduced by the system will always start at the beginning of 15 min

time intervals.

The spectrum for a single computer is not that regular, because normally system jobs

86 BSP for Heterogeneous Workstation Clusters

1 10 100 1000 10000

amplitude

period length [min]

5 min

10 min

15 min

30 min

3 hours

5 hours

1 day

2 days

Figure 6.6: Spectrum of Total Free Capacity

do not dominate the work of a computer. Only if one look at the total system, regular

parts can be seen, because load differences from the irregular normal work is smoothed

out.

6.4.3 Load Balancing Strategies

Load balancing consists of three decisions:

1. Do we need to migrate a virtual processor?

2. To which destination node?

3. Which virtual processor?

Our PUB library contains some loadbalancing strategies, but it is also possible to add new

strategies for the first and second question. The virtual processor for a migration is always

chosen by PUB: When the load balancer decides to migrate a virtual processor, it writes

the destination to a queue and PUB will migrate the first virtual processor that checks this

queue, i. e., the first executed on this computer that reaches the synchronization.

In the following, we describe the implemented strategies: one centralized one with

one node gathering the load of all nodes and making all migration decisions, and some

distributed strategies without global knowledge.

6.4. Load Balancing 87

The Global Strategy. All nodes send information about their CPU speed and their ac-

tual load to one master node. This node calculates the least (Pmin) and the most (Pmax)

loaded node and migrates one virtual processor from Pmax to Pmin if necessary.

Simple Distributed Strategy. Every node asks a constant number cother nodes chosen

uniformly at random for their load. If the minimal load of all the cnodes is smaller than

the own load minus a constant d≥1, one process is migrated to the least loaded node. The

waiting time between these load balancing messages is chosen uniformly at random in the

interval [1 s–40 s] to minimize the probability that two nodes choose the same destination

at exactly the same time. If a node is chosen, it will be notified by a message and increases

the load value that it reports by one to indicate that the actual load will increase soon.

Conservative Distributed Strategy. As in the simple distributed strategy, crandomly

chosen nodes are asked for their load liand for their computation power ci. Next we com-

pute the expected computation power per process ci

li+1if we would migrate one process to

the node ifor i=1,...,c. If the maximal ratio of all cnodes is greater than the local ratio,

the program would run faster after the migration. This strategy only migrates a process, if

the source is the most loaded node of the c+1 ones, i. e., ci

liis greater than the own ratio.

This conservative strategy gives other more loaded nodes the chance to migrate and thus

leads to fewer migrations than the simple strategy.

Global Prediction Strategy. Each node has an array of the loads of all other nodes but

these entries are not up to date all the time. Each node sends its load periodically to k

uniformly at random chosen nodes in the network. In [HDL98] Hill et al. analyze this

process using Marcov chains. The mean age of an entry in the array is O(p/k).

All these strategies use the one minute load average increased by one for each expected

task. This is needed because migrations are executed at the synchronization only, and an

unloaded machine should not be chosen by many other nodes without noticing that other

processes will migrate to the same destination.

6.4.4 Evaluation

Hardware

For our experiments, we used the PCs in our office and in the students’ pools. We ran

different experiments: During the first experiment, the computers were partially used by

their owners, for the second experiment, we choose unused computers and simulated the

external workload. For this, we generate artificial load using a higher scheduling priority

than our normal jobs. As the simulated work is the same for all tests, we get repeatable

results.

88 BSP for Heterogeneous Workstation Clusters

0.2

0.4

0.6

0.8

0 500 1000 1500 2000 2500 3000

utilization

time

total

Figure 6.7: Simulated Load Profile

There are computers of different speed:

1. For the first test we use Intel Pentium III with 933 MHz and one or two processors

(each processor 1867 bogomips4), Intel Pentium IV with 1.7 GHz (3395 bogomips),

and 3.0 GHz (5985 bogomips).

2. For the second group, we use 16 workstations powered by Intel Core2 6700 CPUs

with 2.66 GHz (each core 5320 bogomips), 4 Intel PentiumD 3 GHz (each core

5990 bogomips), 4 Intel Pentium IV 2.4 GHz (4790 bogomips), and 8 Intel Pen-

tium IV 1.7 GHz (3400 bogomips). It is obvious that bogomips is not an accurate

approximation of the performance, because for most applications a Core2 CPU is

much faster than a Pentium D. The bogomips dimension depends on clock speed.

It is nearly the clock speed or twice the clock speed, if the processor can execute

two instructions per cycle. Other important aspects, for example, the speed of the

memory hierarchy, are not included.

The load profile to simulate the external load is based on measurements of the utiliza-

tion of workstation computers. We used the same load profiles as in [BGM05]. One load

4bogomips is a simple measuring unit for the speed of CPUs provided by the Linux kernel. We show

this unit here because our loadbalancing algorithms use it for approximating the power of the processors.

6.4. Load Balancing 89

profile is shown in Figure 6.7. Each entry represents the load of one host, the red curve

aggregates the total utilization of the workstations.

Benchmarks

We use different benchmarks for the evaluation. The start with two synthetic microbench-

marks, one for the communication (total exchange, all processors send data to all others),

and one for computation (sequential multiplications).

The next two tests are applications: solving the 3-Satisfiability problem and compu-

tation of LU-decompositions. We choose these examples as they cover a wide range of

typical problems in the field of high performance computations.

Communication Test: Total Exchange. The first example is a simple all-to-all-com-

munication. A run of the benchmark consists of 64 supersteps. In each superstep, each

processor send 32 kByte to each other processor. Thus, the cost model of BSP is

64·(32768(p−1)·g+L) = 65,011,712 ·g+64·L.

In an all-to-all comunication, the network is the bottle-neck, so we have to minimize

the number of forwarded messages to reduce the network congestion. Figure 6.8 shows

the results for the communication benchmark all-to-all on 32 nodes. Updating the routing

tables by broadcasting migrations is necessary to reduce the running time significantly.

Artificial Computation Benchmark. This benchmark is a simple computation test.

Each host calculates an increasing number of multiplications (from 107up to 1011).

Figure 6.9 shows that migrations are profitable for large supersteps. The total execution

time of our benchmark decreases from ≈8 hours to ≈4 hours with migrations.

3-Satisfiability. Our first example application is the same BSP algorithm for the 3-Satis-

fiability(3-SAT) problem we used in Section 5.5.2 for the system on a chip. We can use

the same source code for both systems. This shows that it is easy to implement parallel

algorithms that run efficient for very different architectures. In fact, we have 888a few

target depending lines of codes, for the input (formula is either read from disk or hard

coded into the program) or for time measurements (using either operating system clock

or cycle counter).

For the PUB-library implementation the processors enter a new superstep every 10000

divide-steps. If a solution is found, it is broadcasted and the computation stops at the end

of the actual superstep. On the one hand these synchronizations are necessary to detect

whether a solution is found, on the other hand they are needed because migrations only

occur during synchronizations.

90 BSP for Heterogeneous Workstation Clusters

100

150

200

250

300

350

0 20 40 60 80 100 120 140 160 180 200

time [s]

run

average

no migration (threads for virt. proc.)

migration disabled

simple distributed migration

conservative distr. migration

global migration

global migration, broadcast

cons. distr. migration, broadcast

Figure 6.8: Execution Time of Total Exchange on 32 Workstations

0.1

1 10 100 1000 10000

time for 10 million multiplications [s]

superstep length [10 million multiplications]

no migrations (threads for virt.proc.)

conservative distributed

simple distributed

global

Figure 6.9: Artificial Computation Benchmark

6.4. Load Balancing 91

100

150

200

250

300

350

400

0 50 100 150 200 250 300 350

time [s]

run

average

no migration (threads for virt. proc.)

migration disabled

simple distributed migration

conservative distr. migration

global migration

global migration, broadcast

cons. distr. migration, broadcast

Figure 6.10: Execution Time of 3-SAT on 32 Workstations

200

400

600

800

1000

1200

1400

1600

1800

0 5 10 15 20 25 30 35 40 45 50

time [s]

run

average

no migration (threads for virt. proc.)

migration disabled

conservative distr. migration

global migration

global migration, broadcast

cons. distr. migration, broadcast

Figure 6.11: Execution Time of 3-SAT on 64 Workstations

92 BSP for Heterogeneous Workstation Clusters

2000

4000

6000

8000

10000

12000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

time [s]

size

no migration (threads)

migration disabled

conservative distributed

simple distributed

global predicted

global

Figure 6.12: LU Decomposition Benchmark

Results. The running time for 3-SAT on 32 and 64 processors is shown in Fig-

ures 6.10 and 6.11. The 3-SAT solver benefits from migrations in our office cluster as

expected. There were some nodes in the system with some load of other users greater

than 1, so it was reasonable to migrate the processes from such nodes to other idle nodes.

The cost of the migrations itself is small due to the small memory contexts of 3-SAT. Up-

dating the routing tables is not of big importance, because there are only few messages.

LU Decomposition. A basic problem in many scientific computing applications is solv-

ing large systems of linear equations. Therefore, we use a benchmark that deals with ma-

trices and floating point operations. Consider a system of linear equations Ax =b, where

Ais a given n×nnonsingular matrix, ba given vector of length n, and xis the solution.

LU decomposition is one method for the calculation of x. LU decomposition is the de-

composition of the matrix Ainto a product of two triangular matrices Land U,A=L·U,

where Lis a unit lower triangular matrix (i. e., ones on the diagonal, zeros above) and Uis

an upper triangular matrix (i. e., zeros below the diagonal). The advantage of this method

over Gaussian elimination is that the factors Land Ucan be reused for different systems

with the same matrix A.

The algorithm and the implementation is described in [Bis04]. The BSP function calls

are adapted from the BSPlib syntax to the PUB library syntax, but no further optimization

6.4. Load Balancing 93

cons. distri.

simple distri.

Global

Pred

migrations

(a) maximal migrations per host

cons. distri.

simple distri.

Global

Pred

migrations

(b) total

Figure 6.13: Number of Migrations

is done, i. e., the benchmark code does not use advanced features of PUB, for example,

oblivious synchronization or the adaptive broadcast function.

Results. The LU benchmark benefits from migrations. All load balancing strategies

but the global predicted one reduces the total execution times. The distributed strategies

perform better than the global one. In our experiment, there was no bottleneck at proces-

sor 0, which gathers all load information, but the load information is not up to date all

the time. On the other side, the distributed strategies search for less loaded hosts and use

only the actual load of the machines. Furthermore, they have a high probability to find

imbalances, as the number of machines is only four times the number of hosts asked for

their load in a load balancing step.

The number of migrations is shown in Figure 6.13. The simple distributed strategy

leads to at most one migration per host and superstep (except the last superstep), whereas

94 BSP for Heterogeneous Workstation Clusters

all other strategies have at least 3 supersteps with a maximum of at least 2 migrations

for a host. Also the total number of migrations, i. e., the total amount of data sent is

significantly lower for the simple distribution strategy (Figure 6.13(b)).

6.5 Conclusions

In this chapter we have shown, how BSP programs can be executed efficiently on clusters

of workstations. As we do only use idle times which are wasted in many installations, this

is a cheap way to high performance computing. We have demonstrated the power of the

concept of virtual processors and the benefits of load balancing by migrations.

The technique of migrating Linux processes in userspace is of independent interest in

all areas of load balancing and fault tolerance.

Chapter 7

Summary and Outlook

In this chapter we will summarize the thesis and conclude with some open questions for

further research.

7.1 Summary

To provide more performance, computers have to utilize more and more processing units.

A general model for such systems is needed, which allows portable yet efficient parallel

programs. In this thesis, we evaluated a family of models — the bulk synchronous par-

allel models — and showed that these models can play the required role. We surveyed

algorithms for BSP, and examined implementations of algorithms as well as libraries sup-

porting the implementation of BSP algorithms on different architectures.

The infrequent use of BSP programming in practice is mainly due to the lack of in-

formation, hence we hope that this thesis can help to spread knowledge about BSP, and,

providing a powerful library, increase the use for real applications.

We proposed a general implementation of BSP, the Paderborn University BSP(PUB)

library, which allows the implementation of efficient BSP algorithms for a wide range

of architectures in a very comfortable way. In particular, we consider the techniques to

utilize unused computational power of existing workstation clusters, as we think much

resources could be used better.

We also investigated parallelism on a single chip, as multi core processors are the most

promising way to increase performance for every-day user, and in the near future most of

the standard computers will be multi core machines. We showed that implementing many

simple processors increases parallelism more effectively than implementing only one to

four more complex “classical” processors. The benchmarks, although using very small

and simple 32 bit RISC processor cores, give an impression of the potential performance

of such systems.

We showed that BSP can be useful in such a scenario as well. Thus, BSP algorithms

96 Summary and Outlook

can be implemented on a large range of parallel systems by using the appropriate compiler

and library, without modifying the source code.

7.2 Open Questions and Further Research

Although this thesis gives a complete overview over BSP, from the model to implementa-

tions, there are many open questions and links for further research.

Models. There are many extensions for the standard BSP model, but some fields are

worthwile to look at, for example, real time applications. BSP can be used to garantuee

worst case execution times, but it could likely be improved to suit the needs of real time

applications, for instance, by implementing forced synchronizations.

Algorithms. There are many algorithms designed and analysed for the BSP model, but

many more problems remain to be treated from the BSP point of view.

Implementations. Implementations of BSP for a large range of parallel machines exist,

but as new computer designs appear, one has to implement and optimize the libraries for

new technologies, for instance, multi core processors.

Furthermore, more implementations of algorithms, tools to ease the development (in-

cluding parallel debuggers and performance analysers) are needed. Also, standard li-

braries of often used data structures are important for the success of a system (e. g., com-

pare Java and its comfortable runtime environment). Hence, one should also consider data

structures and implement libraries of standard data structures with a bulk synchronous

semantic, for example, distributed containers, including arrays and sorted lists. These

containers should have methods to add and delete elements, but these actions should be

executed in the synchronization phase only.

Parallel System on Chip. To obtain a high performance system, more powerful proces-

sor cores should be used. It would be very interesting to see a hardware implementation

employing our interconnection network, fast processor cores (e. g., PowerPC core), and

on-chip memory.

We think such a system could outperform standard processors in many applications.

Heterogeneous Workstation Cluster. Several aspects can be considered: First, the

time required for migrations can be improved, by “migrating on demand”, i. e., only a

small part of the memory is transferred to the destination, then the process is continued,

and the migration takes part in the background. If the process accesses a memory page

that is not already transmitted, it is suspended until the page is ready.

7.2. Open Questions and Further Research 97

We have implemented some load balancing algorithms and evaluted their performance

in practice. However, there is almost no theoretical analysis of them. One result for load

balancing and scheduling in a network is [LMSM04]. Yet, the utilized model differs

strongly from our scenario. The authors assume no preemption, and machines are only

completely available or not at all. A machine that becomes unavailable terminates all jobs

running on it, hence, one has no chance to migrate and save the work done so far.

Our implementation supports fault tolerance. It can save checkpoints of processes to

disk and distribute them in the network. The techniques are ready to use, however, a good

strategy is missing. At which synchronizations should a checkpoint be created? On which

computers should it be stored? Furthermore, a distribution of checkpoint data using idle

times of the network should be considered.

98 Summary and Outlook

Bibliography

[ACS89] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. On communication

latency in PRAM computations. In SPAA ’89: Proceedings of the first

annual ACM symposium on Parallel algorithms and architectures, pages

11–21, Santa Fe, New Mexico, United States, 1989. ACM Press.

[ACS90] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. Communication com-

plexity of PRAMs. Theor. Comput. Sci., 71(1):3–28, 1990.

[ADJ+98] Micah Adler, Wolfgang Dittrich, Ben H. H. Juurlink, Mirosław Ku-

tyłowski, and Ingo Rieping. Communication-optimal parallel minimum

spanning tree algorithms (extended abstract). In SPAA ’98: Proceedings of

the tenth annual ACM symposium on Parallel algorithms and architectures,

pages 27–36, Puerto Vallarta, Mexico, 1998. ACM Press.

[AMD06] AMD64 Architecture Programmer’s Manual, vol. 2: System Programming.

Advanced Micro Devices, Inc., 2006.

[ARM99] ARM Ltd. AMBA Specification (Rev. 2.0), 1999.

[ARM01] ARM Ltd. Multi-layer AHB, 2001.

[Bar00] Moshe Bar. The Linux Signals Handling Model. Linux Journal, 2000.

http://www.linuxjournal.com/article/3985.

[BDM98] Armin Bäumker, Wolfgang Dittrich, and Friedhelm Meyer auf der Heide.

Truly Efficient Parallel Algorithms: 1-optimal Multisearch for an Exten-

sion of the BSP Model. Theoretical Computer Science, 203(2):175–203,

1998.

[Ber98] Martin Beran. Computational power of BSP computers. In SOFSEM ’98:

Proceedings of the 25th Conference on Current Trends in Theory and Prac-

tice of Informatics, pages 285–293, London, UK, 1998. Springer-Verlag.

100 Bibliography

[Ber99] Martin Beran. Decomposable bulk synchronous parallel computers. In

SOFSEM ’99: Proceedings of the 26th Conference on Current Trends in

Theory and Practice of Informatics on Theory and Practice of Informatics,

pages 349–359, London, UK, 1999. Springer-Verlag.

[BGB+04] James M. Baker, Jr., Brian Gold, Mark Bucciero, Sidney Bennett, Ra-

jneesh Mahajan, Priyadarshini Ramachandran, and Jignesh Shah. SCMP:

a single-chip message-passing parallel computer. Journal of Supercomput-

ing, 30(2):133–149, 2004.

[BGM05] Olaf Bonorden, Joachim Gehweiler, and Friedhelm Meyer auf der Heide.

Load balancing strategies in a web computing environment. In Proceeed-

ings of International Conference on Parallel Processing and Applied Math-

ematics (PPAM), pages 839–846, Poznan, Poland, September 2005.

[BGM06] Olaf Bonorden, Joachim Gehweiler, and Friedhelm Meyer auf der Heide.

A web computing environment for parallel algorithms in Java. Scalable

Computing: Practice and Experience, 7(2):1–14, June 2006.

[BGMZ97] Guy E. Blelloch, Phillip B. Gibbons, Yossi Matias, and Marco Zagha. Ac-

counting for memory bank contention and delay in high-bandwidth mul-

tiprocessors. IEEE Transactions on Parallel and Distributed Systems,

08(9):943–958, 1997.

[BGOW07] Olaf Bonorden, Joachim Gehweiler, Pawel Olszta, and Rolf Wanka. PUB-

Library - User Guide and Function Reference.

Available at http://publibray.sourceforge.net, 2007.

[BGvN89] Arthur W. Burks, Herman H. Goldstine, and John von Neumann. Prelimi-

nary discussion of the logical design of an electronic computing instrument

(1946). Perspectives on the computer revolution, pages 39–48, 1989.

[Bis04] Rob H. Bisseling. Parallel Scientific Computation: A Structured Approach

using BSP and MPI. Oxford University Press, Oxford, UK, March 2004.

[Bis07] Rob H. Bisseling. BSPedupack, source of implementations from [Bis04].

http://www.math.uu.nl/people/bisseling/software.html, 2007.

[BJvR03] Olaf Bonorden, Ben H. H. Juurlink, Ingo von Otte, and Ingo Rieping. The

Paderborn University BSP (PUB) library. Parallel Computing, 29(2):187–

207, February 2003.

Bibliography 101

[BL98] A. Barak and O. La’adan. The MOSIX multicomputer operating system

for high performance cluster computing. Journal of Future Generation

Computer Systems, 13(4–5):361–372, March 1998.

[BMW02] Olaf Bonorden, Friedhelm Meyer auf der Heide, and Rolf Wanka. Compo-

sition of efficient nested BSP algorithms: Minimum spanning tree compu-

tation as an instructive example. In Int. Conf. on Parallel and Distributed

Processing Techniques and Applications (PDPTA), 2002.

[Bon02] Olaf Bonorden. Ein System zur automatischen Konfiguration effizienter

paralleler Algorithmen im BSP-Modell. Diplomarbeit (in German), Uni-

versität Paderborn, 2002.

[Bon07] Olaf Bonorden. Load balancing in the bulk-synchronous-parallel setting

using process migrations. In The Sixteenth International Heterogeneity in

Computing Workshop (HCW07), Proceedings of the International Parallel

& Distributed Processing Symposium (IPDPS07). IEEE Computer Society,

2007.

[BR00] Stefan Bock and Otto Rosenberg. A new parallel breadth first tabu search

technique for solving production planning problems. International Trans-

actions in Operational Research, 7(6):625–635, 2000.

[BS97] David Blackston and Torsten Suel. Highly portable and efficient implemen-

tations of parallel adaptive n-body methods. In SC ’97: High Performance

Networking and Computing (Supercomputing ’97), November 1997.

[BSP07] BSP Worldwide, Co-ordinating Bulk Synchronous Parallal Computation.

http://www.bsp-worldwide.org/, 2007. Rob H. Bisseling (maintainer).

[BSR+03] Olaf Bonorden, Adrian Slowik, Ulrich Rückert, Mario Porrmann, Jörg-

Christian Niemann, Friedhelm Meyer auf der Heide, Uwe Kastens, Dinh

Khoi Le, Nikolaus Brüls, and Michael Thies. A holistic methodology for

network processor design. In Proceedings of the Workshop on High-Speed

Local Networks held in conjunction with the 28th Annual IEEE Conference

on Local Computer Networks (LCN2003), pages 583–592, October 2003.

[BvzGG+01] Olaf Bonorden, Joachim von zur Gathen, Jürgen Gerhard, Olaf Müller, and

Michael Nöcker. Factoring a binary polynomial of degree over one million.

ACM SIGSAM Bulletin, 35(1):16–18, 2001.

[Cal96] Radu Calinescu. Bulk Synchronous Parallel Algorithms for Optimistic

Discrete Event Simulation. Technical Report PRG-TR-8-9, Oxford Uni-

versity Computing Laboratory, April 1996.

102 Bibliography

[CKP+93] David E. Culler, Richard M. Karp, David A. Patterson, Abhijit Sahay,

Klaus E. Schauser, Eunice Santos, Ramesh Subramonian, and Thorsten

von Eicken. LogP: Towards a realistic model of parallel computation. In

Proceedings 4th ACM SIGPLAN Symposium on Principles and Practice of

Parallel Programming, pages 1–12, 1993.

[CLG05] Jiannong Cao, Yinghao Li, and Minyi Guo. Process migration for MPI

applications based on coordinated checkpoint. In Proceedings of the 11th

International Conference on Parallel and Distributed Systems, volume 1,

pages 306–312, 2005.

[CLR90] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduc-

tion to Algorithms. MIT Press, Cambridge, 1990.

[CP93] Robert Cypher and C. Greg Plaxton. Deterministic sorting in nearly loga-

rithmic time on the hypercube and related computers. Journal of Computer

and System Science, 47(3):501–548, 1993.

[DFRC93] Frank Dehne, Andreas Fabri, and Andrew Rau-Chaplin. Scalable paral-

lel geometric algorithms for coarse grained multicomputers. In SCG ’93:

Proceedings of the ninth annual symposium on Computational geometry,

pages 298–307, New York, NY, USA, 1993. ACM Press.

[DFRC96] Frank Dehne, Andreas Fabri, and Andrew Rau-Chaplin. Scalable paral-

lel computational geometry for coarse grained multicomputers. Interna-

tional Journal of Computational Geometry and Applications, 6(3):379–

400, 1996.

[DG98] Frank Dehne and Silvia Götz. Practical parallel algorithms for minimum

spanning trees. In Proceedings 17th IEEE Symposium on Reliable Dis-

tributed Systems, Workshop on Advances in Parallel and Distributed Sys-

tems, West Lafayette, IN, USA, pages 366–371, 1998.

[DHS97] Stephen R. Donaldson, Jonathan M. D. Hill, and David B. Skillicorn. Pre-

dictable communication on unpredictable networks: Implementing BSP

over TCP/IP. Technical Report PRG-TR-40-, Programming Research

Group, Oxford University Computing Laboratory, 1997.

[dis08] distributed.net project homepage. http://www.distributed.net, 2008.

[DL01] William R. Dieter and James E. Lumpp, Jr. User-level checkpointing for

linuxthreads programs. In Proceedings of the 2001 USENIX Technical

Conference, http://www.engr.uky.edu/ dieter/publications.html, June 2001.

Bibliography 103

[DR98] Bozhidar Dimitrov and Vernon Rego. Arachne: A portable threads sys-

tem supporting migrant threads on heterogeneous network farms. IEEE

Transactions on Parallel and Distributed Systems, 9(5):459–469, 1998.

[Dre05] Ulrich Drepper. ELF Handling For Thread-Local Storage.

http://people.redhat.com/drepper/tls.pdf, December 2005. Version 0.20.

[Due05] Jason Duell. The design and implementation of Berkeley Lab’s Linux

checkpoint/restart. Technical Report LBNL-54941, April 2005.

[FH96] Amr Fahmy and Abdelsalam Heddaya. Communicable memory and lazy

barriers for bulk synchronous parallelism in BSPk. Technical Report

BUCS-TR-1996-01, Computer Science Department, Boston University,

July 1996.

[FW78] Steven Fortune and James Wyllie. Parallelism in random access machines.

In STOC ’78: Proceedings of the tenth annual ACM symposium on Theory

of computing, pages 114–118, New York, NY, USA, 1978. ACM Press.

[GGHK05] Andrei Goldchleger, Alfredo Goldman, Ulisses Hayashida, and Fabio Kon.

The implementation of the BSP parallel computing model on the InteGrade

Grid middleware. In MGC ’05: Proceedings of the 3rd international work-

shop on Middleware for grid computing, pages 1–6, New York, NY, USA,

2005. ACM Press.

[GHL+96] Mark W. Goudreau, Jonathan M. D. Hill, Kevin Lang, Bill McColl,

Satish B. Rao, Dan C. Stefanescu, Torsten Suel, and Thanasis Tsantilas.

A proposal for the BSP Worldwide Standard Library. Technical report,

http://www.bsp-worldwide.org, April 1996.

[Gib89] Phillip B. Gibbons. A more practical PRAM model. In SPAA ’89: Pro-

ceedings of the first annual ACM symposium on Parallel algorithms and

architectures, pages 158–168, Santa Fe, New Mexico, United States, 1989.

ACM Press.

[GLC01] Yan Gu, Bu-Sung Lee, and Wentong Cai. JBSP: a BSP programming li-

brary in Java. Journal of Parallel and Distributed Computing, 61(8):1126–

1142, 2001.

[GLP+00] Jesus A. Gonzalez, Coromoto Leon, Fabiana Piccoli, Marcela Printista,

José Luis Roda, Casiano Rodríguez, and Francisco de Sande. Oblivious

bsp. In EUROPAR: Parallel Processing, 6nd International EURO-PAR

Conference, volume 1900, pages 682–685. Springer, 2000.

104 Bibliography

[GLR+99] Mark W. Goudreau, Kevin Lang, Satish B. Rao, Torsten Suel, and Thanasis

Tsantilas. Portable and efficient parallel computing using the BSP model.

IEEE Transactions on Computers, 48(7):670–689, 1999.

[GMR98] Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. The queue-

read queue-write PRAM model: Accounting for contention in parallel al-

gorithms. SIAM Journal on Computing, 28(2):733–769, 1998.

[Goo00] Michael T. Goodrich. Communication-efficient parallel sorting. SIAM

Journal on Computing, 29(2):416–432, 2000.

[Göt98] Silvia Götz. Communication-efficient parallel algorithms for minimal

spanning tree computations. Diploma Thesis, Universität Paderborn, 1998.

[Gun04] Neil J. Gunther. Linux load average revealed. In Proceedings of the 30th

International Computer Measurement Group (CMG) Conference, pages

149–160, December 2004.

[GV94] Alexandros V. Gerbessiotis and Leslie G. Valiant. Direct bulk-synchronous

parallel algorithms. Journal of Parallel Distributed Computing, 22(2):251–

267, 1994.

[Hay02] Mike Hayward. Intel P6 vs P7 system call performance.

http://lkml.org/lkml/2002/12/9/13, 2002. LWN.net.

[HDL98] Jonathan M. D. Hill, Stephen R. Donaldson, and Tim Lanfear. Process

migration and fault tolerance of BSPlib programs running on networks of

workstations. In Euro-Par ’98: Proceedings of the 4th International Euro-

Par Conference on Parallel Processing, pages 80–91, London, UK, 1998.

Springer-Verlag.

[HDS97] Jonathan M. D. Hill, Stephen R. Donaldson, and David B. Skillicorn. Sta-

bility of Communication Performance in Practice: from the Cray T3E to

Networks of Workstations. Technical Report PRG-TR-33-97, Program-

ming Research Group, Oxford University Computing Laboratory, 1997.

[Hil98] Jonathan M. D. Hill. Oxford BSP toolset.

http://www.bsp-worldwide.org/implmnts/oxtool/, 1998.

[HJSV98] Jonathan M.D. Hill, Stephen Jarvis, Constantinos J. Siniolakis, and Vasil P.

Vasilev. Portable and architecture independent parallel performance tuning

using a call-graph profiling tool. In 6th EuroMicro Workshop on Paral-

lel and Distributed Processing (PDP’98), page 0286, Los Alamitos, CA,

USA, 1998. IEEE Computer Society Press.

Bibliography 105

[HL02] Gaétan Hains and Frédéric Loulergue. Functional bulk synchronous par-

allel programming using the BSMLlib Library. In Sergei Gorlatch and

Christian Lengauer, editors, Constructive Methods for Parallel Program-

ming, Advances in Computation: Theory and Practice, pages 165–178.

Nova Science Books and Journals, 2002.

[HMS+98] Jonathan M. D. Hill, Bill McColl, Dan C. Stefanescu, Mark W. Goudreau,

Kevin Lang, Satish B. Rao, Torsten Suel, Thanasis Tsantilas, and Rob H.

Bisseling. BSPlib: The BSP programming library. Parallel Computing,

24(14):1947–1980, 1998.

[HO01] Juraj Hromkoviˆ

c and Waldyr M. Oliva. Algorithmics for Hard Problems.

Springer-Verlag New York, Inc., 2001.

[HS00] Holger H. Hoos and Thomas Stützle. SATLIB: An online resource for

research on SAT. In Ian P. Gent, Hans Van Maaren, and Toby Walsh,

editors, Sat 2000: Highlights of Satisfiability Research in the Year 2000,

volume 63 of Frontiers in Artificial Intelligence and Applications, pages

283 – 292. IOS Press, 2000.

[IEE96] IEEE. 1996 (ISO/IEC) [IEEE/ANSI Std 1003.1, 1996 Edition] Information

Technology — Portable Operating System Interface (POSIX®) — Part 1:

System Application: Program Interface (API) [C Language]. IEEE, New

York, NY, USA, 1996.

[IEE05] IEEE. 1666-2005 IEEE Standard System C Language Reference Manual.

http://standards.ieee.org/getieee/1666/index.html, 2005.

[Int06] Intel Corporation. Press Kit — Moore’s Law 40th Anniversary.

http://www.intel.com/pressroom/kits/events/moores_law_40th/index.htm,

2006.

[ISO05] ISO/IEC 9899:1999: Programming languages – C. International Organi-

zation for Standardization, Geneva, Switzerland, 2005.

[JKMR03] Ben H. H. Juurlink, Petr Kolman, Friedhelm Meyer auf der Heide, and Ingo

Rieping. Optimal broadcast on parallel locality models. Journal of Discrete

Algorithms, 1(2):151–166, 2003.

[Juu97] Ben H. H. Juurlink. Computational Models for Parallel Computers. Dis-

sertation, Leiden University, Utrecht, 1997.

[JW96] Ben H. H. Juurlink and Harry A. G. Wijshoff. The E-BSP model: Incorpo-

rating general locality and unbalanced communication into the BSP model.

106 Bibliography

In Euro-Par ’96: Proceedings of the Second International Euro-Par Con-

ference on Parallel Processing-Volume II, pages 339–347, London, UK,

1996. Springer-Verlag.

[KDH+05] James A. Kahle, Michael N. Day, H. Peter Hofstee, Charles R. Johns,

Theodore R. Maeurer, and David Shippy. Introduction to the Cell mul-

tiprocessor. IBM Journal of Research and Development, 49(4-5):589–604,

2005.

[KeN06] Kernel Newbies: Linux 2.6.18. http://kernelnewbies.org/Linux_2_6_18,

2006.

[Keß00] Christoph W. Keßler. NestStep: Nested parallelism and virtual shared

memory for the BSP Model. The Journal of Supercomputing, 17(3):245–

262, November 2000.

[KLST04] Uwe Kastens, Dinh Khoi Le, Adrian Slowik, and Michael Thies. Feedback

driven instruction-set extension. In LCTES ’04: Proceedings of the 2004

ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools

for embedded systems, pages 126–135, New York, NY, USA, 2004. ACM

Press.

[Kun91] Thomas Kunz. The influence of different workload descriptions on a

heuristic load balancing scheme. IEEE Transactions on Software Engi-

neering, 17(7):725–730, 1991.

[LMR95] Zhiyong Li, Peter H. Mills, and John H. Reif. Models and resource metrics

for parallel and distributed computation. In Proceedings of the Twenty-

Eighth Annual Hawaii International Conference on System Sciences, pages

51–60, Los Alamitos, CA, USA, 1995. IEEE Computer Society.

[LMSM04] Stefano Leonardi, Alberto Marchetti-Spaccamela, and Friedhelm Meyer

auf der Heide. Scheduling against an adversarial network. In Proc. 16th

ACM Symposium on Parallelism in Algorithms and Architectures (SPAA

2004), 2004.

[LTBL97] Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny.

Checkpoint and migration of unix processes in the condor distributed pro-

cessing system. Technical Report 1346, University of Wisconsin-Madison

Computer Sciences, April 1997.

[LWN05] Address space randomization in 2.6. http://lwn.net/Articles/121845/, 2005.

LWN.net.

Bibliography 107

[McC95a] William F. McColl. A BSP realisation of Strassen’s algorithm. Technical

report, Oxford University Computing Laboratory, May 1995.

[McC95b] William F. McColl. Scalable Computing. In J. van Leeuwen, editor,

Computer Science Today: Recent Trends and Developments, volume 1000,

pages 46–61. Springer-Verlag, 1995.

[MGLV04] Christine Morin, Pascal Gallard, Renaud Lottiaux, and Geoffroy Vallée.

Towards an efficient single system image cluster operating system. Future

Gener. Comput. Syst., 20(4):505–521, 2004.

[Moo65] Gordon E. Moore. Cramming more components onto integrated circuits.

Electronics, 38:114–117, April 1965.

[Moo98] Gordon E. Moore. Cramming more components onto integrated circuits.

Proceedings of the IEEE, 86(1):82–85, 1998. Reprint of Moo65.

[Mot98a] Motorola. M-Core Reference Manual, 1998.

[Mot98b] Motorola. MMC2001 Reference Manual, 1998.

[MPI95] MPI Forum. MPI: A Message Passing Interface Standard. http://www.mpi-

forum.org, June 1995.

[MT99] William F. McColl and Alexandre Tiskin. Memory-efficient matrix multi-

plication in the BSP model. Algorithmica, 24(3-4):287–297, August 1999.

[MT04] Jeremy Martin and Alexandre Tiskin. Dynamic BSP: Towards a Flexible

Approach to Parallel Computing over the Grid. In Ian R. East, David Duce,

Mark Green, Jeremy M. R. Martin, and Peter H. Welch, editors, Commu-

nicating Process Architectures 2004, pages 219–226, 2004.

[MW00] Jeremy Martin and Alex Wilson. A visual BSP programming environment

for distributed computing. In Communication, Architecture, and Applica-

tions for Network-Based Parallel Computing, pages 15–29, 2000.

[Nag04] Norman Nagel. Analyse eines massiv parallelen Multiprozessorsystems

mit SystemC. Diploma thesis, in german, University of Paderborn, 2004.

[NLPR07] Jörg-Christian Niemann, Christian Liß, Mario Porrmann, and Ulrich Rück-

ert. In ARCS’07: Architecture of Computing Systems, pages 83–97, Zurich,

Switzerland, 2007.

[Nöc01] Michael Nöcker. Data structures for parallel exponentiation in finite fields.

Doktorarbeit, Universität Paderborn, Germany, Juni 2001.

108 Bibliography

[NPPR07] Jörg-Christian Niemann, Christoph Puttmann, Mario Porrmann, and Ulrich

Rückert. Resource efficiency of the giganetic chip multiprocessor architec-

ture. Journal of Systems Architecture, 53(5-6):285–299, 2007.

[NPSR05] Jörg-Christian Niemann, Mario Porrmann, Christian Sauer, and Ulrich

Rückert. An evaluation of the scalable giganetic architecture for access

networks. In Advanced Networking and Communications Hardware Work-

shop (ANCHOR), held in conjunction with the 32nd Annual International

Symposium on Computer Architecture (ISCA 2005), 2005.

[OIS+06] Moriyoshi Ohara, Hiroshi Inoue, Yukihiko Sohda, Hideaki Komatsu, and

Toshio Nakatani. MPI microtask for programming the cell broadband

enginetm processor. IBM Systems Journal, 45(1):85–102, 2006.

[PNPR07] Christoph Puttmann, Jörg-Christian Niemann, Mario Porrmann, and Ul-

rich Rückert. Giganoc – a hierarchical network-on-chip for scalable chip-

multiprocessors. In Proceedings of the 10th EUROMICRO Conference on

Digital System Design (DSD), pages 495–502, 2007.

[pub08] Paderborn University BSP Library. http://publibrary.sourceforge.net, 2008.

[PW72] William Wesley Peterson and E. J. Weldon, Jr. Error-Correcting Codes.

MIT Press, Cambridge, Massachusetts, 2nd edition, 1972.

[Rie00] Ingo Rieping. Communication in Parallel Systems–Models, Algorithms

and Implementations. Dissertation, Universität Paderborn, Heinz Nixdorf

Institut, Theoretische Informatik, 2000.

[Sar98] Luis F. G. Sarmenta. Bayanihan: Web-based volunteer computing using

Java. In WWCA ’98: Proceedings of the Second International Conference

on Worldwide Computing and Its Applications, pages 444–461, London,

UK, 1998. Springer-Verlag.

[Sar99] Luis F. G. Sarmenta. An adaptive, fault-tolerant implementation of BSP

for JAVA-based volunteer computing systems. In Proceedings of the 11

IPPS/SPDP’99 Workshops Held in Conjunction with the 13th International

Parallel Processing Symposium and 10th Symposium on Parallel and Dis-

tributed Processing, pages 763–780, London, UK, 1999. Springer-Verlag.

[SAT00] SATLIB – Benchmark Problems.

http://www.cs.ubc.ca/ hoos/SATLIB/benchm.html, 2000. Holger H. Hoos,

maintainer, The University of British Columbia.

Bibliography 109

[SB05] Sven Schneider and Robert Baumgartl. Unintrusively measuring Linux

kernel execution times. In 7th RTL Workshop, pages 1–5, November 2005.

[set08] SETI@Home project homepage. http://setiathome.ssl.berkeley.edu, 2008.

[SHM96] David B. Skillicorn, Jonathan M. D. Hill, and William F. McColl. Ques-

tions and answers about BSP. Technical Report TR-15-96, Oxford Univer-

sity Computing Laboratory, 1996.

[SS63] John C. Shepherdson and Howard E. Sturgis. Computability of recursive

functions.Journal of the ACM, 10(2):217–255, 1963.

[SS92] Hanmao Shi and Jonathan Schaeffer. Parallel sorting by regular sampling.

Journal of Parallel Distributed Computing, 14(4):361–372, 1992.

[SSB+05] Sriram Sankaran, Jeffrey M. Squyres, Brian Barrett, Andrew Lumsdaine,

Jason Duell, Paul Hargrove, and Eric Roman. The LAM/MPI check-

point/restart framework: System-initiated checkpointing. International

Journal of High Performance Computing Applications, 19(4):479–493,

2005.

[SSY00] Takahiro Sakamoto, Tatsurou Sekiguchi, and Akinori Yonezawa. Bytecode

transformation for portable thread migration in Java. Technical report, Uni-

versity of Tokyo, 2000.

[ST98] David B. Skillicorn and Domenico Talia. Models and languages for parallel

computation. ACM Computing Surveys (CSUR), 30(2):123–169, 1998.

[Sta81] Richard M. Stallman. Emacs the extensible, customizable self-documen-

ting display editor. SIGPLAN Notices, 16(6):147–156, 1981.

[Str06] Erich Strohmaier. TOP500—TOP500 supercomputer. In SC ’06: Proceed-

ings of the 2006 ACM/IEEE conference on Supercomputing, page 18, New

York, NY, USA, 2006. ACM Press.

[Sui07] Wijnand J. Suijlen. BSPonMPI. http://bsponmpi.sourceforge.net, January

2007, 2007.

[sys07] Open SystemC Initiative. http://www.systemc.org, 2007.

[Tis01] Alexandre Tiskin. All-pairs shortest paths computation in the BSP model.

In P. G. Spirakis F. Orejas and J. van Leeuwen, editors, Proceedings of

ICALP, volume 2076, pages 178–189, 2001.

110 Nomenclature

[Tis03] Alexandre Tiskin. Communication-efficient parallel gaussian elimination.

In V. Malyshkin, editor, Proceedings of PaCT, volume 2763, pages 369–

383, 2003.

[Tis07] Alexandre Tiskin. Communication-efficient parallel generic pairwise elim-

ination. Future Generation Computer Systems, 23(2):179–188, 2007.

[Val90] Leslie G. Valiant. A bridging model for parallel computation. Communi-

cations of the ACM, 33(8):103–111, August 1990.

[vL90] Jan van Leeuwen, editor. Handbook of Theoretical Computer Science, Vol-

ume A: Algorithms and Complexity. Elsevier and MIT Press, 1990.

[WP00] Tiffani L. Williams and Rebecca J. Parsons. The heterogeneous bulk syn-

chronous parallel model. In Workshop on Advances in Parallel and Dis-

tributed Computational Models (APDCM’00), Lecture Notes in Computer

Science, pages 102–108, 2000.

[XL97] Hong Xie and Wanqing Li. Parallel volume rendering using the BSP

model. In Hongchi Shi and Patrick C. Coffield, editors, Proceedings of So-

ciety of Photo-Optical Instrumentation Engineers (SPIE) Conference, Par-

allel and Distributed Methods for Image Processing, volume 3166, pages

274–279, September 1997.

[ZKX99] Weiqun Zheng, Shamim Khan, and Hong Xie. BSP Pro: a Java-based BSP

performance profiling system. In Proceedings of the Fourth International

Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN),

pages 54–59, 1999.

Nomenclature

C1....................................................................... 12

class of machine models, machines that are polynomial-time equivalent and lin-

early space equivalent to DTM, cf. [vL90].

C2....................................................................... 12

class of machine models, machines that can all problems of the class PSPACE in

polynomial time, cf. [vL90].

3-SAT ...................................................................... 66

3-Satisfiability problem, problem of determining if the variables of a given boolean

formula can be assigned.

AMBA ...................................................................... 54

ARM advanced microcontroller bus architecture.

Broadcast ................................................................... 19

communication problem, one processor want to send the same data to all others.

BSP ....................................................................... 10

bulk synchronous parallel, a parallel computing model.

Cache ...................................................................... 53

temporary storage area where frequently accessed data can be stored for rapid

access.

CPU ........................................................................ 1

central processing unit.

CRC ....................................................................... 41

cyclic redundancy check, algorithm to compute a checksum of data.

CRCW ...................................................................... 8

concurrent read concurrent write, variant of PRAM allowing concurrent accesses

to a memory cell.

111

112 Nomenclature

EREW ....................................................................... 8

exclusive read exclusive write, variant of PRAM not allowing concurrent accesses

to a memory cell.

MMU ...................................................................... 41

memory management unit, part of the processor that manage the virtual address

space.

MPI ....................................................................... 36

message passing interface, a standard for communication in parallel systems using

messages [MPI95].

MST ....................................................................... 29

minimal spanning tree problem, problem of finding a tree connecting all nodes of

a graph with minimal weights of the tree edges.

PC ....................................................................... 11

personal computer, a computer intended to be operated directly be an end user.

PRAM ....................................................................... 8

parallel random access machine, a parallel computing model.

PUB ....................................................................... 37

Paderborn University BSP library, a C library for implementation of BSP algo-

rithm.

S-Core ...................................................................... 53

32 bit RISC processor core used for the parallel system on chip.

Shared Memoy .............................................................. 51

random access memory that is accessable by different units.

SystemC .................................................................... 57

C++ library to describe and simulate concurrent processes, cf. [sys07].

Total Exchange ............................................................. 19

communication problem, all processors have data for all others.

VHDL ...................................................................... 50

very high speed integrated circuit hardware description language.

Index

h-relation ........................... 12

(d,x)-BSPmodel .................... 15

NP-hard ........................... 66

3-Satisfiability ...................... 66

activemessages ..................... 39

addressspace ....................... 80

algorithm description . . . . . . . . . . . . . . . . 26

All-Pairs-shortest Paths problem . . . . . . 22

Asynchronous PRAM . . . . . . . . . . . . . . . 10

Block-PRAM ........................ 9

Broadcast .......................... 20

BSPcomputer ...................... 11

BSP Worldwide Standard . . . . . . . . . . . . 35

BSPk .............................. 37

BSPonMPI ......................... 36

Bulk Synchronous Parallel Model . . . . . 10

Cache

size ............................. 2

clause .............................. 66

Coarse Grained Multicomputer (CGM) 15

commcontext ....................... 39

commlink .......................... 39

Communicable Memory . . . . . . . . . . . . . 37

communication controller . . . . . . . . . . . . 54

CREW-BSP-model . . . . . . . . . . . . . . . . . . 15

cyclic redundancy check . . . . . . . . . . . . . 62

debugging .......................... 40

Decomposable BSP model . . . . . . . . . . . 14

discrete event simulation . . . . . . . . . . . . . 23

exponential smoothing . . . . . . . . . . . . . . . 83

Extended BSP (E-BSP) model . . . . . . . . 13

flit ................................. 52

gap

BSPmodel ..................... 12

LogPmodel ..................... 16

GigaNetIC .......................... 49

Green BSP Library . . . . . . . . . . . . . . . . . . 37

Heterogeneous BSP (HBSP) model . . . 14

kernelmode ........................ 78

latency

BSPmodel ..................... 12

LogPmodel ..................... 16

lazybarriers ........................ 37

literal .............................. 66

load balancing thread . . . . . . . . . . . . . . . . 76

Local Memory PRAM . . . . . . . . . . . . . . . . 9

113

114 Index

LogPmodel ........................ 16

LU decomposition . . . . . . . . . . . . . . . . . . . 92

matrix multiplication . . . . . . . . . . . . . . . . 24

Message Passing Interface . . . . . . . . . . . . 38

migration

Linux processes . . . . . . . . . . . . . . . . . 78

minimal spanning tree . . . . . . . . . . . . . . . 29

MOESI ............................ 53

Moore’sLaw ......................... 1

multicast ........................... 19

N-bodyproblem .................... 24

Non-Uniform Memory Arch. (NUMA) 51

oblivious routing . . . . . . . . . . . . . . . . . . . . 52

Oblivious Synchronization . . . . . . . . . . . 13

Optimistic Discrete Event Simulation . 24

overhead ........................... 16

Oxford BSP Toolset . . . . . . . . . . . . . . . . . 36

ParallelPrefix ....................... 20

parallel random access machine . . . . . . . . 8

Ping-Pong-Benchmark . . . . . . . . . . . . . . . 63

pointer type check . . . . . . . . . . . . . . . . . . . 41

Power Processing Engine . . . . . . . . . . . . 50

PRAM .............................. 8

processorstate ...................... 79

PUBlibrary ......................... 37

Queue-Read Q.-Write PRAM . . . . . . . . . . 9

rangecheck ......................... 41

receivethread ....................... 76

reduce .............................. 20

ring0 .............................. 78

SampleSort ......................... 21

scan ................................ 20

scheduletree ........................ 28

sendthread ......................... 76

snoopingslave ...................... 54

solving linear equations . . . . . . . . . . . . . . 24

sorting

system on a chip . . . . . . . . . . . . . . . . . 70

splitter ............................. 21

supervisormode .................... 78

Switchbox .......................... 51

Synergistic Processing Elements . . . . . . 50

syscall ............................. 78

sysenter ............................ 78

systemcalls ......................... 78

SystemC ........................... 56

thread local storage . . . . . . . . . . . . . . . . . . 81

total free capacity . . . . . . . . . . . . . . . . . . . 85

Total-Exchange Benchmark . . . . . . . . . . 64

VDSO ............................. 78

VHDL ............................. 50

virtual processor

migration ....................... 76

vsyscall ............................ 78

XY-Sort ............................ 22