Dynamic Load Balancing
in Peer-to-Peer Networks
Dissertation
by
Miroslaw Korzeniowski
Faculty for Electrical Engineering, Computer Science and Mathematics
Department of Computer Science and Heinz Nixdorf Institute
University of Paderborn, Germany
Paderborn, November 2005
Reviewers:
•Prof. Dr. Friedhelm Meyer auf der Heide, University of Paderborn, Germany
•Prof. Dr. Burkhard Monien, University of Paderborn, Germany
Acknowledgments
I am grateful to my advisor, Friedhelm Meyer auf der Heide, who let me work at my
own pace, steering the course of my work only slightly whenever it was needed. I really
appreciate his trust and the fact that my work was always evaluated by its results and
not by the time I spent in the office.
Christian Scheideler can be thought of as an unofficial co-advisor of this thesis. Coop-
eration with him significantly accelerated my progress and increased my will to work
and hope of success in the field of theoretical computer science.
I also would like to thank the former and current members or close friends of the
group “Algorithms and Complexity” not only for scientific support, but mainly for be-
ing so friendly and helpful in everyday life. Above all I would like to thank Andr´
e
Brinkmann, Marcin Bie´
nkowski, Valentina Damerow, Mirek Dynia, Gereon Frahling,
Jarek Kutyłowski, Peter Mahlmann, Harald R¨
acke, Kay Salzwedel, Christian Schindel-
hauer, Christian Sohler, and Martin Ziegler.
Special thanks I owe to Mirek Dynia and Jarek Kutyłowski who proof-read parts of
this thesis several times. I would also like to especially thank Marcin Bie´
nkowski who
was a co-author of the papers on which this thesis if based.
Being abroad is not easy even if the atmosphere at work is so friendly, and this is why
the piece of Poland we created here was so important for me. I would like to mention
the most important polish friends I had here, especially the ones who have been here
during the last several months of writing this thesis: Ania, Asia, Beata, Ewa, Iwonka,
Jarek, Marcin, Mirek, Ola, and Radek.
iii
Contents
1. Introduction 1
1.1. Notationandtools ................................ 3
1.2. Chord and Distance Halving . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3. Half-life....................................... 6
1.4. Ourcontribution ................................. 7
1.5. Bibliographicnotes................................ 8
I. Load Balancing in Distributed Hash Tables 9
2. Introduction to Load Balancing 11
2.1. Our results and organization of the first part . . . . . . . . . . . . . . . . . 12
2.2. Other work on load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3. Uniform distribution of nodes on the ring . . . . . . . . . . . . . . . . . . . 15
3. Approximating the Current Number of Nodes 19
3.1. The1-markerscheme............................... 19
3.2. The log n-markersscheme ............................ 23
4. Self Stabilizing Load Balancing 25
4.1. The load balancing algorithm AS ..................... 25
4.2. The analysis of AS ............................. 28
4.3. Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5. Bounding The Communication Cost 35
5.1. The load balancing algorithm AD ..................... 35
5.2. The performance and dynamics model . . . . . . . . . . . . . . . . . . . . . 38
5.3. Performance and communication cost of AD .............. 39
v
vi C
II. Transparent Data Structures 47
6. Introduction to Transparent Data Structures 49
6.1. Our contribution and organization of the second part . . . . . . . . . . . . 50
6.2. Linear addressable memory model in a distributed environment . . . . . . 51
6.3. Pointer model in a distributed environment . . . . . . . . . . . . . . . . . . 52
6.4. Transparent memory models . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.5. Other work on search structures . . . . . . . . . . . . . . . . . . . . . . . . 54
7. Hypertree Memory Model 55
7.1. A family of graphs for the hypertree model . . . . . . . . . . . . . . . . . . 56
7.2. Dynamic graphs for the hypertree model . . . . . . . . . . . . . . . . . . . 57
7.3. Joining and leaving the dynamic de Bruijn cube . . . . . . . . . . . . . . . 58
7.4. Robustness..................................... 59
8. Search Tree 61
8.1. Amortizedanalysis................................ 64
8.2. Recovery...................................... 66
8.3. Local balancing procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9. Summary and Outlook 77
9.1. Loadbalancing .................................. 77
9.2. Datastructures .................................. 78
Bibliography 85
C 1
Introduction
In the last few years Peer-to-Peer networks have become very popular both in everyday
internet use and academic community. The term “Peer-to-Peer” literally means “equal
to equal” and stands for computers building a network on such a basis in opposition to a
server based approach, in which there is a central server to which all participants connect
and which provides services.
In a Peer-to-Peer network, all participants act as clients, servers and routers at the
same time. One of the advantages of such networks is their high robustness. They have
to be designed in such a way that even unexpected failure of a large fraction of peers
should not switch offthe system as a whole. Through replication even loss of data can
be avoided up to some extent. Another important advantage is the scalability of the
network. All of the participants contribute their resources such as bandwidth, storage
capacity and computational power. Thus, when more nodes arrive and the demand for
the resources increases, their availability increases, too. If the system is designed with
care, it can grow beyond the numbers that could be attended to by a central server. Yet
another advantage that we would like to mention is the lack of infrastructure. Dedicated
servers are usually expensive both in the initial cost and maintenance. Peer-to-Peer
systems are meant to be self organizing and the necessary resources are contributed by
the users themselves, so both the initial and maintenance cost are minimal.
The most popular application of Peer-to-Peer systems is file sharing. A user of such a
system can publish a file that he owns and from the moment of publication all other users
can download this file. The Peer-to-Peer structure is usually used only for the purposes
of publication and finding of files. The transfer of the found file between two users is
performed using a direct connection. Spreading of new releases of linux distributions
is an example of a useful application of file sharing. Regrettably, such applications are
used mainly for distribution of copyright protected files such as music or movies. This
thesis, however, omits the legal and moral issues related to the Peer-to-Peer networks
1
2 I
and concentrates on the algorithmic aspects of Peer-to-Peer networks as distributed
systems.
The system that is considered the precursor of Peer-to-Peer networks was Napster
which appeared in 1997. In fact, the system was based on a central server which was
used for indexing the published files. Each participant could connect to the server
and publish his own file or find a file previously published by someone else. After
receiving an information about the publisher, the seeker could connect to the publisher
and download the file without any further mediation from the server. Such an approach
is called a hybrid approach, as it contains the elements of both server-based and Peer-to-
Peer networks.
A hybrid approach deprives the system of one of its most important advantages,
namely the lack of a single point of failure. A successful attack on the server can break
the whole system and even if the environment is not hostile, the speed of the server
bounds the speed of the whole system. Gnutella was a pure Peer-to-Peer system meant
to deal with this problem. It connected the computers in a random fashion and if a peer
wanted to find a file it flooded its neighborhood of bounded diameter. Such approach,
however, has two other disadvantages. One of them is, that flooding is a very inefficient
communication method already in networks of middle size. The other is that even if
an item that we are looking for exists somewhere in the network, the bounded flooding
approach is not guaranteed to find it. It favors the popular files, neglecting the rare ones.
This is where the algorithmic approach comes into the scene. In 1999 four designs ap-
peared almost at the same time: Chord [SMK+01], CAN [RFH+01], Tapestry [HKRZ02]
and Pastry [RD01]. They were all based on consistent hashing, which works as follows.
There are two components of the approach: the keyspace partitioning and the overlay net-
work. The former means that the approach is based on a virtual space (the d-dimensional
cube [0,1]dfor CAN and the interval [0,1), where 0 and 1 are glued together to form a
ring, for all other approaches) into which all items are hashed and which is partitioned
among the participants called nodes. Each node chooses a point in the space and is
responsible for the piece of the space lying around it. For example, each point of the
space can be attributed to the closest node. If an item is to be published, a global hash
function his computed on its name yielding a point in the space. The information about
the item is stored on the node responsible for the point. When a node wants to find
an item, it computes the hash value on its name and contacts the node responsible for
this value. It can do it using key based routing, a greedy scheme that trying to go from a
node sto a point t, jumps to the neighbor of sthat is closest to t. The network is built in
such a way that the greedy scheme works efficiently, i.e. needs O(d·n1/d) hops in CAN
and O(log n) hops in other approaches, where nis the number of nodes present in the
network.
1.1 N 3
1.1. Notation and tools
Throughout the thesis we use the following terms, techniques, and tools.
First of all, we use two terms to denote the participants of the network. When we
want to stress, that we mean a computer participating in the network, we call it a peer.
If we do not talk about the computer but rather about the abstraction that represents it
in the virtual space, we call it a node.
We use the following terms to talk about important routing parameters. For a network
and a system of paths defined between each pair of nodes, Dilation (denoted as D) is
the length of the longest of these paths. It can obviously be as high as the diameter of the
network, but as it depends not only on the network but also on the system of the chosen
paths, it can be even higher. Another routing parameter, Congestion (denoted as C) is
defined for a set of messages, each of which has a source and a destination and travels
between them along a fixed path. It bounds from above the number of messages that go
through a single node (sometimes one defines separately edge- and node-congestion; in
this thesis, however, we use only the node-congestion). The time needed to perform a
routing task is Ω(D+C) and using for example the protocol from [MV99], routing can
be finished in time O(D+C+log n).
We need two important terms to talk about distributions of random variables. The
first one is well established in the algorithm design community. The second one is
introduced for clarity of notation.
We say that a random variable X(X1,...,Xn) being a function of nvariables (these
can be for example nresults of throwing a dice) is bounded by O(f(n)), with high
probability, if for any l≥0 there exists csuch that Pr[X≥c·f(n)] ≤1
nl. This means that
we can play with the constant in the O-notation in order to obtain any polynomially low
failure probability. The same definition with adequate changes applies to the Ωand Θ
notations. Sometimes we use the abbreviation w.h.p.
We say that a random variable X(X1,...,Xn) being a function of nvariables is bounded
by O(f(n)), with constant probability, if for any there exists csuch that Pr[X≥c·f(n)] ≤
. This means that we can play with the constant in the O-notation in order to obtain
arbitrarily small but constant failure probability. The same definition with adequate
changes applies to the Ωand Θnotations. Sometimes we use the abbreviation w.c.p.
A large part of this thesis is based on the Balls into Bins games. The randomized
game is very simple: we throw nballs into mbins uniformly at random, each one
independently from others. Then, one can examine the probability of different events
like at least one bin being empty, or distributions like the number of balls in the fullest
bin. We use three important tools that emerge in the context of balls into bins games.
The first one is the Birthday Paradox, the second one is Negative Dependence and the
last one is the Paradigm of Two Random Choices. The birthday paradox examines the
4 I
event that at least one bin contains two balls in the end in the case when m= Θ(n2).
The negative dependence [DR98] result proves among others such intuitive results as
Pr[1st bin non-empty] ≥Pr[1st bin non-empty|2nd bin non-empty]. The latter is clear,
since if there is a ball in the second bin, there are fewer balls that could fill the first
one. However, the strict proof of this and more general results is not straightforward.
The paradigm of two random choices consideres the balls-into-bins game with two
modifications. One of them is that balls are thrown one by one and another one is that
at least two bins are chosen for each ball when it should be thrown into a bin it is thrown
to the one that is least full at the moment of throwing. This paradigm shows that in a
game modified in such a way the load of the most loaded bin is much lower in the end.
In order to show that our results are concentrated, that is that they almost always run
in the time we claim they do and that they almost always yield the results we claim they
do, we use the following version of Chernoffbounds, which are proved for example
in [Sch00]. Let X=Pn
i=1Xibe a sum of nindependent indicator (taking only values 0 or 1)
random variables X1,...,Xn. Then for any 0 ≤δ≤1, Pr[X<(1−δ)·E[X]] ≤exp(−E[X]·δ2
2)
and for any δ≥0 Pr[X>(1 +δ)·E[X]] ≤exp(−E[X]·min{δ,δ2}
3). We use Chernoffbounds
mostly in case when X= Ω(log n), which yields that X= Θ(E[X]) with high probability.
1.2. Chord and Distance Halving
We elaborate on Chord [SMK+01] and an approach by Naor and Wieder [NW03b,
NW03a, Wie05] called Distance Halving or the Continuous-Discrete Approach, as these
are the main topologies on which we base.
Chord is based on a ring of integers in range 0 . . . 2b−1, which means that both the
addresses of the peers and of the stored items are b-bits numbers. The parameter bis
sufficiently large, so that the space can contain any reasonable number of nodes and
items. For each point in space its predecessor and successor are defined in a natural
way. The predecessor of a point is the node with the largest ID smaller than the point,
and if such a node does not exist, the node with the largest ID. The successor of a point
is the node with the smallest ID larger than the point, and if such a node does not exist,
the node with the smallest ID. Each point on the ring is attributed to its successor, which
means that all items hashed to this point are stored on its successor and all links from
other nodes pointing to the point are incoming edges of the successor.
We identify the nodes with their positions on the ring. The nodes on the ring form an
approximation of hypercubic connections, that is, each node vhas an edge to succ((v+2i)
mod 2b) for each 0 ≤i<b. Thanks to these edges, any point in the network can be
reached in O(b) hops using the following greedy routing. When a node vwants to route
a message to a point ait finds the largest isuch that (v+2i) mod 2blies between v
1.2 C D H 5
and aand sends the message to succ((v+2i) mod 2b). The intuition behind it is that in
each step the distance between vand ahalves. The authors of Chord prove also, that
if there are nnodes present in the system, routing needs at most O(log n) hops, which
is important for small n(much smaller than 2b). The degree of each node depends on
the placement of nodes in the space and is O(log n) in expectation and O(log2n), with
high probability, if nodes are placed uniformly at random. The first part of this thesis
shows extensions of the original Chord that assure (among other improvements) that
the degree in the network is logarithmic, with high probability. The second part of this
thesis uses a construction similar to Chord in order to efficiently perform rotations in a
balanced binary tree.
Even if the degree of the network is logarithmic, it is not optimal for logarithmic
distance. For an arbitrary network and an arbitrary node rthe number of nodes in
distance at most ifrom rcan be bounded by di+1−1
d−1, where dis the degree of the network.
If the network has nnodes, then at least one of them has to be in distance Ω(logdn)
from r. We say that a network of degree dtogether with a routing scheme is degree-
dilation optimal if its dilation is at most O(logdn). The design by Naor and Wieder is
an example of a degree-dilation optimal scheme. The construction starts from defining
the connections for a continuous graph consisting of all points in the interval [0,1). A
point x∈[0,1) is connected to x
d,x+1
d,..., and x+d−1
d, which is a construction derived from
the de Bruijn graph. The continuous construction is then transformed into a discrete
construction as follows. Each node vis responsible for the interval Iv=[v,succ(v)),
that is, the interval from itself to its successor. In the discrete version there is an edge
between nodes vand u, if and only if there exist x∈Ivand y∈Iusuch that there is an
edge between xand yin the continuous version.
The “routing” procedure is where the approach takes its name from. If a message is to
be routed from a node vto a place t, two messages travel virtually at the same time from
vand t. At each step the messages are routed along the edge x→x+i
d, with the same
random choice of ifor both of them. It is proved that with each such step the distance
between the two messages is shortened by a factor d(halved for d=2, hence ”distance
halving”). Since the smallest distance at which the messages can be, not being on the
same node, is at least inversely polynomial in the number of servers, after O(logdn) steps
the messages meet.
The implementation of the scheme first sends the message along Ω(logdn) edges,
each chosen randomly from the dpossibilities and all random choices are stored in the
message in form of a stack. Then, the message is moved slightly within the interval it
is currently in and sent backwards using the same types of edges it used before but in
inverse order. The distortion is such that in the end it lands exactly in t.
Similarly as with Chord, also in case of the Continuous-Discrete Approach one needs
the network to be balanced in order to prove some properties of the network (e. g.
6 I
constant degree for constant d). For this, one needs one of our balancing schemes from
the first part of this thesis or a similar algorithm. The second part of this thesis uses the
de Bruijn graph to simulate parent-to-child edges of a binary tree.
1.3. Half-life
In order to bound the communication cost of one of our algorithms we will need the
following notion from [LNBK02]. This is introduced, because the system cannot be
treated as a process that starts and ends but rather as something that runs forever. The
notion of half-life lets us measure the communication not as the total bandwidth used,
but rather as rate at which each node has to communicate.
Definition 1.1 ([LNBK02]).If there are n live nodes at time t, then:
1. the doubling time from time t is the time that elapses before n additional nodes arrive
2. the halving time from time t is the time required for half of the nodes alive at time t to
depart
3. the half-life τtfrom time t is the smaller of the doubling and halving times from time t
4. the half-life τof the entire system is the minimum half-life over all times t
We cited the whole definition but in our approach we only need a notion for half-life
from arbitrary time t, namely τt. In Chapter 5 we will analyze the behavior of a load
balancing algorithm during a time interval starting in a time step tand ending after
τt, that is lasting a half-life. We will also use the term half-life for naming such time
interval. The notion of half-life is especially useful because it has been proved that in
each half-life each node should get some number of messages in order for the network
not to fall apart. We elaborate on this below.
Since the system is dynamic and not supervised, some communication is needed in
order to keep the network connected. We assume the simplest model of handling arrivals
and departures. At any time a node can notify another node about its presence in the
network, incurring a communication cost of one message. We allow to send notifications
even if the two nodes are not connected. However, the goal of sending notifications is
to keep the system balanced using minimal number of messages. We cite the following
theorem which bounds from below the communication occurring in the system.
Theorem 1.2 ([LNBK02]).Any n-node Peer-to-Peer network that remains connected with high
probability for any sequence of joins and leaves with half-life τmust notify every node with an
average of Ω(log n)messages per τtime.
1.4 O 7
In the proof a specific scenario is considered. Nodes join the system according to Pois-
son process with rate λand stay in the system according to an exponential distribution
with rate parameter µ. Each node which gets on average fewer than kmessages, loses
all its neighbors with probability 1−1
e−1k, so Ω(log n) messages are necessary for the
node to have at least one connection to another node, with high probability.
If we prove, that our load balancing algorithm generates communication of O(log2n)
per node per half-life, we may say, that its communication cost is amortized against the
network-maintenance and that it is O(log n) competitive.
1.4. Our contribution
As mentioned above, Chord and the Continuous-Discrete approach are both based on a
ring into which nodes are inserted and each node is responsible for some part of the ring,
for example the interval from itself to its direct successor. In case when nodes choose the
places for themselves uniformly at random (and in a truly distributed environment this
seems the best choice) the sizes of the pieces of the space they are responsible for vary in
size. This causes overloading of some nodes not only with larger numbers of items they
have to store but also with more communication, as a node responsible for a large piece
of the space has to keep connections with many nodes responsible for smaller chunks.
The first part of this thesis presents algorithms that balance such load and guarantee
that the final ratio between the most and the least loaded nodes is constant.
First of all, we show two distributed algorithms which yield constant approximation
of the number of participants of the network. Such approximation is needed in both
load balancing algorithms to decide about the role a peer should play in the network.
Namely, they make it possible for a node to decide if its load is too high or too low.
The first load balancing algorithm, AS , works in a static environment and starts
with any distribution of nodes on the ring. It is able to migrate some nodes, so that in
the end each node is responsible for an interval of the same length (up to a constant
factor). The whole routine takes only constant number of rounds, where each round
lasts as long as it takes to deliver a message between any two peers. The price for such
low reaction time is high communication cost. Specifically, linear number of nodes are
in a process of sending a message at each point in time.
The second load balancing algorithm, AD , also starts with an arbitrary distri-
bution of nodes on the ring. It is meant to run in a distributed environment and the
emphasis is put on low communication cost. We prove that in a half-life, each node
sends at most logarithmic number of messages, and the total communication cost in
the network is at most by a logarithmic factor larger than the cost of maintaining the
system. Unfortunately, the reaction of the scheme is slower, that is, it takes logarithmic
8 I
number of rounds to repair the system and only the imbalance present in the beginning
of the half-life is repaired. The imbalance introduced in the present half-life is going to
be repaired in the next time period.
In the second part of this thesis, we introduce a class of structured memory models
that we call transparent memory models. Transparent memory models are memory models
in which memory accesses can be emulated in a scalable overlay network with constant
work. Peer-to-Peer networks are perfect candidates as a support for transparent data
structures.
As a specific example of a transparent data structure, we introduce a transparent
memory model called the hypertree memory model and design a scalable, dynamic overlay
network that can emulate memory accesses in this model with constant work. The
hypertree is a family of binary trees in which each node has additional shortcut edges
leading to all its “cousins”. The overlay network design we propose is a hybrid of the
Continuous-Discrete approach and Chord.
We show how to implement a binary search tree in the hypertree memory model
with the property that the amortized work for insert, delete and search operations is
asymptotically the same as for the best binary search trees in the pointer model, yet it can
efficiently recover all remaining information under arbitrary memory faults. Moreover,
in the emulation, every node in the dynamic overlay network only has to perform a
worst-case logarithmic amount of work for any insert, delete or search operation. The
recovery operations work on a reactive basis, that is, they generate communication cost
only if failures occur and the total cost is only by a logarithmic factor larger than the
number of failures.
1.5. Bibliographic notes
The results presented in this thesis have all been accepted for publication in various
international conference proceedings. The static scenario of the first part of the the-
sis was presented at the 4th International Workship on Peer-to-Peer Systems (IPTPS
2005) [BKM05]. The dynamic one was presented at the 9th International Conference
on Principles of Distributed Systems (OPODIS 2005) [BK05]. The result from the sec-
ond part of the thesis was presented at the 8th International Symposium on Parallel
Architectures, Algorithms, and Networks (I-SPAN 2005) [KS05].
Part I.
Load Balancing in Distributed Hash
Tables
9
C 2
Introduction to Load Balancing
In the first part of this thesis we tackle the problem of balancing the load in a Peer-to-Peer
network. We present algorithms which assure that each node is responsible for a piece
of the virtual space of the same size up to a constant factor. This should guarantee that
the responsibilities for maintaining the network are distributed in a fair way and that
no node is overloaded.
The Peer-to-Peer networks which we are considering are based on consistent hash-
ing[KLL+97] withringtopology likeChord[SMK+01], Tapestry [HKRZ02], Pastry[RD01],
and the Continuous-Discrete Approach [NW03a, NW03b]. The exact structure of the
topology is not relevant. It is, however, important that each peer has direct links to its
successor and predecessor on the ring and that there is a routine that lets any peer contact
the peer responsible for any given point in the network in time D, which depends on n,
the current number of peers present in the system. In the systems, that we are interested
in, D∈O(log n).
In the load balancing algorithms we assume synchronous network behavior. This
means that all nodes work with approximately the same speed and that messages are
guaranteed to reach their destinations in bounded time. In Chapter 4 we assume addi-
tionally that it is possible to synchronize each node with its successor and predecessor,
so that the part of code after such synchronization is run only if the part before it has
been finished by all nodes. Thus, we achieve the execution of the algorithm divided into
synchronized rounds.
Below, we introduce notation for the division of the [0,1) ring among nnodes partic-
ipating in the network, length of interval for which a node is responsible and a crucial
parameter of a network defined in this way, namely smoothness.
Definition 2.1. A placement of n nodes on the unit ring generates n intervals – one from each
node to its direct successor. A node i is responsible for the interval Iifrom itself to its successor.
For a node i, the absolute length of its interval Iiis denoted as l(Ii).
11
12 I L B
Definition 2.2. The smoothness of a network is the ratio of the length of the longest interval
to the length of the shortest interval.
The smoothness is a parameter, which informs about three aspects of load balance.
•Storage load of a peer. The longer its interval is, the more data has to be stored in
the peer. If we assure, that the smoothness is constant, then in a system in which
Ω(n·log n) items are distributed uniformly at random on the ring, each of the n
peers is responsible for the same number of items (up to constant factors), with
high probability.
•Degree of a node. A node responsible for a longer interval has a higher probability
of being contacted by many nodes responsible for short intervals which increases
its in-degree.
•Congestion and dilation. In the Continuous-Discrete Approach [NW03a, NW03b],
constant smoothness is assumed in order to prove that routing parameters like
congestion and dilation are kept low.
The algorithms we describe rearrange the positions of the nodes on the ring so that the
smoothness becomes constant, independently of the original situation. In the process
of designing the algorithms, we noticed that it is very hard if not impossible to devise a
truly distributed and local scheme which would on the one hand react instantly to system
changes and on the other hand would not send unnecessary messages. In our designs,
we propose two tradeoffs. The first algorithm reacts to all changes as quickly as possible
but has very high communication cost. The second one adjusts its communication to
the dynamics of the system but its reactions to the system changes are a little bit slower.
Both algorithms need to know the current number of nodes present in the system at
least with constant accuracy, so we start the description from showing two algorithms
which yield constant approximation of the number of network participants.
2.1. Our results and organization of the first part
In the rest of this chapter we shortly describe the work that others did in the area of load
balancing in Peer-to-Peer networks. It has been a very popular topic in the last few years
and about a dozen results concerning it have been published. We also show a result that
is considered folklore but the concrete statement and proof of which are hard to find.
Namely, we show that, if nnodes are placed on the ring independently and uniformly
at random, the smoothness is Ω(n·log n), with constant probability. This means that the
most obvious solution, i.e. choosing IDs of nodes randomly, yields too high differences
in load.
2.1 O 13
In Chapter 3 we show how to approximate the number of nodes present in the network.
The nodes work independently in a truly distributed fashion and each of them is able
to compute for itself a constant approximation of the current number nof nodes. We
present two schemes. One is due to Awerbuch and Scheideler [AS04a] and the other
is our independent design. Our algorithm needs an upper bound of log nas its input.
The correctness of our approximation and load balancing schemes depends on the
correctness of the bound, whereas their communication cost is proportional to the value
of the bound. Choosing this parameter to be 40, we err in the estimation at most by a
factor of 4 for networks of reasonable size (from 1000 to 240 nodes). Either of the two
approximation schemes can be used for the two load balancing algorithms we show in
the subsequent chapters.
In Chapter 4 we show a self-stabilizing distributed algorithm that rebalances the sys-
tem in the static case, that is when no nodes join or leave it. Such a strong assumption
is needed in the analysis, but the algorithm itself is designed to work in a dynamic
environment. The main advantage of the scheme is that it reacts to any changes im-
mediately and needs only logarithmic time to repair the system from an arbitrarily bad
situation. Its main drawback is its very high communication cost. Even if the system is
perfectly balanced, each node sends a message directly after its previous messsage has
been delivered, so at any point in time there are linear number of unnecessary messages
traveling through the system.
In Section 4.3 we present the results of the experimental evaluation that we performed.
As the constants emerging from the theoretical analysis are in the range of thousands,
it may seem that our algorithms are useless in practical applications. Our experiments
show that with proper selection of parameters, smoothness can be reduced from thou-
sands to 10 or below.
In Chapter 5 we extend and modify the scheme so that it can be analyzed as an ever-
running process. We show formally that in a half-life each node sends O(log n) messages
to logarithmic distance, which yields total neighbor-to-neighbor communication cost of
O(log2n) per node. The algorithm can play against an oblivious adversary but needs
some time to catch up with the dynamics of the system. The latter is modelled in the
following way. An adaptive adversary chooses a point in time in which the assessment
of the algorithm starts. In a half-life starting in this point in time an oblivious adversary
generates join and leave operations. We show that our algorithm smooths the imbalance
present at the beginning of the considered half-life. The imbalance generated by the
adversary during the half-life is going to be balanced later. The delay of the reaction to
system changes is the price for the bounded communication cost.
14 I L B
2.2. Other work on load balancing
Load balancing has been a crucial issue in the field of Peer-to-Peer networks since the
design of the first network topologies like Chord [SMK+01]. In the original Chord paper,
it was proposed that each real peer works as log nvirtual nodes, thus greatly decreasing
the probability that some peer will get a large part of the ring. Some extensions of this
method were proposed by Rao et al. [RLS+03] and Godfrey et al. [GLS+04], who introduce
more schemes based on virtual nodes and evaluate them experimentally. Unfortunately,
such an approach increases the out-degree of each peer by a factor of log n, because each
peer has to keep all the links of all its virtual nodes.
The paradigm of many random choices [MRS00], from the area of throwing balls into
bins games, was used by Byers et al. [BCM03] and by Naor and Wieder [NW03a, NW03b].
When a peer joins, it contacts log nrandom places in the network and chooses to cut the
longest of all the found intervals. This yields constant smoothness, with high probability.
A similar approach was proposed by Adler et al. [AHKV03]. They improve the
paradigm of many random choices in the following way. All bins build a hypercubic
structure, where each bin has direct connections to log nother bins. Such a structure can
be easily simulated using Chord and other networks with hypercubic shortcuts. When a
ball should be thrown into a random bin, instead of checking the load of log nrandomly
chosen bins, it chooses just one random bin and checks its neighbors. The authors prove,
that this one random choice generates log npseudorandom choices, which are sufficient
to spread the load evenly among all bins, with high probability.
The two approaches above have a certain drawback. They both assume that peers
join the network sequentially. What is more important, they do not provide analysis for
the problem of balancing the intervals afresh when peers leave the network.
Karger and Ruhl [KR04a, KR04b] propose a scheme, in which each node chooses
Θ(log n) places in the network and takes responsibility for only one of them. This can
change, if some nodes leave or join, but each node migrates only among the Θ(log n)
places it chose, and after each operation Θ(log log n) nodes have to migrate on expec-
tation. The algorithm uses only tiny messages for checking the network state, and the
communication cost in a half-life can be bounded to Θ(log n) messages sent to logarith-
mic distance per peer. The scheme is claimed to be resistant to attacks thanks to the fact
that each node can only join in logarithmically bounded number of places on the ring.
However, Awerbuch and Scheideler [AS04a] show, that such a scheme cannot be secure
and that more sophisticated algorithms are needed to provide provable security. The
reasoning for this is that with the IPv6 protocol the adversary has access to thousands
of IP numbers, and she can join the system with the ones falling into an interval that she
has chosen. She does not have to join the system with each possible IP, because the hash
functions are public and she can check usefulness of an IP offline.
2.3 U 15
Manku [Man04] presented a scheme based on a virtual binary tree that achieves
constant smoothness with low communication cost for peers joining or leaving the
network. It is also shown that the smoothness can be diminished to as low as (1 +)
with communication cost per operation increased to O(1/). All nodes form a binary
tree, where some of them (called active) are responsible for perfect balancing of subtrees
rooted at them.
Kenthapadi and Manku [KM05] present a scheme that extends [Man04] and general-
izes [AHKV03]. Nodes are grouped into clusters of size h. Each node checks kclusters
when it joins the network (In [AHKV03], h=log nand k=1). The authors prove that it
is sufficient, that h·k= Ω(log n) to guarantee that the resulting smoothness is constant
with high probability. Similarly to the other two approaches basing on the paradigm of
many random choices, this algorithm does not consider nodes leaving the network.
2.3. Uniform distribution of nodes on the ring
In this section we examine the smoothness in case when all peers choose the places for
themselves independently uniformly at random. We show that it is very likely that the
resulting smoothness is Θ(n·log n). In order to be able to treat the nodes thrown into the
[0,1) interval as balls thrown into bins we define a natural partition of the [0,1) interval
into identical sectors, each of which is later treated as a bin. The nodes placed on the
ring will be treated as balls thrown into these bins.
Definition 2.3. The canonical division of the unit ring into m sectors is a division into m
disjoint intervals (called sectors) of length 1
m, where one of the sectors is the interval [0,1
m]. The
latter is the first sector; the subsequent sectors are nubmered from 2to m.
The following result belongs to folklore but we were not able to actually find it in
literature. Therefore, we prove it here for completeness
Theorem 2.4. If n nodes are distributed uniformly and independently at random on the unit
ring the following hold:
1. The longest interval generated by the nodes has length Θlog n
n, with high probability
2. The shortest interval generated by the nodes has length Θ1
n2, with constant probability
Proof. For the first claim, consider the canonical division of the unit ring into m=c·n
ln n
sectors of length 1
m. First, we bound the probability that any sector is empty if c>0
is sufficiently small, i. e. sectors are sufficiently long. If all sectors are non-empty then
16 I L B
the longest interval generated by the nodes has length at most 2
m=2·ln n
c·n∈Olog n
n. For
sufficiently small cand large enough n:
Pr(there exists an empty sector) ≤m·Pr(the first sector is empty)
≤m·1−1
mn
=c·n
ln n·1−ln n
c·nc·n
ln n·ln n
c
≤c·n
ln n·1
eln n
c≤c·n
ln n·1
n1/c≤1
n2/c
On the other hand, if there is at least one empty sector, then the length of the longest
interval generated by the nodes, is at least 1
m=ln n
c·n∈Ωlog n
nand for sufficiently large
c>1 and large enough n:
Pr(all sectors are non-empty) =Pr(the first is sector non-empty)
·
m
Y
i=2
Pr(sector iis non-empty |
sectors 1 . . . i−1 are non-empty)
(1)
≤
m
Y
i=1
Pr(sector iis non-empty) =
m
Y
i=1 1−1−1
mn!
=
c·n
ln n
Y
i=1 1−1−ln n
c·nn!≤1−1
n2/cc·n
ln n≤1
e√n≤1
n2/c
In inequality (1) above we used the negative dependence of non-emptiness of partic-
ular sectors. The latter is easy to explain: when in a balls into bins game, for a concrete
bin we have the information that some other bins already used some balls, then the
probability, that the chosen bin is empty, decreases. This statement is formally proved
in [DR98].
For the second claim of the theorem, consider the canonical division of the unit ring
into m=c·n2sectors of length 1
m. Again, we treat the sectors as mbins into which we
throw nballs. First, we give an upper and lower bound for the probability that there
exists a sector containing at least two nodes. Actually, we consider the opposite event.
Pr(each sector contains ≤1 node) =1·1−1
m·1−2
m···1−n−1
m
Now, to give an upper bound for the event that there exists a sector containing at least
two nodes we estimate the calculated probability for large enough nand sufficiently
2.3 U 17
large c>0c≥1
2·ln(2/(2−))in the following way:
Pr(each sector contains ≤1 node) ≥1−n
mn
=1−1
c·nc·n·1
c≥e−1
2·c≥1−
2
so with arbitrarily large (but smaller than 1 by a constant) probability, each sector
contains at most one node.
The latter does not guarantee a lower bound for the length of the shortest interval
generated by the nodes. Even if each sector of length Θ1
n2of the canonical division
contains at most one node it may happen that two nodes lie very close. But if we perform
a second experiment in which all sectors are shifted by 1
2·m, that is the first sector starts in
the point 1
2·m(in the middle of the first sector from the first experiment), and we use the
random choices from the first experiment (the nodes are thrown exactly into the same
places on the ring), then we have the same bound for two nodes being in the same sector
for the second experiment. Both experiments are dependent but still the probability that
in any of them there exists a sector with two nodes is bounded by . Now, if two nodes
were in distance at most 1
4·m, they would be in the same sector either in the first or in
the second experiment. Thus, with arbitrarily large (but smaller than 1 by a constant)
probability the shortest interval generated by the nodes has length Ω1
n2.
In order to lower bound the probability that there exists a sector containing at least two
nodes, we choose sufficiently small c>0c≤1
4·ln 1
:
Pr(each sector contains ≤1 node) ≤1−n
2·mn
2=1−1
2·c·n2·c·n·1
4·c≤e−1
4·c≤
so with arbitrarily high probability (but smaller than 1 by a constant) there exists a
sector which contains two nodes and thus the shortest interval generated by the nodes
has length at most 1
m=1
c·n2∈O1
n2.
C 3
Approximating the Current Number of
Nodes
In this chapter we present two very similar schemes which are used to approximate
the number of nodes present in the network. The first one is based on the work of
Awerbuch and Scheideler [AS04a] and the second one was independently designed by
us [BK05]. In both schemes all nodes insert a fixed number of markers (markers are
special items or connections) into the [0,1) ring. In the first one each node inserts one
marker and has to communicate to its logarithmic neighborhood, and in the second one,
each node inserts logarithmic number of markers and draws conclusion basing only on
the interval for which it is responsible. Both schemes yield a constant approximation
of the number of nodes. Our scheme needs an upper bound on log nin order to work
properly. In Chapter 5 it will be shown that our scheme is more compatible with our
balancing algorithms.
The actual purpose of both approximation algorithms is to provide a tool to categorize
intervals according to their load. If we know whether a node’s interval is too long, too
short or just fine, we can decide if it needs help, can help, or is content with its load.
Both approximation schemes can work with both balancing algorithms. However,
we describe them in such a way that the details in the 1-marker scheme are adjusted
to the self-stabilizing algorithm AS presented in Chapter 4 and the details in
the log n-markers scheme are adjusted to the bounded-communication algorithm A-
D presented in Chapter 5.
3.1. The 1-marker scheme
We describe a scheme based on the approach of Awerbuch and Scheideler [AS04a],
who give an algorithm which yields a constant approximation of nin every node assum-
19
20 A C N N
ing that the nodes are distributed uniformly at random in the interval [0,1]. We cannot
make such an assumption, since our balancing algorithms heavily change the positions
of nodes on the ring. Besides, our balancing algorithms can start with any (even chosen
by an adversary) distribution of nodes on the ring. Assuming that they are distributed
uniformly would deprive our results of this advantage.
As a preparation for the actual approximation algorithm, each node keeps a connection
to one random position on the ring (instead of such special connections, special items
can be inserted into the virtual space). This position is called a marker. All markers
present in the system define weights of intervals in the following way.
Definition 3.1. For a node i responsible for interval Iiits weight w(Ii)is the number of markers,
which fall into Ii.
The positions of markers are not fixed all the time. After finding an initial random
position for its marker, each node performs the following infinite process. It keeps its
marker in the found position but at the same time chooses a new random position and
contacts a node responsible for it. After successfully contacting the node responsible
for the chosen place, it migrates its marker there and starts looking for another random
position. Thus, the position of each marker is fixed only for D rounds, the time it takes
for any node to contact any other node in the network.
The process of constantly changing the positions of markers is needed for the follow-
ing reason. We show that for a fixed random configuration of markers our balancing
algorithm works properly with high probability. However, since the process runs for-
ever, and nodes are allowed to leave and join (and thus change the positions of their
markers), a bad configuration may (and will) appear at some point in time. Thanks to
the continuous changing of the positions of markers, the probability of failure in time
step tis independent of the probability of failure in time step t+D, since in the time
interval [t,t+D] each node changes the position of its marker at least once. This enables
the process to recover even if a bad event occurs.
The Algorithm A-1 is meant to run on each node iresponsible for interval Ii
independently from other nodes. A node iapproximates the number of nodes present
in the network in the following way. The goal is to build an interval [a,b) of length
Θlog n
n. The beginning of interval [a,b) is fixed to ai, i.e. the beginning of the interval
Ii=[ai,bi) for which iis responsible (the first line of the algorithm). Initially, the right
end bof the constructed interval is bi(lines 2 −3).
The condition in line 4 is selected so, that the interval [a,b) grows (in lines 5 −6) as
long as it is too short. If it is shorter than Olog n
n, its weight is smaller than O(log n)
with high probability and it is compared to log(1/l([a,b)) which is Ω(log n). Thus, the
interval grows until its length is Ωlog n
n.
3.1 T 1- 21
Algorithm 1 A-1 (node iresponsible for interval Ii=[ai,bi))
1: a←ai
2: b←bi
3: j←i
4: while w([a,b)) <log 1
l([a,b)) do
5: j←j.succ
6: b←bj
7: while w([a,b)) >log 1
l([a,b)) do
8: decrease bto the position of the last marker in [a,b)
9: increase bso that w([a,b)) =log 1
l([a,b))
10: return l([a,b))
In lines 7 −9 it is shortened to Θlog n
n, which is done locally using only information
from the last node contributing to the interval [a,b). Awerbuch and Scheideler prove that
the interval [a,b) returned by the algorithm has length Θlog n
n, with high probability.
This is stated formally in the following lemma from [AS04a]. Besides, they prove that
in case when nodes are distributed uniformly at random, the computation time and
communication cost of A-1 are bounded by O(log n), with high probability.
Lemma 3.2. The interval [a,b) returned by the algorithm A-1 fulfills α·log n
n≤l([a,b)) ≤
β·log n
nfor constants αand β, with high probability.
We reformulate this lemma in order to get an approximation of the number of nodes
nfrom an approximation of log n
n. The following theorem states the exact result.
Theorem 3.3. Let l([a,b)) be the length of an interval found by node i. Let nibe the solution of
log x−log log x=log(1/l([a,b))). Then with high probability n
β2≤ni≤n
α2.
Similarly to constantly changing the positions of markers, the algorithm A-1 is
also executed time after time. This is meant to assure that the information about the
number of nodes present in the network is as fresh as possible. In case when nodes are
initially distributed uniformly at random, the information is at most O(log n) time steps
old, as it takes O(log n) time steps to execute A-1 . In the model in Chapter 4, the
situation is static, that is no nodes leave or join the system during the execution of load
balancing. We may assume that each node executes A-1 once before the actual
load balancing procedure. This contributes Θ(log n) time steps to the running time.
In the design of the balancing algorithm we assume that each peer has computed ni.
Additionally, there are global constants land usuch that l·ni≤n≤u·ni, for each i.
22 A C N N
We categorize intervals according to their lengths, whereas the categorization is
slightly influenced by the quality of the computed approximation. An interval can
move between categories without changing its absolute length. This can be caused by
the change of niwhen the distribution of markers in the neighborhood of ichanges,
especially when the size of the network grows or shrinks. An interval Iiis called:
very short if l(Ii)≤1
2·l·ni
short if 1
2·l·ni≤l(Ii)≤4
l·ni
medium if 4
l·ni≤l(Ii)≤8·u
l2·ni
long if l(Ii)≥8·u
l2·ni
From the definition and from the quality of approximation we can conclude that each
medium or long interval has length at least 4
l·ni≥4
n, since l·ni≤n. Thus, there are
at most n
4medium and long intervals present in the network at any point in time. On
the other hand, long intervals are defined so that by halving a long interval we never
obtain a short interval. This comes from the fact that for each short interval Iiit holds
that l(Ii)≤4
l·ni≤u·4
l·nand if a long interval is halved, the lengths of the halves are at least
4·u
l2·ni≥u·4
l·n. The following fact assembles the results of this section
Fact 3.4. The approximation scheme A-1 together with the classification of intervals’
lengths have the following properties:
•each node is able to find to which category it belongs,
•all but very short intervals have lengths Ω1
n,
•all but long intervals have lengths O 1
n,
•there are at most n
4medium and long intervals,
•after gluing two adjacent short intervals, one never gets a long interval,
•after halving a long interval, one never gets a short interval.
3.2 T log n- 23
3.2. The log n-markers scheme
In the second scheme, which is our independent design, we look at weights of intervals
instead of their lengths. The expected length of an interval is 1/nand each node would
like to be responsible for an interval of length equal to Θ(1/n). However, in order to be
able to decide if an interval is shorter or longer than 1/n, we need to know n. Instead of
looking at lengths of intervals, nodes look at their weights and we tinker with markers
so that for each interval it is known at first sight, if it is too light or too heavy. Besides,
we show how from the information about the interval’s weight and absolute length one
can compute an approximation of its relative length and thus also an approximation
of n.
Let ∆be an upper bound on log ncommon for all nodes present in the system. Each
node keeps connections to δ=α·∆random positions on the ring, for sufficiently large
global constant α. Similarly as in the previous section, these connections are called
markers and define weights in the same way as in the Definition 3.1.
It is clear that E[w(Ii)] =δ·l(Ii)·n, so ni=w(Ii)
δ·l(Ii)is a good estimate for n. In the following
theorem we state precisely the relation between the length and the weight of an interval
and prove, using Chernoffbounds, that the quality of the estimate is sufficient.
Theorem 3.5. For any constant wmin >0, there exist constants lmin and lmax, such that for an
arbitrary interval Ii, time step t, and sufficiently large α, the following holds with high probability.
In time step t, if w(Ii)≥wmin ·δ, then lmin
δ·n·w(Ii)≤l(Ii)≤lmax
δ·n·w(Ii); otherwise l(Ii)∈O1
n.
Proof. For sufficiently large constant sand an interval Iof length l(I)=s
n, let a random
variable Xdenote the weight of I. The expected value of Xis E[X]=s·δand Chernoff
bounds guarantee that for sufficiently large α,X≥wmin ·δ, with high probability. Thus,
if for an interval Iiit holds that w(Ii)≤wmin ·δ, then with high probability l(Ii)≤s
n∈O1
n.
For an interval Iwith l(I)≥s
n, let a random variable Xdenote its weight again. Since
its expected value E[X]=n·δ·l(I)∈Ω(log n), Chernoffbounds guarantee that with
high probability X∈Θ(E[X]). Let the constants in the Θnotation be such that with high
probability 1
lmax ·n·δ·l(I)≤X=w(I)≤1
lmin ·n·δ·l(I). Then for an interval Ii, with high
probability lmin
δ·n·w(Ii)≤l(Ii)≤lmax
δ·n·w(Ii).
Without using the knowledge about weights, it is hard for the nodes to compare their
length to the average length (which is 1/n) due to the lack of knowledge about the exact
number of participants n. On the other hand it is easy for them to compare their weight
to the average weight, since the latter is a global constant equal to δ.
24 A C N N
Similarly to the previous section, we categorize the intervals according to their weight.
However, this time the categorization has a parameter wmin which states the threshold to
consider an interval very light. We assume that this threshold satisfies 0 ≤wmin <16. An
interval can change its category even without changing its absolute length. It happens
when it gains or loses weight, which is caused by other intervals joining or leaving the
network. However, since the maximum dilation Din the network is O(log n), it takes
other nodes at most O(log n) time steps to update the information about their markers,
so the information about the category to which a node currently belongs is at most
O(log n) time steps old. An interval Iiis called:
very light if w(Ii)≤wmin ·δ
light if wmin ·δ≤w(Ii)≤16 ·δ
medium if 16 ·δ≤w(Ii)≤32 ·δ
heavy if w(Ii)≥32 ·δ
Such categorization together with Theorem 3.5 lets us look at weights of intervals in
situations in which we would like to know their relative lengths. The load balancing
algorithm will reorganize the positions of nodes on the ring so that the resulting intervals
have weights betwen wmin ·δand 32 ·δ. By Theorem 3.5, this implies that in the end the
smoothness is constant, with high probability.
Notice that from the definition we can conclude that each medium or heavy interval
has weight at least 16 ·δ, and since the total weight of the network is n·δ, there are at
most n
16 medium and heavy intervals in the network at any point in time. Besides, for
any not very light interval Ii, its length is proportional to its weight l(i)= Θ w(Ii)
δ·n.
The following fact summarizes the results of this section which will be needed in
Chapter 5.
Fact 3.6. The approximation algorithm together with the classification of weights have the
following properties, all holding with high probability:
•each node is able to find to which category it belongs,
•all but very light intervals have lengths Ω1
n,
•all but heavy intervals have lengths O 1
n,
•there are at most n
16 medium and heavy intervals,
•after gluing two adjacent light intervals, one never gets a heavy interval,
•after halving a heavy interval, one never gets a light interval.
C 4
Self Stabilizing Load Balancing
In this chapter we describe the algorithm AS , a self-stabilizing scheme that starting
with an arbitrary placement of nodes on the ring, repairs the network in a constant
number of rounds, where a single round lasts as long as it takes a peer to send a message
to some other peer in the network, i.e. Dtime steps. We describe the algorithm, so
that nodes work in synchronized rounds, which is not crucial for the correctness but
significantly simplifies the description of the algorithm. However, it is important, that at
least some synchronicity is guaranteed, that is that all nodes are of the same speed up to
constant factors and that messages send from neighbor to neighbor take only bounded
time to reach their destination. In such case, one can use techniques similar to those
described in Chapter 5 instead of dividing the execution into rounds.
For the simplicity of the analysis, we fix a static situation with some number nof peers
in the system, which means that no nodes join or leave the system. Such assumption
is needed in the analysis but the algorithm itself is designed in such a way that it can
be implemented also in a dynamic system. We do not prove the latter, but we show
the results of experiments that we performed and that indicate good behavior of the
algorithm.
4.1. The load balancing algorithm AS
In Chapter 3 we divided the intervals created by the placement of nodes on the ring into
four classes: very short, short, medium, and long. In the rest of this chapter we need
only the properties of the classification stated in Fact 3.4 and we will not use the concrete
values defined in Chapter 3.
As mentioned above, we assume that the network is frozen for the execution of the
balancing procedure. Therefore we may think that the algorithm A-1 , which
estimates the number of nodes in the network is run once in the beginning. Since
25
26 S S L B
Algorithm 2 AS (node iresponsible for interval Ii)
1: if Iiis short then
2: i.state ←staying
3: if i.pred is short then
4: with probability 1
2i.state ←leaving
5: synchronize
6: if i.state =leaving and i.pred.state=staying then
7: p←random([0,1))
8: P←the node responsible for p
9: send help proposals to Pand its F ·log(u·ni) successors on the ring
10: if a node jaccepts then
11: migrate to the middle of interval Ij
12: at any time if any node proposes then reject
13: else if Iiis medium then
14: if node jcontacts then
15: reject
16: else if Iiis long then
17: wait for help proposals
18: if node jcontacts then
19: let jmigrate to the middle of Ii
no nodes join or leave the network in the meantime, the next calls to A-1 will
categorize the intervals identically to the initial call.
The algorithm AS will minimize the length of the longest interval, but we also
have to take care that no interval is too short. Therefore, before we begin the routine,
we force all nodes responsible for very short intervals to leave the network. By doing
this, we assure that the length of the shortest interval in the network will be bounded
from below by Ω1
n. We have to explain why this does not destroy the structure of the
network.
First of all, it is possible that we remove a huge fraction of the nodes. It is even possible
that a very long interval appears, even though the network was balanced before. This is
not a problem, since the algorithm will rebalance the system. Besides, if AS is used
also for new nodes at the moment of joining, this initialization will never be needed.
We do not completely remove the nodes with too short intervals from the network. The
number of nodes nand thus also the number of markers is unaffected, and the removed
nodes will later act as though they were simply responsible for short intervals. Each
of these nodes can contact the network through the node responsible for its marker.
Another possibility is for each such node to have a virtual position on the ring and
4.1 T AS 27
contact the network through a node responsible for this position.
The algorithm AS works in rounds. The routine works differently for different
nodes, depending on the initial node’s interval’s length. The pseudocode accounting
for all types of intervals is depicted above. Below, we elaborate on the behavior of all
types of nodes.
The most interesting part of the algorithm takes place in nodes responsible for short
intervals (lines 1 −12). The purpose of the nodes responsible for short intervals is to
find and help nodes responsible for long intervals, thus reducing their load. The first
thing a node responsible for a short interval does is to make a decision if it can send
a help proposal in this round. The variable state stores the decision of a node: it is
set either to leaving or staying which means that the node is or is not willing to help,
respectively. In line 2 the node initially decides to stay, as in order to migrate it has to
check that its neighborhood fulfills additional conditions. The first condition, checked
in line 3 is that the predecessor of the node is short too. If this is the case, and the node
migrates, its predecessor overtakes its job and is not overloaded. The second condition,
checked in line 6 is that the predecessor decides not to migrate in this round. In order to
maximize the number of nodes which fulfill both conditions and are willing to migrate,
each node, after checking the first condition, decides to change its state to leaving
only with probability 1
2in line 4. Before checking the second condition all nodes are
synchronized in line 5. It assures that a node checks the state of its predecessor only if it
has already been fixed. In the next section we prove that in each round, linear number
of nodes are willing to migrate, with high probability.
If a node responsible for a short interval decides to send a help proposal, it sends it
to a random place pon the ring chosen in line 7. In lines 9 −12 it checks whether the
node responsible for por any of its Θ(log n) successors is responsible for a long interval
and needs help. The distance through which a help proposal should be forwarded
is dependent on a constant Fwhich is a parameter of the algorithm. The larger the
constant, the longer the distance. If a node responsible for a long interval is found, the
node sending a help proposal migrates to its middle. In the next section we prove that
linear number of proposals together with each proposal visiting logarithmic number
of nodes, suffice to eventually cut each long interval into medium intervals with high
probability. In line 12 the algorithm assures that if a help proposal comes to a node
responsible for a short interval it is not accepted.
The medium intervals are on the one hand short enough, so that nodes responsible
for them do not need help, and on the other hand, they are too long to send help offers
themselves. Therefore the pseudocode for them (lines 13 −15) is the simplest: it rejects
all help proposals.
As long as a node is responsible for a long interval, it wants to split. It does not
send messages calling for help, but rather waits for messages with help offers. A node
28 S S L B
responsible for a long interval awaits help proposals (lines 16 −17) and splits as soon as
one comes (lines 18 −19).
4.2. The analysis of AS
All properties of the algorithm AS are stated and proved in the following theorem.
Theorem 4.1. AS has the following properties, all holding with high probability:
1. In each round each node incurs a communication cost of at most O(D+log n).
2. The total number of migrated nodes is within a constant factor from the number of mi-
grations generated by an optimal centralized algorithm with the same initial network
state.
3. Each node is migrated at most once.
4. O(1) rounds are sufficient to achieve constant smoothness.
Proof. The first statement of the theorem follows easily from the algorithm due to
the fact that each node responsible for a short interval sends a message to a random
destination which takes time at most Dand then consecutively contacts the successors of
the found node. This incurs additional communication cost of at most F ·(log n+log u).
Additionally, in each round each node changes the position of its marker and this
operation also incurs communication cost at most D.
If a node tries to leave the network and join it somewhere else, it is certain that
its predecessor is short and is not going to change its location. This assures, that the
predecessor will take over the job of the migrating node and it will not become long.
Therefore, no long interval is ever created and as a result of the algorithm long intervals
are cut into halves. Both AS and an optimal centralized algorithm have to cut each
long interval into medium intervals. Let Mand Sbe the upper thresholds for the lengths
of a medium and short interval, respectively, and Ibe an arbitrary long interval with
length l(I). An optimal algorithm needs at least dl(I)/Mecuts, whereas AS always
cuts an interval in the middle and performs at most 2dlog(l(I)/S)ecuts, which can be at most
constant times larger because M/Sis constant. This proves the second statement of the
theorem.
The statement that each node is migrated at most once follows from the reasoning
below. A node is migrated only if its interval is short. Due to the gap between the
upper threshold for short interval and the lower threshold for long interval, after being
migrated a node never takes responsibility for a short interval, so it will not be migrated
again.
4.2 T AS 29
In order to prove the last statement of the theorem, we show the following two lemmas.
The first one shows how many nodes responsible for short intervals are willing to help
during a constant number of rounds. The second one states how many helpful nodes
are needed so that AS succeeds in balancing the system.
Lemma 4.2. For any constant a ≥0, there exists a constant c, such that in c rounds at least a ·n
nodes send help proposals, with high probability.
Proof. As stated before, the length of each medium or long interval is at least 4
nand
thus at most 1
4·nintervals are medium or long. Therefore, we have at least 3
4·nnodes
responsible for short intervals.
We number all the nodes in order of their position on the ring with numbers 0,...,n−1.
For simplicity we assume that nis even, and divide the set of all nodes into n/2 pairs
Pi=(2i,2i+1), where i=0, ..., n
2−1. Then, there are at least 1
2·n−1
4·n=1
4·npairs Pi,
which contain indexes of two short intervals. Since the first element of a pair is assigned
state staying with probability at least 1/2 and the second element state leaving with
probability 1/2, the probability that the second element is eager to migrate is at least 1/4.
For two different pairs Piand Pj, migrations of their second elements are independent.
We stress here that this reasoning only bounds the number of nodes able to migrate from
below. For example, we do not consider first elements of pairs which also may migrate
in some cases. Nevertheless, we are able to show that the number of migrating elements
is large enough. Notice also that even if in one round many of the nodes migrate, it is
still guaranteed that in each of the next rounds there will still exist at least 3
4·nshort
intervals.
The above process stochastically dominates a Bernoulli process with c·n/4 trials
and single trial success probability p=1/4. Let Xbe a random variable denoting the
number of successes in the Bernoulli process. Then E[X]=c·n/16 and Chernoffbound
guarantees that X≥a·n, with high probability, if we only choose clarge enough with
respect to a.
In the following lemma we deal with cutting one long interval into medium intervals.
Lemma 4.3. There exists a constant b such that for any long interval I, after b ·n help proposals
are generated overall, the interval I will be cut into medium intervals, w.h.p.
Proof. For the further analysis we will need that l(I)≤log n
n, therefore we first consider
the case where l(I)>log n
n. We first estimate the number of help proposals that have to
be generated in order to cut Iinto intervals of length at most log n
n. We depict the process
of cutting Ion a binary tree in the following way. Let Ibe the root of this tree and its
children the two intervals into which Iis cut after it receives the first help proposal. The
tree is built further in the same way and achieves its lowest level when its nodes have
30 S S L B
length ssuch that 1
2·log n
n≤s≤log n
n. The tree has height at most log n. If a leaf gets log n
help proposals, it can use them to cover the whole path from itself to the root. Such
covering is a witness that this interval will be separated from others. Thus, if each of the
leaves receives log nhelp proposals, the interval Iwill be cut into intervals of length at
most log n
n.
Let b1be a sufficiently large constant and consider the first b1·nhelp proposals. We will
bound the probability that one of the leaves gets at most log nof these help proposals.
Let Xbe a random variable depicting how many help proposals fall into a leaf J. The
probability that a help proposal hits a leaf is equal to the length of this leaf and the
expected number of help proposals that hit a leaf is E[X]≥b1·log n. Chernoffbound
guarantees that, if b1is large enough, the number of help proposals is at least log n, with
high probability.
There are at most nleaves in this tree, so each of them gets sufficiently many help
proposals, with high probability. In the further phase we assume that all the intervals
existing in the network are of length at most log n
n.
Let Jbe any of such intervals. Consider the maximal possible set Kof predecessors
of J, such that their total length is at most 2 ·log n
n(excluding J). Maximality assures that
l(K)≥log n
n. The upper bound on the length assures that even if the intervals belonging to
Kand Jare cut (“are cut” in this context means “have been cut”, “are being cut” and/or
”will be cut”) into smallest possible pieces (of length 2
n), their number does not exceed
6·log n. The parameter Fis sufficiently large, so that if a help proposal hits some of
them and is not needed by any of them, it is forwarded to Jand can reach its end. We
consider only the help proposals that hit K. Some of them will be used by Kand the rest
will be forwarded to J.
Let b2be a constant and Ybe a random variable denoting the number of help proposals
that fall into Kin a process in which b2·nhelp proposals are generated in the network.
We want to show that, with high probability, Yis large enough, i.e. Y≥2·n·(l(J)+l(K)).
The expected value of Ycan be estimated as E[Y]=b2·n·l(K)≥b2·log n. Again, Chernoff
bound guarantees that Y≥6·log n, with high probability, if b2is large enough. This is
sufficient to cut both Kand Jinto medium intervals.
Now taking b=b1+b2, finishes the proof of Lemma 4.3.
Combining Lemmas 4.2 and 4.3 and setting a=b, finishes the proof of Theorem 4.1.
Both the approximation and load balancing algorithms are analyzed in a static en-
vironment but they can also run as everlasting processes. In such case the following
corollary summarizes the results of Theorem 4.1 and Chapter 3.
4.3 E 31
Corollary 4.4. If the algorithms A-1 and AS are executed as an everrunning pro-
cesses and the system is static for Θ(log n)time steps, they rebalance the system so, that smooth-
ness is constant, with high probability. The probability of success in time step t is independent of
the probability of success in time step t + Θ(log n).
4.3. Experimental evaluation
In order to adjust the parameters and examine the practical usefulness of our approxi-
mation scheme (the one using log nmarkers) and the algorithm AS we performed
experiments on a single machine. In all tests the underlying space was the ring of natural
numbers {0,...,264 −1}with arithmetics mod 264.
0
20
40
60
80
100
120
140
160
180
200
0 20 40 60 80 100 120 140
The resulting smoothness
Number of markers per node
Figure 4.1.: Smoothness according to approximation of log n
There were two groups of experiments. The first group was to find the best parameters
for the algorithm in the static case. In the tests in this group nnodes were inserted in the
very beginning and the system was static (no nodes joined or left during the execution
of the algorithm. The number of nodes nwas 25, 211, 214 or 217; they were chosen to be
powers of two only for easier implementation. We present the results for the highest
number of nodes, that is for approximately 130 thousands nodes.
32 S S L B
The first parameter we tested was the number of markers that each peer should
insert into the network. The number of markers δranged from 8 to 128 and the other
parameters were set as follows. Each help proposal was forwarded through 128 nodes to
assure quick convergence. A node could send a help proposal if its interval glued with
the interval of its predecessor had weight at most δ. An interval was considered very
light if its weight was below δ/2. We measured the final smoothness in the system and
noticed that setting δto 40 already yields smoothness below 20 and values larger than
64 do not improve the system much. The final smoothness for different δparameters is
depicted in Figure 4.1.
With the number of markers fixed to 64 we performed tests which had to estimate
what intervals should be considered very light. The parameter deciding how far a
request should be forwarded was set to 128 again. The final smoothness for different
weights of a very light interval is depicted in Figure 4.2. We propose to consider an
interval very light if its weight is below 7·δ
8=56.
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70
The resulting smoothness
Very Light Interval parameter
Figure 4.2.: Smoothness according to the weight of the very light interval
The last set of tests for the static scenario was developed to optimize the distance to
which a proposal should be forwarded. This time the output of the analysis was not
smoothness but the time it takes for the scheme to stabilize. From Figure 4.3 it can be
4.3 E 33
seen that already very short forwarding gives good results. We chose to forward the
proposal through 16 nodes.
0
5
10
15
20
25
30
35
40
45
0 20 40 60 80 100 120 140
Steps to converge
The Forward parameter
Figure 4.3.: Convergence according to the forward parameter
To test the behavior of our algorithm in a dynamic scenario, we took the following
model from [PRU01]. In each time step the number of nodes joining at this time step is
chosen according to the Poisson distribution with mean λ, and for each node its time
to stay in the system is chosen according to the exponential distribution with mean µ.
We tested the behavior of the system for two cases. In the first one these parameters
were set to (λ, µ)=(10,100), which yields the expected number of nodes present in the
system at any time step equal to λ·µ=1000. In the second case the parameters were
(λ, µ)=(100,100) yielding the expected number of nodes present in the system equal to
λ·µ=10000. In both cases the smoothness never exceeded 14 and was at most 9 in 97%
of time steps. For comparison, if we insert the nodes uniformly and independently at
random the smoothness varies approximately between 2500 and 3 ·104for 1000 nodes.
For 10000 nodes it is usually approximately between 3 ·104and 7 ·106.
C 5
Bounding The Communication Cost
In this chapter we describe a communication cost efficient algorithm AD . This
scheme reacts slower to the changes taking place in the system but its communication
cost can be amortized against the communication cost of keeping the network connected.
Similarly to AS , AD can start with arbitrary placement of nodes on the ring
but there are a few differences in the model describing the system. First of all, the
system is dynamic and the rate at which nodes join and leave the system is chosen by an
oblivious adversary. Secondly, the model of synchronicity we use for AD is much
weaker than the one used for AS . We still assume that all nodes have the same
speed up to constant factors and that messages sent between neighbors are guaranteed
to reach their destinations in a bounded time. However, this time each node works for
itself and it can even happen that some nodes perform the scheme while others sleep.
Last, we analyze the behavior of the algorithm in a bounded time interval which has
length equal to half-life. We prove that in this time interval each node generates traffic of
at most O(log2n) neighbor-to-neighbor messages. This is at most by a logarithmic factor
larger than the communication cost of keeping the system connected. We show that
AD repairs the imbalance present in the system in the beginning of the considered
time interval. The imbalance introduced by the adversary during the interval has to be
coped with later.
5.1. The load balancing algorithm AD
In Chapter 3 we classified nodes according to their weights and said that the algorithm
AD would reorganize the positions of nodes on the ring, so that the resulting
intervals have weights between wmin ·δand 32 ·δ. By Fact 3.6, this implies that in
the end the smoothness is constant, with high probability.
35
36 B T C C
The algorithm minimizes the weight of the heaviest interval, but we also have to take
care that no interval is too light. We do a similar thing to what we did in the self-
stabilizing scheme, that is, we make all nodes responsible for very light intervals leave
the network. They still keep their markers, so that the total weight stored in the system
is unaffected. Such nodes are still called very light nodes, even though they do not
actively participate in network construction anymore, i.e. none of them is responsible
for any part of the [0,1) ring.
Each of the very light nodes knows its virtual position on the ring and contacts the
network through the node responsible for that position. We assume that the node is not
heavy, since it could immediately split using the very light node. If very light nodes
choose their virtual positions uniformly at random, they increase the in-degree of the
nodes through which they communicate at most by Olog n
log log n, with high probability.
The routine works differently for different nodes, depending on the node’s interval
weight. Except for the classification of nodes according to the weights of their inter-
vals we also distinguish two types of light nodes. In order to be able to bound the
communication cost we divide the light nodes into two groups: active and passive. The
pseudocodes for different types of nodes are listed at the end of the chapter. They are
all meant to run in infinite loops, so that before every loop execution the current weight
of a node is checked in order to decide which part of code should be run.
The most important part of the algorithm is its behavior for the active light nodes. The
procedure which active light nodes execute periodically can be divided into three main
parts. The first part is a locking protocol in which it is settled which nodes are allowed
to participate in the second part. The second part relies on sending help proposals and
possibly migrating and the third part decides how long a node should stay active.
An active node, in order to send a help proposal, has to make sure that its predecessor
will overtake its job in case the proposal is accepted – it has to lock the predecessor for
the time of sending the proposal. Nodes use the following types of messages in order
to communicate with their successors and predecessors. A node itells a predecessor
that iwould like to send a help proposal in this turn and also lock the predecessor by
sending a lock-request message. In order to tell the predecessor that iis not willing
to send help proposals at the moment, isends a no-lock-request message to i.pred. To
inform the successor about the decision, isends a lock-allow or lock-deny message in
order to allow or deny the successor to lock i, respectively. If a node isucceeds to lock
its predecessor and isends its help proposal, it has to inform the predecessor about it. It
does so by sending an unlock message to it.
The main part of the locking protocol is defined in lines 1 −12. Every time the
procedure is executed, each node makes a random decision if it should try to send a
help proposal or let its successor send one. Each of these possibilities has probability 1
2.
After making the decision, a node informs its predecessor and successor about it (lines
5.1 T AD 37
5−6 and 8−9) and waits for their answers. In this point, the scheme syncrhonizes itself,
that is, each node waits for messages from its neighbors, independently of its decision
(lines 5 −7 and 10 −12) and also independently of the current weights of its neighbors.
For each node there are a few possibilities, depending on the decisions of itself and the
decisions of its neighbors. The most important two possibilities are when the decision
of the considered node iis compatible with the decision of the proper neighbor. One
case is when the node idecides to try to send a help proposal and its predecessor i.pred
allows it to; this is the code in lines 18 −30. The other case (symmetrical to the previous
one) is when the node idecides to allow the successor to send a help proposal and the
successor wants to send it; this is the code in lines 13 −17.
In case when a node decides to let its successor send a help proposal and the successor
wants to do it, the node is not allowed to take any actions while the successor is looking
for a heavy node. That is because the node may have to overtake the job of its successor
when it finds a heavy interval. Therefore in lines 14−17 the considered node waits for an
information from the successor that the successor has finished its search. However, the
node is not allowed to ignore messages from its predecessor during this period of time.
Therefore, whenever any kind of message comes from its predecessor, the considered
node always sends a message telling the predecessor that our node is not willing to send
help proposals at the moment.
If a node makes a decision to send a help proposal and it has a permission from
its predecessor, it is also careful not to ignore messages from its successor. Whenever,
during the procedure of looking for heavy intervals, a message from the successor comes,
the node answers that it does not allow the successor to send any help proposals. The
main part of the algorithm is the search for a heavy node and possible migration in order
to help. This is done by choosing a random position on the ring (in line 21) and the node
jresponsible for it (in line 22). If it is possible to migrate to exactly the chosen position,
the node idoes so and if it is not, it checks a logarithmic number of successors of j. If any
of them is heavy, imigrates to the middle of its interval (lines 26 −28). Independently
of the success of the search, in the end the predecessor of iis unlocked (line 29).
The last part of the algorithm for active light nodes is responsible for bounding the
number of sent help proposals. Every time the node isends a help proposal, the number
of help proposals which still have to be sent is decreased and checked if it fell down to
0. If so, the node becomes passive (lines 30 −32).
The part of the algorithm responsible for handling passive light intervals is much
simpler, since they do not take active part in sending help proposals. The algorithm
may be divided into two blocks. The first one (lines 1 −9) assures that active light
intervals do not fall into deadlocks waiting for messages from neighbors. The second
part is responsible for activating a passive node. A passive light node becomes active
whenever the set of the markers it stores, changes (lines 10 −12). When it becomes
38 B T C C
active, the initial number of messages it wants to send is set to maximum, that is to the
constant A. This is done in lines 10 −12.
A passive node idoes not want to send help proposals but does not mind overtaking
the load of its successor. For the first reason, whenever its predecessor awaits a message,
the node itells it that idoes not want to lock it (lines 1 −2). If the successor of iwants
to migrate, iwants to enable it to. Therefore, it sends a message letting the successor
send help proposals and waits until it has finished (lines 3 −7). If the successor sends a
non-requesting message, ionly answers because the successor awaits an answer (lines
8−9).
The only task of nodes responsible for medium and heavy intervals is to properly
answer the requests of their neighbors when they await an answer. Neither medium
nor heavy intervals want to send help proposals or overtake the job of their successors.
Therefore, when a message comes to a medium or heavy node ifrom its predecessor
it is answered with a message telling it that idoes not want to send help proposals.
When a message from the successor of icomes, the successor is not allowed to send
help proposals. It is important that a requesting neighbor is always answered to avoid
deadlocks in the communication protocol. Active light nodes check weight of their
predecessors but it might happen that this weight grows afterwards.
5.2. The performance and dynamics model
We model an oblivious adversary, against which the cost of the algorithm is measured
in the following way.
1. The adversary generates a pattern of insertions and removals of nodes in any way
she wants as long as she does not measure the performance of our algorithm. Thus,
she can generate the worst possible starting situation in an adaptive way, in the
sense, that she sees the random choices of AD .
2. The adversary chooses a moment in time t0, at which she wants to start the mea-
surement. The number of nodes at time tifor i≥0 is denoted by ni. As mentioned
above, the distribution of the nodes on the ring at t0can be arbitrary. However,
it has to be an output of our AD , so there are no very light intervals in the
system.
3. The adversary chooses the patterns of insertions and removals for the current
phase. She can decide about insertions online in an adaptive way, that is the
number of nodes inserted at time imay depend on the behavior of the algorithm
up to time i−1. The removals are generated in an oblivious way, i.e. the adversary
decides about Time To Live (TTL) of each node at the moment of its insertion; for
5.3 P AD 39
the sake of the analysis, the TTLs of the nodes present in the network at time t0,
are assumed to be generated at t0. The algorithm does not know the TTLs.
4. At every time step nodes are inserted and removed according to the pattern chosen
by the adversary. The algorithm responds with communication and possible
migrations of nodes.
5. The phase ends after a half-life, that is when the adversary either inserts n0new
nodes or removes n0
2of the nodes present in time step t0. (τt0time steps). The
adversary is switched offand the algorithm is allowed to communicate and migrate
in an extension of Θ(log n) time steps.
In order to precisely express the performance of the algorithm, we consider the fol-
lowing virtual process after the half-life and the extension have ended. For each node,
which left during the considered half-life, we re-insert it into the network in its last po-
sition, provided that such an insertion does not create a very light interval. We perform
the re-insertions sequentially in any order, so that there are no ambiguities with parallel
checks. In Theorem 5.1 we prove that the system is balanced, i.e. the smoothness is
constant, after such reinsertions. Thus, our algorithm copes with the imbalance present
in the network at the beginning of the half-life. The imbalance created by the adversarial
deletions of nodes will be repaired in the next half-life.
5.3. Performance and communication cost of AD
We prove the following theorem characterizing the communication cost of AD in a
half-life and stating how well the algorithm rebalances the system. In order to be able
to bound the communication cost, we set A(the number of messages a node sends after
it is activated) to be a sufficiently large constant.
Theorem 5.1. In the defined model AD fulfills the following properties, both with high
probability.
1. The neighbor-to-neighbor communication cost of AD in the considered half-life is at
most O(log2n). This is at most by a logarithmic factor larger than the cost of keeping the
network connected.
2. If we re-insert the nodes removed in the current half-life as described above, smoothness is
constant at the end of the half-life.
For further analysis we need the following claims about the behavior of the network
in a half-life.
40 B T C C
Lemma 5.2. Denote the number of nodes in the network in the beginning of a half-life by n0,
the number of nodes present at time t by nt, and the number of markers at time t as mt, for t ≥0.
Then during the considered half-life:
1. the number of nodes in the network at time t is bounded by n0
2≤nt≤2·n0,
2. the number of markers in the network at time t is bounded by n0·δ
2≤mt≤2·n0·δ,
3. the number of insert /delete operations ranges from n0
2to 5·n0
2.
Proof. Let the nodes present at time t0be called old and the nodes inserted during the
half-life new. Before the half-life ends the adversary can remove at most n0
2old nodes
and insert at most 2 ·n0new nodes. Thus, the total number of nodes ranges from n0
2to
2·n0and the number of markers ranges from n0·δ
2to 2 ·n0·δ.
It is straightforward that removing n0/2 nodes is the fastest way to end the half-life,
whereas inserting n0−1 new nodes and removing them and n0/2 old ones is the best
way for the half-life to last long. Thus, the last statement follows.
Using Lemma 5.2, we can prove the communication-competitiveness of AD .
Proof of Statement 1of Theorem 5.1. For the exact analysis we modify our approxi-
mation and balancing algorithms slightly. Instead of δmarkers, each node inserts δ/2
insertion-markers and δ/2 deletion-markers into the system. If a passive light node i
stores a marker of a node jthen iis activated in two cases: if jjoins the network and the
marker is an insertion-marker or if jleaves the network and the marker is a deletion-
marker. Since the adversary is oblivious, we may think that the deletion-markers are
generated at the moment when a node leaves the network. Thus, when a node joins
or leaves the network, it contacts Θ(δ) places chosen uniformly at random at the mo-
ment in which the operation happens. Since, from Lemma 5.2, the total number of joins
and leaves in a half-life is Θ(n), the total number of contacted places in a half-life is
Θ(n·δ)= Θ(n·log n), all of which are independent.
There are a few types of communication cost that each node incurs. The first one,
we consider is the cost of inserting or removing markers when a node joins or leaves
the network. Each node inserts or removes δmarkers in such a situation and contacts
δrandom places. If δ=O(log n) this means total cost of O(D·δ)=O(log2n) direct
(neighbor-to-neighbor) messages.
The most direct cost incurred in the process of load balancing is the cost for sending
help proposals. Each node sends at most A·O(δ)=O(log n) proposals to random places,
with high probability. Each of these messages is sent to a distance of O(D+log n)=
O(log n), so the total neighbor-to-neighbor cost is O(log2n).
5.3 P AD 41
One of two indirect cost types is when a node tries to communicate to its predecessor
but the predecessor is busy sending a help proposal and always answers with lock-deny
messages. We amortize this cost against the cost of sending help proposals by the
predecessor. This means that for each node, this cost is A·O(δ)·O(D+log n)=O(log2n)
Another indirect communication cots is when an active light node iand its active or
passive light predecessor take random decisions. The probability that as a result of these
decisions, isends a help proposal is at least 1
4and the total number of help proposals
iis going to send is O(δ)=O(log n). With high probability, each successful decision
making is preceded by at most O(log n) unsuccessful ones. Thus, with high probability,
the number of unsuccessful decision making messages is O(δ·log n)=O(log2n)
The total neighbor-to-neighbor communication cost incurred by a single node is
O(log2n) and by Theorem 1.2 each node has to receive Ω(log n) messages in a half
life, so the communication cost of the approximation and balancing algorithms is at
most O(log n) times larger than the cost of keeping the system connected. This finishes
the proof of the first statement of Theorem 5.1.
It remains to prove that such communication is still sufficient to keep the system bal-
anced. First, we show that the number of help proposals in a half-life is sufficiently
high.
Lemma 5.3. There exist n0
16 light nodes alive during the whole half-life which send at least A
help proposals and still are not accepted anywhere. With high probability, a fraction 1
3of the help
proposals they want to send is sent in the half-life or in the extension.
Remark 5.4. If our model was static and we considered only one half-life and an extension, we
could prove that each node can finish its communication in logarithmic time. In the model we
consider it is possible that a node is active much longer than the duration of the half-life, even
with the extension.
Proof. There are at least n0/2 nodes which are alive during the whole half-life - we
call them immortals. Since an interval is light when its weight is below 16 ·δand, by
Lemma 5.2, the total weight in the network is at most 2 ·n0·δ, at most n0/8 nodes are
not light, and thus at every time step at least 3
8·n0immortals are light.
Consider the distribution of the immortals in the beginning of the half-life and ignore
other nodes for a while. We couple the immortals in the following way. An arbitrary
immortal is chosen as the first element of the first pair and its successor is the second
element of the first pair. Consecutive immortals (according to the order on the ring)
build consecutive pairs. We want to lower bound the number of pairs with two light
nodes. There are n0/4 pairs, at most n0/8 of which can have at least one medium or
heavy node. Thus, in at least n0/8 pairs, an immortal will be willing to migrate when it
42 B T C C
is activated. Notice that presence of other nodes can only decrease the weight of some
immortals and cannot discourage any immortals from sending proposals.
As a node only wishes to share its load with another node, if its current weight exceeds
32 ·δ,n0/16 nodes suffice to store all load. Thus, only n0/16 immortals have a chance to
migrate and at least n0/16 send the maximum number Aof messages.
Assume that in order to send a message, an active light node does not have to lock its
predecessor. Then all Amessages sent by a node will be sent in time O(log n). Suppose
that the part of the algorithm responsible for trying to lock a neighbor is performed
immediately. Then at every time step, if an active light node is not sending a message, it
means that either it cannot lock its predecessor or that it has been locked by its successor.
In either case it implies that one of its neighbors is sending a message. This means that
at least 1
3of the messages will be delivered in the time, in which all messages would be
delivered if nodes could send messages without locking their predecessors.
We assign additional Θ(log n) time steps to each active light node. Each time there is a
chance for an active light node to synchronize with its predecessor and the synchroniza-
tion does not take place, we pay for such waste of time from this additional cost. There
are two reasons why the synchronization should not occur. One is that the considered
node decides to try to let its successor lock it. Another is that the predecessor of the
node decides not to allow the node to lock it. If within the Θ(log n) additional time steps
a node succeeds Atimes, it manages to send all its messages. At each trial the node
succeeds with probability at least 1
4. With high probability, there is at least one success
in Θ(log n) time steps, and thus also Asuccesses in Θ(log n) trials. Also, all nodes use
Θ(log n) additional time steps to succeed Atimes, with high probability. This finishes
the proof of Lemma 5.3.
Below we prove the second part of Theorem 5.1. As writing all constants exactly
significantly decreases the readability of the proof, we simplify it by assuming that
nodes operate on lengths instead of weights and that an interval is considered very
short, if its length is below 1
n. This assumption is justified by Fact 3.6.
Proof of Statement 2of Theorem 5.1. From Lemma 5.3, we know that in the half-life
(or in the extension guaranteed by the definition of the model) Θ(n) proposals from
immortals are generated in the system, and that we can make the constant in the Θ
notation arbitrarily large just by increasing the Aconstant in the algorithm.
Let y∈[0,1) be an arbitrary real number on the ring. Consider the interval Iyof length
log n
nstarting in y. Let xand zbe equal to y−log n
nand y+log n
nrespectively. Ixand Izare
intervals of the same length as the length of Iystarting in xand z, respectively. In other
words, Ixand Izare the intervals preceding and succeeding Iy, respectively.
5.3 P AD 43
Chernoffbound guarantees that, with high probability, the number of help proposals
from immortals that fall into Ixis Θ(log n). We say that a proposal reaches z, if it contacts
the node responsible for z. The parameter Fis chosen so, that even if between xand z
there are as many nodes as possible, a proposal travelling from xand forwarded through
all the nodes can reach z.
At this point of the analysis we allow the adversary to be adaptive. She can choose,
which of the mortal nodes die and when. When she removes a node from the network,
another node can become heavy, and it should be split by an immortal. Our analysis
is based on the fact that when an immortal helps a heavy node, it does it at least
until the end of the half-life. The adversary may remove some nodes from Iyafter all
proposals from immortals have been forwarded through Iyand the algorithm repairs
such imbalance in the next half-life.
Below we show that the following event happens at least 2 ·log ntimes: a help
proposal that hit Ixreaches zat the moment when there is a node in Iz. If such an
event happens, then either there are no heavy nodes in Iy, or the immortal is inserted
somewhere between yand the node in Iz. At most 2 ·log nimmortals can be inserted in
the interval Iy]Iz, so after 2 ·log nsuch events there are no heavy nodes in Iy.
First, we show that Θ(log n) of the proposals reach zin the process of forwarding (not
necessarily when there exists a node in Iz). This is true, since in Ixand Iythere is place
for at most 2 ·log nof them. Notice that we cannot be sure which proposals (in order
of coming) reach z: it may happen that the first one gets through, right after that many
nodes die in the network, and the next immortals split the heavy intervals that appear
in Ixand Iy.
We introduce a notion of a steady barrier as the first node behind zon the ring which will
survive until the end of the half-life. At the beginning of the process we can bound the
distance between yand the barrier by 1. When during our process an immortal reaches z
and there is no node in Izthen the immortal is inserted between yand the current steady
barrier and becomes the new steady barrier. In the worst case, the distance between y
and the steady barrier halves each time when it happens, so after fewer than log nsuch
events there is a steady barrier in Iz. Thus, log nimmortals can be used to assure that
there is a steady barrier in zand further 2 ·log nto assure that Iyis balanced.
If we divide the ring into n
log ndisjoint intervals each of length log n
n, then the probability
that the algorithm fails for at least one of them is at most ntimes larger than the
probability of failure for a single interval. Thus, the algorithm AD succeeds on the
whole ring, with high probability.
44 B T C C
Algorithm 3 AD (an active light node iresponsible for an interval Ii)
1: if i.pred is not light then
2: change ito passive light
3: decision ←random({0,1})
4: if decision =0then
5: send a lock-request message to i.pred
6: send a lock-deny message to i.succ
7: else
8: send a no-lock-request message to i.pred
9: send a lock-allow message to i.succ
10: repeat
11: do nothing
12: until received messages from i.succ and i.pred
13: if decision =1 and received a lock-request message from i.succ then
14: repeat
15: if received a message from i.pred then
16: send a no-lock-request to i.pred
17: until received an unlock message from i.succ
18: else if decision =0and received a lock-allow message from i.pred then
19: if a message lock-request from i.succ comes during execution of lines 22−30 then
20: send a lock-deny to i.succ
21: p←random([0,1))
22: j←the node responsible for p
23: if jis heavy and migrating to pdoes not produce a very light or light interval
then
24: migrate ito the point p
25: else
26: contact jand its F ·∆successors
27: if a node kaccepts then
28: migrate to the middle of Ik
29: send an unlock message to i.pred
30: activity level ←activity level −1
31: if activity level =0then
32: change ito passive light
5.3 P AD 45
Algorithm 4 AD (a passive light node iresponsible for an interval Ii)
1: if a message comes from i.pred then
2: send a no-lock message to i.pred
3: if a lock-request message comes from i.succ then
4: send a lock-allow message to i.succ
5: repeat
6: do nothing
7: until received an unlock message from i.succ
8: if a no-lock-request message comes from i.succ then
9: send a lock-deny message to i.succ
10: if any change in markers stored in Iithen
11: i.activity level ←(A)
12: change ito active light
Algorithm 5 AD (a medium or heavy node iresponsible for an interval Ii)
1: if a message comes from i.pred then
2: send a no-lock message to i.pred
3: if a lock-request message comes from i.succ then
4: send a lock-deny message to i.succ
Part II.
Transparent Data Structures
47
C 6
Introduction to Transparent Data
Structures
In the second part of this thesis we propose a new class of memory models for im-
plementing data structures so that they can be emulated in a distributed environment
in a scalable and robust way. For this we use Peer-to-Peer networks as a basis. We
demonstrate the effectiveness of our approach by looking at a specific memory model
and implementing a binary search tree in it.
Mostof thedata structuresin thealgorithms literature arebased ononeof thefollowing
two basic models: the pointer model or the linear addressable memory model. In the
linear addressable memory model we have a linear addressable memory and every
read and write request to a memory cell can be processed at unit cost. In the pointer
model we only have memory cells and labeled pointers interconnecting the memory
cells. Instead of directly addressing the memory cells, all read and write accesses have
to be handled via pointers. The standard assumption is that forwarding a request along
a pointer can be done at unit cost. For papers on various pointer machine models see,
for example [BAG92, CD93, GK89, vEB90].
Both models are useful for the design of data structures for a single machine. However,
in a distributed environment both models have their limitations. The basic problem with
these models is that they give too much freedom in designing a data structure. This
can make it tempting to design data structures that are hard to emulate in a distributed
environment in a scalable, efficient, and robust way. Scalability means that a low degree
overlay network can be used for the emulation of the data structure, efficiency means that
operations in the data structure can be emulated with a low time and work overhead, and
robustness means that the data structure can recover from (a potentially large number of
arbitrary or random) memory faults. If a data structure is scalable, efficient, and robust,
Peer-to-Peer networks should form a very good foundation for it. If it is implemented
49
50 I T D S
on top of them, it should be able to use their main advantage, namely be able to access
a huge amount of resources.
6.1. Our contribution and organization of the second
part
We introduce a class of structured memory models that we call transparent memory
models. Transparent memory models are memory models in which memory accesses
can be emulated in a scalable overlay network with constant work overhead. The
transparency of the proposed models means that it should be possible to design data
structures without detailed knowledge about the underlying network.
As a specific example, we introduce a transparent memory model called the hypertree
memory model, which is based on the structures of a binary tree and a hypercube.
We design a scalable, dynamic overlay network that can emulate memory accesses
in this model with constant work. The network is based on the Continuous-Discrete
Approach [NW03a, NW03b] with its de Bruijn links but has additionally hypercubic
links like in Chord [SMK+01]. The model is designed specifically to emulate a binary
tree but it is sufficiently general to support other data structures, e.g., heaps.
We show how to implement a search tree in the hypertree memory model with the
property that the amortized work for insert, delete and search is the same as for the best
search trees in the pointer model, and additionally it can efficiently recover all remaining
information under arbitrary memory faults. Moreover, in the emulation, every node in
the dynamic overlay network only has to perform a worst-case logarithmic amount of
work for any insert, delete or search operation.
In the rest of this chapter, we describe two classic memory models in the context of
faulty memory cells and robustness. We discuss why the linear addressable memory
model and the pointer model are inadequate for achieving scalability, efficiency, and ro-
bustness in a distributed environment. Then, we give a detailed definition of transparent
memory models. We also describe the literature on search structures, as our example is a
binary search tree. In this description we concentrate on the robustness aspect of data
structures.
In Chapter 7 we define the hypertree memory model, which is a complete binary tree
with a set of hypercubic connections connecting nodes of each level. Such structure can
be emulated using de Bruijn connections for tree edges and Chord like connections for
the hypercubic edges. Using the Continuous-Discrete Approach, we develop a dynamic
graph which can emulate the hypertree with dilation at most 1, i.e. for each pointer in
the hypertree, the memory cells it connects are stored on the same node or on nodes
neighboring in the emulating network.
6.2 L 51
In Chapter 8 we present a specific application of the hypertree model, namely a
balanced binary search tree. The balance in the tree is achieved by assuring that for each
node the sizes of its left and right subtrees differ by only a constant factor. This assures
logarithmic height of the tree. The balancing of the tree is performed by procedures
similar to the procedures used to balance AVL and red-black trees, but in our model
moving a subtree incurs a cost proportional to its size and takes time proportional to
its height. With careful analysis we can show that the amortized cost of all search tree
operations is logarithmic per operation and that each single operation takes at most
logarithmic time.
6.2. Linear addressable memory model in a distributed
environment
Consider the problem of emulating a data structure in the linear addressable memory
model in a distributed environment. Since in this model it may be hard to predict
the access pattern without knowing the input in advance, a universal shared memory
implementation may have to be used. A vast number of shared memory platforms have
already been developed. In the context of Peer-to-Peer networks one can easily adapt
the Distributed Hash Tables like Chord [SMK+01], CAN [RFH+01], Pastry [RD01], and
Tapestry [HKRZ02] to build a scalable shared memory platform.
However, the problem with any scalable universal shared memory implementation is
that it has an inherent (time and work) overhead of essentially Ω(log n/log log n). This is
because by definition, a scalable solution requires the shared memory to be distributed
in a system interconnected by an overlay network of at most polylogarithmic degree.
The diameter of a network of polylogarithmic degree is at least Ω(log n/log log n) and
if a data structure and a network are not adjusted to one another we have to assume
that this is the the distance which has to be traveled at each access in the data structure.
The time overhead may be reduced with the help of pipelining techniques but reducing
the work overhead can be very difficult. It usually requires knowledge about the access
structure so that parts of the data structure that have a high access correlation (such as
a subheap in a heap) can be stored together or at a small hop-distance in the distributed
system and therefore fewer transmissions of requests over the network are necessary.
However, since in a pure shared memory model there is no incentive for a programmer
to implement data structures that have a high degree of locality, there may not be
a high access correlation that can be exploited. Providing the programmer with a
model in which each memory access costs only 1 unit independently of the access
pattern, we have to take into account that the access pattern is arbitrary, even adversarial.
Hence, it is not surprising that pure shared memory models have been criticized as
52 I T D S
being inadequate for distributed systems (e.g. [WWWK94]) and people have considered
alternative models including LogP [CKP+93], BSP [GV94], QSM [Ram97], and many
others. For a comprehensive discussion of these models we refer the reader to the book
by Savage [Sav98].
Robustness can also be a problem. A general technique for obtaining a robust, dis-
tributed shared memory platform is to use redundancy. However, redundancy is ex-
pensive and cannot protect against all possible bad events that may happen. Therefore,
it is important to also consider events in which an item in a data structure may not be
recoverable. Here, compact representations of data structures can help. Consider, for
example, the array implementation of a heap. If an element in the heap disappears, then
a simple reheap operation on the array can repair the heap. For other data structures, a
recovery may not be as easy as for the heap, and the linear addressable memory model
does not provide a natural way of dealing with it due to its structure-less nature.
6.3. Pointer model in a distributed environment
If we use the pointer model instead of the linear addressable memory model, then we can
get rid of some of the problems, such as identifying the access structure. For example,
by maintaining a network connection for each pointer in the data structure, we can
obtain a faithful (i.e., constant overhead) emulation of the original data structure in the
distributed system. However, if the data structure is much larger than the number of
processing units available for its emulation, then even though the data structure itself
may be scalable, embedding it in the wrong way into the distributed system may result
in a non-scalable overlay network for the emulation of the data structure. Embedding
a data structure in the right way (i.e., so that each pair of elements connected by a
pointer lie at a small distance from one another in the network) is a non-trivial problem
in general and has been extensively investigated in the context of network embeddings
and partitions in the past (e.g. [Lei92, MS90]). This problem can be removed when
embedding the data structure into a shared space, but as mentioned above, a scalable
implementation of a shared space has an inherent overhead of Ω(log n/log log n).
Data structures based on the pointer model can have serious problems with robustness
if they are not designed carefully. Pointer structures such as linear lists and trees, for ex-
ample, are not useful for a distributed environment because a single failure of a memory
cell (such as the root of a search tree) can make the whole data structure or a large part of
it inaccessible. Therefore, more fault-tolerant pointer structures have been investigated.
See, for instance, the work by Aumann and Bender [AB96] or on Skip Graphs [AS03],
Skip Nets [HJS+03], and deterministic variants of the skip graph [AS04b, HM04]. Since
these pointer structures have a high expansion, they can suffer many memory cell faults
6.4 T 53
and still have a large connected component. However, those parts disconnected from
the rest of the structure are lost. Our goal, instead, is to find mechanisms so that all
information in the data structure that did not get lost due to memory faults can be
recovered and reorganized so that the data structure is back in a legal state.
6.4. Transparent memory models
Transparent memory models belong to the class of structured memory models which
are defined as follows.
Definition 6.1. Astructured memory model consists of:
•a countably infinite set U of memory cells,
•an infinite family of pointer structures H={Hn=(Un,Fn)|n∈N,Un⊆U,Fn⊆
Un×Un},
where Hnis a connected graph for all n ∈Nand U =Sn∈NUn. We demand also that Un⊆Un+1
and Fn⊆Fn+1.
In such a model, requests can only be exchanged between adjacent cells and every
such request can be processed at a unit cost.
Definition 6.2. A structured memory model is called transparent if there is an infinite family
G={Gn=(Vn,En)|n∈N}of graphs of at most polylogarithmic degree such that for all n ∈N
and m ∈N, Hmcan be mapped to Gn:
•with dilation at most 1,
•with load at most O(log(|Um|)+|Um|/|Vn|).
This means that, for all nand m, there is a mapping f:Um→Vnof the cells in Hm
to the nodes in Gnsuch that the number of cells stored in a single node is at most by a
logarithmic term larger than average (for all v∈Vn,|f−1(v)|=O(log(|Um|)+|Um|/|Vn|)),
and a connection between any two cells is emulated by an edge or by storing them on
the same node (for all {v,w} ∈ Fm,f(v)=f(w) or {f(v),f(w)} ∈ En).
We think that transparent memory models will succeed in implementing of diverse
data structures on top of Peer-to-Peer networks. However, we bound our present
considerations to only one model, namely the hypertree memory model. It turns out that
for the hypertree model, there is not only an infinite family of static graphs that can
emulate Hwith dilation at most 1 but we can also specify dynamic graphs (i.e., nodes
can continuously enter and leave the graph) that can emulate Hwith dilation at most 1
at any time.
54 I T D S
6.5. Other work on search structures
Work on search structures has a long history. The most popular search structures are
probably AVL trees, red-black trees, skip lists, and splay trees. Whereas all of these
structures are efficient allowing to process insert, delete, and search operations with a
cost of O(log n), none of these structures is robust to memory faults.
There are basically two approaches of making search structures robust to memory
faults: using high-expansion pointer structures such as Skip Graphs [AS03] or the
Hyperring [AS04b], or embedding the search structure in a compact form into an ar-
ray [AL90, BDFC00, BFJ02, IKR81]. The first approach cannot recover from arbitrary
memory faults but can at least make sure that the number of non-faulty entries that
get disconnected from the largest connected component of non-faulty entries is within
a constant factor of the faulty entries [AW05]. The second approach allows, in princi-
ple, recovery from arbitrary memory faults, but it is not clear how much time would
be needed for that. But even if there were an efficient recovery mechanism, the fact
that insert and delete operations in these search structures require an amortized work
of Θ(log2n) [AL90, BDFC00, BFJ02, IKR81], which is also believed to be best possi-
ble [BDFC00, DZ90, DSZ94], makes them not particularly attractive.
Hence, it is rather surprising that we can demonstrate the feasibility of a search
structure with amortized O(log n) work for insert and delete operations and worst case
O(log n) work for search operations that can recover efficiently to a maximum possible
extent from arbitrary memory faults.
C 7
Hypertree Memory Model
In this chapter we describe a specific memory model, the hypertree model. We demon-
strate that the hypertree memory model can be emulated by scalable, dynamic graphs
in an efficient and robust way. For this we first introduce a new family of static graphs
and then demonstrate how to turn them into dynamic graphs that allow the hypertree
memory model to be emulated with dilation 1. This new family, the de Bruijn cube,
consists of graphs that are the union of the de Bruijn graph and the hypercube. We need
to take the union of these graphs because neither the de Bruijn graph nor the hypercube
can emulate our hypertree model with dilation 1 but their combination can. The dy-
namic version of the de Bruijn cube, a continuous de Bruijn cube, is solely based on the
Continuous-Discrete Approach by Naor and Wieder.
In the hypertree memory model, U={0,1}∗, i.e., the cells in the model are labeled
by binary strings. The structure Hinterconnecting these cells is defined as follows.
(Recall that the Hamming distance H(v,w) of any two bit strings v,w∈ {0,1}kis equal to
Pk
i=1|vi−wi|.)
Definition 7.1. Ahypertree H=(U,E)is an infinite binary tree in which
•the root has the label (the empty label),
•every node v ∈V is connected to v0and v1(the tree edges) and
•every node v ∈V is connected to every node w with |v|=|w|and Hamming distance
H(v,w)=1(the hypercube edges), i.e., v and w only differ in one bit.
We designed the hypertree specifically to support an efficient implementation of a
search tree because the tree edges will be needed for the search tree structure and the
hypercube edges will be needed for the balancing.
55
56 H M M
7.1. A family of graphs for the hypertree model
Consider the following two well-known classes of graphs.
Definition 7.2 (de Bruijn).For any d ≥0, the d-dimensional de Bruijn graph DB(d)is an
undirected graph G =(V,E)with node set V ={0,1}dand edge set E that contains all edges
{v,w}with v =(v1,...,vd)and w ∈ {(x,v1,...,vd−1)|x∈ {0,1}}.
Definition 7.3 (Hypercube).For any d ≥0, the d-dimensional hypercube HC(d)is an
undirected graph G0=(V,E0)with node set V ={0,1}dand edge set E0that contains an edge
between any two nodes v,w∈V with H(v,w)=1.
When combining these two graphs, we obtain the following new class of graphs.
Definition 7.4 (de Bruijn cube).For any d ≥0, the d-dimensional de Bruijn cube DC(d)is
an undirected graph Gd=(Vd,Ed)with node set Vd={0,1}dand edge set Ed=E∪E0, where
E={{v,w}:v=(v1,...,vd),w∈ {(x,v1,...,vd−1) : x∈ {0,1}}}and E0={{v,w}:v,w∈V,
H(v,w)=1}, i.e. E and E0define respectively de Bruijn and hypercubic sets of edges on the same
set of nodes.
Given a finite binary string b=(b1b2. . . bk), let bR=(bk. . . b2b1) and prefixk0(b)=
(b1. . . bmin{k,k0}). For any two binary strings aand b,a◦brepresents their concatenation. In
order to map the cells of any Hdto the nodes of any Gd0, we use the following mapping
fd,d0:Ud→Vd0:
fd,d0(s)=prefixd0(sR◦r) for all s∈Ud\{}
where r∈ {0,1}d0is the bit string the root is mapped to in fand can be selected in an
arbitrary way. Intuitionally, if a node of the Hdis mapped to a node v=(v1,...,vd0) of
the Gd0, then its children in Hdare mapped to (0,v1,...,vd0−1) and (1,v1,...,vd0−1), that is
to two of the de Bruijn neighbors of vin Gd0. This mapping has the following important
property:
Theorem 7.5. For any d ∈N, d0∈Nand r ∈ {0,1}d0, the embedding of Hdinto Gd0via fd,d0has
a dilation of at most 1 and a load of at most |Ud|/|Vd0|+d+1.
Proof. For every node vof Hd, its edges to v0 and v1 can be emulated by the de Bruijn
edges of Gd0and every connection between vand wwith Hamming distance H(v,w)=1
in Hdcan be emulated by the hypercubic edges of Gd0.
Every level 0 ≤i≤dof Hdis evenly distributed among the nodes of Gd0. If i≤d0, then
each node of Gd0has load at most 1. Otherwise, each node of Gd0has load exactly 2i−d0. If
we sum the load over all d+1 levels, the maximum load of a node is |Ud|/|Vd0|+d+1.
7.2 D 57
Hence, the hypertree model satisfies the properties of a transparent memory model,
that is, it can be emulated by a de Bruijn cube with low dilation and also low load as
long as the label sizes of the allocated memory cells are not too widely spread.
A particularly interesting property of the de Bruijn cube is that the root of the hypertree
model can be placed at any position in [0,1). This allows to emulate multiple hypertree
memories simultaneously in a load-balanced manner, which is very useful when running
multiple applications on top of the same distributed system.
7.2. Dynamic graphs for the hypertree model
Using the Continuous-Discrete Approach of Naor and Wieder [NW03b], one can trans-
form the family of de Bruijn cubes into a dynamic de Bruijn cube suitable for Peer-to-Peer
systems. Interpreting every binary label (v1,...,vd) as a point x=Pd
s=1vs/2sin [0,1), we
can formulate the following continuous variant of the de Bruijn cube.
Definition 7.6. The continuous de Bruijn cube consists of the space V =[0,1) and a set of
functions f0,f1:V→V and gs:V→V, s ≥1with
•fs(x)=(x+s)/2for both s ∈ {0,1}(which represents the de Bruijn edges) and
•gs(x)=x⊕2−sfor all s ∈N(which represents the hypercube edges), where x ⊕y is the
bit-wise XOR of (the binary representations of) x and y.
Next we show how to convert this back into a discrete graph, using the Continuous-
Discrete Approach of Naor and Wieder [NW03b]. Recall from the first part of this thesis,
that for a placement of a set of nnodes on the [0,1) ring, a node iis responsible for
interval Iifrom itself to its successor succ(i).
Definition 7.7. The de Bruijn cube DC(V)of a node set V ⊂[0,1) is an undirected graph with
node set V that contains an edge {i,j}for every pair of points i,j∈V with
•j=succ(i)(which connects the nodes to a ring), or
•fs(Ii)∩Ij,∅for some s ∈ {0,1}, or
•gs(Ii)∩Ij,∅for some s ≥1.
Here, we also allow self-loops.
This definition implies the following crucial fact:
Fact 7.8. For any set of points V and any two points x,y∈[0,1) with y =fs(x)or y =gs(x)for
some s it holds for the owner i of x and the owner j of y that {v,w}is an edge in DC(V).
58 H M M
As we will see below, the fact implies that the hypertree memory model can be
emulated with dilation at most 1 in DC(V) for any node set V. In the following, we
assume that the nodes in Valways have a sufficiently large but finite precision (i.e., bit
representation).
7.3. Joining and leaving the dynamic de Bruijn cube
In the following we assume that the network is balanced by the procedures from the
first part of this thesis or similar or that at least the nodes choose their IDs uniformly at
random. Then, the following theorem shows bounds for the diameter and degree of the
network.
Theorem 7.9. If V is a set of n nodes in [0,1) forming a balanced network, then DC(V)has both
maximum degree and diameter of O(log n), w.h.p. If the IDs of nodes are chosen uniformly at
random, then the diameter is O(log n)and the maximum degree is O(log2n), w.h.p.
Proof. The continuous de Bruijn cube is constructed in such a way that the neighborhood
of an interval Iforms logarithmic (in the number of nodes n) number of intervals of length
the same as the length of I, up to constant factors. If the network is balanced, then only
constant number of nodes can be responsible for each of these neighbors, provided that
only one node is responsible for I. If nodes are distributed uniformly at random, then
the length of Iis Olog n
n, with high probability. Thus, the total length of all neighbors
of Iis Olog2n
n, and Chernoffbounds guarantee that, with high probability, Olog2n
nodes are responsible for the whole neighborhood of I.
In both cases, the shortest interval has length at least Ω1
n2and the hypercubic
connections assure that the diameter of the network is at most 2 ·log n+O(1).
Whenever a new node joins or an old node leaves V, we aim at adjusting the con-
nections in DC(V) so that Definition 7.7 is satisfied. Interestingly, this is very easy (see
also [NW03b]):
Suppose that a new node j∈[0,1) wants to join V, and suppose that i=pred(j), i.e. i
is the closest predecessor of jin V. Given that Ii=[i,k), it follows that Inew
i=[i,j) and
Ij=[j,k). Thus, according to the definition of DC(V), the new edges of iand the edges of
jare a subset of the old edges of i, and all edges not associated with a point in Iiremain
unchanged. Hence, by routing the join request of jto iand communicating with iand
its adjacent nodes, jcan be fully integrated into the system.
Suppose that a node jwants to leave V, and suppose that i=pred(j). Then it suffices
that jforwards all of its edges to ibefore leaving, because iwill take over j’s region.
Hence, we get:
7.4 R 59
Theorem 7.10. Given any node set V of n nodes in DC(V)forming a balanced network, it takes
at most O(log n)work to process a join or leave operation, w.h.p. If the IDs of nodes are chosen
uniformly at random, then a join or leave operation takes at most O(log n)work, w.h.p.
7.4. Robustness
The hypertree space model also has the advantage that, in principle, it allows efficient
recovery from arbitrary memory faults (as long as the distributed system emulating it can
recover). We call a data structure implementation in the hypertree model compact if, in the
fault-free state, the memory cells of the data structure form a connected component in the
hypertree. A data structure is called recoverable if it has a recovery mechanism allowing it
to recover from arbitrary lost memory cells. Suppose that we use the following strategy
whenever a node iestablishes a new edge to a node j:Wake up the recovery mechanism in
all memory cells in i. Then the following result holds:
Theorem 7.11. Any compact data structure implementation in the hypertree model that can
recover from arbitrary lost memory cells can be efficiently emulated by a scalable overlay network
so that also in the emulation it can recover from arbitrary lost memory cells.
Proof. Follows from the fact that the data structure is compact and that any lost memory
cell will eventually be detected once the edge in the network emulating a connection to
a still working memory cell has been recovered.
We use the term “efficient” in the theorem because the recovery mechanism only
needs to be invoked in the case that there is an edge change in the network. As long
as the network is static, no checks have to be performed (under the assumption that
local memory is reliable). This repair procedure will be much clearer in the next chapter
when we show its implementation for a specific example, namely a binary search tree.
C 8
Search Tree
In this chapter we give a concrete example of a transparent data structure, a binary
search tree. The tree has the usual property of a binary search tree that all items stored in
a left subtree of a node are not larger than all items from the right subtree. The balance
is assured by the property that for each node the sizes of its left and right subtree differ
by at most constant factor. This property yields logarithmic height of the tree.
Such definition of the tree makes it possible that a single balancing operation incurs
linear cost, but we show how to perform the balancing in such a way that the amortized
cost per any tree operation is logarithmic. Such a property is still not satisfying in a
distributed environment, as it is also desirable that no operation burdens a single node
too much. We show that our balancing procedure needs time proportional to the height
of the tree, and thus load of each node is at most logarithmic.
In the end, we describe how the tree can recover from arbitrary failures. Thanks to the
logarithmic bound of the height of the tree and the underlying hypertree structure, it is
possible to repair the tree in logarithmic time and with cost by a logarithmic factor larger
than the number of failures. In the end, the tree looks as though the broken items were
simply removed. The balancing and recovery procedures are described independently,
that is, in the description of the balancing procedures we assume there are no failures
or recovery procedures running and in the description of the recovery procedure we
assume the tree is balanced.
We define a balanced binary search tree in the following way.
Condition 8.1. For each node r of the tree T where the subtrees rooted in the left and right child
of r are αand β, respectively:
1. r stores at most one entry
2. if r stores an entry and r is not the root, also the parent of r stores an entry
3. for all a ∈αand b ∈βa≤r≤b (i.e., the tree is sorted)
61
62 S T
4. |α|
3−1≤ |β| ≤ 3·|α|+1(i.e., the tree is balanced)
5. r stores the weight of the subtree rooted in it
In addition, the root of the search tree must be at position , the root of the hypertree
model. The first two conditions ensure that the tree is stored in a compact form and
that data loss in case of peer failures is kept low. The third condition is the usual search
tree condition. The fourth condition assures balance in the tree and gives a logarithmic
upper bound on the height of the tree.
In order to perform operations on the tree and maintain the balance in it we use the
following primitives, each of them working on a subtree Tvrooted in a node v:
move upwards vis moved to its parent and Tvis moved so, that it remains the subtree
rooted in v
move downwards vis moved to its child and Tvis moved so, that it remains the subtree
rooted in v
move sideways vis moved to its tree-sibling and Tvis moved so, that it remains the
subtree rooted
Lemma 8.2. Each of the primitive operations:
•takes constant time to be planned
•takes logarithmic time in the size of the moved subtree
•needs communication linear in the size of the moved subtree
Proof. When rtakes a decision to perform a primitive operation it gives proper orders to
its children and moves along one edge, which takes constant time and communication
since communication with the children and moving upwards or downwards can use
the tree edges and moving sideways can use the hypercubic edges of the Hypertree.
Similarly, when a node gets an order to move from its parent, it forwards the order to
its children (if it has any) and moves along one edge. This also takes constant time and
communication. Thus, the total communication cost is bounded by the number of nodes
in the subtree and the total time is bounded by the height of the subtree.
The basic tree operations are performed as follows. Each of them starts in the root of
the tree.
Search(a) If the item in the root is equal to a, the address of the root is returned.
Otherwise the Search operation is performed recursively in the left (if ais smaller
than the value in the root) or right (if ais larger) subtree. When the subtree does
not exist (e.g. when we are in a leaf), the item is reported to be not found.
63
Insert(a) First, Search(a) is performed. If ais not found, then ais inserted at the proper
place below the leaf reached by Search(a). In case ais found, the operation continues
recursively to any subtree, as in this case we are allowed to insert it recursively in
any of the subtrees.
Delete(a) First, Search(a) is performed. Let xbe the found node (if it exists). If it is a
leaf, it is simply removed from the tree. If not, let ybe the rightmost node in its left
subtree (or the leftmost node in its right subtree if the left subtree is empty). We
remove afrom xand move the item from yto x. The node ywas either a leaf or its
right (left) subtree was empty. If it was a leaf, we can simply remove it. Otherwise
the whole non-empty subtree is moved upwards.
Below we show that as long as the tree is balanced, each of the above operations takes
log(n) time, where nis the initial number of nodes.
Lemma 8.3. An isolated Search or Insert operation or any m consecutive Delete operations
preserve Condition 8.1 (except point 4) in the tree with logarithmic cost per operation.
Proof. A Search operation does not change the structure of the tree at all. Since the
tree fulfills the Condition 8.1 and thus is balanced, its height is O(log n) and the height
bounds the cost of the Search operation.
During an Insert operation a path from the root to the new position of ais traversed.
The cost of the operation is bounded by the new height of the tree, which is by at most 1
larger than the old height, which is O(log n). The new node is inserted as an extension
of a path from the root which assures the first two properties. The third property is
assured by the way the path is built - in the same way as a Search operation would visit
the nodes in the tree. The weights of subtrees have to be updated only for the nodes
on the path from the root to the new position of awhich can be done at the moment of
traversing the path.
Since the Delete operations can only decrease the height of the tree, we can assume
that during all moperations the height of the tree is O(log n). Thus, it suffices to prove
that a single Delete operation preserves Condition 8.1 and has cost bounded by the
height of the tree. A single Delete operation takes care that no hole appears — when x
is removed from the tree and it was not a leaf, its place is filled by another node ywhich
in turn can be removed from the tree without creating another hole. Obviously, no new
information is inserted anywhere except into a hole created by removing a node. Thus,
the first two properties are fulfilled. If xis a leaf, removing it does not influence the
order of other items in the tree in any way. If it is not, ycontains the largest item smaller
than x(or the smallest item larger than x), so moving it to the place previously occupied
by xdoes not change the order, either. Finally, if yhad a subtree, moving it upwards
as a whole, only compresses the tree but does not change the order as well. The cost of
64 S T
moving the tree is linear in its size but the latter is constant since the other subtree of y
does not exist and the tree is balanced. Updating the weights of subtrees can be done
during traversal of the path from the root to y(or to xif xwas a leaf).
After performing an operation on the tree it is easy to check whether the tree is still
balanced, as each node that participated in the operation can check whether its children
fulfill the balance condition (Condition 8.1, property 4). If they do not, a rebalancing
routine is needed. The simplest way would be to rebalance the whole subtree so that
it is perfectly balanced — such approach has logarithmic amortized cost per operation
(we omit the proof). However a single rebalancing routine can take linear time, which
is unacceptable in a distributed environment. In the next section we show a routine that
has logarithmic amortized cost per operation and that performs each rebalancing routine
in logarithmic time. The latter implies that each node participating in the network has to
send O(log n) messages during the rebalancing routine. Since no node is overburdened,
we call this approach fair.
8.1. Amortized analysis
Before we reveal the balancing algorithm, we prepare the ground for it by some defini-
tions. In order to analyze the cost of the algorithm, we use amortized analysis. Whenever
our algorithm has to perform a tree operation it is paid some number of virtual coins.
The algorithm has to pay a coin for each operation which takes constant time. If the cost
of the current tree operation is lower than the number of coins given to the algorithm,
the algorithm can save the spare coins in form of potential. If the cost is higher, the
algorithm has to use some of its savings. We have to assure that the algorithm is never
in debt, that is, its potential is never negative. The number of coins that the algorithm is
paid for each operation is its amortized cost per operation.
We define a potential function for each node rof the tree in the following way. The
potential of the tree is defined as the sum of potentials of all its nodes.
Definition 8.4. For a tree T with subtrees αand βthe potential stored in its root r is the
difference between the weights of αand β:
φr=||α|−|β||
Obviously, for each node rthe potential in ris never negative. Thus, the potential of
the whole tree Tis never negative either. We prove below that if we can attribute Θ(log n)
additional cost (in the form of virtual coins) to each Insert and Delete operation, we can
use such capital to ensure that the potential in all nodes of the tree is in accordance with
the definition. Later, we can use the stored coins to pay for a rebalancing routine.
8.1 A 65
Lemma 8.5. For an Insert or Delete operation O(log n)additional coins suffice to update the
potentials in the tree there where it is necessary.
Proof. As the tree is balanced before the operation the height of the tree is O(log n).
When an Insert or Delete operation is performed, a path Pfrom the root to a leaf is
visited. For each node rin Pat most one of the two subtrees below rgrows or shrinks
by one item. Thus, φrchanges by at most 1 and the total potential changes by at most
the length of Pwhich is O(log n).
The balancing algorithm uses rotations, known from other approaches (see for exam-
ple “Introduction to Algorithms” [CLR89]) for balancing the tree, as a basic routine. We
do not define rotations as a written description would not be as clear as pictures. The
reader is encouraged to check that the tree depicted in Figure 8.1 can be easily trans-
formed into the tree depicted in Figure 8.2 by a constant number of primitive operations.
The same holds for transforming the tree from Figure 8.3 into the tree in Figure 8.4 and
the tree in Figure 8.5 into the tree in Figure 8.6.
In Section 8.3 we show a technical local balancing procedure with the proof of the
following properties.
Lemma 8.6. If for a node r the subtrees rooted in the children or r are balanced (all nodes
fulfill point 4of Condition 8.1, called the balance condition further on), then the local balancing
procedure rebuilds the tree so that:
•r fulfills the balance condition
•on each path from r to a leaf at most one node does not fulfill the balance condition
•only a constant number of primitive operations are used
•the potential stored in the tree decreases and the total communication cost is at most by a
constant factor larger than the decrease in potential
For a node rthe local balancing procedure checks 3 levels of descendants of rand,
depending on the weights of the trees rooted in them, it performs some rotations. If
some other balancing procedures are currently being executed lower in the tree they do
not influence the procedure in r. It is only important that no node from the highest 3
levels is participating in any rotations. The trees below the third level are only moved
as a whole and individual nodes of such trees are moved after they have finished the
rotations started previously.
66 S T
The balancing of the tree is done in a bottom-up-down fashion, that is it is started
in the leaves, it follows in the direction of the root and comes back to the leaves . If
imbalance appears in the tree rooted in a node r,ris informed about it by one of its
children and it happens after the child has finished the balancing. If rdoes not fulfill
the balance condition at that point in time it performs the necessary rotations and tells
those of its descendants that lost their balance property that they should run the balance
procedure. Lemma 8.6 guarantees that each of these descendants can run the procedure
without delay, since all subtrees below them are balanced. After constant number of
time-steps all children and grandchildren of rfulfill the balance property and rcan
inform its parent about the change in weight and tell it to start the balancing.
The following theorem states the main result of this chapter, namely it shows that the
proposed tree maintenance algorithm works with logarithmic amortized cost per tree
operation and that each tree operation needs at most logarithmic time to finish.
Theorem 8.7. The total communication cost of the balancing procedure is O(∆φ+m·log n)
and the time of the balancing procedure is O(log n)where ∆φis the total decrease of potential,
n is the initial number of nodes in the tree, and m is the number of Insert or Delete operations
executed right before the balancing.
Proof. The communication cost is twofold: the cost of executing the primitive operations
and the cost of child-parent communication to inform about possible weight-changes
and to give orders. According to Lemma 8.6 the total cost of all primitive operations
is O(∆φ). If a node communicates to its parent it informs about the change of weight
and the total number of such messages is meach travelling a distance of log n. If a node
communicates to its descendants it does so only after it has performed a local balancing
procedure (which decreased potential) and it sends a constant number of messages.
Thus, the total number of such messages is O(∆φ).
The balancing procedure starts in the leaves of the tree and each node needs only
constant time to initiate the necessary primitive operations and let its parent start the
balancing. After O(log n) steps, each node has initiated the balancing. It takes constant
time in each node and ends after reaching the leaves, i. e. after O(log n) time steps.
8.2. Recovery
In this section we describe how our search tree can recover from arbitrary node failures.
We assume the simplest possible model of failures defined as follows. All failures
happen at a single time step and there are no failures during the recovery. Also, no
failures happen during load balancing, that is, the tree is balanced at the moment
when failures happen. The system as a whole is synchronized, i.e. we may assume that
8.2 R 67
communication takes place in discrete time steps and that in a single time step a message
sent from neighbor to neighbor is guaranteed to reach it. In the following we use the
term nodes to denote peers building the de Bruijn Cube and the term items to denote
values stored in a search tree.
It should be clear that failures happen in nodes building the system and as a result
all items stored in an interval for which a broken node was responsible are lost. When
a node takes over a previously broken interval it does not know if there were items
belonging to any tree in this interval. Still, the algorithm has to repair the structure of
all trees which stored their items in this particular interval.
The recovery proceeds in three phases. In the first phase new special items are created
in the places of lost items. Instead of a key, a node can store a hole which means that
it should get an item to store in the later phases. We create holes only on paths from
survivors to the root, where survivors are the items that were not broken in the failure.
If a whole subtree (in particular a single leaf) disappears we do not reconstruct it. Thus,
the first phase is responsible for connecting all parts of the original tree (from before the
failure) and forming an intermediate tree, which is a minimum subgraph of the original
tree in which each surviving item has all ancestors. After building the intermediate tree,
we start the second phase in which we plan how to fill in the holes. This is done in a
top-down fashion. If the root of a subtree contains a hole, it sends a request to fill it to
one of its subtrees and then runs the fill-in procedure in each child, provided that the
subtree rooted in the child will still contain some items after serving all the requests
from its ancestors. In the third phase, the items which were ordered to fill-in the holes,
do so by travelling upwards. In the end the rebalancing procedure is run in the root of
the tree, but this is no longer a part of the recovery process.
As mentioned in Chapter 7, with a proper wake-up mechanism in the de Bruijn Cube
emulating a hypertree, we can assume that nodes which take over the role of the nodes
that failed, run a wake up mechanism in the newly acquired intervals. When a node
discovers that its successor on the ring is no longer present in the system it takes over
its job. It cannot access the items previously stored in the newly acquired interval as
they are irrevocably lost but it can contact the intervals responsible for the parents and
children of any items previously stored there. If there are items in the neighboring
(along the de Bruijn edges) intervals which had tree neighbors in the restored interval,
we start the wake up procedure in these items. If not, in particular if the parent or child
interval has also just been restored, we do nothing. Such procedure starts the wake up
mechanism in each item that survived the failure but the parent or at least one child of
which was lost.
Each woken item has information about the original weight of its subtree and knows
the address of the root of the tree. Basing on the information about the weight of its
subtree it computes an estimate of where a leaf could have been below. If the weight is
68 S T
w, a good estimate is any number between l=1
2·log wand h=3·log w(the former is an
exact value guaranteed by the balance of the tree, the latter is a bound for log4/3w, as in
a balanced tree all leaves have to be on such heights. A woken item sends two messages
upwards, one of which can reach the root of the tree very fast and the other carries more
information. The first message is called fast and the second slow. One purpose of the
fast messages is to wake up the whole intermediate tree, so that each item knows about
the failure. Another purpose is to inform every item if it should expect slow messages
from its children. The fast messages also carry information about the time when the
failure happened. The slow messages carry information about the updated weight of
the tree (the number of survivors in each subtree), but each item has to be sure it has
gathered this information from its whole subtree before sending the information to its
parent. Below we describe the first phase in details.
Each item that woke up after the failure immediately sends a fast message to its parent
and if the parent does not exist, a hole is created in its place. Each item that receives a
fast message from one of its children and has not sent a fast message to its parent, sends
a fast message immediately too. The following lemmas state the properties of the fast
messages that we will need later. The expression possible location of a leaf in the subtree of
an item means that for the original weight of the subtree rooted in the considered item,
there exists such balanced distribution of items in the subtree that there is a leaf in the
mentioned location.
Lemma 8.8. All fast messages are sent within 3·log n time steps from the moment of the failure.
After this time, each item of the intermediate tree knows a possible location of a leaf in its subtree.
Proof. The height of the original tree is smaller than 3 ·log n, as the tree is balanced.
Each survivor can compute information about a leaf in its subtree from the information
about the weight of the subtree. For a broken item we consider two cases. Either its
whole subtree was broken and it does not belong to the intermediate tree, or at least one
of its descendants has survived and will send an information about a leaf in its subtree.
Such information is also valid for the considered item.
If an item has received a fast message from one of its children, it needs one time step
to send it to its parent, so the height of the tree is an upper bound for the time in which
fast messages are sent.
Lemma 8.9. If an item knows (either because it survived or because it received a fast message
from one of its children) that there is a leaf at height h in its subtree then fast messages from its
children come within 6·h time steps or never.
Proof. If there is a leaf in the subtree of the considered item at height h, it means that
its weight before the failure was at most 4h, as in every balanced tree of weight wthe
shallowest leaves are at height 1
2·log w. On the other hand, in a balanced tree of weight
8.2 R 69
w=4h, the deepest leaves are at height smaller than 3 ·log w, so the height of the
considered subtree is at most 6 ·h. If there was at least one survivor and at least one
broken item in the subtree, a message from it will come in time bounded by its height,
that is 6 ·h.
The purpose of the slow messages is to update the information about the weights of
subtrees in all items of the intermediate subtree. Each item vacts as follows, concerning
the slow messages. When it has an information about the failure together with an
information about a leaf at height hin its subtree, it waits 6 ·htime steps for possible fast
messages from its children. If a fast message comes from a child, then vwaits for a slow
message from this child. The slow message will carry the updated weight of the tree of
the considered child. If no fast message comes from a child within the time limit, there
are two cases. If a child is alive, we may assume that the whole subtree of the child is
alive. If not, we may assume that the whole subtree of the child is dead. In the first case,
we can read the information about the weight from the information stored in the child.
In the second we know the weight is 0. After calculating the weights of the subtrees of
its children, vcalculates the weight of its subtree (calculating its weight as 0, in case it
is a hole) and passes the information in a slow message to its parent. The correctness
of the computed weight is obvious due to the simple structure of a tree. It remains to
prove that the weight of the whole intermediate tree is calculated in logarithmic time. It
is proved in the following lemma.
Lemma 8.10. All slow messages are sent within 24 ·log n time steps from the failure.
Proof. From Lemma 8.8 we know that each item receives a fast message within 3 ·log n
time steps and that it waits for a fast message from its second child for at most 6 ·htime
steps, where his the height of the leaf it knows about. The height of the whole tree is
bounded by 3 ·log n, so all items finish waiting for fast messages within (3 +18) ·log n
time steps.
Let t0be the time step in which all items have received the fast messages or made sure
that fast messages will never come. Let ti, for i>0, be consecutive time steps. We prove
by induction that the heights of all items which have not sent their slow message by time
step tiare bounded by (3·log n)−i, that is, that in time step t3·log nthe procedure finishes.
In time step t0all items at level 3 ·log ncan send their slow messages, as they do not
have children to wait for. This proves the induction base. For the inductive step assume
that all alive items from levels higher than (3 ·log n)−(i−1) have sent their messages at
latest in time step ti−1. By time step tiitems at level (3 ·log n)−ireceive messages from
their alive descendants and know that they will never receive messages from dead ones,
so they can send their slow messages in this time step.
70 S T
Since the height of the intermediate tree is at most 3 ·log n, in time step 3 ·log n, all
items will have sent their slow messages. We know that t0≤21 ·log n, so the total time
needed to send all slow messages is bounded by 24 ·log n.
In the second phase of the recovery process, requests are generated to refill the failed
and newly built items, that is holes. Holes generate requests to items which can refill
them in a top-down fashion. The procedure starts from the root and an item can start it
only when its parent allows it to. If an item is allowed to start the procedure and it is
not a hole, it simply forwards the permission to its children but only to which sent a fast
message in the first phase. If it is a hole, before it allows its children to act, it performs
the following actions. If both of its subtrees are empty (the information comes from the
weight stored in the root of each subtree) it finishes the second phase, as it does not need
to be filled because it does not belong to the intermediate tree. In other case, it chooses
a non-empty subtree from which it wants to take an item. If it chose the right subtree, it
wants the leftmost item in it and if it is the left subtree, it wants the rightmost item there.
It sends a request downwards and the message carrying the request has the following
properties. One is that it will not be overtaken by a permission to start the second phase.
Another is that it can already update the weight information in all items on its way.
And the last one is that a requesting message from a parent is never overtaken by any
message from a child, so when an item gets a permission to start the second phase, its
weight has already been updated by all requesting messages sent by all its ancestors.
Since each item needs at most two time steps, one to possibly send a requesting
message to one of its subtrees and another to send or forward a permission message to
both of its children, the whole second phase is finished in at most 6 ·log ntime steps.
In the third and last phase, the items which received request messages are moved
upwards to fill the holes which sent the requesting messages. The time of the phase is
bounded by the height of the original tree, that is 3 ·log n. After the phase has finished,
a balancing operation can be started in the leaves, which then follows to the root.
In the first and second phase there are only messages on paths from broken items to
the root of the tree, so each message can be attributed to a broken item below. Moreover,
each item sends only a constant number of messages, so the total communication cost is
bounded by O(mlog n), where mis the number of failures. Similarly in sending requests
in the second phase and in the third phase broken items send messages which can travel
along paths in the tree, so the total cost is bounded by O(mlog n), too. Thus, we come to
the following theorem about the recovery algorithm.
Theorem 8.11. The recovery algorithm allows the search tree of size n to recover from any set of
m failures in O(log n)time using at most O(mlog n)work.
8.3 L 71
8.3. Local balancing procedure
Below we present a procedure used to take local decisions if imbalance appears in the
tree. We assume that the procedure is run in an item rand that the descendants of r
up to three levels fulfill the balance condition. After performing the balancing in r, we
will guarantee that on each path from rto a leaf at most one item loses balance in its
subtree. Intuitionally, it can be thought of as waves of balancing going from the root
to the leaves. The new wave that we introduce, while balancing r, will never catch or
overtake the waves that existed and moved in the direction of leaves before.
Proof of Lemma 8.6. Let nbe the number of items in the considered tree. For simplicity
of notation for any (sub)tree αits name denotes also its weight, i.e. the number of items
it contains. The simplest starting situation is depicted in Figure 8.1. We will give more
detailed views as we explore more details about the tree. Imbalance in the network
means simply that 0 ≤α≤n
4. In order to cope with it, we consider the following cases.
α
b
a
γ δ
Figure 8.1.: Initial state of the tree, the simplest case
α γ δ
a
b
Figure 8.2.: The tree after rotation, the simplest case
72 S T
1. If δ≥5n
16 then we perform a single rotation so that the tree looks as in Figure 8.2.
After the rotation the balance in the root is ensured by n
4<5n
16 ≤δ≤3
4·(δ+γ)≤3n
4.
Before the rotation φ(a)=δ+γ−αand φ(b)=|δ−γ|. After the rotation the
potentials should be φ0(a)=|α−γ|and φ0(b)=|α+γ−δ|. Below we show that
∆φ:=φ(a)+φ(b)−φ0(a)−φ0(b)= Ω(n), i.e. the initial potential is sufficient to store
the resulting potential and pay for the rotation. We consider three cases:
•δ≤γ:∆φ=(γ+δ−α)+(γ−δ)−(γ−α)−(γ+α−δ)=δ−α≥n
16
•γ≤δ≤α+γ:∆φ=(γ+δ−α)+(δ−γ)−(α+γ−δ)−|α−γ|=3δ−2α−γ−|α−γ|=
δ−α+2δ−2 max{α, γ} ≥ δ−α≥n
16
•α+γ≤δ:∆φ=(δ+γ−α)+(δ−γ)−(δ−α−γ)−|α−γ|=δ+γ−|α−γ| ≥ δ−α≥n
16
Obviously, any imbalance can only be in a.
2. If 3n
16 ≤δ≤5n
16 and 3n
16 ≤α≤4n
16 we assume that the tree looks as depicted in
Figure 8.3 and we perform a rotation so that it looks as in Figure 8.4. From the
assumptions we have 6n
16 ≤α+δ≤9n
16 , so 7n
16 ≤+λ≤10n
16 and 7n
64 ≤, λ ≤30n
64 . The
balance in the root is assured by n
4<19n
64 ≤α+λ≤46n
64 <3n
4. We need to show that
∆φ:=φ(a)+φ(b)+φ(c)−φ0(a)−φ0(b)−φ0(c)=(δ+λ+−α)+|λ−|+(λ+−
δ)−|λ−α|−|α+λ−δ−|−|−δ|= Ω(n) We consider two cases.
•If α+λ≥δ+then we have ∆φ=λ+−α+2·max{λ, }−2·max{λ, α}+2·min{, δ}.
Now if λ≥αthen ∆φ≥λ+−α≥3n
16 . Otherwise ∆φ≥2λ+2−3α≥2n
16 .
•If α+λ≤δ+then we have ∆φ=λ+−α+2·max{λ, }−2·max{, δ}+2·min{α, λ}.
Now if ≥δthen ∆φ≥λ+−α≥3n
16 . Otherwise ∆φ≥2λ+2−α−2δ+2·
min{α, λ} ≥ 14n
16 .
Obviously imbalance can only be in aor c, that is at only one level in each branch.
ε δλα
b
c
a
Figure 8.3.: Initial state of the tree, intermediate case
8.3 L 73
3. If 3n
16 ≤δ≤5n
16 and 2n
16 ≤α≤3n
16 we assume that the tree looks as depicted in
Figure 8.3 and we perform a rotation so that it looks as in Figure 8.4. From the
assumptions we have 5n
16 ≤α+δ≤8n
16 , so 8n
16 ≤+λ≤11n
16 and 8n
64 ≤, λ ≤33n
64 . The
balance in the root is assured by n
4≤16n
64 ≤α+λ≤45n
64 <3n
4. We need to show that
∆φ:=φ(a)+φ(b)+φ(c)−φ0(a)−φ0(b)−φ0(c)=(δ+λ+−α)+|λ−|+(λ+−
δ)−|λ−α|−|α+λ−δ−|−|−δ|= Ω(n) We consider two cases.
•If α+λ≥δ+then we have ∆φ=λ+−α+2·max{λ, }−2·max{λ, α}+2·min{, δ}.
Now if λ≥αthen ∆φ≥λ+−α≥5n
16 . Otherwise ∆φ≥2λ+2−3α≥7n
16 .
•If α+λ≤δ+then we have ∆φ=λ+−α+2·max{λ, }−2·max{, δ}+2·min{λ, α}.
Now if ≥δthen ∆φ≥λ+−α≥5n
16 . Otherwise ∆φ≥2λ+2−α−2δ+2·
min{α, λ} ≥ 7n
16 .
The same as in the previous case, imbalance can only be in aor c.
ε δλα
b
a c
Figure 8.4.: The tree after rotation, intermediate case
b
a
δα λ σ τ
c
d
Figure 8.5.: Initial state of the tree, the most advanced case
4. If 3n
16 ≤δ≤5n
16 and n
16 ≤α≤2n
16 and 12n
64 ≤λ≤36n
64 we assume that the tree looks as
depicted in Figure 8.3 and we perform a rotation so that it looks as in Figure 8.4.
From the first two assumptions we have 4n
16 ≤α+δ≤7n
16 , so 9n
16 ≤+λ≤12n
16 and
9n
64 ≤, λ ≤36n
64 but from the last assumption we have also 12n
64 ≤λ. The balance
74 S T
δα λ σ τ
a
b
c
d
Figure 8.6.: The tree after rotation, the most advanced case
b
a
e
σα µ ν
Figure 8.7.: Zoom of the most advanced case
in the root is assured by n
4≤16n
64 ≤α+λ≤44n
64 <3n
4. We need to show that
∆φ:=φ(a)+φ(b)+φ(c)−φ0(a)−φ0(b)−φ0(c)=(δ+λ+−α)+|λ−|+(λ+−
δ)−|λ−α|−|α+λ−δ−|−|−δ|= Ω(n) We consider two cases.
•If α+λ≥δ+then we have ∆φ=λ+−α+2·max{λ, }−2·max{λ, α}+2·min{, δ}.
Now λ≥αso ∆φ≥λ+−α≥7n
16 .
•If α+λ≤δ+then we have ∆φ=λ+−α+2·max{λ, }−2·max{, δ}+2·min{λ, α}.
Now if ≥δthen ∆φ≥λ+−α≥7n
16 . Otherwise ∆φ≥2λ+2−α−2δ+2α≥9n
16 .
The same as in the previous two cases, imbalance can only be in aor c.
5. If 3n
16 ≤δ≤5n
16 and n
16 ≤α≤2n
16 and 9n
64 ≤λ≤12n
64 we assume that the tree
looks as depicted in Figure 8.5 and we perform a rotation so that it looks as
in Figure 8.6. From the first two assumptions we have 4n
16 ≤α+δ≤7n
16 , so
9n
16 ≤σ+τ+λ≤12n
16 and 9n
64 ≤σ+τ, λ ≤36n
64 but from the last assumption we
have also λ≤12n
64 , so 24n
64 ≤σ+τ≤36n
64 and 6n
64 ≤σ, τ ≤27n
64 . The balance in
the root is assured by n
4<19n
64 ≤α+λ+σ≤47n
64 <3n
4. We need to show that
∆φ:=φ(a)+φ(b)+φ(c)+φ(d)−φ0(a)−φ0(b)−φ0(c)−φ0(d)=(δ+λ+σ+τ−α)+(σ+
τ−λ)+|σ−τ|+(λ+σ+τ−δ)−(λ−α)−|α+λ−σ|−|α+λ+σ−δ−τ|−|δ−τ|= Ω(n). We notice
that ∆φ=3σ+3τ+|σ−τ|−|α+λ−σ|−|α+λ+σ−δ−τ|−|δ−τ| ≥ 72n
64 +0−14n
64 −29n
64 −15n
64 =14n
64 .
8.3 L 75
σα µ ν
b
a
e
Figure 8.8.: Zoom of the most advanced case after rotation
Possible imbalance in dis not a problem, since it is the only imbalance in the right
branch. There can be imbalance in cbut ais balanced, since α≥n
16 ≥1
3·3n
16 ≥λand
α≤2n
16 <9n
64 ≤λ.
6. If 3n
16 ≤δ≤5n
16 and 0 ≤α≤n
16 and 16n
64 ≤λ≤39n
64 we assume that the tree looks as
depicted in Figure 8.3 and we perform a rotation so that it looks as in Figure 8.4.
From the first two assumptions we have 3n
16 ≤α+δ≤6n
16 , so 10n
16 ≤+λ≤13n
16
and 10n
64 ≤, λ ≤39n
64 but from the last assumption we have also 16n
64 ≤λ. The
balance in the root is assured by n
4≤16n
64 ≤α+λ≤43n
64 <3n
4. We need to show that
∆φ:=φ(a)+φ(b)+φ(c)−φ0(a)−φ0(b)−φ0(c)=(δ+λ+−α)+|λ−|+(λ+−
δ)−(λ−α)−|α+λ−δ−|−|−δ|= Ω(n) We consider two cases.
•If α+λ≥δ+then we have ∆φ=λ+−α+2·max{λ, }−2·max{λ, α}+2·min{, δ}.
Now λ≥αso ∆φ≥λ+−α≥9n
16 .
•If α+λ≤δ+then we have ∆φ=λ+−α+2·max{λ, }−2·max{, δ}+2·min{λ, α}.
Now if ≥δthen ∆φ≥λ+−α≥9n
16 . Otherwise ∆φ≥2λ+2−α−2δ+2α≥10n
16 .
The same as in cases 2,3 and 4, imbalance can only be in aor c.
7. If 3n
16 ≤δ≤5n
16 and 0 ≤α≤n
16 and 10n
64 ≤λ≤16n
64 we assume that the tree
looks as depicted in Figure 8.5 and we perform a rotation so that it looks as
in Figure 8.6. From the first two assumptions we have 3n
16 ≤α+δ≤6n
16 , so
10n
16 ≤σ+τ+λ≤13n
16 and 10n
64 ≤σ+τ, λ ≤39n
64 but from the last assumption we
have also λ≤16n
64 , so 24n
64 ≤σ+τ≤39n
64 and 6n
64 ≤σ, τ < 117n
256 . The balance in the
root is assured by n
4=16n
64 ≤α+λ+σand 1
4<18n
64 ≤τ+δ. We need to show that
∆φ:=φ(a)+φ(b)+φ(c)+φ(d)−φ0(a)−φ0(b)−φ0(c)−φ0(d)=(δ+λ+σ+τ−α)+(σ+τ−λ)+
|σ−τ|+(λ+σ+τ−δ)−(λ−α)−|α+λ−σ|−|α+λ+σ−δ−τ|−|δ−τ|= Ω(n). We notice that
∆φ=3σ+3τ+|σ−τ|−|α+λ−σ|−|α+λ+σ−δ−τ|−|δ−τ| ≥ 288n
256 +0−77n
256 −133n
256 −69n
256 =9n
256
76 S T
In this case a problem can appear if bis not balanced. Since σ≤117n
256 <120n
256 =3·10n
64 ≤
3·λ≤3·(α+λ), this means that σ≤1
3·(α+λ) but 6n
64 ≤σ≤1
3·(α+λ)≤1
3·18n
64 =6n
64 ,
which means that α=4n
64 ,λ=16n
64 σ=6n
64 . In such case we take a closer look
at λ. The interesting part of the tree is depicted in Figure 8.7. We perform a
rotation so that the tree looks as in Figure 8.8. The tree rooted in eis balanced, since
α+µ, ν+σ≤18n
64 ≤3
4·24
64 <3
4·(α+µ+ν+σ). Thus only a,bor dcan be in imbalance but
they are all in separate branches. The decrease in potential can cover the cost, since
∆φ=(µ+ν−α)+(α+µ+ν−σ)−(µ−α)−|ν−σ|−|α+µ−ν−σ| ≥ 12n
64 +14n
64 −8n
64 −6n
64 −10n
64 =2n
64 .
The minimum decrease of potential in a rotation is 9n
256. In each case a constant number
of primitive rotations is performed and each of the primitive operations moves at most
nnodes to constant distance. Thus, the total movement cost is at most linear.
C 9
Summary and Outlook
In this thesis we concentrated on algorithmic aspects of Peer-to-Peer networks. This is
the area in which hundreds if not thousands of scientists all over the world are interested
and we were hoping to contribute something too.
9.1. Load balancing
Load balancing has been a problem addressed already in the first algorithmic designs
of Peer-to-Peer networks and much work has been done concerning it. We considered
only the area of smoothing the intervals for which the nodes are responsible and still
one can see that there is no one and only solution of this problem, even though more
than a dozen publications on this topic exist.
The algorithm AS , we presented, is meant to perform the load balancing proce-
dures as quickly as possible and it achieves this but pays with significant communication
cost. The experiments show that the efficiency of the algorithm can be proved not only
on paper but also in applications. Even though, the constants emerging from the anal-
ysis are in the range of thousands, the experiments show that the smoothness can be
reduced from more than 2000 if the nodes are distributed uniformly to under 20, if we
let our algorithm take care of the network.
The algorithm AD is meant to repair the main disadvantage of the first one. Its
low communication cost can be provably bounded but the balance of the system is
always a few steps behind its dynamics.
We do not provide experimental analysis for the second algorithm, as deeper consid-
erations lead to a conclusion that it would be a project in itself. A model of the dynamics
of the system is needed which captures the changing rates at which nodes join and leave.
Intuitionally, one can think that for example in the morning there are many joins and in
the evening many leaves. Such view is complicated by timezones, density of computer
77
78 S O
equipped population on the whole planet and different activities of people on weekends
and working days. We do not have the knowledge or tools to devise and check the
correctness of a model which would grasp all these intuitions. Except this, simulations
should be executed at least on a cluster of computers, since a single computer was hardly
capable of performing simulations for the simple model.
A significant question that remains open is how to balance the load in a heterogeneous
system, where heterogeneity has at least a few independent aspects like storage capacity,
communication bandwidth and computation power. To the best of our knowledge, the
only means of treating this problem that has been suggested is introduction of virtual
processors. Other approaches should be investigated in the nearest future.
An aspect that we did not investigate, is the balancing of the items stored in the system
when the scarce resource is storage capacity and dealing with the popularity of items,
when communication bandwith should be spared.
9.2. Data structures
The second part of this thesis can also be thought of as load balancing, thanks to its
concentrating on balanced trees. We presented a scheme that lets implement a balanced
binary search trees on top of a Peer-to-Peer network in a robust way. Hopefully, this
is only the first step in the direction of implementing other useful data structures in
such an environment. It is probably possible to use the Hypertree we have designed
as a foundation of a simple heap structure. Other data structures will need more
sophisticated designs.
We hope that our work on transparent data structures will advance the Peer-to-Peer
networks as a tool for large scale distributed computation. This is also the direction of
research we would like to persue in the nearest future.
We would like to concentrate on creating a model of computation based on virtual
processes which can be migrated in a Peer-to-Peer network and thus achieve a huge
distributed computation platform. In such a system each process works for itself com-
municating to other processes through messages or common data structures. For the
second possibility, transparent data structures can be adapted.
Such a system could exploit the advantages of Peer-to-Peer networks and at the same
time can be designed to counteract their deficiencies. For example the heterogeneity in
different aspects like computing power, storage space and network bandwidth can be
used to assign more processes, more backup burden or more responsibilities for network
organization to peers strong in respective areas.
On the other hand robustness should be taken into consideration, as in a Peer-to-
Peer network we have to assume that some nodes will not only leave but also fail
9.2 D 79
without notice. Even using counteracting techniques like replication, we have to take
into account that some data will be lost from time to time. Procedures which cope with
such inconveniences should be very carefully designed and transparent data structures
are the first step and a good example how it can be done.
Bibliography
[AB96] Yonatan Aumann and Michael A. Bender. Fault Tolerant Data Structures.
In Proc. of the 37th IEEE Symp. on Foundations of Computer Science (FOCS),
pages 580–589, 1996.
[AHKV03] Micah Adler, Eran Halperin, Richard M. Karp, and Vijay Vazirani. A
Stochastic Process on the Hypercube with Applications to Peer-to-Peer
Networks. In Proc. of the 35th ACM Symp. on Theory of Computing (STOC),
pages 575–584, 2003.
[AL90] Arne Andersson and Tony W. Lai. Fast Updating of Well-Balanced Trees.
In Proc. of the 2nd Scandinavian Workshop on Algorithm Theory (SWAT), pages
111–121, 1990.
[AS03] James Aspnes and Gauri Shah. Skip Graphs. In Proc. of the 14th ACM-SIAM
Symp. on Discrete Algorithms (SODA), pages 384–393, 2003.
[AS04a] Baruch Awerbuch and Christian Scheideler. Group Spreading: A Protocol
for Provably Secure Distributed Name Service. In Proc. of the 31st Int.
Colloquium on Automata, Languages, and Programming (ICALP), pages 183–
195, 2004.
[AS04b] Baruch Awerbuch and Christian Scheideler. The Hyperring: A Low-
Congestion Deterministic Data Structure for Distributed Environments.
In Proc. of the 15th ACM-SIAM Symp. on Discrete Algorithms (SODA), pages
318–327, 2004.
[AW05] James Aspnes and Udi Wieder. The Expansion and Mixing Time of Skip
Graphs with Applications. In Proc. of the 17th ACM Symp. on Parallel Algo-
rithms and Architectures (SPAA), pages 126–134, 2005.
81
82 B
[BAG92] Amir M. Ben-Amram and Zvi Galil. On Pointers Versus Addresses. Journal
of the ACM, 39(3):617–648, 1992.
[BCM03] John W. Byers, Jeffrey Considine, and Michael Mitzenmacher. Simple
Load Balancing for Distributed Hash Tables. In Proc. of the 2nd International
Workshop on Peer-to-Peer Systems, pages 80–87, 2003.
[BDFC00] Michael A. Bender, Erik D. Demaine, and Martin Farach-Colton. Cache-
Oblivious B-Trees. In Proc. of the 41st IEEE Symp. on Foundations of Computer
Science (FOCS), pages 399–409, 2000.
[BFJ02] Gerth Stlting Brodal, Rolf Fagerberg, and Riko Jacob. Cache-Oblivious
Search Trees via Trees of Small Height. In Proc. of the 13th ACM-SIAM
Symp. on Discrete Algorithms (SODA), pages 39–48, 2002.
[BK05] Marcin Bienkowski and Miroslaw Korzeniowski. Bounding the Commu-
nication Cost in Dynamic Load Balancing of Distributed Hash Tables. In
Proc. of the 9th International Conference on Principles of Distributed Systems,
2005.
[BKM05] Marcin Bienkowski, Miroslaw Korzeniowski, and Friedhelm
Meyer auf der Heide. Dynamic Load Balancing in Distributed Hash
Tables. In Proc. of the 4th International Workshop on Peer-to-Peer Systems,
2005.
[CD93] Stephen A. Cook and Patrick W. Dymond. Parallel Pointer Machines.
Computational Complexity, 3:19–30, 1993.
[CKP+93] Dadiv E. Culler, Richard M. Karp, David A. Patterson, Abhijit Sahaj,
Klaus E. Schauser, Eunice E. Santos, Ramesh Subramonian, and Thorsten
van Eicken. LogP: Towards a Realistic Model for Parallel Computation.
In Proc. of the 4th ACM SIGPLAN Symp. on Principles and Practice of Parallel
Programming, pages 1–12, 1993.
[CLR89] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduc-
tion to Algorithms. The MIT Press, Cambridge, Mass., 1989.
[DR98] Devdatt P. Dubhashi and Desh Ranjan. Balls and Bins: A study in Negative
Dependence. Random Structures and Algorithms, 13(2):99–124, 1998.
[DSZ94] Paul F. Dietz, Joel I. Seiferas, and Ju Zhang. A Tight Lower Bound for
On-Line Monotonic List Labeling. In Proc. of the 6th Scandinavian Workshop
on Algorithm Theory (SWAT), pages 131–142, 1994.
B 83
[DZ90] Paul F. Dietz and Ju Zhang. Lower Bounds for Monotonic List Labeling. In
Proc. of the 2nd Scandinavian Workshop on Algorithm Theory (SWAT), pages
173–180, 1990.
[GK89] Michael T. Goodrich and S. Rao Kosaraju. Sorting on a Parallel Pointer
Machine with Applications to Set Expression Evaluation. In Proc. of the
30th IEEE Symp. on Foundations of Computer Science (FOCS), pages 190–195,
1989.
[GLS+04] B. Godfrey, K. Lakshminarayanan, S. Surana, Richard M. Karp, and Ion
Stoica. Load Balancing in Dynamic Structured P2P Systems. In Proc. of the
23rd Conference of the IEEE Communications Society (INFOCOM), 2004.
[GV94] Alexandros V. Gerbessiotis and Leslie G. Valiant. Direct Bulk-Synchronous
Parallel Algorithms. Journal of Parallel and Distributed Computing, 22(2):251–
267, 1994.
[HJS+03] Nicholas J. A. Harvey, Michael B. Jones, Stefan Saroiu, Marvin Theimer, and
Alec Wolman. Skipnet: A Scalable Overlay Network with Practical Locality
Properties. In Proc. of the 4th USENIX Symposium on Internet Technologies
and Systems (USITS ’03), 2003.
[HKRZ02] Kirsten Hildrum, John D. Kubiatowicz, Satish Rao, and Ben Y. Zhao. Dis-
tributed Object Location in a Dynamic Network. In Proc. of the 14th ACM
Symp. on Parallel Algorithms and Architectures (SPAA), pages 41–52, 2002.
[HM04] Nicholas J. A. Harvey and J. Ian Munro. Deterministic SkipNet. Information
Processing Letters, 90(4):205–208, 2004.
[IKR81] Alon Itai, Alan G. Konheim, and Michael Rodeh. A Sparse Table Im-
plementation of Sorted Sets. Technical Report RC 9146, IBM T.J. Watson
Research Center, Yorktown Heights, New York, 1981.
[KLL+97] David R. Karger, Eric Lehman, Frank Thomson Leighton, Matthew S.
Levine, Daniel Lewin, and Rina Panigrahy. Consistent Hashing and Ran-
dom Trees: Distributed Caching Protocols for Relieving Hot Spots on the
World Wide Web. In Proc. of the 29th ACM Symp. on Theory of Computing
(STOC), pages 654–663, 1997.
[KM05] Krishnaram Kenthapadi and Gurmeet Singh Manku. Decentralized Algo-
rithms Using Both Local and Random Probes for P2P Load Balancing. In
Proc. of the 17th ACM Symp. on Parallel Algorithms and Architectures (SPAA),
pages 135–144, 2005.
84 B
[KR04a] David R. Karger and Matthias Ruhl. Simple Efficient Load Balancing Al-
gorithms for Peer-to-Peer Systems. In Proc. of the 3rd International Workshop
on Peer-to-Peer Systems, pages 131–140, 2004.
[KR04b] David R. Karger and Matthias Ruhl. Simple Efficient Load Balancing
Algorithms for Peer-to-Peer Systems. In Proc. of the 16th ACM Symp. on
Parallelism in Algorithms and Architectures (SPAA), pages 36–43, 2004.
[KS05] Miroslaw Korzeniowski and Christian Scheideler. Transparent Data Struc-
tures, Or How to Make Search Trees Robust in a Distributed Environment.
In Proc. of the 8th International Symposium on Parallel Architectures, Algo-
rithms, and Networks (I-SPAN), 2005.
[Lei92] Frank Thomson Leighton. Introduction to Parallel Algorithms and Architec-
tures: Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers, 1992.
[LNBK02] David Liben-Nowell, Hari Balakrishnan, and David R. Karger. Analysis
of the Evolution of Peer-to-Peer Systems. In Proc. of the 21st annual ACM
symposium on Principles of Distributed Computing(PODC), pages 233–242,
2002.
[Man04] Gurmeet Singh Manku. Balanced Binary Trees for ID Management And
Load Balance in Distributed Hash Tables. In Proc. of the 23rd annual ACM
symposium on Principles of Distributed Computing (PODC), pages 197–205,
2004.
[MRS00] Michael Mitzenmacher, Andra W. Richa, and Ramesh K. Sitaraman. The
Power of Two Random Choices: A Survey of Techniques and Results. In
Handbook of Randomized Computing. P. Pardalos, S.Rajasekaran, J.Rolim, and
Eds. Kluwer, editors, 2000.
[MS90] Burkhard Monien and Ivan Hal Sudborough. Embedding One Intercon-
nection Network in Another. Computing Supplementum, 7:257–282, 1990.
[MV99] Friedhelm Meyer auf der Heide and Berthold V¨
ocking. Shortest Path
Routing in Arbitrary Networks. Journal of Algorithms, 31(1):105–131, 1999.
[NW03a] Moni Naor and Udi Wieder. A Simple Fault Tolerant Distributed Hash
Table. In Proc. of the 2nd International Workshop on Peer-to-Peer Systems,
pages 88–97, 2003.
[NW03b] Moni Naor and Udi Wieder. Novel Architectures for P2P Applications: the
Continuous-Discrete Approach. In Proc. of the 15th ACM Symp. on Parallel
Algorithms and Architectures (SPAA), pages 50–59, 2003.
B 85
[PRU01] Gopal Pandurangan, Prabhakar Raghavan, and Eli Upfal. Building Low-
Diameter Peer-to-Peer Networks. In Proc. of the 42nd IEEE symposium on
Foundations of Computer Science (FOCS), pages 492–499, 2001.
[Ram97] Vijaya Ramachandran. QSM: A General Purpose Shared-Memory Model
for Parallel Computation. In Proc. of the 17th Conference on Foundations of
Software Technology and Theoretical Computer Science (FSTTCS), pages 1–5,
1997.
[RD01] Antony Rowstron and Peter Druschel. Pastry: Scalable, Decentralized
Object Location, and Routing for Large-Scale Peer-to-Peer Systems. Lecture
Notes in Computer Science, 2218:329–350, 2001.
[RFH+01] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard M. Karp, and Scott
Shenker. A Scalable Content Addressable Network. In Proc. of the ACM
SIGCOMM, pages 161–172, 2001.
[RLS+03] Ananth Rao, Karthik Lakshminarayanan, Sonesh Surana, Richard M. Karp,
and Ion Stoica. Load Balancing in Structured P2P Systems. In Proc. of the
2nd International Workshop on Peer-to-Peer Systems, pages 68–79, 2003.
[Sav98] John E. Savage. Models of Computation: Exploring the Power of Computing.
Addison Wesley, 1998.
[Sch00] Christian Scheideler. Probabilistic Methods for Coordination Problems.
Habilitation thesis, University of Paderborn, 2000.
[SMK+01] Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, and Hari
Balakrishnan. Chord: A Scalable Peer-to-Peer Lookup Service for Internet
Applications. In Proc. of the ACM SIGCOMM, pages 149–160, 2001.
[vEB90] Peter van Emde Boas. Handbook of Theoretical Computer Science, Vol. A,
chapter Machine Models and Simulations, pages 1–66. Elsevier, 1990.
[Wie05] Udi Wieder. The Continuous-Discrete Approach for Designing P2P Networks
and Algorithms. PhD thesis, 2005.
[WWWK94] Jim Waldo, GeoffWyant, Ann Wollrath, and Samuel C. Kendall. A Note on
Distributed Computing. Technical report SMLI TR-94-29, Sun Microsys-
tems Laboratories, 1994.