scieee Science in your language
[en] (orig)
On Fault-Tolerant Data Placement in
Storage Networks
Dissertation
by
Mario Mense
Faculty of Computer Science, Electrical Engineering and Mathematics
Department of Computer Science and Heinz Nixdorf Institute
University of Paderborn, Germany
August 2008
ii
Reviewers:
Prof. Dr. Friedhelm Meyer auf der Heide, University of Paderborn, Germany
Prof. Dr. Christian Schindelhauer, University of Freiburg, Germany
iii
Acknowledgments
First of all, I would like to thank my advisor Prof. Dr. Friedhelm Meyer auf der Heide for
his great support. Friedhelm always left me the freedom to address the problems I was
interested to investigate and which have now become the core of this thesis. Moreover,
basing on his early results in PRAM simulation and resource management in networks, I
had several motivating discussions with him from which I always could extract interesting
proposals and useful perspectives for my work. Furthermore, I would also like to thank
Christian Schindelhauer from Albert-Ludwigs-Universität, Freiburg, for reviewing this
thesis. Collaboration with Christian is always interesting, progressive, sometimes strange,
but in the end leads to fascinating views on many different problems.
At last, I express my thanks to those people with whom I had a great collaboration
and great time, mainly the members of the research group "Algorithms and Complexity"
and my co-authors of the main results in this thesis, namely Michael Kortenjan and Mar-
tin Ziegler from the University of Paderborn, Christian Scheideler from the Technische
Universität München, Germany, as well as Christian Schindelhauer and Arne Vater from
Albert-Ludwigs-Universität, Freiburg, Germany. Another special thank goes to Michael
Kortenjan and Gunnar Schomaker for suffering me as their office member during the last
years.
Paderborn, August 2008 Mario Mense
v
To Nicole and Charlotte
Contents
1 Introduction 3
1.1 Storage (Area) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 The Storage Virtualization Layer . . . . . . . . . . . . . . . . . . 6
1.1.2 Storage Management Schemes . . . . . . . . . . . . . . . . . . . 7
1.2 Demands on Data Placement in Storage Networks . . . . . . . . . . . . . 9
1.2.1 A Brief Note on Erasure Encoding and Replication . . . . . . . . 12
1.3 OurContribution .............................. 14
2 Some Classical Approaches 15
3 Random Allocation of Data Copies 23
3.1 PreliminarySection............................. 26
3.1.1 TheBasicModel.......................... 26
3.1.2 On the Practical Realization of the Allocation Function F. . . . 31
3.1.3 Further Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 The Redundant Allocation Problem . . . . . . . . . . . . . . . . . . . . 43
3.3 Analysis of the Probability Distribution P................. 46
3.3.1 Existence and Uniqueness . . . . . . . . . . . . . . . . . . . . . 47
3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations . . . . 49
3.4 COMB: An Allocation Strategy for Static Scenarios . . . . . . . . . . . . 60
3.5 SPREAD: An Adaptive Allocation Scheme for Dynamic Scenarios . . . . 64
3.5.1 Previous Adaptive Strategies . . . . . . . . . . . . . . . . . . . . 65
3.5.2 The SPREAD Strategy . . . . . . . . . . . . . . . . . . . . . . . 69
3.6 Conclusion and Open Problems . . . . . . . . . . . . . . . . . . . . . . . 89
4 Erasure Codes for Reading and Writing 91
4.0.1 Preliminaries ............................ 93
4.0.2 Data Reconstruction from Failures . . . . . . . . . . . . . . . . . 96
vii
viii Contents
4.1 The Operational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 LowerBounds................................103
4.3 Encoding/Decoding in the RWC . . . . . . . . . . . . . . . . . . . . . . 104
4.3.1 The Matrix Approach . . . . . . . . . . . . . . . . . . . . . . . . 106
4.4 Security and Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5 AdaptiveRWCodes.............................112
4.6 Boolean Read-Write-Codes . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.7 GeneralRW-Codes .............................118
4.8 Conclusion .................................120
CHAPTER 1
Introduction
The advances in networking technology over the last decades, the explosive popularity
of the World Wide Web and, depending on this, the growth of the Internet have greatly
widened the user base of computers, making it nearly universal. In times of the so-called
Web 2.0, information exchange has become ubiquitous, and novel pervasive applications,
like e.g. online banking, audio- and video downloads, data warehousing or the wide and
continuously growing field of e-commerce, have been enabled by this progress. Further-
more, it has become usual for people in the industrial nations that information is available
24 hours a day, 7 days a week1, and complex multimedia based content, like video clips
and MP3 music files, can be downloaded at high rates or viewed online.
Driven by this evolution, data has become the central and most valuable asset for many
companies and organizations, and generally, the efficient storage and retrieval of huge
amounts of information has emerged as a driving force of our information society’s econ-
omy. Due to an ICD study sponsored by EMC2[Gan07], the worldwide amount of digital
data in 2006 is projected at 161 Exabyte (billion GByte). In other words, on average, each
of the 6.7 billion people on earth has created 24 GByte of new data in 2006. This is plau-
sible since multimedia applications have entered a phase of rapid growth which is driven
by mainly two forces. First, the rapidly dropping costs of computer and networking hard-
ware has made multimedia applications accessible to an ever-increasing number of users.
Second, the growing size of the Internet has made it possible to publish multimedia con-
tent comparatively cheaply to an immense audience. For example, YouTube, a company
that did not exist a few years ago, hosts 100 million video streams a day, and experts say
more than a billion songs a day are shared over the Internet in MP3 format. Assuming
an expansion rate of 57%, in 2010, the worldwide data volume will be about six times as
much as today, 988 Exabyte. Particularly, from the total global data volume, about 70%
of all information is created by private persons, but nevertheless, roughly 85% of all data
1often abbreviated 24/7 or 24x7
3
4 Introduction
require the involvement of a company in terms of storing, providing or transmitting the
data. As a result, for providing and enabling all this information, multimedia content and
technical processes, companies have to administrate, process and backup these quantities
of data appropriately, and it has turned out that, on average, enterprises have to deal with
an information amount that doubles every year.
Therefore, in the recent years, stored data has become more and more critical and irre-
placeable, and the demand for quick and uninterrupted data availability has reached high-
est priority, thus, basic issues such as fault-tolerance, continuous information availability,
fast data access and advanced backup principles are crucial for every company today.
To guarantee all this, the concepts to manage and store data are of great importance be-
cause the quality of the data availability as well as the access speed have direct impact on
any company’s success. As an example, consider the case of a video-on-demand service
provider storing all his frequently accessed data on the same storage device. Due to band-
width limitations, during playback, the device may become unable to satisfy all incoming
request appropriately leading to loss of customer goodwill and hence, may be considered
expensive. As a result from summarizing all demands addressing modern advanced data
allocation, we can identify two major objectives any contemporary storage subsystem has
to accomplish: high-level serviceability, i.e. always serve given I/O requests quickly and
efficiently, independent of either the network’s size or occurring changes in the set of
storage devices, and data availability (or reliability), i.e. to feature appropriate fail-over
mechanisms to be safe from disk failures, and thus, data loss.
1.1 Storage (Area) Networks
The technological foundation to face the dramatic growth of enterprise data storage ca-
pacity as well as to meet the given objectives is to consolidate the storage capacity inside
a so-called storage area network (SAN) (see e.g. [JT05, Gup02] for details), a dedicated
network which is built around block-addressed storage units and that disbands the tradi-
tional tight coupling of servers and storage devices, but instead, establishes an any-to-any
connection from servers to disks by an underlying high-speed network. This concept
of storage consolidation leads to the so-called storage-centralized architecture enabling
enterprises to facilitate efficient parallel access to the data as well as cost-effective sys-
tem management. Unlike common server-centralized architectures in which a single disk
drive is directly attached to a workstation, the disks are rather encapsulated inside ad-
vanced storage subsystems, like e.g. disk arrays or high-end enterprise storage cabinets,
that today offer a capacity ranging from some TByte to several PByte each. Every such
storage subsystem can be directly accessed by any of the connected servers, thus, enhanc-
ing the ability to assign free capacity in an easy and flexible manner (see Figure 1.1).
Additionally today, almost all SANs are scalable although not all implement convenient
1.1. Storage (Area) Networks 5
storage-on-demand concepts, that is, to add (theoretically unlimited) storage capacity to
the system only when it is needed. Another problem that comes along with scalable envi-
ronments is heterogeneity. Depending on the underlying storage configuration (c.f. Sec-
tion 1.1.2), the ability of a storage network to scale may induce the storage components to
become heterogeneous over time because new storage devices enter the system as well as
older ones are being removed or replaced. This indeed simplifies migration processes, but
on the other hand forces the management modules to be able to cope with heterogeneity.
However, even if not all SANs are capable of handling heterogeneity yet, the ability to
scale already reduces the total costs of ownership (TCO) [GM00, Fal07, BSV04].
Last but not least, every SAN is required to have appropriate reliability mechanisms
installed to face the mentioned problem of data loss in case of device failures, while the
probability of such failures to occur increases with the number of storage devices in the
SAN (c.f. [PGK88] and Section 1.2). The common way to achieve high data availability,
and moreover, to ensure business continuity appropriately, is given by inducing redun-
dancy into the system in terms of applying either complete replication of data blocks or
erasure (resilient) encoding schemes (e.g. RAID). Unfortunately, almost all utilized fault-
tolerant placement strategies suffer from serious drawbacks, like e.g. the lacking ability
to properly cope with heterogeneity as well as insufficient failure resilience or inherent
I/O overheads when modifying the data.
SAN
SAN
. . .
LAN/WAN
Storage
Server
Figure 1.1: Storage Area Network
To summarize, depending on storage consolidation and supported by advanced data
and resource management strategies, modern SANs offer a high potential regarding the
efficient adaption to changing capacity demands without system downtimes as well as
parallel access to the (non-uniform) storage devices, low maintenance costs, simplified
6 Introduction
administration and integrated recovery services to react on hardware failures. Surely, to
ensure all these benefits, appropriate mechanisms to provide fault-tolerance,system scal-
ability and efficient data access even for non-uniform storage devices must be installed.
1.1.1 The Storage Virtualization Layer
Inside a SAN, the utilized data placement scheme is the building block to face the sketched
challenges and which is usually part of a software layer implementing the so-called stor-
age virtualization, an abstract concept separating the physical from the logical view of
storage resources (see e.g. [BMS+03, BHMadH+04, Sal04]). This abstraction enables
the consolidated storage devices to be managed as one single entity, normally as one
virtual disk, which is then offered to the application servers while hiding the physical
representation and explicit storage of the data from the user (Figure 1.2). In particular,
the logical address space presented to the accessing servers in terms of virtual volumes
is scrambled into equal sized data blocks which are then distributed among the physical
devices by the data distribution strategy according to the desired serviceability and avail-
ability objectives (additionally, modern SANs offer improved management services, like
e.g. remote copy or volume snapshots [BEHV06]).
Applicationserver
Physical Storage
Virtual Volume
Virtualization Layer
Remote Copy Snapshot …….
Data Placement Engine
Figure 1.2: Storage Virtualization Layer
However, obviously, both the given storage management scheme (see next section) and,
depending on the layout, the capability of the utilized allocation scheme directly deter-
mines the overall performance of the storage network regarding the described challenges.
1.1.2 Storage Management Schemes 7
1.1.2 Storage Management Schemes
In most commercial SANs today, RAID is still the usual way to allocate data blocks to
the set of storage devices, and for fault-tolerance purpose, most companies apply RAID
levels 1 (replication, resp. mirroring) or 4/5 (erasure (resilient) encoding) (for details, see
[PGK88, Mas97]) which arranges the scrambled data blocks into a fix sized stripe of data
blocks and additionally includes a parity block for providing one-failure tolerant redun-
dancy; if a higher level of redundancy is required, often Reed-Solomon erasure encoding
is considered (c.f. Chapter 4). Although RAID implements a deterministic (resp. man-
ual) data placement, it works for small storage environments as RAID sets are shipped in
terms of rigid storage cabinets each covering some fixed number of homogeneous disks on
which the data is striped by an internal controller to ensure low system latencies. Surely,
fixed RAID sets provide optimal fairness (by uniform data distribution) and since such
systems can tolerate one single disk failure at a time, vendors almost add so-called spare
disks for compensating a missing disk in case of failure.
However, if considering large-scale systems or those that may undergo changes in the
number of connected storage systems, the traditional rigid layout of such SANs makes
storage management become more and more less efficient because the only way to scale
is to concatenate multiple RAID sets disjointly, as depicted in Figure 1.3.
“manual”
placement
Figure 1.3: Classical Storage Management: Concatenated RAID sets
The drawbacks are quite obvious. First, if such systems base on disks of same type, the
probability of an occurring failure increases significantly (c.f. next section). Furthermore,
easy system and data migration is merely possible inside a given cabinet, but most im-
portantly, since the storage virtualization is located close to the data inside the cabinets,
the benefits of virtualization, like e.g. fault-tolerance, disk utilization or access efficiency,
have only local effects inducing the need for an additional superior virtualization layer. If
then such layer also implements RAID, system scaling becomes a hard challenge because
if new disks are added over time, with high probability, they are of different characteristic
making the scaling problem a heterogeneous one.
8 Introduction
Therefore, a self-evident way to achieve highest system performance is given if envi-
ronments get rid of all the rigid cabinets and arrange the complete storage in a single, flat
storage solution, as depicted in Figure 1.4 ? Surely, such solution is much more chal-
lenging to the employed allocation scheme because all mentioned issues directly address
the superior virtualization layer. However, besides the possibility to efficiently handle
heterogeneous devices, an open, flat storage environment further offers increased scal-
ing possibilities as well as better capacity consumption by efficiently using the complete
storage space and thus, letting spare disks become obsolete.
Figure 1.4: Open Storage Management: Single, flat storage arrangement
In this work, we present and discuss different data placement schemes for both clas-
sical and open storage configurations, and thus, depending on different restrictions, also
different demands with respect to the virtualization engine are considered. For example,
for a classical RAID-based storage layout as described, we ignore scalability and hetero-
geneity, and instead, rather focus on fault-tolerance and efficient access behavior that is
almost gained by erasure resilient encoding.
Therefore, in Chapter 4, we present a flexible framework for generating different era-
sure resilient codes that, compared to classic erasure codes, include improved update
behaviors which are very useful for SANs. On the other hand, in case of open storage en-
vironments, scalability and heterogeneity additionally come into play forcing the applied
distribution scheme to handle all challenges mentioned in the previous section. Therefore,
in Chapter 3, we introduce efficient and redundant strategies that base on replication and
which are furthermore capable of coping with scalable and heterogeneous storage config-
urations, respectively. However, before we go into further detail investigating these novel
allocation schemes, we first introduce and briefly describe the complete set of different
requirements that generally address data allocation and which must, at least partially, be
handled by any placement algorithm employed in a storage network.
1.2. Demands on Data Placement in Storage Networks 9
1.2 Demands on Data Placement in Storage Networks
We give a brief summary of the basic challenges that must at least in part be considered by
a data allocation scheme to ensure the serviceability and availability objective for a con-
sidered storage environment. In particular, as the underlying storage configuration may
restrict the set of demands to be considered by the applied placement scheme, different
overall performances might be achieved for varying configurations. However, all strate-
gies observed in this thesis in common is that they can be reduced to pure data placement
only, i.e. all considered strategies are not concerned with limiting properties of the under-
lying network topology, like e.g. congestion or network latency. This results from the fact
that the I/O accesses to the single storage devices cause the bottleneck of the overall sys-
tem because any such access is served by comparatively slow mechanical driven access
mechanisms inside a single storage device. Consequently, we can suppose the underlying
network topology to be unrestricted because the system bottleneck appears directly at the
storage devices such that potential problems when routing the data packets through the
network can be neglected.
Therefore, the usual way to ease the access bottleneck at the disks in order to decrease
system latency in both the classic and the open storage model is data striping which
spreads the data blocks of some given file to multiple storage devices making multiple
portions of the file be accessible in parallel, thus yielding effective transfer rates. Ba-
sically, the principles of data striping can be partitioned into random and deterministic
approaches, and as we have seen, in traditional (commercial) SANs, respectively disk ar-
rays, the well-known deterministic RAID scheme [PGK88, Mas97] is almost still applied.
However, in general, besides parallel access, striping the data as evenly as possible among
the disks further results in an optimal disk utilization which, at first, also improves system
performance because each write I/O involves several disks for data storage, and second,
reduces the TCO (c.f. [Mas97]). Moreover, data striping over multiple devices consider-
ably supports business continuity which, for instance, becomes apparent in the example of
the video-on-demand provider who runs into problems when storing frequently accessed
data onto either a single or a small number of devices such that parallelism could not be
exploited accurately. On the other hand, placing only infrequent accessed material on a
storage device may result in the device to be underutilized. Thus importantly, any allo-
cation scheme must provide an appropriate load balancing with respect to both the data
blocks and the incoming I/O requests over all accessible disks.
As we explained already, modern open scalable storage systems additionally need to
consider storage devices of non-uniform characteristics (e.g. capacity, bandwidth, la-
tency, age etc.). Therefore, advanced data distribution strategies are required to handle
heterogeneity appropriately; with respect to Salzwedel [Sal04], we denote this the hetero-
geneity problem, and any data distribution strategy able to efficiently handle heterogeneity
10 Introduction
is called fair. Similar to uniform capacities, a fair distribution has the same great influence
on disk utilization and thus, overall system performance.
Another issue significantly affecting system latency is system scaling what is caused
by an ever growing storage data volume as well as occurring disk failures. In case of a
growing data volume, the simple concatenation of storage devices obviously implies ac-
cess bottlenecks caused by the so-called 80/20 problem describing the fact that without
data redistribution, new data to which roughly 80% of all current accesses go is almost
exclusively stored on the newer devices covering roughly 20% of the total storage capac-
ity (c.f. Chapter 2). Additionally, depending on technical progress, age reasons or failure
probability, added or replaced disks are often of different characteristic compared to the
ones already connected. Thus, even if initially of homogeneous trait, over time, a SAN
becomes more and more heterogeneous. Furthermore, to obtain the best performance, the
newly added disks should also be included into the striping process but this evokes the
problem of how to add newer disks efficiently such that, in the long run, the overall per-
formance is not affected noticeably. Clearly, in case of fixed sized data stripes, any system
growth induces a complete redistribution of almost all of the stored data. If then TBytes
or PBytes of data are stored, either serious performance losings or complete downtimes of
the system for hours or days must be considered, what is intolerable for every company.
Thus consequently, system expansion should be done as fast as possible, and after finish-
ing the redistribution, the strategy should balance both the data and the requests optimally
over all disks. A scalable data placement scheme that keeps the amount of redistributed
data at a near-minimum is called adaptive.
Besides performance considerations, the stripe size has further influence on the relia-
bility of the system because, according to Patterson et al. [PGK88], the failure rate of a
disk array increases proportional to the number of covered disks which are expected to
fail independently. More precisely, failures in disk arrays are often assumed to satisfy
the memoryless property, that is, the life expectancy of a disk is estimated only upon the
condition that the disk is working now. Hence, the reliability of a disk array is modeled by
the exponential distribution [Gib99]. Consequently, for low disk failure rates, the failure
rate of a disk array is proportional to the number of disks it contains. However, as a rule of
thumb, we can alternatively consider the so-called Mean Time Between Failure (MTBF),
the vendor given operation time of a single disk expressing its reliability (for example, the
Hitatchi UltrastarTMA7K1000, a conventional disk in 2007, exhibits an MTBF of targeted
1.2 million hours [Tec07]). Assuming a constant failure rate, the MTBF of a disk array
then is
MTBF of a Disk Array =MTBF of a single disk
Number of disk in the array
Thus, for a SAN consisting of 100 UltrastarTMdisks, the MTBF is calculated as roughly
1.2 million / 100 hours (= 1.5 years). At a first glance, this seems sufficient, but if we
1.2. Demands on Data Placement in Storage Networks 11
scale up to 1000 disks, the MTBF drops to 1,200 hours, or less than 2 months, what is
dismal. With respect to the disk failures that might occur, we primarily distinguish three
types. The first, called transient errors, arise from noise corruption and are dealt with
by repeating the requests. The second, called media defects, are caused by permanent
defects in material and are detected and masked by the manufacturer. The last, on which
we concentrate in this thesis, are catastrophic failures such as head crashes and failures
of the disk controller electronics. According to Hellerstein et al. [HGK+94], we denote
such failures erasures throughout this work because when a disk suffers a catastrophic
failure, its data is rendered unreadable, and thus, is effectively erased.
Consequently, besides appropriate backup mechanisms, any allocation scheme should
furthermore provide some certain degree of fault-tolerance. Usually, fault-tolerance is
achieved by introducing redundancy into the system that can be realized by either repli-
cation, i.e. seperately storing identical copies of every data block, or erasure resilient
encoding by which the information is redundantly encoded inside a codeword made up
of a number of blocks that are stored on separated devices. As we see in the follow-
ing chapters, erasure encoding is suitable rather for a homogeneous set of disks whereas
with replication, heterogeneity and scalability can be tackled efficiently. That is, all data
placement strategies we introduce in this work are fault-tolerant approaches while redun-
dancy is either provided by replication or erasure resilient encoding. However, the central
algorithmic problem we address in this thesis is the following
Data Availability Problem: Guarantee to always ensure a fast and efficient access to
all stored data in a storage network even in case of erasures.
Surely, the data availability problem covers the fulfillment of both the serviceability and
the availability objective defined in the beginning including all properties and require-
ments discussed above. In the following list, we summarize these general requirements
that result from the given discussion:
1. Redundancy: Data items are redundantly stored on separated disks.
2. Fairness: To ensure a best optimal disk utilization and reduced access latency, a
data placement strategy must assign every disk a fair share of the total data load,
i.e. a share of data that is equivalent to its share of the total capacity. This should
hold not only for the data, but also for the I/O requests.
3. Heterogeneity: Non-uniform capacities or bandwidths should also be handled op-
timally.
4. Scalability: The system should be able to efficiently handle the addition or removal
of storage devices.
12 Introduction
5. Adaptivity: A scheme is called adaptive if in case of any system changes (in the
number of data items or storage devices), it allows to adapt to the new configuration
with minimum amount of data replacement while maintaining all given demands.
6. Availability: Suitable redundancy mechanisms should be incorporated to compen-
sate the loss of failed (or blocked) disk drives.
7. Efficiency: The total space for storing control information should only depend on
the number of storage devices and not on their differences in capacity. Furthermore,
the time for computing the position of any data item stored in the system should be
at most logarithmic in the system size.
Again, driven by specific application demands concerning the applied allocation strat-
egy as well as different characteristics of the given hardware environment often allow a
placement scheme to cover only a subset of the listed issues, as we see in the following
chapters.
Before we investigate the different strategies in more detail, for the sake of complete-
ness, we first take a brief look on the parameters that must be chosen for replication as
well as erasure coding because in large-scaled storage environments, a proper selection
of values for the redundancy parameters directly influences the capacity provisioning and
thus, the TCO of the company.
1.2.1 A Brief Note on Erasure Encoding and Replication
Replication is a simple scheme where ridentical copies of each data object are stored
separately on rdifferent disks, thus, enabling the system to tolerate up to r1 erasures.
In contrast, with an erasure-coded redundancy scheme, each object is divided into kfrag-
ments and redundantly encoded into a codeword of n>kfragments (blocks) which are
also stored separately. Then, the key property is that the original information can be re-
constructed from any kfragments of the encoding. Basically, the main difference between
replication and erasure encoding is that with replication, the entire information of the data
block is stored in each copy implying that only one out of rcopies suffices for recovering
the data, what is an advantage as most applications today feature a rather read-intensive
access behavior. On the other hand, this leads to a serious drawback if capacity sav-
ings become apparent as replication suffers from a storage overhead of a factor of r1.
Therefore, the main reason for using erasure coding (if only homogeneous storage de-
vices are considered) is that a high degree of redundancy, respectively data availability,
can be achieved with less additional storage because the redundancy factor is only sc=n
k.
Nevertheless, in case of non-uniform capacities or a scalable storage system, replication
features the discussed advantages.
1.2.1 A Brief Note on Erasure Encoding and Replication 13
Often, the values r,kand nare simply set by a rule of thumb but in large systems, made
up of a hundred or thousands of storage devices, these values should rather be chosen
carefully with respect to the mentioned storage overhead. Moreover, this also includes
the question of how many copies to hold for data items featuring different popularities.
For this problem, a suitable way to determine and thus control the storage overhead of
both approaches was presented by Rodruiges and Liskov [RL05]. They focus on high
availability in peer-to-peer networks and model the values of r, respectively sc, according
to a desired per object unavailability target ε>0 in combination with an average node
availability adenoting the time a peer is reachable. In particular, in case of replication,
they assume the node availability to be distributed independently and identically. Then,
the values for εand r, respectively, are computed as
ε=Pr[object ois unavailable]
=Pr[all rreplicas of oare unavailable]
=Pr[one replica is unavailable]r
= (1a)r
which upon solving for ryields r=llog ε
log(1a)m.
For the case of erasure coding, we sketch a summary of the complete derivation to
compute scwhich can be found in [BSV03]. Object availability is given by the probability
that at least kout of sc·kfragments are available which can be modeled by a binomial
distribution as
1ε=
sck
i=ksck
iai(1a)scki.
Applying some algebraic simplifications and the normal approximation to the binomial
distribution, the storage overhead sccan be obtained as
sc=
σεqa(1a)
k+qσ2
εa(1a)
k+4a
2a
2
where σεis the number of the standard deviations in a normal distribution for the required
level of availability.
Besides the result of Rodrigues and Liskov [RL05], there exists further comparisons
[BSV03, DLS+04a, WK02], all of which argue that erasure coding is the clear victor due
to capacity savings for the same level of availability. However, the authors only compared
approaches for homogeneous, unscaled systems, and as we already described, in case of
non-uniform capacities or dynamic system behavior, an open replication based allocation
has considerable advantages.
14 Introduction
1.3 Our Contribution
Depending on the different storage configurations briefly described in Section 1.1.2, the
schemes we introduce in the following are coarsely divided into two different parts. With
respect to an open, heterogeneous and perhaps dynamic system, in the first chapter, we in-
vestigate replication-based schemes that adapt balls-into-bins games (see Section 3.3.2.1)
for randomly allocating identical copies of data items to the set of storage devices ensur-
ing that no two identical copies are hosted by the same device. Thus, given a replication
parameter of r, the system can tolerate up to r1 failures and is still operational. We
consider randomized algorithms because in the non-redundant case, a random allocation
of the data items has turned out to be a useful way to conveniently model the load balanc-
ing problem. Furthermore, we distinguish between static (i.e. not dynamic) and dynamic
environments that may undergo changes in the number of connected disks, and as we
see, if scalability is neglected, a random and fair allocation to non-uniform capacities
can already be achieved very easy. Nevertheless, as we show, with a random and redun-
dant allocation of identical copies to non-uniform disks, an appropriate load balancing is
harder to gain than in the non-fault-tolerant case. Furthermore, if the system is allowed
to behave somehow dynamic, this challenge increases by magnitudes, but as we show by
the SPREAD strategy, an adaptive allocation scheme for redundant and fair storage in
dynamic heterogeneous storage systems, this can be handled efficiently.
Although storing replicated copies implies the efficient and appropriate handling of
hardware heterogeneity as well as dynamic behavior, unfortunately, such schemes require
to update all copies in case of any data modification to keep the system consistent, and
moreover, they require a higher waste of resources. Therefore, in the second part, we
introduce the so-called Read-Write Erasure Coding Systems (RWC), a flexible system
for generating parametrized erasure codes that, unlike usual erasure codes such as RAID
or Reed-Solomon codes, offer improved update properties making them very useful for
utilization in storage networks. Furthermore, the generated codes additionally include
new security features for a secure storage of redundant data.
Before we start introducing our main results, we first sketch some of the most prominent
previous results, each of which was designed to tackle a certain subset of the listed issues
above.
CHAPTER 2
Some Classical Approaches
In this chapter, we give a brief survey of the most prominent approaches in the field of data
placement algorithms that are usable for storage networks. Basically, most of the schemes
we discuss in this thesis distribute the given data by applying either hashing methods,
randomization, erasure resilient encoding or a mixture of these techniques, and for most
of them, it holds that depending on given parameters, configurations or objectives, it
often suffices to fulfill only some of the number of listed demands on data placement. For
example, if the system is not expected to scale and the connected storage devices are of
uniform size, simple and efficient XOR-based RAID schemes could be applied, instead
of complex distributed hash table approaches.
Since we investigate randomized allocation schemes that adapt balls-into-bins games
and erasure resilient encoding more deeply in the two major parts of this thesis, we now
discuss alternative approaches that base on a different design. All these approaches in
common is that non of them covers the data availability problem completely, thus, we
partition the description of the algorithms according to the requirement they mainly aim
to fulfill, namely fairness, data availability, heterogeneity and scalability (moreover, in
this chapter, we restrict ourselves to scalability rather than adaptivity because for some
scalable strategies the degree of adaptivity has not yet been analyzed or is simply ignored).
As fairness is one of the most important issues with respect to the serviceability objec-
tive, we start with introducing techniques that aim at both obtaining a fair data and access
distribution over the disks to provide optimal system response times. In particular, these
schemes can mainly be distinguished into either deterministic or random based ones.
Fairness
Again, the ability to distribute the data in a fair manner among the (perhaps non-uniform)
disks in a SAN (which is often also referred to as space balance [Sal04]) directly corre-
lates to an optimal system saturation, respectively an overall disk utilization. Moreover,
15
16 Some Classical Approaches
besides decreasing the waste of storage space by full capacity usage, inside a disk, con-
tinuous access algorithms, like the elevator strategy, are applied making the response time
of a disk correlate with its utilization.
Moreover, as briefly sketched in Section 1.2 and shown by the example of the audio-
and video provider in the previous chapter, not only the data must be distributed fairly;
the second factor that has great influence on the performance are user access patterns.
In particular, it has turned out that the frequency of access to data objects follows a Zipf
distribution1[WAS+96] (Figure 2.1) where about 80% of the accesses are to 26% of the
data indicating a high locality of reference which leads to access hot spots that have to
be dissolved by the placement algorithm. Unfortunately, many of the older expendable
data placement schemes do not consider this problem appropriately. For example, Fargin
et al. introduce Extensible Hashing, a scheme that uses a directory for addressing pages
[FNPS79, ED88]. The scheme applies a hash function to evenly spread the data across the
directory, and the addresses of the pages are stored in at least one index of the directory.
An important characteristic of this directory is that several entries may point to the same
primary data page which leads to access clustering on the pages. Moreover, in case of
key overflows on some pages, the directory expands by complete doubling. In that case,
another hash function addressing a wider index range is applied which in turn intensifies
the access clustering on some pages.
Figure 2.1: A typical Zipf-law rank distribution. The y-axis represents occurrence fre-
quency, and the x-axis represents rank (highest at the left).
1In a Zipf distribution, if the objects are sorted due to the access frequency, then the access frequency
for the ith object is given by pi=c/i(1θ), where θis the parameter for the distribution and cis a
normalization constant.
17
Therefore, space and access balance are mutually dependent, and it is always necessary
to store the data such that I/O accesses are balanced evenly across the disks, regardless of
using uniform or non-uniform sized disks.
When looking at algorithmic approaches, usually in storage networks, deterministic
strategies are used, while the most-applied deterministic technique is uniform striping
[Joh84, PGK88, Mas97, HJH02] by which, given a fixed stripe size n, the i-th element in
the data stripe is mapped to disk Dimod non position idiv n(c.f. the widely used RAID
schemes with levels 0/4/5/6/...). However, deterministic striping is hard to scale because
whenever a fixed stripe size is broken up and scaled to a different size, this induces data
redistributions at great expense. Therefore, the natural extension is to simply concatenate
additional disk arrays to the given system, but again, this leads to the above 80/20-problem
(also discussed in the previous chapter). An alternative approach is to group the disks into
clusters in advance while each cluster idefines its own stripe of length `isuch that inside
the cluster, uniform striping is applied. Unfortunately then, either the complete system
has to be planned also for future situations or, in case of concatenated disk arrays, the
number of new disks is determined by fixed stripe sizes perhaps resulting in an undesired
over-provisioning of storage capacity. As a consequence, there exists a number of simple
striping variations to break the very regular structure of this approach (e.g. [BGMJ94]).
A more general drawback of deterministic schemes is given by the fact that if the place-
ment policy is completely known, bad inputs causing placement collisions can easily be
created. Thus, the most effective protection to be safe from bad input is given by ran-
domization (see e.g. [CW77, LYD71, AT97, SMRN00] and c.f. Section 3.1.2.1). With
a randomized placement scheme, each data block is assigned to a disk drawn at random
by either a pseudo-random hash function or a precomputed random distribution [SM98a].
Generally, randomized placement schemes are often much simpler than their deterministic
counterparts, efficient to use, redundantize the need for complex and sensitive centralized
control, provide efficient results even for non-uniform settings, and most importantly, un-
like deterministic striping patterns, offer high potential to adaptivity, i.e. being able to
react quickly on occurring configuration changes (see [Sch00] for a good overview about
randomization techniques and Chapter 3 for efficient novel strategies).
Scalability
Again, todays SANs still follow the classical storage management (Section 1.1.2) for
which the expansion rule is simple concatenation of rigid RAID arrays in case of sys-
tem scaling. As mentioned above, this results in an undesired and costly capacity over-
provisioning implying a negative effect on the TCO. Moreover, assuming a fixed stripe
size, the induced redistribution phase dramatically influences the overall system perfor-
mance.
18 Some Classical Approaches
a b c d
00 01 10 11
(a)
a b c d
000 01 10 11
A
100
w
(b)
Figure 2.2: Linear Hashing
An early approach on scalability to overcome these drawbacks is given by the Linear
Hashing strategy providing a fair and access balanced distribution of data records directly
on a number of continuous (primary) buckets as long as the address space remains un-
changed. The buckets are addressed by a pair of hash functions hiand hi+1,i=0,1,2,...,
starting with h0, and each hihashes keys to primary buckets (Figure 2.2a).
As shown in Figure 2.2(b), continuous key insertions on single buckets may lead to
bucket overflows (bucket b) inducing the address space to expand which is done by split-
ting overflowed buckets in two starting from left to right (that is, bucket ais split before
bucket b) while one of them is added as a new (secondary) bucket at the rightmost end
of the primary address space. To address the split buckets, a linear extension hi+1of hiis
then used for addressing the new (extended) address space.
However, although designed for scaling systems, Linear Hashing features serious draw-
backs. First, the continuous (and linear) addition of buckets leads to bad space utilization
as unsplit buckets are more likely to overflow than newly added ones leading to waste of
storage. Second, the best performance for an unscaling system is gained by hash func-
tions that distribute the data uniformly over the address space. Unfortunately, as new
buckets are sequentially added at the rightmost end, the overflow chains of buckets at
the rightmost end become too long (Figure 2.2) because they are the last to split, thus,
causing an imbalance. Therefore, a decreasing exponential distribution that maps more
keys to the left end would perform better. However, there are different approaches that
handle this problem accurately but by inducing slightly more complexity into the system
[Mar79, Mul84, Oto88].
Basing on the idea of Linear Hashing, Litwin et. al introduced LH* [LNS93], a class
of scalable distributed data structures (SDDS) generalizing the concept of Linear Hashing
to parallel and distributed RAM and disk files without employing a centralized coordi-
nator. In contrast to Linear Hashing, the address space is shared and manipulated by
several clients, and to enable distributed processing appropriately, all clients have their
own view on the file, and calculations are made on local values. Accurate object manipu-
lations are then coordinated by a messaging system where clients process local parameter
adjustments while only the bucket splitting is controlled by a dedicated server, the split
19
coordinator.
Like Linear Hashing, LH* allows a file (address space) to grow gracefully in a dis-
tributed environment but adopts many of the drawbacks sketched above. As an ex-
tension, several LH* variants have been created that incorporate fault-tolerant features,
such as mirroring, checksum, scalable availability or Reed-Solomon encoding (see e.g.
[LN96, LMR98, LS00, LR02]). However, depending on the overflow policy, space uti-
lization is also more or less moderate, and more importantly, no LH* variant provides
mechanisms to cope with non-uniform disk capacities.
Data Availability
We already explained that the easiest way to achieve fault-tolerance is to separately store
identical copies of all data items on different disks, like in the RAID 1 scheme [PGK88]
or the PAST storage server [RD01, DR01], and although keeping additional copies for
each data item results in a higher waste of storage capacity, compared to erasure resilient
encoding, replication offers better access properties for read-I/O dominated applications,
like not being concerned with complex data en- or decoding as well as offering the chance
for accesses to alternate on identical copies for I/O speed-up. Nevertheless, instead of
replication, commercial SANs rather use RAID-parity encoding while business continuity
is provided by the concept of (hot) spare disks, dedicated disks that are kept for fail-over
only. Negatively, in normal mode, these disks remain unused, and only in case of a
failed disk, they take over the missing part. Additionally, this concept does not ensure the
expandability of the system due to growing storage demands.
With respect to the LH* schemes, Honicky and Miller presented a way to overcome
their inherent drawbacks [HM03, HM04, WBMM06]. They introduced a family of pseudo-
random algorithms for decentralized data distribution that map replicated objects to a
scalable collection of storage servers or disks. These algorithms are primarily designed
for very large-scale storage systems in which new disks are always added in terms of
weighted homogeneous clusters. Moreover, all of an object’s replicas are stored strictly
disjoint (for example, some peer-to-peer systems, such as OceanStore [REG+03], do not
make such guarantees). However, since all algorithms are laid-out to work for large-
scale systems, a per-cluster weighting is performed rather than weighting individual disks
which leads to a fair distribution with respect to clusters only. Furthermore, in case
of system scaling, the redistribution of data objects solely considers uniform servers in
weighted clusters, thus, fairness can only be achieved for uniform disk capacities making
the scheme less usable for heterogeneous disks (but which is tolerable since they assume
server addition in clusters only).
Another approach to obtain a scalable and fault-tolerant strategy that is furthermore able
to handle non-uniform capacities was presented by Brinkmann et al. [BEHV05, BE07].
20 Some Classical Approaches
They show the coupling of a scalable data placement scheme called SHARE (c.f. Sec-
tion 3.5.1 for details on SHARE) with a separate RAID 1 layer on top. This approach
is implemented inside the storage virtualization solution V:DRIVE using SHARE as the
core allocation algorithm to become scalable. In particular, for a given replication pa-
rameter r, the V:DRIVE layer creates rvirtual volumes from the physical storage devices
which are then exported to the RAID layer for generating mirrored RAID volumes on top.
However, although efficient, the inherent drawback of this generated stack of strategies
is an increase in error-proneness because of a higher number of different but interacting
strategies.
In addition to storing multiple copies for each data item, the use of parity information to
tolerate a small number of failures has reached high attraction. The key idea is to introduce
redundant information into each data stripe by keeping the bit-wise exclusive-or (XOR)
information of the data blocks in the stripe in an extra parity block. Then, all blocks
are stored on separated disks, and in case of a single disk failure, the XOR operation
guarantees complete information recovery from the residual disks. Due to their simplicity,
efficiency and low capacity consumption, parity based schemes have widely been applied,
like in RAID level 4/5 [PGK88, CLG+94] or numerous video servers [BGMJ94, SM98a,
TPBG93, VRG95].
As simple parity approaches only prevent from one erasure, advanced schemes, like
the EVENODD strategy [BBBM94], are able to tolerate two missing disks. This is ac-
complished by keeping a matrix in which the regular RAID level 5 layout is encoded and
parity information is added for each diagonal of the matrix. Moreover, an EVENODD ap-
proach can be extended to tolerate up to tarbitrary simultaneous disk failures, but in this
case, t·`11/tdisks are needed for keeping parity information [HGK+94], where ` < n
denotes the stripe length. Thus, with respect to a fast reconstruction, this approach is only
of theoretical interest. As a consequence, more complex erasure codes even for storage
networks, like SANs or peer-to-peer networks, have become important that are able to
recover from a higher number of disk failures (but which we leave out for the moment
and refer to Chapter 4 for more details).
However, all known erasure coding schemes mainly focus on an efficient and almost
optimal data recovery in case of erasures. But as we see in Chapter 4, this induces a
negative update behavior in case of information modifications. Therefore, to overcome
this drawback, we introduce so-called Read-Write Codes that feature an improved update
flexibility for stored codewords and which are very useful for SANs.
Heterogeneity
Obviously, an easy solution to face heterogeneity could be to divide disks into clusters
of equal characteristics, as proposed by Honicky and Miller [HM03, HM04, WBMM06].
21
Since inside each cluster, the disks appear to be uniform, the aforementioned strategies
can be applied without explicitly regarding heterogeneity. However, this principle com-
prises some drawback. If newer and perhaps faster disks are added, a significant improve
in access performance is only given for the subset the new disks were added to. Thus, re-
sponse time strongly correlates to the clusters and the clustering policy, respectively. As
an advantage, if access patterns are known in advance and remain stable over time, such
clustering could be used to dissolve access hot spots. Unfortunately, such assumption is
not very realistic.
Another principle that bases on grouping is given by the HERA approach, a heteroge-
neous extension of RAID [ZG00]. In HERA, a strategy called Disk Merging is introduced
that was first proposed in [ZG97, Zim98] and which constructs a logical collection of uni-
form sized volumes from an array of heterogeneous physical disks. Each logical volume
is created by aggregating a certain number of fixed sized fractions of the bandwidth or
capacity provided by different physical disks. For example, in Figure 2.3, the logical
volume number 2 (d`
2) is fed by fractions from the disks 0, 1, and 2 (dp
0,dp
1,dp
2).
G0
G1
G2
G3
0.0 0.1 0.2 0.P
1.0 1.1 1.2 1.P
2.0 2.1 2.2 2.P
3.0 3.1 3.2 3.P
Logical: d0ld1ld2ld3ld4ld5ld6ld7ld8ld9ld10ld12ld13ld14ld15l
d11l
d0pd1pd2pd3pd4pd5p
Physical:
Parity
Groups:
p0=2.2 p1=2.2 p2=3.6 p3=2.2 p4=2.2 p5=3.6
Figure 2.3: HERA: Heterogeneous Extension of RAID
According to a given parity scheme, the volumes are then grouped into different parity
groups. As a main constraint, it must be considered that two logical volumes may be
mapped to the same physical device, thus, they become dependent because if the physi-
cal disk fails, both volumes are affected. Consequently, since with simple XOR exactly
one failure can be tolerated, two different volumes may not be mapped to the same par-
ity group. To show the reliability of the system, the behavior is modeled by different
processes, like the failure process and the reconstruction process. Because the MTTF
of any disk is given by vendors, the time of these processes can be estimated; using a
Markov model [Gib99], the mean time to service loss (MTTSL) can be derived for such
heterogeneous environments. It was shown that the system reliability is only a factor of
approximately 10 away from the best possible configuration, namely the clustering of
22 Some Classical Approaches
identical disks, as outlined above. However, the scheme itself and its provided perfor-
mance strongly depends on the control of an administrator and is thus unusable for large
environments.
In AdaptRAID [CL00, CL01], the regular RAID layout is extended such that hetero-
geneity in terms of different capacities can be handled. We only sketch the initial idea
here which is as follows. The disks are sorted according to an increasing capacity. Then,
the goal is to put as many stripes of length `=non all disks as possible. As soon as some
kdisks are saturated by data blocks, stripes of length nkare mapped onto the remain-
ing nkdisks, and so forth. The main drawback of this approach is that, depending on
different stripe lengths, access to data stored in the upper part of a disk is faster than in
the lower part. To overcome this problem, a second layer of different data patterns must
be defined carefully which compensates the gaps between the initial patterns. Access to
the data blocks that are encapsulated in fix sized stripes then is twofold making it hardly
usable if the environment is allowed to scale.
At last, we briefly sketch the RIO (Random I/O) Mediaserver [SM98a, SM98b], a
generic multimedia storage system capable of efficient, concurrent retrieval of many types
of media objects. The RIO server defines a randomized distribution strategy and supports
real-time data delivery with statistic delay guarantees. Data is placed to randomly cho-
sen positions on randomly chosen heterogeneous disks. Disk bandwidth and capacity are
coupled by a bandwidth-to-space ratio BSR, and disks with a high BSR can deliver data
faster than some with lower BSR. To better utilize the preferred disks, replication is used
such that copies of data blocks are assigned to these disks. The only remaining problem
then is to identify the degree of redundancy to introduce into the system to sustain a given
load but which can be estimated by simple calculations (that we leave out at this point).
Importantly, this scheme already emphasizes that randomization performs best when
having to cope with heterogeneous disk such that disk utilization and overall system per-
formance can optimally be provided.
For the rest of this work, note that additional related work that bears a higher relation to
the results representing the core of this thesis can be found located closely to the particular
strategies.
CHAPTER 3
Random Allocation of Data Copies
In this first scenario, we refer to an open storage management scheme as described in
Section 1.1.2 in which we consider a SAN as a single, flat storage solution consisting of
storage devices of arbitrary capacities. Furthermore then, we investigate data placement
approaches that use replication for obtaining fault-tolerance in storage networks. As we
already mentioned, replication is the easiest way for introducing redundancy into a stor-
age system because the only thing to do is to store ridentical copies of each data item
separately on rdifferent disks what makes the system capable of tolerating up to r1
erasures. Surely, the storage of r1 additional replicas of some data block implies a
storage overhead of a factor of r1, but this waste of capacity is outweighed by the in-
herent simplicity of the scheme because, basically, the only challenge to cope with results
in controlling the separate storage of the rcopies on different disk. Depending on this
inherent simplicity, replication has become very popular in many application areas be-
cause it neither depends on complex control structures nor on any operational overhead,
as, for instance, given with erasure coding (see Chapter (4)). Again, the most promi-
nent representative using replication in the context of storage networks is the well-known
RAID scheme whose level 1 definition (also known as mirroring) applies replication in
a pure deterministic manner, and practical usage has shown that, as long as the observed
environments do not exceed some certain size, the RAID 1 scheme is quite efficient.
However, if systems grow, the attempt to control any deterministic allocation regarding
the redundancy and fairness conditions becomes pretty challenging. Moreover, if, be-
sides the required redundancy, additional parameters like arbitrary disk capacities and/or
dynamic system behavior come into play, with a deterministic scheduling, the costs, re-
spectively computational complexity for the allocation considering all demands listed in
Section (1.2), rise significantly with the size of the storage environment, what is dismal.
Thus, alternatively, a well-known and very efficient paradigm to overcome these draw-
backs is given by randomization which has become a standard approach in algorithm
design since efficiency and simplicity are the main features of randomized algorithms.
23
24 Random Allocation of Data Copies
The usage of randomization is not that new, and probabilistic based schemes have been
applied successfully to conveniently model load balancing in many allocation problems,
such as job scheduling, hashing, network routing or data allocation in e.g. distributed
hash tables (DHT) [Sch01, DKM+88, KLM92, ACMR95]. Again, randomized place-
ment schemes are often much simpler than their deterministic counterparts, efficient to
use, redundantize the need for complex and sensitive centralized control (as required by
deterministic schemes) and provide efficient results even for non-uniform settings.
Surely, one of the main differences between deterministic and randomized strategies is
that, in general, going from a deterministic scheduling that becomes very costly for large
systems to short computations of randomized algorithms is often paid for by the risk of
computing a wrong, respectively erroneous outcome, what is also undesired. Fortunately,
for all our strategies, we can show that, besides being redundant, they appropriately satisfy
the desired fairness condition, and the deviation of the disk loads resulting from the ran-
domized allocation compared to an optimal data layout is considerably small, with high
probability. Note that, when using a randomized placement, the higher the concentration
of the load of some disk around the expected value (which equals the load assigned to
the disks by an optimal algorithm) the more evenly the data layout becomes. Again, this
basic feature prepares the ground for efficient parallel access to the data stored on slow
mechanical based devices because by storing the data in a randomized fashion among
all accessible disks, all I/O requests are also fragmented and evenly distributed over the
storage network.
Furthermore, unlike any deterministic approach, randomization offers high potential
to scalability (see e.g. [JK77, KLMadH96, BMadHS97, Sch01]) which has become an
important property of modern SANs over the last decade to efficiently face the increasing
system sizes and huge amounts of stored data making any downtime or interruption of
service intolerable, and therefore crucial for a company. Hence, to keep the system at high
operation, the costs for restoring an efficient data layout after any change in the set of disks
should be minimized. To guarantee this, a data placement scheme should be adaptive, that
is, in case of changes, it should guarantee to redistribute only a minimum amount of data
while after redistribution, a fair data layout should again be achieved. This task is already
of great challenge in a non-fault-tolerant and homogeneous setting considering uniform
capacities only, but in a heterogeneous environment, this challenge further increases, and
finally, if redundancy must furthermore be taken into account, this task turns to be pretty
hard. Nevertheless, depending on the inherent properties of randomization, this demand
can be tackled more conveniently by a randomized strategy. A good introduction to the
design and analysis of a broad bouquet of randomized algorithms is given in [MR95,
HZ05]. Furthermore, Scheideler [Sch00] presents several probability methods that can
be used efficiently for different types of problems, like e.g. routing in networks, job
scheduling, coloring hypergraphs and data distribution processes.
25
Outline of this Chapter
For what follows, we investigate and analyze novel randomized replication strategies un-
der fault-tolerance consideration, and the major problem on what we are interested in this
chapter is to always find a fair load balancing for storage networks that may consist of ar-
bitrary disk capacities. Obviously, this is a pretty challenging task because, compared to
usual randomized placement algorithms which allocate all data items independently from
each other, regarding the redundancy condition at any point in time induces dependen-
cies on the allocation of any two identical copies, thus, making the allocations no longer
independent from each other. For dynamic systems that are often allowed to undergo
changes in the number of connected devices caused by the addition or removal of accessi-
ble disks due to maintenance, capacity extensions or disk failures, this problem becomes
even harder.
Since companies run static as well as dynamic systems, we divide our observations into
two parts. At first, we consider static storage environments only, while the second part
also tolerates dynamic system behavior. This implies that, in the first part, we can restrict
our analysis on the data layout, i.e. on the load of the disks after allocation, whereas in
the second part, we also have to consider the replacement of data blocks in case of disk
additions or removals.
In more detail, for the static scenario, we analyze strategies that are closely related to
those applied for the classical resource allocation problem. In that problem, a system
is considered in which for each arriving data block a scheduling algorithm chooses the
location of the block on-line and uniformly at random from a fixed number of identical
resources (e.g. disks of same size). Since in this scenario, both the resources and the
data blocks are supposed to be uniform, and additionally, the blocks are considered to
be pairwise independent (i.e. no redundancy is assumed), near-optimal allocations can
be obtained by using algorithms which base on the widely known Balls-into-Bins model,
a simple model that has become very prominent in the load balancing community (see
e.g. [JK77, RS98, ABKU00, BCSV00, Sch00]). In short words, in a balls-into-bins
model, one sequentially allocates a set of mindependent balls (representing tasks, jobs,
data blocks,...) at random to a set of nbins (servers, processors, disks, ...) such that
the maximum number of balls in any bin is minimized. In the recent years, many use-
ful and efficient results have been obtained by this simple scheme, most of which hold
with high probability1. However, unfortunately, nearly all of the known results only con-
sider uniform capacity distributions, and more important, none of the results covers the
redundancy condition we address in this work (we return to the balls-into-bins approach
in more detail in Section 3.4).
1We denote an event Ato occur with high probability (w.h.p.) if Pr[A]1n`for an arbitrarily chosen
constant `1.
26 Random Allocation of Data Copies
The dynamic scenario is significantly harder as to maintain fairness, any given change
in the capacities of the disks caused by disk additions or removals implies some redis-
tributions on the stored data blocks. Thus, to avoid any intermediate state in which at
least one of the given requirements is violated, the applied placement scheme is forced to
re-establish the fair and redundant data layout after any step in the redistribution phase.
3.1 Preliminary Section
Summarizing the descriptions in the previous section, we are interested in finding effi-
cient randomized algorithms for some sort of redundant resource allocation problem for
arbitrary capacities under either static or dynamic conditions. This section is dedicated
to give useful abstractions for addressing this problem in a theoretical manner. First, we
introduce a basic model of the considered storage networks and then define essential at-
tributes on which we concentrate in the subsequent sections to evaluate the resulting load
balancing after the random data allocation. That is, as we are mainly interested in the
layout quality of the allocations throughout the whole chapter, we solely focus on the
insertion of data blocks and omit the deletion of already stored blocks. In addition, we
only consider a finite number of data blocks to be allocated, thus, we restrict ourselves to
afinite allocation problem what is different to an infinite allocation model which is more
suitable for problems such as job allocation on processors or video-on-demand scenarios.
(Moreover, for the rest of this chapter, we use the terms bins, disks, nodes interchangeably,
and this also holds for balls, data blocks and data items.)
3.1.1 The Basic Model
First of all, for the following descriptions, we define the set [c]:={1,...,c}for any cN.
Then, for all observed storage systems, we consider arbitrary but fixed numbers M,NN
sufficiently large such that the set V= [N]denotes the address space of all possible disks
(bins, nodes), and U= [M]is the address space of all possible balls (data items) in the
system. Given these sets, we suppose that, at any time, mMis the number of balls
currently in the system, and nNalways denotes the current number of accessible disks.
Naturally, as storage networks consist of a collection of hard disks each covering of a
huge number of data blocks, we assume mn, and since our distribution schemes pay
no attention to the content of any given data block, this also implies to consider the balls
as uniform, that is, their weights are normalized, i.e. each ball jUhas weight wj=1.
Furthermore, due to redundancy requirements, let each data item be represented in the
system by r[n]identical copies, with rcalled the replication parameter (for a non-
redundant system, we set r=1). Since we assume rto be fixed for all data items and we
always consider mMballs currently in the system, let [k]be the set of original, pairwise
3.1.1 The Basic Model 27
distinct data items, for k=m
rN. Generally, any given I/O operation on the data blocks
(i.e. read or write access) is referred to as data access or data request.
For a formal description of our storage network model, we refer to some definitions
from the thesis of Kay Salzwedel [Sal04].
Definition 3.1. (Storage Network) Astorage network S(n,C[n]) is a collection of n N
accessible disks [n]V. Each disk i is characterized by a capacity CiNdescribing
the total number of (uniform) balls the disk can store. Let C[n] = (C1,C2,...,Cn), and the
total capacity of S(n,C[n]) is Ctotal =1inCi.
A storage network S(n,C[n]) is called homogeneous (or uniform) if all disk have the
same capacity, that is, for any two disks 1i,jn, it holds that Ci=Cj. Otherwise, the
system is called heterogeneous (or non-uniform).
In Figure 3.1, we depict an arbitrary storage network consisting of a number nof disks
connected by some dedicated network. As further given in the picture, the disks may
feature different characteristics, like e.g. the capacity Cior the bandwidth Bistating the
average transfer time when accessing the disk by an I/O operation. For the purpose of il-
lustration, we will restrict ourselves to disk capacities only throughout this thesis because
disk utilization is more self-evident than throughput utilization.
Network
D1
D2
D3D4D5
Dn
C1
C2
C3C4
C5
Cn
B1
B2
B3B4
B5
Bn
Figure 3.1: A storage network with non-uniform characteristics (capacity, bandwidth).
Since throughout this chapter, we consider random allocations on the disks of a storage
network, S(n,C[n]) is considered as the given sample space for allocation on which we
can model an appropriate probability distribution by the following shares of the disks.
28 Random Allocation of Data Copies
Definition 3.2. (Share) Given a storage network S(n,C[n]), the share of a disk i is defined
by its relative capacity
ci=Ci
Ctotal [0,1].
Thus, 1inci=1. In other words, the share of each disk currently in the system is a
normalized measure of the heterogeneity in a storage network. Moreover, given the share
tuple ¯c= (c1,c2,...,cn)for the disks in S(n,C[n]), obviously, (S(n,C[n]),¯c)defines an
appropriate probability space for random allocation in which the sampling of the disks
is supposed to be done independently at random (at a first glance, the probability space
(S(n,C[n]),¯c)defines a base for our investigations).
Given the share tuple ¯c= (c1,c2,...,cn), we can now define our random allocation
problem for a storage network S(n,C[n]).
Definition 3.3. (Redundant Allocation Problem) Let S(n,C[n]) be a storage network
with a set [n]of accessible disks and capacities given by the share tuple ¯c= (c1,c2,...,cn).
Furthermore, consider a set [k]U of data items and a replication parameter r [n]that
holds for every d [k].
Depending on r, let (f1,f2,..., fr)be an r-tuple of allocation functions fi:[k][n],
i[r], satisfying the following condition:
i,j[r],i,j,d[k]:fi(d),fj(d).
As for each d [k],(f1,f2,..., fr)generates an allocation tuple (f1(d),f2(d),..., fr(d)),
we consider the set Mr
k={(f1(d),f2(d),..., fr(d))|d[k]}[n]rof allocation tuples.
Then, let F = ( f1,f2,..., fr)be a redundant allocation function defined as
F:[k]Mr
k,F(d) = (f1(d),f2(d),..., fr(d)) d[k].
Since |Mr
k|=k, we obtain a total of m =k·r different allocations (balls) in the system.
Then, S(n,C[n]), the share tuple ¯c, and the random allocation function F define a redun-
dant allocation problem on the probability space (S(n,C[n]),¯c). The allocation problem
is finite as Mr
kis finite.
Again, since we always assume mMballs in the system, we fix m=k·rin the
following. Additionally, as considering the balls to be uniform, we introduce the load `i
of a given disk ias the number of data blocks allocated to it.
Now, given the definition of the redundant allocation problem, we can formulate a
suitable definition for a redundant data distribution strategy as we require in the following
descriptions.
Definition 3.4. (Redundant Data Distribution Strategy) Aredundant data distribution
strategy for the redundant allocation problem defines a disjoint mapping F :[k]Mr
kfor
any number m Ctotal of blocks such that for each disk i the load `iCi.
3.1.1 The Basic Model 29
To address the fairness condition on the disks in S(n,C[n]) in a formal manner, we
apply a set of binomially distributed random variables X1,X2,...,Xnsuch that for each
disk i[n], let Xidenote the load `iof disk i. Then, for a total of mdata blocks in the
system and any given integer kN, it holds: Pr[Xi=k] = bm,ci(k):=m
kck
i(1ci)mk
and E[Xi] = m·ciis the expected value.
With this random variables, we are now able to measure the fairness property of the
defined redundant data placement scheme (sometimes also referred to as space balance)
(c.f. Section 1.2).
Definition 3.5. (Fairness) For a redundant allocation problem, let X1,X2,...,Xnbe bino-
mially distributed random variables such that Xidenotes the load of disk i [n]. Then, a
redundant data distribution strategy is fair if for all disks i [n]
`i=E[Xi] = r·k·ci.
Moreover, since we require that we come close to optimal, the fairness condition above is
called tight if the expected load of disk i is in range
`i= (1±ε)·r·k·ci
w.h.p., where ε>0can be made arbitrarily small.
Recall that the fairness property ensures the utilization of the storage network, and as
each disk’s charging level corresponds to its share, a tight fair data distribution enables
the system to get filled up almost completely.
Up to this point, the given definitions enable us to characterize a redundant allocation
only for a static storage network, that is, the configuration of the storage network does
not change over time; the only dynamic aspect that comes into play with respect to such
systems is given if the number mof blocks in the system changes.
However, this is not yet sufficient as we are further interested in an appropriate formula-
tion for a dynamic setting in which a given configuration of current disks may change due
to altering space requirements. Thus, in the following, we formalize the adaptivity condi-
tion for a redundant allocation strategy, that is, we enable a storage network to increase or
decrease in the number of disks. The major algorithmic challenge coming along with this
additional property is to ensure both the fairness and redundancy condition at any time.
A natural measure for the efficiency of a distribution strategy in terms of adaptivity is
the number of blocks that have to be redistributed to ensure the fairness property in case
of changes in the configuration. This measure determines how long the storage system
works in a degraded mode and in which performance losings must be considered. In the
following, we call operations that induce a replacement of blocks change operations.
30 Random Allocation of Data Copies
As usual, we apply competitive analysis (see Section 3.1.3.5 for details) which is a well-
known method to model and evaluate this dynamic process that we can also comprehend
as an online algorithm. In short words, competitive analysis compares an online algorithm
with an optimal offline strategy. In particular, for our considerations, we would require an
optimal algorithm to ensure fairness after each change operation ωin the system implying
that, for instance, if ωis triggered by the addition of a disk, any algorithm has to replace at
least cn+1·mblocks. We call an adaptive distribution strategy c-competitive if it requires
at most ctimes the number of replacements performed by the optimal offline algorithm
for any sequence of change operations. Now, we can formulate a useful definition for an
adaptive redundant data placement strategy as we consider in the following sections.
Definition 3.6. (Adaptive Redundant Data Distribution Strategy) Consider a storage
network S(n,C[n]) and any change operation ω. We call a redundant data distribution
strategy adaptive for S(n,C[n]) if it is able to restore the fairness and redundancy property
on S(n,C[n]) after any change in the number of current disks n induced by ω.
It is c-competitive if the number of blocks necessary to restore fairness is at most c
times the number of blocks an optimal strategy has to replace, on expectation.
Obviously, to appropriately model and analyze the dynamic case observed in Sec-
tion 3.5, we suppose an extended share tuple ¯c= (c1,c2,...,cN)for Nnand a given
storage network S(n,C[n]). Furthermore then, all disks inot currently in the network are
assigned are share ci=0. Hence, we consider an increase in the number of disks up to an
upper bound N. Then, given two share tuples ¯c= (c1,c2,...,cN)and ¯c0= (c0
1,c0
2,...,c0
N),
for any data distribution strategy to be adaptive, it must be able to handle any capacity
change from ¯cto ¯c0. Certainly, it is easy to see that every storage strategy that wants to
preserve fairness has to replace at least a
i:ci>c0
i
(cic0
i) = 1
2
i|cic0
i|
fraction of the data in the system.
Regarding all given definitions, we can, roughly speaking, formulate the redundant
allocation problem, on that we concentrate throughout this chapter as follows:
Is it always possible to find an efficient, fair, redundant (and perhaps) adaptive random
data distribution strategy for arbitrary capacities such that, at any time, best performance
provided by parallel access as well as fewest resource consumption can be guaranteed ?
Before, in the rest of this chapter, we concentrate on the description and analysis of our
novel data distribution strategies that address the given problem appropriately, either for
a static or dynamic setting, we first give some preliminaries in the next section which we
require for the modeling and the analysis of those allocation schemes.
3.1.2 On the Practical Realization of the Allocation Function F31
3.1.2 On the Practical Realization of the Allocation Function F
Obviously, finding a positive answer to the formulation of the redundant allocation prob-
lem can be directly translated into finding an appropriate redundant data distribution strat-
egy F= ( f1,f2,..., fr)for either a static or dynamic system. Thus, for the rest of this
chapter, we aim at finding a suitable translation of the theoretic definition given in the
previous section into some practical realizations of Ffor any given storage system that
can easily be implemented inside a respective application. In particular, this means that
such realization always has to ensure that
any two functions fi,fj,i,j, in the r-tuple (f1,f2,..., fr)allocate a given data
block d[k]on distinct bins `,o[n],`,o,
each of the functions fiin (f1,f2,..., fr)is efficiently computable, and
for a given number mCtotal of blocks, (f1,f2,..., fr)yields a fair data layout up
to some factor (1±ε)for each disk.
Clearly, by allocating data blocks in a given storage network S(n,C[n]), the function
Falso implements the dictionary functions search,insert and delete for the data blocks
on S(n,C[n]) (again, we solely concentrate our analysis on the data layout only, that is,
we do never consider the deletion of blocks). With respect to a technical implementation
of F, at a first glance, it seems to be a good idea to realize the functions fiin Fby r
different hash functions h1,h2,...,hrbecause hash functions can be efficiently computed,
and more importantly, by using hash functions, the expected complexity of the dictionary
functions can be reduced to O(1)(in Section 3.1.2.1, we go into further details on hash
functions).
However, the proposed idea imposes some severe problems. First of all, the usage of
rindependent hash functions cannot adequately ensure a perfectly disjoint allocation of
all identical copies, but what we strictly require, because all of them hash each data block
into the same range [n]of disks. Second, depending on r, the number of applied hash
functions may be too small to achieve fairness as defined in the previous section. At last,
all of those hash functions must be kept in memory, thus, if ris supposed to be large,
this implies a waste of memory, what is dismal. Hence, it is easy to see that, since all
our introduced strategies are randomized algorithms, this makes a separate allocation of
identical copies become a pretty hard challenge. Moreover importantly, the redundancy
property must always be ensured even in dynamic storage environments.
Therefore, as we show by the descriptions in this chapter, there is considerably more
work to do to realize Ffor a given storage environment. Recall that each function Fhas to
be chosen such that the elements from the set Mr
kare allocated to S(n,C[n]) in a way that,
on expectation, fairness can almost perfectly be achieved. Nevertheless, it is a good idea
32 Random Allocation of Data Copies
to consider hashing for allocation, and as we show in the following description, there are
useful universal hash functions, called k-universal hash functions for some parameter k,
that ensure a suitable distribution of the data blocks on the set of disks and which underly
all of the strategies introduced in this chapter.
We start our descriptions with introducing a basic design paradigm for randomized
online algorithms that is applied by the functions and strategies presented in the following.
After that, we briefly sketch universal hashing (because it belongs to folklore) before we
introduce k-universal hash functions. At last, we prove that for both static and dynamic
storage environments, O(logn)-universal hash functions are a good choice to obtain a
useful dispersal of blocks w.h.p.
Design Paradigm: Foiling the Adversary. To find a suitable realization for all ran-
dom allocation functions Fin this chapter, we apply the well-known design paradigm
for randomized algorithms, the method of foiling the adversary, also called the method of
avoiding the worst-case problem instance (cf e.g. [HZ05]). The paradigm considers the
design of an algorithm as a game between two players: a designer who tries to design an
efficient algorithm for a given problem, and an adversary who for any designed algorithm
tries to construct an input on which the algorithm fares poorly, that is, the output is not
efficient or incorrect. Typically, in the analysis of deterministic algorithms, the adver-
sary argument establishes a lower bound on the worst-case complexity, and, in general,
only those algorithms that run efficiently on every given input are considered to be effi-
cient. As the designer has to present his algorithm at first in the game, the adversary has
some advantage since, knowing the algorithm, he can find a hard problem instance for
the algorithm. This situation can essentially change if the designer designs a randomized
algorithm A, that is, the adversary does not know which of the possible runs of the algo-
rithm Awill be chosen at random. Therefore, the paradigm is very suitable for problems
for which
1. every deterministic algorithm (strategy) has some worst-case instances at which the
algorithm is not able to efficiently compute the correct (or optimal) output,
2. but there exists a class of deterministic algorithms such that, for every problem
instance, most algorithms of this class efficiently compute the correct result.
If (2.) holds, then for any given input instance, one can pick an algorithm from the class
at random and expect to get the correct result efficiently with reasonable probability. As
we will see, the following universal hashing approach describes a successful application
of this paradigm.
3.1.2 On the Practical Realization of the Allocation Function F33
3.1.2.1 Excursus: Universal Hashing
Hashing is a well-known method of dynamic data management. In particular, given a
data structure T= [N], called hash table, which consists of NNaddressable slots to
which direct access is provided2as well as some data item addressed by its key k, hashing
efficiently performs the dictionary functions search(T,k),insert(T,k) and delete(T,k). The
key kis taken from a universe Uof all possible keys that may be a finite subset of N,
or U=N. The major goal of hashing then is to perform the dictionary operations in
O(1)time, what is different to other dynamic data structures such as AVL-trees or B-trees
performing these operations in logarithmic time in the worst case.
We furthermore consider some certain application by which we are usually given a
finite input set SUof size MN, and the aim is to save Sin Tsuch that the directory
operations can be provided as efficient as possible on T. It is easy to see that, to achieve
the desired efficiency for the dictionary operations, the task now is to determine a mapping
h:UT, called a hash function, on which we pose the following requirements:
hcan be efficiently computed,
hmaps every input set SUin a ’well-dispersed’ fashion to Tin a way that, for
all i[N], the cardinality of the set T(i) = {dS|h(d) = i}is in O|S|
|T|.
Unfortunately, the main problem arising with respect to the mapping his that we cannot
influence the choice of Sas well as we do not have any preliminary information about
S; the set Sis given completely by the application (user), thus, we are not able to predict
anything about it. Moreover, one can show that for every hash function h, there exist
"bad" sets S, all of whose elements are mapped by hto some slot iof T(for instance,
consider Sto be a subset of the set Uh,i={dU|h(d) = i}(see [HZ05])). On the other
hand, for each given set S, there exist useful hash functions that appropriately distribute
the keys of Samong the slots of T.
Summarizing this, we have an exemplary situation calling for applying the ’foiling
the adversary’ paradigm. For every choice of h, the adversary can construct arbitrarily
bad inputs S, but on the other hand, there are many hash functions that assure the best
possible distribution for most inputs S. Following the paradigm, the only possibility then
to distribute all (and also bad) input sets evenly among the slots of Tis to choose a hash
function huniformly at random from a suitable set Hof hash functions. In other words,
unlike assuming the input set Sbeing chosen uniformly at random from the universe U, we
rather suppose a class Hof hash functions in which these functions are evenly distributed.
Obviously, for most input sets, on average, this sampling guarantees an even distribution
of the keys in Tw.h.p. The considered suitable’ classes Hare widely-known as universal
classes of hash functions which are defined as follows:
2One is able to examine any slot of Tin time O(1).
34 Random Allocation of Data Copies
Definition 3.7. (Universal Hash Functions): Let H be a finite set of hash functions from
U to T (|T|=N). The set H is called universal if for each pair of elements x,yU, x ,y,
|{hH|h(x) = h(y)}| |H|
N
holds, i.e. at most every N-th hash function from H maps x and y to the same slot of T.
That is, for any fixed pair of keys x,yU,x,y, it holds the following condition
PrH[h(x) = h(y)] 1
N.(3.1)
As a major result, it can be shown that for an arbitrary finite set SUof keys and a
universal class Hof hash functions from Uto T, the expected number of keys in any slot
of Tis smaller than 1+|S|
|T|[HZ05]. (We skip going into further detail at this point because
universal hash function are subject to any basic course on algorithms and can furthermore
be found in e.g [HZ05, CLRS01].)
As we can see, the concept of universal hash functions provides a useful mechanism
to obtain an appropriate distribution of the data items among the storage devices. It only
remains to show that such classes exist at all, and if so, to find a universal class Hof
hash functions that is suitable for our requirements. The first question can be answered
positively by the following lemma.
Lemma 3.8. Let U be a finite set. The class A ={g|g:UT}of all functions from U
to T is universal.
We omit the proof here (but which can be found in e.g. [HZ05]). Now, we know that
classes of universal hash functions exist but, unfortunately, many of them suffer from
huge size. For instance, the number of all functions in Ais3N|U|what, obviously, makes
Aimpossible for real applications since
1. Ais too large4to be able to perform a random choice of an hfrom Aefficiently, and
2. most functions from Acan be viewed as random mappings in the sense5that they do
not have any essential shorter description than their full table representation. Such
functions are not efficiently computable.
Instead, we would like to have a universal class Hof hash functions such that
(i) His small (at most polynomial in |U|), and
3One has to assign one of the Nvalues from Tto each element of U.
4Observe that Uis already very large, thus, N|U|is usually substantially larger than the number of protons
in the known universe.
5c.f. [Hro03].
3.1.2 On the Practical Realization of the Allocation Function F35
(ii) every hash function from His efficiently computable.
The most prominent universal hash functions satisfying the formulated wishes are the
linear hash functions ha,b:UTdefined by
ha,b(x) = ((ax +b)mod p)mod |T|,a,bU,pprime.
for every xU. The corresponding universal class of hash functions is
Hp
lin ={ha,b|a[p1]and b{0,1,...,p1}}
of size |Hp
lin|=p·(p1). Unfortunately, as we can see from the definition of ha,b,
linear hash functions guarantee the desired properties only for any two elements x,yU
(see [HZ05]). Instead, for our further analysis in this thesis, we require what we call
k-independence, that is, for any given set of kpairwise distinct keys from U, the hash
function guarantees that Condition 3.1 holds also (independently) for all kelements. This
demand is satisfied by the following k-universal hash functions (the following is taken
from [MadH06]; further references are [CW77, DKM+88]) (recall [M]and [N]to be the
address spaces of the balls and bins, respectively).
Definition 3.9. (k-Universal Hash Functions): Consider H A={g|g:[M][N]},
kN. H is a k-universal class of hash functions if, for all `k, all pairwise distinct
x1,...,x`[M]and all j1,..., j`[N], it holds that
Pr[h(xi) = ji,i[`]] 2
N`
To simplify our further analysis, we weakened the given bound for the probability of
the random function hthat, if considered to be an arbitrary random function from the set
A, would be 1
N`. However, for our assumptions, hHis sufficiently random, and if we
merely consider `kinput values, hnearly behaves like a real random function.
In what follows, we prove that some special classes of polynomials are universal hash
functions. Let Mbe an arbitrary but fixed prime. Then, we can define the following:
Definition 3.10. Suppose d N. For a tuple ¯a= (ad,...,a0)[M]d+1of coefficients, let
h¯a(x):= d
i=0
ai·xi!mod M!mod N,for x [M].
Furthermore, we define Hd
N:={h¯a:[M][N]|¯a[M]d+1}.
That is, the class Hd
Nconsists of all polynomials of degree-bound dtaken from the
finite field ZN(for every prime p, there exists a finite field Zpwith pelements (see e.g.
[HP03, Rom96]) whose images have finally to be mapped into the set [N].
36 Random Allocation of Data Copies
Theorem 3.11. The class Hd
Nhas the following properties:
1. each function h¯aHd
Ncan be evaluated in time O(d), and
2. Hd
Nis (d+1)-universal.
Proof. Case (1) is obvious.
Case (2): For the tuple ¯a[M]d+1of coefficients, let f¯a:[M][M]be the polynomial
f¯a(x) = d
i=0
ai·xi!mod M.
Since Mis a prime, f¯ais a polynomial of degree-bound dover the field ZM, and as we
know, in a field, each polynomial of degree-bound dis uniquely determined by d+1
distinct points x0,...,xd.
Now, consider an arbitrary but fixed `, 1 `d+1. Let x1,...,x`[M]be `pairwise
distinct input values for f¯a. Furthermore, let j1,..., j`[N]be possible output values
’modulo N’, and let
A`:={h¯aHd
N|h¯a(xi) = ji,i[`]}.
Then, it holds:
h¯aA`h¯a=f¯amod N,and f¯a(xi){ji+α·N|α[M
N]}
| {z }
=:Bi
,i[`]
Since the f¯aare polynomials of degree-bound dover a finite field ZM, each polynomial is
uniquely determined by d+1 distinct values implying that the coefficients ¯aof f¯aare also
uniquely determined. Furthermore, since we already fixed `points of each polynomial
in the beginning by choosing the `point-values xiand ji, it remains to determine the
residual d`+1 distinct values to determine the polynomial uniquely. Thus, for each
tuple (k1,...,k`)B1×...×B`with f¯a(xi) = ki, we can still determine further [M]d`+1
distinct values because the domain of f¯ais [M]and d`+1 points are still left to choose.
From this, it follows that the number of hash functions that map to the chosen values ji
is
|A`|=|B0|·...·|Bd|·[M]d`+1M
N`
·[M]d`+1.
Moreover, since |Hd
N|=Md+1holds (because each tuple ¯a[M]d+1defines a different
polynomial over the field ZM), if follows:
Pr[h¯a(xi) = ji,i[`]] = |A`|
Md+1M
N`·[M]d`+1
Md+12
N`
.
Thus, the class Hd
Nis (d+1)-universal.
3.1.2 On the Practical Realization of the Allocation Function F37
3.1.2.2 Analyzing the Distribution Quality of the Hash Functions
Given the above definition of k-universal hash functions, we can now analyze the distri-
bution quality of the hash functions that we apply for our strategies. In particular, for a
storage network S(n,C[n]) with arbitrary capacities, a set P[M]of data blocks (note
that, for our analysis, we neglect redundancy, i.e. all data blocks are considered pairwise
distinct), and a hash function hchosen uniformly at random from an O(log n)-universal
class of hash functions HA={g|g:[P][n]}, we can state the following theorem.
Theorem 3.12. Consider a storage network S(n,C[n]) with arbitrary capacities. Then,
for every set P of m n data blocks, the hash function h provides an even distribution of
the blocks among the set [n]of bins, w.h.p
Proof. Before we start, we introduce the following well-known inequality for binomial
coefficients which we use for the analysis in Lemma 3.14.
Lemma 3.13. For k,` N,0`k, it holds
k
`k·e
``
Proof.
k
`=k·(k1)···(k`+1)
`!k`
`!=k`
``·``
`!k`
``·e`,since e`=
i=0
`i
i!``
`!.
Now, for every relative capacity value ci[0,1]in S(n,C[n]), fix some arbitrary bin
i[n]of size ciand proceed as follows.
Assume the input set Pto be of size m=αi·nlog nfor some appropriate constant
αiN1such that Pcan be partitioned into equal sized batches P={B1,B2,...,Bαi}.
To determine αi, let |Bk|=˜c·log n
cifor all k[αi]where ˜cis some constant which is chosen
such that ˜c
ci=nand that can easily be determined if we assume ci=1
nfor a uniform
network, and for non-uniform capacities, we consider ci=c0·1
nfor an appropriate constant
c00.
Now, consider each batch Bkto be allocated independently to the nbins by applying
the hash function h. Then, we can state the following lemma concerning the distribution
of the elements of Bk.
Lemma 3.14. Let S Bk. Then, for the load `iof some bin i [n], the following holds
1. Pr[`ic·log n]2e
cc·log nfor an appropriate constant c
38 Random Allocation of Data Copies
2. For all bins i [n]of size ci, it holds: Pr[maxi{`i} c·log n]1
n`for some
arbitrary constant `N1and an appropriate constant c
3. For all bins i [n]of size ci, it holds: E[maxi{`i}]O(log n)
Proof. Case (1)
Pr[`ic·log n] = Pr[AS,|A|=c·log n,dA:h(d) = i]
=Pr
[
AS
|A|=c·log n
h(d) = idA
()
AS
|A|=c·log n
Pr[h(d) = idA]
(∗∗)
nlog n
c·log n·2
nc·log n
(∗∗∗)
nlog n·e
c·log n·2
nc·log n
=2e
cc·log n
.
The estimate ()holds by Boole’s inequality (see Section 3.1.3.1), (∗∗)holds since His
an O(log n)-universal class of hash functions, and ()holds according to Lemma 3.13.
Case (2)
Now, we show that for all bins i[n]of size ci, the result from Case (1) holds w.h.p.
That is, instead of focusing on a single bin only, we now consider all bins of size ciand
show that there is no bin having a higher load w.h.p. We denote by maxi{`i}the maximum
load of the bins we concentrate on. Furthermore, let Aidenote the event that the load of
any bin iis `ic·log n. Then,
Pr[max
i{`i}c·log n] = Pr[i[n]:`ic·log n]
=Pr[A1A2...An]
()
n
i=1
Pr[Ai] = n·2e
cc·log n
.
Again, estimate ()holds by Boole’s inequality. Now, let c=4e. Then, we obtain the
following.
n·2e
cc·log n
n·1
2c·log n
=n·1
nc
1
n`
for all `c1.
3.1.3 Further Preliminaries 39
Case (3)
Given the results above, we can now determine the expected maximum load of the bins
i[n]of size ci.
E[maxi{`i}]Pr[max
i{`i}<c·log n]·c·log n+Pr[max
i{`i}c·log n]·nlog n
11
n`·c·log n+nlog n
n`
c·log n+nlog n
n`(c+1)·log nO(log n).
Since the results from Lemma 3.14 hold for any partition P=αi·nlog nfor some
constant αithat is determined by some capacity ci, the theorem follows.
3.1.3 Further Preliminaries
In the following, we sketch some useful mathematical tools taken from probability theory
and combinatorial optimization which help to estimate appropriate bounds on the devia-
tion of any disk’s load from its expectation.
3.1.3.1 Some Facts from Probability Theory
For our analysis, we use basic results and definitions from fundamental probability theory,
like e.g. conditional probability, random variables or expectation. In the following, we
introduce some tools which are useful for the analysis of our strategies.
Deviation Bounds. To bound the deviation of the load on a given disk from its expecta-
tion, we apply the following version of the prominent Chernoff Bounds found in [HR90]
that give the tail estimates based on expectation for sums of independent binary random
variables (we omit the proofs here that are given in detail in e.g. in [MR95]).
Lemma 3.15. (Chernoff Bound) Consider any set of n independent binary random vari-
ables X1,...,Xn. Let X =n
i=1Xiand µ=E[X]. Then, it holds for all δ0that
Pr[X(1+δ)µ]e-min[δ2,δ]·µ
3
and for all 0δ1that
Pr[X(1δ)µ]e-δ2·µ
2.
40 Random Allocation of Data Copies
An Independent Bounded Difference Inequality. The following theorem presents tail
estimates that are based on bounds on the difference of functions over independent ran-
dom variables. The given results can be found in [McD89, Sch00, Ros06].
Theorem 3.16. Let X= (X1,...,Xn)be a vector of independent random variables with
Xitaking values in a set Aifor each i. Suppose that the real-valued function f defined on
A1×...×Ansatisfies
|f(x)f(x0)|ci
whenever the vectors xand x0differ only in the ith coordinate. Let µbe the expected value
of f (x). Then, for any d 0,
Pr[f(X)µd]e2d2/n
i=1c2
i
and
Pr[f(X)µd]e2d2/n
i=1c2
i
The Inclusion-Exclusion-Principle on Events. Consider any arbitrary collection of n
sets A1,...,An. Then, the inclusion-exclusion principle is given as
n
[
i=1
Ai
=
n
k=1
(1)k+1
i1<i2<...<ik
k
\
j=1
Ai j
.
If then A1,...,Anis considered as a collection of events, the inclusion-exclusion principle
implies
Pr"n
[
i=1
Ai#=
n
k=1
(1)k+1
i1<i2<...<ik"k
\
j=1
Ai j#.(3.2)
In cases where it is too difficult to evaluate Equation 3.2 exactly, one can use Boole’s
inequalities to find suitable approximations:
Pr"n
[
i=1
Ai#
n
i=1
Pr[Ai]
and
Pr"n
[
i=1
Ai#
n
i=1
Pr[Ai]
1i<jn
Pr[AiAj].
3.1.3 Further Preliminaries 41
3.1.3.2 Some Basics from Linear Programming and Farkas’ Lemma
Linear programming involves the optimization of a linear function under linear inequality
constraints. Since linear programming covers a wide field in mathematics, here, we give
only some brief definitions which are useful for the understanding of Farkas’ Lemma
[Roc70, p.200] (for more details on linear programming, see e.g. [PS98, Ros00])
Definition 3.17. (Basics Facts:) Alinear programming (LP) problem is an optimization
problem that can be written
maximize: cx
subject to: Ax b(3.3)
where A = (ai j)i=1,...,q,j=1,...,nis a given q ×n matrix, c Rnis a given row vector of
length n, and b Rqis a given column vector of length q (note that in (3.3), the ’-
relation is defined on vectors, i.e. for each component in the vectors). Each of the q
inequalities in Ax b is a constraint, and the decision variables of the Problem 3.3 are
represented by the column vector x of length n.
Afeasible solution is a vector x satisfying Ax b. The feasible region is the subset
of all feasible solutions in Rn. If no feasible solution exists (so that the feasible region is
empty), the LP problem is infeasible; otherwise it is feasible.
For a feasible solution x, the function z =cx is the objective function and cx the ob-
jective value of x. If the feasible solution is also the maximum, the feasible solution xis
an optimal solution. If the optimal value can be made arbitrarily large over the feasible
region, the LP problem is unbounded. Otherwise it is bounded.
Since every LP problem, referred to as a primal problem, can be converted into a
dual problem providing an upper bound to the optimal value of the primal problem, and
additionally, every feasible solution for an LP gives a bound on the optimal value of the
objective function of its dual. The weak duality theorem states that the objective function
value of the dual at any feasible solution is always greater than or equal to the objective
function value of the primal at any feasible solution. The strong duality theorem states
that if the primal has an optimal solution, x, then the dual also has an optimal solution,
y, such that cTx=bTy.
The duality theorems given above base on the following Farkas’ Lemma which is an
example of a theorem of the alternative; a theorem stating that of two systems, one or
the other has a solution, but not both. Thus, it can be applied to show the existence (and
uniqueness) of solutions to linear models and stationary distributions in finite Markov
chains.
42 Random Allocation of Data Copies
Lemma 3.18. (Farkas’ Lemma) Let A be a real-valued q ×n matrix and x and b are
vectors. Then, the system:
Ax =b,x0
has no solution if and only if the system
ATy0,bTy<0
has a solution, where y is a vector of length q.
3.1.3.3 Hypergraphs
The analysis in the next section depends on structures described by hypergraphs for which
we give the major properties we require by the following definition.
Definition 3.19. (Hypergraphs) Let V be a finite set of nodes. Further, let E Si2Vi
be a finite subset of the set of tuples over V. The set E is called the set of edges. An
element e = (v1,...,vj)E is said to be self-loop-free if vi,vi0for all 1ii0j. If
all elements are self-loop-free, the pair G = (V,E)is called a hypergraph. A hypergraph
G= (v,E)is d-uniform if E Vd. An edge e E has size dif e Vd.
Let e = (v1,...,vj)E be an edge of a hypergraph G and let v V be a node of G.
The node v is said to be incident to e if there is an 1ij with vi=v. An edge e is
incident to a node v if v is incident to e. If a node v is incident to a number i of edges, the
node v has degree i (deg(v) = i). If v ,v0are nodes incident to the same edge e E, the
nodes v and v0are said to be adjacent.
For j 1, a sequence of nodes v1,...,vjis called a path connecting v1and vjif viis
adjacent to vi+1for each 1ij1. The length of a path v1,...,vjis j 1. Two nodes
v and w are said to have distance i (dist(v,w) = i) if the shortest path connecting v and
w has length i. If W V is a subset of the nodes, the distance between a node v V
and W is defined by minwWdist(v,w). Two paths v1,...,viand w1,...,wjare distinct if
{v1,...,vi}{w1,...,wj}=/0.
3.1.3.4 Multisets
The following definition of multisets is well-known from set theory.
Definition 3.20. Amultiset M(E)is a pair (E,#E)where E is some underlying set and
#E:EN0is a function specifying the multiplicity of the elements in E. For each e E,
let µE(e):=#E(e)denote its multiplicity. Thus, µE(e) = 0for all e <E. Furthermore, the
cardinality of a multiset |M(E)|=eEµE(e).
Subtraction on multisets is defined as M(E0) = M(E)\e, thus, µE0(e):=µE(e)1if
µE(e)>0and µE0(e):=µE(e) = 0otherwise. The union of multisets is defined analo-
gously, i.e. µE0(e):=µE(e)+1.
3.2. The Redundant Allocation Problem 43
3.1.3.5 Competitive Analysis
In part two of this chapter, we analyze the introduced strategies by applying competitive
analysis which is a well-known method of evaluating online algorithms (see [BEY98]).
With respect to our dynamic data placement strategies, we apply competitive analysis
to evaluate their adaptivity. That is, for any change operation ω, we compare the number
of (re)-placements of data blocks performed by the observed strategy with the number
of (re)-placements performed by an optimal strategy such that all required conditions are
satisfied. To formalize our intension, we introduce the following definition.
Definition 3.21. (Competitiveness) Let S(n,C[n]) be a storage network, and let k be the
identifier of some storage device. Then, let L ={insert(S(n,C[n]),k),delete(S(n,C[n]),k)}
be the set of operations on S(n,C[n]) that cause changes in the number of connected stor-
age devices. Furthermore, let A denote a dynamic data placement strategy on S(n,C[n]).
Then, for any operation ωL, the competitive ratio compA(ω)of Aon ωis the number
compA(ω) = max(COpt(ω)
CA(ω),CA(ω)
COpt(ω)),
where COpt(ω)denotes the cost of an optimal algorithm for the instance ω, and CA(ω)
denotes the cost of the online strategy A.
Let c 1. Then, the storage strategy A is called c-competitive concerning operation ω
if, on expectation, compA(ω)c for all ωL.
Note that randomized algorithms must fulfill this bound with high probability, that is,
the number of redistributed data blocks implied by an operation ωshould be at most
`·(c·COpt(ω)) with probability at least 11
n`.
3.2 The Redundant Allocation Problem
Given all the formal definitions and required mathematical tools, we can now refine the
redundant allocation problem with respect to a rather probabilistic based formulation.
In particular, the following central problem of allocation describes the major allocation
problem underlying the design and analysis of all schemes we discuss in this chapter. In
other words, all algorithms we introduce in the following, whether adaptive or not, basi-
cally aim at finding a suitable solution for this allocation problem.
Central Problem of Allocation: Given a redundant allocation problem for a given
storage network S(n,C[n]), find an appropriate probability distribution P on the set
[n]of disks in either a static or dynamic setting that satisfies the redundancy and
fairness conditions for any number m Ctotal of data blocks.
44 Random Allocation of Data Copies
Clearly, if we ignore redundancy, we can simply set P=¯cbecause then, all blocks are
allocated independently at random according to P. For such settings, there exist numerous
strategies ensuring a fair allocation for static as well as for dynamic storage systems (note
that we omit dynamic systems in this section but return back to them in Section 3.5).
For a static setting, a fair data layout can easily be achieved by applying standard balls-
into-bins models. Again, given a storage network S(n,C[n]) and P=¯c, a balls-into-bins
allocation stores each data item (ball) d[k]independently at random to some bin i[n]
with probability ci. For non-uniform capacities, this yields a fair data layout up to some
factor (1±ε), whereas in a uniform setting, we can come close to optimal if one out of
d2 choices for each allocation is allowed (we return to standard balls-into-bins games
a little more precisely in Section 3.3.2).
However, whenever regarding redundancy, the observed problem becomes more chal-
lenging because, as we know from the realization of the function Fin Section 3.1.2, the
redundancy condition imposes high carefulness on the allocation functions, and more im-
portantly, for any ridentical copies, r1 allocations are no longer independent. This
implies a process in which for each of the functions f1,f2,..., frin F, respectively for a
given sequence of ridentical copies, we always have to find a sequence P= (P1,P2,...,Pr)
of probability distributions, where Piis valid for all allocations via function fionly.
Therefore, in what follows in this section, we concentrate on the investigation of appro-
priate sequences P= (P1,P2,...,Pr)subject to a fair data allocation. At first, we describe
a very intuitive balls-into-bins-like algorithm which aims at ensuring a redundant alloca-
tion on the share tuples of disks that remain valid in some particular allocation step (note
that for a data item d[k]and any step i[r], the random allocation of the i-th copy of d
to some disk corresponds to the allocation fi(d)of F).
Algorithm 1 Naive Balls-into-Bins Allocation
Require: Probability Space (S(n,C[n]),¯c), Set [k]of balls, r-tuple (f1,f2,..., fr)
Ensure: Redundant Allocation of m=k·rreplicated balls
1: for all d[k]do
2: To hold the r-th copy of d, choose disk fr(d) = ir[n]with probability cir
3: repeat
4: Set [n0]([n]\{ir}){Reduce sample space by chosen disk ir}
5: Set r0(r1){Reduce current number of copy}
6: for all i[n0]do
7: Set c0
i(Ci/j[n0]Cj){(Re-)assign disk probabilities (relative capacities)}
8: To hold the r0-th copy of d, choose disk fr0(d) = ir0[n0]with probability c0
ir0
9: until (r0=0)
Obviously, this approach is quite self-evident for obtaining a fair allocation under re-
dundancy conditions because it is simple, features low complexity and is closely related to
3.2. The Redundant Allocation Problem 45
the well-analyzed standard balls-into-bins model for independent allocations. However,
as we show in the following, the given algorithm is only useful for uniform capacities
because for heterogeneous settings, a naive balls-into-bins scheme always induces an im-
balance on the data layout such that fairness can never be obtained. Since, generally,
balls-into-bins schemes feature high attraction to many designers of data placement algo-
rithms, it is worth investigating the occurring imbalance in more detail. We start by first
examining a homogeneous setting for Algorithm 1.
Theorem 3.22. Let S(n,C[n]) be a uniform storage network with relative capacities ci=1
n
for all disks i [n]. Furthermore, let [k]be a set of data items and r [n]the replication
parameter for each d [k]. Then, for any number m Ctotal (m =k·r) of replicated
blocks, Algorithm 1 realizes a balanced (i.e. equal) load on the set [n]of disks.
Proof. In fact, this process can be seen as (the first rsteps of) sampling a permutation
π:[n][n]uniformly at random: first choose π(1)[n](with probability 1
n), then
π(2)[n]0= [n]\{π(1)}and so on. Since for each i[r],π(i)is chosen uniformly
from [n]\{π(1),...,π(i1)}, that is, with probability 1
ni+1independent of the values
π(1),...,π(i1), induction on rshows that each r-tuple π(1),...,π(r)of pairwise dis-
tinct numbers between 1 and nappears with equal probability nr!
n!. Proceeding from an
r-tuple π(1),...,π(r)to the r-element set {π(1),...,π(r)}constitutes a mapping with
exactly r! fibers of equal probability nr!
n!each: every r-element subset therefore arises
with probability 1/n
r.
Observation 3.23. Given a uniform storage network S(n,C[n]), Theorem 3.22 constitutes
an efficient algorithm for sampling uniformly at random an r-element subset of the set [n]
of disks. For the subsequent descriptions, let us interpret Algorithm 1 differently: con-
sider the uniform sampling of a hyperedge E from the complete r-uniform n1
r1-regular
hypergraph [n]
r:={E[n]| |E|=r}. Notice that this random hyperedge E contains
any fixed disk i [n]with probability pi=r
n(because we have n different elements, each
of which chosen with probability 1
nand r possible entries for each of the elements in [n]).
So far the homogeneous case. Now consider Algorithm 1 for the inhomogeneous case.
For this, in the following, we restrict ourselves to random processes which are online
and memory-less in the sense that they distribute (the rcopies of) each given data block
independently from the previous ones. Thus, balanced load then means that the rdisks
which are drawn for storing the copies of some current data block should be sampled from
the set [n]in a way such that disk igets chosen with probability pi:=E([n]
r):iEPr[E],
proportional to its relative capacity ci=Ci/jCj.
However, as the following example shows, this assignment (and thus Algorithm 1) fails
for non-uniform capacities.
46 Random Allocation of Data Copies
Example 3.24. For simplicity, in the following two examples, we consider the size of the
disks due to the normalized data items of size 1.
a) Consider the case n =3, r =2, and disks 1,2,3with capacities C1=C2=1, C3=2.
In order to fill these disks in a balanced way, disk 3has to receive twice as many
copies as both disk 1and 2. That is, to obtain fairness (and thus, full utilization),
disk tuples {1,3}and {2,3}must be chosen as where to place copies with proba-
bility 1
2each; whereas tuple {1,2}must not occur. In contrast, Algorithm 1 chooses
disk 1with probability c1=1
4to hold the first copy; and then disk 2with probability
c0
2=1
3yielding a positive chance for the tuple {1,2}to occur.
b) Even worse in case n =3, r =2, and C1=C2=1, C3=3, there is no way to achieve
a balanced distribution at all, regardless of which algorithm one employs !
According to the given translation on hypergraphs, we can now refine our examinations
on finding a suitable probability distribution Pfor the disjoint storage of replicated balls.
We start with posing the following central question.
Question 3.25. Consider the replication parameter r and fixed integers p1,...,pn[0,1].
a) Does there exist a probability distribution P :[n]
r[0,1]such that6
i[n]:pi=
E([n]
r):iE
P(E)? (3.4)
b) Is such a distribution uniquely determined by Equation 3.4 ?
c) How can one efficiently algorithmically determine this distribution P from the tuple
¯p= (p1,...,pn)?
d) How can one efficiently algorithmically sample E [n]
raccording to P for either
static or dynamic systems ?
For the rest of this chapter, we concentrate on answering the questions above. In partic-
ular, in the following section, we examine the properties of a possibly existing distribution
Pin more detail. Section 3.5 then is dedicated to the latter two questions above.
3.3 Analysis of the Probability Distribution P
Based on the above observations, we now analyze the properties of Pin terms of existence,
uniqueness, and of algorithmic tractability. Moreover, we show that using a balls-into-
bins scheme for allocation as realized by Algorithm 1 always yields an imbalance on the
resulting data layout whenever replicated balls are to be stored on heterogeneous disks.
6To emphasize the probability distribution P, in this section, we use P(E)rather than Pr[E]to denote the
probability of some event E.
3.3.1 Existence and Uniqueness 47
3.3.1 Existence and Uniqueness
Regarding Question 3.25(a), a simple double-counting gives rise to the following neces-
sary (pre-)condition:
Lemma 3.26. Consider a storage network S(n,C[n]) and fixed replication parameter r.
Then, P :[n]
r[0,1]denotes a probability distribution on the set [n]of disks such that
pi:=E([n]
r):iEP(E)satisfies n
i=1pi=r.
Proof.
n
i=1
pi=
(i,E)[n]×([n]
r):iE
P(E) =
E([n]
r)
iE
P(E)
| {z }
=P(E)·|E|
=r·
E([n]
r)
P(E) = r(3.5)
Note that subject to Definition 3.5, we require pi=r·cilater for our allocations. Then,
to achieve a fair and redundant data distribution at all, Lemma 3.26 implies that for the
relative capacities ciof all disks i[n], it must hold that ci1
r(c.f. Lemma 3.29).
3.3.1.1 Existence
Given the result from Lemma 3.26, we can now focus on the existence of the probability
distribution Pregarding the above condition.
Theorem 3.27. Let p1,...,pn[0,1]such that ipi=r. Then, there exists a probability
distribution P :[n]
r[0,1]with pi=E([n]
r):iEP(E)for each i [n].
Proof. Consider the incidence matrix Aof the r-uniform n1
r1-regular hypergraph [n]
r,
that is, A= (ai,E)(i,E)[n]×([n]
r)where ai,E=1 if iEand ai,E=0 otherwise. We are thus
looking for a vector
x=P(E)E([n]
r)[0,)([n]
r)s.t. A·x=
E([n]
r):iE
xEi[n]
!
= (pi)i[n].(3.6)
Let ATdenote the transposed matrix of A. Then, we can observe that for vector y
Rn,AT·yis a vector of size n
rwith E-th component equal to iEyiwhere E[n]
r.
Now, AT·y0 is infeasible for iyipi<0 according to the claim below; hence by
Farkas’ Lemma ([Roc70]), we obtain that Equation 3.6 is feasible. Moreover, reading
Equation 3.5 reversely reveals such a solution xto satisfy E([n]
r)xE=1, i.e. to yield a
probability distribution P.
48 Random Allocation of Data Copies
Lemma 3.28. Let y1,y2,...,ynRand p1,...,pn[0,1]with ipi=rN. Suppose
that iEyi0for each E [n]
r. Then, it holds that n
i=1yipi0.
Proof. W.l.o.g. let y1y2···yn. Then, n
i=1yipibecomes minimal (subject to the
constraints 0 p1,...,pn1 and ipi=r) for 1 =p1=···=prand 0 =pr+1=···=pn.
But that minimum equals iEyi0 for E={1,...,r}.
So far, we have shown that there exists a probability distribution Pas asked for in
Question 3.25(a). Now, we tackle on the uniqueness of such distribution, or in other
words, are there more than one possibilities to assign probabilities to the disks in the set
[n]in order to obtain the desired fair and redundant allocation ?
3.3.1.2 Uniqueness
Uniqueness of Pwould mean that ¯p= (p1,...,pn)determines P:[n]
r[0,1]subject to
Condition 3.4. However, this fails already in the uniform case on [4]
2, that is, with pi=1
2
for all i[4]: In addition to the straight-forward solution P=1
6, also the following one is
easily verified to satisfy Equation 3.4.
P({1,2}) =1
61
11,P({1,3}) =1
61
7,P({1,4}) =1
6+1
11 +1
7,
P({2,3}) =1
6+1
11 +1
7,P({2,4}) =1
61
7,P({3,4}) =1
61
11 .
The given example proves that Equation 3.4 fails to uniquely determine a probability
distribution Pfor our redundant allocations. Thus, consequently, there is not only a single
chance to find the desired data layout.
In particular, the most intuitive way of determining a correct Pobviously would be
to turn the proof of Theorem 3.27 into an algorithm as follows: given a set [n]of bins
with relative capacities c1,c2,...,cn[0,1], we define pi:=r·cisatisfying ipi=r
(Lemma 3.26). Then, one could use his favorite linear optimizer to solve Equation 3.6.
Since linear programming is in the complexity class P (see e.g. [GLS93]), this yields a
valid probability distribution P:[n]
r[0,1]as desired using running time polynomial
in [n]
r=Θn
rr.
However, merely storing Pcompletely may already take exponential space (for in-
stance, for r=n
2). Thus, obviously, one would be better off just sampling some hyper-
graph edge E[n]
raccording to the distribution Pas this can be performed without
actually calculating Pentirely.
Therefore, with respect to the very efficient and successful allocations gained by stan-
dard balls-into-bins games for the non-redundant case, it seems reasonable to apply Algo-
rithm 1 also for redundant settings. Some further motivating fact is that in Theorem 3.22,
we already proved that for a uniform configuration with c1=c2=... =cn=1
n, the de-
picted Algorithm 1 provides the desired data layout.
3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations 49
3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations
Before we start comparing the results and properties of Algorithm 1 with a standard balls-
into-bins model, we first give some brief overview about the most important results on
allocating data blocks by a balls-into-bins approach. Notice that since there are no useful
results considering redundant placement, we only refer to independent allocations.
3.3.2.1 Previous Results on Balls-into-Bins
In a standard balls-into-bins model, one sequentially allocates a set of mindependent
balls at random to a set of nbins such that the maximum number of balls in any bin is
minimized. This paradigm has become very popular because it is simple, quite general,
and can be adopted for a variety of applications. To our knowledge, the first use was
given by Karp et al. [KLM92] who deal with PRAM simulations on Distributed Memory
Machines (DMMs). They realize the balls-into-bins paradigm by universal hash functions
that map the PRAM memory cells to the memory modules of the DMM in a way that an
even dispersal of the cells over the DMM modules can be obtained such that, for any
access request to some PRAM cell, an efficient response time can be ensured.
Depending on the number of used hash functions, the mapping of the cells to the mem-
ory modules is either realizing a so-called single-choice or multiple-choice approach.
The classical single-choice approach allocates each ball to a bin chosen independently
and uniformly at random. For m=nballs, the maximum load in any bin is Θlogn
log log n.
More generally, for mnlog nballs, it is well-known that there exists a bin receiving
m
n+Θqmlog n
nballs (see e.g. [RS98]). This result holds not only on expectation but
with high probability. Let the maximum height above average denote the difference be-
tween the number of balls in the fullest bin and the average number of balls per bin. Then,
the maximum height above average of the single-choice approach is Θqmlog n
n, and it
is easy to see that the deviation between the randomized single-choice allocation and the
optimal allocation increases with the number of balls. Importantly, this cannot be done
better for this kind of game, i.e. for single-choice processes.
Nevertheless, the maximum load can be decreased dramatically by allowing every ball
to choose one out of d2 bins independently and uniformly at random. The classical
results for this multiple-choice game are given by Azar et al. [ABKU00] who, following
the results from the PRAM simulations, introduced the Greedy[d] protocol for allocating
nballs to nbins. In each step, Greedy[d] chooses d2 bins for each ball and allo-
cates the ball into a bin with minimum load. Then, it can be shown that the maximum
height above average obtained by Greedy[d] is only log log n
log d+Θm
n, w.h.p. which is an
exponential decrease compared to single-choice games. Moreover, Vöcking [Voe99] in-
troduced the Always-Go-Left protocol yielding a maximum height above average of only
50 Random Allocation of Data Copies
log log n
log d+Θm
nwith ddlog 2. Since all these results hold only for the case mn, in
[BCSV00], the authors analyze Greedy[d] for the heavily loaded case, i.e. mn(what
is, clearly, a more suitable investigation when considering the allocation of data blocks
to storage devices). They show that the maximum load is m
n+log log n
log d, w.h.p. This re-
sult yields a fundamental difference between the allocation obtained by multiple- and the
single-choice approach; while the single-choice allocation deviates more and more from
the optimal allocation with an increasing number of balls, the deviation gained by the
multiple-choice allocation is independent from the number of balls.
The above results show that with a multiple-choice allocation, on one hand, a near-
optimal load balancing can be obtained while, on the other hand, the analysis of these
allocations is challenging even for uniform capacities and pairwise independent balls.
With respect to heterogeneous bins, Wieder [Wie07] introduced the analysis of a multiple-
choice allocation obtaining a tight upper bound of m
n+log log n
log(1+ε)+O(1)w.h.p. using a
modification of the Greedy[d] algorithm called Greedy[U,1+ε], where Udenotes the
uniform probability distribution on nbins. Greedy[U,1+ε] inserts mballs sequentially,
and in each round, the probability of a ball to be inserted into one of the i-th most heavy
bins is exactly (i
n)1+ε. However, two main drawbacks come along with this approach.
First, Greedy[U,1+ε] is not of practical usage, and second, the analysis is derived from
allocations in peer-to-peer networks, that is, the author assumes homogeneous peers (all
of same size) to which an imbalanced probability distribution is wrongfully assigned (the
imbalance is caused by a previous allocation process, like e.g. CHORD [SMK+01a], such
that an appropriate uniform distribution on the peers is biased). Obviously, this scenario
does not reflect the assumptions and demands of the system we observe here.
3.3.2.2 Analysis of the Imbalance
In Lemma 3.22, we showed that symmetry considerations significantly simplified the
analysis of Algorithm 1 in a uniform system because in such settings, the dependencies
between any two choices of disks are always the same. This implies that every r-tuple
π(1),...,π(r)was equally likely to be chosen for allocation, which in turn implies that
also all their r! fibers {π(1),...,π(r)have equal probability. In contrast, as we show in
this section, this behaves significantly different if any non-uniform capacity distribution
is considered. In particular, Theorem 3.30 shows that the referred algorithm, realizing the
redundant type of a standard balls-into-bins game, fails when being applied in heteroge-
neous storage environments for which an optimal disk utilization is required.
As usual, we consider a given storage network S(n,C[n]), and since all of our strategies
work on values defined by the share tuples ¯cfor the disks [n], before we start, we first state
an important precondition on the capacity of each disk which is necessary for obtaining a
fair and redundant data layout at all.
3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations 51
Lemma 3.29. A fair and redundant distribution is feasible if and only if ci1
rfor all
bins i [n].
Proof. Consider the random variables from Definition 3.5. Now, assume some bin jwith
relative capacity cj>1
r. Then, for the variable Xj, it holds that E[Xj] = r·k·cj>r·k·1
r,
thus, bin jcontains at least two identical copies violating the redundancy condition.
However, even by supposing that all disks ihave relative capacity ci1
r, the following
theorem shows that Lemma 3.29 states only a necessary condition on the allocation.
Theorem 3.30. Consider a storage network S(n,C[n]) and a share tuple ¯c with ci1
rfor
all i [n]. If then there exists at least any two indices i,j[n]with ci,cj, Algorithm 1
never provides a fair allocation on the set [n]of disks.
Proof. Consider the share tuple ¯c={c1,c2,...,cn}and the replication parameter r. More-
over, for non-uniformity and w.l.o.g., we assume c1<cn. Then, we start by defining the
values pi:=r·ci, and since n
i=1ci=1, it holds that n
i=1pi=rsatisfying the necessary
conditions for balanced allocations in Lemma 3.26. For what follows, we further assume
w.l.o.g. that p1p2... pn, and we define the following extracting function
Ξ:= (ξ1,ξ2,...,ξr):[n]
r[n]r
covering rcomponent functions ξ1,ξ2,...,ξrwith
ξi:[n]
r[n]s.t. the condition
r
[
i=1{ξi(E)}=EE[n]
r
on the considered hypergraph which, given some edge E[n]
r, extracts an r-tuple by the
component functions. More precisely, each function ξiextracts an element from E, and
according to its position iin Ξ,ξithus determines the i-th element of the resulting r-tuple.
Moreover, as due to the referred condition all extracted integers are pairwise distinct, the
functions define an order on the elements in E.
Now, consider the symmetric group Sron the set [r]. Then, for all permutations πSr
and any edge E[n]
r, we define
Ξπ(E):= (ξπ(1)(E),ξπ(2)(E),...,ξπ(r)(E)).
W.l.o.g. let ξr(E) = nfor some edge E[n]
r:nE. Then, for the rest of this proof, we
aim at stating that
pn>
E([n]
r):nE
P(E)(3.7)
violating Theorem 3.27 and thus showing that fairness can not be achieved.
52 Random Allocation of Data Copies
Since we are focusing on a balls-into-bins game, for each r-tuple of identical copies, we
consider a sequential algorithm to place the rcopies to the set [n]of disks. Thus, for any
r-tuple and any bin i[n], let Et
idenote the event that bin iis drawn for storing the t-th
copy, 1 tr. Moreover, according to the redundancy condition, only the first copy of
each tuple is allocated independently from any other placement. Hence, for redundantly
allocating any number `rof identical copies to bins i1,i2,...,i`, let the conditional
probabilities qi1,i2,...,i`:=Pr[E1
i1E2
i2...E`
i`]be defined as
qi1,i2,...,i`:=ci1·ci2
1ci1·...·ci`
1ci1ci2...ci`1
(3.8)
Now, consider the allocation of `+1 identical copies. Then, for the probability qi1,i2,...,i`,
we can state the following lemma.
Lemma 3.31.
qi1,i2,...,i`=
i`+1[n]
i`+1,i1,i2,...,i`
qi1,i2,...,i`,i`+1
Proof. Since n
i=1ci=1, we can state:
i`+1[n]
i`+1,i1,i2,...,i`
qi1,i2,...,i`,i`+1
=ci1·ci2
1ci1·...·ci`
1ci1ci2...ci`1·
i`+1[n]
i`+1,i1,i2,...,i`
ci`+1
1ci1ci2...ci`1ci`
=ci1·ci2
1ci1·...·ci`
1ci1ci2...ci`1·1ci1ci2...ci`1ci`
1ci1ci2...ci`1ci`
=qi1,i2,...,i`
Now, returning to Inequality 3.7, we obtain
E([n]
r):nE
P(E) =
E([n]
r):nE
πSr
P[(ξπ(1)(E),ξπ(2)(E),...,ξπ(r)(E))].
Due to Definition 3.8 above, let qπ(1),π(2),...,π(r)(E):=qξπ(1)(E),ξπ(2)(E),...,ξπ(r)(E). Then,
E([n]
r):nE
πSr
P[(ξπ(1)(E),...,ξπ(r)(E))] =
E([n]
r):nE
πSr
qπ(1),...,π(r)(E)
3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations 53
Now, for all E[n]
r:nEand indices i[r], let us consider the cases with bin nat
position ξπ(i). Then,
E([n]
r):nE
πSr
qπ(1),...,π(r)(E)
=
E([n]
r):nE
πSr
ξπ(1)(E)=n
qπ(1),π(2),...,π(r)(E)+...+
πSr
ξπ(r)(E)=n
qπ(1),π(2),...,π(r)(E)
=
E([n]
r):nE
πSr
ξπ(1)(E)=n
qπ(1),π(2),...,π(r)(E)+...+
E([n]
r):nE
πSr
ξπ(r)(E)=n
qπ(1),π(2),...,π(r)(E)
Before we continue our descriptions, we first take a closer look at the summands given
in the above summation. In particular, for some index k, 1 kr, let us fix the case for
which ξπ(k)=n. Additionally, for the sake of convenience, we define the following sets
of tuples of indices
Iu
`:={(i`,i`+1,...,iu)[n]u`+1|ia,ib;la,bu:a,b}(3.9)
Then,
E([n]
r):nE
πSr
ξπ(k)(E)=n
qπ(1),π(2),...,π(r)(E)
=
(i1,i2,...,ir)Ir
1
ik=n
qi1,i2,...,ir
=
(i1,i2,...,ik1)Ik1
1
(ik+1,...,ir)Ir
k+1
qi1,i2,...,ik1,n,ik+1,...,ir
=
(i1,i2,...,ik1)Ik1
1
qi1,i2,...,ik1,n
which follows from Lemma 3.31 by induction. More precisely, since we concentrate on
the allocation on bin n, we can simply ignore the last rksteps of the r-fold placement.
Now, we continue our description by using the last result. Thus, in total, we obtain
E([n]
r):nE
P(E) = qn
|{z}
=:q1
n
+
1i1<n
qi1,n
| {z }
=:q2
n
+
1i1,i2<n
i1,i2
qi1,i2,n
| {z }
=:q3
n
+...+
(i1,i2,...,ir1)Ir1
1
qi1,i2,...,ir1,n
| {z }
=:qr
n
with qi
ndenoting the probability of bin nto be drawn in the i-th sequential step of Algo-
rithm 1 (recall that the algorithm samples an r-tuple from the set [n]). By the following
lemma, we show the imbalance occurring with these probabilities.
54 Random Allocation of Data Copies
Lemma 3.32. For any step t 2of a given r-tuple, it holds that qt
n<cn.
Proof. The proof follows by induction. For the basis, we consider t=2. Then, according
to Definition 3.8 of conditional probabilities, for arbitrary bins i1,i2[n],i1,i2, the
probability Pr[E1
i1E2
i2]to receive a copy each is qi1,i2=ci2·ci1
1ci1
. Thus,
q2
n=
n1
i1=1
qi1,n=cn·
n1
i1=1
ci1
1ci1
.
From this, it follows:
q2
n=cn·
n1
i1=1
ci1
1ci1
(c1<cn)
<cn·
n1
i1=1
ci1
1cn
=cn·n1
i1=1ci1
1cn
=cn·1cn
1cn
(3.10)
that is, q2
n<cn(analogues, one could show: q2
1>c1).
Now, let t>2. Then again, consider arbitrary bins i1,i2,...,it[n],ia,iba,b[t],
a,b, such that we calculate Pr[E1
i1E2
i2...Et
it]as
qi1,i2,...,it=cit·
t1
`=1
ci`
1`
m=1cim.
Since this term calculates the probability only for a fixed set of chosen bins, we continue
with calculating the absolute probability qt
nof bin nto be drawn in the t-th step which
considers all possible combinations of bins to be drawn for allocation in the t1 previous
steps. For our description, we again refer to the index sets Iu
`as defined in (3.9). Then,
we obtain
qt
n=
(i1,...,it1,n)It
1
qi1,...,it1,n=cn·
(i1,...,it1,n)It
1
t1
`=1
ci`
1`
m=1cim
=cn·
(i1,...,it2,n)It1
1
1it1n
it1,i1,...,it2,n
t1
`=1
ci`
1`
m=1cim
We can now factor out the first t2 factors because these factors and the indices i1,...,it2
do not depend on the index it1. For convenience, we shorten the description by defining
F(js):=
s
`=1
ci`
1`
m=1cim.
Then, we obtain
qt
n=cn·
(i1,...,it2,n)It1
1
F(it2)·
1it1n
it1,i1,...,it2,n
cit1
1ci1ci2...cit1
(3.11)
3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations 55
It now remains to find an appropriate estimate for the innermost sum because then, the
calculation can be reduced to the previous step of the induction. Since only the innermost
sum depends on index it1, again, we substitute the capacity cit1in the denominator of
the innermost sum by cnyielding the following inequality:
qt
n=cn·
(i1,...,it2,n)It1
1
F(it2)·
1it1n
it1,i1,...,it2,n
cit1
1ci1...cit2cit1
cn·
(i1,...,it2,n)It1
1
F(it2)·
1it1n
it1,i1,...,it2,n
cit1
1ci1...cit2cn
=cn·
(i1,...,it2,n)It1
1
F(it2)·
1it1n
it1,i1,...,it2,n
cit1
1ci1...cit2cn
=cn·
(i1,...,it2,n)It1
1
F(it2)·1cnt2
l=1cil
1cnt2
l=1cil
The reduction continues until reaching the induction base that contains a <-relation as
shown in Inequality 3.10, thus, we obtain qt
n<cnand the lemma follows.
Lemma 3.32 shows that for any step t2 of allocation, qt
n<cnyielding the following
result:
E([n]
r):nE
P(E) = q1
n+q2
n+...+qr
n<r·cn=pn
satisfying Inequality 3.7 and thus, the theorem follows.
3.3.2.3 Determining Upper Bounds for the Imbalance
Since by Lemma 3.32, we know about the existence of the imbalance for any step 2 tr
if Algorithm 1 is used for allocation on non-uniform capacities, in this section, we are fur-
thermore interested in determining the size of the imbalance, i.e. its maximum deviation
from an optimal allocation. More precisely, to obtain an even data layout, Algorithm 1
usually supposes any bin ito be drawn with allocation probability ciin every allocation
step t. Unfortunately, as we have shown, the desired data layout can only be achieved if
solely uniform capacities are considered, otherwise, we obtain a biased data distribution.
In other words, in a uniform setting, the imbalance equals zero, whereas with non-uniform
capacities, there is a deviation from a fair layout (Theorem 3.30). Moreover, the more the
capacities of the bins deviate from being uniform the greater the imbalance. Therefore,
obviously, the imbalance directly depends on the underlying disk configuration.
56 Random Allocation of Data Copies
As the imbalance in the system is determined for each disk exclusively, we now cal-
culate the maximum deviation with respect to bin nthat we assume to have maximum
capacity cn=1
r. Clearly, this could be of practical interest for e.g. administrators who al-
ways aim at providing the best access performance of the administrated storage network,
even if the SAN consists of a mixture of heterogeneous capacities, like e.g. some 80 to
500 GByte disks combined with a number of 3 TByte RAID-systems. Then, condition to
a required total storage capacity and with knowledge about the sizes of the particularly
induced imbalances, the administrator can decide what combination of storage devices to
chose in order to ease the imbalance and thus, to increase system performance.
Since the deviation is determined according to a fixed capacity distribution for which
exact values are needed, for our calculations we first define some capacity range for the
disk capacities. Then, given this range, we can calculate the induced imbalance (for this,
recall Lemma 3.29 for denoting the maximum size of a disk).
Definition 3.33. Consider a storage network S(n,C[n]), a share tuple ¯c and a replica-
tion parameter r 2. Then, a capacity distribution c1,...,cnis called γ-bounded if any
capacity ciin ¯c is taken from the range 1
γrci1
rfor an arbitrary γQ>1.
Due the proof of Lemma 3.32, the task now is to determine a value for the <-relation
in each step t, as first shown in Inequality 3.10. Again, as different steps induce different
degrees of index dependencies, for different steps 2 t,t0r,t,t0, we have to consider
different functions for our calculations. Obviously, since regarding a <-relation, the im-
balance occurring in any step t2 becomes worst if the corresponding function reaches
its global minimum. Furthermore, since first, in a uniform setting, there is no imbalance,
and second, all bins i[n]are supposed to have capacity ci1
r, we start our considera-
tions from the smallest setting of bins which is given by a set of rbins of capacity 1
reach.
Then, our analysis is as follows.
Again, we fix bin nof capacity cn=1
rwhich, initially, holds for also for all other bins
and that we assume to remain unchanged for the whole process. Then, for each step
t2, the task is to find the minimum over all possible γ-bounded capacity distributions
on the remaining bins. Clearly, as we fixed bin n, the total capacity that is provided by
any disk configuration consisting of r1 additional disks is always 1 1
r. Moreover,
since the initial configuration is the only one to provide this total capacity by solely r
bins, any other capacity distribution force them to grow by number. We start with the
initial configuration consisting of bin nand r1 additional bins, each having capacity 1
r.
Then, we determine the global minimum of any possible capacity distribution on these
r1 remaining bins. After that, we proceed with calculating the global minimum for
any capacity distribution on rbins (with 2 bins of size <1
r), and so forth, until finally,
we reach the maximum number of remaining bins, denoted by n1 in the following, in
which all bins have capacity 1
γr.
3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations 57
The Case t=2
We start with stating the size of the imbalance for step t=2. That is, we consider the
calculation for the probability q2
nas given in Lemma 3.32 for which we determine the
minimum over all γ-bounded capacity distributions on the set of remaining bins in the
system (recall that to ensuring the total capacity 11
rby different distributions a variable
number of bins in different steps must be considered). More formally, we calculate
min
r1`n1(min
c1,...,c`
`
i1=1
ci1
1ci1).(3.12)
Then, the following theorem shows that the imbalance becomes worst if the configuration
consists of nbins, that is, for n1 bins the capacity is 1
γr.
Theorem 3.34. For all i [n], let ci[1
γr,1
r], and fix cn=1
r. Then, Equation 3.12 is
minimal if `=n1and ci=1
γri[n1].
Proof. Due to step t=2, we denote the innermost sum by X(2):=`
i1=1
ci1
1ci1
. Further-
more, we consider the capacity c:= (11
r)/` which defines an even distribution of the
total capacity 1 1
rcovered by the `observed bins. It is easy to show that the innermost
term ci1
1ci1
corresponds to a convex and strictly monotonically increasing function on the
compact interval [1
γr,1
r]implying that X(2)is also convex and strictly monotonically in-
creasing for each `, and moreover, this means that Equation 3.12 has a unique solution.
With this, we can state our assumption by using the following lemma to show that X(2)is
minimal if and only if all bins have size c.
Lemma 3.35. X(2)is minimal if and only if for all i,j[`], i ,j, it holds that ci=cj.
Proof. The proof is straightforward. We assume that at least two indices i,j,i,j, exist
such that ci=c+εand cj=cεfor an arbitrary ε0. Then, we simply have to find
the minimal εfor ci
1ci
+cj
1cj2c
1c.
For this, we define the following function due to ciand cj
f(ε):=c+ε
1cε+cε
1c+ε,
and to find the minimum of f(ε), we have to look at the first and second order derivatives
f0(ε)and f00(ε), respectively. With
f0(ε) = 1
(1cε)21
(1c+ε)2
we obtain f0(ε) = 0 if and only if ε=0. Furthermore, for ε=0, f00(ε)>0, and hence,
the lemma follows.
58 Random Allocation of Data Copies
The result from Lemma 3.35 implies that the minimum is obtained if X(2)=`
i1=1c
1c
which can then be transformed into the following term that we denote by Y:
Y:=11
`
11(1/r)
`
Since Yis monotonically decreasing in `, its minimum is given if `=n1 and hence, the
theorem follows.
So far, we showed that for step t=2 the imbalance on the data distribution becomes
worst if there is at most one bin of maximum size 1
rand all remaining bins have smallest
size 1
γr. Thus, it remains to show the same for the subsequent steps.
The Case t>2
According to the proof of Theorem 3.30, to calculate the allocation bias for any subse-
quent step t, 3 tr, we again consider the functions for the probability qt
nas follows
(for better description, we again use the sets It
1, as defined in (3.9)).
min
r1`n1
min
c1,...,c`
(i1,...,it1,n)It
1 t1
`=1
ci`
1`
m=1cim!
| {z }
=:X(t)
(3.13)
For this problem and any observed function, we again suppose the global minimum if
`=n1, that is, if n1 bins have minimum capacity 1
γrbut this is only true if the
following conjecture holds (for which no proof is given so far).
Conjecture 3.36. For any step t 3, X(t)is a convex and strictly monotonically increas-
ing function on the compact interval [1
γr,1
r].
Corollary 3.37. If and only if Conjecture 3.36 is true, the minimum for Equation 3.13 is
given if `=n1and ci=1
γri[n1].
Theorem 3.38. Consider a storage network S(n,C[n]) with γ-bounded capacities for
some γQ>1, a replication parameter r and a fixed capacity cn=1
r. If Corollary 3.37
holds, then the maximum allocation imbalance for bin n in any step 2tr is given by
qt
n=1
r·
t1
i=1
ni
γriif γQ>1,
and
qt
n=1
r·n1
t1/γr1
t1 if (γr)Z>1.
3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations 59
Proof. The proof is given by induction on t. Particularly, we show for each tthat qt
n<1
r,
and further, that for any step t>2, qt
n=1
r·t1
i=1
ni
γri<qt1
n.
We start with t=2. Furthermore, we assume ndisks with cn=1
rand cj=1
γrfor all
j[n1]. Then, by replacing the values in Equation 3.10 (in proof of Lemma 3.32), the
first assumption holds since
q2
n=1
r·
n1
i1=1
1
γr
11
γr
=1
r·n1
γr1
(γr>n)
<1
r
which holds because as nis fixed, 1
γr<1
nin the non-uniform case. To show this, recall
that 1
γris the smallest capacity to be chosen for any bin. Then, condition to a fixed n,
a total capacity of at least n·1
γr1 is provided. If we now assume 1
γr1
n, it directly
follows that 1
γr=1
nimplying the uniform case what is a contradiction.
Now, consider t>2 and assume for all previous steps tj,j1, that qtj
n<1
rholds.
Then, according to Equation 3.11 (in Lemma 3.32), we consider
qt
n=cn·
(i1,...,jt2,n)It1
1
F(it2)·
1it1n
it1,i1,...,it2,n
cit1
1ci1ci2...cit1
and as we suppose ci=1
γrfor all i,n, we can substitute as follows:
qt
n=cn·
(i1,...,jt2,n)It1
1
F(it2)·
nt+1
j=1
1
γr
11
γr1
γr...1
γr
| {z }
t1elements
=cn·
(i1,...,jt2,n)It1
1
F(it2)·nt+1
γrt+1
As qt1
n=cn·(i1,...,jt2,n)It1
1F(it2)<1
rby assumption, the induction follows. More-
over, again, since γr>n, it follows that nt+1
γrt+1<1, and thus, qt
n<qt1
n.
If γrZ>1, the second result can simply be calculated as
qt
n=1
r·n1
γr1·n2
γr2·...·n(t1)
γr(t1)=1
r·n1
t1
γr1
t1
So far, we showed that if sampling some edge E[n]
raccording to the distribution P
by using a common balls-into-bins scheme (Algorithm 1), a redundant data layout com-
bined with an optimal utilization of the offered system capacity can never be achieved
for heterogeneous environments. However, in the following sections, we show that the
desired data distribution can be obtained by placing the data blocks somewhat more con-
trolled, and what holds for static as well as dynamic storage environments.
60 Random Allocation of Data Copies
3.4 COMB: An Allocation Strategy for Static Scenarios
Although we showed in the previous section that applying a standard balls-into-bins ap-
proach for redundant data allocating can achieve fairness only for homogeneous envi-
ronments, there are numerous results making such approaches very suitable for random
allocation because, as we briefly sketched in Section 3.3.2.1, the results are quite convinc-
ing to linger on them. Thus, we now show that introducing some slight modifications to
the data placement in Algorithm 1 suffices to obtain the desired data layout.
Before we start our description, we first briefly discuss some intuitive approaches for
introducing redundancy into the standard balls-into-bins scheme.
Naive Approaches.
1. Allocate all balls independently by some common balls-into-bins scheme but ignore
all trials in which the same bin is drawn for identical copies such that, finally, all
elements of some certain edge E[n]
rare stored separately.
2. Since, according to Definition 3.5, fairness requires load `i=r·k·cifor all bins
i[n], one could try to partition the set [n]into rdisjoint subsets of size 1
reach
and then allocate the elements of each edge E[n]
rsuch that each partition gets
at most one copy.
We call these approaches naive because it is quite obvious that they do not provide
the desired result. In particular, if we consider non-uniform disk capacities for the first
approach, then, after allocating all balls to the disks, the valid trials (i.e. the ones that were
not ignored) again lead to a biased data placement according to the result of Theorem 3.30.
The second scheme also runs into problems because the mentioned approach corresponds
to the well-known PARTITION problem which is NP-complete.
Therefore, in this section, we introduce an efficient placement scheme that yields the
desired layout by just sampling a hypergraph edge Eaccording to a given probability
distribution P. Particularly, for a storage network S(n,C[n]) and the following allocation
strategy, Pis simply defined by the values of the corresponding share tuple ¯c. With respect
to Lemma 3.26 and Theorem 3.27, this is feasible because we sample r-tuples and as we
define pi=r·ci[0,1]for each bin iand n
i=1ci=1, the necessary conditions to obtain
a fair distribution are satisfied (surely, provided that ci1
r).
The following algorithm is called COMB strategy, and it is closely related to a redun-
dant balls-into-bins model but differs due to the fact that only the first of ridentical copies
is always allocated at random to the set [n]of bins, and all subsequent ones are placed de-
terministically on distinct bins. Moreover, we show first that the resulting load of any bin
corresponds to a standard single-choice balls-into-bins allocation, and second, that it is
highly concentrated around the expected value.
3.4. COMB: An Allocation Strategy for Static Scenarios 61
In particular, consider a storage network S(n,C[n]) and a corresponding share tuple ¯c
for the disks in the set [n]. Let the bins be numbered in a range from 1 to nand arranged in
a linear manner in the [0,1)-interval (that we consider as a ring) according to their shares
in ¯c. That is, bin iis responsible for the half-open interval Ii= [αi,αi+ci)[0,1)of
length cistarting at point αi=i1
j=1cj. Now, given a position x[0,1), let IC(x)[n]
denote the index of the interval Iiwhich xlies in. Then, the COMB strategy works as
follows.
Algorithm 2 The COMB Strategy
Require: Probability space (S(n,C[n]),¯c), Set [k]of balls, Replication parameter r
Ensure: A set Mr
k={(I1
1,I1
2,...,I1
r),...,(Ik
1,Ik
2,...,Ik
r)}of r-tuples of indices of bins.
1: for all d[k]do
2: Place duniformly at random at position x1[0,1)yielding interval I1=IC(x1)
3: Set i2
4: while irdo
5: Allocate dto position xi= (x1+(i1)·1
r)yielding interval Ii=IC(xi)
6: Set i(i+1)
The COMB strategy, which works on disk assignments into the [0,1)-interval, is graph-
ically illustrated in Figure 3.2 for a single 4-tuple. The first copy is allocated uniformly
at random to some point x1[0,1), and all subsequent copies i{2,3,4}are determin-
istically placed to locations xiby a distance of (i1)·1
4from x1in the unit ring. In the
following theorem, we show the correctness and layout quality of the COMB algorithm.
Theorem 3.39. For a static storage network S(n,C[n]) of arbitrary capacities, a set [k]
of distinct data blocks and r 2, the COMB strategy provides a fair and redundant dis-
tribution of m =k·r replicated balls on the set [n]of bins. Moreover, any bin i gets at
most
`i=ci·k·r+O(pci·k·r·log n)
data blocks w.h.p. if q3·`·log n
ci·k·r<1for some constant ` > 0.
Proof. We prove the correctness of the COMB strategy in terms of redundancy and a
balanced data load.
Redundancy. As any bin i[n]is assigned a half-open interval Iiof size ci1
r, and
furthermore, all copies of any data block d[k]are allocated by a distance of 1
rfrom each
other7, no half-open interval can simultaneously contain two different locations xj,xj+1
for any two identical copies. Thus, all copies dj,j[r], lie in different intervals, and from
this it follows that all values j1,j2,..., jrin the sampled r-tuple are pairwise distinct. If
7More precisely, any distance in the interval [maxi{Ii},1
r]is valid to achieve the desired data layout.
62 Random Allocation of Data Copies
c1
c2
c3
c4
c5
c6
c7
c8
c9
x1
x2 (x1+ 1/4)
x3
(x1 + 1/2)
x4
(x1 + 3/4)
Figure 3.2: The COMB strategy on the [0,1)-interval for n=9 bins and r=4.
then some function f:[n]r[n]
ris applied that maps each sampled r-tuple to some edge
E[n]
r, this yields the required r-element set.
Fairness. Given the set [k]and parameter r, we obtain kdistinct r-tuples independently
sampled by the COMB strategy. As, for each such tuple, the first location x1is chosen
uniformly at random in the [0,1)-ring, clearly, the probability for some interval Iito be
chosen is Pr[IC(x1) = Ii] = ci. Moreover, if x1is chosen uniformly random, then so is
xj=x1+j1
rfor each j {2,...,r}; therefore, regardless of the obvious correlation
between x1and xj, and depending on the size c`for some interval I`to be chosen, it holds
that Pr[IC(xj) = I`)] = c`for all `[n].
For the rest of this proof, consider an arbitrary but fixed bin iwith allocation probability
ci. Furthermore, for an arbitrary r-tuple S, 1 Sk, let XS
ibe the binary random variable
denoting the event that bin iis drawn by S, and let Xi=SXS
i. Then, referring to Ques-
tion 3.25, it holds that Pr[XS
i=1] = r·ci(= pi)which satisfies the necessary condition
in Lemma 3.26 for balanced allocations. Additionally, since each random variable Xiis
binomially distributed, its expected value E[Xi] = r·k·ciwhich, on expectation, ensures
a fair data layout according to Definition 3.5.
At last, it remains to show that any disk contains the expected number of blocks also
w.h.p. Since all r-tuples are mapped to the [0,1)-interval independently at random and the
XS
is are binary random variables, we can apply the Chernoff bounds from Lemma 3.15
gaining the following upper bound on the load of bin i:
Pr[Xi(1+ε)·ci·k·r]e-min[δ2,ε]·ci·k·r/3(3.14)
3.4. COMB: An Allocation Strategy for Static Scenarios 63
for every ε>0.
To prove the bound on the concentration, we need to derive the particular εfor which
the probability drops below n`(we exemplarily sketch the calculation for obtaining the
correct εonce in this thesis). Considering Equation 3.14, from the right hand side, we get
the following, assuming ε<1,
eε2·ci·k·r/3n`
n`eε2·ci·k·r/3
Setting ε=pε0·log nleads to
n`eε0·log n2·ci·k·r/3
n`elog n·ε0·ci·k·r/3
n`nε0·ci·k·r/3
ε03·`
ci·k·r
Substituting this in Inequality 3.14 concludes the proof.
The above result shows that for bin iwith capacity cithe maximum deviation of the load
from its expected value is in O(ci·k·r·log n). Recall that this directly corresponds to
the deviation of a standard single-choice balls-into-bins game (Section 3.3.2.1). Thus,
as the COMB strategy implements some sort of redundant single-choice balls-into-bins
approach, the achieved result is the best to obtain for a redundant single-choice process.
However, like in the standard (non-redundant) case, the deviation heavily depends on the
total number of balls allocated to the bins, that is, the more balls thrown the worse the
deviation. However, we leave a possible improvement of this drawback as the following
Open Problem: Is it possible to improve the presented single-choice deviation
bound by some multiple-choice allocation algorithm that allocates r identical
copies to one out of d 2sampled r-tuples (that are determined by their first copy
allocations similar to the shown COMB strategy) ?
Furthermore, as a more essential drawback, the COMB strategy neither is nor can be
made adaptive. As we have shown, the allocation process works on an n-fold partitioning
of the [0,1)-interval into nintervals of lengths c1,c2,...,cn, respectively. Since ici=1,
it is easy to see that this scheme works efficiently for static environments only because
if the number of disks changes, the system obviously runs into fragmentation problems.
That is, whenever a new disk enters the system, this would affect the sizes of all intervals
(even if the disk has only small capacity) what in turn leads to a huge amount of re-
allocations and thus, to intolerable performance deficits over a certain time.
As a consequence, we address the problem of occurring dynamics in the storage net-
work in more detail in the following section.
64 Random Allocation of Data Copies
3.5 SPREAD: An Adaptive Allocation Scheme for
Dynamic Scenarios
We introduce an advanced data placement scheme called SPREAD [MS08] which, unlike
the former strategies, provides useful adaptivity properties to face dynamic changes in
scalable SANs, that is, it is capable of handling any occurring changes in the capacity dis-
tribution on the disks in the storage network that may be caused by faulty disks or growing
storage demands. More precisely, we present a redundant data placement algorithm that
satisfies redundancy, fairness and adaptivity in a time- and space-efficient manner,
is able to locate any stored data item efficiently (access time is at most logarithmic
in the system size),
uses lookup functions whose space requirements only depend on the number of
disks in the system and not on the number of data items or capacity distribution,
given a replication parameter r, may cope with a variable number rdof copies per
data item das long as rdr.
In contrast to the algorithms described in the last two sections that primarily focused
on the obtained data layout without giving a practical notion of the allocation functions,
we now consider the practical realization of such functions as well as used data structures
and lookup-times for realizing the dictionary functions search(T,k) and insert(T,k) (c.f.
Section 3.1.2.1).
As we know, the most efficient approach to implement the dictionary functions is pro-
vided by hashing, where some given set PUof data blocks with keys taken from a uni-
verse Uof all possible keys is distributed sufficiently even over a set of disks in a storage
network by applying an appropriate hash function h(see Figure 3.3). More particularly,
for achieving an appropriate even allocation of data items to the disks, all strategies we
present in this section apply k-universal hash functions (recall from Section 3.1.2.2 that
k=O(log n)suffices for our demands). Again, the very advantage we gain by this way of
data placement is that allocations can be done with complexity O(1). However, as hash-
ing is usually utilized for static environments, and as we have further shown with Linear
Hashing in Section 2, a simple substitution of the hash functions in case of a scaling envi-
ronment is not sufficient for satisfying our demands. Thus, we are left with the following
question: How to design an adaptive hash function ?
Before we describe the SPREAD strategy in more detail, we first tackle this design
problem by briefly introducing the most important adaptive strategies that have been pub-
lished so far and discuss why it is so difficult to adapt these so that all conditions are met.
Furthermore, note that the adaptive schemes we introduce in the following differ from
3.5.1 Previous Adaptive Strategies 65
Figure 3.3: Hashing
SPREAD with respect to the fact that none of them is designed to handle redundancy
either appropriately or at all, that is, they feature no or only less efficient fault-tolerant
mechanisms. Thus, we simply ignore the redundancy condition for a moment.
3.5.1 Previous Adaptive Strategies
If all storage devices have the same capacity, surely, we are not concerned with the het-
erogeneity issue such that any strategy has to assign each disk simply the same data load.
Thus, in a static environment, a standard balls-into-bins algorithm could easily provide
the desired fair distribution (and COMB if redundancy is demanded), but as we consider
dynamic scenarios, we are left with the difficulty of changes in the environment, that is,
the addition and removal of disks, respectively data items. Nevertheless, the usage of only
uniform capacities simplifies the allocation problem considerably.
Exemplary, we restrict our descriptions on two prominent schemes, called Consistent
Hashing and SHARE, which were the first approaches introduced to feature adaptivity for
uniform and also non-uniform capacity distributions.
3.5.1.1 Consistent Hashing for Uniform Capacities
Consistent Hashing, which is also called Nearest Neighbor Strategy, was published by
Karger et al.[KLL+97]. Initially, the scheme was part of a global distribution scheme for
web-based contents, and it has further been applied to Web caching in [KSB+99].
In its basic form, Consistent Hashing uses two hash functions, g:U[0,1)for map-
ping the (pairwise independent) data items into the [0,1)-interval (which again is consid-
ered as a modulo 1 ring) and h:V[0,1)to map the nodes (resp. storage devices) into the
same [0,1)-ring. Then, each node vis responsible for the interval [h((v),h((w)) [0,1),
with some node wbeing the direct successor of vin the mapping into the unit ring. Now,
given the two functions, every data item dUis placed to position g(d)and furthermore
stored at the node vwith h(v)being closest from above to g(d), that is, it is assigned to
the interval [h((v),h((w)) (in Figure 3.4, F2marks such interval for v=2).
66 Random Allocation of Data Copies
12345
01
hash h
hash g
F2
Figure 3.4: ConsistentHashing
In the basic case, if only one single function gis used for placing the nodes indepen-
dently and uniformly into the [0,1)-ring, this would lead to a biased data layout as stated
by the following lemma (the proof can be found in [MNR02]).
Lemma 3.40. If n nodes are allocated to the [0,1)-ring independently and uniform at
random, the length of the longest segment is w.h.p. Θlog n
n, and w.h.p. there is no
segment which is shorter than Θ1
n2.
Fortunately, this problem can be circumvented by using O(log n)differently labeled
copies for each node v, all of which are allocated to the [0,1)-interval. If then gand h
are O(log n)-universal hash functions, it is easy to show that, first, Consistent Hashing
yields an even distribution on the nodes, and second, the strategy is 2-competitive (see
e.g. [KLL+97, Sal04]). Moreover, also multiple copies for each data item dcan easily
be stored in a redundant way while preserving fairness by the rule that rcopies of d
are stored in the rnodes vwith h(v)being among the rclosest successors of g(d)in
[0,1). However, Consistent Hashing is no longer fair if applied to nodes of non-uniform
capacities. This can be repaired by cutting the nodes into pieces of uniform capacity, but
then the space for the hash table would depend on the differences in capacity and not
just the number of nodes, which causes maintenance problems. In particular, it would be
difficult to adapt this approach to changes in the minimum node capacity without inducing
high data replacements and an undesired storage overhead, respectively.
Another adaptive hash table method for uniform storage devices, called Cut-and-Paste
Strategy, has been presented in [BSS00, San01] for which it can be shown that, in contrast
to Consistent Hashing, it keeps the deviation from the desired number of balls in a bin
extremely small w.h.p. However, like Consistent Hashing, the Cut-and-Paste strategy is
also hard to extend to the non-uniform case.
3.5.1 Previous Adaptive Strategies 67
01
I1
I2I3
I4I5
I6
Fj
g(i)
h(b)
Figure 3.5: The SHARE Strategy
3.5.1.2 The SHARE Strategy for Non-Uniform Capacities
One of the first adaptive hash tables that can handle storage devices of arbitrary capacities
was introduced in [BSS02], called SHARE, and which works in two phases. At first, there
is a reduction phase implementing the conversion of the given non-uniform allocation
problem into several uniform ones. In the second phase, a uniform allocation strategy,
like e.g. Consistent Hashing, is then applied to each of these uniform problems to finally
assign the data items to the bins.
In more detail, given any data item bUfor allocation, like in Consistent Hashing,
the reduction phase of SHARE also applies two random hash functions, h:U[0,1)
to map the data items, and g:V[0,1)to map the nodes into the [0,1)-interval (which
again is considered as a modulo ring). Surely, both functions must ensure a sufficient
even distribution of the items which, as we know, can be obtained by using O(log n)-
universal hash functions for hand g. Then, unlike in Consistent Hashing, each node vV
is assigned a subinterval Ivin the [0,1)-ring whose starting point is g(v)[0,1)and that
has length s·cv, where cvdenotes the relative capacity of the node v, and s=Θ(log n)is
astretch factor that is needed to ensure a complete coverage of the [0,1)-interval by the
subintervals, what, furthermore, guarantees high probability on the allocation. That is,
any interval Iithat starts at some point g(i)ends at point (g(i)+s·ci)mod 1.
Furthermore, as depicted in Figure 3.5, given the hashed node intervals, each start- or
endpoint of an interval marks the beginning or ending of some frame Fj. Since all such
frames are disjoint in the ring, each data item bthat is hashed to some location h(b)is
thus exclusively assigned to exactly one frame. Moreover, as the borders of the frames are
defined by the start- or endpoints of the subintervals, each frame is completely covered
by a number of subintervals. At last, the chosen uniform placement strategy is applied
to any frame to select the target node out of the set of nodes assigned to each frame for
68 Random Allocation of Data Copies
data placement. Then, the SHARE scheme yields the desired fair distribution on an either
uniform or non-uniform as well as scalable storage environment w.h.p. and is (2+ε)-
competitive, where εcan be made arbitrarily small.
Another approach that can also be found in [BSS02] and that is called SIEVE is orga-
nized as a multistage filter and overcomes some of the drawbacks inherent to SHARE,
especially, it does not rely on an extra placement strategy for uniform capacities. How-
ever, to provide a fair placement, the SIEVE algorithm maintains a logarithmic number
of random hash functions, one for each stage, and surely, keeping all these functions in
memory leads to an undesired space consumption.
Nevertheless, for any capacity distribution, SHARE and SIEVE are efficient, (2 +ε)-
adaptive and both yield a fair data dispersal within a (1±ε)factor, where ε>0 can be
made arbitrarily small. Particularly, the bound on the adaptivity of SHARE and SIEVE
is based upon the following Fact 3.41. In more detail, for a hash table to be adaptive, it
has to be able to handle any capacity change from (c1,...,cn)to (c0
1,...,c0
n), and as we
already mentioned in Section 3.1.1, any adaptive storage strategy that wants to preserve
fairness has to replace at least a
i:ci>c0
i
(cic0
i) = 1
2
i|cic0
i|
fraction of the data in the system, hence, the following fact holds.
Fact 3.41. If, for any change from (c1,...,cn)to (c0
1,...,c0
n), a storage strategy only
needs to replace a d i|cic0
i|-fraction of the data in the system, then it is 2d-adaptive
w.r.t. capacity changes.
Regarding fault-tolerance, obviously, the attempt to make SIEVE redundant is a hard
challenge (if possible at all) but introducing redundancy into SHARE in some rudimentary
way can be done as follows. If for some given stretch factor s=Θ(log n)the capacity
of every disk i[n]is bounded by ci1
r·s, then one can assign the copies of any data
item dto positions h(d),h(d) + 1
r,h(d) + 2
r,...,h(d) + r1
rin the [0,1)-ring (recall that
|Ii|=ci·s). However, in what follows in the next section, we aim at yielding redundancy,
fairness and adaptivity for disks up to size 1
r.
In [SS05], a related strategy, called DHHT, was introduced but, similar to SHARE
and SIEVE, it has not been shown yet that this strategy can handle both redundancy and
fairness simultaneously and in an appropriate manner.
Recently, Brinkmann et al. [BEMadHS07] proposed the first scheme for systems of
disks with arbitrary capacities that is fair and redundant but what is only Θ(r2)-adaptive
in the worst case for rcopies per data item whereas we are aiming at a constant adaptivity
that is independent of r.
3.5.2 The SPREAD Strategy 69
3.5.2 The SPREAD Strategy
For the rest of this chapter, we describe and analyze the SPREAD strategy in more de-
tail. First, we sketch the basic framework of SPREAD but leaving out various details
about how to achieve fairness, redundancy and adaptivity. Afterwards, we bound the
time- and space-efficiency of SPREAD (Section 3.5.2.2), describe how to achieve fair-
ness and redundancy (Section 3.5.2.3), and show how to make SPREAD O(1)-adaptive
(Sections 3.5.2.4 and 3.5.2.5). At the end, we discuss how to extend SPREAD to the case
that the data items have different levels of redundancy.
3.5.2.1 The basic scheme.
Like in the previous sections, in the following, we again consider a given storage network
S(n,C[n]) and a corresponding share tuple ¯cfor the nodes in the set [n]. Again, assume
the nodes to be numbered uniquely in a range from 1 to n, and furthermore due to the
adaptivity condition, for any history of changes, we always use the strategy that every
node newly entering the system obtains the lowest available identifier 1, thus, we can
make sure that the nodes will be numbered in the defined range from 1 to nany further
(note that the capacity of any currently unoccupied identifier is equal to 0, but it will
nevertheless be part of the SPREAD data structure.).
SPREAD uses three (pseudo-)random (O(log n)-universal) hash functions: a hash func-
tion h:V[0,1)for the nodes and two hash functions g1,g2:U[0,1)to map the data
items into the [0,1)-interval (which again will be considered as a modulo1 ring). Simi-
lar to the SHARE strategy, SPREAD makes use of a stretch factor s=σ·log N, where
N=|V|(resp. the maximum number of nodes the system can expect) and σ1 is a
sufficiently large constant such that sN. Note that, while in SHARE, the stretch factor
was used to ensure the complete coverage of the unit ring, in SPREAD, it is furthermore
used to guarantee that for each point x[0,1), there are at least rintervals passing x.
Now, given these preconditions, the SPREAD strategy works as follows.
For every node v[n], we identify an interval I(v)=[h(v),h(v) + s·r·cimod 1)of
length s·r·ciin the [0,1)-ring, where h(v)is called the starting point of the interval and
h(v)+s·r·cimod 1 is its endpoint. If s·r·ci>1, we consider I(v)to be wrapped around
the whole [0,1)-interval bs·r·cicmany times.
With respect to the previous strategy, the SPREAD scheme realizes some form of a
hybrid solution combining the Consistent Hashing approach and the SHARE strategy. In
particular, like Consistent Hashing, the SPREAD data structure maintains a partition of
the [0,1)interval into n frames F1,...,Fnwith Fvstarting at h(v)and ending at the closest
successor of h(v)among the points {h(1),...,h(n)}\{h(v)}(recall that [0,1)is treated
as a ring). Surely, the intervals differ in size, and since all intervals are allocated to the
[0,1)-ring independently and uniform at random, they determine overlapping areas, called
70 Random Allocation of Data Copies
areas of influence, for some given point x[0,1), that is, there are numerous intervals
passing x(see Figure 3.6 for an example).
01
12345
x
Fj
Figure 3.6: The interval decomposition of SPREAD
Moreover, like in SHARE, for reducing the non-uniform allocation problem to a uni-
form one, each frame Fvis further partitioned into a number of subframes (we will explain
later how these subframes are chosen. For now, just note that the subframe decomposition
only depends on nand h(1),...,h(n)and not on the capacities of the nodes, and as ngrows
according to system scalings, some of these subframes may be partitioned into smaller
subframes). Then, given the replication parameter r, for each subframe f, SPREAD aims
at maintaining one (and sometimes two, as will be explained later) (s
α)×r-table Tfcon-
sisting of r·s
αslots which are organized into s
αgroups (respectively columns) of size r
each, where α>0 is a small constant that is selected such that a certain degree of fair-
ness is maintained. Every slot is assigned to (resp. owned by) exactly one node, and the
assignment has to be chosen so that for each group of rslots, each slot is assigned to
a different node (see Figure 3.7). Then, we can use the following strategy for ensuring
redundancy.
(s/α)
r
group 1 group (s/α)
direction of allocation
Figure 3.7: Allocation table for a subframe f
3.5.2 The SPREAD Strategy 71
At first, whenever we want to read or overwrite the copies of some data item d, we
perform the following steps.
1. Identify the unique subframe fvia g1(d)f,
2. Pick group d(s
α)·g2(d)ein Tfand either read or store the copies of din the rnodes
owning the slots in this group
(in Figure 3.8, the allocation of a data item to the slots in the corresponding subframe
table is depicted). Given that the hash functions can be evaluated in constant time and
there are msubframes in the system, standard data structures such as search trees and
arrays can be used to obtain the following result (recall that we assign slots to nodes, and
surely, ndifferent nodes require log nbits for encoding).
01
….
r
s/α cols
g1(d)
places of rcopies
g2(d)
Figure 3.8: Allocation of a slot in a subframe table
Lemma 3.42. The lookup and insert operations of SPREAD can be implemented with
runtime O(r+log m)and space O(m(r·s
α)log n).
Hence, as long as mis close to linear in n, SPREAD is time- and space-efficient. In the
following subsections, we show that there is a way of maintaining the tables so that also
the fairness, redundancy and adaptivity conditions are satisfied.
3.5.2.2 Subframe management.
As we see in the following, to make the reduction phase that transforms the non-uniform
problem into a number of uniform ones be able to cope also with a scaling storage system
appropriately, ideally, we would like to have the following subframe decomposition for
each frame Fv[0,1). For the subframes f0,f1,f2,... following Fvin the [0,1)-ring (in
ascending direction), it holds that
|fi|=ε(1+ε)i·|Fv|for some fixed ε>0.(3.15)
72 Random Allocation of Data Copies
If this were true, we could approximate any interval I(v)[0,1)of size |I(v)||Fv|by
an interval I0(v)ending at the starting point of some subframe fiso that |I0(v)|is within
(1±ε)|I(v)|. In fact, the following lemma holds.
Lemma 3.43. Suppose some number k Nto be the maximum value for which it holds
that |Fv|+k1
i=0|fi| |I(v)|. Then, it also holds that |Fv|+k1
i=0|fi| (1ε)|I(v)|and
|Fv|+k
i=0|fi| (1+ε)|I(v)|.
Proof. It holds that
|Fv|+
k1
i=0|fi|= (1+ε
k1
i=0
(1+ε)i)|Fv|= (1+ε)k|Fv|
for any k0. Hence, for |Fv|+k1
i=0|fi| |I(v)| |Fv|+k
i=0|fi|, we get that
|Fv|+
k1
i=0|fi| 1
1+ε|I(v)|= (1ε
1+ε)|I(v)| (1ε)|I(v)|
and |Fv|+k
i=0|fi| (1+ε)|I(v)|.
Thus, we can either decide to round I(v)down to the starting point of the subframe
containing its endpoint or up to the starting point of the next subframe in order to obtain
an interval I0(v)whose size is within (1±ε)|I(v)|. If |I(v)|is at most |Fv|, we identify
I0(v)with I(v), i.e., we do not round I(v)(see Figure 3.9 for illustration).
v
exact I(v)rounded I(v)
Subframes of size
(1+
ε
)i
ε
|Fv|
Fv
Figure 3.9: Rounded Up Interval
Of course, choosing a subframe decomposition such that for each frame Fv, there are
subsequent subframes f0,f1,f2,... of sizes |fi|=ε(1+ε)i|Fv|is not possible, but it is
sufficient to cut each frame Fwinto subframes such that the following condition is true for
every frame Fv:
3.5.2 The SPREAD Strategy 73
Subframe condition: For any two frames or subframes fand f0, let (f,f0)be the
distance (in ascending direction along the [0,1)-ring) from the endpoint of fto the starting
point of f0. Furthermore, for every w[n]\{v}and every subframe fin Fv, we require
that |f|ε(1+ε)k|Fw|, where kN0is maximum possible so that
ε
k1
i=1
(1+ε)i|Fw| (Fw,f).
If this condition is true, it is easy to see that the interval I(v)can be rounded down to the
starting point of the subframe ficontaining its endpoint or up to the starting point of the
next subframe fi+1in order to obtain an interval I0(v)with |I(v0)| (1±ε)|I(v)|. Now,
to satisfy the subframe condition, we use the following decomposition strategy for the
[0,1)-ring.
Given h(1),...,h(n)[0,1), we start with some single subframe representing the en-
tire [0,1)-interval, and we keep cutting subframes in half until the subframe condition is
true everywhere (when considering each subframe crossing b1 frame borders as being
cut into b+1 subframes at these borders). This leads to a unique decomposition into
subframes (for any given node mappings h(1),...,h(n)) with the property that for every
subframe fthat is not the first or last subframe in some frame Fv,|f|=1
2kfor some
kN0. Moreover, whenever the number nof disks is increased by system scaling, the
decomposition strategy will only cause some subframes to be cut into smaller subframes
(which turns out to be useful for the adaptivity of SPREAD). The following two lemmas
show that, w.h.p., our decomposition rule does not create too many subframes.
Lemma 3.44. The number of frames Fvin the system with |Fv| δ
nfor some given δ>0is
bounded by 2δn+O(n), w.h.p. Furthermore, w.h.p. there is no frame Fvwith |Fv| 1
nk
for some constant k, w.h.p.
Proof. For every node v[n], consider a random variable Xvwith Xv=h(v). Then, given
a fixed δ, let the function
f(X1,...,Xn):[0,1)nN
represent the number of frames Fvwith |Fv| δ
n. Certainly, whenever (x1,...,xn)and
(x0
1,...,x0
n)differ in at most one coordinate, it holds that
|f(x1,...,xn)f(x0
1,...,x0
n)| 2.
74 Random Allocation of Data Copies
Moreover, for any node v[n]and δ1, it holds that
Pr|Fv| δ
n=11δ
nn1
1 1(n1)δ
n+n1
2δ
n2!
= (n1)δ
nn1
2δ
n2
δ
which implies that E[f]δn. Hence, it follows from the method of bounded differences
(c.f. Section 3.1.3.1) that for any d0, it holds that
Pr[fδn+d]ed2/(2n).
Choosing d=δn+Θ(n)results in the first part of the lemma. The second part holds
because for any node v, it holds that
Pr|Fv| 1
nk1
nk1.
Therefore, the probability that there exists a v[n]with |Fv| 1
nkis polynomially small
in n.
Lemma 3.45. The number of subframes in the system is bounded by O(n
ε), w.h.p.
Proof. For any iN0, let Sibe the set of all frames of size in [2i
n,2(i+1)
n). Then, for
each frame FSi, we need at most
k=log1+ε1
n|F|log1+ε2i+1=Oi
ε
subframes f0,f1,..., fkin the ideal setting until |fk|=ε(1+ε)k|F| ε
n. Hence, it follows
from Lemma 3.44 that the total number of subframes needed is bounded by
log n
i=02n
2i·Oi
ε+On·Olog n
ε+On
ε=On
ε
w.h.p.
Next, we explain how to manage the tables.
3.5.2 The SPREAD Strategy 75
3.5.2.3 Table management.
For each subframe f[0,1),C(f)represents a multiset of nodes wfor which fI(w),/0
holds. Furthermore, for every node win C(f), let its multiplicity µf(w)be defined as the
number of times woccurs in C(f)(see Section 3.1.3.4 for details on multisets). Initially,
µf(w)is set to the number of times I(w)crosses the starting point of f, thus, surely,
µf(w) {br·s·cwc,dr·s·cwe}. Moreover, since ci1
rfor every i[n], it holds that
µf(w){0,...,s}.
Now, for what follows, consider some frame Fv. Then, for every subframe fFvand
nodes w,v, we are using the following rules:
Rounding conditions:
1. Whenever the endpoint of I(w)crosses fcompletely (i.e. moves beyond the start-
and endpoint of f) for the first time, we set µf(w) = µf(w)+1.
2. Whenever the endpoint of I(w)moves below the starting point of ffor the first time
after crossing the endpoint of f(or after whas been introduced to the system), we
set µf(w) = µf(w)1.
We also apply the same rules if w=vand |I(v)|>|Fv|. For the special case that w=v
and |I(v)|≤|Fv|,µf(w)is defined as the number of times I(w)crosses the starting point
of f(recall that the frames Fvare defined by the values h(v), thus, |I(v)|Fvmay occur).
With these rules, it holds that whenever the endpoint of I(w)is outside of f, then µf(w)
equals the number of times I(w)crosses f, and otherwise µf(w){br·s·cic,dr·s·cie}.
Hence, since no interval can start inside of f, the number of times intervals cross fat the
left and right endpoints is an upper and lower bound on |C(f)|. More precisely, for any
subframe f, we can bound |C(f)|as follows.
Lemma 3.46. For every subframe f [0,1)and corresponding multiset C(f),|C(f)|
is within (1±β)r·s, w.h.p., where the constant β>0can be made arbitrarily small
depending on the constant σin s.
Proof. To bound |C(f)|for any subframe f[0,1), the arguments above imply that it
suffices to consider any point x[0,1)that represents a starting or endpoint of some
interval I(v). In the following, let us first consider the starting point of an interval I(v).
Consider some fixed node vwith starting point h(v)[0,1). For any node win the
system (including v), let the random variable Xwbe defined as
Xw=br·s·cwc+Yw
where the binary random variable Yw=1 if and only if I(w)contains h(v)dr·s·cwemany
times. Furthermore, let X=wXwand Y=wYw. It certainly holds that E[Xw] = s·r·cw
for all w,vand that X=ds·r·cve+w,vXw. Hence,
E[X] = ds·r·cve+s·r·(1cv)[s·r,s·r+1].
76 Random Allocation of Data Copies
Moreover, E[Y]E[X], and since the starting points of the intervals are chosen indepen-
dently at random and the Yws are binary random variables, it follows from the Chernoff
bounds (see Section 3.1.3.1) that
Pr[|YE[Y]|βE[Y]] eβ2E[Y]
2(1+β/3)
for any β>0. Hence, Xis within (1±β)s·r, w.h.p., where the constant β>0 can be
made arbitrarily small depending on the constant σin s.
For the endpoint of an interval I(v), it is easy to see that E[X][s·r1,s·r]. Thus,
the same deviation bounds can also be shown here.
For our further descriptions, we need somewhat more refined estimates on the sizes
of multisets corresponding to the nodes in the system. As we will see later, this is im-
portant for the design and management of the used tables to appropriately handle system
dynamics.
For what follows, consider the sets
W`:=nv[n]|cv1
(s1)ro[n]
and Ws:= [n]\W`(it is easy to see that all nodes vW`are guaranteed to have multiplicity
µf(v)1 in every subframe f[0,1)).
Now, for any subframe f, we define C`(f):=C(f)|W`(i.e., the multiset containing
only those nodes in C(f)that are also in W`) and Cs(f):=C(f)|Ws. Furthermore, let
a`=vW`cvand as=1a`. Then, a more refined version of Lemma 3.46 is as follows.
Lemma 3.47. For any subframe f and any subset W0
`W`, it holds that |C0
`(f)|is within
s·r·a0
`±β|W0
`|+O(log n), w.h.p., where C0
`(f) = C`(f)|W0
`and a0
`=vW0
`cv. Further-
more, |Cs(f)|is within (1±β)s·r·as+O(log n), w.h.p.
In both cases, the constant β>0can be made arbitrarily small depending on the constant
σin s.
Proof. Recall the definition of Xwand Ywin the proof of the previous lemma. Applied to
nodes wW0
`, it holds that
E[X] =
wW0
`bs·r·cwc+E[Y][s·r·a0
`1,s·r·a0
`+1].
Since E[Y]|W0
`|, the first bound follows from the fact that
Pr[|YE[Y]| βE[Y]] eβ2E[Y]
2(1+β/3)
for any β>0.
The second bound follows along the same lines as in the proof of Lemma 3.46.
3.5.2 The SPREAD Strategy 77
As we know now, any subframe is always crossed by a sufficient number of intervals.
Furthermore, the following result is crucial for SPREAD to be redundant.
Lemma 3.48. For every subframe f , there are at least r different nodes in C(f), w.h.p.,
where the probability depends on the constant σin the stretch factor.
Proof. Let q=|W`|. If qr, then the lemma is trivially true. So suppose that q<r.
Since cv1
rfor all v[n], a relative capacity of at least 1 q
rmust be covered by Ws.
Hence,
|Ws|1q
r/1
(s1)r= (s1)(rq)
and
vWs|I(v)|(r·s)1q
r=s(rq).
Suppose that the nodes in Wsare numbered from 1 to t. Consider any fixed node vWs
and focus on the point yimplied by vs interval I(v)=[x,y). For any node wW, let the
binary random variable Xw=1 if and only if yI0(w)and let X=wXw. Since
E[Xw] = Pr[Xw=1] = min{|I0(w)|,1}for all wWs\{v},
it holds that
E[X] = q+
wWs\{v}
E[Xw] = q+
wWs\{v}
min{|I0(w)|,1}
q+(1ε)s(rq)1r+(1ε)s2
Since s=Θ(logN)and the starting points of the intervals are chosen uniformly and inde-
pendently at random, it follows from the Chernoff bounds that Xrfor point y, w.h.p.
Since we only need to consider those points in [0,1)that are starting points or endpoints
of intervals I(v)in order to cover all subframes in [0,1)and the lowest values for E[X]
are reached when focusing on endpoints of intervals I(v)of nodes vWs, the lemma
follows.
Recall that for each subframe f, we maintain one (and sometimes two) s
α×r-table Tf
of r·s
αslots which are organized into s
αgroups (or columns) of size reach (Figure 3.7).
We assume that α>0 is a sufficiently small constant with 1
αN. Our goal is to assign
the slots of Tfto nodes in fso that the following conditions are met:
Table conditions:
1. Every slot is assigned to (resp. owned by) exactly one node in C(f).
78 Random Allocation of Data Copies
2. Every node vin C(f)owns within (1±γ)µf(v)/αmany slots but at most s
αmany
slots in Tffor some constant 0 <γ<1 to be specified below.
3. Every group consists of slots belonging to different nodes.
The constant γwhich is sufficient to maintain the table conditions is given in the following
lemma. Moreover, in this lemma, αis the parameter used in the table size and βis the
parameter used in the Lemmas 3.46 and 3.47.
Lemma 3.49. If the bounds in Lemma 3.47 are true and αand βare sufficiently small
constants, then Conditions 1 and 2 can be met with any γβ
1β+α.
Proof. Recall the bounds in Lemma 3.47 and ignore the O(log n)terms for a moment.
Then, in this proof, we consider the nodes in W`and Wsseparately, starting with W`.
Let WhW`be the subset of all nodes vW`with capacity cvlarge enough so that
it is guaranteed that µf(v)1
γfor γas chosen in the lemma. In this case, every vWh
satisfies
(1γ)µf(v)
αµf(v)1
αjs·r·cv
αk
and
(1+γ)µf(v)
αµf(v)+1
αls·r·cv
αm.
Moreover, ds·r·cv
αe s
α. Hence, each vWhcan be given either bs·r·cv
αcor ds·r·cv
αe
many slots without violating Condition 1 implying a slot assignment to the nodes in Wh
such that the total number of slots used by the nodes in Whis in
hjs·r·ah
αk,ls·r·ah
αmi
where ah=vWhcv. This is perfect up to an additive 1.
Next, consider the remaining nodes in the subset W0
`=W`\Whwith capacity cvsuch
that µf(v)<1
γ. For this, let C0
`(f) = C`(f)|W0
`be the multiset of nodes vC`(f)with
v<Wh. In this case, for each node vW0
`, it holds that
(1+γ)µf(v)
α<s
α
(given that sis sufficiently large). Thus, we do not have to worry about limiting the
number s
αfor the slots of v.
Let a0
`=vW0
`cv. Then, suppose for an upper bound on the number of slots per node
that |C0
`(f)|=s·r·a0
`β|W0
`|(see Lemma 3.47). If a constant φ>0 is chosen such that
vW0
`µf(v)
α+φ
αs·r·a0
`
α,(3.16)
3.5.2 The SPREAD Strategy 79
then it suffices to assign at most
µf(v)+φ
α+1
slots to every node vC0
`(f)to cover at least an a0
`-fraction of the slots in f. Furthermore,
since vW0
`µf(v) = |C0
`(f)|, Inequality 3.16 above holds if and only if
|C0
`(f)| s·r·a0
`φ|W0
`|,
so we have to choose φβfor this.
For a lower bound, suppose that |C0
`(f)|=s·r·a0
`+β|W0
`|. If the constant φ>0 is
chosen so that
vW0
`µf(v)
αφ
αs·r·a0
`
α,
then at least µf(v)φ
α1
slots can be assigned to every node vC0
`(f)to cover at most an a0
`-fraction of the slots
in f. For this, we have to choose φβ. Together with the fact that µf(v)1 for all
vC0
`(f), it follows that γ=β+αis sufficient so that
µf(v)+β
α+1(1+γ)µf(v)
α
and
µf(v)β
α1(1γ)µf(v)
α.
Thus, we are done with determining bounds on the slot assignment for the nodes in W`,
that is, for nodes vthat are guaranteed to have multiplicity µf(v)1 in C(f).
Finally, we consider the nodes in Ws, that is, nodes vwith µf(v)<1 that may not cross
every subframe f[0,1). Suppose for an upper bound on the number of slots per node
that |Cs(f)|= (1β)s·r·as(see Lemma 3.47). If then constant φ>0 is chosen so that
vWs
(1+φ)µf(v)
αs·r·as
α,
it suffices to assign at most
(1+φ)µf(v)
α+1
slots to every node vCfto cover an as-fraction of the slots in f. Since it holds that
vWsµf(v) = |Cs(f)|, we have to choose φsuch that
(1+φ)(1β)s·r·as
αs·r·as
α,
80 Random Allocation of Data Copies
which works for φ1
1β1=β
1β.
For a lower bound, suppose that |Cs(f)|= (1+β)s·r·as. If then the constant φ>0 is
chosen so that
vWs
(1φ)µf(v)
αs·r·as
α,
at least
(1φ)µf(v)
α1
slots can be assigned to every node vCs(f)without exceeding an as-fraction of the slots
in f. For this, we have to choose φsuch that
(1φ)(1+β)s·r·as
αs·r·as
α
which works for φ11
1+β=β
1+β. Hence, γβ
1β+αis sufficient for both cases.
Finally, notice that there is this O(log n)term in the bounds in Lemma 3.47 that we
ignored so far. However, whenever this term is dominant for a set of nodes in some
subframe fj, then it holds that vfjcvδfor some constant δ>0 that can be made
arbitrarily small depending on the constant σin the stretch factor s. In other words, since
we are considering only three sets of nodes (W`,W0
`,Ws), those sets of nodes where the
O(logn)term is negligible (and can therefore be covered by β) represent a total capacity
of at least 1 2δwhich is sufficient to fill all slots just with these nodes, or leave enough
space for the other nodes, as long as δis sufficiently small compared to αand β.
If Conditions 1 and 2 are true, also Condition 3 can be met. To show this, consider
the slots to be numbered column-wise from 1 to r·s
αby giving slot (i,j)in group ithe
number (j1)·s
α+i(c.f. Figure 3.7). Then, assign the slots to the nodes such that each
node vC(f)owns a consecutive sequence of slots. As every node owns at most s
αslots,
no group can have two slots owned by the same node, which proves our claim.
However, the challenge, of course, will be to maintain these conditions in case of sys-
tem changes caused by some change operation ωwithout rearranging too many slot as-
signments. Before we show how to do this, we prove that the given table conditions allow
us to maintain fairness.
Lemma 3.50. If the table conditions are met, then for every node v [n]in the system
and every data item d U, it holds that
Pr[v stores a copy of d][(1γ)(1ε)r·cv,(1+γ)(1+ε)r·cv]
for the γ>0as given in Condition 2 and ε>0as chosen for the interval rounding.
3.5.2 The SPREAD Strategy 81
Proof. Consider any node v[n]and data item dU. Let p=|I0(v)|mod 1, where
I0(v)is the rounded form of I(v)(see Figure 3.9). For an area Ap= [xi,xj][0,1)of
size |Ap|=p, the multiplicity of vis µf(v) = d|I0(v)|e for all subframes fAp. For the
remaining area A0p[0,1)of size |A0p|= (1p), the multiplicity is µf(v) = b|I0(v)|cfor
all subframes fA0p. Moreover, it follows from the table conditions that for any node v
in some subframe f,
Pr[vselected by a data item](1γ)µf(v)
s,min{(1+γ)µf(v),s}
s.
Hence, it holds for data item dthat
Pr[vstores a copy of d]p·(1γ)d|I0(v)|e
s+(1p)·(1γ)b|I0(v)|c
s
= (1γ)·|I0(v)|
s(1γ)(1ε)r·cv
Similarly, it holds that Pr[vstores a copy of d](1+γ)(1+ε)r·cv, which implies the
lemma.
Since ε>0 and γ>0 can be made arbitrarily small, the lemma implies the fairness of
SPREAD. Next we show that SPREAD is also adaptive.
3.5.2.4 Amortized adaptivity.
We start with amortized adaptivity, i.e. we show how to perform updates so that move-
ments of data copies can always be charged to capacity changes in the past.
Note that SPREAD does not need to perform any adaptations of the slots as new data
items are added or old data items are removed from the system since Lemma 3.50 implies
that the data items in the system will remain fairly distributed among the nodes. Hence, it
remains to describe how to react to changes in the capacities of the nodes.
In what follows, we suppose that the capacities of the nodes change from (c1,...,cn)
to (c0
1,...,c0
n). In case that the number nof nodes in the system increases, then, for each
node identifier vVthat has not been used before, we have to rearrange the subframe
layout in the [0,1)-interval in a way that some subframes may have to be cut into smaller
subframes that take over the tables of the previous subframes. Then, it follows that if the
previous subframes satisfied the table conditions, the new ones will also do so. After that,
our rounding conditions may require updates to these tables, which can be charged to
past capacity changes as for the other cases below. At last, we adapt the intervals I(v)to
the new capacity distribution (c0
1,...,c0
n). Clearly, this may cause changes in the multiset
C(f)of some subframe finducing required updates in its table (or tables). We call a
subframe f dirty if fFv, and furthermore, fcontains the endpoint of the interval I(v)
with |I(v)||Fv|. Otherwise, it is called clean.
82 Random Allocation of Data Copies
In the following, we first describe how to update the table of a clean subframe, and then
we consider dirty subframes.
Updating a clean subframe f.Let C(f)be the multiset of nodes in fbefore the change
and C0(f)be the multiset of nodes in fafter the change in capacities. Furthermore,
let µC
f(v)denote the multiplicity of some node vC(f)and µC0
f(v)the multiplicity of
vC0(f). If then C0(f),C(f), we go through the following stages.
1. Pairing stage: Suppose that
vC(f)
(µC
f(v)µC0
f(v)) = δd
is the total decrease in the multiplicities of nodes in C(f), and
vC(f)
(µC0
f(v)µC
f(v)) = δi
is the total increase in the multiplicities of nodes in C(f). Then, we can identify
|δiδd|pairs of nodes (v,w)where vwants to decrease its multiplicity whereas w
wants to increase its multiplicity. For each such pair, we set
µC0
f(v):=µC
f(v)1 and µC0
f(w):=µC
f(w)+1
and then change slot assignments according to µC0
f(v)and µC0
f(w), respectively,
until table Condition 2 is satisfied for vand w.
For each such slot reassignment, we distinguish between three cases.
a) If both vand wviolate Condition 2, then a slot of vis given to w.
b) If only vviolates Condition 2, we give a slot of vto any node w0who can still
take a slot without violating Condition 2 (we will see below that such a node
w0can always be found, w.h.p.).
c) If only wviolates Condition 2, we give a slot from any node v0who can lose a
slot without violating Condition 2 to w.
For each slot xTfgiven from some node uto some node u0, we use the following
slot switching strategy to preserve Table condition 3.
Switching strategy: If xbelongs to some group gTfin which no other slot
is assigned to u0, we are done. Otherwise, there must be a group g0with no slot
assigned to u0since otherwise u0would have more than s
αslots at the end, violating
Condition 2. Since Condition 3 was true before the movement, there must be a slot
x0in g0that is assigned to some node u00 having no slot in g. Then, switch slots x
and x0among u0and u00 which repairs Condition 3.
3.5.2 The SPREAD Strategy 83
2. Movement stage: After the pairing stage, we only have nodes left that all want to
decrease or increase their multiplicities. We consider these node by node. For each
node vamong these, we determine µC0
f(v), and then either move slots to vor away
from vusing the slot switching strategy in the pairing stage, if necessary, until v
satisfies Condition 2.
Of course, it is not obvious that suitable slots can always be found for the reassignments
(besides the pairing stage in which the nodes vand wstill violate condition 2), but the
following lemma implies that this is possible. In it, C00(f)represents the intermediate
multiset during the process of moving from C(f)to C0(f).
Lemma 3.51. In any situation in which |C00(f)|is within |C(f)|and |C0(f)|, Conditions
1 and 3 are true, and at most one node violates Condition 2. But then, Condition 2 can be
repaired for it so that all table conditions are met, w.h.p.
Proof. For any node wC00(f), let swbe the number of slots whas in the table Tf. Let
vbe the node that is violating Condition 2. Then, veither needs additional slots or has to
give up slots. Suppose first that vneeds additional slots. In this case,
sv<(1γ)µC0
f(v)
α.
As long as there is a node wwith
sw1(1γ)µC0
f(w)
α,
we can move a slot from wto vuntil
sv(1γ)µC0
f(v)
α.
Then, we repaired Condition 2 for vwithout violating the Condition 2 for any of the other
nodes.
Suppose, however, that we reach a point in which sv<(1γ)µC0
f(v)/αbut there is no
node wany more with sw1(1γ)µC0
f(w)/α. In this case, the total number of slots
occupied by the nodes in C00(f)is less than
(1γ)µC0
f(v)
α+
w,v"(1γ)µC0
f(w)
α+1#<1γ
α
wC00(f)
µC0
f(w)+|C00(f)|
=1γ
α+1|C00(f)|.
84 Random Allocation of Data Copies
If γβ
1β+α, it follows from Lemma 3.46 that this is at most
12β
α(1β)·|C00(f)| 12β
α(1β)·(1+β)s·r
w.h.p. Furthermore, it holds that
(12β)(1+β) = 1β2β2<1β,
thus, wsw<s·r
α, which is a contradiction since all slots must be owned by a node at
any time. Hence, it will always be possible to reassign slots so as to repair Condition 2
for node vin this case.
Next, we consider the case that node vneeds to give up slots. In this case, vgets stuck
if sv>(1+γ)µC0
f(v)/αand for all other nodes w, it holds that sw+1>(1+γ)µc0
f(w)/α.
Then, the total number of slots occupied by the nodes in C00(f)is more than
(1+γ)µC0
f(v)
α+
w,v"(1+γ)µC0
f(w)
α1#>1+γ
α
wC00(f)
µC0
f(w)|C00(f)|
1
α(1β)+1
wC00(f)
µC0
f(w)|C00(f)|
1
α(1β)·(1β)s·r=s·r
α
w.h.p. This, however, is a contradiction. Hence, slots from node vcan be reassigned to
other nodes until Condition 2 holds for v.
Next, we bound the number of slot reassignments. A step in the movement or pairing
stage is defined as the process of fixing the table conditions after the multiplicity of a node
or pair of nodes has changed by 1.
Lemma 3.52. In each step of the pairing or movement stage, at most 2(1+γ)
αslots have to
be reassigned in order to repair the table conditions.
Proof. First, consider the movement stage. Suppose that the multiplicity of some node v
increases by 1, that is, µC0
f(v) = µC
f+1. We know that for its old multiplicity µC
f(v), it
holds that
sv(1γ)µC
f(v)
α.
Hence, at most 1
αslots have to be moved to vto satisfy
sv(1γ)(µC0
f(v))
α.
(3.17)
3.5.2 The SPREAD Strategy 85
Since each slot movement may require a flip with another slot to repair Condition 3, the
total number of slot reassignments is at most 2
αin this case.
For the case that the multiplicity µC
fof some node vdecreases by 1, at most 2(1+γ)
αslots
have to be reassigned. The worst case happens if µC
f(v)was previously 1 and vhad 1+γ
α
slots.
Next, consider a step of the pairing stage. Suppose that for node vthe multiplicity
µC0
f(v) = µC
f(v)1 whereas the multiplicity of node wis µC0
f(w) = µC
f(w) + 1. Then,
vhas to give up at most 1+γ
αslots while whas to get at most 1
αslots in order to repair
Condition 2. In any step of repairing Condition 2 for vand/or w(vgives a slot to w,
or vgives up a slot, or wgains a slot), at most 2 slot reassignments are necessary, so
altogether, at most 2(1+γ)
αslot reassignments are needed to repair Condition 2 for the
nodes vand w.
Hence, given a total change in the multiplicities of the nodes by µ, at most 2µ(1+γ)
αslot
reassignments are necessary to get from C(f)to C0(f). Given kslot reassignments, the
probability that a specific copy of a data item dUwith g1(d)fneeds to be replaced
is equal to
k
r·s
α
.
Thus, the expected number of copy movements is at most
2µ(1+γ)
α
r·s
α·|f|·r|D|=2(1+γ)µ|f|
s·|D|
where DUis the set of data items that are stored in the system. A change in multi-
plicities by µcan be charged to a capacity change of c(f)|f|µ
s·rwith respect to fin
the past, and the way we perform interval rounding makes it possible that every capacity
change is charged at most once. Thus, with respect to c(f), the expected number of copy
movements is at most
2(1+γ)s·r·c(f)
s·|D|=2(1+γ)c(f)·r|D|
According to Fact 3.41, a capacity change of c(f)requires the replacement of at least
c(f)·r|D|
2copies for the copy distribution to remain fair with respect to f. Hence, for
clean subframes, SPREAD is amortized 4(1+γ)-adaptive.
Updating a dirty subframe f.If fis dirty, we maintain two tables for f. One table, T1,
for the interval f1from the starting point of ftill the endpoint of I(v), and one table, T2,
for the interval f2from the endpoint of I(v)till the endpoint of f. The multisets C(f1)
and C(f2)are equal to C(f)with the difference that C(f1)contains a copy of node vwhile
86 Random Allocation of Data Copies
I(v)
h(v)
f
f1f2
Figure 3.10: A Dirty Subframe
C(f2)does not contain v. Our goal is to make sure that T1and T2differ in at most 2(1+γ)
α
slots, which we call the proximity condition. This is ensured by the following strategy.
Suppose that C(f)stays the same but the size of I(v)changes. Then, we only update f1
and f2accordingly and leave the tables T1and T2as before, which satisfies the proximity
condition. If C(f)only changes because the endpoint of I(v)enters ffrom below, then T2
inherits the table of f, and T1is obtained by applying slot reassignments for the node vto
table T2until the table conditions are met for f1. According to Lemma 3.52, this requires
the reassignment of at most 2(1+γ)
αslots, so the proximity condition holds afterwards. If
C(f)only changes because the endpoint of I(v)is moving below f, then T2is chosen as
the table for f. Similar solutions can be found if the endpoint of I(v)enters or leaves f
from above.
In all other cases, we first ignore a potential change in I(v), which means that the sizes
of f1and f2remain the same. We then adapt the table Tiof the interval fiof largest size
among f1and f2as described for the table of a clean subframe fto get from C(f)to
C0(f)(ignoring changes in C(f)due to I(v)entering or leaving f), and then we construct
the table of the other interval by performing at most 2(1+γ)
αfurther slot reassignments in
order to remove or add slots for node v. Afterwards, we update the sizes of f1and f2if
necessary (i.e. if I(v)has changed).
Next, we bound the adaptivity of SPREAD for dirty subframes. Suppose that C(f)
stays the same but the size of I(v)changes by some `in f(i.e. `|f|). Then, this can
be charged to a change in capacity of node vof c=`
s·r. Hence, the proximity condition
ensures that the expected number of copy movements is at most
`·2(1+γ)
α
r·s
α·r|D|=2(1+γ)c·r|D|
which implies that, with respect to c, SPREAD is 4(1+γ)-adaptive in this case.
3.5.2 The SPREAD Strategy 87
If C(f)only changes because the endpoint of I(v)enters ffrom below or is moving
below f, and this is associated with a change of I(v)of length `with respect to f, then it
follows analogously to the first case that SPREAD is 4(1+γ)-adaptive.
It remains to consider any remaining case. Let µ1 be the total change in multiplici-
ties in C(f)(ignoring the change caused by I(v)). W.l.o.g., suppose that |f1||f2|. Then,
at most
2µ(1+γ)
α
slots are reassigned in T1and at most
(µ+2)2(1+γ)
α
slots are reassigned in T2. The latter bound holds because T2differs in at most 2(1+γ)
αslots
from T1before and after the reassignments.
Since a total change of µcan be charged to a capacity change of c(f)|f|µ
s·rwith
respect to fin the past, |f1| |f2|, and µ1, it follows from the arguments for clean
subframes that the expected number of copy movements due to these changes is at most
µ·2(1+γ)
α
r·s
α
+(µ+2)2(1+γ)
α
r·s
α!|f|
2·r|D| 2µ|f|·2(1+γ)
α
r·s
α·|D|
4(1+γ)s·r·c(f)
s·|D|
=4(1+γ)c(f)·r|D|
For any additional changes due to I(v), SPREAD is 4(1+γ)-adaptive. Hence, overall
SPREAD is amortized 8(1+γ)-adaptive for dirty subframes. Summing up the adaptivity
bounds over all subframes results in an amortized adaptivity of 8(1+γ).
3.5.2.5 Adaptivity.
In order to get from amortized adaptivity to adaptivity, we replace the deterministic round-
ing rules for the intervals above by a randomized rounding rule. More specifically, we
choose an additional (pseudo-)random hash function h0:V[0,1), and for every in-
terval I(w)and subframe f= [x,y)in Fvwith v,w(or |I(w)|≥|Fw|), we check the
following:
Randomized rounding conditions:
1. Whenever the endpoint of I(w)crosses x+h0(v)
|f|mod 1 from below, µC
f(w)is in-
creased by 1.
2. Whenever the endpoint of I(w)crosses x+h0(v)
|f|mod 1 from above, µC
f(w)is de-
creased by 1.
88 Random Allocation of Data Copies
With this rule we obtain the following result:
Lemma 3.53. For any capacity change, SPREAD is 8(1+γ)-adaptive.
Proof. First, suppose that ndoes not change. Consider any capacity change in the system,
and for any node v, let (I(v)) be the interval representing the difference between I(v)
before and I(v)after the change. Suppose that the starting point of (I(v)) is in some
subframe fand the endpoint of (I(v)) is in some subframe f0.
First, consider the case that (I(v)) f, i.e., f=f0. Then, it is easy to check that
the probability that µC
f(v)increases or decreases by 1 is equal to |(I(v))|
|f|. We know from
above that an increase or decrease of a multiplicity by 1 in a subframe frequires the
replacement of an expected number of at most
4(1+γ)|f|
s·|D|
copies in the system. Since cvchanged by cv=|I(v)|
r·s, this means that the expected
number of copies replaced due to node vis at most
|(I(v))|
|f|·4(1+γ)|f|
s·|D|=4(1+γ)cv·r|D|.
For (I(v)) *f, similar arguments also yield that the expected number of copies replaced
due to vis at most 4(1+γ)cv·r|D|.
Now, let us consider that nincreases. Then, for any rounded I(v)with endpoint in
some subframe f= [x,y)before the increase, I(v)is only rounded again, applying the
new decomposition, if I(v)I0(v)and the endpoint of I(v)passes x+h0(v)
|f|or y, or if
I0(v)I(v)and the endpoint of I(v)passes xor x+h0(v)
|f|, which preserves our adaptivity
bound.
Combining all of the results in this section satisfies all given demands. Note that with
the help of Chernoff bounds the adaptivity bound can also be shown to hold w.h.p. (up to
minor order terms) if the total capacity change is ω(ϕlogn), where ϕis an upper bound
on the maximum size of a subframe in the system. Hence, the smaller the subframes, the
smaller will be the deviation from the expected number of replaced copies.
3.5.2.6 Variable number of copies per data item.
If the number of copies per data item varies but is upper bounded by r, then we just need
to slightly adapt our storage strategy. For any data item with r0rcopies, we first select
a group of rdistinct nodes as before and then store r0copies among r0of these nodes
by selecting a (pseudo-)random starting node vin the group (via some additional hash
3.6. Conclusion and Open Problems 89
function) and then storing copies at the subsequent r0nodes in the group (where we treat
the group as a ring). It is not difficult to show that this preserves all the properties shown
above for data items with redundancy exactly r.
3.6 Conclusion and Open Problems
In this chapter, we observed and analyzed strategies for obtaining a redundant and fair
distribution of normalized copies of data blocks among a given storage environment that
may be considered static or, perhaps, dynamic. We showed, that when allocating the data
items to disks of non-uniform capacity by some sequential balls-into-bins like algorithm,
this will lead to an imbalance in the probability distributions of the allocations for each
step t2. However, we presented an efficient data placement strategy, called COMB, that
overcomes this imbalance problem such that for ridentical copies of some data item, all
copies are redundantly stored on rdistinct disks of arbitrary capacity and furthermore, the
fairness condition is satisfied. Unfortunately, since COMB is not adaptive and thus only
works efficiently for static storage environments, after that, we introduced the adaptive
SPREAD strategy which is efficient, always yields a redundant and fair data layout and is
O(1)-adaptive for any change in the capacities of the disks in case of system scalings.
However, with respect to the observed allocation problem, the following open problems
remain to be addressed in some future work.
Determine (by some function) the shift in the assigned weights of the r-tuples when
moving from a uniform to a non-uniform data placement.
Investigate the given data redundant allocation problem (and thus, all presented
algorithms) also for weighted data blocks.
Find a multiple-choice approach for the standard balls-into-bins problem consider-
ing non-uniform capacities.
Modify the r-tuple placement of the COMB strategy by the multiple-choice ap-
proach (and show (perhaps better) deviation bounds).
Prove Conjecture 3.36.
CHAPTER 4
Erasure Codes for Reading and Writing
We now switch to allocation strategies that provide good results but only for classical
storage environments (c.f. Section 1.1.2) and that use erasure (resilient) encoding to
ensure fault-tolerance in a storage network. In general, erasure coding is a technique
that has widely been employed for achieving high data availability and reliability in stor-
age networks (e.g. [PGK88, PT03, Pla97, REG+03]) and communication systems (e.g.
[ABB+97, NB97, BLM02, BLMR98, Riz97]).
From a general perspective, the main difference to replication-based strategies is that
with erasure coding, the original information is no longer stored separately from the re-
dundant data (copies) but is mixed up by the encoding yielding a codeword that contains
the desired degree of redundancy. More specific, an erasure (resilient) code maps a word
xof ksymbols drawn from some finite set Σ, which is referred to as the code alphabet,
into a codeword y of n>ksymbols from the same alphabet; the nsymbols of yare stored
separately on ndistinct disks in the storage network. If then some disks fail for reading, in
the optimal case, any ksymbols from yare sufficient to recover x. That is, such codes can
tolerate up to nkerasures which may be caused by failed, respectively temporarily not
accessible disks (throughout this chapter, we refer to xas an information word or message
of fixed size k, according to the erasure channel model introduced by Elias [Eli55]).
Although various erasure codes exist, the special set of codes we consider in the fol-
lowing are those for RAID-like storage systems such as RAID-arrays and SANs (e.g.
[PGK88, PT03, Pla97, REG+03]) and whose most prominent representatives are parity
codes like RAID and EVENODD [PGK88, BBBM94] or Reed-Solomon codes [HP03,
Pla97] (we assume the reader to be somewhat familiar with basic properties of erasure
codes in order to limit the scope of this thesis. However, a good overview can be found
in e.g. [HP03, Rom96]). In contrast to other applications, RAID-like storage systems are
much more sensitive concerning the size of the codeword because the separated storage
of the code symbols induces additional storage costs. Hence, the codeword size has great
influence on the required storage capacity but which is of less overhead than given with
91
92 Erasure Codes for Reading and Writing
replication, and this, clearly, is the very great advantage of erasure encoding in storage en-
vironments. Again, with replication, one has to cope with a storage overhead of a factor of
r1 that depends on the strict separation of the original and the redundant data (copies).
Instead, with erasure coding, redundancy is directly encoded into the data symbols, and
the ratio of original information to total code symbols is called the (information) rate of
the code rc=k
n<1 which imposes a storage overhead (or stretch factor) sc=1
rc=n
k.
Surely, it should hold that sc<2k(otherwise, one would be better off using replication
because of its simplicity). The least storage consumption of such codes is given by parity-
based schemes that (considering n=k) require a factor sc=n+1
n(RAID) and sc=n+2
n
(EVENODD). Additionally, since basing on simple XOR-operations, those schemes are
highly suitable for hardware realization. Unfortunately, parity codes can merely tolerate
a very small number of erasures at a time, what is often insufficient, especially in large-
scale SANs. In that case, more complex codes, like Reed-Solomon codes, are applied that
provide improved fault-tolerance.
However, compared to replication, the applied erasure codes lack of some substantial
drawbacks, that are (a) to almost be applicable in uniform storage environments only,
(b) to suffer from high complex algebraic operations for encoding/decoding, and (c) to
lack of the ability to adapt to changing storage environments, but most importantly, all
such codes suffer from a negative write and update behavior, respectively, because most
of them strictly focus on providing an (almost) optimal data recovery. That is, for any
k-sized information word xand corresponding codeword y, those codes aim at recovering
the data from up to nkerasures (what is an optimal recovery behavior). Unfortunately,
this optimal recovery property induces negative update behavior since, in particular, yis
generated as a function of the original symbols implying that on every change of x, the
complete codeword must also be updated, what is dismal. This overhead is widely known
as write-penalty. Moreover, in a SAN, the write procedure is no longer atomic but rather
describes a sequence of write operations which usually involves writing to multiple disks.
These concurrent writes may then induce inconsistencies but which can, for instance, be
reduced if only a small fraction of all parity symbols must be rewritten. Some approaches
to reduce the update complexity are given by e.g. X-codes [XB99] but, however, such
codes are limited to recover from any two simultaneous erasures only.
In the following, we overcome the update problem by the Read-Write-Coding System
(RWC), a very flexible class of erasure codes that are closely related to Reed-Solomon
codes but offer enhanced flexibility and improved update behavior. However, before dis-
cussing the RWC system in more detail, we first require some brief introduction to linear
block codes, the class of codes to which our Read-Write codes belong. To bound the
scope of this thesis, we suppose the reader to be familiar with basics from linear algebra,
number theory like e.g. modular arithmetic, and abstract algebra (e.g. group theory and
the theory of finite fields) (see e.g. [HP03, Rom96, LN86] for overviews).
4.0.1 Preliminaries 93
4.0.1 Preliminaries
Throughout this chapter, we address so called linear block codes which have useful prop-
erties for the storage systems we refer to. In particular, we consider static RAID-like
storage systems that (a) are supposed to consist of a fixed sized set [n]of disks for which
furthermore a fixed allocation stripe size can be defined, and (b) for which, driven by me-
chanical components, access latency to the storage devices is magnitudes higher than the
required time for encoding or decoding the data. With respect to the first issue, usually in
those storage systems, so-called block codes are utilized in which all codewords have the
same fixed length nand that have good potential to meet the desired disk utilization (in
what follows, we state state some important facts about linear codes in very brief without
proving the listed theorems; the proofs can be found in the references mentioned above).
Definition 4.1. (Block Code) Let Σbe an arbitrary finite alphabet with |Σ|2. Then, an
[n,m]-block code over Σis an m-elementary subset C Σnwith m 1. The elements of
C are referred to as codewords, all of which have fixed length n, and the number |C|=m
of codewords in C is the size of the code.
Now, to be more specific, the special class of block codes we concentrate on in what
follows are linear codes which are defined by the means of linear algebra and thus, in-
duce an easier description/analysis in general and with respect to the encoding/decoding
operations. In particular, linear codes are codes over finite fields, that is, for a linear
code, Σ=Fqand Fqis a finite field with qelements. Let Fk
qdenote the k-dimensional
vector space of all k-vectors over the finite field Fq. Then, our k-sized messages are vec-
tors x= (x1,x2,...,xk)Fk
q, and we refer to Fk
qas the message, respectively information
space. Furthermore, to introduce redundancy, we consider the following embedding of
the vector space Fk
qinto the n-dimensional vector space Fn
q,n>k.
Definition 4.2. (Linear Code) Consider the following injective linear mapping
γ:Fk
qFn
q
which defines a linear embedding of the message space Fk
qinto Fn
q. Then, the image
C=γ(Fk
q)is a k-dimensional subspace of Fn
qthat is isomorphic to Fk
q, and to which we
refer to as linear (n,k)-code, or simply (n,k)-code, of size |C|=qk.
It can be shown that every linear (n,k)-code Cover the field Fqis an (n,qk)-block code
over the alphabet Σ=Fq. The parameter nis called the block length and kis the dimension
of the code.
The isomorphism between Fk
qand the (n,k)-code C=γ(Fk
q)is usually described by
ak×n generator matrix Γof rank kover the field Fqwhose rows define a basis for C.
Thus, we obtain
C=γ(Fk
q) = {vΓ|vFk
q}.
94 Erasure Codes for Reading and Writing
For any set of kindependent columns of Γ, the corresponding set of coordinates forms an
information set for C. The remaining r=nkcoordinates are termed a redundancy set
and ris called the redundancy of C. If the first kcoordinates form an information set, the
code has a unique generator matrix of the form (Ik|A)where Ikis the k×kidentity matrix
and we denote Γto be in standard form as depicted in the following example.
Example 4.3. The matrix Γ= (I4|A), where
Γ=
1 0 0 0 0 1 0
0 1 0 0 1 0 1
0 0 1 0 1 1 0
0 0 0 1 1 1 1
,
is a generator matrix in standard form for a (7,4)binary code.
The overall complexity of the encoding vvΓis given by the complexity of the multi-
plication vΓwhich is O(n·k)assuming O(1)-time for addition/multiplication in Fq.
We now introduce a useful metric on Fn
qwhich, besides the block length and the di-
mension, is the third important parameter to determine the performance of a code.
Definition 4.4. (Hamming Metric) Consider the function d :Fn
q×Fn
qNdefined as
d(w,w0):=|{i|wi,w0
i}| 0
and which satisfies the following properties:
1. (non-negativity) d(w,w0)0for all w,w0Fn
q,
2. d(w,w0) = 0if and only if w =w0,
3. (symmetry) d(w,w0) = d(w0,w)for all w,w0Fn
q,
4. (triangle inequality) d(w,w00)d(w,w0)+d(w0,w00)for all w,w0,w00 Fn
q.
Then, d defines the so-called Hamming metric on Fn
q, and d(w,w0)is called the Hamming
distance between the vectors w,w0Fn
q. The pair (Fn
q,d)is referred to as a metric space,
called the Hamming space of dimension n over the alphabet Fq.
Given the Hamming distance, we can now define the minimum distance of a code Cwhich
measures the difference between most similar codewords and that is an important invariant
determining the error-correcting capability of C.
Definition 4.5. (Minimum Distance) Let C be an (n,k)-code with |C| 2. Then, the
minimum distance d(C)between distinct codewords is
d(C)) :=min{d(y,y0)|y,y0C,y,y0}.
4.0.1 Preliminaries 95
Since y,y0, it follows that d(C)1. If the minimum distance of an (n,k)-code is known,
we let d:=d(C)and refer to the code as an (n,k,d)-code. Then, we can state the follow-
ing important theorem on the erasure correction capability of a linear code.
Theorem 4.6. With an (n,k,d)-block code C, it is guaranteed that, after sending any
codeword yiC over some noisy channel, yican completely be corrected if the number
of erasures in yiis at most d 1.
Proof. Suppose by the way of contradiction that, after sending yi, a word wis received
containing ed1 erasures. Then, the pattern of necorrect symbols in wmatches yi,
thus, the Hamming distance d(w,yi)d1. If then there is another codeword yjwith
d(w,yj)d1, it follows that d(yi,yj)d1 because yiand yjcan only differ in the
ed1 coordinates. But this is a contradiction since according to Definition 4.5, there
are no two codewords yi,yjwith d(yi,yj)d1.
Therefore, the higher the minimum distance dof some (n,k,d)-code, the more erasures
the code can tolerate. With respect to the block length, we call the ratio δ(C) = d
nthe
relative distance of the code which is a measure of the error-correcting capability of the
code relative to its length. This ratio holds for any code (linear or not) of length nthat
has minimum distance d. Now, recall the information rate rc=k
n<1 as given in the
introduction of this chapter. It holds that the higher the rate the higher the proportion of
coordinates in a codeword actually containing information rather than redundancy, what
makes the code less fault-tolerant. On the other hand, a high rate implies low storage
overhead. As we can see, there is a trade-off between the rate and the relative distance
when applying linear block codes in storage environments, and the major goal in coding
theory is to find codes that have high rate as well as high relative distance (that is, both
parameters should be close to 1). With respect to this objective, the following theorem
states an important upper bound on the parameters (n,k,d).
Theorem 4.7. (Singleton Bound) For every linear (n,k,d)-code over Fq, it holds that
dnk+1.
Therefore, according to the Singleton Bound, optimal codes are those for which equal-
ity d=nk+1 holds. Such codes are called maximum distance separable codes (MDS
codes) which can be defined as follows.
Definition 4.8. (MDS Code) A linear (n,k)-code C is an MDS-Code if in each generator
matrix Γfor C, any k columns are linearly independent.
Corollary 4.9. A linear (n,k)-code C is an MDS-Code if and only if any k components
define an information set.
96 Erasure Codes for Reading and Writing
It holds that no code of length nand minimum distance dhas more codewords than an
MDS code with parameters nand d. Trivial MDS codes, for instance, are the (n,n1,2)-
parity codes in which a parity symbol is added to an information word of length k=n
representing the parity of the sum of all ninformation symbols as well as codes with
parameters (n,1,n),(n,n,1)who exist over every field Fq(c.f. Section 4.6).
Definition 4.8 already emphasized that the recovery capability of a linear code depends
on the invertibility of the generator matrix Γ, thus, the ability to recover the information
word from the codeword always requires an invertible k×ksub-matrix of Γ. In the fol-
lowing definition, we introduce the prominent Vandermonde matrix in which any s×s
sub-matrix is invertible for some given s. Hence, the Vandermonde matrix has very useful
properties for data recovery in linear block codes.
Definition 4.10. (Vandermonde Matrix) Let α1,...,αsbe elements in a field F. The
s×s matrix V = [vi,j], where
vi,j=αj1
i=
1α1α2
1... αs1
1
1α2α2
2... αs1
2
1α3α2
3... αs1
3
1.
.
.....
.
.
1αsα2
s... αs1
s
,
is called a Vandermonde matrix.
Lemma 4.11. The determinant det V=1i<js(αjαi). In particular, det V,0if the
elements α1,...,αsare distinct.
That is, Vis nonsingular, and therefore, it is invertible.
4.0.2 Data Reconstruction from Failures
Besides providing an efficient computation of the algebraic operations used for encoding,
the major goal when designing erasure codes for storage-based application is to ensure a
more or less optimal data reconstruction in case of occurring erasures. In todays RAID-
like storage systems, whenever obtaining high fault-tolerance is the major design focus
(rather than efficient computation as given with parity codes), advanced MDS codes, like
e.g. Reed-Solomon codes, are almost applied which can reconstruct the information from
up to nkerasures as described in the following.
Consider the information vector x= (x1,x2,...,xk), and let Γbe an n×kgenerator
matrix. Then, a linear (n,k)-code Cis represented by the following equation:
Γx=y.
4.0.2 Data Reconstruction from Failures 97
Furthermore, we can define functions Γ1,Γ2,...,Γnwhere each Γidenotes a linear com-
bination of the information word that works on a word-by-word basis for generating the
code symbol yi:
yi=Γi(x1,x2,...,xk) =
k
j=1
xjγi,j
In other words, the functions Γiare the rows of Γfrom which by multiplication with the
vector xthe complete codeword yis obtained. If we furthermore define Γto be the n×k
Vandermonde matrix, that is, γi,j=αi1
j(Definition 4.10) with pairwise distinct elements
α1,...,αkFq, the equation Γx=ybecomes:
γ1,1γ1,2... γ1,k
γ2,1γ2,2... γ2,k
.
.
..
.
.....
.
.
γn,1γn,2... γn,k
x1
x2
.
.
.
xk
=
1 1 1 ... 1
α1α2α3... αk
.
.
..
.
..
.
.....
.
.
1αn1
2αn1
3... αn1
k
x1
x2
.
.
.
xk
=
y1
y2
.
.
.
yn
Assuming that kcomponents of yare available at the receiver, the information vector
can be reconstructed by using any kequations that correspond to the known components
of yand which form an invertible k×ksub-matrix Γ0. Surely, the extracted matrix Γ0is
only invertible if the kchosen equations are linearly independent (but this is always the
fact with a Vandermonde matrix).
Now, we describe a useful modification of the encoding that simplifies the data recon-
struction in case one excepts rather few erasures. For this, we consider a code whose
codewords yinclude a verbatim copy of the information blocks, and which we call sys-
tematic. This corresponds to including the identity matrix Ikin Γas already shown in
Example 4.3. We obtain the matrix ΓS= (Ik|Γ)and a codevector yS= (x|y)leading to
the following equation (ΓSx=yS):
1 0 0 ... 0
0 1 0 ... 0
.
.
..
.
..
.
.....
.
.
0 0 0 ... 1
1 1 1 ... 1
α1α2α3... αk
.
.
..
.
..
.
.....
.
.
1αn1
2αn1
3... αn1
k
x1
x2
.
.
.
xk
=
x1
x2
.
.
.
xk
y1
y2
.
.
.
yn
98 Erasure Codes for Reading and Writing
The reconstruction of the information word xfrom any kavailable blocks of the code-
word ySis depicted in Figure 4.1.
1 0 0
0 1 0
·
·
·
·
0 0 1
n
y x
k
Encoder Decoder
xy
n
k
Γ
S
Γ
S
1 0 0
0 1 0
·
·
·
·
0 0 1
Figure 4.1: Encoding/Decoding in matrix form for a systematic code (the top krows
constitute the identity matrix Ik). y0
Sand Γ0
Scorrespond to the grey areas of the vector and
the matrix on the right hand side.
Reconstructing xfrom any combination of up to ksymbols from yis always possible
since Γis a generator matrix and thus, any valid codeword yis a linear combination of
its columns. In particular, as Γhas rank k, always any krows are linearly independent
imposing the linear mapping xΓxto be injective, thus, xcan uniquely be determined
from its corresponding codeword y. Moreover, the system Γ0y0=xhas a unique solution
which can be obtained by Gaussian Elimination. Hence, xcan be reconstructed from any
subset of kcode symbols because each code symbol conveys information on all the k
source blocks (the ratio of this fraction of information is given by the rate).
However, any code that is able to reconstruct xfrom up to nkerasures suffers from a
bad update behavior. In particular, if one information symbol changes from xito x0
i, any
codesymbol yimust also be modified. This can be effected by subtracting out the portion
of the checksum word that corresponds to xi, and adding the required amount for x0
ias
follows:
y0
i=Γi,j(xi,x0
i,yi) = yi+γi,j(y0
iyi).
Surely, both calculation and maintenance of the symbols yican be done by simple arith-
metic but which are operations in the underlying finite field Fqwhich may be of different
complexity depending on the field chosen. Furthermore, if any of the disks keeping yi
is not accessible, there is no chance to store the modified codeword (except merely the
plain information word if a systematic code is applied such as given, for example, with
the RAID 4/5 encoding).
4.0.2 Data Reconstruction from Failures 99
Outline of this Chapter
In order to face the negative update behavior that is inherent to usual linear blocks codes
and which turns to be pretty costly when such codes are applied in RAID-like storage en-
vironments, we introduce the Read-Write-Coding-System (RWC) a very flexible frame-
work for generating different linear block codes, called Read-Write (RW) codes in the
following, which feature enhanced update properties for given codewords by simultane-
ously offering different degrees of fault-tolerance.
In particular, as we have seen, usual linear block codes merely consider a k-sized infor-
mation vector x, a codeword yof fixed length n, and some linear function γrealizing the
mapping from xto y. Instead, an RWC defines further parameters kr,wnwhich offer
enhanced possibilities to adjust the redundancy, and thus, the fault-tolerance capability of
an RW code. In the language of coding theory, for any fixed r, an RWC provides linear
(n,r,d)-codes over some finite field Fqthat have (a) minimum distance d=nr+1
(thus, are MDS codes if r=k), and (b) any two codewords are within a distance of at
most wfrom each other. More specific, an RWC generates appropriate subcodes of Reed-
Solomon (RS) codes (see e.g. [HP03] for details on RS codes) of dimension rand length
nin which every codeword has distance w. Depending on the values r,wand the field Fq
chosen, different block codes can be generated, e.g. parity codes (if q=2).
The ensured degree of redundancy mixed up with the improved update behavior offered
by an Read-Write code provides significant benefits for the observed storage systems
which, driven by the application’s read and write behavior, on one hand, suffer from
very costly I/O operations, and on the other hand, have to ensure some defined level of
fault-tolerance at any time. Clearly, a Read-Write code provides best improvements for
write-intensive applications because, given an n-sized codeword yand parameters r,w, it
can decode the information from any rsymbols of ywhereas only any wsymbols of y
must be updated whenever the information word xchanges completely (recall that we can
choose w<n). This is different to other linear codes, like e.g. Reed-Solomon codes,
which always must rewrite the codeword ycompletely as xchanges.
An example. Consider a RAID 4-parity code with n=4 hard disks storing a data file
bit by bit, Σ={0,1}. We encode k=3 bits x1,x2,x3to symbols y1=x1,y2=x2,y3=x3
and y4=x1+x2+x3, where addition denotes the XOR-operation and the code symbols yi,
1i4, are stored separately on distinct disks. The XOR-operation enables to recover
the original three bits from any combination of r=3 hard disks, e.g. giving y2,y3,y4
we have x1=y2+y3+y4,x2=y2, and x3=y3. Thus, if any one disk is temporarily
not available, reading data is still possible. However then, writing data is not possible
since a complete change of the original information involves the change of the entire
code; we call this code consistency (this also holds for any other erasure code applied in
storage environments). The second example shows an RW-code. Again, consider n=4
100 Erasure Codes for Reading and Writing
Contents Code Line
x1x2y1y2y3y4v
0 0 0 0 0 0 0
0 0 1 1 1 1 1
0 1 0 1 0 1 0
0 1 1 0 1 0 1
1 0 0 0 1 1 0
1 0 1 1 0 0 1
1 1 0 1 1 0 0
1 1 1 0 0 1 1
Table 4.1: A(2,3,3,4)2-Read-Write-Code for contents x1,x2and code y1,y2,y3,y4. Ev-
ery information vector has two possible code words. Even if only three of the four code
words are available for reading and writing, the system can perform read and write oper-
ations (see Figure 4.3 for the encoding).
hard disks with code bits y1,y2,y3,y4. Now, we encode k=2 information bits x1,x2
such that any r=3 code bits yi,yj,ykcan be used to recover the original message, and
furthermore, only any of such w=3 code bits yi0,yj0,yk0need to be changed to encode a
completely new information. For instance, start with the codeword (0,1,1,?). According
to Table 4.1, the information is (1,1)and therefore the complete code is (0,1,1,0). Now,
we want to encode (0,1)without changing the second entry. For this, we choose line 0
for information (0,1)and get code (0,1,0,1).
Moreover, RW codes can exploit system information about existing erasures, which are
caused by failed or blocked disks and that have rather long-term character in a SAN, for
encoding and decoding. For instance, consider a codeword ywith symbols stored on n
disk from which bnwdisks are unreachable (e.g. failed or blocked). Then, using
an RW code, ycan still be updated to the codeword y0in a code consistent manner. Fur-
thermore, if then some of the formerly blocked disks become available again while some
other b0nrdisks turn to be unreachable, we can still recover the new information
word x0by simply selecting any rof the remaining nb0accessible disks. Therefore, as
long as sufficient disks are accessible, an RW code provides code consistent operations
by circumventing blocked disks.
At last, some RW codes offer the possibility to change any of the parameters k,r,w,n
during runtime, that is, a (k,r,w,n)-RW code can be changed to (nearly) any choice of
(k0,r0,w0,n0)giving such codes the ability to adapt to changing system conditions.
From now on, we denote a (k,r,w,n)b-Read-Write code a code which is given a k-
symbol message xthat is encoded into an n-symbol codeword ywith symbols drawn from
4.1. The Operational Model 101
some b-symbol alphabet Σ. Furthermore, we consider parameters rfor recovering the
information word and wfor modifying a given codeword y, with kr,wn. In the next
section, we state the operations of the RWC formally. After that, we prove general bounds
for the parameters of RW codes and present a general scheme to generate (k,r,w,n)b-RW
codes as long as k+nr+wholds for an appropriate choice of b. Then, we introduce
adaptive RW codes for which any of the given parameters can be subject to changes. At
last, we investigate the question for which choices of (k,r,w,n)a coding system exists
over the binary alphabet F2={0,1}and discuss how RW codes can be combined. At
last, notice that the following description is given in more operation-based terms rather
than conceptual since RW codes base on the same well-studied algebraic principles as RS
codes.
4.1 The Operational Model
Before we now introduce the concepts of the Read-Write codes in more detail, we first
present the operations assigned to a Read-Write Coding System. Again, Read-Write
codes encode information words into codewords. The information is given by a k-tuple
over some finite alphabet Σ, and since Read-Write codes are linear block codes, the code-
word is an n-tuple over the same alphabet, for k<n. For what follows, let b=|Σ|and
P(M)denotes the power set of some set M. Moreover, let P`(M):={SP(M)| |S|=`}.
Then, a (k,r,w,n)bRead-Write-Coding-System (RWC) provides the following opera-
tions.
1. Initial state x0Σk,y0Σn
This is the initial state of the system with information x0and codeword y0. This
state is crucial because all further operations ensuring the beneficial features of an
RW code depend on this initial state.
2. Read function f:Pr([n])×ΣrΣk
This function reconstructs the information by reading rsymbols of the codeword
whose positions are known. The first parameter shows the positions (indices) of
the symbols in the code, and the second parameter gives the corresponding code
symbols. The outcome is the decoded information.
3a. Write function g:Pr([n])×Σr×Σk×Pw([n]) Σw
This function adapts the codeword to a changed information by changing wsymbols
of the codeword at wgiven positions. The first two parameters describe the reading
of the original information. Then, we have the new information as a parameter, and
the last parameter indicates which code symbols to change in the codeword. The
outcome are the values of the new wcode symbols.
102 Erasure Codes for Reading and Writing
3b. Differential write function δ:Pw([n])×ΣkΣw
This is a restricted alternative to the write function whose parameters are the po-
sitions Sof symbols available for writing as well as the difference of the original
information xand the new information x0, but without reading the wcode entries.
The outcome is the difference of the available old and the new codeword symbols.
Thus, for two functions 1:Σk×ΣkΣkand 2:Σw×ΣwΣwand wgiven po-
sitions from y, we can describe the write function gabove by the differential write
function as
y0=2(y,δ(S,1(x,x0))).
All RW codes presented here have such differential write functions where 1,2
denote the bit-wise XOR-operations. The goal is that e.g. a controller in a storage
device ican, by itself, update its kept block yiby simply adding (XOR) the received
difference γof yiand y0
i, i.e. y0
i=yi+γ, if, for example, the device is blocked
between reading the old and writing the modified block.
For a tuple y= (y1,...,yn)and a subset SP`([n]), let CHOOSE(S,y)be the tuple
(yi1,yi2,...,yi`)where i1,...,i`are the ordered elements of S. Furthermore, for an `-tuple
d, let SUBST(S,y,d)be the tuple where according to Seach indexed element yi1,yi2,...,yi`
of yis replaced by the element taken from dsuch that CHOOSE(S,SUBST(S,y,d)) = d
and all other elements in yremain unchanged in the outcome.
Now, for S0Pr([n]), define the read operation
Read(S0,y):=f(S0,CHOOSE(S0,y))
and for SPw([n]) and x0Σk, the write operation
Write(S,S0,y,x0):=SUBST(S,y,g(Read(S0,y),x0,S)).
Since any Read-Write code needs to start at some initial state, we define the set of possible
codewords Yas the transitive closure of the function y7→Write(S,S0,y,x0)starting with
y=y0and allowing all values S,S0,x. Then, an RW code is correct if the following
statements are fulfilled.
1. Correctness of the initial state: SPr([n]):
Read(S,y0) = x0.
2. Consistency of read operation:S,S0Pr([n])
yY:Read(S,y) = Read(S0,y).
3. Correctness of write operation:
SPw([n]),S0Pr([n]),yY,
xΣn:Read(S0,Write(S,S0,y,x))=x.
4.2. Lower Bounds 103
Before we continue explaining the inherent mechanisms of an RWC in order to prove
the operational statements above more formally, we first state some lower bounds on the
parameters of an RWC that must be satisfied for the existence of RW codes at all.
4.2 Lower Bounds
The example of a (2,3,3,4)2-RW code in the previous section stores two symbols of
information in a four symbol code (c.f. Table 4.1). Unfortunately, this storage overhead
of a factor two is unavoidable, as the following theorem shows (moreover, this implies
that e.g. no (3,3,3,4)b-RWC exists).
Theorem 4.12. For parameters r +w<k+n or r,w<k and any base b, there does not
exist any (k,r,w,n)b-RWC.
Proof. Consider a write operation and a subsequent read operation where the index set W
of the write operation (|W|=w) and the index set Rof the read operation (|R|=r) have an
intersection: WR=Swith |S|=r+wn. Then, there are bkpossible change vectors
with symbols in Sthat need to be encoded by the write operation since this is the only
base of information for the subsequent read operation. This holds because all further R\S
code symbols remain unchanged. Now, assume that |S|<k. Then, at most bk1possible
changes can be encoded, and therefore, the read operation will produce faulty outputs for
some write operations. Thus, r+wnkand the claim follows.
If r<k, only brdifferent messages can be distinguished while bkdifferent messages
exist. Then, from the pigeonhole principle, it follows that such a code does not exist. For
the case w<k, this is analogous.
From Theorem 4.12, it follows that in the best case (k,r,w,n)b-RW codes have param-
eters r+w=k+n. We call such RW codes perfect. Unfortunately, such perfect codes do
not always exist for any choice of parameters as the following lemma shows.
Lemma 4.13. There is no (1,2,2,3)2-RW code.
Proof. Consider a read operation on the code bits y1,y2and a write operation on y2,y3.
Then, y2is the only intersecting bit which must be inverted in case of an information bit
flip. The same holds for bit y3when considering a read operation on y1,y3and a write
operation on y2,y3. Thus, together, both y2and y3have to be inverted if the information
bit x1flips. Now, consider a sequence of three write operations on bits (1,2),(2,3),(1,3)
each inverting the information bit x1. After these operations, all code bits have been
inverted twice bringing it back to the original state. In contrast, the information bit has
been inverted thrice and is thus inverted. Therefore, all read operations lead to wrong
results.
104 Erasure Codes for Reading and Writing
Contents Code Line
x y1y2y3v
0 0 0 0 0
0 1 1 1 1
0 2 2 2 2
1 0 1 2 0
1 1 2 0 1
1 2 0 1 2
2 0 2 1 0
2 1 0 2 1
2 2 1 0 2
Table 4.2: A(1,2,2,3)3-Read-Write code for an information word xand codeword ycon-
sisting of symbols y1,y2,y3. For every information, there are three possible codewords,
and if only any two of them three are available for reading and writing, the system can
perform read and write operations (see Figure 4.4 in the next section for the encoding).
However, if we allow a larger symbol alphabet, we can find an RW code.
Lemma 4.14. There exists a (1,2,2,3)3-RW code.
Proof. See Table 4.2 for an example. The correctness is straight-forward.
Clearly, concerning operational complexity, b=2 (i.e. F2) is the best choice for codes
applied in SANs because XOR-based I/O operations can often efficiently be realized
in hardware. However, as common RAID 4/5 schemes as well as parity-based Reed-
Solomon codes correspond to an (n,n,n+1,n+1)2-RW code, n1, the following lemma
shows that there is no parity-based placement scheme offering better update properties.
Lemma 4.15. For n 1, there is no (n,n,n,n+1)2-RW code.
Proof. The proof follows directly from Theorem 4.12.
4.3 Encoding/Decoding in the RWC
Now, we know that as long as the symbol alphabet is chosen sufficiently large, perfect RW
codes always exist, and in the following, we show how perfect RW codes are generated
by describing its encoding and decoding process in terms of vector arithmetic (surely,
we always assume the symbol alphabet (i.e. the basis) to be sufficiently large whereas
throughout this chapter, we use the terms basis and symbol alphabet interchangeably).
4.3. Encoding/Decoding in the RWC 105
Fqk
Fqr
Fqn
γ(Fqk)
ζ(Fqr)
γ:Fqk
Fqn
x0
y0
S0
ext:Fqk
Fqr
x‘0
y‘0
x1
x‘1
y‘1
ζ:Fqr
Fqn
w
x
+ ∆ v
Figure 4.2: Illustration of Encoding/Decoding RW Codes
As already mentioned, Read-Write codes are closely related to Reed-Solomon codes
and can thus be described by the same methods (i.e. by polynomials over finite fields).
Nevertheless, in this thesis, we rather use a vector-based description that is better suited
to sketch the encoding/decoding operations. In particular, for given information tuples
x= (x1,...,xk)Σkwhose base bis a sufficiently large finite field Fqand additional
parameters kr,wn, we examine special subcodes of larger Reed-Solomon codes that
have dimension rand length nand in which every codeword y= (y1,...,yn)Σnhas
distance at most w(Figure 4.2 depicts the generation of an RW code).
Every RW encoding starts with an initial information word x0generating an initial
codeword y0. As usual with linear codes, since we have chosen some finite field Fqfor the
basis, we can consider x0to be a vector in the vector space Fk
q. In a usual linear encoding,
x0would then be mapped directly into the vector space Fn
qof dimension n>kby some
linear function γ:Fk
qFn
qyielding y0. More particularly, the resulting codeword y0lies
in some k-dimensional subspace γ(Fk
q)of Fn
qthat is isomorphic to the vector space Fk
q,
and as already illustrated in Section 4.0.2, if we suppose an MDS code, optimal recovery
from up to nkerasures can then be achieved (recall that such codes have minimum
distance d=nk+1). However, as we also stated already, whenever the information
word x0changes to x1, all ncode symbols of y0must be modified implying a distance
d=nbetween any two codewords in γ(Fn
q), what, again, is dismal in a SAN.
106 Erasure Codes for Reading and Writing
To improve the update behavior of linear codes such that, at most, w<nsymbols suf-
fice to encode a complete information change, a Read-Write encoding takes the following
detour concerning the vector mapping. First, depending on the parameter r, each informa-
tion vector xFk
qis extended to an r-sized vector x0by adding `=rk(slack) symbols
v0
1,...,v0
`from the underlying symbol alphabet Fqthat carry no particular information.
In other words, we translate xinto an r-dimensional vector x0Fr
qby some mapping
function ext :Fk
qFr
q. Importantly, (only) in case of the initial information word x0,
the `symbols can be arbitrarily chosen from Fqyielding q`possible extension vectors x0
0
(denoted by the subset S0in Figure 4.2). Depending on the choice of the slack symbols,
a vector x0
0S0is fixed which is then mapped by some linear function ζ:Fr
qFn
qto
the codeword y0
0lying in the subspace ζ(Fr
q)of Fn
q(c.f. dashed lines). Clearly, ζ(Fr
q)is
an r-dimensional subspace of Fn
qwhich is isomorphic to Fr
q. Moreover, as being also an
(n,r)-linear MDS code, it is spanned by a set of n-dimensional basis vectors b1,...,br,
and for given parameters n,k,r, a (k,r,w,n)b-Read-Write code has minimum distance
dRW =nr+1 and is thus capable of tolerating up to nrerasures (see next section
for a more formal description). Surely, this recovery property is not at good as offered by
usual MDS codes but which is outweighed by the following improved update behavior.
Again, consider the initial information word x0and the corresponding RW codeword
y0ζ(Fr
q). Now, let x0somehow be changed/modified by the user/application what
results in the new information word x1=x0+xwith xdenoting the information dif-
ference between x0and x1. Usually, in common linear (MDS) codes, this would create
a new, independent encoding γ(x1) = y1Fn
qof distance d {nk+1,...,n}whose
upper bound cannot be improved. Instead, in a Read-Write code, in order to achieve that,
at most, w<ncode symbols must be modified, any new encoding, i.e. any vector pair
x0
i,y0
i, strongly depends on the initial setting x0,y0and all subsequent vector pairs x0j,y0jfor
0<j<i. The secret lies in appropriately adjusting the `=nw=rkslack variables
in each step. In particular, for an arbitrary step i>0, consider the extended vector x0
iwith
slack variables v0
ivalid for the corresponding codeword y0
i. Then, to achieve y0
i+1=y0
i+y
from x0
i+1=x0
i+xin the following step, the new slack variables v0
i+1=v0
i+vmust be
determined such that in the vector yexactly `symbols remain unchanged (that is, are
set to 0). From this, it follows that, at most, wsymbols must be rewritten, and thus, the
codeword space ζ(Fr
q)corresponds to Y, the transitive closure of the initial codeword y0
(c.f. Section 4.1).
4.3.1 The Matrix Approach
We now state the generation of RW codes formally for what we present a matrix-based
approach which is similar to common linear encodings. In particular, we consider the
information word x= (x1,...,xk)Fk
q, the corresponding codeword y= (y1,...yn)Fn
q,
4.3.1 The Matrix Approach 107
and for any modification in x, let δ=xbe the information change vector. Moreover,
let v= (v1,...,v`)denote the vector of internal slack variables with `=nw=rk
and which carry no particular information. The linear mapping is given by an appropriate
n×rgenerator matrix Γwith Γi,jFq; the sub-matrix (Γi,j)i[n],j∈{k+1,...,r}is called the
variable matrix. Then, an RW code relies on the following equation realizing the mapping
function ζ:Fr
qFn
q.
Γ1,1Γ1,2··· Γ1,r
Γ2,1Γ2,2··· Γ2,r
.
.
..
.
..
.
.
Γn,1Γn,2··· Γn,r
x1
.
.
.
xk
v1
.
.
.
v`
=
y1
y2
.
.
.
yn
(4.1)
Operations
Initialization:
Starting with an information vector x0= (x1,...,xk), the variables (v1,...,v`)can
be set to arbitrary values (furthermore, if one wants to benefit from additional se-
curity features of this coding system (see Section 4.4), the slack variables must
be chosen uniformly at random). The codeword y0= (y1,...,yn)is computed by
Equation 4.1.
Read: Given rcode entries from y, compute x
We rearrange the rows of Γand the rows of ysuch that the first rentries of yare
available for reading, and let y0and Γ0denote these rearranged vector and matrix,
respectively. The first rrows of Γ0describe the r×rmatrix Γ00 that we assume to
be invertible. Then, the information vector x(and the variable vector v) is obtained
by: x
v= (Γ00)1y.
Differential write: Given the information change vector δand wcode entries from
y, compute the difference vector γfor the wcode entries. Recall that yis updated
by γwithout first reading the information of yat the wcode positions.
The new information vector x0is given by x0
i=xi+δi. This notation allows to
change the vector x0without reading its entries. Clearly, only the choices w<r
make sense. Now, according to Equation 4.1, given the new k-dimensional in-
formation vector x0, the task is to find another (rk)-dimensional vector ρwith
v0=v+ρsuch that the new codeword
y0=Γx0|v0T=Γ(x+δ|v+ρ)T
108 Erasure Codes for Reading and Writing
is a vector of distance at most w. Since we only consider at most wpositions of
y0, we may, without loss of generality, assume that the last nwpositions are zero
such that Γ(x0|v0)T= (y0
w|0)T, with y0
wof length w, and the vector 0= (0,...,0)T
is of length nw. Clearly, we must rearrange the rows of the matrix Γdue to
the vector (y0
w|0)T. Thus, for simpler description, we partition Γaccording to the
lengths of the sub-vectors involved, that is, we rearrange the rows of Γand ysuch
that the writable code symbols are y1,...,yw. Let Γ0denote this rearranged matrix
and y0the rearranged code vector. Hence, we define the following sub-matrices of
Γ0.
Γ←↑ = (Γ0
i,j)i[w],j[k],
Γ↑→ = (Γ0
i,j)i[w],j∈{k+1,...,r},
Γ←↓ = (Γ0
i,j)i∈{w+1,...,n},j[k],
Γ↓→ = (Γ0
i,j)i∈{w+1,...,n},j∈{k+1,...,r}.
Given these definitions, we then obtain
Γ←↑x0+Γ↑→v0=y0
w
Γ←↓x0+Γ↓→v0=0.
Obviously, an important precondition of the write operation is the invertibility of
the submatrix Γ↓→. The code symbol vector is then updated by the w-dimensional
vector
γ= ((Γ←↑)(Γ↑→)(Γ↓→)1(Γ←↓)) δ,
such that the new codeword y0is derived from the former code symbols at the w
given positions by simple addition, that is,
y0=y+γ.
In fact, the (2,3,3,4)2-RWC in Table 4.1 can be generated by this matrix based ap-
proach (Figure 4.3 gives the encoding. Additionally, compare also the (1,2,2,3)3-RWC
in Table 4.2 and Figure 4.4).
The following definition and statements formalize the properties of the parameters of
an RWC.
Definition 4.16. An n ×k-matrix A over any base b with n k is row-wise invertible if
each k ×k matrix, constructed by combining k distinct rows of A, has full rank (and is
therefore invertible).
4.3.1 The Matrix Approach 109
0 0 1
0 1 1
1 0 1
1 1 1
x1
x2
v1
=
y1
y2
y3
y4
Readable x1x2
code symbols
y1,y2,y3y1+y3y1+y2
y1,y2,y4y2+y4y1+y2
y1,y3,y4y1+y3y3+y4
y2,y3,y4y2+y4y3+y4
Write (x0
1,x0
2) = (x1+δ1,x2+δ2)
code y0
1=y1+y0
2=y2+y0
3=y3+y0
4=y4+
1,2,3δ1+δ2δ1δ20
1,2,4δ1δ1+δ20δ2
1,3,4δ20δ1+δ2δ1
2,3,4 0 δ2δ1δ1+δ2
Figure 4.3: A(2,3,3,4)2-Read-Write-Code over F2={0,1}modulo 2.
Theorem 4.17. The matrix-based RWC is correct and well-defined if the n ×r generator
matrix Γas well as the n ×(rk)variable sub-matrix Γ0is row-wise invertible.
Proof. Follows from the definition of row-wise invertibility and the description of the op-
erations. To prove the correctness of the coding system, we show that after each operation
Equation 4.1 is valid. This is straightforward for the initialization and read operations. It
remains to prove the correctness of the write operation.
Again, consider the additive vector (ρ1,...,ρ`)denoting the changes in the variable
vector vand the vector (γ1,...,γw). With this and the information change vector δ, we
obtain x0=x+δ,v0=v+ρand y0=y+γ. The correctness of the write operation then
follows by combining:
Γx0
v0=Γx+δ
v+ρ=Γx
v+Γδ
ρ
=
y1
.
.
.
yw
yw+1
.
.
.
yn
+
γ1
.
.
.
γw
0
.
.
.
0
This equation is equivalent to the following.
(Γ←↑)δ+(Γ↑→)ρ=γ
(Γ↓→)ρ+(Γ←↓)δ=0.
110 Erasure Codes for Reading and Writing
0 1
1 1
2 1
x
v=
y1
y2
y3
Readable
code symbols x
y1,y22y1+y2
y1,y3y1+2y3
y2,y32y2+y3
x0=x+δ
Writable symbols y0
1=y0
2=y0
3=
y1,y2δ+y12δ+y2y3
y1,y32δ+y1y2δ+y3
y2,y3y1δ+y22δ+y3
Figure 4.4: The (1,2,2,3)3-Read-Write code over the alphabet F3={0,1,2}modulo 3.
corresponding to Table 4.2 above.
Since δis given as a parameter, ρcan be computed as
ρ=Γ↓→1Γ←↓δ,
and γby the last upper equation. If ρis known, then the product Γ·(δ|ρ)T(reduced
to the first wrows) gives the difference vector γproviding the new code entries of y0by
y0=y+γ.
Theorem 4.18. For parameters k r,wn with r +w=k+n, there exists a (k,r,w,n)b-
RWC for an appropriate base b. Furthermore, this coding system can be computed in
polynomial time.
Proof. Follows from the following lemma and the fact that we use standard Gaussian
elimination for recovery.
Lemma 4.19. For each n k and basis b 2dlog2n+1e, there is a row-wise invertible
n×k-matrix V over the finite field Fb. Furthermore, each submatrix of V is also row-wise
invertible.
Proof. Define an n×kVandermonde like matrix Vconsisting of non-zero and pairwise
distinct elements (α1,...,αn)F[2dlog2n+1e]. Then, erase any nkrows what yields a
k×ksubmatrix V0that is also a Vandermonde-matrix. Since all Vandermonde-matrices
are invertible, the lemma follows.
4.4. Security and Redundancy 111
4.4 Security and Redundancy
Induced by the usage of additional slack variables and besides adjustable fault-tolerance
and improved update properties, Read-Write codes furthermore offer useful properties
concerning data availability and security. Consider, for instance, the very extreme sce-
nario of a combination of hard disks of nportable (laptop) computers within an office. If
then a (k,r,w,n)b-RWC is used for encoding of at most nlaptops, it is sufficient if at least
max{r,w}computers are accessible at the office at any time for data access and changes.
If merely rcomputers are connected, at least the read operations can be performed. Now,
what happens if computer hard disks are broken or information on some hard disks has
changed ? Then, the inherent redundancy of the (k,r,w,k)b-RWC allows to point out the
number of wrong data and repair it (to some extent).
A different problem occurs if computers are stolen by some adversary to achieve knowl-
edge about company data. The good news is that, for every matrix based RWC, it holds
that one can give away any nwhard disks without revealing any information to the
adversary. If the slack variables are chosen uniformly at random from Σ, the attacker will
receive hard disks with perfect random sequences, absolutely useless without the other
hard disks. As a surplus, this redundantizes the need for complex encryption algorithms.
Redundancy
Theorem 4.20. Every (k,r,w,n)b-RWC can detect and repair up to `faulty code symbols
if
n!(r+`)!
(n`)r!<1
2.
Additionally, it can reconstruct the data from n r missing code symbols (erasures).
Proof. The latter statement is trivial because if nrcode symbols are missing, then by
the definition of an RWC, the complete information can be recovered from any raccessi-
ble code symbols. Furthermore, if then `out of these rcode symbols are faulty, we can
simply test every combination of the n
rcombinations of rreceived code symbols and
take a majority vote over the information vector. In this vote, at least n`
rcombinations
produce the correct result what is a majority if
n`
r1
2n
r
which, by transformation, is equivalent to
n!(r+`)!
(n`)!r!<1
2.
112 Erasure Codes for Reading and Writing
Security
If the coded symbols are stored on distinct storage devices, with an (k,r,w,n)-RWC, the
loss of at most nmax{r,w}devices can be tolerated. For instance, if these storage
devices were stolen, then the following theorem shows that the thief cannot reveal any
information whatsoever from the encoded information. More particular, the attacker sees
only a completely random sequence vector.
Theorem 4.21. Every matrix based (k,r,w,n)b-RWC for which k+n=r+w holds can be
used such that every choice of n w code symbols does not reveal any information about
the original information vector.
Proof. Choose random values for the slack variables v1,...,v`from the symbol alphabet
Σat initialization (let |Σ|=b). Then, there is an isomorphism between these slack vari-
ables and the stolen code symbols yielding a total of b`possibilities for the stolen code
symbols to be changed. If more symbols are added, this starts to reveal some informa-
tion.
4.5 Adaptive RW Codes
Usually in this chapter, we consider static scenarios for a SAN environment, but neverthe-
less, some perfect RW codes can be modified to also behave somehow adaptive in case of
dynamic changes in the number of accessible disks (note that we still assume the storage
devices to be of uniform size). More clearly, we already know from the previous chapter
that in a SAN, adding and removing hard disks are the most delicate maneuvers. Provided
that the size of the underlying symbol alphabet is chosen appropriately, we show in the
following that perfect RW codes exist which allow to seamlessly continue all operations
without forcing the system to be in some intermediate and, more importantly, invalid state.
Consider the following example. Assume 10 disk in a SAN using an (8,9,9,10)-RW
code. Then, for better space utilization, the system administrator wants to switch to a
(4,7,7,10)-RW code. In a usual encoding, this requires all disks to be available for the
switch. Instead, in this section, we show that some special adaptive RW codes exist that
allow to switch from one code to the other with only 9 disks being accessible for reading
and writing. If then, after computing the re-encoding of all data on the 9 disks, the 10-th
disk returns, it can immediately participate in the new (4,7,7,10)-RW code. Moreover,
if the 10-th disk is permanently lost, it can be completely reconstructed from the new
(4,7,7,10)-RW code.
In particular, an adaptive RW code is a set of certain RW codes (k,r,w,n)bwith fixed
alphabet and which, unlike the previously described codes, has a switch function. If a
code is switchable, all parameters k,r,w,ncan be subject to change. Furthermore, with
respect to the codeword y, not all of the code symbols need to be read or changed.
4.5. Adaptive RW Codes 113
Theorem 4.22. Consider a sufficiently large constant M. Then, for a basis b 2dlog2M+1e,
there is an (M,b)-adaptive-RWC. In this system, it is possible to switch at any time from a
(k,r,w,n)b-RWC to any (k0,r0,w0,n0)b-RWC provided that n,n0M, and k0+n0=r0+w0
by merely reading any set of r encoded symbols and changing any set of w0encoded
symbols.
Proof. At first, we choose an appropriate base bby selecting a finite field Fqof suffi-
cient size q2dlog2M+1eand take the Vandermonde Matrix based approach as shown in
Section 4.3.1. We change the main equation to the following.
1α1α2
1... αM1
1
1α2α2
2... αM1
2
1.
.
.....
.
.
1αM1α2
M1... αM1
M1
x1
.
.
.
xk
v1
.
.
.
vrk
0
.
.
.
0
=
y1
.
.
.
yn
z1
.
.
.
zMn
Again, x1,...,xkare the content symbols, v1,···,vrkare the slack variables and y1,...yn
are the code symbols. The variables z1,...,zMncan be ignored for the beginning; they
are neither contents, slack nor code symbols and can be generated from the content and
slack symbols at any time. The initial vector as well as the read and write function are
chosen as in the matrix based approach. Then, the switch operation, that is, switching
from a (k,r,w,n)-RWC to a (k0,r0,w0,n0)-RWC, works as follows.
First, we read rcode symbols at given positions and decode the vectors xand vaccord-
ing to the matrix based approach. Then, we adapt the size of the former code to the new
code size. If n0>n, we compute the corresponding variables zifrom xand v. If n0<n,
we rename nn0code variables to z-variables, and thus, reduce the code size. If r0>r,
the content/slack-variable vector (x|v)Tis extended by (r0r)0-entries. We assume that
new contents are written during the switch-operation (especially, if k,k0). For this, let
v0
1,...,v0
r0k0be the new set of slack variables. Furthermore, we suppose at most w0code
symbols (positions) available for writing.
We start by erasing the rows n0+1,...,Min yand in the Vandermonde matrix since
they are of no interest for this operation. Then, like in Section 4.3.1, we rearrange the
residual matrix and the residual code vector such that the first wpositions are the writable
variables. We additionally rearrange the columns of the Vandermonde matrix and the
contents/slack vector such that the new slack variables are on the rightmost columns,
respectively lowermost lines. This results in the generator matrix Γ0(c.f. Section 4.3.1),
and the original vector xis rearranged up to the lowest r0k0entries (possibly containing a
mixture of old contents, old slack variables, and 0-entries). Let x0be the vector of the new
114 Erasure Codes for Reading and Writing
contents (adequately rearranged), and let v0be the new slack vector with r0k0entries.
If r0r,xhas k0entries, and otherwise, x(x0) has rr0additional entries resulting from
former slack or content variables that must to be set to 0.
We first consider the case r0r. The number of entries in xis k0. Then, we can perform
an RWC write operation changing w0code symbols. Let `0=r0k0=n0w0and partition
Γ0like in Section 4.3.1. That is, let Γ←↑ be a w0×k0-sub-matrix of Γ0,Γ↑→ aw0×k0-sub-
matrix, Γ←↓ an `0×n0-sub-matrix and Γ↓→ an invertible `0×`0-sub-matrix of Γ0. Again,
according to the matrix based approach, the new (rearranged) code vector y0is obtained
by
y0=y+h(Γ←↑)(Γ↑→)(Γ↓→)1(Γ←↓)i·(x0x)(4.2)
using the old (rearranged) writable symbol vector y. The proof of correctness is analogous
to Section 4.3.1.
Now, consider the case r0<r. Then, the number of entries in xand x0is ˜
k=k0+rr0.
Again, let x0be the new (adequately rearranged) vector containing the ˜
knew symbols, and
v0is the new slack variable vector with r0k0entries. Note that x0xcan be computed
at this stage. We now perform a slightly adapted matrix based RWC write operation that
changes w0code symbols. Clearly, compared to the previous case, that matrix consists of
additional rr0columns but what does not cause any problem since we only have to adapt
the sub-matrices. Furthermore, let `0=r0k0=n0w0and ˜w=w+rr0. Then, let Γ←↑
be a ˜wט
k-sub-matrix of Γ0,Γ↑→ a ˜w×`0-sub-matrix, Γ←↓ an `0ט
k-sub-matrix and Γ↓→
an invertible `0×`0-sub-matrix of Γ0. As usual, ydenotes the old and y0the new writeable
symbols. Then, applying the defined matrices, the new vector y0is obtained as given with
Equation 4.2, and again, the proof is analogous to the proof given in Section 4.3.1.
4.6 Boolean Read-Write-Codes
We already mentioned that the computational speed of RW codes strongly depends on
the size of the finite field used (c.f. Reed-Solomon codes) because encoding/decoding is
hardly driven by algebraic computations in the applied field, and the larger the field chosen
the more complex the encoding/decoding process. Thus, the most interesting case for the
choice of the alphabet is the binary case with Σ={0,1}. Unfortunately, as we have shown
in Lemma 4.13, there are no perfect Boolean (1,2,2,3)-RW codes, and furthermore, also
for the matrix based RW codes (also called parity Read-Write codes, according to parity
Reed-Solomon codes), the Boolean basis poses a severe restriction. The reason is that
only for some dimensions Boolean matrices are row-wise invertible. We address these
cases of valid (k,r,w,n)2-RW codes in the following.
4.6. Boolean Read-Write-Codes 115
Lemma 4.23. For each n, there exists an n ×1row-wise invertible Boolean matrix. Fur-
thermore, for k 2, there are n ×k row-wise invertible Boolean matrices if and only if
n{k,k+1}.
Proof. The first claim is trivial. For the second, note there is a (k+1)×krow-wise
invertible Boolean matrix, e.g. Γ= (Γi,j)i[k+1],j[k]such that
Γi,j=1 : i=ni=j
0 : else
Now, we prove that there are no (k+2)×krow-wise invertible matrices as we show that,
given a k×kfull rank Boolean matrix Γ, there is always a unique vector leading to a
(k+1)×krow-wise Boolean matrix.
Consider Γas above and remove an arbitrary row i. Then, there are 2k1possibilities
to add a row receiving a matrix with full rank. Such a row is described by the vector set
Γ(x1,...,xk)Twhere xi=1 and the rest is chosen arbitrarily. The only vector that can be
added to each of these combinations is Γ(1,...,1)T. After adding this vector, there is no
other vector which is linearly independent from the other vectors.
This lemma has severe implications on the matrix based method for Boolean bases, as
we show in the following.
Theorem 4.24. If r +w=k+n, there are Boolean n ×k generator matrices V for a valid
(k,r,w,n)2-RW encoding only for the following cases:
1. (1,1,n,n)2,(1,n,1,n)2,(k,k,k,k)2
2. For k 2:(k,k+1,k+1,k+2)2
3. For k 1:(k,k,k+1,k+1)2
4. For k 1:(k,k+1,k,k+1)2
Proof. Follows by combining the prerequisites of the matrix based RWC method with
Lemma 4.23.
If we merely consider the encoding of a 1-bit information word, the following lemma
holds.
Lemma 4.25. Every (1,r,w,r+w1)2-RW code is a matrix based RW code.
116 Erasure Codes for Reading and Writing
Proof. Consider a write operation on the subset Wof code symbols with |W|=wand a
read operation on the index set Rwith |R|=rsuch that |RW|=1. Let ibe the index
of the element in the intersection. Now, if the information bit xflips to x0, the code bit yi
must flip as well. Otherwise, the result of the read operation would be the same as before,
what is incorrect. Thus, we can describe the operation on yiby
y0
iyi+x0+x(mod 2).
Notice, there is always a set Rwith |R|=rand |RW|=1 for any index iW. Thus,
the congruence holds for all bits leading to the matrix representation of RW codes.
Unfortunately, there are no perfect Boolean RW codes of higher dimensions.
Lemma 4.26. There is no (2,r,w,n)2-RW code for r +w=2+n and w 4.
Proof. Consider arbitrary 4 bits of some codeword yand let their indices be given by
the set F={i,j,k,l} [n]. Then, partition the residual index set IR= [n]\Finto two
disjoint index sets Rwith |R|=r2 and Wwith |W|=w4. W.l.o.g. consider the set
F={1,2,3,4}describing the code bits y1,y2,y3,y4. Now, consider write operations on
the index sets Wi,j=W{i,j}for all distinct i,jFand read operations on the index
set Ri,j=R{i,j}, again for all i,jF, imposing an intersection on at most two code
symbols.
Now, what is the number of bits to flip whenever the information (x1,x2) changes to
(x0
1,x0
2) (for our example, there are three change possibilities) ? To cover all cases, we
define pias a predicate which is only true if yimust be changed, i.e. inverted. If we then
consider the read operation on the set Ri,j, clearly, all bits in Rremain unchanged, and the
only way the write operation induces a change for the read operation is that, at least, one
of the code bits yi,yjis changed. To consider all such read operations, we have to fulfill
the following term: ^
i,j[4],i,j
pipj=
(p1p2p3)(p1p2p4)(p1p3p4)(p2p3p4).
As we see, all but (at most) one bit have to be inverted. Hence, there are five possibilities
to encode all three possible changes to the information vector (x1,x2). Moreover, by the
definition of the set Ri,j, each read operation can only observe two out of the four rewritten
bits. In particular, the following cases occur (we start with R1,2):
1. A: y1and y2are inverted.
2. B: Only y1is inverted.
3. C: Only y2is inverted.
4.6. Boolean Read-Write-Codes 117
Thus, A, B, and C can be mapped to all three possible changes of (x1,x2) to (x0
1,x0
2). Un-
fortunately, if we continue this game, we will observe a conflict with the read operation
R3,4in the situations Band Cthat both hold for R3,4, that is, both bits y3and y4must
be inverted inducing an indistinguishable situation for the read operation. Thus, in other
words, the read operation of R3,4cannot uniquely distinguish the change in the informa-
tion vector.
This lemma can be generalized to the following theorem.
Theorem 4.27. There is no (k,r,w,n)2-RW code for r +w=k+n and w k+2.
Proof. The proof is analogous to the proof of Lemma 4.26 but we now consider k+2 bits
of y, and w.l.o.g. we set F= [k+2]describing the code bits y1,y2,...,yk+2. Then again,
we partition the nkresidual code bits (i.e. the residual index set IR= [n]\F) into the
sets Rand Wwith |R|=r2 and |W|=wk2. Let the considered read and write
operations on the index sets Wi,jand Ri,jbe defined as in Lemma 4.26.
Again, we are interested in the number of bits to flip if the information (xi1,...,xik)
changes to (xi1,...,xik+2) (there are 2k1 change possibilities). Let the predicate pialso
be defined as before. If we then consider the read operation on R{i1,...,ik}, all bits
in Rremain the same. Thus, the only way the write operation induces a change is that,
at least, one of the code bits yi1,...,yikis changed (note that the read operation will read
only two bits from this intersection). Considering all such read operations, we must now
fulfill the following term:
^
S[k+2]:|S|=k_
iS
pi=_
S[k+2]:|S|≥3^
iS
pi.
Thus, at least 3 bits must be inverted in every write operation. Moreover, there are
2k+2
2
i=0k+2
i
possibilities to encode all 2k1 possible changes to the vector (x1,...,xk).
However, in each read operation, only kbits are accessed (in the intersection) which
requires 2k1 possible outcomes. If we then combine any two of the read operations, this
implies for the whole set that, at least, 2k+27 possible outcomes are demanded. Now,
consider the intersection of read operations that have only k2 positions in common. If
the corresponding bits remain the same, clearly then, each of the residual pairs of two bits
must differ, leaving 9 possibilities. In the other case, we have 2k21 possibilities for
the common bit vector and 16 possibilities for the ‘private’ bit pairs. Hence, there are
2
i=0k+2
i7>1
codes missing, for all k1.
118 Erasure Codes for Reading and Writing
4.7 General RW-Codes
Since only few perfect Boolean RW codes exist, we now consider more general RW codes
for which, in particular, hold
r+w>n+k.
In order to appropriately describe the existence of such codes, we use a modified matrix
based approach in which all matrix entries are chosen uniformly at random from the
alphabet Σ. Then, given this structures, we prove that with positive probability the read
and write operations work.
We call the following approach a random matrix (k,`,n)b-RW code. As usual, we
consider the information vector xto consists of ksymbols plus additional `used slack
variables, and an n-symbol code ygenerated by the matrix based encoding. Again, let
b=|Σ|denote the size of the symbol alphabet Σ. Furthermore, we consider an n×(k+`)-
generator matrix Γwhose elements are drawn (uniformly) at random from Σ. And as
usual, we use the matrix based code representation
Γ(x|v)T=y.
Then, the read and write operations can be derived as follows:
Read
Consider rk+`given readable positions in the code vector y. Then, we can
derive the matrix Γ0by choosing all rows of Γcorresponding to the rows available
for reading (c.f. Section 4.3.1). In Γ0, we choose `+klinear independent row
vectors yielding matrix Γ00 (note that in case of an insufficient number of existing
linear independent rows in the matrix (what may occur since the elements were
chosen at random), the read operation fails). Corresponding to Γ00, we also reduce
the code vector yyielding a codeword y0of size k+`and compute the k-sized
content vector xas well as the slack vector vas usual by
(x|v)T= (Γ00)1y0.
Write
Consider the n×lvariable sub-matrix
Γ= (Γi,j)i[n],j∈{k+1,...,k+l}.
4.7. General RW-Codes 119
There are nwnon-writable positions corresponding to rows of the code vector
yand the variable sub-matrix Γ. In case that these nwrows of Γare not
linearly independent, clearly, the write operation fails (recall from Section 4.3.1 that
a precondition for the write operation is the invertibility of the (nw)×lvariable
submatrix Γ↓→). Then, we could add some other `n+wlinear independent rows
of Γ(again, if these do not exist, the operation fails). Now, provided that we have
found appropriate linearly independent rows, these rows then correspond to non-
writable positions, and by using the complement of this rows as writable positions,
we can directly apply the matrix based RW code write operation.
We now investigate the success probability of read and write.
Lemma 4.28. Consider a k ×(k+`)random matrix A over a given base b 2. Then,
Pr[A has an invertible k ×k-sub matrix ]1b`+2.
Proof. Since the column rank and row rank of a matrix are always equal, it is more con-
venient to argue the following by the column rank of the matrix A. Then, the probability
that a random k×(k+`)submatrix of Adoes not have rank kequals the probability
that all k+`columns of Alie in some k1-dimensional subspace of bk. To sum up
all these subspaces, we enumerate each subspace by its 1-dimensional orthogonal com-
plement yielding a total of bk1
b1such subspaces (since we omit the zero vector, we have
bk1 possibilities for a 1-dimensional vector or size k, and since for each such vector
that spans some certain subspace, a multiplication with b1 non-zero elements from the
underlying base still spans the same subspace). Then, for each of such subspaces, the
probability that all columns of the matrix are in this subspace is b(k+`)1. By the union
bound, the probability that all the columns are in the same subspace is upper bounded by
b(k+`)bk1
b1b`
b1b`.
Thus, the probability that the matrix has full rank is at least 1 b`.
From this, it follows that if only few combinations of read and write index sets occur,
the overhead for general codes is small.
Theorem 4.29. For given parameters r and w, consider a general RWC with read index
sets R={R1,...,R|R|},|Ri| r, and write index sets W={W1,...,W|W|},|Wi| w.
Then, there is a restricted (k,r,w,n)b-RWC for any
r>k+logb|R|+2,
w+r>n+k+logb|W|+2.
1For each column vector cit holds: Pr[clies in this subspace ] = b1=bk1
bk. In other words, one aims at
sampling a vector from the k-dimensional space which is also in the k1-dimensional subspace.
120 Erasure Codes for Reading and Writing
Proof. According to Lemma 4.28, the probability of having a valid code for reading by
using rrows is at least 1 b(rk)+2. Now, summing up the error probabilities of all
possible read operations, we end up with
|R|
i=1
Pr[Riis not valid ] |R|b(rk)+2<1
provided that r>k+logb|R|+3.
For the write operation, we consider an `×(nw)sub-matrix for `=rkin which
we need to find nw`independent rows, with positive probability. The other failure
case can be omitted. Again, due to the previous lemma, with a success probability greater
than 1 bnw`+2we succeed. Thus, for
nwr+k+2>logb|W|,
the summed error probability is less than 1. Combining both error probabilities, the over-
all error probability is also less than 1. Hence, there exists a matrix allowing these re-
stricted read and write index sets.
Yet, the overall number of possible read and write index sets is quite high. The follow-
ing theorem shows that random matrices perform quite well for most of the read and write
index sets.
Theorem 4.30. For any base b, a random matrix based (k,`,n)b-RW code successfully
performs a read operation for a random choice of k +`+q code symbols with probability
1bq+2and successfully performs a write operation for a random choice of n `+q
writable code symbols with probability 1bq+2.
Proof. Follows directly from Lemma 4.28 and from the definition of the Read and Write
operation of random matrix based (k,`,n)b-RW-Codes.
4.8 Conclusion
In this chapter, we presented a useful extension of linear erasure codes, called Read-Write
codes, for application in RAID-like storage systems. Such systems often store valuable
business data implying some certain degree of reliability on the stored data. Furthermore,
to exploit access parallelism, the data is distributed among ndistinct storage devices but
what makes the system subject to failures as the probability of failed disks increases with
the number of devices connected. Therefore, fault-tolerant data placement schemes are
highly required to beware from data loss. In addition, reducing system latency is also of
great concern since access to the devices on which the data is stored is magnitudes slower
than any memory access which leads to comparatively high I/O costs.
4.8. Conclusion 121
Usually, employed RAID schemes (RAID 4/5) and Reed-Solomon encoding are sim-
ple and efficient techniques that, on one hand, focus on aggregating disk performance by
employing parallel access to them, and on the other hand, guarantee sufficient reliability
to tolerate multiple disk failures, all at a low cost, but unfortunately, all such codes re-
quire naccessible disks for storing an n-block codeword appropriately, what leads to the
following problems.
1. The modification of the original information implies access to all nkchecksum
disks; this write penalty leads to a performance bottleneck.
2. A complete modification of the information word results in inconsistencies if at
least any of the ndisks is missing.
3. Once the stripe size of an encoding is chosen, it cannot be changed any further,
thus, it cannot be adjusted to potentially changing demands (e.g. changing degree
of redundancy, adjusting to a higher number of tolerable failures, changing stripe
size, changing document size, etc.).
RW codes can significantly reduce, or even eliminated these problems. In a perfect
(k,r,w,n)-RW code (with r+w=k+n), any rdisks suffice for reading, and any max{r,w}
disks suffice for writing. The overhead is described by the stretch factor m
n, and any of
these combinations (k,r,w,n)with kr,wncan be chosen. In the case of adaptive RW
codes, these parameters can be adjusted during runtime without reading and rewriting all
disks. Furthermore, the information of nrdisks can be recovered and the system can
continue working if not more than nmax{r,w}disks fail. There is some redundancy
against faulty hard disks. However, redundancy is better solved by reducing it with disk-
wise checksums to the problem of failed disks. Interestingly, there is also a security
feature which allows the theft of up to nwdisks. The main advantage of RW codes
in a SAN is that the current r, respectively w, fastest disks can be used for read or write
operation. This leads to more efficient storage area networks.
In addition, in [DLS+04b], methods have been introduced for a wide-area distributed
hash table (DHT) that provides high-throughput and low-latency network storage. The au-
thors suggest a system called DHash++ which based on the peer-to-peer network Chord
[SMK+01b], and one feature of DHash++ is to use an erasure-resilient code for stor-
ing data. In particular, they store each 8192-byte block as 14 1171-byte erasure-coded
fragments, any seven of which are sufficient to reconstruct the block, using the IDA (In-
formation Dispersal Algorithm) coding algorithm from [Rab89]. The benefit of this code
is low latency as the fastest peers can be used to reconstruct the stored data. On the other
hand, this system suffers from a bad update behavior making it a promising application
for Read-Write-Codes to improve peer-to-peer-networks.
Bibliography
[ABB+97] Micah Adler, Yair Bartal, John W. Byers, Michael Luby, and Danny Raz.
A modular analysis of network transmission protocols. In Israel Sympo-
sium on Theory of Computing Systems, pages 54–62, 1997.
[ABKU00] Yossi Azar, Andrei Z. Broder, Anna R. Karlin, and Eli Upfal. Balanced
allocations. SIAM Journal on Computing, 29(1):180–200, 2000.
[ACMR95] Micah Adler, Soumen Chakrabarti, Michael Mitzenmacher, and Lars
Rasmussen. Parallel randomized load balancing. pages 238–247, 1995.
[AT97] J. Alemany and J. S. Thathachar. Random striping news on demand
servers. Technical Report TR-97-02-02, 1997.
[BBBM94] M. Blaum, J. Brady, J. Bruck, and J. Menon. EVENODD: an optimal
scheme for tolerating double disk failures in RAID architectures. In Pro-
ceedings of the 21st Annual International Symposium on Computer Ar-
chitecture, pages 245–254. IEEE Computer Society TCCA and ACM
SIGARCH, April 18–21, 1994.
[BCSV00] Petra Berenbrink, Artur Czumaj, Angelika Steger, and Berthold Vöck-
ing. Balanced allocations: the heavily loaded case. pages 745–754,
2000.
[BE07] André Brinkmann and Sascha Effert. Inter-node communication in peer-
to-peer storage clusters. In Proceedings of the 24th IEEE Conference on
Mass Storage Systems and Technologies (MSST), San Diego, California,
24 - 26 September 2007.
123
124 Bibliography
[BEHV05] André Brinkmann, Sascha Effert, Michael Heidebuer, and Mario
Vodisek. Distributed md. In In Proceedings of the International Work-
shop on Storage Network Architecture and Parallel I/Os, pages 81 88,
Saint Louis, Missouri, USA, 18 September 2005.
[BEHV06] André Brinkmann, Sascha Effert, Michael Heidebuer, and Mario
Vodisek. Realizing multilevel snapshots in dynamically changing vir-
tualized storage environments. In 5th International Conference on Net-
working (ICN), number 5, Mauritius, 23 - 26 April 2006. Springer Verlag
LNCS.
[BEMadHS07] André Brinkmann, Sascha Effert, Friedhelm Meyer auf der Heide, and
Christian Scheideler. Dynamic and redundant data placement. In
27th IEEE International Conference on Distributed Computing Systems
(ICDCS 2007), Toronto, Canada, 25 - 29 June 2007.
[BEY98] Allan Borodin and Ran El-Yaniv. Online computation and competitive
analysis. Cambridge University Press, New York, NY, USA, 1998.
[BGMJ94] Steven Berson, Shahram Ghandeharizadeh, Richard Muntz, and Xi-
angyu Ju. Staggered striping in multimedia information systems. pages
79–90, 1994.
[BHMadH+04] André Brinkmann, Michael Heidebuer, Friedhelm Meyer auf der Heide,
Ulrich Rückert, Kay Salzwedel, and Mario Vodisek. V:drive - costs and
benefits of an out-of-band storage virtualization system. In Proceedings
of the 12th NASA Goddard, 21st IEEE Conference on Mass Storage Sys-
tems and Technologies (MSST), pages 153 157, College Park, Mary-
land, USA, 13 - 16 April 2004.
[BLM02] J. Byers, M. Luby, and M. Mitzenmacher. A digital fountain approach
to asynchronous reliable multicast. IEEE Journal on Selected Areas in
Communications, 20(8), Oct, 2002.
[BLMR98] John W. Byers, Michael Luby, Michael Mitzenmacher, and Ashutosh
Rege. A digital fountain approach to reliable distribution of bulk data.
In SIGCOMM’98, pages 56–67, Sep 1998.
[BMadHS97] Petra Berenbrink, Friedhelm Meyer auf der Heide, and Klaus Schröder.
Allocating weighted jobs in parallel. SPAA 1997, pages 302–310, June
1997.
Bibliography 125
[BMS+03] A. Brinkmann, F. Meyer auf der Heide, K. Salzwedel, C. Scheideler,
M. Vodisek, and U. Rückert. Storage Management as Means to Cope
with Exponential Information Growth. In Proceedings of the Interna-
tional Conference on Advances in Infrastructure for Electronic Business,
Education, Science, Medcine, and Mobile Technologies on the Internet,
SSGRR 2003w, January 2003.
[BSS00] A. Brinkmann, Kay Salzwedel, and C. Scheideler. Efficient, distributed
data placement strategies for storage area networks. In Proc. of the 12th
ACM Symposium on Parallel Algorithms and Architectures (SPAA’00),
pages 119–128, 2000.
[BSS02] A. Brinkmann, Kay Salzwedel, and C. Scheideler. Compact, adaptive
placement schemes for non-uniform distribution requirements. In Proc.
of the 14th ACM Symposium on Parallel Algorithms and Architectures
(SPAA’02), pages 53–62, 2002.
[BSV03] R. Bhagwan, S. Savage, and G. Voelker. Understanding availability. In
Proceedings of the 2nd International Workshop on Peer-to-Peer Systems
(IPTPS ’03), February 2003.
[BSV04] André Brinkmann, Kay Salzwedel, and Mario Vodisek. A case for vir-
tualized arrays of raid. In Proceedings of the International Workshop
on Storage Network Architecture and Parallel I/Os SNAPI 2004, pages
9–16, Antibes Juan-les-pins, France, 30 September 2004.
[CL00] T. Cortes and J. Labarta. A case for heterogenenous disk arrays. In
Proc. of the IEEE International Conference on Cluster Computing (Clus-
ter’2000), pages 319–325, 2000.
[CL01] T. Cortes and J. Labarta. Extending heterogeneity to RAID level 5. In
USENIX 2001, Boston, June 2001.
[CLG+94] Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and
David A. Patterson. RAID: High-performance, reliable secondary stor-
age. ACM Computing Surveys, 26(2):145–185, 1994.
[CLRS01] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clif-
ford Stein. Introduction to Algorithms, Second Edition. The MIT Press,
September 2001.
126 Bibliography
[CW77] J. Lawrence Carter and Mark N. Wegman. Universal classes of hash
functions (extended abstract). In STOC ’77: Proceedings of the ninth
annual ACM symposium on Theory of computing, pages 106–112, New
York, NY, USA, 1977. ACM.
[DKM+88] Martin Dietzfelbinger, Anna R. Karlin, Kurt Mehlhorn, Fried-
helm Meyer auf der Heide, Hans Rohnert, and Robert Endre Tarjan. Dy-
namic perfect hashing: Upper and lower bounds. In IEEE Symposium on
Foundations of Computer Science, pages 524–531, 1988.
[DLS+04a] Frank Dabek, Jinyang Li, Emil Sit, James Robertson, M. Frans
Kaashoek, and Robert Morris. Designing a dht for low latency and high
throughput. In NSDI’04: Proceedings of the 1st conference on Sym-
posium on Networked Systems Design and Implementation, pages 7–7,
Berkeley, CA, USA, 2004. USENIX Association.
[DLS+04b] Frank Dabek, Jinyang Li, Emil Sit, James Robertson, M. Frans
Kaashoek, and Robert Morris. Designing a DHT for low latency and
high throughput. In NSDI, pages 85–98, 2004.
[DR01] P. Druschel and A. Rowstron. PAST: A large-scale, persistent peer-to-
peer storage utility. pages 75–80, 2001.
[ED88] R. J. Enbody and H. C. Du. Dynamic hashing schemes. ACM Comput.
Surv., 20(2):850–113, 1988.
[Eli55] P. Elias. Coding for two noisy channels. In Information Theory, Third
London Symposium, pages 61–76. Butterswort’s Scientific Publications,
1955.
[Fal07] FalconStor. The Value of Storage Virtualization. White Paper, FalconStor
Software Inc., 2007.
[FNPS79] Ronald Fagin, Jürg Nievergelt, Nicholas Pippenger, and H. Raymond
Strong. Extendible hashing - a fast access method for dynamic files.
ACM Trans. Database Syst., 4(3):315–344, 1979.
[Gan07] John F. Gantz et al. ICD Study: The Expanding Digital Universe. White
Paper, EMC2, March 2007.
[Gib99] Garth A Gibson. Redundant disk arrays: Reliable, parallel secondary
storage. Technical report, Berkeley, CA, USA, 1999.
Bibliography 127
[GLS93] Martin Grötschel, Lászlo Lovász, and Alexander Schrijver. Geometric
Algorithms and Combinatorial Optimization, volume 2 of Algorithms
and Combinatorics. Springer, second corrected edition edition, 1993.
[GM00] Garth A. Gibson and Rodney Van Meter. Network attached storage ar-
chitecture. Commun. ACM, 43(11):37–45, 2000.
[Gup02] M. Gupta. Storage Area Network Fundamentals. Cisco Press, 2002.
[HGK+94] Lisa Hellerstein, Garth A. Gibson, Richard M. Karp, Randy H. Katz, and
David A. Patterson. Coding techniques for handling failures in large disk
arrays. Algorithmica, 12(2/3):182–208, 1994.
[HJH02] Kai Hwang, Hai Jin, and Roy S.C. Ho. Orthogonal striping and mirroring
in distributed raid for i/o-centric cluster computing. IEEE Trans. Parallel
Distrib. Syst., 13(1):26–44, 2002.
[HM03] R. Honicky and E. Miller. A fast algorithm for online placement and
reorganization of replicated data. 2003.
[HM04] R. J. Honicky and Ethan L. Miller. Replication Under Scalable Hashing:
A Family of Algorithms for Scalable Decentralized Data Distribution. In
Proceedings of the 18th IPDPS Conference, 2004.
[HP03] Cary Huffmann and Vera Pless. Fundamentals of Error-Correcting
Codes. Cambridge University Press, Cambridge, UK, 2003.
[HR90] Torben Hagerup and C. Rüb. A guided tour of chernoff bounds. Inf.
Process. Lett., 33(6):305–308, 1990.
[Hro03] J. Hromkovic. Algorithms for Hard Problems, 2nd Edition. (Texts in
Theoretical Computer Science. An EATCS Series). Springer-Verlag New
York, Inc., Secaucus, NJ, USA, 2003.
[HZ05] J. Hromkovic and I. Zámecniková. Design and Analysis of Random-
ized Algorithms: Introduction to Design Paradigms (Texts in Theoretical
Computer Science. An EATCS Series). Springer-Verlag New York, Inc.,
Secaucus, NJ, USA, 2005.
[JK77] Norman L. Johnson and Samuel Kotz. Urn models and their application.
John Wiley & Sons, New York-London-Sydney, 1977. An approach
to modern discrete probability theory, Wiley Series in Probability and
Mathematical Statistics.
128 Bibliography
[Joh84] Olin G. Johnson. Three-dimensional wave equation computations on
vector computers. Proceedings of the IEEE, 72(1):90–95, January 1984.
[JT05] A. Telles J. Tate, R. Kanth. Introduction to Storage Area Networks. Tech-
nical report, IBM, May 2005.
[KLL+97] David Karger, Eric Lehman, Tom Leighton, Mathhew Levine, Daniel
Lewin, and Rina Panigrahy. Consistent hashing and random trees: Dis-
tributed caching protocols for relieving hot spots on the world wide web.
In ACM Symposium on Theory of Computing, pages 654–663, May 1997.
[KLM92] Richard M. Karp, Michael Luby, and Friedhelm Meyer auf der Heide.
Efficient PRAM simulation on a distributed memory machine. pages
318–326, 1992.
[KLMadH96] Richard M. Karp, Michael Luby, and Friedhelm Meyer auf der Heide.
Efficient pram simulation on a distributed memory machine. Algorith-
mica, 16(4/5):517–542, 1996.
[KSB+99] David Karger, Alex Sherman, Andy Berkheimer, Bill Bogstad, Rizwan
Dhanidina, Ken Iwamoto, Brian Kim, Luke Matkins, and Yoav
Yerushalmi. Web caching with consistent hashing. In WWW ’99: Pro-
ceedings of the eighth international conference on World Wide Web,
pages 1203–1213, New York, NY, USA, 1999. Elsevier North-Holland,
Inc.
[LMR98] W. Litwin, J. Menon, and T. Risch. LH* schemes with scalable avail-
ability. Technical Report RJ 10121 (91937), IBM Research, Almaden
Center, May 1998.
[LN86] Rudolf Lidl and Harald Niederreiter. Introduction to finite fields and
their applications. Cambridge University Press, New York, NY, USA,
1986.
[LN96] Witold Litwin and Marie-Anne Neimat. High-availability LH* schemes
with mirroring. In Conference on Cooperative Information Systems,
pages 196–205, 1996.
[LNS93] Witold Litwin, Marie-Anne Neimat, and Donovan A. Schneider. Lh*:
Linear hashing for distributed files. In SIGMOD ’93: Proceedings of the
1993 ACM SIGMOD international conference on Management of data,
pages 327–336, New York, NY, USA, 1993. ACM.
Bibliography 129
[LR02] Witold Litwin and Tore Risch. LH*G: A high-availability scalable dis-
tributed data structure by record grouping. IEEE Transactions on Knowl-
edge and Data Engineering, 14(4):923–927, 2002.
[LS00] Witold Litwin and Thomas Schwarz. LH* RS : A high-availability scal-
able distributed data structure using reed solomon codes. In SIGMOD
Conference, pages 237–248, 2000.
[LYD71] V. Y. Lum, P. S. T. Yuen, and M. Dodd. Key-to-address transform tech-
niques: a fundamental performance study on large existing formatted
files. Commun. ACM, 14(4):228–239, 1971.
[MadH06] Friedhelm Meyer auf der Heide. Kommunikation in Parallelen Rechen-
modellen. Skript, University of Paderborn, July 2006.
[Mar79] G. N.N. Martin. Spiral storage: Incrementally augmentable hash ad-
dressed storage. Technical report, Coventry, UK, UK, 1979.
[Mas97] P. Massiglia. RAB. The RAIDbook. A Storage System Technology Hand-
book. RAID Advisory Board, 1997.
[McD89] C. McDiarmid. On the method of bounded differences. In J. Siemons,
editor, Surveys in Combinatorics. London Mathematical Society Lecture
Note Series 141, Cambridge University Press, 1989.
[MNR02] D. MALKHI, M. NAOR, and D. RATAJCZAK. Viceroy: A scalable and
dynamic emulation of the butterfly. In Proceedings of the 21st annual
ACM symposium on Principles of distributed computing. ACM Press,
2002.
[MR95] Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms.
Cambridge University Press, New York, NY, USA, 1995.
[MS08] Mario Mense and Christian Scheideler. Spread: An adaptive scheme for
redundant and fair storage in dynamic heterogeneous storage systems. In
19th ACM-SIAM Symposium on Discrete Algorithms (SODA), San Fran-
cisco, California, USA, 20.-22. Febr.„ January 2008.
[Mul84] James K. Mullin. Unified dynamic hashing. In VLDB ’84: Proceedings
of the 10th International Conference on Very Large Data Bases, pages
473–480, San Francisco, CA, USA, 1984. Morgan Kaufmann Publishers
Inc.
130 Bibliography
[NB97] J. Nonnenmacher and E. Biersack. Asynchronous multicast push: Amp.
In Proceedings of ICCC’97, pp. 419–430, Cannes, France, November,
1997.
[Oto88] Ekow J. Otoo. Linearizing the directory growth in order preserving ex-
tendible hashing. In Proceedings of the Fourth International Conference
on Data Engineering, pages 580–588, Washington, DC, USA, 1988.
IEEE Computer Society.
[PGK88] D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundant
Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM
Conference on Management of Data (SIGMOD), pages 109–116, June
1988.
[Pla97] James S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance
in RAID-like systems. Software, Practice and Experience, 27(9):995–
1012, 1997.
[PS98] Christos H. Papadimitriou and Kenneth Steiglitz. Combinatorial Opti-
mization : Algorithms and Complexity. Dover Publications, July 1998.
[PT03] J. S. Plank and M. G. Thomason. On the practical use of ldpc erasure
codes for distributed storage applications. Technical Report CS-03-510,
University of Tennessee, September 2003.
[Rab89] M. O. Rabin. Efficient dispersal of information for security, load balac-
ing and fault tolerance. Journal of the ACM, 36(2):335–348, 1989.
[RD01] Antony I. T. Rowstron and Peter Druschel. Storage management and
caching in past, a large-scale, persistent peer-to-peer storage utility. In
Symposium on Operating Systems Principles, pages 188–201, 2001.
[REG+03] S. Rhea, P. Eaton, D. Geels, H. Weatherspoon, B. Zhao, and J. Kubiatow-
icz. Pond: The oceanstore prototype. In Proceedings of the Conference
on File and Storage Technologies. USENIX, 2003.
[Riz97] Luigi Rizzo. Effective erasure codes for reliable computer communica-
tion protocols. ACM Computer Communication Review, 27(2):24–36,
April 1997.
[RL05] Rodrigo Rodrigues and Barbara Liskov. High availability in dhts: Era-
sure coding vs. replication. In Miguel Castro and Robbert van Renesse,
editors, IPTPS, volume 3640 of Lecture Notes in Computer Science,
pages 226–239. Springer, 2005.
Bibliography 131
[Roc70] R. T. Rockafellar. Convex Analysis. Princeton University Press, Prince-
ton, New Jersey, 1970.
[Rom96] Steven Roman. Introduction to coding and information theory. Springer-
Verlag New York, Inc., Secaucus, NJ, USA, 1996.
[Ros00] K.H. Rosen. Handbook of Discrete and Combinatorial Mathematics.
CRC; 1 edition, 2000.
[Ros06] Sheldon M. Ross. Introduction to Probability Models, Ninth Edition.
Academic Press, Inc., Orlando, FL, USA, 2006.
[RS98] Martin Raab and Angelika Steger. “balls into bins” A simple and tight
analysis. Lecture Notes in Computer Science, 1518:159–??, 1998.
[Sal04] Kay Salzwedel. Data Distribution Algorithms for Storage Networks.
Dissertation, Universität Paderborn, Heinz Nixdorf Institut, Theoretische
Informatik, 2004. EUR 20,00 ISBN 3-935433-62-X.
[San01] P. Sanders. Reconciling simplicity and realism in parallel disk mod-
els. In Proc. of the 12th ACM-SIAM Symposium on Discrete Algorithms
(SODA), pages 67–76. SIAM, Philadelphia, PA, 2001.
[Sch00] C. Scheideler. Probabilistic Methods for Coordination Problems. HNI-
Verlagsschriftenreihe 78, University of Paderborn, 2000.
[Sch01] Klaus Schröder. Balls into Bins: A Paradigm for Job Allocation, Data
Distribution Processes, and Routing. Dissertation, Universität Pader-
born, Heinz Nixdorf Institut, Theoretische Informatik, 2001. ISBN 3-
931466-88-4.
[SM98a] J. R. Santos and R. Muntz. Performance analysis of the RIO multimedia
storage system with heterogeneous disk configuration. In ACM Multi-
media 98, pages 303–308, 1998.
[SM98b] Jose Renato Santos and Richard Muntz. Using heterogeneous disks on
a multimedia storage system with random data allocation. Technical
Report 980011, 18, 1998.
[SMK+01a] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari
Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet
applications. In Proc. of the ACM SIGCOMM 2001, pages 149–160,
August 2001.
132 Bibliography
[SMK+01b] Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, and
Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for
internet applications. In SIGCOMM, pages 149–160, 2001.
[SMRN00] Jose Renato Santos, Richard R. Muntz, and Berthier A. Ribeiro-Neto.
Comparing random data allocation and data striping in multimedia
servers. In Measurement and Modeling of Computer Systems, pages 44–
55, 2000.
[SS05] Christian Schindelhauer and Gunnar Schomaker. Weighted distributed
hash tables. In SPAA ’05: Proceedings of the seventeenth annual ACM
symposium on Parallelism in algorithms and architectures, pages 218–
227, New York, NY, USA, 2005. ACM Press.
[Tec07] Hitatchi Global Storage Technologies. Hitatchi UltrastarTMA7K1000.
Specification Sheet, Hitatchi Global Storage Technologies, 2007.
[TPBG93] F. Tobagi, J. Pang, R. Baird, and M. Gang. Streaming raid - a disk
array management system for video files. In Proceedings of of 1st ACM
Multimedia, pages 393–400, 1993.
[Voe99] Berthold Voecking. How asymmetry helps load balancing. In IEEE
Symposium on Foundations of Computer Science, pages 131–141, 1999.
[VRG95] Harrick M. Vin, S. S. Rao, and Pawan Goyal. Optimizing the placement
of multimedia objects on disk arrays. In International Conference on
Multimedia Computing and Systems, pages 158–166, 1995.
[WAS+96] Stephen Williams, Marc Abrams, Charles R. Standridge, Ghaleb Ab-
dulla, and Edward A. Fox. Removal policies in network caches for
World-Wide Web documents. In Procedings of the ACM SIGCOMM
’96 Conference, Stanford University, CA, 1996.
[WBMM06] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, and Carlos Maltzahn.
Crush: controlled, scalable, decentralized placement of replicated data.
In SC ’06: Proceedings of the 2006 ACM/IEEE conference on Super-
computing, page 122, New York, NY, USA, 2006. ACM.
[Wie07] Udi Wieder. Balanced allocations with heterogenous bins. In SPAA ’07:
Proceedings of the nineteenth annual ACM symposium on Parallel al-
gorithms and architectures, pages 188–193, New York, NY, USA, 2007.
ACM.
Bibliography 133
[WK02] Hakim Weatherspoon and John Kubiatowicz. Erasure coding vs. replica-
tion: A quantitative comparison. In IPTPS ’01: Revised Papers from the
First International Workshop on Peer-to-Peer Systems, pages 328–338,
London, UK, 2002. Springer-Verlag.
[XB99] Lihao Xu and Jehoshua Bruck. X-code: MDS array codes with optimal
encoding. IEEE Transactions on Information Theory, 45(1):272–276,
1999.
[ZG97] Roger Zimmermann and Shahram Ghandeharizadeh. Continuous dis-
play using heterogeneous disk-subsystems. In MULTIMEDIA ’97: Pro-
ceedings of the fifth ACM international conference on Multimedia, pages
227–238, New York, NY, USA, 1997. ACM.
[ZG00] Roger Zimmermann and Shahram Ghandeharizadeh. HERA: Hetero-
geneous extension of RAID. In Proc. of the International Conference
on Parallel and Distributed Processing Techniques and Applications
(PDPTA 2000), 2000.
[Zim98] Roger Zimmermann. Continuous media placement and scheduling in
heterogeneous disk storage systems. PhD thesis, Los Angeles, CA, USA,
1998. Adviser-Shahram Ghandeharizadeh.