Document [original]

On Fault-Tolerant Data Placement in

Storage Networks

Dissertation

Mario Mense

Faculty of Computer Science, Electrical Engineering and Mathematics

Department of Computer Science and Heinz Nixdorf Institute

University of Paderborn, Germany

August 2008

Reviewers:

•Prof. Dr. Friedhelm Meyer auf der Heide, University of Paderborn, Germany

•Prof. Dr. Christian Schindelhauer, University of Freiburg, Germany

iii

Acknowledgments

First of all, I would like to thank my advisor Prof. Dr. Friedhelm Meyer auf der Heide for

his great support. Friedhelm always left me the freedom to address the problems I was

interested to investigate and which have now become the core of this thesis. Moreover,

basing on his early results in PRAM simulation and resource management in networks, I

had several motivating discussions with him from which I always could extract interesting

proposals and useful perspectives for my work. Furthermore, I would also like to thank

Christian Schindelhauer from Albert-Ludwigs-Universität, Freiburg, for reviewing this

thesis. Collaboration with Christian is always interesting, progressive, sometimes strange,

but in the end leads to fascinating views on many different problems.

At last, I express my thanks to those people with whom I had a great collaboration

and great time, mainly the members of the research group "Algorithms and Complexity"

and my co-authors of the main results in this thesis, namely Michael Kortenjan and Mar-

tin Ziegler from the University of Paderborn, Christian Scheideler from the Technische

Universität München, Germany, as well as Christian Schindelhauer and Arne Vater from

Albert-Ludwigs-Universität, Freiburg, Germany. Another special thank goes to Michael

Kortenjan and Gunnar Schomaker for suffering me as their office member during the last

years.

Paderborn, August 2008 Mario Mense

To Nicole and Charlotte

Contents

1 Introduction 3

1.1 Storage (Area) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 The Storage Virtualization Layer . . . . . . . . . . . . . . . . . . 6

1.1.2 Storage Management Schemes . . . . . . . . . . . . . . . . . . . 7

1.2 Demands on Data Placement in Storage Networks . . . . . . . . . . . . . 9

1.2.1 A Brief Note on Erasure Encoding and Replication . . . . . . . . 12

1.3 OurContribution .............................. 14

2 Some Classical Approaches 15

3 Random Allocation of Data Copies 23

3.1 PreliminarySection............................. 26

3.1.1 TheBasicModel.......................... 26

3.1.2 On the Practical Realization of the Allocation Function F. . . . 31

3.1.3 Further Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 The Redundant Allocation Problem . . . . . . . . . . . . . . . . . . . . 43

3.3 Analysis of the Probability Distribution P................. 46

3.3.1 Existence and Uniqueness . . . . . . . . . . . . . . . . . . . . . 47

3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations . . . . 49

3.4 COMB: An Allocation Strategy for Static Scenarios . . . . . . . . . . . . 60

3.5 SPREAD: An Adaptive Allocation Scheme for Dynamic Scenarios . . . . 64

3.5.1 Previous Adaptive Strategies . . . . . . . . . . . . . . . . . . . . 65

3.5.2 The SPREAD Strategy . . . . . . . . . . . . . . . . . . . . . . . 69

3.6 Conclusion and Open Problems . . . . . . . . . . . . . . . . . . . . . . . 89

4 Erasure Codes for Reading and Writing 91

4.0.1 Preliminaries ............................ 93

4.0.2 Data Reconstruction from Failures . . . . . . . . . . . . . . . . . 96

vii

viii Contents

4.1 The Operational Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.2 LowerBounds................................103

4.3 Encoding/Decoding in the RWC . . . . . . . . . . . . . . . . . . . . . . 104

4.3.1 The Matrix Approach . . . . . . . . . . . . . . . . . . . . . . . . 106

4.4 Security and Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.5 AdaptiveRWCodes.............................112

4.6 Boolean Read-Write-Codes . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.7 GeneralRW-Codes .............................118

4.8 Conclusion .................................120

CHAPTER 1

Introduction

The advances in networking technology over the last decades, the explosive popularity

of the World Wide Web and, depending on this, the growth of the Internet have greatly

widened the user base of computers, making it nearly universal. In times of the so-called

Web 2.0, information exchange has become ubiquitous, and novel pervasive applications,

like e.g. online banking, audio- and video downloads, data warehousing or the wide and

continuously growing field of e-commerce, have been enabled by this progress. Further-

more, it has become usual for people in the industrial nations that information is available

24 hours a day, 7 days a week1, and complex multimedia based content, like video clips

and MP3 music files, can be downloaded at high rates or viewed online.

Driven by this evolution, data has become the central and most valuable asset for many

companies and organizations, and generally, the efficient storage and retrieval of huge

amounts of information has emerged as a driving force of our information society’s econ-

omy. Due to an ICD study sponsored by EMC2[Gan07], the worldwide amount of digital

data in 2006 is projected at 161 Exabyte (billion GByte). In other words, on average, each

of the 6.7 billion people on earth has created 24 GByte of new data in 2006. This is plau-

sible since multimedia applications have entered a phase of rapid growth which is driven

by mainly two forces. First, the rapidly dropping costs of computer and networking hard-

ware has made multimedia applications accessible to an ever-increasing number of users.

Second, the growing size of the Internet has made it possible to publish multimedia con-

tent comparatively cheaply to an immense audience. For example, YouTube, a company

that did not exist a few years ago, hosts 100 million video streams a day, and experts say

more than a billion songs a day are shared over the Internet in MP3 format. Assuming

an expansion rate of 57%, in 2010, the worldwide data volume will be about six times as

much as today, 988 Exabyte. Particularly, from the total global data volume, about 70%

of all information is created by private persons, but nevertheless, roughly 85% of all data

1often abbreviated 24/7 or 24x7

4 Introduction

require the involvement of a company in terms of storing, providing or transmitting the

data. As a result, for providing and enabling all this information, multimedia content and

technical processes, companies have to administrate, process and backup these quantities

of data appropriately, and it has turned out that, on average, enterprises have to deal with

an information amount that doubles every year.

Therefore, in the recent years, stored data has become more and more critical and irre-

placeable, and the demand for quick and uninterrupted data availability has reached high-

est priority, thus, basic issues such as fault-tolerance, continuous information availability,

fast data access and advanced backup principles are crucial for every company today.

To guarantee all this, the concepts to manage and store data are of great importance be-

cause the quality of the data availability as well as the access speed have direct impact on

any company’s success. As an example, consider the case of a video-on-demand service

provider storing all his frequently accessed data on the same storage device. Due to band-

width limitations, during playback, the device may become unable to satisfy all incoming

request appropriately leading to loss of customer goodwill and hence, may be considered

expensive. As a result from summarizing all demands addressing modern advanced data

allocation, we can identify two major objectives any contemporary storage subsystem has

to accomplish: high-level serviceability, i.e. always serve given I/O requests quickly and

efficiently, independent of either the network’s size or occurring changes in the set of

storage devices, and data availability (or reliability), i.e. to feature appropriate fail-over

mechanisms to be safe from disk failures, and thus, data loss.

1.1 Storage (Area) Networks

The technological foundation to face the dramatic growth of enterprise data storage ca-

pacity as well as to meet the given objectives is to consolidate the storage capacity inside

a so-called storage area network (SAN) (see e.g. [JT05, Gup02] for details), a dedicated

network which is built around block-addressed storage units and that disbands the tradi-

tional tight coupling of servers and storage devices, but instead, establishes an any-to-any

connection from servers to disks by an underlying high-speed network. This concept

of storage consolidation leads to the so-called storage-centralized architecture enabling

enterprises to facilitate efficient parallel access to the data as well as cost-effective sys-

tem management. Unlike common server-centralized architectures in which a single disk

drive is directly attached to a workstation, the disks are rather encapsulated inside ad-

vanced storage subsystems, like e.g. disk arrays or high-end enterprise storage cabinets,

that today offer a capacity ranging from some TByte to several PByte each. Every such

storage subsystem can be directly accessed by any of the connected servers, thus, enhanc-

ing the ability to assign free capacity in an easy and flexible manner (see Figure 1.1).

Additionally today, almost all SANs are scalable although not all implement convenient

1.1. Storage (Area) Networks 5

storage-on-demand concepts, that is, to add (theoretically unlimited) storage capacity to

the system only when it is needed. Another problem that comes along with scalable envi-

ronments is heterogeneity. Depending on the underlying storage configuration (c.f. Sec-

tion 1.1.2), the ability of a storage network to scale may induce the storage components to

become heterogeneous over time because new storage devices enter the system as well as

older ones are being removed or replaced. This indeed simplifies migration processes, but

on the other hand forces the management modules to be able to cope with heterogeneity.

However, even if not all SANs are capable of handling heterogeneity yet, the ability to

scale already reduces the total costs of ownership (TCO) [GM00, Fal07, BSV04].

Last but not least, every SAN is required to have appropriate reliability mechanisms

installed to face the mentioned problem of data loss in case of device failures, while the

probability of such failures to occur increases with the number of storage devices in the

SAN (c.f. [PGK88] and Section 1.2). The common way to achieve high data availability,

and moreover, to ensure business continuity appropriately, is given by inducing redun-

dancy into the system in terms of applying either complete replication of data blocks or

erasure (resilient) encoding schemes (e.g. RAID). Unfortunately, almost all utilized fault-

tolerant placement strategies suffer from serious drawbacks, like e.g. the lacking ability

to properly cope with heterogeneity as well as insufficient failure resilience or inherent

I/O overheads when modifying the data.

SAN

. . .

LAN/WAN

Storage

Server

Figure 1.1: Storage Area Network

To summarize, depending on storage consolidation and supported by advanced data

and resource management strategies, modern SANs offer a high potential regarding the

efficient adaption to changing capacity demands without system downtimes as well as

parallel access to the (non-uniform) storage devices, low maintenance costs, simplified

6 Introduction

administration and integrated recovery services to react on hardware failures. Surely, to

ensure all these benefits, appropriate mechanisms to provide fault-tolerance,system scal-

ability and efficient data access even for non-uniform storage devices must be installed.

1.1.1 The Storage Virtualization Layer

Inside a SAN, the utilized data placement scheme is the building block to face the sketched

challenges and which is usually part of a software layer implementing the so-called stor-

age virtualization, an abstract concept separating the physical from the logical view of

storage resources (see e.g. [BMS+03, BHMadH+04, Sal04]). This abstraction enables

the consolidated storage devices to be managed as one single entity, normally as one

virtual disk, which is then offered to the application servers while hiding the physical

representation and explicit storage of the data from the user (Figure 1.2). In particular,

the logical address space presented to the accessing servers in terms of virtual volumes

is scrambled into equal sized data blocks which are then distributed among the physical

devices by the data distribution strategy according to the desired serviceability and avail-

ability objectives (additionally, modern SANs offer improved management services, like

e.g. remote copy or volume snapshots [BEHV06]).

Applicationserver

Physical Storage

Virtual Volume

Virtualization Layer

Remote Copy Snapshot …….

Data Placement Engine

Figure 1.2: Storage Virtualization Layer

However, obviously, both the given storage management scheme (see next section) and,

depending on the layout, the capability of the utilized allocation scheme directly deter-

mines the overall performance of the storage network regarding the described challenges.

1.1.2 Storage Management Schemes 7

1.1.2 Storage Management Schemes

In most commercial SANs today, RAID is still the usual way to allocate data blocks to

the set of storage devices, and for fault-tolerance purpose, most companies apply RAID

levels 1 (replication, resp. mirroring) or 4/5 (erasure (resilient) encoding) (for details, see

[PGK88, Mas97]) which arranges the scrambled data blocks into a fix sized stripe of data

blocks and additionally includes a parity block for providing one-failure tolerant redun-

dancy; if a higher level of redundancy is required, often Reed-Solomon erasure encoding

is considered (c.f. Chapter 4). Although RAID implements a deterministic (resp. man-

ual) data placement, it works for small storage environments as RAID sets are shipped in

terms of rigid storage cabinets each covering some fixed number of homogeneous disks on

which the data is striped by an internal controller to ensure low system latencies. Surely,

fixed RAID sets provide optimal fairness (by uniform data distribution) and since such

systems can tolerate one single disk failure at a time, vendors almost add so-called spare

disks for compensating a missing disk in case of failure.

However, if considering large-scale systems or those that may undergo changes in the

number of connected storage systems, the traditional rigid layout of such SANs makes

storage management become more and more less efficient because the only way to scale

is to concatenate multiple RAID sets disjointly, as depicted in Figure 1.3.

“manual”

placement

Figure 1.3: Classical Storage Management: Concatenated RAID sets

The drawbacks are quite obvious. First, if such systems base on disks of same type, the

probability of an occurring failure increases significantly (c.f. next section). Furthermore,

easy system and data migration is merely possible inside a given cabinet, but most im-

portantly, since the storage virtualization is located close to the data inside the cabinets,

the benefits of virtualization, like e.g. fault-tolerance, disk utilization or access efficiency,

have only local effects inducing the need for an additional superior virtualization layer. If

then such layer also implements RAID, system scaling becomes a hard challenge because

if new disks are added over time, with high probability, they are of different characteristic

making the scaling problem a heterogeneous one.

8 Introduction

Therefore, a self-evident way to achieve highest system performance is given if envi-

ronments get rid of all the rigid cabinets and arrange the complete storage in a single, flat

storage solution, as depicted in Figure 1.4 ? Surely, such solution is much more chal-

lenging to the employed allocation scheme because all mentioned issues directly address

the superior virtualization layer. However, besides the possibility to efficiently handle

heterogeneous devices, an open, flat storage environment further offers increased scal-

ing possibilities as well as better capacity consumption by efficiently using the complete

storage space and thus, letting spare disks become obsolete.

Figure 1.4: Open Storage Management: Single, flat storage arrangement

In this work, we present and discuss different data placement schemes for both clas-

sical and open storage configurations, and thus, depending on different restrictions, also

different demands with respect to the virtualization engine are considered. For example,

for a classical RAID-based storage layout as described, we ignore scalability and hetero-

geneity, and instead, rather focus on fault-tolerance and efficient access behavior that is

almost gained by erasure resilient encoding.

Therefore, in Chapter 4, we present a flexible framework for generating different era-

sure resilient codes that, compared to classic erasure codes, include improved update

behaviors which are very useful for SANs. On the other hand, in case of open storage en-

vironments, scalability and heterogeneity additionally come into play forcing the applied

distribution scheme to handle all challenges mentioned in the previous section. Therefore,

in Chapter 3, we introduce efficient and redundant strategies that base on replication and

which are furthermore capable of coping with scalable and heterogeneous storage config-

urations, respectively. However, before we go into further detail investigating these novel

allocation schemes, we first introduce and briefly describe the complete set of different

requirements that generally address data allocation and which must, at least partially, be

handled by any placement algorithm employed in a storage network.

1.2. Demands on Data Placement in Storage Networks 9

1.2 Demands on Data Placement in Storage Networks

We give a brief summary of the basic challenges that must at least in part be considered by

a data allocation scheme to ensure the serviceability and availability objective for a con-

sidered storage environment. In particular, as the underlying storage configuration may

restrict the set of demands to be considered by the applied placement scheme, different

overall performances might be achieved for varying configurations. However, all strate-

gies observed in this thesis in common is that they can be reduced to pure data placement

only, i.e. all considered strategies are not concerned with limiting properties of the under-

lying network topology, like e.g. congestion or network latency. This results from the fact

that the I/O accesses to the single storage devices cause the bottleneck of the overall sys-

tem because any such access is served by comparatively slow mechanical driven access

mechanisms inside a single storage device. Consequently, we can suppose the underlying

network topology to be unrestricted because the system bottleneck appears directly at the

storage devices such that potential problems when routing the data packets through the

network can be neglected.

Therefore, the usual way to ease the access bottleneck at the disks in order to decrease

system latency in both the classic and the open storage model is data striping which

spreads the data blocks of some given file to multiple storage devices making multiple

portions of the file be accessible in parallel, thus yielding effective transfer rates. Ba-

sically, the principles of data striping can be partitioned into random and deterministic

approaches, and as we have seen, in traditional (commercial) SANs, respectively disk ar-

rays, the well-known deterministic RAID scheme [PGK88, Mas97] is almost still applied.

However, in general, besides parallel access, striping the data as evenly as possible among

the disks further results in an optimal disk utilization which, at first, also improves system

performance because each write I/O involves several disks for data storage, and second,

reduces the TCO (c.f. [Mas97]). Moreover, data striping over multiple devices consider-

ably supports business continuity which, for instance, becomes apparent in the example of

the video-on-demand provider who runs into problems when storing frequently accessed

data onto either a single or a small number of devices such that parallelism could not be

exploited accurately. On the other hand, placing only infrequent accessed material on a

storage device may result in the device to be underutilized. Thus importantly, any allo-

cation scheme must provide an appropriate load balancing with respect to both the data

blocks and the incoming I/O requests over all accessible disks.

As we explained already, modern open scalable storage systems additionally need to

consider storage devices of non-uniform characteristics (e.g. capacity, bandwidth, la-

tency, age etc.). Therefore, advanced data distribution strategies are required to handle

heterogeneity appropriately; with respect to Salzwedel [Sal04], we denote this the hetero-

geneity problem, and any data distribution strategy able to efficiently handle heterogeneity

10 Introduction

is called fair. Similar to uniform capacities, a fair distribution has the same great influence

on disk utilization and thus, overall system performance.

Another issue significantly affecting system latency is system scaling what is caused

by an ever growing storage data volume as well as occurring disk failures. In case of a

growing data volume, the simple concatenation of storage devices obviously implies ac-

cess bottlenecks caused by the so-called 80/20 problem describing the fact that without

data redistribution, new data to which roughly 80% of all current accesses go is almost

exclusively stored on the newer devices covering roughly 20% of the total storage capac-

ity (c.f. Chapter 2). Additionally, depending on technical progress, age reasons or failure

probability, added or replaced disks are often of different characteristic compared to the

ones already connected. Thus, even if initially of homogeneous trait, over time, a SAN

becomes more and more heterogeneous. Furthermore, to obtain the best performance, the

newly added disks should also be included into the striping process but this evokes the

problem of how to add newer disks efficiently such that, in the long run, the overall per-

formance is not affected noticeably. Clearly, in case of fixed sized data stripes, any system

growth induces a complete redistribution of almost all of the stored data. If then TBytes

or PBytes of data are stored, either serious performance losings or complete downtimes of

the system for hours or days must be considered, what is intolerable for every company.

Thus consequently, system expansion should be done as fast as possible, and after finish-

ing the redistribution, the strategy should balance both the data and the requests optimally

over all disks. A scalable data placement scheme that keeps the amount of redistributed

data at a near-minimum is called adaptive.

Besides performance considerations, the stripe size has further influence on the relia-

bility of the system because, according to Patterson et al. [PGK88], the failure rate of a

disk array increases proportional to the number of covered disks which are expected to

fail independently. More precisely, failures in disk arrays are often assumed to satisfy

the memoryless property, that is, the life expectancy of a disk is estimated only upon the

condition that the disk is working now. Hence, the reliability of a disk array is modeled by

the exponential distribution [Gib99]. Consequently, for low disk failure rates, the failure

rate of a disk array is proportional to the number of disks it contains. However, as a rule of

thumb, we can alternatively consider the so-called Mean Time Between Failure (MTBF),

the vendor given operation time of a single disk expressing its reliability (for example, the

Hitatchi UltrastarTMA7K1000, a conventional disk in 2007, exhibits an MTBF of targeted

1.2 million hours [Tec07]). Assuming a constant failure rate, the MTBF of a disk array

then is

MTBF of a Disk Array =MTBF of a single disk

Number of disk in the array

Thus, for a SAN consisting of 100 UltrastarTMdisks, the MTBF is calculated as roughly

1.2 million / 100 hours (= 1.5 years). At a first glance, this seems sufficient, but if we

1.2. Demands on Data Placement in Storage Networks 11

scale up to 1000 disks, the MTBF drops to 1,200 hours, or less than 2 months, what is

dismal. With respect to the disk failures that might occur, we primarily distinguish three

types. The first, called transient errors, arise from noise corruption and are dealt with

by repeating the requests. The second, called media defects, are caused by permanent

defects in material and are detected and masked by the manufacturer. The last, on which

we concentrate in this thesis, are catastrophic failures such as head crashes and failures

of the disk controller electronics. According to Hellerstein et al. [HGK+94], we denote

such failures erasures throughout this work because when a disk suffers a catastrophic

failure, its data is rendered unreadable, and thus, is effectively erased.

Consequently, besides appropriate backup mechanisms, any allocation scheme should

furthermore provide some certain degree of fault-tolerance. Usually, fault-tolerance is

achieved by introducing redundancy into the system that can be realized by either repli-

cation, i.e. seperately storing identical copies of every data block, or erasure resilient

encoding by which the information is redundantly encoded inside a codeword made up

of a number of blocks that are stored on separated devices. As we see in the follow-

ing chapters, erasure encoding is suitable rather for a homogeneous set of disks whereas

with replication, heterogeneity and scalability can be tackled efficiently. That is, all data

placement strategies we introduce in this work are fault-tolerant approaches while redun-

dancy is either provided by replication or erasure resilient encoding. However, the central

algorithmic problem we address in this thesis is the following

Data Availability Problem: Guarantee to always ensure a fast and efficient access to

all stored data in a storage network even in case of erasures.

Surely, the data availability problem covers the fulfillment of both the serviceability and

the availability objective defined in the beginning including all properties and require-

ments discussed above. In the following list, we summarize these general requirements

that result from the given discussion:

1. Redundancy: Data items are redundantly stored on separated disks.

2. Fairness: To ensure a best optimal disk utilization and reduced access latency, a

data placement strategy must assign every disk a fair share of the total data load,

i.e. a share of data that is equivalent to its share of the total capacity. This should

hold not only for the data, but also for the I/O requests.

3. Heterogeneity: Non-uniform capacities or bandwidths should also be handled op-

timally.

4. Scalability: The system should be able to efficiently handle the addition or removal

of storage devices.

12 Introduction

5. Adaptivity: A scheme is called adaptive if in case of any system changes (in the

number of data items or storage devices), it allows to adapt to the new configuration

with minimum amount of data replacement while maintaining all given demands.

6. Availability: Suitable redundancy mechanisms should be incorporated to compen-

sate the loss of failed (or blocked) disk drives.

7. Efficiency: The total space for storing control information should only depend on

the number of storage devices and not on their differences in capacity. Furthermore,

the time for computing the position of any data item stored in the system should be

at most logarithmic in the system size.

Again, driven by specific application demands concerning the applied allocation strat-

egy as well as different characteristics of the given hardware environment often allow a

placement scheme to cover only a subset of the listed issues, as we see in the following

chapters.

Before we investigate the different strategies in more detail, for the sake of complete-

ness, we first take a brief look on the parameters that must be chosen for replication as

well as erasure coding because in large-scaled storage environments, a proper selection

of values for the redundancy parameters directly influences the capacity provisioning and

thus, the TCO of the company.

1.2.1 A Brief Note on Erasure Encoding and Replication

Replication is a simple scheme where ridentical copies of each data object are stored

separately on rdifferent disks, thus, enabling the system to tolerate up to r−1 erasures.

In contrast, with an erasure-coded redundancy scheme, each object is divided into kfrag-

ments and redundantly encoded into a codeword of n>kfragments (blocks) which are

also stored separately. Then, the key property is that the original information can be re-

constructed from any kfragments of the encoding. Basically, the main difference between

replication and erasure encoding is that with replication, the entire information of the data

block is stored in each copy implying that only one out of rcopies suffices for recovering

the data, what is an advantage as most applications today feature a rather read-intensive

access behavior. On the other hand, this leads to a serious drawback if capacity sav-

ings become apparent as replication suffers from a storage overhead of a factor of r−1.

Therefore, the main reason for using erasure coding (if only homogeneous storage de-

vices are considered) is that a high degree of redundancy, respectively data availability,

can be achieved with less additional storage because the redundancy factor is only sc=n

Nevertheless, in case of non-uniform capacities or a scalable storage system, replication

features the discussed advantages.

1.2.1 A Brief Note on Erasure Encoding and Replication 13

Often, the values r,kand nare simply set by a rule of thumb but in large systems, made

up of a hundred or thousands of storage devices, these values should rather be chosen

carefully with respect to the mentioned storage overhead. Moreover, this also includes

the question of how many copies to hold for data items featuring different popularities.

For this problem, a suitable way to determine and thus control the storage overhead of

both approaches was presented by Rodruiges and Liskov [RL05]. They focus on high

availability in peer-to-peer networks and model the values of r, respectively sc, according

to a desired per object unavailability target ε>0 in combination with an average node

availability adenoting the time a peer is reachable. In particular, in case of replication,

they assume the node availability to be distributed independently and identically. Then,

the values for εand r, respectively, are computed as

ε=Pr[object ois unavailable]

=Pr[all rreplicas of oare unavailable]

=Pr[one replica is unavailable]r

= (1−a)r

which upon solving for ryields r=llog ε

log(1−a)m.

For the case of erasure coding, we sketch a summary of the complete derivation to

compute scwhich can be found in [BSV03]. Object availability is given by the probability

that at least kout of sc·kfragments are available which can be modeled by a binomial

distribution as

1−ε=

sck

∑

i=ksck

iai(1−a)sck−i.

Applying some algebraic simplifications and the normal approximation to the binomial

distribution, the storage overhead sccan be obtained as

sc=





σεqa(1−a)

k+qσ2

εa(1−a)

k+4a

2a





where σεis the number of the standard deviations in a normal distribution for the required

level of availability.

Besides the result of Rodrigues and Liskov [RL05], there exists further comparisons

[BSV03, DLS+04a, WK02], all of which argue that erasure coding is the clear victor due

to capacity savings for the same level of availability. However, the authors only compared

approaches for homogeneous, unscaled systems, and as we already described, in case of

non-uniform capacities or dynamic system behavior, an open replication based allocation

has considerable advantages.

14 Introduction

1.3 Our Contribution

Depending on the different storage configurations briefly described in Section 1.1.2, the

schemes we introduce in the following are coarsely divided into two different parts. With

respect to an open, heterogeneous and perhaps dynamic system, in the first chapter, we in-

vestigate replication-based schemes that adapt balls-into-bins games (see Section 3.3.2.1)

for randomly allocating identical copies of data items to the set of storage devices ensur-

ing that no two identical copies are hosted by the same device. Thus, given a replication

parameter of r, the system can tolerate up to r−1 failures and is still operational. We

consider randomized algorithms because in the non-redundant case, a random allocation

of the data items has turned out to be a useful way to conveniently model the load balanc-

ing problem. Furthermore, we distinguish between static (i.e. not dynamic) and dynamic

environments that may undergo changes in the number of connected disks, and as we

see, if scalability is neglected, a random and fair allocation to non-uniform capacities

can already be achieved very easy. Nevertheless, as we show, with a random and redun-

dant allocation of identical copies to non-uniform disks, an appropriate load balancing is

harder to gain than in the non-fault-tolerant case. Furthermore, if the system is allowed

to behave somehow dynamic, this challenge increases by magnitudes, but as we show by

the SPREAD strategy, an adaptive allocation scheme for redundant and fair storage in

dynamic heterogeneous storage systems, this can be handled efficiently.

Although storing replicated copies implies the efficient and appropriate handling of

hardware heterogeneity as well as dynamic behavior, unfortunately, such schemes require

to update all copies in case of any data modification to keep the system consistent, and

moreover, they require a higher waste of resources. Therefore, in the second part, we

introduce the so-called Read-Write Erasure Coding Systems (RWC), a flexible system

for generating parametrized erasure codes that, unlike usual erasure codes such as RAID

or Reed-Solomon codes, offer improved update properties making them very useful for

utilization in storage networks. Furthermore, the generated codes additionally include

new security features for a secure storage of redundant data.

Before we start introducing our main results, we first sketch some of the most prominent

previous results, each of which was designed to tackle a certain subset of the listed issues

above.

CHAPTER 2

Some Classical Approaches

In this chapter, we give a brief survey of the most prominent approaches in the field of data

placement algorithms that are usable for storage networks. Basically, most of the schemes

we discuss in this thesis distribute the given data by applying either hashing methods,

randomization, erasure resilient encoding or a mixture of these techniques, and for most

of them, it holds that depending on given parameters, configurations or objectives, it

often suffices to fulfill only some of the number of listed demands on data placement. For

example, if the system is not expected to scale and the connected storage devices are of

uniform size, simple and efficient XOR-based RAID schemes could be applied, instead

of complex distributed hash table approaches.

Since we investigate randomized allocation schemes that adapt balls-into-bins games

and erasure resilient encoding more deeply in the two major parts of this thesis, we now

discuss alternative approaches that base on a different design. All these approaches in

common is that non of them covers the data availability problem completely, thus, we

partition the description of the algorithms according to the requirement they mainly aim

to fulfill, namely fairness, data availability, heterogeneity and scalability (moreover, in

this chapter, we restrict ourselves to scalability rather than adaptivity because for some

scalable strategies the degree of adaptivity has not yet been analyzed or is simply ignored).

As fairness is one of the most important issues with respect to the serviceability objec-

tive, we start with introducing techniques that aim at both obtaining a fair data and access

distribution over the disks to provide optimal system response times. In particular, these

schemes can mainly be distinguished into either deterministic or random based ones.

Fairness

Again, the ability to distribute the data in a fair manner among the (perhaps non-uniform)

disks in a SAN (which is often also referred to as space balance [Sal04]) directly corre-

lates to an optimal system saturation, respectively an overall disk utilization. Moreover,

16 Some Classical Approaches

besides decreasing the waste of storage space by full capacity usage, inside a disk, con-

tinuous access algorithms, like the elevator strategy, are applied making the response time

of a disk correlate with its utilization.

Moreover, as briefly sketched in Section 1.2 and shown by the example of the audio-

and video provider in the previous chapter, not only the data must be distributed fairly;

the second factor that has great influence on the performance are user access patterns.

In particular, it has turned out that the frequency of access to data objects follows a Zipf

distribution1[WAS+96] (Figure 2.1) where about 80% of the accesses are to 26% of the

data indicating a high locality of reference which leads to access hot spots that have to

be dissolved by the placement algorithm. Unfortunately, many of the older expendable

data placement schemes do not consider this problem appropriately. For example, Fargin

et al. introduce Extensible Hashing, a scheme that uses a directory for addressing pages

[FNPS79, ED88]. The scheme applies a hash function to evenly spread the data across the

directory, and the addresses of the pages are stored in at least one index of the directory.

An important characteristic of this directory is that several entries may point to the same

primary data page which leads to access clustering on the pages. Moreover, in case of

key overflows on some pages, the directory expands by complete doubling. In that case,

another hash function addressing a wider index range is applied which in turn intensifies

the access clustering on some pages.

Figure 2.1: A typical Zipf-law rank distribution. The y-axis represents occurrence fre-

quency, and the x-axis represents rank (highest at the left).

1In a Zipf distribution, if the objects are sorted due to the access frequency, then the access frequency

for the ith object is given by pi=c/i(1−θ), where θis the parameter for the distribution and cis a

normalization constant.

Therefore, space and access balance are mutually dependent, and it is always necessary

to store the data such that I/O accesses are balanced evenly across the disks, regardless of

using uniform or non-uniform sized disks.

When looking at algorithmic approaches, usually in storage networks, deterministic

strategies are used, while the most-applied deterministic technique is uniform striping

[Joh84, PGK88, Mas97, HJH02] by which, given a fixed stripe size n, the i-th element in

the data stripe is mapped to disk Dimod non position idiv n(c.f. the widely used RAID

schemes with levels 0/4/5/6/...). However, deterministic striping is hard to scale because

whenever a fixed stripe size is broken up and scaled to a different size, this induces data

redistributions at great expense. Therefore, the natural extension is to simply concatenate

additional disk arrays to the given system, but again, this leads to the above 80/20-problem

(also discussed in the previous chapter). An alternative approach is to group the disks into

clusters in advance while each cluster idefines its own stripe of length `isuch that inside

the cluster, uniform striping is applied. Unfortunately then, either the complete system

has to be planned also for future situations or, in case of concatenated disk arrays, the

number of new disks is determined by fixed stripe sizes perhaps resulting in an undesired

over-provisioning of storage capacity. As a consequence, there exists a number of simple

striping variations to break the very regular structure of this approach (e.g. [BGMJ94]).

A more general drawback of deterministic schemes is given by the fact that if the place-

ment policy is completely known, bad inputs causing placement collisions can easily be

created. Thus, the most effective protection to be safe from bad input is given by ran-

domization (see e.g. [CW77, LYD71, AT97, SMRN00] and c.f. Section 3.1.2.1). With

a randomized placement scheme, each data block is assigned to a disk drawn at random

by either a pseudo-random hash function or a precomputed random distribution [SM98a].

Generally, randomized placement schemes are often much simpler than their deterministic

counterparts, efficient to use, redundantize the need for complex and sensitive centralized

control, provide efficient results even for non-uniform settings, and most importantly, un-

like deterministic striping patterns, offer high potential to adaptivity, i.e. being able to

react quickly on occurring configuration changes (see [Sch00] for a good overview about

randomization techniques and Chapter 3 for efficient novel strategies).

Scalability

Again, todays SANs still follow the classical storage management (Section 1.1.2) for

which the expansion rule is simple concatenation of rigid RAID arrays in case of sys-

tem scaling. As mentioned above, this results in an undesired and costly capacity over-

provisioning implying a negative effect on the TCO. Moreover, assuming a fixed stripe

size, the induced redistribution phase dramatically influences the overall system perfor-

mance.

18 Some Classical Approaches

a b c d

00 01 10 11

(a)

a b c d

000 01 10 11

100

(b)

Figure 2.2: Linear Hashing

An early approach on scalability to overcome these drawbacks is given by the Linear

Hashing strategy providing a fair and access balanced distribution of data records directly

on a number of continuous (primary) buckets as long as the address space remains un-

changed. The buckets are addressed by a pair of hash functions hiand hi+1,i=0,1,2,...,

starting with h0, and each hihashes keys to primary buckets (Figure 2.2a).

As shown in Figure 2.2(b), continuous key insertions on single buckets may lead to

bucket overflows (bucket b) inducing the address space to expand which is done by split-

ting overflowed buckets in two starting from left to right (that is, bucket ais split before

bucket b) while one of them is added as a new (secondary) bucket at the rightmost end

of the primary address space. To address the split buckets, a linear extension hi+1of hiis

then used for addressing the new (extended) address space.

However, although designed for scaling systems, Linear Hashing features serious draw-

backs. First, the continuous (and linear) addition of buckets leads to bad space utilization

as unsplit buckets are more likely to overflow than newly added ones leading to waste of

storage. Second, the best performance for an unscaling system is gained by hash func-

tions that distribute the data uniformly over the address space. Unfortunately, as new

buckets are sequentially added at the rightmost end, the overflow chains of buckets at

the rightmost end become too long (Figure 2.2) because they are the last to split, thus,

causing an imbalance. Therefore, a decreasing exponential distribution that maps more

keys to the left end would perform better. However, there are different approaches that

handle this problem accurately but by inducing slightly more complexity into the system

[Mar79, Mul84, Oto88].

Basing on the idea of Linear Hashing, Litwin et. al introduced LH* [LNS93], a class

of scalable distributed data structures (SDDS) generalizing the concept of Linear Hashing

to parallel and distributed RAM and disk files without employing a centralized coordi-

nator. In contrast to Linear Hashing, the address space is shared and manipulated by

several clients, and to enable distributed processing appropriately, all clients have their

own view on the file, and calculations are made on local values. Accurate object manipu-

lations are then coordinated by a messaging system where clients process local parameter

adjustments while only the bucket splitting is controlled by a dedicated server, the split

coordinator.

Like Linear Hashing, LH* allows a file (address space) to grow gracefully in a dis-

tributed environment but adopts many of the drawbacks sketched above. As an ex-

tension, several LH* variants have been created that incorporate fault-tolerant features,

such as mirroring, checksum, scalable availability or Reed-Solomon encoding (see e.g.

[LN96, LMR98, LS00, LR02]). However, depending on the overflow policy, space uti-

lization is also more or less moderate, and more importantly, no LH* variant provides

mechanisms to cope with non-uniform disk capacities.

Data Availability

We already explained that the easiest way to achieve fault-tolerance is to separately store

identical copies of all data items on different disks, like in the RAID 1 scheme [PGK88]

or the PAST storage server [RD01, DR01], and although keeping additional copies for

each data item results in a higher waste of storage capacity, compared to erasure resilient

encoding, replication offers better access properties for read-I/O dominated applications,

like not being concerned with complex data en- or decoding as well as offering the chance

for accesses to alternate on identical copies for I/O speed-up. Nevertheless, instead of

replication, commercial SANs rather use RAID-parity encoding while business continuity

is provided by the concept of (hot) spare disks, dedicated disks that are kept for fail-over

only. Negatively, in normal mode, these disks remain unused, and only in case of a

failed disk, they take over the missing part. Additionally, this concept does not ensure the

expandability of the system due to growing storage demands.

With respect to the LH* schemes, Honicky and Miller presented a way to overcome

their inherent drawbacks [HM03, HM04, WBMM06]. They introduced a family of pseudo-

random algorithms for decentralized data distribution that map replicated objects to a

scalable collection of storage servers or disks. These algorithms are primarily designed

for very large-scale storage systems in which new disks are always added in terms of

weighted homogeneous clusters. Moreover, all of an object’s replicas are stored strictly

disjoint (for example, some peer-to-peer systems, such as OceanStore [REG+03], do not

make such guarantees). However, since all algorithms are laid-out to work for large-

scale systems, a per-cluster weighting is performed rather than weighting individual disks

which leads to a fair distribution with respect to clusters only. Furthermore, in case

of system scaling, the redistribution of data objects solely considers uniform servers in

weighted clusters, thus, fairness can only be achieved for uniform disk capacities making

the scheme less usable for heterogeneous disks (but which is tolerable since they assume

server addition in clusters only).

Another approach to obtain a scalable and fault-tolerant strategy that is furthermore able

to handle non-uniform capacities was presented by Brinkmann et al. [BEHV05, BE07].

20 Some Classical Approaches

They show the coupling of a scalable data placement scheme called SHARE (c.f. Sec-

tion 3.5.1 for details on SHARE) with a separate RAID 1 layer on top. This approach

is implemented inside the storage virtualization solution V:DRIVE using SHARE as the

core allocation algorithm to become scalable. In particular, for a given replication pa-

rameter r, the V:DRIVE layer creates rvirtual volumes from the physical storage devices

which are then exported to the RAID layer for generating mirrored RAID volumes on top.

However, although efficient, the inherent drawback of this generated stack of strategies

is an increase in error-proneness because of a higher number of different but interacting

strategies.

In addition to storing multiple copies for each data item, the use of parity information to

tolerate a small number of failures has reached high attraction. The key idea is to introduce

redundant information into each data stripe by keeping the bit-wise exclusive-or (XOR)

information of the data blocks in the stripe in an extra parity block. Then, all blocks

are stored on separated disks, and in case of a single disk failure, the XOR operation

guarantees complete information recovery from the residual disks. Due to their simplicity,

efficiency and low capacity consumption, parity based schemes have widely been applied,

like in RAID level 4/5 [PGK88, CLG+94] or numerous video servers [BGMJ94, SM98a,

TPBG93, VRG95].

As simple parity approaches only prevent from one erasure, advanced schemes, like

the EVENODD strategy [BBBM94], are able to tolerate two missing disks. This is ac-

complished by keeping a matrix in which the regular RAID level 5 layout is encoded and

parity information is added for each diagonal of the matrix. Moreover, an EVENODD ap-

proach can be extended to tolerate up to tarbitrary simultaneous disk failures, but in this

case, t·`1−1/tdisks are needed for keeping parity information [HGK+94], where ` < n

denotes the stripe length. Thus, with respect to a fast reconstruction, this approach is only

of theoretical interest. As a consequence, more complex erasure codes even for storage

networks, like SANs or peer-to-peer networks, have become important that are able to

recover from a higher number of disk failures (but which we leave out for the moment

and refer to Chapter 4 for more details).

However, all known erasure coding schemes mainly focus on an efficient and almost

optimal data recovery in case of erasures. But as we see in Chapter 4, this induces a

negative update behavior in case of information modifications. Therefore, to overcome

this drawback, we introduce so-called Read-Write Codes that feature an improved update

flexibility for stored codewords and which are very useful for SANs.

Heterogeneity

Obviously, an easy solution to face heterogeneity could be to divide disks into clusters

of equal characteristics, as proposed by Honicky and Miller [HM03, HM04, WBMM06].

Since inside each cluster, the disks appear to be uniform, the aforementioned strategies

can be applied without explicitly regarding heterogeneity. However, this principle com-

prises some drawback. If newer and perhaps faster disks are added, a significant improve

in access performance is only given for the subset the new disks were added to. Thus, re-

sponse time strongly correlates to the clusters and the clustering policy, respectively. As

an advantage, if access patterns are known in advance and remain stable over time, such

clustering could be used to dissolve access hot spots. Unfortunately, such assumption is

not very realistic.

Another principle that bases on grouping is given by the HERA approach, a heteroge-

neous extension of RAID [ZG00]. In HERA, a strategy called Disk Merging is introduced

that was first proposed in [ZG97, Zim98] and which constructs a logical collection of uni-

form sized volumes from an array of heterogeneous physical disks. Each logical volume

is created by aggregating a certain number of fixed sized fractions of the bandwidth or

capacity provided by different physical disks. For example, in Figure 2.3, the logical

volume number 2 (d`

2) is fed by fractions from the disks 0, 1, and 2 (dp

0,dp

1,dp

2).

0.0 0.1 0.2 0.P

1.0 1.1 1.2 1.P

2.0 2.1 2.2 2.P

3.0 3.1 3.2 3.P

Logical: d0ld1ld2ld3ld4ld5ld6ld7ld8ld9ld10ld12ld13ld14ld15l

d11l

d0pd1pd2pd3pd4pd5p

Physical:

Parity

Groups:

p0=2.2 p1=2.2 p2=3.6 p3=2.2 p4=2.2 p5=3.6

Figure 2.3: HERA: Heterogeneous Extension of RAID

According to a given parity scheme, the volumes are then grouped into different parity

groups. As a main constraint, it must be considered that two logical volumes may be

mapped to the same physical device, thus, they become dependent because if the physi-

cal disk fails, both volumes are affected. Consequently, since with simple XOR exactly

one failure can be tolerated, two different volumes may not be mapped to the same par-

ity group. To show the reliability of the system, the behavior is modeled by different

processes, like the failure process and the reconstruction process. Because the MTTF

of any disk is given by vendors, the time of these processes can be estimated; using a

Markov model [Gib99], the mean time to service loss (MTTSL) can be derived for such

heterogeneous environments. It was shown that the system reliability is only a factor of

approximately 10 away from the best possible configuration, namely the clustering of

22 Some Classical Approaches

identical disks, as outlined above. However, the scheme itself and its provided perfor-

mance strongly depends on the control of an administrator and is thus unusable for large

environments.

In AdaptRAID [CL00, CL01], the regular RAID layout is extended such that hetero-

geneity in terms of different capacities can be handled. We only sketch the initial idea

here which is as follows. The disks are sorted according to an increasing capacity. Then,

the goal is to put as many stripes of length `=non all disks as possible. As soon as some

kdisks are saturated by data blocks, stripes of length n−kare mapped onto the remain-

ing n−kdisks, and so forth. The main drawback of this approach is that, depending on

different stripe lengths, access to data stored in the upper part of a disk is faster than in

the lower part. To overcome this problem, a second layer of different data patterns must

be defined carefully which compensates the gaps between the initial patterns. Access to

the data blocks that are encapsulated in fix sized stripes then is twofold making it hardly

usable if the environment is allowed to scale.

At last, we briefly sketch the RIO (Random I/O) Mediaserver [SM98a, SM98b], a

generic multimedia storage system capable of efficient, concurrent retrieval of many types

of media objects. The RIO server defines a randomized distribution strategy and supports

real-time data delivery with statistic delay guarantees. Data is placed to randomly cho-

sen positions on randomly chosen heterogeneous disks. Disk bandwidth and capacity are

coupled by a bandwidth-to-space ratio BSR, and disks with a high BSR can deliver data

faster than some with lower BSR. To better utilize the preferred disks, replication is used

such that copies of data blocks are assigned to these disks. The only remaining problem

then is to identify the degree of redundancy to introduce into the system to sustain a given

load but which can be estimated by simple calculations (that we leave out at this point).

Importantly, this scheme already emphasizes that randomization performs best when

having to cope with heterogeneous disk such that disk utilization and overall system per-

formance can optimally be provided.

For the rest of this work, note that additional related work that bears a higher relation to

the results representing the core of this thesis can be found located closely to the particular

strategies.

CHAPTER 3

Random Allocation of Data Copies

In this first scenario, we refer to an open storage management scheme as described in

Section 1.1.2 in which we consider a SAN as a single, flat storage solution consisting of

storage devices of arbitrary capacities. Furthermore then, we investigate data placement

approaches that use replication for obtaining fault-tolerance in storage networks. As we

already mentioned, replication is the easiest way for introducing redundancy into a stor-

age system because the only thing to do is to store ridentical copies of each data item

separately on rdifferent disks what makes the system capable of tolerating up to r−1

erasures. Surely, the storage of r−1 additional replicas of some data block implies a

storage overhead of a factor of r−1, but this waste of capacity is outweighed by the in-

herent simplicity of the scheme because, basically, the only challenge to cope with results

in controlling the separate storage of the rcopies on different disk. Depending on this

inherent simplicity, replication has become very popular in many application areas be-

cause it neither depends on complex control structures nor on any operational overhead,

as, for instance, given with erasure coding (see Chapter (4)). Again, the most promi-

nent representative using replication in the context of storage networks is the well-known

RAID scheme whose level 1 definition (also known as mirroring) applies replication in

a pure deterministic manner, and practical usage has shown that, as long as the observed

environments do not exceed some certain size, the RAID 1 scheme is quite efficient.

However, if systems grow, the attempt to control any deterministic allocation regarding

the redundancy and fairness conditions becomes pretty challenging. Moreover, if, be-

sides the required redundancy, additional parameters like arbitrary disk capacities and/or

dynamic system behavior come into play, with a deterministic scheduling, the costs, re-

spectively computational complexity for the allocation considering all demands listed in

Section (1.2), rise significantly with the size of the storage environment, what is dismal.

Thus, alternatively, a well-known and very efficient paradigm to overcome these draw-

backs is given by randomization which has become a standard approach in algorithm

design since efficiency and simplicity are the main features of randomized algorithms.

24 Random Allocation of Data Copies

The usage of randomization is not that new, and probabilistic based schemes have been

applied successfully to conveniently model load balancing in many allocation problems,

such as job scheduling, hashing, network routing or data allocation in e.g. distributed

hash tables (DHT) [Sch01, DKM+88, KLM92, ACMR95]. Again, randomized place-

ment schemes are often much simpler than their deterministic counterparts, efficient to

use, redundantize the need for complex and sensitive centralized control (as required by

deterministic schemes) and provide efficient results even for non-uniform settings.

Surely, one of the main differences between deterministic and randomized strategies is

that, in general, going from a deterministic scheduling that becomes very costly for large

systems to short computations of randomized algorithms is often paid for by the risk of

computing a wrong, respectively erroneous outcome, what is also undesired. Fortunately,

for all our strategies, we can show that, besides being redundant, they appropriately satisfy

the desired fairness condition, and the deviation of the disk loads resulting from the ran-

domized allocation compared to an optimal data layout is considerably small, with high

probability. Note that, when using a randomized placement, the higher the concentration

of the load of some disk around the expected value (which equals the load assigned to

the disks by an optimal algorithm) the more evenly the data layout becomes. Again, this

basic feature prepares the ground for efficient parallel access to the data stored on slow

mechanical based devices because by storing the data in a randomized fashion among

all accessible disks, all I/O requests are also fragmented and evenly distributed over the

storage network.

Furthermore, unlike any deterministic approach, randomization offers high potential

to scalability (see e.g. [JK77, KLMadH96, BMadHS97, Sch01]) which has become an

important property of modern SANs over the last decade to efficiently face the increasing

system sizes and huge amounts of stored data making any downtime or interruption of

service intolerable, and therefore crucial for a company. Hence, to keep the system at high

operation, the costs for restoring an efficient data layout after any change in the set of disks

should be minimized. To guarantee this, a data placement scheme should be adaptive, that

is, in case of changes, it should guarantee to redistribute only a minimum amount of data

while after redistribution, a fair data layout should again be achieved. This task is already

of great challenge in a non-fault-tolerant and homogeneous setting considering uniform

capacities only, but in a heterogeneous environment, this challenge further increases, and

finally, if redundancy must furthermore be taken into account, this task turns to be pretty

hard. Nevertheless, depending on the inherent properties of randomization, this demand

can be tackled more conveniently by a randomized strategy. A good introduction to the

design and analysis of a broad bouquet of randomized algorithms is given in [MR95,

HZ05]. Furthermore, Scheideler [Sch00] presents several probability methods that can

be used efficiently for different types of problems, like e.g. routing in networks, job

scheduling, coloring hypergraphs and data distribution processes.

Outline of this Chapter

For what follows, we investigate and analyze novel randomized replication strategies un-

der fault-tolerance consideration, and the major problem on what we are interested in this

chapter is to always find a fair load balancing for storage networks that may consist of ar-

bitrary disk capacities. Obviously, this is a pretty challenging task because, compared to

usual randomized placement algorithms which allocate all data items independently from

each other, regarding the redundancy condition at any point in time induces dependen-

cies on the allocation of any two identical copies, thus, making the allocations no longer

independent from each other. For dynamic systems that are often allowed to undergo

changes in the number of connected devices caused by the addition or removal of accessi-

ble disks due to maintenance, capacity extensions or disk failures, this problem becomes

even harder.

Since companies run static as well as dynamic systems, we divide our observations into

two parts. At first, we consider static storage environments only, while the second part

also tolerates dynamic system behavior. This implies that, in the first part, we can restrict

our analysis on the data layout, i.e. on the load of the disks after allocation, whereas in

the second part, we also have to consider the replacement of data blocks in case of disk

additions or removals.

In more detail, for the static scenario, we analyze strategies that are closely related to

those applied for the classical resource allocation problem. In that problem, a system

is considered in which for each arriving data block a scheduling algorithm chooses the

location of the block on-line and uniformly at random from a fixed number of identical

resources (e.g. disks of same size). Since in this scenario, both the resources and the

data blocks are supposed to be uniform, and additionally, the blocks are considered to

be pairwise independent (i.e. no redundancy is assumed), near-optimal allocations can

be obtained by using algorithms which base on the widely known Balls-into-Bins model,

a simple model that has become very prominent in the load balancing community (see

e.g. [JK77, RS98, ABKU00, BCSV00, Sch00]). In short words, in a balls-into-bins

model, one sequentially allocates a set of mindependent balls (representing tasks, jobs,

data blocks,...) at random to a set of nbins (servers, processors, disks, ...) such that

the maximum number of balls in any bin is minimized. In the recent years, many use-

ful and efficient results have been obtained by this simple scheme, most of which hold

with high probability1. However, unfortunately, nearly all of the known results only con-

sider uniform capacity distributions, and more important, none of the results covers the

redundancy condition we address in this work (we return to the balls-into-bins approach

in more detail in Section 3.4).

1We denote an event Ato occur with high probability (w.h.p.) if Pr[A]≥1−n−`for an arbitrarily chosen

constant `≥1.

26 Random Allocation of Data Copies

The dynamic scenario is significantly harder as to maintain fairness, any given change

in the capacities of the disks caused by disk additions or removals implies some redis-

tributions on the stored data blocks. Thus, to avoid any intermediate state in which at

least one of the given requirements is violated, the applied placement scheme is forced to

re-establish the fair and redundant data layout after any step in the redistribution phase.

3.1 Preliminary Section

Summarizing the descriptions in the previous section, we are interested in finding effi-

cient randomized algorithms for some sort of redundant resource allocation problem for

arbitrary capacities under either static or dynamic conditions. This section is dedicated

to give useful abstractions for addressing this problem in a theoretical manner. First, we

introduce a basic model of the considered storage networks and then define essential at-

tributes on which we concentrate in the subsequent sections to evaluate the resulting load

balancing after the random data allocation. That is, as we are mainly interested in the

layout quality of the allocations throughout the whole chapter, we solely focus on the

insertion of data blocks and omit the deletion of already stored blocks. In addition, we

only consider a finite number of data blocks to be allocated, thus, we restrict ourselves to

afinite allocation problem what is different to an infinite allocation model which is more

suitable for problems such as job allocation on processors or video-on-demand scenarios.

(Moreover, for the rest of this chapter, we use the terms bins, disks, nodes interchangeably,

and this also holds for balls, data blocks and data items.)

3.1.1 The Basic Model

First of all, for the following descriptions, we define the set [c]:={1,...,c}for any c∈N.

Then, for all observed storage systems, we consider arbitrary but fixed numbers M,N∈N

sufficiently large such that the set V= [N]denotes the address space of all possible disks

(bins, nodes), and U= [M]is the address space of all possible balls (data items) in the

system. Given these sets, we suppose that, at any time, m≤Mis the number of balls

currently in the system, and n≤Nalways denotes the current number of accessible disks.

Naturally, as storage networks consist of a collection of hard disks each covering of a

huge number of data blocks, we assume mn, and since our distribution schemes pay

no attention to the content of any given data block, this also implies to consider the balls

as uniform, that is, their weights are normalized, i.e. each ball j∈Uhas weight wj=1.

Furthermore, due to redundancy requirements, let each data item be represented in the

system by r∈[n]identical copies, with rcalled the replication parameter (for a non-

redundant system, we set r=1). Since we assume rto be fixed for all data items and we

always consider m≤Mballs currently in the system, let [k]be the set of original, pairwise

3.1.1 The Basic Model 27

distinct data items, for k=m

r∈N. Generally, any given I/O operation on the data blocks

(i.e. read or write access) is referred to as data access or data request.

For a formal description of our storage network model, we refer to some definitions

from the thesis of Kay Salzwedel [Sal04].

Definition 3.1. (Storage Network) Astorage network S(n,C[n]) is a collection of n ≤N

accessible disks [n]⊆V. Each disk i is characterized by a capacity Ci∈Ndescribing

the total number of (uniform) balls the disk can store. Let C[n] = (C1,C2,...,Cn), and the

total capacity of S(n,C[n]) is Ctotal =∑1≤i≤nCi.

A storage network S(n,C[n]) is called homogeneous (or uniform) if all disk have the

same capacity, that is, for any two disks 1≤i,j≤n, it holds that Ci=Cj. Otherwise, the

system is called heterogeneous (or non-uniform).

In Figure 3.1, we depict an arbitrary storage network consisting of a number nof disks

connected by some dedicated network. As further given in the picture, the disks may

feature different characteristics, like e.g. the capacity Cior the bandwidth Bistating the

average transfer time when accessing the disk by an I/O operation. For the purpose of il-

lustration, we will restrict ourselves to disk capacities only throughout this thesis because

disk utilization is more self-evident than throughput utilization.

Network

D3D4D5

C3C4

B3B4

Figure 3.1: A storage network with non-uniform characteristics (capacity, bandwidth).

Since throughout this chapter, we consider random allocations on the disks of a storage

network, S(n,C[n]) is considered as the given sample space for allocation on which we

can model an appropriate probability distribution by the following shares of the disks.

28 Random Allocation of Data Copies

Definition 3.2. (Share) Given a storage network S(n,C[n]), the share of a disk i is defined

by its relative capacity

ci=Ci

Ctotal ∈[0,1].

Thus, ∑1≤i≤nci=1. In other words, the share of each disk currently in the system is a

normalized measure of the heterogeneity in a storage network. Moreover, given the share

tuple ¯c= (c1,c2,...,cn)for the disks in S(n,C[n]), obviously, (S(n,C[n]),¯c)defines an

appropriate probability space for random allocation in which the sampling of the disks

is supposed to be done independently at random (at a first glance, the probability space

(S(n,C[n]),¯c)defines a base for our investigations).

Given the share tuple ¯c= (c1,c2,...,cn), we can now define our random allocation

problem for a storage network S(n,C[n]).

Definition 3.3. (Redundant Allocation Problem) Let S(n,C[n]) be a storage network

with a set [n]of accessible disks and capacities given by the share tuple ¯c= (c1,c2,...,cn).

Furthermore, consider a set [k]⊆U of data items and a replication parameter r ∈[n]that

holds for every d ∈[k].

Depending on r, let (f1,f2,..., fr)be an r-tuple of allocation functions fi:[k]→[n],

i∈[r], satisfying the following condition:

∀i,j∈[r],i,j,∀d∈[k]:fi(d),fj(d).

As for each d ∈[k],(f1,f2,..., fr)generates an allocation tuple (f1(d),f2(d),..., fr(d)),

we consider the set Mr

k={(f1(d),f2(d),..., fr(d))|d∈[k]}⊆[n]rof allocation tuples.

Then, let F = ( f1,f2,..., fr)be a redundant allocation function defined as

F:[k]→Mr

k,F(d) = (f1(d),f2(d),..., fr(d)) ∀d∈[k].

Since |Mr

k|=k, we obtain a total of m =k·r different allocations (balls) in the system.

Then, S(n,C[n]), the share tuple ¯c, and the random allocation function F define a redun-

dant allocation problem on the probability space (S(n,C[n]),¯c). The allocation problem

is finite as Mr

kis finite.

Again, since we always assume m≤Mballs in the system, we fix m=k·rin the

following. Additionally, as considering the balls to be uniform, we introduce the load `i

of a given disk ias the number of data blocks allocated to it.

Now, given the definition of the redundant allocation problem, we can formulate a

suitable definition for a redundant data distribution strategy as we require in the following

descriptions.

Definition 3.4. (Redundant Data Distribution Strategy) Aredundant data distribution

strategy for the redundant allocation problem defines a disjoint mapping F :[k]→Mr

kfor

any number m ≤Ctotal of blocks such that for each disk i the load `i≤Ci.

3.1.1 The Basic Model 29

To address the fairness condition on the disks in S(n,C[n]) in a formal manner, we

apply a set of binomially distributed random variables X1,X2,...,Xnsuch that for each

disk i∈[n], let Xidenote the load `iof disk i. Then, for a total of mdata blocks in the

system and any given integer k∈N, it holds: Pr[Xi=k] = bm,ci(k):=m

kck

i(1−ci)m−k

and E[Xi] = m·ciis the expected value.

With this random variables, we are now able to measure the fairness property of the

defined redundant data placement scheme (sometimes also referred to as space balance)

(c.f. Section 1.2).

Definition 3.5. (Fairness) For a redundant allocation problem, let X1,X2,...,Xnbe bino-

mially distributed random variables such that Xidenotes the load of disk i ∈[n]. Then, a

redundant data distribution strategy is fair if for all disks i ∈[n]

`i=E[Xi] = r·k·ci.

Moreover, since we require that we come close to optimal, the fairness condition above is

called tight if the expected load of disk i is in range

`i= (1±ε)·r·k·ci

w.h.p., where ε>0can be made arbitrarily small.

Recall that the fairness property ensures the utilization of the storage network, and as

each disk’s charging level corresponds to its share, a tight fair data distribution enables

the system to get filled up almost completely.

Up to this point, the given definitions enable us to characterize a redundant allocation

only for a static storage network, that is, the configuration of the storage network does

not change over time; the only dynamic aspect that comes into play with respect to such

systems is given if the number mof blocks in the system changes.

However, this is not yet sufficient as we are further interested in an appropriate formula-

tion for a dynamic setting in which a given configuration of current disks may change due

to altering space requirements. Thus, in the following, we formalize the adaptivity condi-

tion for a redundant allocation strategy, that is, we enable a storage network to increase or

decrease in the number of disks. The major algorithmic challenge coming along with this

additional property is to ensure both the fairness and redundancy condition at any time.

A natural measure for the efficiency of a distribution strategy in terms of adaptivity is

the number of blocks that have to be redistributed to ensure the fairness property in case

of changes in the configuration. This measure determines how long the storage system

works in a degraded mode and in which performance losings must be considered. In the

following, we call operations that induce a replacement of blocks change operations.

30 Random Allocation of Data Copies

As usual, we apply competitive analysis (see Section 3.1.3.5 for details) which is a well-

known method to model and evaluate this dynamic process that we can also comprehend

as an online algorithm. In short words, competitive analysis compares an online algorithm

with an optimal offline strategy. In particular, for our considerations, we would require an

optimal algorithm to ensure fairness after each change operation ωin the system implying

that, for instance, if ωis triggered by the addition of a disk, any algorithm has to replace at

least cn+1·mblocks. We call an adaptive distribution strategy c-competitive if it requires

at most ctimes the number of replacements performed by the optimal offline algorithm

for any sequence of change operations. Now, we can formulate a useful definition for an

adaptive redundant data placement strategy as we consider in the following sections.

Definition 3.6. (Adaptive Redundant Data Distribution Strategy) Consider a storage

network S(n,C[n]) and any change operation ω. We call a redundant data distribution

strategy adaptive for S(n,C[n]) if it is able to restore the fairness and redundancy property

on S(n,C[n]) after any change in the number of current disks n induced by ω.

It is c-competitive if the number of blocks necessary to restore fairness is at most c

times the number of blocks an optimal strategy has to replace, on expectation.

Obviously, to appropriately model and analyze the dynamic case observed in Sec-

tion 3.5, we suppose an extended share tuple ¯c= (c1,c2,...,cN)for N≥nand a given

storage network S(n,C[n]). Furthermore then, all disks inot currently in the network are

assigned are share ci=0. Hence, we consider an increase in the number of disks up to an

upper bound N. Then, given two share tuples ¯c= (c1,c2,...,cN)and ¯c0= (c0

1,c0

2,...,c0

N),

for any data distribution strategy to be adaptive, it must be able to handle any capacity

change from ¯cto ¯c0. Certainly, it is easy to see that every storage strategy that wants to

preserve fairness has to replace at least a

∑

i:ci>c0

(ci−c0

i) = 1

2∑

i|ci−c0

fraction of the data in the system.

Regarding all given definitions, we can, roughly speaking, formulate the redundant

allocation problem, on that we concentrate throughout this chapter as follows:

Is it always possible to find an efficient, fair, redundant (and perhaps) adaptive random

data distribution strategy for arbitrary capacities such that, at any time, best performance

provided by parallel access as well as fewest resource consumption can be guaranteed ?

Before, in the rest of this chapter, we concentrate on the description and analysis of our

novel data distribution strategies that address the given problem appropriately, either for

a static or dynamic setting, we first give some preliminaries in the next section which we

require for the modeling and the analysis of those allocation schemes.

3.1.2 On the Practical Realization of the Allocation Function F31

3.1.2 On the Practical Realization of the Allocation Function F

Obviously, finding a positive answer to the formulation of the redundant allocation prob-

lem can be directly translated into finding an appropriate redundant data distribution strat-

egy F= ( f1,f2,..., fr)for either a static or dynamic system. Thus, for the rest of this

chapter, we aim at finding a suitable translation of the theoretic definition given in the

previous section into some practical realizations of Ffor any given storage system that

can easily be implemented inside a respective application. In particular, this means that

such realization always has to ensure that

•any two functions fi,fj,i,j, in the r-tuple (f1,f2,..., fr)allocate a given data

block d∈[k]on distinct bins `,o∈[n],`,o,

•each of the functions fiin (f1,f2,..., fr)is efficiently computable, and

•for a given number m≤Ctotal of blocks, (f1,f2,..., fr)yields a fair data layout up

to some factor (1±ε)for each disk.

Clearly, by allocating data blocks in a given storage network S(n,C[n]), the function

Falso implements the dictionary functions search,insert and delete for the data blocks

on S(n,C[n]) (again, we solely concentrate our analysis on the data layout only, that is,

we do never consider the deletion of blocks). With respect to a technical implementation

of F, at a first glance, it seems to be a good idea to realize the functions fiin Fby r

different hash functions h1,h2,...,hrbecause hash functions can be efficiently computed,

and more importantly, by using hash functions, the expected complexity of the dictionary

functions can be reduced to O(1)(in Section 3.1.2.1, we go into further details on hash

functions).

However, the proposed idea imposes some severe problems. First of all, the usage of

rindependent hash functions cannot adequately ensure a perfectly disjoint allocation of

all identical copies, but what we strictly require, because all of them hash each data block

into the same range [n]of disks. Second, depending on r, the number of applied hash

functions may be too small to achieve fairness as defined in the previous section. At last,

all of those hash functions must be kept in memory, thus, if ris supposed to be large,

this implies a waste of memory, what is dismal. Hence, it is easy to see that, since all

our introduced strategies are randomized algorithms, this makes a separate allocation of

identical copies become a pretty hard challenge. Moreover importantly, the redundancy

property must always be ensured even in dynamic storage environments.

Therefore, as we show by the descriptions in this chapter, there is considerably more

work to do to realize Ffor a given storage environment. Recall that each function Fhas to

be chosen such that the elements from the set Mr

kare allocated to S(n,C[n]) in a way that,

on expectation, fairness can almost perfectly be achieved. Nevertheless, it is a good idea

32 Random Allocation of Data Copies

to consider hashing for allocation, and as we show in the following description, there are

useful universal hash functions, called k-universal hash functions for some parameter k,

that ensure a suitable distribution of the data blocks on the set of disks and which underly

all of the strategies introduced in this chapter.

We start our descriptions with introducing a basic design paradigm for randomized

online algorithms that is applied by the functions and strategies presented in the following.

After that, we briefly sketch universal hashing (because it belongs to folklore) before we

introduce k-universal hash functions. At last, we prove that for both static and dynamic

storage environments, O(logn)-universal hash functions are a good choice to obtain a

useful dispersal of blocks w.h.p.

Design Paradigm: Foiling the Adversary. To find a suitable realization for all ran-

dom allocation functions Fin this chapter, we apply the well-known design paradigm

for randomized algorithms, the method of foiling the adversary, also called the method of

avoiding the worst-case problem instance (cf e.g. [HZ05]). The paradigm considers the

design of an algorithm as a game between two players: a designer who tries to design an

efficient algorithm for a given problem, and an adversary who for any designed algorithm

tries to construct an input on which the algorithm fares poorly, that is, the output is not

efficient or incorrect. Typically, in the analysis of deterministic algorithms, the adver-

sary argument establishes a lower bound on the worst-case complexity, and, in general,

only those algorithms that run efficiently on every given input are considered to be effi-

cient. As the designer has to present his algorithm at first in the game, the adversary has

some advantage since, knowing the algorithm, he can find a hard problem instance for

the algorithm. This situation can essentially change if the designer designs a randomized

algorithm A, that is, the adversary does not know which of the possible runs of the algo-

rithm Awill be chosen at random. Therefore, the paradigm is very suitable for problems

for which

1. every deterministic algorithm (strategy) has some worst-case instances at which the

algorithm is not able to efficiently compute the correct (or optimal) output,

2. but there exists a class of deterministic algorithms such that, for every problem

instance, most algorithms of this class efficiently compute the correct result.

If (2.) holds, then for any given input instance, one can pick an algorithm from the class

at random and expect to get the correct result efficiently with reasonable probability. As

we will see, the following universal hashing approach describes a successful application

of this paradigm.

3.1.2 On the Practical Realization of the Allocation Function F33

3.1.2.1 Excursus: Universal Hashing

Hashing is a well-known method of dynamic data management. In particular, given a

data structure T= [N], called hash table, which consists of N∈Naddressable slots to

which direct access is provided2as well as some data item addressed by its key k, hashing

efficiently performs the dictionary functions search(T,k),insert(T,k) and delete(T,k). The

key kis taken from a universe Uof all possible keys that may be a finite subset of N,

or U=N. The major goal of hashing then is to perform the dictionary operations in

O(1)time, what is different to other dynamic data structures such as AVL-trees or B-trees

performing these operations in logarithmic time in the worst case.

We furthermore consider some certain application by which we are usually given a

finite input set S⊆Uof size M≈N, and the aim is to save Sin Tsuch that the directory

operations can be provided as efficient as possible on T. It is easy to see that, to achieve

the desired efficiency for the dictionary operations, the task now is to determine a mapping

h:U→T, called a hash function, on which we pose the following requirements:

•hcan be efficiently computed,

•hmaps every input set S⊆Uin a ’well-dispersed’ fashion to Tin a way that, for

all i∈[N], the cardinality of the set T(i) = {d∈S|h(d) = i}is in O|S|

|T|.

Unfortunately, the main problem arising with respect to the mapping his that we cannot

influence the choice of Sas well as we do not have any preliminary information about

S; the set Sis given completely by the application (user), thus, we are not able to predict

anything about it. Moreover, one can show that for every hash function h, there exist

"bad" sets S, all of whose elements are mapped by hto some slot iof T(for instance,

consider Sto be a subset of the set Uh,i={d∈U|h(d) = i}(see [HZ05])). On the other

hand, for each given set S, there exist useful hash functions that appropriately distribute

the keys of Samong the slots of T.

Summarizing this, we have an exemplary situation calling for applying the ’foiling

the adversary’ paradigm. For every choice of h, the adversary can construct arbitrarily

bad inputs S, but on the other hand, there are many hash functions that assure the best

possible distribution for most inputs S. Following the paradigm, the only possibility then

to distribute all (and also bad) input sets evenly among the slots of Tis to choose a hash

function huniformly at random from a suitable set Hof hash functions. In other words,

unlike assuming the input set Sbeing chosen uniformly at random from the universe U, we

rather suppose a class Hof hash functions in which these functions are evenly distributed.

Obviously, for most input sets, on average, this sampling guarantees an even distribution

of the keys in Tw.h.p. The considered ’suitable’ classes Hare widely-known as universal

classes of hash functions which are defined as follows:

2One is able to examine any slot of Tin time O(1).

34 Random Allocation of Data Copies

Definition 3.7. (Universal Hash Functions): Let H be a finite set of hash functions from

U to T (|T|=N). The set H is called universal if for each pair of elements x,y∈U, x ,y,

|{h∈H|h(x) = h(y)}|≤ |H|

holds, i.e. at most every N-th hash function from H maps x and y to the same slot of T.

That is, for any fixed pair of keys x,y∈U,x,y, it holds the following condition

PrH[h(x) = h(y)] ≤1

N.(3.1)

As a major result, it can be shown that for an arbitrary finite set S⊆Uof keys and a

universal class Hof hash functions from Uto T, the expected number of keys in any slot

of Tis smaller than 1+|S|

|T|[HZ05]. (We skip going into further detail at this point because

universal hash function are subject to any basic course on algorithms and can furthermore

be found in e.g [HZ05, CLRS01].)

As we can see, the concept of universal hash functions provides a useful mechanism

to obtain an appropriate distribution of the data items among the storage devices. It only

remains to show that such classes exist at all, and if so, to find a universal class Hof

hash functions that is suitable for our requirements. The first question can be answered

positively by the following lemma.

Lemma 3.8. Let U be a finite set. The class A ={g|g:U→T}of all functions from U

to T is universal.

We omit the proof here (but which can be found in e.g. [HZ05]). Now, we know that

classes of universal hash functions exist but, unfortunately, many of them suffer from

huge size. For instance, the number of all functions in Ais3N|U|what, obviously, makes

Aimpossible for real applications since

1. Ais too large4to be able to perform a random choice of an hfrom Aefficiently, and

2. most functions from Acan be viewed as random mappings in the sense5that they do

not have any essential shorter description than their full table representation. Such

functions are not efficiently computable.

Instead, we would like to have a universal class Hof hash functions such that

(i) His small (at most polynomial in |U|), and

3One has to assign one of the Nvalues from Tto each element of U.

4Observe that Uis already very large, thus, N|U|is usually substantially larger than the number of protons

in the known universe.

5c.f. [Hro03].

3.1.2 On the Practical Realization of the Allocation Function F35

(ii) every hash function from His efficiently computable.

The most prominent universal hash functions satisfying the formulated wishes are the

linear hash functions ha,b:U→Tdefined by

ha,b(x) = ((ax +b)mod p)mod |T|,a,b∈U,pprime.

for every x∈U. The corresponding universal class of hash functions is

lin ={ha,b|a∈[p−1]and b∈{0,1,...,p−1}}

of size |Hp

lin|=p·(p−1). Unfortunately, as we can see from the definition of ha,b,

linear hash functions guarantee the desired properties only for any two elements x,y∈U

(see [HZ05]). Instead, for our further analysis in this thesis, we require what we call

k-independence, that is, for any given set of kpairwise distinct keys from U, the hash

function guarantees that Condition 3.1 holds also (independently) for all kelements. This

demand is satisfied by the following k-universal hash functions (the following is taken

from [MadH06]; further references are [CW77, DKM+88]) (recall [M]and [N]to be the

address spaces of the balls and bins, respectively).

Definition 3.9. (k-Universal Hash Functions): Consider H ⊆A={g|g:[M]→[N]},

k∈N. H is a k-universal class of hash functions if, for all `≤k, all pairwise distinct

x1,...,x`∈[M]and all j1,..., j`∈[N], it holds that

Pr[h(xi) = ji,∀i∈[`]] ≤2

N`

To simplify our further analysis, we weakened the given bound for the probability of

the random function hthat, if considered to be an arbitrary random function from the set

A, would be 1

N`. However, for our assumptions, h∈His sufficiently random, and if we

merely consider `≤kinput values, hnearly behaves like a real random function.

In what follows, we prove that some special classes of polynomials are universal hash

functions. Let Mbe an arbitrary but fixed prime. Then, we can define the following:

Definition 3.10. Suppose d ∈N. For a tuple ¯a= (ad,...,a0)∈[M]d+1of coefficients, let

h¯a(x):= d

∑

i=0

ai·xi!mod M!mod N,for x ∈[M].

Furthermore, we define Hd

N:={h¯a:[M]→[N]|¯a∈[M]d+1}.

That is, the class Hd

Nconsists of all polynomials of degree-bound dtaken from the

finite field ZN(for every prime p, there exists a finite field Zpwith pelements (see e.g.

[HP03, Rom96]) whose images have finally to be mapped into the set [N].

36 Random Allocation of Data Copies

Theorem 3.11. The class Hd

Nhas the following properties:

1. each function h¯a∈Hd

Ncan be evaluated in time O(d), and

2. Hd

Nis (d+1)-universal.

Proof. Case (1) is obvious.

Case (2): For the tuple ¯a∈[M]d+1of coefficients, let f¯a:[M]→[M]be the polynomial

f¯a(x) = d

∑

i=0

ai·xi!mod M.

Since Mis a prime, f¯ais a polynomial of degree-bound dover the field ZM, and as we

know, in a field, each polynomial of degree-bound dis uniquely determined by d+1

distinct points x0,...,xd.

Now, consider an arbitrary but fixed `, 1 ≤`≤d+1. Let x1,...,x`∈[M]be `pairwise

distinct input values for f¯a. Furthermore, let j1,..., j`∈[N]be possible output values

’modulo N’, and let

A`:={h¯a∈Hd

N|h¯a(xi) = ji,∀i∈[`]}.

Then, it holds:

h¯a∈A`⇐⇒h¯a=f¯amod N,and f¯a(xi)∈{ji+α·N|α∈[M

N]}

| {z }

=:Bi

,∀i∈[`]

Since the f¯aare polynomials of degree-bound dover a finite field ZM, each polynomial is

uniquely determined by d+1 distinct values implying that the coefficients ¯aof f¯aare also

uniquely determined. Furthermore, since we already fixed `points of each polynomial

in the beginning by choosing the `point-values xiand ji, it remains to determine the

residual d−`+1 distinct values to determine the polynomial uniquely. Thus, for each

tuple (k1,...,k`)∈B1×...×B`with f¯a(xi) = ki, we can still determine further [M]d−`+1

distinct values because the domain of f¯ais [M]and d−`+1 points are still left to choose.

From this, it follows that the number of hash functions that map to the chosen values ji

|A`|=|B0|·...·|Bd|·[M]d−`+1≤M

N`

·[M]d−`+1.

Moreover, since |Hd

N|=Md+1holds (because each tuple ¯a∈[M]d+1defines a different

polynomial over the field ZM), if follows:

Pr[h¯a(xi) = ji,∀i∈[`]] = |A`|

Md+1≤M

N`·[M]d−`+1

Md+1≤2

N`

Thus, the class Hd

Nis (d+1)-universal.

3.1.2 On the Practical Realization of the Allocation Function F37

3.1.2.2 Analyzing the Distribution Quality of the Hash Functions

Given the above definition of k-universal hash functions, we can now analyze the distri-

bution quality of the hash functions that we apply for our strategies. In particular, for a

storage network S(n,C[n]) with arbitrary capacities, a set P⊆[M]of data blocks (note

that, for our analysis, we neglect redundancy, i.e. all data blocks are considered pairwise

distinct), and a hash function hchosen uniformly at random from an O(log n)-universal

class of hash functions H⊆A={g|g:[P]→[n]}, we can state the following theorem.

Theorem 3.12. Consider a storage network S(n,C[n]) with arbitrary capacities. Then,

for every set P of m ≥n data blocks, the hash function h provides an even distribution of

the blocks among the set [n]of bins, w.h.p

Proof. Before we start, we introduce the following well-known inequality for binomial

coefficients which we use for the analysis in Lemma 3.14.

Lemma 3.13. For k,` ∈N,0≤`≤k, it holds

k

`≤k·e

``

Proof.

k

`=k·(k−1)···(k−`+1)

`!≤k`

`!=k`

``·``

`!≤k`

``·e`,since e`=

∞

∑

i=0

i!≥``

`!.

Now, for every relative capacity value ci∈[0,1]in S(n,C[n]), fix some arbitrary bin

i∈[n]of size ciand proceed as follows.

Assume the input set Pto be of size m=αi·nlog nfor some appropriate constant

αi∈N≥1such that Pcan be partitioned into equal sized batches P={B1,B2,...,Bαi}.

To determine αi, let |Bk|=˜c·log n

cifor all k∈[αi]where ˜cis some constant which is chosen

such that ˜c

ci=nand that can easily be determined if we assume ci=1

nfor a uniform

network, and for non-uniform capacities, we consider ci=c0·1

nfor an appropriate constant

c0≥0.

Now, consider each batch Bkto be allocated independently to the nbins by applying

the hash function h. Then, we can state the following lemma concerning the distribution

of the elements of Bk.

Lemma 3.14. Let S ⊆Bk. Then, for the load `iof some bin i ∈[n], the following holds

1. Pr[`i≥c·log n]≤2e

cc·log nfor an appropriate constant c

38 Random Allocation of Data Copies

2. For all bins i ∈[n]of size ci, it holds: Pr[maxi{`i} ≥ c·log n]≤1

n`for some

arbitrary constant `∈N≥1and an appropriate constant c

3. For all bins i ∈[n]of size ci, it holds: E[maxi{`i}]∈O(log n)

Proof. Case (1)

Pr[`i≥c·log n] = Pr[∃A⊆S,|A|=c·log n,∀d∈A:h(d) = i]

=Pr



[

A⊆S

|A|=c·log n

h(d) = i∀d∈A





(∗)

≤∑

A⊆S

|A|=c·log n

Pr[h(d) = i∀d∈A]

(∗∗)

≤nlog n

c·log n·2

nc·log n

(∗∗∗)

≤nlog n·e

c·log n·2

nc·log n

=2e

cc·log n

The estimate (∗)holds by Boole’s inequality (see Section 3.1.3.1), (∗∗)holds since His

an O(log n)-universal class of hash functions, and (∗∗∗)holds according to Lemma 3.13.

Case (2)

Now, we show that for all bins i∈[n]of size ci, the result from Case (1) holds w.h.p.

That is, instead of focusing on a single bin only, we now consider all bins of size ciand

show that there is no bin having a higher load w.h.p. We denote by maxi{`i}the maximum

load of the bins we concentrate on. Furthermore, let Aidenote the event that the load of

any bin iis `i≥c·log n. Then,

Pr[max

i{`i}≥c·log n] = Pr[∃i∈[n]:`i≥c·log n]

=Pr[A1∪A2∪...∪An]

(∗)

≤

∑

i=1

Pr[Ai] = n·2e

cc·log n

Again, estimate (∗)holds by Boole’s inequality. Now, let c=4e. Then, we obtain the

following.

n·2e

cc·log n

≤n·1

2c·log n

=n·1

nc

≤1

n`

for all `≤c−1.

3.1.3 Further Preliminaries 39

Case (3)

Given the results above, we can now determine the expected maximum load of the bins

i∈[n]of size ci.

E[maxi{`i}]≤Pr[max

i{`i}<c·log n]·c·log n+Pr[max

i{`i}≥c·log n]·nlog n

≤1−1

n`·c·log n+nlog n

≤c·log n+nlog n

n`≤(c+1)·log n∈O(log n).

Since the results from Lemma 3.14 hold for any partition P=αi·nlog nfor some

constant αithat is determined by some capacity ci, the theorem follows.

3.1.3 Further Preliminaries

In the following, we sketch some useful mathematical tools taken from probability theory

and combinatorial optimization which help to estimate appropriate bounds on the devia-

tion of any disk’s load from its expectation.

3.1.3.1 Some Facts from Probability Theory

For our analysis, we use basic results and definitions from fundamental probability theory,

like e.g. conditional probability, random variables or expectation. In the following, we

introduce some tools which are useful for the analysis of our strategies.

Deviation Bounds. To bound the deviation of the load on a given disk from its expecta-

tion, we apply the following version of the prominent Chernoff Bounds found in [HR90]

that give the tail estimates based on expectation for sums of independent binary random

variables (we omit the proofs here that are given in detail in e.g. in [MR95]).

Lemma 3.15. (Chernoff Bound) Consider any set of n independent binary random vari-

ables X1,...,Xn. Let X =∑n

i=1Xiand µ=E[X]. Then, it holds for all δ≥0that

Pr[X≥(1+δ)µ]≤e-min[δ2,δ]·µ

and for all 0≤δ≤1that

Pr[X≤(1−δ)µ]≤e-δ2·µ

40 Random Allocation of Data Copies

An Independent Bounded Difference Inequality. The following theorem presents tail

estimates that are based on bounds on the difference of functions over independent ran-

dom variables. The given results can be found in [McD89, Sch00, Ros06].

Theorem 3.16. Let X= (X1,...,Xn)be a vector of independent random variables with

Xitaking values in a set Aifor each i. Suppose that the real-valued function f defined on

A1×...×Ansatisfies

|f(x)−f(x0)|≤ci

whenever the vectors xand x0differ only in the ith coordinate. Let µbe the expected value

of f (x). Then, for any d ≥0,

Pr[f(X)−µ≥d]≤e−2d2/∑n

i=1c2

and

Pr[f(X)−µ≤−d]≤e−2d2/∑n

i=1c2

The Inclusion-Exclusion-Principle on Events. Consider any arbitrary collection of n

sets A1,...,An. Then, the inclusion-exclusion principle is given as



[

i=1

Ai

∑

k=1

(−1)k+1∑

i1<i2<...<ik

j=1

Ai j

If then A1,...,Anis considered as a collection of events, the inclusion-exclusion principle

implies

Pr"n

[

i=1

Ai#=

∑

k=1

(−1)k+1∑

i1<i2<...<ik"k

j=1

Ai j#.(3.2)

In cases where it is too difficult to evaluate Equation 3.2 exactly, one can use Boole’s

inequalities to find suitable approximations:

Pr"n

[

i=1

Ai#≤

∑

i=1

Pr[Ai]

and

Pr"n

[

i=1

Ai#≥

∑

i=1

Pr[Ai]−∑

1≤i<j≤n

Pr[Ai∩Aj].

3.1.3 Further Preliminaries 41

3.1.3.2 Some Basics from Linear Programming and Farkas’ Lemma

Linear programming involves the optimization of a linear function under linear inequality

constraints. Since linear programming covers a wide field in mathematics, here, we give

only some brief definitions which are useful for the understanding of Farkas’ Lemma

[Roc70, p.200] (for more details on linear programming, see e.g. [PS98, Ros00])

Definition 3.17. (Basics Facts:) Alinear programming (LP) problem is an optimization

problem that can be written

maximize: cx

subject to: Ax ≥b(3.3)

where A = (ai j)i=1,...,q,j=1,...,nis a given q ×n matrix, c ∈Rnis a given row vector of

length n, and b ∈Rqis a given column vector of length q (note that in (3.3), the ’≥’-

relation is defined on vectors, i.e. for each component in the vectors). Each of the q

inequalities in Ax ≥b is a constraint, and the decision variables of the Problem 3.3 are

represented by the column vector x of length n.

Afeasible solution is a vector x satisfying Ax ≥b. The feasible region is the subset

of all feasible solutions in Rn. If no feasible solution exists (so that the feasible region is

empty), the LP problem is infeasible; otherwise it is feasible.

For a feasible solution x, the function z =cx is the objective function and cx the ob-

jective value of x. If the feasible solution is also the maximum, the feasible solution x∗is

an optimal solution. If the optimal value can be made arbitrarily large over the feasible

region, the LP problem is unbounded. Otherwise it is bounded.

Since every LP problem, referred to as a primal problem, can be converted into a

dual problem providing an upper bound to the optimal value of the primal problem, and

additionally, every feasible solution for an LP gives a bound on the optimal value of the

objective function of its dual. The weak duality theorem states that the objective function

value of the dual at any feasible solution is always greater than or equal to the objective

function value of the primal at any feasible solution. The strong duality theorem states

that if the primal has an optimal solution, x∗, then the dual also has an optimal solution,

y∗, such that cTx∗=bTy∗.

The duality theorems given above base on the following Farkas’ Lemma which is an

example of a theorem of the alternative; a theorem stating that of two systems, one or

the other has a solution, but not both. Thus, it can be applied to show the existence (and

uniqueness) of solutions to linear models and stationary distributions in finite Markov

chains.

42 Random Allocation of Data Copies

Lemma 3.18. (Farkas’ Lemma) Let A be a real-valued q ×n matrix and x and b are

vectors. Then, the system:

Ax =b,x≥0

has no solution if and only if the system

ATy≥0,bTy<0

has a solution, where y is a vector of length q.

3.1.3.3 Hypergraphs

The analysis in the next section depends on structures described by hypergraphs for which

we give the major properties we require by the following definition.

Definition 3.19. (Hypergraphs) Let V be a finite set of nodes. Further, let E ⊆Si≥2Vi

be a finite subset of the set of tuples over V. The set E is called the set of edges. An

element e = (v1,...,vj)∈E is said to be self-loop-free if vi,vi0for all 1≤i≤i0≤j. If

all elements are self-loop-free, the pair G = (V,E)is called a hypergraph. A hypergraph

G= (v,E)is d-uniform if E ⊆Vd. An edge e ∈E has size dif e ∈Vd.

Let e = (v1,...,vj)∈E be an edge of a hypergraph G and let v ∈V be a node of G.

The node v is said to be incident to e if there is an 1≤i≤j with vi=v. An edge e is

incident to a node v if v is incident to e. If a node v is incident to a number i of edges, the

node v has degree i (deg(v) = i). If v ,v0are nodes incident to the same edge e ∈E, the

nodes v and v0are said to be adjacent.

For j ≥1, a sequence of nodes v1,...,vjis called a path connecting v1and vjif viis

adjacent to vi+1for each 1≤i≤j−1. The length of a path v1,...,vjis j −1. Two nodes

v and w are said to have distance i (dist(v,w) = i) if the shortest path connecting v and

w has length i. If W ⊆V is a subset of the nodes, the distance between a node v ∈V

and W is defined by minw∈Wdist(v,w). Two paths v1,...,viand w1,...,wjare distinct if

{v1,...,vi}∩{w1,...,wj}=/0.

3.1.3.4 Multisets

The following definition of multisets is well-known from set theory.

Definition 3.20. Amultiset M(E)is a pair (E,#E)where E is some underlying set and

#E:E→N0is a function specifying the multiplicity of the elements in E. For each e ∈E,

let µE(e):=#E(e)denote its multiplicity. Thus, µE(e) = 0for all e <E. Furthermore, the

cardinality of a multiset |M(E)|=∑e∈EµE(e).

Subtraction on multisets is defined as M(E0) = M(E)\e, thus, µE0(e):=µE(e)−1if

µE(e)>0and µE0(e):=µE(e) = 0otherwise. The union of multisets is defined analo-

gously, i.e. µE0(e):=µE(e)+1.

3.2. The Redundant Allocation Problem 43

3.1.3.5 Competitive Analysis

In part two of this chapter, we analyze the introduced strategies by applying competitive

analysis which is a well-known method of evaluating online algorithms (see [BEY98]).

With respect to our dynamic data placement strategies, we apply competitive analysis

to evaluate their adaptivity. That is, for any change operation ω, we compare the number

of (re)-placements of data blocks performed by the observed strategy with the number

of (re)-placements performed by an optimal strategy such that all required conditions are

satisfied. To formalize our intension, we introduce the following definition.

Definition 3.21. (Competitiveness) Let S(n,C[n]) be a storage network, and let k be the

identifier of some storage device. Then, let L ={insert(S(n,C[n]),k),delete(S(n,C[n]),k)}

be the set of operations on S(n,C[n]) that cause changes in the number of connected stor-

age devices. Furthermore, let A denote a dynamic data placement strategy on S(n,C[n]).

Then, for any operation ω∈L, the competitive ratio compA(ω)of Aon ωis the number

compA(ω) = max(COpt(ω)

CA(ω),CA(ω)

COpt(ω)),

where COpt(ω)denotes the cost of an optimal algorithm for the instance ω, and CA(ω)

denotes the cost of the online strategy A.

Let c ≥1. Then, the storage strategy A is called c-competitive concerning operation ω

if, on expectation, compA(ω)≤c for all ω∈L.

Note that randomized algorithms must fulfill this bound with high probability, that is,

the number of redistributed data blocks implied by an operation ωshould be at most

`·(c·COpt(ω)) with probability at least 1−1

n`.

3.2 The Redundant Allocation Problem

Given all the formal definitions and required mathematical tools, we can now refine the

redundant allocation problem with respect to a rather probabilistic based formulation.

In particular, the following central problem of allocation describes the major allocation

problem underlying the design and analysis of all schemes we discuss in this chapter. In

other words, all algorithms we introduce in the following, whether adaptive or not, basi-

cally aim at finding a suitable solution for this allocation problem.

Central Problem of Allocation: Given a redundant allocation problem for a given

storage network S(n,C[n]), find an appropriate probability distribution P on the set

[n]of disks in either a static or dynamic setting that satisfies the redundancy and

fairness conditions for any number m ≤Ctotal of data blocks.

44 Random Allocation of Data Copies

Clearly, if we ignore redundancy, we can simply set P=¯cbecause then, all blocks are

allocated independently at random according to P. For such settings, there exist numerous

strategies ensuring a fair allocation for static as well as for dynamic storage systems (note

that we omit dynamic systems in this section but return back to them in Section 3.5).

For a static setting, a fair data layout can easily be achieved by applying standard balls-

into-bins models. Again, given a storage network S(n,C[n]) and P=¯c, a balls-into-bins

allocation stores each data item (ball) d∈[k]independently at random to some bin i∈[n]

with probability ci. For non-uniform capacities, this yields a fair data layout up to some

factor (1±ε), whereas in a uniform setting, we can come close to optimal if one out of

d≥2 choices for each allocation is allowed (we return to standard balls-into-bins games

a little more precisely in Section 3.3.2).

However, whenever regarding redundancy, the observed problem becomes more chal-

lenging because, as we know from the realization of the function Fin Section 3.1.2, the

redundancy condition imposes high carefulness on the allocation functions, and more im-

portantly, for any ridentical copies, r−1 allocations are no longer independent. This

implies a process in which for each of the functions f1,f2,..., frin F, respectively for a

given sequence of ridentical copies, we always have to find a sequence P= (P1,P2,...,Pr)

of probability distributions, where Piis valid for all allocations via function fionly.

Therefore, in what follows in this section, we concentrate on the investigation of appro-

priate sequences P= (P1,P2,...,Pr)subject to a fair data allocation. At first, we describe

a very intuitive balls-into-bins-like algorithm which aims at ensuring a redundant alloca-

tion on the share tuples of disks that remain valid in some particular allocation step (note

that for a data item d∈[k]and any step i∈[r], the random allocation of the i-th copy of d

to some disk corresponds to the allocation fi(d)of F).

Algorithm 1 Naive Balls-into-Bins Allocation

Require: Probability Space (S(n,C[n]),¯c), Set [k]of balls, r-tuple (f1,f2,..., fr)

Ensure: Redundant Allocation of m=k·rreplicated balls

1: for all d∈[k]do

2: To hold the r-th copy of d, choose disk fr(d) = ir∈[n]with probability cir

3: repeat

4: Set [n0]←([n]\{ir}){Reduce sample space by chosen disk ir}

5: Set r0←(r−1){Reduce current number of copy}

6: for all i∈[n0]do

7: Set c0

i←(Ci/∑j∈[n0]Cj){(Re-)assign disk probabilities (relative capacities)}

8: To hold the r0-th copy of d, choose disk fr0(d) = ir0∈[n0]with probability c0

ir0

9: until (r0=0)

Obviously, this approach is quite self-evident for obtaining a fair allocation under re-

dundancy conditions because it is simple, features low complexity and is closely related to

3.2. The Redundant Allocation Problem 45

the well-analyzed standard balls-into-bins model for independent allocations. However,

as we show in the following, the given algorithm is only useful for uniform capacities

because for heterogeneous settings, a naive balls-into-bins scheme always induces an im-

balance on the data layout such that fairness can never be obtained. Since, generally,

balls-into-bins schemes feature high attraction to many designers of data placement algo-

rithms, it is worth investigating the occurring imbalance in more detail. We start by first

examining a homogeneous setting for Algorithm 1.

Theorem 3.22. Let S(n,C[n]) be a uniform storage network with relative capacities ci=1

for all disks i ∈[n]. Furthermore, let [k]be a set of data items and r ∈[n]the replication

parameter for each d ∈[k]. Then, for any number m ≤Ctotal (m =k·r) of replicated

blocks, Algorithm 1 realizes a balanced (i.e. equal) load on the set [n]of disks.

Proof. In fact, this process can be seen as (the first rsteps of) sampling a permutation

π:[n]→[n]uniformly at random: first choose π(1)∈[n](with probability 1

n), then

π(2)∈[n]0= [n]\{π(1)}and so on. Since for each i∈[r],π(i)is chosen uniformly

from [n]\{π(1),...,π(i−1)}, that is, with probability 1

n−i+1independent of the values

π(1),...,π(i−1), induction on rshows that each r-tuple π(1),...,π(r)of pairwise dis-

tinct numbers between 1 and nappears with equal probability n−r!

n!. Proceeding from an

r-tuple π(1),...,π(r)to the r-element set {π(1),...,π(r)}constitutes a mapping with

exactly r! fibers of equal probability n−r!

n!each: every r-element subset therefore arises

with probability 1/n

r.

Observation 3.23. Given a uniform storage network S(n,C[n]), Theorem 3.22 constitutes

an efficient algorithm for sampling uniformly at random an r-element subset of the set [n]

of disks. For the subsequent descriptions, let us interpret Algorithm 1 differently: con-

sider the uniform sampling of a hyperedge E from the complete r-uniform n−1

r−1-regular

hypergraph [n]

r:={E⊆[n]| |E|=r}. Notice that this random hyperedge E contains

any fixed disk i ∈[n]with probability pi=r

n(because we have n different elements, each

of which chosen with probability 1

nand r possible entries for each of the elements in [n]).

So far the homogeneous case. Now consider Algorithm 1 for the inhomogeneous case.

For this, in the following, we restrict ourselves to random processes which are online

and memory-less in the sense that they distribute (the rcopies of) each given data block

independently from the previous ones. Thus, balanced load then means that the rdisks

which are drawn for storing the copies of some current data block should be sampled from

the set [n]in a way such that disk igets chosen with probability pi:=∑E∈([n]

r):i∈EPr[E],

proportional to its relative capacity ci=Ci/∑jCj.

However, as the following example shows, this assignment (and thus Algorithm 1) fails

for non-uniform capacities.

46 Random Allocation of Data Copies

Example 3.24. For simplicity, in the following two examples, we consider the size of the

disks due to the normalized data items of size 1.

a) Consider the case n =3, r =2, and disks 1,2,3with capacities C1=C2=1, C3=2.

In order to fill these disks in a balanced way, disk 3has to receive twice as many

copies as both disk 1and 2. That is, to obtain fairness (and thus, full utilization),

disk tuples {1,3}and {2,3}must be chosen as where to place copies with proba-

bility 1

2each; whereas tuple {1,2}must not occur. In contrast, Algorithm 1 chooses

disk 1with probability c1=1

4to hold the first copy; and then disk 2with probability

2=1

3yielding a positive chance for the tuple {1,2}to occur.

b) Even worse in case n =3, r =2, and C1=C2=1, C3=3, there is no way to achieve

a balanced distribution at all, regardless of which algorithm one employs !

According to the given translation on hypergraphs, we can now refine our examinations

on finding a suitable probability distribution Pfor the disjoint storage of replicated balls.

We start with posing the following central question.

Question 3.25. Consider the replication parameter r and fixed integers p1,...,pn∈[0,1].

a) Does there exist a probability distribution P :[n]

r→[0,1]such that6

∀i∈[n]:pi=∑

E∈([n]

r):i∈E

P(E)? (3.4)

b) Is such a distribution uniquely determined by Equation 3.4 ?

c) How can one efficiently algorithmically determine this distribution P from the tuple

¯p= (p1,...,pn)?

d) How can one efficiently algorithmically sample E ∈[n]

raccording to P for either

static or dynamic systems ?

For the rest of this chapter, we concentrate on answering the questions above. In partic-

ular, in the following section, we examine the properties of a possibly existing distribution

Pin more detail. Section 3.5 then is dedicated to the latter two questions above.

3.3 Analysis of the Probability Distribution P

Based on the above observations, we now analyze the properties of Pin terms of existence,

uniqueness, and of algorithmic tractability. Moreover, we show that using a balls-into-

bins scheme for allocation as realized by Algorithm 1 always yields an imbalance on the

resulting data layout whenever replicated balls are to be stored on heterogeneous disks.

6To emphasize the probability distribution P, in this section, we use P(E)rather than Pr[E]to denote the

probability of some event E.

3.3.1 Existence and Uniqueness 47

3.3.1 Existence and Uniqueness

Regarding Question 3.25(a), a simple double-counting gives rise to the following neces-

sary (pre-)condition:

Lemma 3.26. Consider a storage network S(n,C[n]) and fixed replication parameter r.

Then, P :[n]

r→[0,1]denotes a probability distribution on the set [n]of disks such that

pi:=∑E∈([n]

r):i∈EP(E)satisfies ∑n

i=1pi=r.

Proof.

∑

i=1

pi=∑

(i,E)∈[n]×([n]

r):i∈E

P(E) = ∑

E∈([n]

∑

i∈E

P(E)

| {z }

=P(E)·|E|

=r·∑

E∈([n]

P(E) = r(3.5)

Note that subject to Definition 3.5, we require pi=r·cilater for our allocations. Then,

to achieve a fair and redundant data distribution at all, Lemma 3.26 implies that for the

relative capacities ciof all disks i∈[n], it must hold that ci≤1

r(c.f. Lemma 3.29).

3.3.1.1 Existence

Given the result from Lemma 3.26, we can now focus on the existence of the probability

distribution Pregarding the above condition.

Theorem 3.27. Let p1,...,pn∈[0,1]such that ∑ipi=r. Then, there exists a probability

distribution P :[n]

r→[0,1]with pi=∑E∈([n]

r):i∈EP(E)for each i ∈[n].

Proof. Consider the incidence matrix Aof the r-uniform n−1

r−1-regular hypergraph [n]

r,

that is, A= (ai,E)(i,E)∈[n]×([n]

r)where ai,E=1 if i∈Eand ai,E=0 otherwise. We are thus

looking for a vector

x=P(E)E∈([n]

r)∈[0,∞)([n]

r)s.t. A·x=∑

E∈([n]

r):i∈E

xEi∈[n]

= (pi)i∈[n].(3.6)

Let ATdenote the transposed matrix of A. Then, we can observe that for vector y∈

Rn,AT·yis a vector of size n

rwith E-th component equal to ∑i∈Eyiwhere E∈[n]

r.

Now, AT·y≥0 is infeasible for ∑iyipi<0 according to the claim below; hence by

Farkas’ Lemma ([Roc70]), we obtain that Equation 3.6 is feasible. Moreover, reading

Equation 3.5 reversely reveals such a solution xto satisfy ∑E∈([n]

r)xE=1, i.e. to yield a

probability distribution P.

48 Random Allocation of Data Copies

Lemma 3.28. Let y1,y2,...,yn∈Rand p1,...,pn∈[0,1]with ∑ipi=r∈N. Suppose

that ∑i∈Eyi≥0for each E ∈[n]

r. Then, it holds that ∑n

i=1yipi≥0.

Proof. W.l.o.g. let y1≤y2≤···≤yn. Then, ∑n

i=1yipibecomes minimal (subject to the

constraints 0 ≤p1,...,pn≤1 and ∑ipi=r) for 1 =p1=···=prand 0 =pr+1=···=pn.

But that minimum equals ∑i∈Eyi≥0 for E={1,...,r}.

So far, we have shown that there exists a probability distribution Pas asked for in

Question 3.25(a). Now, we tackle on the uniqueness of such distribution, or in other

words, are there more than one possibilities to assign probabilities to the disks in the set

[n]in order to obtain the desired fair and redundant allocation ?

3.3.1.2 Uniqueness

Uniqueness of Pwould mean that ¯p= (p1,...,pn)determines P:[n]

r→[0,1]subject to

Condition 3.4. However, this fails already in the uniform case on [4]

2, that is, with pi=1

for all i∈[4]: In addition to the straight-forward solution P=1

6, also the following one is

easily verified to satisfy Equation 3.4.

P({1,2}) =1

6−1

11,P({1,3}) =1

6−1

7,P({1,4}) =1

6+1

11 +1

P({2,3}) =1

6+1

11 +1

7,P({2,4}) =1

6−1

7,P({3,4}) =1

6−1

11 .

The given example proves that Equation 3.4 fails to uniquely determine a probability

distribution Pfor our redundant allocations. Thus, consequently, there is not only a single

chance to find the desired data layout.

In particular, the most intuitive way of determining a correct Pobviously would be

to turn the proof of Theorem 3.27 into an algorithm as follows: given a set [n]of bins

with relative capacities c1,c2,...,cn∈[0,1], we define pi:=r·cisatisfying ∑ipi=r

(Lemma 3.26). Then, one could use his favorite linear optimizer to solve Equation 3.6.

Since linear programming is in the complexity class P (see e.g. [GLS93]), this yields a

valid probability distribution P:[n]

r→[0,1]as desired using running time polynomial

in [n]

r=Θn

rr.

However, merely storing Pcompletely may already take exponential space (for in-

stance, for r=n

2). Thus, obviously, one would be better off just sampling some hyper-

graph edge E∈[n]

raccording to the distribution Pas this can be performed without

actually calculating Pentirely.

Therefore, with respect to the very efficient and successful allocations gained by stan-

dard balls-into-bins games for the non-redundant case, it seems reasonable to apply Algo-

rithm 1 also for redundant settings. Some further motivating fact is that in Theorem 3.22,

we already proved that for a uniform configuration with c1=c2=... =cn=1

n, the de-

picted Algorithm 1 provides the desired data layout.

3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations 49

3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations

Before we start comparing the results and properties of Algorithm 1 with a standard balls-

into-bins model, we first give some brief overview about the most important results on

allocating data blocks by a balls-into-bins approach. Notice that since there are no useful

results considering redundant placement, we only refer to independent allocations.

3.3.2.1 Previous Results on Balls-into-Bins

In a standard balls-into-bins model, one sequentially allocates a set of mindependent

balls at random to a set of nbins such that the maximum number of balls in any bin is

minimized. This paradigm has become very popular because it is simple, quite general,

and can be adopted for a variety of applications. To our knowledge, the first use was

given by Karp et al. [KLM92] who deal with PRAM simulations on Distributed Memory

Machines (DMMs). They realize the balls-into-bins paradigm by universal hash functions

that map the PRAM memory cells to the memory modules of the DMM in a way that an

even dispersal of the cells over the DMM modules can be obtained such that, for any

access request to some PRAM cell, an efficient response time can be ensured.

Depending on the number of used hash functions, the mapping of the cells to the mem-

ory modules is either realizing a so-called single-choice or multiple-choice approach.

The classical single-choice approach allocates each ball to a bin chosen independently

and uniformly at random. For m=nballs, the maximum load in any bin is Θlogn

log log n.

More generally, for m≥nlog nballs, it is well-known that there exists a bin receiving

n+Θqmlog n

nballs (see e.g. [RS98]). This result holds not only on expectation but

with high probability. Let the maximum height above average denote the difference be-

tween the number of balls in the fullest bin and the average number of balls per bin. Then,

the maximum height above average of the single-choice approach is Θqmlog n

n, and it

is easy to see that the deviation between the randomized single-choice allocation and the

optimal allocation increases with the number of balls. Importantly, this cannot be done

better for this kind of game, i.e. for single-choice processes.

Nevertheless, the maximum load can be decreased dramatically by allowing every ball

to choose one out of d≥2 bins independently and uniformly at random. The classical

results for this multiple-choice game are given by Azar et al. [ABKU00] who, following

the results from the PRAM simulations, introduced the Greedy[d] protocol for allocating

nballs to nbins. In each step, Greedy[d] chooses d≥2 bins for each ball and allo-

cates the ball into a bin with minimum load. Then, it can be shown that the maximum

height above average obtained by Greedy[d] is only log log n

log d+Θm

n, w.h.p. which is an

exponential decrease compared to single-choice games. Moreover, Vöcking [Voe99] in-

troduced the Always-Go-Left protocol yielding a maximum height above average of only

50 Random Allocation of Data Copies

log log n

log d∗+Θm

nwith d∗≈dlog 2. Since all these results hold only for the case m≈n, in

[BCSV00], the authors analyze Greedy[d] for the heavily loaded case, i.e. mn(what

is, clearly, a more suitable investigation when considering the allocation of data blocks

to storage devices). They show that the maximum load is m

n+log log n

log d, w.h.p. This re-

sult yields a fundamental difference between the allocation obtained by multiple- and the

single-choice approach; while the single-choice allocation deviates more and more from

the optimal allocation with an increasing number of balls, the deviation gained by the

multiple-choice allocation is independent from the number of balls.

The above results show that with a multiple-choice allocation, on one hand, a near-

optimal load balancing can be obtained while, on the other hand, the analysis of these

allocations is challenging even for uniform capacities and pairwise independent balls.

With respect to heterogeneous bins, Wieder [Wie07] introduced the analysis of a multiple-

choice allocation obtaining a tight upper bound of m

n+log log n

log(1+ε)+O(1)w.h.p. using a

modification of the Greedy[d] algorithm called Greedy[U,1+ε], where Udenotes the

uniform probability distribution on nbins. Greedy[U,1+ε] inserts mballs sequentially,

and in each round, the probability of a ball to be inserted into one of the i-th most heavy

bins is exactly (i

n)1+ε. However, two main drawbacks come along with this approach.

First, Greedy[U,1+ε] is not of practical usage, and second, the analysis is derived from

allocations in peer-to-peer networks, that is, the author assumes homogeneous peers (all

of same size) to which an imbalanced probability distribution is wrongfully assigned (the

imbalance is caused by a previous allocation process, like e.g. CHORD [SMK+01a], such

that an appropriate uniform distribution on the peers is biased). Obviously, this scenario

does not reflect the assumptions and demands of the system we observe here.

3.3.2.2 Analysis of the Imbalance

In Lemma 3.22, we showed that symmetry considerations significantly simplified the

analysis of Algorithm 1 in a uniform system because in such settings, the dependencies

between any two choices of disks are always the same. This implies that every r-tuple

π(1),...,π(r)was equally likely to be chosen for allocation, which in turn implies that

also all their r! fibers {π(1),...,π(r)have equal probability. In contrast, as we show in

this section, this behaves significantly different if any non-uniform capacity distribution

is considered. In particular, Theorem 3.30 shows that the referred algorithm, realizing the

redundant type of a standard balls-into-bins game, fails when being applied in heteroge-

neous storage environments for which an optimal disk utilization is required.

As usual, we consider a given storage network S(n,C[n]), and since all of our strategies

work on values defined by the share tuples ¯cfor the disks [n], before we start, we first state

an important precondition on the capacity of each disk which is necessary for obtaining a

fair and redundant data layout at all.

3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations 51

Lemma 3.29. A fair and redundant distribution is feasible if and only if ci≤1

rfor all

bins i ∈[n].

Proof. Consider the random variables from Definition 3.5. Now, assume some bin jwith

relative capacity cj>1

r. Then, for the variable Xj, it holds that E[Xj] = r·k·cj>r·k·1

thus, bin jcontains at least two identical copies violating the redundancy condition.

However, even by supposing that all disks ihave relative capacity ci≤1

r, the following

theorem shows that Lemma 3.29 states only a necessary condition on the allocation.

Theorem 3.30. Consider a storage network S(n,C[n]) and a share tuple ¯c with ci≤1

rfor

all i ∈[n]. If then there exists at least any two indices i,j∈[n]with ci,cj, Algorithm 1

never provides a fair allocation on the set [n]of disks.

Proof. Consider the share tuple ¯c={c1,c2,...,cn}and the replication parameter r. More-

over, for non-uniformity and w.l.o.g., we assume c1<cn. Then, we start by defining the

values pi:=r·ci, and since ∑n

i=1ci=1, it holds that ∑n

i=1pi=rsatisfying the necessary

conditions for balanced allocations in Lemma 3.26. For what follows, we further assume

w.l.o.g. that p1≤p2≤... ≤pn, and we define the following extracting function

Ξ:= (ξ1,ξ2,...,ξr):[n]

r→[n]r

covering rcomponent functions ξ1,ξ2,...,ξrwith

ξi:[n]

r→[n]s.t. the condition

[

i=1{ξi(E)}=E∀E∈[n]

r

on the considered hypergraph which, given some edge E∈[n]

r, extracts an r-tuple by the

component functions. More precisely, each function ξiextracts an element from E, and

according to its position iin Ξ,ξithus determines the i-th element of the resulting r-tuple.

Moreover, as due to the referred condition all extracted integers are pairwise distinct, the

functions define an order on the elements in E.

Now, consider the symmetric group Sron the set [r]. Then, for all permutations π∈Sr

and any edge E∈[n]

r, we define

Ξπ(E):= (ξπ(1)(E),ξπ(2)(E),...,ξπ(r)(E)).

W.l.o.g. let ξr(E) = nfor some edge E∈[n]

r:n∈E. Then, for the rest of this proof, we

aim at stating that

pn>∑

E∈([n]

r):n∈E

P(E)(3.7)

violating Theorem 3.27 and thus showing that fairness can not be achieved.

52 Random Allocation of Data Copies

Since we are focusing on a balls-into-bins game, for each r-tuple of identical copies, we

consider a sequential algorithm to place the rcopies to the set [n]of disks. Thus, for any

r-tuple and any bin i∈[n], let Et

idenote the event that bin iis drawn for storing the t-th

copy, 1 ≤t≤r. Moreover, according to the redundancy condition, only the first copy of

each tuple is allocated independently from any other placement. Hence, for redundantly

allocating any number `≤rof identical copies to bins i1,i2,...,i`, let the conditional

probabilities qi1,i2,...,i`:=Pr[E1

i1∩E2

i2∩...∩E`

i`]be defined as

qi1,i2,...,i`:=ci1·ci2

1−ci1·...·ci`

1−ci1−ci2...−ci`−1

(3.8)

Now, consider the allocation of `+1 identical copies. Then, for the probability qi1,i2,...,i`,

we can state the following lemma.

Lemma 3.31.

qi1,i2,...,i`=∑

i`+1∈[n]

i`+1,i1,i2,...,i`

qi1,i2,...,i`,i`+1

Proof. Since ∑n

i=1ci=1, we can state:

∑

i`+1∈[n]

i`+1,i1,i2,...,i`

qi1,i2,...,i`,i`+1

=ci1·ci2

1−ci1·...·ci`

1−ci1−ci2...−ci`−1·

∑i`+1∈[n]

i`+1,i1,i2,...,i`

ci`+1

1−ci1−ci2−...−ci`−1−ci`

=ci1·ci2

1−ci1·...·ci`

1−ci1−ci2...−ci`−1·1−ci1−ci2−...−ci`−1−ci`

1−ci1−ci2−...−ci`−1−ci`

=qi1,i2,...,i`

Now, returning to Inequality 3.7, we obtain

∑

E∈([n]

r):n∈E

P(E) = ∑

E∈([n]

r):n∈E

∑

π∈Sr

P[(ξπ(1)(E),ξπ(2)(E),...,ξπ(r)(E))].

Due to Definition 3.8 above, let qπ(1),π(2),...,π(r)(E):=qξπ(1)(E),ξπ(2)(E),...,ξπ(r)(E). Then,

∑

E∈([n]

r):n∈E

∑

π∈Sr

P[(ξπ(1)(E),...,ξπ(r)(E))] = ∑

E∈([n]

r):n∈E

∑

π∈Sr

qπ(1),...,π(r)(E)

3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations 53

Now, for all E∈[n]

r:n∈Eand indices i∈[r], let us consider the cases with bin nat

position ξπ(i). Then,

∑

E∈([n]

r):n∈E

∑

π∈Sr

qπ(1),...,π(r)(E)

=∑

E∈([n]

r):n∈E





∑

π∈Sr

ξπ(1)(E)=n

qπ(1),π(2),...,π(r)(E)+...+∑

π∈Sr

ξπ(r)(E)=n

qπ(1),π(2),...,π(r)(E)





=∑

E∈([n]

r):n∈E

∑

π∈Sr

ξπ(1)(E)=n

qπ(1),π(2),...,π(r)(E)+...+∑

E∈([n]

r):n∈E

∑

π∈Sr

ξπ(r)(E)=n

qπ(1),π(2),...,π(r)(E)

Before we continue our descriptions, we first take a closer look at the summands given

in the above summation. In particular, for some index k, 1 ≤k≤r, let us fix the case for

which ξπ(k)=n. Additionally, for the sake of convenience, we define the following sets

of tuples of indices

`:={(i`,i`+1,...,iu)∈[n]u−`+1|ia,ib;∀l≤a,b≤u:a,b}(3.9)

Then,

∑

E∈([n]

r):n∈E

∑

π∈Sr

ξπ(k)(E)=n

qπ(1),π(2),...,π(r)(E)

=∑

(i1,i2,...,ir)∈Ir

ik=n

qi1,i2,...,ir

=∑

(i1,i2,...,ik−1)∈Ik−1

∑

(ik+1,...,ir)∈Ir

k+1

qi1,i2,...,ik−1,n,ik+1,...,ir

=∑

(i1,i2,...,ik−1)∈Ik−1

qi1,i2,...,ik−1,n

which follows from Lemma 3.31 by induction. More precisely, since we concentrate on

the allocation on bin n, we can simply ignore the last r−ksteps of the r-fold placement.

Now, we continue our description by using the last result. Thus, in total, we obtain

∑

E∈([n]

r):n∈E

P(E) = qn

|{z}

=:q1

+∑

1≤i1<n

qi1,n

| {z }

=:q2

+∑

1≤i1,i2<n

i1,i2

qi1,i2,n

| {z }

=:q3

+...+∑

(i1,i2,...,ir−1)∈Ir−1

qi1,i2,...,ir−1,n

| {z }

=:qr

with qi

ndenoting the probability of bin nto be drawn in the i-th sequential step of Algo-

rithm 1 (recall that the algorithm samples an r-tuple from the set [n]). By the following

lemma, we show the imbalance occurring with these probabilities.

54 Random Allocation of Data Copies

Lemma 3.32. For any step t ≥2of a given r-tuple, it holds that qt

n<cn.

Proof. The proof follows by induction. For the basis, we consider t=2. Then, according

to Definition 3.8 of conditional probabilities, for arbitrary bins i1,i2∈[n],i1,i2, the

probability Pr[E1

i1∩E2

i2]to receive a copy each is qi1,i2=ci2·ci1

1−ci1

. Thus,

n−1

∑

i1=1

qi1,n=cn·

n−1

∑

i1=1

ci1

1−ci1

From this, it follows:

n=cn·

n−1

∑

i1=1

ci1

1−ci1

(c1<cn)

<cn·

n−1

∑

i1=1

ci1

1−cn

=cn·∑n−1

i1=1ci1

1−cn

=cn·1−cn

1−cn

(3.10)

that is, q2

n<cn(analogues, one could show: q2

1>c1).

Now, let t>2. Then again, consider arbitrary bins i1,i2,...,it∈[n],ia,ib∀a,b∈[t],

a,b, such that we calculate Pr[E1

i1∩E2

i2∩...∩Et

it]as

qi1,i2,...,it=cit·

t−1

∏

`=1

ci`

1−∑`

m=1cim.

Since this term calculates the probability only for a fixed set of chosen bins, we continue

with calculating the absolute probability qt

nof bin nto be drawn in the t-th step which

considers all possible combinations of bins to be drawn for allocation in the t−1 previous

steps. For our description, we again refer to the index sets Iu

`as defined in (3.9). Then,

we obtain

n=∑

(i1,...,it−1,n)∈It

qi1,...,it−1,n=cn·∑

(i1,...,it−1,n)∈It

t−1

∏

`=1

ci`

1−∑`

m=1cim

=cn·∑

(i1,...,it−2,n)∈It−1

∑

1≤it−1≤n

it−1,i1,...,it−2,n

t−1

∏

`=1

ci`

1−∑`

m=1cim

We can now factor out the first t−2 factors because these factors and the indices i1,...,it−2

do not depend on the index it−1. For convenience, we shorten the description by defining

F(js):=

∏

`=1

ci`

1−∑`

m=1cim.

Then, we obtain

n=cn·∑

(i1,...,it−2,n)∈It−1

F(it−2)·∑

1≤it−1≤n

it−1,i1,...,it−2,n

cit−1

1−ci1−ci2−...−cit−1

(3.11)

3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations 55

It now remains to find an appropriate estimate for the innermost sum because then, the

calculation can be reduced to the previous step of the induction. Since only the innermost

sum depends on index it−1, again, we substitute the capacity cit−1in the denominator of

the innermost sum by cnyielding the following inequality:

n=cn·∑

(i1,...,it−2,n)∈It−1

F(it−2)·∑

1≤it−1≤n

it−1,i1,...,it−2,n

cit−1

1−ci1−...−cit−2−cit−1

≤cn·∑

(i1,...,it−2,n)∈It−1

F(it−2)·∑

1≤it−1≤n

it−1,i1,...,it−2,n

cit−1

1−ci1−...−cit−2−cn

=cn·∑

(i1,...,it−2,n)∈It−1

F(it−2)·

∑1≤it−1≤n

it−1,i1,...,it−2,n

cit−1

1−ci1−...−cit−2−cn

=cn·∑

(i1,...,it−2,n)∈It−1

F(it−2)·1−cn−∑t−2

l=1cil

1−cn−∑t−2

l=1cil

The reduction continues until reaching the induction base that contains a <-relation as

shown in Inequality 3.10, thus, we obtain qt

n<cnand the lemma follows.

Lemma 3.32 shows that for any step t≥2 of allocation, qt

n<cnyielding the following

result:

∑

E∈([n]

r):n∈E

P(E) = q1

n+q2

n+...+qr

n<r·cn=pn

satisfying Inequality 3.7 and thus, the theorem follows.

3.3.2.3 Determining Upper Bounds for the Imbalance

Since by Lemma 3.32, we know about the existence of the imbalance for any step 2 ≤t≤r

if Algorithm 1 is used for allocation on non-uniform capacities, in this section, we are fur-

thermore interested in determining the size of the imbalance, i.e. its maximum deviation

from an optimal allocation. More precisely, to obtain an even data layout, Algorithm 1

usually supposes any bin ito be drawn with allocation probability ciin every allocation

step t. Unfortunately, as we have shown, the desired data layout can only be achieved if

solely uniform capacities are considered, otherwise, we obtain a biased data distribution.

In other words, in a uniform setting, the imbalance equals zero, whereas with non-uniform

capacities, there is a deviation from a fair layout (Theorem 3.30). Moreover, the more the

capacities of the bins deviate from being uniform the greater the imbalance. Therefore,

obviously, the imbalance directly depends on the underlying disk configuration.

56 Random Allocation of Data Copies

As the imbalance in the system is determined for each disk exclusively, we now cal-

culate the maximum deviation with respect to bin nthat we assume to have maximum

capacity cn=1

r. Clearly, this could be of practical interest for e.g. administrators who al-

ways aim at providing the best access performance of the administrated storage network,

even if the SAN consists of a mixture of heterogeneous capacities, like e.g. some 80 to

500 GByte disks combined with a number of 3 TByte RAID-systems. Then, condition to

a required total storage capacity and with knowledge about the sizes of the particularly

induced imbalances, the administrator can decide what combination of storage devices to

chose in order to ease the imbalance and thus, to increase system performance.

Since the deviation is determined according to a fixed capacity distribution for which

exact values are needed, for our calculations we first define some capacity range for the

disk capacities. Then, given this range, we can calculate the induced imbalance (for this,

recall Lemma 3.29 for denoting the maximum size of a disk).

Definition 3.33. Consider a storage network S(n,C[n]), a share tuple ¯c and a replica-

tion parameter r ≥2. Then, a capacity distribution c1,...,cnis called γ-bounded if any

capacity ciin ¯c is taken from the range 1

γr≤ci≤1

rfor an arbitrary γ∈Q>1.

Due the proof of Lemma 3.32, the task now is to determine a value for the <-relation

in each step t, as first shown in Inequality 3.10. Again, as different steps induce different

degrees of index dependencies, for different steps 2 ≤t,t0≤r,t,t0, we have to consider

different functions for our calculations. Obviously, since regarding a <-relation, the im-

balance occurring in any step t≥2 becomes worst if the corresponding function reaches

its global minimum. Furthermore, since first, in a uniform setting, there is no imbalance,

and second, all bins i∈[n]are supposed to have capacity ci≤1

r, we start our considera-

tions from the smallest setting of bins which is given by a set of rbins of capacity 1

reach.

Then, our analysis is as follows.

Again, we fix bin nof capacity cn=1

rwhich, initially, holds for also for all other bins

and that we assume to remain unchanged for the whole process. Then, for each step

t≥2, the task is to find the minimum over all possible γ-bounded capacity distributions

on the remaining bins. Clearly, as we fixed bin n, the total capacity that is provided by

any disk configuration consisting of r−1 additional disks is always 1 −1

r. Moreover,

since the initial configuration is the only one to provide this total capacity by solely r

bins, any other capacity distribution force them to grow by number. We start with the

initial configuration consisting of bin nand r−1 additional bins, each having capacity 1

Then, we determine the global minimum of any possible capacity distribution on these

r−1 remaining bins. After that, we proceed with calculating the global minimum for

any capacity distribution on rbins (with 2 bins of size <1

r), and so forth, until finally,

we reach the maximum number of remaining bins, denoted by n−1 in the following, in

which all bins have capacity 1

γr.

3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations 57

The Case t=2

We start with stating the size of the imbalance for step t=2. That is, we consider the

calculation for the probability q2

nas given in Lemma 3.32 for which we determine the

minimum over all γ-bounded capacity distributions on the set of remaining bins in the

system (recall that to ensuring the total capacity 1−1

rby different distributions a variable

number of bins in different steps must be considered). More formally, we calculate

min

r−1≤`≤n−1(min

c1,...,c`

∑

i1=1

ci1

1−ci1).(3.12)

Then, the following theorem shows that the imbalance becomes worst if the configuration

consists of nbins, that is, for n−1 bins the capacity is 1

γr.

Theorem 3.34. For all i ∈[n], let ci∈[1

γr,1

r], and fix cn=1

r. Then, Equation 3.12 is

minimal if `=n−1and ci=1

γr∀i∈[n−1].

Proof. Due to step t=2, we denote the innermost sum by X(2):=∑`

i1=1

ci1

1−ci1

. Further-

more, we consider the capacity c∗:= (1−1

r)/` which defines an even distribution of the

total capacity 1 −1

rcovered by the `observed bins. It is easy to show that the innermost

term ci1

1−ci1

corresponds to a convex and strictly monotonically increasing function on the

compact interval [1

γr,1

r]implying that X(2)is also convex and strictly monotonically in-

creasing for each `, and moreover, this means that Equation 3.12 has a unique solution.

With this, we can state our assumption by using the following lemma to show that X(2)is

minimal if and only if all bins have size c∗.

Lemma 3.35. X(2)is minimal if and only if for all i,j∈[`], i ,j, it holds that ci=cj.

Proof. The proof is straightforward. We assume that at least two indices i,j,i,j, exist

such that ci=c∗+εand cj=c∗−εfor an arbitrary ε≥0. Then, we simply have to find

the minimal εfor ci

1−ci

+cj

1−cj≥2c∗

1−c∗.

For this, we define the following function due to ciand cj

f(ε):=c∗+ε

1−c∗−ε+c∗−ε

1−c∗+ε,

and to find the minimum of f(ε), we have to look at the first and second order derivatives

f0(ε)and f00(ε), respectively. With

f0(ε) = 1

(1−c∗−ε)2−1

(1−c∗+ε)2

we obtain f0(ε) = 0 if and only if ε=0. Furthermore, for ε=0, f00(ε)>0, and hence,

the lemma follows.

58 Random Allocation of Data Copies

The result from Lemma 3.35 implies that the minimum is obtained if X(2)=∑`

i1=1c∗

1−c∗

which can then be transformed into the following term that we denote by Y:

Y:=1−1

1−1−(1/r)

`

Since Yis monotonically decreasing in `, its minimum is given if `=n−1 and hence, the

theorem follows.

So far, we showed that for step t=2 the imbalance on the data distribution becomes

worst if there is at most one bin of maximum size 1

rand all remaining bins have smallest

size 1

γr. Thus, it remains to show the same for the subsequent steps.

The Case t>2

According to the proof of Theorem 3.30, to calculate the allocation bias for any subse-

quent step t, 3 ≤t≤r, we again consider the functions for the probability qt

nas follows

(for better description, we again use the sets It

1, as defined in (3.9)).

min

r−1≤`≤n−1











min

c1,...,c`∑

(i1,...,it−1,n)∈It

1 t−1

∏

`=1

ci`

1−∑`

m=1cim!

| {z }

=:X(t)











(3.13)

For this problem and any observed function, we again suppose the global minimum if

`=n−1, that is, if n−1 bins have minimum capacity 1

γrbut this is only true if the

following conjecture holds (for which no proof is given so far).

Conjecture 3.36. For any step t ≥3, X(t)is a convex and strictly monotonically increas-

ing function on the compact interval [1

γr,1

r].

Corollary 3.37. If and only if Conjecture 3.36 is true, the minimum for Equation 3.13 is

given if `=n−1and ci=1

γr∀i∈[n−1].

Theorem 3.38. Consider a storage network S(n,C[n]) with γ-bounded capacities for

some γ∈Q>1, a replication parameter r and a fixed capacity cn=1

r. If Corollary 3.37

holds, then the maximum allocation imbalance for bin n in any step 2≤t≤r is given by

n=1

r·

t−1

∏

i=1

n−i

γr−iif γ∈Q>1,

and

n=1

r·n−1

t−1/γr−1

t−1 if (γr)∈Z>1.

3.3.2 On the Imbalance Induced by Balls-into-Bins Allocations 59

Proof. The proof is given by induction on t. Particularly, we show for each tthat qt

n<1

and further, that for any step t>2, qt

n=1

r·∏t−1

i=1

n−i

γr−i<qt−1

We start with t=2. Furthermore, we assume ndisks with cn=1

rand cj=1

γrfor all

j∈[n−1]. Then, by replacing the values in Equation 3.10 (in proof of Lemma 3.32), the

first assumption holds since

n=1

r·

n−1

∑

i1=1

γr

1−1

γr

r·n−1

γr−1

(γr>n)

which holds because as nis fixed, 1

γr<1

nin the non-uniform case. To show this, recall

that 1

γris the smallest capacity to be chosen for any bin. Then, condition to a fixed n,

a total capacity of at least n·1

γr≤1 is provided. If we now assume 1

γr≥1

n, it directly

follows that 1

γr=1

nimplying the uniform case what is a contradiction.

Now, consider t>2 and assume for all previous steps t−j,j≥1, that qt−j

n<1

rholds.

Then, according to Equation 3.11 (in Lemma 3.32), we consider

n=cn·∑

(i1,...,jt−2,n)∈It−1

F(it−2)·∑

1≤it−1≤n

it−1,i1,...,it−2,n

cit−1

1−ci1−ci2−...−cit−1

and as we suppose ci=1

γrfor all i,n, we can substitute as follows:

n=cn·∑

(i1,...,jt−2,n)∈It−1

F(it−2)·

n−t+1

∑

j=1

γr

1−1

γr−1

γr−...−1

γr

| {z }

t−1elements

=cn·∑

(i1,...,jt−2,n)∈It−1

F(it−2)·n−t+1

γr−t+1

As qt−1

n=cn·∑(i1,...,jt−2,n)∈It−1

1F(it−2)<1

rby assumption, the induction follows. More-

over, again, since γr>n, it follows that n−t+1

γr−t+1<1, and thus, qt

n<qt−1

If γr∈Z>1, the second result can simply be calculated as

n=1

r·n−1

γr−1·n−2

γr−2·...·n−(t−1)

γr−(t−1)=1

r·n−1

t−1

γr−1

t−1

So far, we showed that if sampling some edge E∈[n]

raccording to the distribution P

by using a common balls-into-bins scheme (Algorithm 1), a redundant data layout com-

bined with an optimal utilization of the offered system capacity can never be achieved

for heterogeneous environments. However, in the following sections, we show that the

desired data distribution can be obtained by placing the data blocks somewhat more con-

trolled, and what holds for static as well as dynamic storage environments.

60 Random Allocation of Data Copies

3.4 COMB: An Allocation Strategy for Static Scenarios

Although we showed in the previous section that applying a standard balls-into-bins ap-

proach for redundant data allocating can achieve fairness only for homogeneous envi-

ronments, there are numerous results making such approaches very suitable for random

allocation because, as we briefly sketched in Section 3.3.2.1, the results are quite convinc-

ing to linger on them. Thus, we now show that introducing some slight modifications to

the data placement in Algorithm 1 suffices to obtain the desired data layout.

Before we start our description, we first briefly discuss some intuitive approaches for

introducing redundancy into the standard balls-into-bins scheme.

Naive Approaches.

1. Allocate all balls independently by some common balls-into-bins scheme but ignore

all trials in which the same bin is drawn for identical copies such that, finally, all

elements of some certain edge E∈[n]

rare stored separately.

2. Since, according to Definition 3.5, fairness requires load `i=r·k·cifor all bins

i∈[n], one could try to partition the set [n]into rdisjoint subsets of size 1

reach

and then allocate the elements of each edge E∈[n]

rsuch that each partition gets

at most one copy.

We call these approaches naive because it is quite obvious that they do not provide

the desired result. In particular, if we consider non-uniform disk capacities for the first

approach, then, after allocating all balls to the disks, the valid trials (i.e. the ones that were

not ignored) again lead to a biased data placement according to the result of Theorem 3.30.

The second scheme also runs into problems because the mentioned approach corresponds

to the well-known PARTITION problem which is NP-complete.

Therefore, in this section, we introduce an efficient placement scheme that yields the

desired layout by just sampling a hypergraph edge Eaccording to a given probability

distribution P. Particularly, for a storage network S(n,C[n]) and the following allocation

strategy, Pis simply defined by the values of the corresponding share tuple ¯c. With respect

to Lemma 3.26 and Theorem 3.27, this is feasible because we sample r-tuples and as we

define pi=r·ci∈[0,1]for each bin iand ∑n

i=1ci=1, the necessary conditions to obtain

a fair distribution are satisfied (surely, provided that ci≤1

r).

The following algorithm is called COMB strategy, and it is closely related to a redun-

dant balls-into-bins model but differs due to the fact that only the first of ridentical copies

is always allocated at random to the set [n]of bins, and all subsequent ones are placed de-

terministically on distinct bins. Moreover, we show first that the resulting load of any bin

corresponds to a standard single-choice balls-into-bins allocation, and second, that it is

highly concentrated around the expected value.

3.4. COMB: An Allocation Strategy for Static Scenarios 61

In particular, consider a storage network S(n,C[n]) and a corresponding share tuple ¯c

for the disks in the set [n]. Let the bins be numbered in a range from 1 to nand arranged in

a linear manner in the [0,1)-interval (that we consider as a ring) according to their shares

in ¯c. That is, bin iis responsible for the half-open interval Ii= [αi,αi+ci)⊆[0,1)of

length cistarting at point αi=∑i−1

j=1cj. Now, given a position x∈[0,1), let IC(x)∈[n]

denote the index of the interval Iiwhich xlies in. Then, the COMB strategy works as

follows.

Algorithm 2 The COMB Strategy

Require: Probability space (S(n,C[n]),¯c), Set [k]of balls, Replication parameter r

Ensure: A set Mr

k={(I1

1,I1

2,...,I1

r),...,(Ik

1,Ik

2,...,Ik

r)}of r-tuples of indices of bins.

1: for all d∈[k]do

2: Place duniformly at random at position x1∈[0,1)yielding interval I1=IC(x1)

3: Set i←2

4: while i≤rdo

5: Allocate dto position xi= (x1+(i−1)·1

r)yielding interval Ii=IC(xi)

6: Set i←(i+1)

The COMB strategy, which works on disk assignments into the [0,1)-interval, is graph-

ically illustrated in Figure 3.2 for a single 4-tuple. The first copy is allocated uniformly

at random to some point x1∈[0,1), and all subsequent copies i∈{2,3,4}are determin-

istically placed to locations xiby a distance of (i−1)·1

4from x1in the unit ring. In the

following theorem, we show the correctness and layout quality of the COMB algorithm.

Theorem 3.39. For a static storage network S(n,C[n]) of arbitrary capacities, a set [k]

of distinct data blocks and r ≥2, the COMB strategy provides a fair and redundant dis-

tribution of m =k·r replicated balls on the set [n]of bins. Moreover, any bin i gets at

most

`i=ci·k·r+O(pci·k·r·log n)

data blocks w.h.p. if q3·`·log n

ci·k·r<1for some constant ` > 0.

Proof. We prove the correctness of the COMB strategy in terms of redundancy and a

balanced data load.

Redundancy. As any bin i∈[n]is assigned a half-open interval Iiof size ci≤1

r, and

furthermore, all copies of any data block d∈[k]are allocated by a distance of 1

rfrom each

other7, no half-open interval can simultaneously contain two different locations xj,xj+1

for any two identical copies. Thus, all copies dj,j∈[r], lie in different intervals, and from

this it follows that all values j1,j2,..., jrin the sampled r-tuple are pairwise distinct. If

7More precisely, any distance in the interval [maxi{Ii},1

r]is valid to achieve the desired data layout.

62 Random Allocation of Data Copies

x2 (x1+ 1/4)

(x1 + 1/2)

(x1 + 3/4)

Figure 3.2: The COMB strategy on the [0,1)-interval for n=9 bins and r=4.

then some function f:[n]r→[n]

ris applied that maps each sampled r-tuple to some edge

E∈[n]

r, this yields the required r-element set.

Fairness. Given the set [k]and parameter r, we obtain kdistinct r-tuples independently

sampled by the COMB strategy. As, for each such tuple, the first location x1is chosen

uniformly at random in the [0,1)-ring, clearly, the probability for some interval Iito be

chosen is Pr[IC(x1) = Ii] = ci. Moreover, if x1is chosen uniformly random, then so is

xj=x1+j−1

rfor each j∈ {2,...,r}; therefore, regardless of the obvious correlation

between x1and xj, and depending on the size c`for some interval I`to be chosen, it holds

that Pr[IC(xj) = I`)] = c`for all `∈[n].

For the rest of this proof, consider an arbitrary but fixed bin iwith allocation probability

ci. Furthermore, for an arbitrary r-tuple S, 1 ≤S≤k, let XS

ibe the binary random variable

denoting the event that bin iis drawn by S, and let Xi=∑SXS

i. Then, referring to Ques-

tion 3.25, it holds that Pr[XS

i=1] = r·ci(= pi)which satisfies the necessary condition

in Lemma 3.26 for balanced allocations. Additionally, since each random variable Xiis

binomially distributed, its expected value E[Xi] = r·k·ciwhich, on expectation, ensures

a fair data layout according to Definition 3.5.

At last, it remains to show that any disk contains the expected number of blocks also

w.h.p. Since all r-tuples are mapped to the [0,1)-interval independently at random and the

i’s are binary random variables, we can apply the Chernoff bounds from Lemma 3.15

gaining the following upper bound on the load of bin i:

Pr[Xi≥(1+ε)·ci·k·r]≤e-min[δ2,ε]·ci·k·r/3(3.14)

3.4. COMB: An Allocation Strategy for Static Scenarios 63

for every ε>0.

To prove the bound on the concentration, we need to derive the particular εfor which

the probability drops below n−`(we exemplarily sketch the calculation for obtaining the

correct εonce in this thesis). Considering Equation 3.14, from the right hand side, we get

the following, assuming ε<1,

e−ε2·ci·k·r/3≤n−`

n`≤eε2·ci·k·r/3

Setting ε=pε0·log nleads to

n`≤e√ε0·log n2·ci·k·r/3

n`≤elog n·ε0·ci·k·r/3

n`≤nε0·ci·k·r/3

ε0≥3·`

ci·k·r

Substituting this in Inequality 3.14 concludes the proof.

The above result shows that for bin iwith capacity cithe maximum deviation of the load

from its expected value is in O(√ci·k·r·log n). Recall that this directly corresponds to

the deviation of a standard single-choice balls-into-bins game (Section 3.3.2.1). Thus,

as the COMB strategy implements some sort of redundant single-choice balls-into-bins

approach, the achieved result is the best to obtain for a redundant single-choice process.

However, like in the standard (non-redundant) case, the deviation heavily depends on the

total number of balls allocated to the bins, that is, the more balls thrown the worse the

deviation. However, we leave a possible improvement of this drawback as the following

Open Problem: Is it possible to improve the presented single-choice deviation

bound by some multiple-choice allocation algorithm that allocates r identical

copies to one out of d ≥2sampled r-tuples (that are determined by their first copy

allocations similar to the shown COMB strategy) ?

Furthermore, as a more essential drawback, the COMB strategy neither is nor can be

made adaptive. As we have shown, the allocation process works on an n-fold partitioning

of the [0,1)-interval into nintervals of lengths c1,c2,...,cn, respectively. Since ∑ici=1,

it is easy to see that this scheme works efficiently for static environments only because

if the number of disks changes, the system obviously runs into fragmentation problems.

That is, whenever a new disk enters the system, this would affect the sizes of all intervals

(even if the disk has only small capacity) what in turn leads to a huge amount of re-

allocations and thus, to intolerable performance deficits over a certain time.

As a consequence, we address the problem of occurring dynamics in the storage net-

work in more detail in the following section.

64 Random Allocation of Data Copies

3.5 SPREAD: An Adaptive Allocation Scheme for

Dynamic Scenarios

We introduce an advanced data placement scheme called SPREAD [MS08] which, unlike

the former strategies, provides useful adaptivity properties to face dynamic changes in

scalable SANs, that is, it is capable of handling any occurring changes in the capacity dis-

tribution on the disks in the storage network that may be caused by faulty disks or growing

storage demands. More precisely, we present a redundant data placement algorithm that

•satisfies redundancy, fairness and adaptivity in a time- and space-efficient manner,

•is able to locate any stored data item efficiently (access time is at most logarithmic

in the system size),

•uses lookup functions whose space requirements only depend on the number of

disks in the system and not on the number of data items or capacity distribution,

•given a replication parameter r, may cope with a variable number rdof copies per

data item das long as rd≤r.

In contrast to the algorithms described in the last two sections that primarily focused

on the obtained data layout without giving a practical notion of the allocation functions,

we now consider the practical realization of such functions as well as used data structures

and lookup-times for realizing the dictionary functions search(T,k) and insert(T,k) (c.f.

Section 3.1.2.1).

As we know, the most efficient approach to implement the dictionary functions is pro-

vided by hashing, where some given set P⊆Uof data blocks with keys taken from a uni-

verse Uof all possible keys is distributed sufficiently even over a set of disks in a storage

network by applying an appropriate hash function h(see Figure 3.3). More particularly,

for achieving an appropriate even allocation of data items to the disks, all strategies we

present in this section apply k-universal hash functions (recall from Section 3.1.2.2 that

k=O(log n)suffices for our demands). Again, the very advantage we gain by this way of

data placement is that allocations can be done with complexity O(1). However, as hash-

ing is usually utilized for static environments, and as we have further shown with Linear

Hashing in Section 2, a simple substitution of the hash functions in case of a scaling envi-

ronment is not sufficient for satisfying our demands. Thus, we are left with the following

question: How to design an adaptive hash function ?

Before we describe the SPREAD strategy in more detail, we first tackle this design

problem by briefly introducing the most important adaptive strategies that have been pub-

lished so far and discuss why it is so difficult to adapt these so that all conditions are met.

Furthermore, note that the adaptive schemes we introduce in the following differ from

3.5.1 Previous Adaptive Strategies 65

1234

hash function h

Figure 3.3: Hashing

SPREAD with respect to the fact that none of them is designed to handle redundancy

either appropriately or at all, that is, they feature no or only less efficient fault-tolerant

mechanisms. Thus, we simply ignore the redundancy condition for a moment.

3.5.1 Previous Adaptive Strategies

If all storage devices have the same capacity, surely, we are not concerned with the het-

erogeneity issue such that any strategy has to assign each disk simply the same data load.

Thus, in a static environment, a standard balls-into-bins algorithm could easily provide

the desired fair distribution (and COMB if redundancy is demanded), but as we consider

dynamic scenarios, we are left with the difficulty of changes in the environment, that is,

the addition and removal of disks, respectively data items. Nevertheless, the usage of only

uniform capacities simplifies the allocation problem considerably.

Exemplary, we restrict our descriptions on two prominent schemes, called Consistent

Hashing and SHARE, which were the first approaches introduced to feature adaptivity for

uniform and also non-uniform capacity distributions.

3.5.1.1 Consistent Hashing for Uniform Capacities

Consistent Hashing, which is also called Nearest Neighbor Strategy, was published by

Karger et al.[KLL+97]. Initially, the scheme was part of a global distribution scheme for

web-based contents, and it has further been applied to Web caching in [KSB+99].

In its basic form, Consistent Hashing uses two hash functions, g:U→[0,1)for map-

ping the (pairwise independent) data items into the [0,1)-interval (which again is consid-

ered as a modulo 1 ring) and h:V→[0,1)to map the nodes (resp. storage devices) into the

same [0,1)-ring. Then, each node vis responsible for the interval [h((v),h((w)) ⊆[0,1),

with some node wbeing the direct successor of vin the mapping into the unit ring. Now,

given the two functions, every data item d∈Uis placed to position g(d)and furthermore

stored at the node vwith h(v)being closest from above to g(d), that is, it is assigned to

the interval [h((v),h((w)) (in Figure 3.4, F2marks such interval for v=2).

66 Random Allocation of Data Copies

12345

hash h

hash g

Figure 3.4: ConsistentHashing

In the basic case, if only one single function gis used for placing the nodes indepen-

dently and uniformly into the [0,1)-ring, this would lead to a biased data layout as stated

by the following lemma (the proof can be found in [MNR02]).

Lemma 3.40. If n nodes are allocated to the [0,1)-ring independently and uniform at

random, the length of the longest segment is w.h.p. Θlog n

n, and w.h.p. there is no

segment which is shorter than Θ1

n2.

Fortunately, this problem can be circumvented by using O(log n)differently labeled

copies for each node v, all of which are allocated to the [0,1)-interval. If then gand h

are O(log n)-universal hash functions, it is easy to show that, first, Consistent Hashing

yields an even distribution on the nodes, and second, the strategy is 2-competitive (see

e.g. [KLL+97, Sal04]). Moreover, also multiple copies for each data item dcan easily

be stored in a redundant way while preserving fairness by the rule that rcopies of d

are stored in the rnodes vwith h(v)being among the rclosest successors of g(d)in

[0,1). However, Consistent Hashing is no longer fair if applied to nodes of non-uniform

capacities. This can be repaired by cutting the nodes into pieces of uniform capacity, but

then the space for the hash table would depend on the differences in capacity and not

just the number of nodes, which causes maintenance problems. In particular, it would be

difficult to adapt this approach to changes in the minimum node capacity without inducing

high data replacements and an undesired storage overhead, respectively.

Another adaptive hash table method for uniform storage devices, called Cut-and-Paste

Strategy, has been presented in [BSS00, San01] for which it can be shown that, in contrast

to Consistent Hashing, it keeps the deviation from the desired number of balls in a bin

extremely small w.h.p. However, like Consistent Hashing, the Cut-and-Paste strategy is

also hard to extend to the non-uniform case.

3.5.1 Previous Adaptive Strategies 67

I2I3

I4I5

g(i)

h(b)

Figure 3.5: The SHARE Strategy

3.5.1.2 The SHARE Strategy for Non-Uniform Capacities

One of the first adaptive hash tables that can handle storage devices of arbitrary capacities

was introduced in [BSS02], called SHARE, and which works in two phases. At first, there

is a reduction phase implementing the conversion of the given non-uniform allocation

problem into several uniform ones. In the second phase, a uniform allocation strategy,

like e.g. Consistent Hashing, is then applied to each of these uniform problems to finally

assign the data items to the bins.

In more detail, given any data item b∈Ufor allocation, like in Consistent Hashing,

the reduction phase of SHARE also applies two random hash functions, h:U→[0,1)

to map the data items, and g:V→[0,1)to map the nodes into the [0,1)-interval (which

again is considered as a modulo ring). Surely, both functions must ensure a sufficient

even distribution of the items which, as we know, can be obtained by using O(log n)-

universal hash functions for hand g. Then, unlike in Consistent Hashing, each node v∈V

is assigned a subinterval Ivin the [0,1)-ring whose starting point is g(v)∈[0,1)and that

has length s·cv, where cvdenotes the relative capacity of the node v, and s=Θ(log n)is

astretch factor that is needed to ensure a complete coverage of the [0,1)-interval by the

subintervals, what, furthermore, guarantees high probability on the allocation. That is,

any interval Iithat starts at some point g(i)ends at point (g(i)+s·ci)mod 1.

Furthermore, as depicted in Figure 3.5, given the hashed node intervals, each start- or

endpoint of an interval marks the beginning or ending of some frame Fj. Since all such

frames are disjoint in the ring, each data item bthat is hashed to some location h(b)is

thus exclusively assigned to exactly one frame. Moreover, as the borders of the frames are

defined by the start- or endpoints of the subintervals, each frame is completely covered

by a number of subintervals. At last, the chosen uniform placement strategy is applied

to any frame to select the target node out of the set of nodes assigned to each frame for

68 Random Allocation of Data Copies

data placement. Then, the SHARE scheme yields the desired fair distribution on an either

uniform or non-uniform as well as scalable storage environment w.h.p. and is (2+ε)-

competitive, where εcan be made arbitrarily small.

Another approach that can also be found in [BSS02] and that is called SIEVE is orga-

nized as a multistage filter and overcomes some of the drawbacks inherent to SHARE,

especially, it does not rely on an extra placement strategy for uniform capacities. How-

ever, to provide a fair placement, the SIEVE algorithm maintains a logarithmic number

of random hash functions, one for each stage, and surely, keeping all these functions in

memory leads to an undesired space consumption.

Nevertheless, for any capacity distribution, SHARE and SIEVE are efficient, (2 +ε)-

adaptive and both yield a fair data dispersal within a (1±ε)factor, where ε>0 can be

made arbitrarily small. Particularly, the bound on the adaptivity of SHARE and SIEVE

is based upon the following Fact 3.41. In more detail, for a hash table to be adaptive, it

has to be able to handle any capacity change from (c1,...,cn)to (c0

1,...,c0

n), and as we

already mentioned in Section 3.1.1, any adaptive storage strategy that wants to preserve

fairness has to replace at least a

∑

i:ci>c0

(ci−c0

i) = 1

2∑

i|ci−c0

fraction of the data in the system, hence, the following fact holds.

Fact 3.41. If, for any change from (c1,...,cn)to (c0

1,...,c0

n), a storage strategy only

needs to replace a d ∑i|ci−c0

i|-fraction of the data in the system, then it is 2d-adaptive

w.r.t. capacity changes.

Regarding fault-tolerance, obviously, the attempt to make SIEVE redundant is a hard

challenge (if possible at all) but introducing redundancy into SHARE in some rudimentary

way can be done as follows. If for some given stretch factor s=Θ(log n)the capacity

of every disk i∈[n]is bounded by ci≤1

r·s, then one can assign the copies of any data

item dto positions h(d),h(d) + 1

r,h(d) + 2

r,...,h(d) + r−1

rin the [0,1)-ring (recall that

|Ii|=ci·s). However, in what follows in the next section, we aim at yielding redundancy,

fairness and adaptivity for disks up to size ≤1

In [SS05], a related strategy, called DHHT, was introduced but, similar to SHARE

and SIEVE, it has not been shown yet that this strategy can handle both redundancy and

fairness simultaneously and in an appropriate manner.

Recently, Brinkmann et al. [BEMadHS07] proposed the first scheme for systems of

disks with arbitrary capacities that is fair and redundant but what is only Θ(r2)-adaptive

in the worst case for rcopies per data item whereas we are aiming at a constant adaptivity

that is independent of r.

3.5.2 The SPREAD Strategy 69

3.5.2 The SPREAD Strategy

For the rest of this chapter, we describe and analyze the SPREAD strategy in more de-

tail. First, we sketch the basic framework of SPREAD but leaving out various details

about how to achieve fairness, redundancy and adaptivity. Afterwards, we bound the

time- and space-efficiency of SPREAD (Section 3.5.2.2), describe how to achieve fair-

ness and redundancy (Section 3.5.2.3), and show how to make SPREAD O(1)-adaptive

(Sections 3.5.2.4 and 3.5.2.5). At the end, we discuss how to extend SPREAD to the case

that the data items have different levels of redundancy.

3.5.2.1 The basic scheme.

Like in the previous sections, in the following, we again consider a given storage network

S(n,C[n]) and a corresponding share tuple ¯cfor the nodes in the set [n]. Again, assume

the nodes to be numbered uniquely in a range from 1 to n, and furthermore due to the

adaptivity condition, for any history of changes, we always use the strategy that every

node newly entering the system obtains the lowest available identifier ≥1, thus, we can

make sure that the nodes will be numbered in the defined range from 1 to nany further

(note that the capacity of any currently unoccupied identifier is equal to 0, but it will

nevertheless be part of the SPREAD data structure.).

SPREAD uses three (pseudo-)random (O(log n)-universal) hash functions: a hash func-

tion h:V→[0,1)for the nodes and two hash functions g1,g2:U→[0,1)to map the data

items into the [0,1)-interval (which again will be considered as a modulo1 ring). Simi-

lar to the SHARE strategy, SPREAD makes use of a stretch factor s=σ·log N, where

N=|V|(resp. the maximum number of nodes the system can expect) and σ≥1 is a

sufficiently large constant such that s∈N. Note that, while in SHARE, the stretch factor

was used to ensure the complete coverage of the unit ring, in SPREAD, it is furthermore

used to guarantee that for each point x∈[0,1), there are at least rintervals passing x.

Now, given these preconditions, the SPREAD strategy works as follows.

For every node v∈[n], we identify an interval I(v)=[h(v),h(v) + s·r·cimod 1)of

length s·r·ciin the [0,1)-ring, where h(v)is called the starting point of the interval and

h(v)+s·r·cimod 1 is its endpoint. If s·r·ci>1, we consider I(v)to be wrapped around

the whole [0,1)-interval bs·r·cicmany times.

With respect to the previous strategy, the SPREAD scheme realizes some form of a

hybrid solution combining the Consistent Hashing approach and the SHARE strategy. In

particular, like Consistent Hashing, the SPREAD data structure maintains a partition of

the [0,1)interval into n frames F1,...,Fnwith Fvstarting at h(v)and ending at the closest

successor of h(v)among the points {h(1),...,h(n)}\{h(v)}(recall that [0,1)is treated

as a ring). Surely, the intervals differ in size, and since all intervals are allocated to the

[0,1)-ring independently and uniform at random, they determine overlapping areas, called

70 Random Allocation of Data Copies

areas of influence, for some given point x∈[0,1), that is, there are numerous intervals

passing x(see Figure 3.6 for an example).

12345

Figure 3.6: The interval decomposition of SPREAD

Moreover, like in SHARE, for reducing the non-uniform allocation problem to a uni-

form one, each frame Fvis further partitioned into a number of subframes (we will explain

later how these subframes are chosen. For now, just note that the subframe decomposition

only depends on nand h(1),...,h(n)and not on the capacities of the nodes, and as ngrows

according to system scalings, some of these subframes may be partitioned into smaller

subframes). Then, given the replication parameter r, for each subframe f, SPREAD aims

at maintaining one (and sometimes two, as will be explained later) (s

α)×r-table Tfcon-

sisting of r·s

αslots which are organized into s

αgroups (respectively columns) of size r

each, where α>0 is a small constant that is selected such that a certain degree of fair-

ness is maintained. Every slot is assigned to (resp. owned by) exactly one node, and the

assignment has to be chosen so that for each group of rslots, each slot is assigned to

a different node (see Figure 3.7). Then, we can use the following strategy for ensuring

redundancy.

(s/α)

group 1 group (s/α)

direction of allocation

Figure 3.7: Allocation table for a subframe f

3.5.2 The SPREAD Strategy 71

At first, whenever we want to read or overwrite the copies of some data item d, we

perform the following steps.

1. Identify the unique subframe fvia g1(d)∈f,

2. Pick group d(s

α)·g2(d)ein Tfand either read or store the copies of din the rnodes

owning the slots in this group

(in Figure 3.8, the allocation of a data item to the slots in the corresponding subframe

table is depicted). Given that the hash functions can be evaluated in constant time and

there are msubframes in the system, standard data structures such as search trees and

arrays can be used to obtain the following result (recall that we assign slots to nodes, and

surely, ndifferent nodes require log nbits for encoding).

….

s/α cols

g1(d)

places of rcopies

g2(d)

Figure 3.8: Allocation of a slot in a subframe table

Lemma 3.42. The lookup and insert operations of SPREAD can be implemented with

runtime O(r+log m)and space O(m(r·s

α)log n).

Hence, as long as mis close to linear in n, SPREAD is time- and space-efficient. In the

following subsections, we show that there is a way of maintaining the tables so that also

the fairness, redundancy and adaptivity conditions are satisfied.

3.5.2.2 Subframe management.

As we see in the following, to make the reduction phase that transforms the non-uniform

problem into a number of uniform ones be able to cope also with a scaling storage system

appropriately, ideally, we would like to have the following subframe decomposition for

each frame Fv⊆[0,1). For the subframes f0,f1,f2,... following Fvin the [0,1)-ring (in

ascending direction), it holds that

|fi|=ε(1+ε)i·|Fv|for some fixed ε>0.(3.15)

72 Random Allocation of Data Copies

If this were true, we could approximate any interval I(v)⊆[0,1)of size |I(v)|≥|Fv|by

an interval I0(v)ending at the starting point of some subframe fiso that |I0(v)|is within

(1±ε)|I(v)|. In fact, the following lemma holds.

Lemma 3.43. Suppose some number k ∈Nto be the maximum value for which it holds

that |Fv|+∑k−1

i=0|fi| ≤ |I(v)|. Then, it also holds that |Fv|+∑k−1

i=0|fi| ≥ (1−ε)|I(v)|and

|Fv|+∑k

i=0|fi| ≤ (1+ε)|I(v)|.

Proof. It holds that

|Fv|+

k−1

∑

i=0|fi|= (1+ε

k−1

∑

i=0

(1+ε)i)|Fv|= (1+ε)k|Fv|

for any k≥0. Hence, for |Fv|+∑k−1

i=0|fi| ≤ |I(v)| ≤ |Fv|+∑k

i=0|fi|, we get that

|Fv|+

k−1

∑

i=0|fi| ≥ 1

1+ε|I(v)|= (1−ε

1+ε)|I(v)| ≥ (1−ε)|I(v)|

and |Fv|+∑k

i=0|fi| ≤ (1+ε)|I(v)|.

Thus, we can either decide to round I(v)down to the starting point of the subframe

containing its endpoint or up to the starting point of the next subframe in order to obtain

an interval I0(v)whose size is within (1±ε)|I(v)|. If |I(v)|is at most |Fv|, we identify

I0(v)with I(v), i.e., we do not round I(v)(see Figure 3.9 for illustration).

exact I(v)rounded I(v)

Subframes of size

(1+

|Fv|

Figure 3.9: Rounded Up Interval

Of course, choosing a subframe decomposition such that for each frame Fv, there are

subsequent subframes f0,f1,f2,... of sizes |fi|=ε(1+ε)i|Fv|is not possible, but it is

sufficient to cut each frame Fwinto subframes such that the following condition is true for

every frame Fv:

3.5.2 The SPREAD Strategy 73

Subframe condition: For any two frames or subframes fand f0, let ∆(f,f0)be the

distance (in ascending direction along the [0,1)-ring) from the endpoint of fto the starting

point of f0. Furthermore, for every w∈[n]\{v}and every subframe fin Fv, we require

that |f|≤ε(1+ε)k|Fw|, where k∈N0is maximum possible so that

k−1

∑

i=1

(1+ε)i|Fw| ≤ ∆(Fw,f).

If this condition is true, it is easy to see that the interval I(v)can be rounded down to the

starting point of the subframe ficontaining its endpoint or up to the starting point of the

next subframe fi+1in order to obtain an interval I0(v)with |I(v0)|≤ (1±ε)|I(v)|. Now,

to satisfy the subframe condition, we use the following decomposition strategy for the

[0,1)-ring.

Given h(1),...,h(n)∈[0,1), we start with some single subframe representing the en-

tire [0,1)-interval, and we keep cutting subframes in half until the subframe condition is

true everywhere (when considering each subframe crossing b≥1 frame borders as being

cut into b+1 subframes at these borders). This leads to a unique decomposition into

subframes (for any given node mappings h(1),...,h(n)) with the property that for every

subframe fthat is not the first or last subframe in some frame Fv,|f|=1

2kfor some

k∈N0. Moreover, whenever the number nof disks is increased by system scaling, the

decomposition strategy will only cause some subframes to be cut into smaller subframes

(which turns out to be useful for the adaptivity of SPREAD). The following two lemmas

show that, w.h.p., our decomposition rule does not create too many subframes.

Lemma 3.44. The number of frames Fvin the system with |Fv|≤ δ

nfor some given δ>0is

bounded by 2δn+O(√n), w.h.p. Furthermore, w.h.p. there is no frame Fvwith |Fv|≤ 1

for some constant k, w.h.p.

Proof. For every node v∈[n], consider a random variable Xvwith Xv=h(v). Then, given

a fixed δ, let the function

f(X1,...,Xn):[0,1)n→N

represent the number of frames Fvwith |Fv| ≤ δ

n. Certainly, whenever (x1,...,xn)and

(x0

1,...,x0

n)differ in at most one coordinate, it holds that

|f(x1,...,xn)−f(x0

1,...,x0

n)| ≤ 2.

74 Random Allocation of Data Copies

Moreover, for any node v∈[n]and δ≤1, it holds that

Pr|Fv| ≤ δ

n=1−1−δ

nn−1

≤1− 1−(n−1)δ

n+n−1

2δ

n2!

= (n−1)δ

n−n−1

2δ

n2

≤δ

which implies that E[f]≤δn. Hence, it follows from the method of bounded differences

(c.f. Section 3.1.3.1) that for any d≥0, it holds that

Pr[f≥δn+d]≤e−d2/(2n).

Choosing d=δn+Θ(√n)results in the first part of the lemma. The second part holds

because for any node v, it holds that

Pr|Fv|≤ 1

nk≤1

nk−1.

Therefore, the probability that there exists a v∈[n]with |Fv|≤ 1

nkis polynomially small

in n.

Lemma 3.45. The number of subframes in the system is bounded by O(n

ε), w.h.p.

Proof. For any i∈N0, let Sibe the set of all frames of size in [2−i

n,2−(i+1)

n). Then, for

each frame F∈Si, we need at most

k=log1+ε1

n|F|≤log1+ε2i+1=Oi

ε

subframes f0,f1,..., fkin the ideal setting until |fk|=ε(1+ε)k|F|≥ ε

n. Hence, it follows

from Lemma 3.44 that the total number of subframes needed is bounded by

log n

∑

i=02n

2i·Oi

ε+O√n·Olog n

ε+On

ε=On

ε

w.h.p.

Next, we explain how to manage the tables.

3.5.2 The SPREAD Strategy 75

3.5.2.3 Table management.

For each subframe f⊆[0,1),C(f)represents a multiset of nodes wfor which f∩I(w),/0

holds. Furthermore, for every node win C(f), let its multiplicity µf(w)be defined as the

number of times woccurs in C(f)(see Section 3.1.3.4 for details on multisets). Initially,

µf(w)is set to the number of times I(w)crosses the starting point of f, thus, surely,

µf(w)∈ {br·s·cwc,dr·s·cwe}. Moreover, since ci≤1

rfor every i∈[n], it holds that

µf(w)∈{0,...,s}.

Now, for what follows, consider some frame Fv. Then, for every subframe f⊆Fvand

nodes w,v, we are using the following rules:

Rounding conditions:

1. Whenever the endpoint of I(w)crosses fcompletely (i.e. moves beyond the start-

and endpoint of f) for the first time, we set µf(w) = µf(w)+1.

2. Whenever the endpoint of I(w)moves below the starting point of ffor the first time

after crossing the endpoint of f(or after whas been introduced to the system), we

set µf(w) = µf(w)−1.

We also apply the same rules if w=vand |I(v)|>|Fv|. For the special case that w=v

and |I(v)|≤|Fv|,µf(w)is defined as the number of times I(w)crosses the starting point

of f(recall that the frames Fvare defined by the values h(v), thus, |I(v)|≤Fvmay occur).

With these rules, it holds that whenever the endpoint of I(w)is outside of f, then µf(w)

equals the number of times I(w)crosses f, and otherwise µf(w)∈{br·s·cic,dr·s·cie}.

Hence, since no interval can start inside of f, the number of times intervals cross fat the

left and right endpoints is an upper and lower bound on |C(f)|. More precisely, for any

subframe f, we can bound |C(f)|as follows.

Lemma 3.46. For every subframe f ⊆[0,1)and corresponding multiset C(f),|C(f)|

is within (1±β)r·s, w.h.p., where the constant β>0can be made arbitrarily small

depending on the constant σin s.

Proof. To bound |C(f)|for any subframe f⊆[0,1), the arguments above imply that it

suffices to consider any point x∈[0,1)that represents a starting or endpoint of some

interval I(v). In the following, let us first consider the starting point of an interval I(v).

Consider some fixed node vwith starting point h(v)∈[0,1). For any node win the

system (including v), let the random variable Xwbe defined as

Xw=br·s·cwc+Yw

where the binary random variable Yw=1 if and only if I(w)contains h(v)dr·s·cwemany

times. Furthermore, let X=∑wXwand Y=∑wYw. It certainly holds that E[Xw] = s·r·cw

for all w,vand that X=ds·r·cve+∑w,vXw. Hence,

E[X] = ds·r·cve+s·r·(1−cv)∈[s·r,s·r+1].

76 Random Allocation of Data Copies

Moreover, E[Y]≤E[X], and since the starting points of the intervals are chosen indepen-

dently at random and the Yw’s are binary random variables, it follows from the Chernoff

bounds (see Section 3.1.3.1) that

Pr[|Y−E[Y]|≥βE[Y]] ≤e−β2E[Y]

2(1+β/3)

for any β>0. Hence, Xis within (1±β)s·r, w.h.p., where the constant β>0 can be

made arbitrarily small depending on the constant σin s.

For the endpoint of an interval I(v), it is easy to see that E[X]∈[s·r−1,s·r]. Thus,

the same deviation bounds can also be shown here.

For our further descriptions, we need somewhat more refined estimates on the sizes

of multisets corresponding to the nodes in the system. As we will see later, this is im-

portant for the design and management of the used tables to appropriately handle system

dynamics.

For what follows, consider the sets

W`:=nv∈[n]|cv≥1

(s−1)ro⊆[n]

and Ws:= [n]\W`(it is easy to see that all nodes v∈W`are guaranteed to have multiplicity

µf(v)≥1 in every subframe f⊆[0,1)).

Now, for any subframe f, we define C`(f):=C(f)|W`(i.e., the multiset containing

only those nodes in C(f)that are also in W`) and Cs(f):=C(f)|Ws. Furthermore, let

a`=∑v∈W`cvand as=1−a`. Then, a more refined version of Lemma 3.46 is as follows.

Lemma 3.47. For any subframe f and any subset W0

`⊆W`, it holds that |C0

`(f)|is within

s·r·a0

`±β|W0

`|+O(log n), w.h.p., where C0

`(f) = C`(f)|W0

`and a0

`=∑v∈W0

`cv. Further-

more, |Cs(f)|is within (1±β)s·r·as+O(log n), w.h.p.

In both cases, the constant β>0can be made arbitrarily small depending on the constant

σin s.

Proof. Recall the definition of Xwand Ywin the proof of the previous lemma. Applied to

nodes w∈W0

`, it holds that

E[X] = ∑

w∈W0

`bs·r·cwc+E[Y]∈[s·r·a0

`−1,s·r·a0

`+1].

Since E[Y]≤|W0

`|, the first bound follows from the fact that

Pr[|Y−E[Y]| ≥ βE[Y]] ≤e−β2E[Y]

2(1+β/3)

for any β>0.

The second bound follows along the same lines as in the proof of Lemma 3.46.

3.5.2 The SPREAD Strategy 77

As we know now, any subframe is always crossed by a sufficient number of intervals.

Furthermore, the following result is crucial for SPREAD to be redundant.

Lemma 3.48. For every subframe f , there are at least r different nodes in C(f), w.h.p.,

where the probability depends on the constant σin the stretch factor.

Proof. Let q=|W`|. If q≥r, then the lemma is trivially true. So suppose that q<r.

Since cv≤1

rfor all v∈[n], a relative capacity of at least 1 −q

rmust be covered by Ws.

Hence,

|Ws|≥1−q

r/1

(s−1)r= (s−1)(r−q)

and

∑

v∈Ws|I(v)|≥(r·s)1−q

r=s(r−q).

Suppose that the nodes in Wsare numbered from 1 to t. Consider any fixed node v∈Ws

and focus on the point yimplied by v’s interval I(v)=[x,y). For any node w∈W, let the

binary random variable Xw=1 if and only if y∈I0(w)and let X=∑wXw. Since

E[Xw] = Pr[Xw=1] = min{|I0(w)|,1}for all w∈Ws\{v},

it holds that

E[X] = q+∑

w∈Ws\{v}

E[Xw] = q+∑

w∈Ws\{v}

min{|I0(w)|,1}

≥q+(1−ε)s(r−q)−1≥r+(1−ε)s−2

Since s=Θ(logN)and the starting points of the intervals are chosen uniformly and inde-

pendently at random, it follows from the Chernoff bounds that X≥rfor point y, w.h.p.

Since we only need to consider those points in [0,1)that are starting points or endpoints

of intervals I(v)in order to cover all subframes in [0,1)and the lowest values for E[X]

are reached when focusing on endpoints of intervals I(v)of nodes v∈Ws, the lemma

follows.

Recall that for each subframe f, we maintain one (and sometimes two) s

α×r-table Tf

of r·s

αslots which are organized into s

αgroups (or columns) of size reach (Figure 3.7).

We assume that α>0 is a sufficiently small constant with 1

α∈N. Our goal is to assign

the slots of Tfto nodes in fso that the following conditions are met:

Table conditions:

1. Every slot is assigned to (resp. owned by) exactly one node in C(f).

78 Random Allocation of Data Copies

2. Every node vin C(f)owns within (1±γ)µf(v)/αmany slots but at most s

αmany

slots in Tffor some constant 0 <γ<1 to be specified below.

3. Every group consists of slots belonging to different nodes.

The constant γwhich is sufficient to maintain the table conditions is given in the following

lemma. Moreover, in this lemma, αis the parameter used in the table size and βis the

parameter used in the Lemmas 3.46 and 3.47.

Lemma 3.49. If the bounds in Lemma 3.47 are true and αand βare sufficiently small

constants, then Conditions 1 and 2 can be met with any γ≥β

1−β+α.

Proof. Recall the bounds in Lemma 3.47 and ignore the O(log n)terms for a moment.

Then, in this proof, we consider the nodes in W`and Wsseparately, starting with W`.

Let Wh⊆W`be the subset of all nodes v∈W`with capacity cvlarge enough so that

it is guaranteed that µf(v)≥1

γfor γas chosen in the lemma. In this case, every v∈Wh

satisfies

(1−γ)µf(v)

α≤µf(v)−1

α≤js·r·cv

αk

and

(1+γ)µf(v)

α≥µf(v)+1

α≥ls·r·cv

αm.

Moreover, ds·r·cv

αe≤ s

α. Hence, each v∈Whcan be given either bs·r·cv

αcor ds·r·cv

αe

many slots without violating Condition 1 implying a slot assignment to the nodes in Wh

such that the total number of slots used by the nodes in Whis in

hjs·r·ah

αk,ls·r·ah

αmi

where ah=∑v∈Whcv. This is perfect up to an additive 1.

Next, consider the remaining nodes in the subset W0

`=W`\Whwith capacity cvsuch

that µf(v)<1

γ. For this, let C0

`(f) = C`(f)|W0

`be the multiset of nodes v∈C`(f)with

v<Wh. In this case, for each node v∈W0

`, it holds that

(1+γ)µf(v)

α<s

(given that sis sufficiently large). Thus, we do not have to worry about limiting the

number s

αfor the slots of v.

Let a0

`=∑v∈W0

`cv. Then, suppose for an upper bound on the number of slots per node

that |C0

`(f)|=s·r·a0

`−β|W0

`|(see Lemma 3.47). If a constant φ>0 is chosen such that

∑

v∈W0

`µf(v)

α+φ

α≥s·r·a0

α,(3.16)

3.5.2 The SPREAD Strategy 79

then it suffices to assign at most

µf(v)+φ

α+1

slots to every node v∈C0

`(f)to cover at least an a0

`-fraction of the slots in f. Furthermore,

since ∑v∈W0

`µf(v) = |C0

`(f)|, Inequality 3.16 above holds if and only if

|C0

`(f)| ≥ s·r·a0

`−φ|W0

`|,

so we have to choose φ≥βfor this.

For a lower bound, suppose that |C0

`(f)|=s·r·a0

`+β|W0

`|. If the constant φ>0 is

chosen so that

∑

v∈W0

`µf(v)

α−φ

α≤s·r·a0

α,

then at least µf(v)−φ

α−1

slots can be assigned to every node v∈C0

`(f)to cover at most an a0

`-fraction of the slots

in f. For this, we have to choose φ≥β. Together with the fact that µf(v)≥1 for all

v∈C0

`(f), it follows that γ=β+αis sufficient so that

µf(v)+β

α+1≤(1+γ)µf(v)

and

µf(v)−β

α−1≥(1−γ)µf(v)

α.

Thus, we are done with determining bounds on the slot assignment for the nodes in W`,

that is, for nodes vthat are guaranteed to have multiplicity µf(v)≥1 in C(f).

Finally, we consider the nodes in Ws, that is, nodes vwith µf(v)<1 that may not cross

every subframe f∈[0,1). Suppose for an upper bound on the number of slots per node

that |Cs(f)|= (1−β)s·r·as(see Lemma 3.47). If then constant φ>0 is chosen so that

∑

v∈Ws

(1+φ)µf(v)

α≥s·r·as

α,

it suffices to assign at most

(1+φ)µf(v)

α+1

slots to every node v∈Cfto cover an as-fraction of the slots in f. Since it holds that

∑v∈Wsµf(v) = |Cs(f)|, we have to choose φsuch that

(1+φ)(1−β)s·r·as

α≥s·r·as

α,

80 Random Allocation of Data Copies

which works for φ≥1

1−β−1=β

1−β.

For a lower bound, suppose that |Cs(f)|= (1+β)s·r·as. If then the constant φ>0 is

chosen so that

∑

v∈Ws

(1−φ)µf(v)

α≤s·r·as

α,

at least

(1−φ)µf(v)

α−1

slots can be assigned to every node v∈Cs(f)without exceeding an as-fraction of the slots

in f. For this, we have to choose φsuch that

(1−φ)(1+β)s·r·as

α≤s·r·as

which works for φ≥1−1

1+β=β

1+β. Hence, γ≥β

1−β+αis sufficient for both cases.

Finally, notice that there is this O(log n)term in the bounds in Lemma 3.47 that we

ignored so far. However, whenever this term is dominant for a set of nodes in some

subframe fj, then it holds that ∑v∈fjcv≤δfor some constant δ>0 that can be made

arbitrarily small depending on the constant σin the stretch factor s. In other words, since

we are considering only three sets of nodes (W`,W0

`,Ws), those sets of nodes where the

O(logn)term is negligible (and can therefore be covered by β) represent a total capacity

of at least 1 −2δwhich is sufficient to fill all slots just with these nodes, or leave enough

space for the other nodes, as long as δis sufficiently small compared to αand β.

If Conditions 1 and 2 are true, also Condition 3 can be met. To show this, consider

the slots to be numbered column-wise from 1 to r·s

αby giving slot (i,j)in group ithe

number (j−1)·s

α+i(c.f. Figure 3.7). Then, assign the slots to the nodes such that each

node v∈C(f)owns a consecutive sequence of slots. As every node owns at most s

αslots,

no group can have two slots owned by the same node, which proves our claim.

However, the challenge, of course, will be to maintain these conditions in case of sys-

tem changes caused by some change operation ωwithout rearranging too many slot as-

signments. Before we show how to do this, we prove that the given table conditions allow

us to maintain fairness.

Lemma 3.50. If the table conditions are met, then for every node v ∈[n]in the system

and every data item d ∈U, it holds that

Pr[v stores a copy of d]∈[(1−γ)(1−ε)r·cv,(1+γ)(1+ε)r·cv]

for the γ>0as given in Condition 2 and ε>0as chosen for the interval rounding.

3.5.2 The SPREAD Strategy 81

Proof. Consider any node v∈[n]and data item d∈U. Let p=|I0(v)|mod 1, where

I0(v)is the rounded form of I(v)(see Figure 3.9). For an area Ap= [xi,xj]⊆[0,1)of

size |Ap|=p, the multiplicity of vis µf(v) = d|I0(v)|e for all subframes f⊆Ap. For the

remaining area A0p⊆[0,1)of size |A0p|= (1−p), the multiplicity is µf(v) = b|I0(v)|cfor

all subframes f⊆A0p. Moreover, it follows from the table conditions that for any node v

in some subframe f,

Pr[vselected by a data item]∈(1−γ)µf(v)

s,min{(1+γ)µf(v),s}

s.

Hence, it holds for data item dthat

Pr[vstores a copy of d]≥p·(1−γ)d|I0(v)|e

s+(1−p)·(1−γ)b|I0(v)|c

= (1−γ)·|I0(v)|

s≥(1−γ)(1−ε)r·cv

Similarly, it holds that Pr[vstores a copy of d]≤(1+γ)(1+ε)r·cv, which implies the

lemma.

Since ε>0 and γ>0 can be made arbitrarily small, the lemma implies the fairness of

SPREAD. Next we show that SPREAD is also adaptive.

3.5.2.4 Amortized adaptivity.

We start with amortized adaptivity, i.e. we show how to perform updates so that move-

ments of data copies can always be charged to capacity changes in the past.

Note that SPREAD does not need to perform any adaptations of the slots as new data

items are added or old data items are removed from the system since Lemma 3.50 implies

that the data items in the system will remain fairly distributed among the nodes. Hence, it

remains to describe how to react to changes in the capacities of the nodes.

In what follows, we suppose that the capacities of the nodes change from (c1,...,cn)

to (c0

1,...,c0

n). In case that the number nof nodes in the system increases, then, for each

node identifier v∈Vthat has not been used before, we have to rearrange the subframe

layout in the [0,1)-interval in a way that some subframes may have to be cut into smaller

subframes that take over the tables of the previous subframes. Then, it follows that if the

previous subframes satisfied the table conditions, the new ones will also do so. After that,

our rounding conditions may require updates to these tables, which can be charged to

past capacity changes as for the other cases below. At last, we adapt the intervals I(v)to

the new capacity distribution (c0

1,...,c0

n). Clearly, this may cause changes in the multiset

C(f)of some subframe finducing required updates in its table (or tables). We call a

subframe f dirty if f⊆Fv, and furthermore, fcontains the endpoint of the interval I(v)

with |I(v)|≤|Fv|. Otherwise, it is called clean.

82 Random Allocation of Data Copies

In the following, we first describe how to update the table of a clean subframe, and then

we consider dirty subframes.

Updating a clean subframe f.Let C(f)be the multiset of nodes in fbefore the change

and C0(f)be the multiset of nodes in fafter the change in capacities. Furthermore,

let µC

f(v)denote the multiplicity of some node v∈C(f)and µC0

f(v)the multiplicity of

v∈C0(f). If then C0(f),C(f), we go through the following stages.

1. Pairing stage: Suppose that

∑

v∈C(f)

(µC

f(v)−µC0

f(v)) = δd

is the total decrease in the multiplicities of nodes in C(f), and

∑

v∈C(f)

(µC0

f(v)−µC

f(v)) = δi

is the total increase in the multiplicities of nodes in C(f). Then, we can identify

|δi−δd|pairs of nodes (v,w)where vwants to decrease its multiplicity whereas w

wants to increase its multiplicity. For each such pair, we set

µC0

f(v):=µC

f(v)−1 and µC0

f(w):=µC

f(w)+1

and then change slot assignments according to µC0

f(v)and µC0

f(w), respectively,

until table Condition 2 is satisfied for vand w.

For each such slot reassignment, we distinguish between three cases.

a) If both vand wviolate Condition 2, then a slot of vis given to w.

b) If only vviolates Condition 2, we give a slot of vto any node w0who can still

take a slot without violating Condition 2 (we will see below that such a node

w0can always be found, w.h.p.).

c) If only wviolates Condition 2, we give a slot from any node v0who can lose a

slot without violating Condition 2 to w.

For each slot x∈Tfgiven from some node uto some node u0, we use the following

slot switching strategy to preserve Table condition 3.

Switching strategy: If xbelongs to some group g∈Tfin which no other slot

is assigned to u0, we are done. Otherwise, there must be a group g0with no slot

assigned to u0since otherwise u0would have more than s

αslots at the end, violating

Condition 2. Since Condition 3 was true before the movement, there must be a slot

x0in g0that is assigned to some node u00 having no slot in g. Then, switch slots x

and x0among u0and u00 which repairs Condition 3.

3.5.2 The SPREAD Strategy 83

2. Movement stage: After the pairing stage, we only have nodes left that all want to

decrease or increase their multiplicities. We consider these node by node. For each

node vamong these, we determine µC0

f(v), and then either move slots to vor away

from vusing the slot switching strategy in the pairing stage, if necessary, until v

satisfies Condition 2.

Of course, it is not obvious that suitable slots can always be found for the reassignments

(besides the pairing stage in which the nodes vand wstill violate condition 2), but the

following lemma implies that this is possible. In it, C00(f)represents the intermediate

multiset during the process of moving from C(f)to C0(f).

Lemma 3.51. In any situation in which |C00(f)|is within |C(f)|and |C0(f)|, Conditions

1 and 3 are true, and at most one node violates Condition 2. But then, Condition 2 can be

repaired for it so that all table conditions are met, w.h.p.

Proof. For any node w∈C00(f), let swbe the number of slots whas in the table Tf. Let

vbe the node that is violating Condition 2. Then, veither needs additional slots or has to

give up slots. Suppose first that vneeds additional slots. In this case,

sv<(1−γ)µC0

f(v)

α.

As long as there is a node wwith

sw−1≥(1−γ)µC0

f(w)

α,

we can move a slot from wto vuntil

sv≥(1−γ)µC0

f(v)

α.

Then, we repaired Condition 2 for vwithout violating the Condition 2 for any of the other

nodes.

Suppose, however, that we reach a point in which sv<(1−γ)µC0

f(v)/αbut there is no

node wany more with sw−1≥(1−γ)µC0

f(w)/α. In this case, the total number of slots

occupied by the nodes in C00(f)is less than

(1−γ)µC0

f(v)

α+∑

w,v"(1−γ)µC0

f(w)

α+1#<1−γ

α∑

w∈C00(f)

µC0

f(w)+|C00(f)|

=1−γ

α+1|C00(f)|.

84 Random Allocation of Data Copies

If γ≥β

1−β+α, it follows from Lemma 3.46 that this is at most

1−2β

α(1−β)·|C00(f)| ≤ 1−2β

α(1−β)·(1+β)s·r

w.h.p. Furthermore, it holds that

(1−2β)(1+β) = 1−β−2β2<1−β,

thus, ∑wsw<s·r

α, which is a contradiction since all slots must be owned by a node at

any time. Hence, it will always be possible to reassign slots so as to repair Condition 2

for node vin this case.

Next, we consider the case that node vneeds to give up slots. In this case, vgets stuck

if sv>(1+γ)µC0

f(v)/αand for all other nodes w, it holds that sw+1>(1+γ)µc0

f(w)/α.

Then, the total number of slots occupied by the nodes in C00(f)is more than

(1+γ)µC0

f(v)

α+∑

w,v"(1+γ)µC0

f(w)

α−1#>1+γ

α∑

w∈C00(f)

µC0

f(w)−|C00(f)|

≥1

α(1−β)+1∑

w∈C00(f)

µC0

f(w)−|C00(f)|

≥1

α(1−β)·(1−β)s·r=s·r

w.h.p. This, however, is a contradiction. Hence, slots from node vcan be reassigned to

other nodes until Condition 2 holds for v.

Next, we bound the number of slot reassignments. A step in the movement or pairing

stage is defined as the process of fixing the table conditions after the multiplicity of a node

or pair of nodes has changed by 1.

Lemma 3.52. In each step of the pairing or movement stage, at most 2(1+γ)

αslots have to

be reassigned in order to repair the table conditions.

Proof. First, consider the movement stage. Suppose that the multiplicity of some node v

increases by 1, that is, µC0

f(v) = µC

f+1. We know that for its old multiplicity µC

f(v), it

holds that

sv≥(1−γ)µC

f(v)

α.

Hence, at most 1

αslots have to be moved to vto satisfy

sv≥(1−γ)(µC0

f(v))

α.

(3.17)

3.5.2 The SPREAD Strategy 85

Since each slot movement may require a flip with another slot to repair Condition 3, the

total number of slot reassignments is at most 2

αin this case.

For the case that the multiplicity µC

fof some node vdecreases by 1, at most 2(1+γ)

αslots

have to be reassigned. The worst case happens if µC

f(v)was previously 1 and vhad 1+γ

slots.

Next, consider a step of the pairing stage. Suppose that for node vthe multiplicity

µC0

f(v) = µC

f(v)−1 whereas the multiplicity of node wis µC0

f(w) = µC

f(w) + 1. Then,

vhas to give up at most 1+γ

αslots while whas to get at most 1

αslots in order to repair

Condition 2. In any step of repairing Condition 2 for vand/or w(vgives a slot to w,

or vgives up a slot, or wgains a slot), at most 2 slot reassignments are necessary, so

altogether, at most 2(1+γ)

αslot reassignments are needed to repair Condition 2 for the

nodes vand w.

Hence, given a total change in the multiplicities of the nodes by µ, at most 2µ(1+γ)

αslot

reassignments are necessary to get from C(f)to C0(f). Given kslot reassignments, the

probability that a specific copy of a data item d∈Uwith g1(d)∈fneeds to be replaced

is equal to

r·s

Thus, the expected number of copy movements is at most

2µ(1+γ)

r·s

α·|f|·r|D|=2(1+γ)µ|f|

s·|D|

where D⊆Uis the set of data items that are stored in the system. A change in multi-

plicities by µcan be charged to a capacity change of c(f)≥|f|µ

s·rwith respect to fin

the past, and the way we perform interval rounding makes it possible that every capacity

change is charged at most once. Thus, with respect to c(f), the expected number of copy

movements is at most

2(1+γ)s·r·c(f)

s·|D|=2(1+γ)c(f)·r|D|

According to Fact 3.41, a capacity change of c(f)requires the replacement of at least

c(f)·r|D|

2copies for the copy distribution to remain fair with respect to f. Hence, for

clean subframes, SPREAD is amortized 4(1+γ)-adaptive.

Updating a dirty subframe f.If fis dirty, we maintain two tables for f. One table, T1,

for the interval f1from the starting point of ftill the endpoint of I(v), and one table, T2,

for the interval f2from the endpoint of I(v)till the endpoint of f. The multisets C(f1)

and C(f2)are equal to C(f)with the difference that C(f1)contains a copy of node vwhile

86 Random Allocation of Data Copies

I(v)

h(v)

f1f2

Figure 3.10: A Dirty Subframe

C(f2)does not contain v. Our goal is to make sure that T1and T2differ in at most 2(1+γ)

slots, which we call the proximity condition. This is ensured by the following strategy.

Suppose that C(f)stays the same but the size of I(v)changes. Then, we only update f1

and f2accordingly and leave the tables T1and T2as before, which satisfies the proximity

condition. If C(f)only changes because the endpoint of I(v)enters ffrom below, then T2

inherits the table of f, and T1is obtained by applying slot reassignments for the node vto

table T2until the table conditions are met for f1. According to Lemma 3.52, this requires

the reassignment of at most 2(1+γ)

αslots, so the proximity condition holds afterwards. If

C(f)only changes because the endpoint of I(v)is moving below f, then T2is chosen as

the table for f. Similar solutions can be found if the endpoint of I(v)enters or leaves f

from above.

In all other cases, we first ignore a potential change in I(v), which means that the sizes

of f1and f2remain the same. We then adapt the table Tiof the interval fiof largest size

among f1and f2as described for the table of a clean subframe fto get from C(f)to

C0(f)(ignoring changes in C(f)due to I(v)entering or leaving f), and then we construct

the table of the other interval by performing at most 2(1+γ)

αfurther slot reassignments in

order to remove or add slots for node v. Afterwards, we update the sizes of f1and f2if

necessary (i.e. if I(v)has changed).

Next, we bound the adaptivity of SPREAD for dirty subframes. Suppose that C(f)

stays the same but the size of I(v)changes by some `in f(i.e. `|f|). Then, this can

be charged to a change in capacity of node vof c=`

s·r. Hence, the proximity condition

ensures that the expected number of copy movements is at most

`·2(1+γ)

r·s

α·r|D|=2(1+γ)c·r|D|

which implies that, with respect to c, SPREAD is 4(1+γ)-adaptive in this case.

3.5.2 The SPREAD Strategy 87

If C(f)only changes because the endpoint of I(v)enters ffrom below or is moving

below f, and this is associated with a change of I(v)of length `with respect to f, then it

follows analogously to the first case that SPREAD is 4(1+γ)-adaptive.

It remains to consider any remaining case. Let µ≥1 be the total change in multiplici-

ties in C(f)(ignoring the change caused by I(v)). W.l.o.g., suppose that |f1|≥|f2|. Then,

at most

2µ(1+γ)

slots are reassigned in T1and at most

(µ+2)2(1+γ)

slots are reassigned in T2. The latter bound holds because T2differs in at most 2(1+γ)

αslots

from T1before and after the reassignments.

Since a total change of µcan be charged to a capacity change of c(f)≥|f|µ

s·rwith

respect to fin the past, |f1| ≥ |f2|, and µ≥1, it follows from the arguments for clean

subframes that the expected number of copy movements due to these changes is at most

µ·2(1+γ)

r·s

+(µ+2)2(1+γ)

r·s

α!|f|

2·r|D| ≤ 2µ|f|·2(1+γ)

r·s

α·|D|

≤4(1+γ)s·r·c(f)

s·|D|

=4(1+γ)c(f)·r|D|

For any additional changes due to I(v), SPREAD is 4(1+γ)-adaptive. Hence, overall

SPREAD is amortized 8(1+γ)-adaptive for dirty subframes. Summing up the adaptivity

bounds over all subframes results in an amortized adaptivity of 8(1+γ).

3.5.2.5 Adaptivity.

In order to get from amortized adaptivity to adaptivity, we replace the deterministic round-

ing rules for the intervals above by a randomized rounding rule. More specifically, we

choose an additional (pseudo-)random hash function h0:V→[0,1), and for every in-

terval I(w)and subframe f= [x,y)in Fvwith v,w(or |I(w)|≥|Fw|), we check the

following:

Randomized rounding conditions:

1. Whenever the endpoint of I(w)crosses x+h0(v)

|f|mod 1 from below, µC

f(w)is in-

creased by 1.

2. Whenever the endpoint of I(w)crosses x+h0(v)

|f|mod 1 from above, µC

f(w)is de-

creased by 1.

88 Random Allocation of Data Copies

With this rule we obtain the following result:

Lemma 3.53. For any capacity change, SPREAD is 8(1+γ)-adaptive.

Proof. First, suppose that ndoes not change. Consider any capacity change in the system,

and for any node v, let ∆(I(v)) be the interval representing the difference between I(v)

before and I(v)after the change. Suppose that the starting point of ∆(I(v)) is in some

subframe fand the endpoint of ∆(I(v)) is in some subframe f0.

First, consider the case that ∆(I(v)) ⊆f, i.e., f=f0. Then, it is easy to check that

the probability that µC

f(v)increases or decreases by 1 is equal to |∆(I(v))|

|f|. We know from

above that an increase or decrease of a multiplicity by 1 in a subframe frequires the

replacement of an expected number of at most

4(1+γ)|f|

s·|D|

copies in the system. Since cvchanged by ∆cv=|∆I(v)|

r·s, this means that the expected

number of copies replaced due to node vis at most

|∆(I(v))|

|f|·4(1+γ)|f|

s·|D|=4(1+γ)∆cv·r|D|.

For ∆(I(v)) *f, similar arguments also yield that the expected number of copies replaced

due to vis at most 4(1+γ)∆cv·r|D|.

Now, let us consider that nincreases. Then, for any rounded I(v)with endpoint in

some subframe f= [x,y)before the increase, I(v)is only rounded again, applying the

new decomposition, if I(v)⊂I0(v)and the endpoint of I(v)passes x+h0(v)

|f|or y, or if

I0(v)⊂I(v)and the endpoint of I(v)passes xor x+h0(v)

|f|, which preserves our adaptivity

bound.

Combining all of the results in this section satisfies all given demands. Note that with

the help of Chernoff bounds the adaptivity bound can also be shown to hold w.h.p. (up to

minor order terms) if the total capacity change is ω(ϕlogn), where ϕis an upper bound

on the maximum size of a subframe in the system. Hence, the smaller the subframes, the

smaller will be the deviation from the expected number of replaced copies.

3.5.2.6 Variable number of copies per data item.

If the number of copies per data item varies but is upper bounded by r, then we just need

to slightly adapt our storage strategy. For any data item with r0≤rcopies, we first select

a group of rdistinct nodes as before and then store r0copies among r0of these nodes

by selecting a (pseudo-)random starting node vin the group (via some additional hash

3.6. Conclusion and Open Problems 89

function) and then storing copies at the subsequent r0nodes in the group (where we treat

the group as a ring). It is not difficult to show that this preserves all the properties shown

above for data items with redundancy exactly r.

3.6 Conclusion and Open Problems

In this chapter, we observed and analyzed strategies for obtaining a redundant and fair

distribution of normalized copies of data blocks among a given storage environment that

may be considered static or, perhaps, dynamic. We showed, that when allocating the data

items to disks of non-uniform capacity by some sequential balls-into-bins like algorithm,

this will lead to an imbalance in the probability distributions of the allocations for each

step t≥2. However, we presented an efficient data placement strategy, called COMB, that

overcomes this imbalance problem such that for ridentical copies of some data item, all

copies are redundantly stored on rdistinct disks of arbitrary capacity and furthermore, the

fairness condition is satisfied. Unfortunately, since COMB is not adaptive and thus only

works efficiently for static storage environments, after that, we introduced the adaptive

SPREAD strategy which is efficient, always yields a redundant and fair data layout and is

O(1)-adaptive for any change in the capacities of the disks in case of system scalings.

However, with respect to the observed allocation problem, the following open problems

remain to be addressed in some future work.

•Determine (by some function) the shift in the assigned weights of the r-tuples when

moving from a uniform to a non-uniform data placement.

•Investigate the given data redundant allocation problem (and thus, all presented

algorithms) also for weighted data blocks.

•Find a multiple-choice approach for the standard balls-into-bins problem consider-

ing non-uniform capacities.

•Modify the r-tuple placement of the COMB strategy by the multiple-choice ap-

proach (and show (perhaps better) deviation bounds).

•Prove Conjecture 3.36.

CHAPTER 4

Erasure Codes for Reading and Writing

We now switch to allocation strategies that provide good results but only for classical

storage environments (c.f. Section 1.1.2) and that use erasure (resilient) encoding to

ensure fault-tolerance in a storage network. In general, erasure coding is a technique

that has widely been employed for achieving high data availability and reliability in stor-

age networks (e.g. [PGK88, PT03, Pla97, REG+03]) and communication systems (e.g.

[ABB+97, NB97, BLM02, BLMR98, Riz97]).

From a general perspective, the main difference to replication-based strategies is that

with erasure coding, the original information is no longer stored separately from the re-

dundant data (copies) but is mixed up by the encoding yielding a codeword that contains

the desired degree of redundancy. More specific, an erasure (resilient) code maps a word

xof ksymbols drawn from some finite set Σ, which is referred to as the code alphabet,

into a codeword y of n>ksymbols from the same alphabet; the nsymbols of yare stored

separately on ndistinct disks in the storage network. If then some disks fail for reading, in

the optimal case, any ksymbols from yare sufficient to recover x. That is, such codes can

tolerate up to n−kerasures which may be caused by failed, respectively temporarily not

accessible disks (throughout this chapter, we refer to xas an information word or message

of fixed size k, according to the erasure channel model introduced by Elias [Eli55]).

Although various erasure codes exist, the special set of codes we consider in the fol-

lowing are those for RAID-like storage systems such as RAID-arrays and SANs (e.g.

[PGK88, PT03, Pla97, REG+03]) and whose most prominent representatives are parity

codes like RAID and EVENODD [PGK88, BBBM94] or Reed-Solomon codes [HP03,

Pla97] (we assume the reader to be somewhat familiar with basic properties of erasure

codes in order to limit the scope of this thesis. However, a good overview can be found

in e.g. [HP03, Rom96]). In contrast to other applications, RAID-like storage systems are

much more sensitive concerning the size of the codeword because the separated storage

of the code symbols induces additional storage costs. Hence, the codeword size has great

influence on the required storage capacity but which is of less overhead than given with

92 Erasure Codes for Reading and Writing

replication, and this, clearly, is the very great advantage of erasure encoding in storage en-

vironments. Again, with replication, one has to cope with a storage overhead of a factor of

r−1 that depends on the strict separation of the original and the redundant data (copies).

Instead, with erasure coding, redundancy is directly encoded into the data symbols, and

the ratio of original information to total code symbols is called the (information) rate of

the code rc=k

n<1 which imposes a storage overhead (or stretch factor) sc=1

rc=n

Surely, it should hold that sc<2k(otherwise, one would be better off using replication

because of its simplicity). The least storage consumption of such codes is given by parity-

based schemes that (considering n=k) require a factor sc=n+1

n(RAID) and sc=n+2

(EVENODD). Additionally, since basing on simple XOR-operations, those schemes are

highly suitable for hardware realization. Unfortunately, parity codes can merely tolerate

a very small number of erasures at a time, what is often insufficient, especially in large-

scale SANs. In that case, more complex codes, like Reed-Solomon codes, are applied that

provide improved fault-tolerance.

However, compared to replication, the applied erasure codes lack of some substantial

drawbacks, that are (a) to almost be applicable in uniform storage environments only,

(b) to suffer from high complex algebraic operations for encoding/decoding, and (c) to

lack of the ability to adapt to changing storage environments, but most importantly, all

such codes suffer from a negative write and update behavior, respectively, because most

of them strictly focus on providing an (almost) optimal data recovery. That is, for any

k-sized information word xand corresponding codeword y, those codes aim at recovering

the data from up to n−kerasures (what is an optimal recovery behavior). Unfortunately,

this optimal recovery property induces negative update behavior since, in particular, yis

generated as a function of the original symbols implying that on every change of x, the

complete codeword must also be updated, what is dismal. This overhead is widely known

as write-penalty. Moreover, in a SAN, the write procedure is no longer atomic but rather

describes a sequence of write operations which usually involves writing to multiple disks.

These concurrent writes may then induce inconsistencies but which can, for instance, be

reduced if only a small fraction of all parity symbols must be rewritten. Some approaches

to reduce the update complexity are given by e.g. X-codes [XB99] but, however, such

codes are limited to recover from any two simultaneous erasures only.

In the following, we overcome the update problem by the Read-Write-Coding System

(RWC), a very flexible class of erasure codes that are closely related to Reed-Solomon

codes but offer enhanced flexibility and improved update behavior. However, before dis-

cussing the RWC system in more detail, we first require some brief introduction to linear

block codes, the class of codes to which our Read-Write codes belong. To bound the

scope of this thesis, we suppose the reader to be familiar with basics from linear algebra,

number theory like e.g. modular arithmetic, and abstract algebra (e.g. group theory and

the theory of finite fields) (see e.g. [HP03, Rom96, LN86] for overviews).

4.0.1 Preliminaries 93

4.0.1 Preliminaries

Throughout this chapter, we address so called linear block codes which have useful prop-

erties for the storage systems we refer to. In particular, we consider static RAID-like

storage systems that (a) are supposed to consist of a fixed sized set [n]of disks for which

furthermore a fixed allocation stripe size can be defined, and (b) for which, driven by me-

chanical components, access latency to the storage devices is magnitudes higher than the

required time for encoding or decoding the data. With respect to the first issue, usually in

those storage systems, so-called block codes are utilized in which all codewords have the

same fixed length nand that have good potential to meet the desired disk utilization (in

what follows, we state state some important facts about linear codes in very brief without

proving the listed theorems; the proofs can be found in the references mentioned above).

Definition 4.1. (Block Code) Let Σbe an arbitrary finite alphabet with |Σ|≥2. Then, an

[n,m]-block code over Σis an m-elementary subset C ⊆Σnwith m ≥1. The elements of

C are referred to as codewords, all of which have fixed length n, and the number |C|=m

of codewords in C is the size of the code.

Now, to be more specific, the special class of block codes we concentrate on in what

follows are linear codes which are defined by the means of linear algebra and thus, in-

duce an easier description/analysis in general and with respect to the encoding/decoding

operations. In particular, linear codes are codes over finite fields, that is, for a linear

code, Σ=Fqand Fqis a finite field with qelements. Let Fk

qdenote the k-dimensional

vector space of all k-vectors over the finite field Fq. Then, our k-sized messages are vec-

tors x= (x1,x2,...,xk)∈Fk

q, and we refer to Fk

qas the message, respectively information

space. Furthermore, to introduce redundancy, we consider the following embedding of

the vector space Fk

qinto the n-dimensional vector space Fn

q,n>k.

Definition 4.2. (Linear Code) Consider the following injective linear mapping

γ:Fk

q→Fn

which defines a linear embedding of the message space Fk

qinto Fn

q. Then, the image

C=γ(Fk

q)is a k-dimensional subspace of Fn

qthat is isomorphic to Fk

q, and to which we

refer to as linear (n,k)-code, or simply (n,k)-code, of size |C|=qk.

It can be shown that every linear (n,k)-code Cover the field Fqis an (n,qk)-block code

over the alphabet Σ=Fq. The parameter nis called the block length and kis the dimension

of the code.

The isomorphism between Fk

qand the (n,k)-code C=γ(Fk

q)is usually described by

ak×n generator matrix Γof rank kover the field Fqwhose rows define a basis for C.

Thus, we obtain

C=γ(Fk

q) = {vΓ|v∈Fk

q}.

94 Erasure Codes for Reading and Writing

For any set of kindependent columns of Γ, the corresponding set of coordinates forms an

information set for C. The remaining r=n−kcoordinates are termed a redundancy set

and ris called the redundancy of C. If the first kcoordinates form an information set, the

code has a unique generator matrix of the form (Ik|A)where Ikis the k×kidentity matrix

and we denote Γto be in standard form as depicted in the following example.

Example 4.3. The matrix Γ= (I4|A), where

Γ=





1 0 0 0 0 1 0

0 1 0 0 1 0 1

0 0 1 0 1 1 0

0 0 0 1 1 1 1







is a generator matrix in standard form for a (7,4)binary code.

The overall complexity of the encoding v→vΓis given by the complexity of the multi-

plication vΓwhich is O(n·k)assuming O(1)-time for addition/multiplication in Fq.

We now introduce a useful metric on Fn

qwhich, besides the block length and the di-

mension, is the third important parameter to determine the performance of a code.

Definition 4.4. (Hamming Metric) Consider the function d :Fn

q×Fn

q→Ndefined as

d(w,w0):=|{i|wi,w0

i}| ≥0

and which satisfies the following properties:

1. (non-negativity) d(w,w0)≥0for all w,w0∈Fn

2. d(w,w0) = 0if and only if w =w0,

3. (symmetry) d(w,w0) = d(w0,w)for all w,w0∈Fn

4. (triangle inequality) d(w,w00)≤d(w,w0)+d(w0,w00)for all w,w0,w00 ∈Fn

Then, d defines the so-called Hamming metric on Fn

q, and d(w,w0)is called the Hamming

distance between the vectors w,w0∈Fn

q. The pair (Fn

q,d)is referred to as a metric space,

called the Hamming space of dimension n over the alphabet Fq.

Given the Hamming distance, we can now define the minimum distance of a code Cwhich

measures the difference between most similar codewords and that is an important invariant

determining the error-correcting capability of C.

Definition 4.5. (Minimum Distance) Let C be an (n,k)-code with |C| ≥ 2. Then, the

minimum distance d(C)between distinct codewords is

d(C)) :=min{d(y,y0)|y,y0∈C,y,y0}.

4.0.1 Preliminaries 95

Since y,y0, it follows that d(C)≥1. If the minimum distance of an (n,k)-code is known,

we let d:=d(C)and refer to the code as an (n,k,d)-code. Then, we can state the follow-

ing important theorem on the erasure correction capability of a linear code.

Theorem 4.6. With an (n,k,d)-block code C, it is guaranteed that, after sending any

codeword yi∈C over some noisy channel, yican completely be corrected if the number

of erasures in yiis at most d −1.

Proof. Suppose by the way of contradiction that, after sending yi, a word wis received

containing e≤d−1 erasures. Then, the pattern of n−ecorrect symbols in wmatches yi,

thus, the Hamming distance d(w,yi)≤d−1. If then there is another codeword yjwith

d(w,yj)≤d−1, it follows that d(yi,yj)≤d−1 because yiand yjcan only differ in the

e≤d−1 coordinates. But this is a contradiction since according to Definition 4.5, there

are no two codewords yi,yjwith d(yi,yj)≤d−1.

Therefore, the higher the minimum distance dof some (n,k,d)-code, the more erasures

the code can tolerate. With respect to the block length, we call the ratio δ(C) = d

nthe

relative distance of the code which is a measure of the error-correcting capability of the

code relative to its length. This ratio holds for any code (linear or not) of length nthat

has minimum distance d. Now, recall the information rate rc=k

n<1 as given in the

introduction of this chapter. It holds that the higher the rate the higher the proportion of

coordinates in a codeword actually containing information rather than redundancy, what

makes the code less fault-tolerant. On the other hand, a high rate implies low storage

overhead. As we can see, there is a trade-off between the rate and the relative distance

when applying linear block codes in storage environments, and the major goal in coding

theory is to find codes that have high rate as well as high relative distance (that is, both

parameters should be close to 1). With respect to this objective, the following theorem

states an important upper bound on the parameters (n,k,d).

Theorem 4.7. (Singleton Bound) For every linear (n,k,d)-code over Fq, it holds that

d≤n−k+1.

Therefore, according to the Singleton Bound, optimal codes are those for which equal-

ity d=n−k+1 holds. Such codes are called maximum distance separable codes (MDS

codes) which can be defined as follows.

Definition 4.8. (MDS Code) A linear (n,k)-code C is an MDS-Code if in each generator

matrix Γfor C, any k columns are linearly independent.

Corollary 4.9. A linear (n,k)-code C is an MDS-Code if and only if any k components

define an information set.

96 Erasure Codes for Reading and Writing

It holds that no code of length nand minimum distance dhas more codewords than an

MDS code with parameters nand d. Trivial MDS codes, for instance, are the (n,n−1,2)-

parity codes in which a parity symbol is added to an information word of length k=n

representing the parity of the sum of all ninformation symbols as well as codes with

parameters (n,1,n),(n,n,1)who exist over every field Fq(c.f. Section 4.6).

Definition 4.8 already emphasized that the recovery capability of a linear code depends

on the invertibility of the generator matrix Γ, thus, the ability to recover the information

word from the codeword always requires an invertible k×ksub-matrix of Γ. In the fol-

lowing definition, we introduce the prominent Vandermonde matrix in which any s×s

sub-matrix is invertible for some given s. Hence, the Vandermonde matrix has very useful

properties for data recovery in linear block codes.

Definition 4.10. (Vandermonde Matrix) Let α1,...,αsbe elements in a field F. The

s×s matrix V = [vi,j], where

vi,j=αj−1

i=





1α1α2

1... αs−1

1α2α2

2... αs−1

1α3α2

3... αs−1

.....

1αsα2

s... αs−1







is called a Vandermonde matrix.

Lemma 4.11. The determinant det V=∏1≤i<j≤s(αj−αi). In particular, det V,0if the

elements α1,...,αsare distinct.

That is, Vis nonsingular, and therefore, it is invertible.

4.0.2 Data Reconstruction from Failures

Besides providing an efficient computation of the algebraic operations used for encoding,

the major goal when designing erasure codes for storage-based application is to ensure a

more or less optimal data reconstruction in case of occurring erasures. In todays RAID-

like storage systems, whenever obtaining high fault-tolerance is the major design focus

(rather than efficient computation as given with parity codes), advanced MDS codes, like

e.g. Reed-Solomon codes, are almost applied which can reconstruct the information from

up to n−kerasures as described in the following.

Consider the information vector x= (x1,x2,...,xk), and let Γbe an n×kgenerator

matrix. Then, a linear (n,k)-code Cis represented by the following equation:

Γx=y.

4.0.2 Data Reconstruction from Failures 97

Furthermore, we can define functions Γ1,Γ2,...,Γnwhere each Γidenotes a linear com-

bination of the information word that works on a word-by-word basis for generating the

code symbol yi:

yi=Γi(x1,x2,...,xk) =

∑

j=1

xjγi,j

In other words, the functions Γiare the rows of Γfrom which by multiplication with the

vector xthe complete codeword yis obtained. If we furthermore define Γto be the n×k

Vandermonde matrix, that is, γi,j=αi−1

j(Definition 4.10) with pairwise distinct elements

α1,...,αk∈Fq, the equation Γx=ybecomes:







γ1,1γ1,2... γ1,k

γ2,1γ2,2... γ2,k

.....

γn,1γn,2... γn,k

























1 1 1 ... 1

α1α2α3... αk

.....

1αn−1

2αn−1

3... αn−1



















=











Assuming that kcomponents of yare available at the receiver, the information vector

can be reconstructed by using any kequations that correspond to the known components

of yand which form an invertible k×ksub-matrix Γ0. Surely, the extracted matrix Γ0is

only invertible if the kchosen equations are linearly independent (but this is always the

fact with a Vandermonde matrix).

Now, we describe a useful modification of the encoding that simplifies the data recon-

struction in case one excepts rather few erasures. For this, we consider a code whose

codewords yinclude a verbatim copy of the information blocks, and which we call sys-

tematic. This corresponds to including the identity matrix Ikin Γas already shown in

Example 4.3. We obtain the matrix ΓS= (Ik|Γ)and a codevector yS= (x|y)leading to

the following equation (ΓSx=yS):







1 0 0 ... 0

0 1 0 ... 0

.....

0 0 0 ... 1

1 1 1 ... 1

α1α2α3... αk

.....

1αn−1

2αn−1

3... αn−1































98 Erasure Codes for Reading and Writing

The reconstruction of the information word xfrom any kavailable blocks of the code-

word ySis depicted in Figure 4.1.

1 0 0

0 1 0

0 0 1

y x

Encoder Decoder

1 0 0

0 1 0

0 0 1

Figure 4.1: Encoding/Decoding in matrix form for a systematic code (the top krows

constitute the identity matrix Ik). y0

Sand Γ0

Scorrespond to the grey areas of the vector and

the matrix on the right hand side.

Reconstructing xfrom any combination of up to ksymbols from yis always possible

since Γis a generator matrix and thus, any valid codeword yis a linear combination of

its columns. In particular, as Γhas rank k, always any krows are linearly independent

imposing the linear mapping x→Γxto be injective, thus, xcan uniquely be determined

from its corresponding codeword y. Moreover, the system Γ0y0=xhas a unique solution

which can be obtained by Gaussian Elimination. Hence, xcan be reconstructed from any

subset of kcode symbols because each code symbol conveys information on all the k

source blocks (the ratio of this fraction of information is given by the rate).

However, any code that is able to reconstruct xfrom up to n−kerasures suffers from a

bad update behavior. In particular, if one information symbol changes from xito x0

i, any

codesymbol yimust also be modified. This can be effected by subtracting out the portion

of the checksum word that corresponds to xi, and adding the required amount for x0

ias

follows:

i=Γi,j(xi,x0

i,yi) = yi+γi,j(y0

i−yi).

Surely, both calculation and maintenance of the symbols yican be done by simple arith-

metic but which are operations in the underlying finite field Fqwhich may be of different

complexity depending on the field chosen. Furthermore, if any of the disks keeping yi

is not accessible, there is no chance to store the modified codeword (except merely the

plain information word if a systematic code is applied such as given, for example, with

the RAID 4/5 encoding).

4.0.2 Data Reconstruction from Failures 99

Outline of this Chapter

In order to face the negative update behavior that is inherent to usual linear blocks codes

and which turns to be pretty costly when such codes are applied in RAID-like storage en-

vironments, we introduce the Read-Write-Coding-System (RWC) – a very flexible frame-

work for generating different linear block codes, called Read-Write (RW) codes in the

following, which feature enhanced update properties for given codewords by simultane-

ously offering different degrees of fault-tolerance.

In particular, as we have seen, usual linear block codes merely consider a k-sized infor-

mation vector x, a codeword yof fixed length n, and some linear function γrealizing the

mapping from xto y. Instead, an RWC defines further parameters k≤r,w≤nwhich offer

enhanced possibilities to adjust the redundancy, and thus, the fault-tolerance capability of

an RW code. In the language of coding theory, for any fixed r, an RWC provides linear

(n,r,d)-codes over some finite field Fqthat have (a) minimum distance d=n−r+1

(thus, are MDS codes if r=k), and (b) any two codewords are within a distance of at

most wfrom each other. More specific, an RWC generates appropriate subcodes of Reed-

Solomon (RS) codes (see e.g. [HP03] for details on RS codes) of dimension rand length

nin which every codeword has distance w. Depending on the values r,wand the field Fq

chosen, different block codes can be generated, e.g. parity codes (if q=2).

The ensured degree of redundancy mixed up with the improved update behavior offered

by an Read-Write code provides significant benefits for the observed storage systems

which, driven by the application’s read and write behavior, on one hand, suffer from

very costly I/O operations, and on the other hand, have to ensure some defined level of

fault-tolerance at any time. Clearly, a Read-Write code provides best improvements for

write-intensive applications because, given an n-sized codeword yand parameters r,w, it

can decode the information from any rsymbols of ywhereas only any wsymbols of y

must be updated whenever the information word xchanges completely (recall that we can

choose w<n). This is different to other linear codes, like e.g. Reed-Solomon codes,

which always must rewrite the codeword ycompletely as xchanges.

An example. Consider a RAID 4-parity code with n=4 hard disks storing a data file

bit by bit, Σ={0,1}. We encode k=3 bits x1,x2,x3to symbols y1=x1,y2=x2,y3=x3

and y4=x1+x2+x3, where addition denotes the XOR-operation and the code symbols yi,

1≤i≤4, are stored separately on distinct disks. The XOR-operation enables to recover

the original three bits from any combination of r=3 hard disks, e.g. giving y2,y3,y4

we have x1=y2+y3+y4,x2=y2, and x3=y3. Thus, if any one disk is temporarily

not available, reading data is still possible. However then, writing data is not possible

since a complete change of the original information involves the change of the entire

code; we call this code consistency (this also holds for any other erasure code applied in

storage environments). The second example shows an RW-code. Again, consider n=4

100 Erasure Codes for Reading and Writing

Contents Code Line

x1x2y1y2y3y4v

0 0 0 0 0 0 0

0 0 1 1 1 1 1

0 1 0 1 0 1 0

0 1 1 0 1 0 1

1 0 0 0 1 1 0

1 0 1 1 0 0 1

1 1 0 1 1 0 0

1 1 1 0 0 1 1

Table 4.1: A(2,3,3,4)2-Read-Write-Code for contents x1,x2and code y1,y2,y3,y4. Ev-

ery information vector has two possible code words. Even if only three of the four code

words are available for reading and writing, the system can perform read and write oper-

ations (see Figure 4.3 for the encoding).

hard disks with code bits y1,y2,y3,y4. Now, we encode k=2 information bits x1,x2

such that any r=3 code bits yi,yj,ykcan be used to recover the original message, and

furthermore, only any of such w=3 code bits yi0,yj0,yk0need to be changed to encode a

completely new information. For instance, start with the codeword (0,1,1,?). According

to Table 4.1, the information is (1,1)and therefore the complete code is (0,1,1,0). Now,

we want to encode (0,1)without changing the second entry. For this, we choose line 0

for information (0,1)and get code (0,1,0,1).

Moreover, RW codes can exploit system information about existing erasures, which are

caused by failed or blocked disks and that have rather long-term character in a SAN, for

encoding and decoding. For instance, consider a codeword ywith symbols stored on n

disk from which b≤n−wdisks are unreachable (e.g. failed or blocked). Then, using

an RW code, ycan still be updated to the codeword y0in a code consistent manner. Fur-

thermore, if then some of the formerly blocked disks become available again while some

other b0≤n−rdisks turn to be unreachable, we can still recover the new information

word x0by simply selecting any rof the remaining n−b0accessible disks. Therefore, as

long as sufficient disks are accessible, an RW code provides code consistent operations

by circumventing blocked disks.

At last, some RW codes offer the possibility to change any of the parameters k,r,w,n

during runtime, that is, a (k,r,w,n)-RW code can be changed to (nearly) any choice of

(k0,r0,w0,n0)giving such codes the ability to adapt to changing system conditions.

From now on, we denote a (k,r,w,n)b-Read-Write code a code which is given a k-

symbol message xthat is encoded into an n-symbol codeword ywith symbols drawn from

4.1. The Operational Model 101

some b-symbol alphabet Σ. Furthermore, we consider parameters rfor recovering the

information word and wfor modifying a given codeword y, with k≤r,w≤n. In the next

section, we state the operations of the RWC formally. After that, we prove general bounds

for the parameters of RW codes and present a general scheme to generate (k,r,w,n)b-RW

codes as long as k+n≤r+wholds for an appropriate choice of b. Then, we introduce

adaptive RW codes for which any of the given parameters can be subject to changes. At

last, we investigate the question for which choices of (k,r,w,n)a coding system exists

over the binary alphabet F2={0,1}and discuss how RW codes can be combined. At

last, notice that the following description is given in more operation-based terms rather

than conceptual since RW codes base on the same well-studied algebraic principles as RS

codes.

4.1 The Operational Model

Before we now introduce the concepts of the Read-Write codes in more detail, we first

present the operations assigned to a Read-Write Coding System. Again, Read-Write

codes encode information words into codewords. The information is given by a k-tuple

over some finite alphabet Σ, and since Read-Write codes are linear block codes, the code-

word is an n-tuple over the same alphabet, for k<n. For what follows, let b=|Σ|and

P(M)denotes the power set of some set M. Moreover, let P`(M):={S∈P(M)| |S|=`}.

Then, a (k,r,w,n)bRead-Write-Coding-System (RWC) provides the following opera-

tions.

1. Initial state x0∈Σk,y0∈Σn

This is the initial state of the system with information x0and codeword y0. This

state is crucial because all further operations ensuring the beneficial features of an

RW code depend on this initial state.

2. Read function f:Pr([n])×Σr→Σk

This function reconstructs the information by reading rsymbols of the codeword

whose positions are known. The first parameter shows the positions (indices) of

the symbols in the code, and the second parameter gives the corresponding code

symbols. The outcome is the decoded information.

3a. Write function g:Pr([n])×Σr×Σk×Pw([n]) →Σw

This function adapts the codeword to a changed information by changing wsymbols

of the codeword at wgiven positions. The first two parameters describe the reading

of the original information. Then, we have the new information as a parameter, and

the last parameter indicates which code symbols to change in the codeword. The

outcome are the values of the new wcode symbols.

102 Erasure Codes for Reading and Writing

3b. Differential write function δ:Pw([n])×Σk→Σw

This is a restricted alternative to the write function whose parameters are the po-

sitions Sof symbols available for writing as well as the difference of the original

information xand the new information x0, but without reading the wcode entries.

The outcome is the difference of the available old and the new codeword symbols.

Thus, for two functions ∆1:Σk×Σk→Σkand ∆2:Σw×Σw→Σwand wgiven po-

sitions from y, we can describe the write function gabove by the differential write

function as

y0=∆2(y,δ(S,∆1(x,x0))).

All RW codes presented here have such differential write functions where ∆1,∆2

denote the bit-wise XOR-operations. The goal is that e.g. a controller in a storage

device ican, by itself, update its kept block yiby simply adding (XOR) the received

difference γof yiand y0

i, i.e. y0

i=yi+γ, if, for example, the device is blocked

between reading the old and writing the modified block.

For a tuple y= (y1,...,yn)and a subset S∈P`([n]), let CHOOSE(S,y)be the tuple

(yi1,yi2,...,yi`)where i1,...,i`are the ordered elements of S. Furthermore, for an `-tuple

d, let SUBST(S,y,d)be the tuple where according to Seach indexed element yi1,yi2,...,yi`

of yis replaced by the element taken from dsuch that CHOOSE(S,SUBST(S,y,d)) = d

and all other elements in yremain unchanged in the outcome.

Now, for S0∈Pr([n]), define the read operation

Read(S0,y):=f(S0,CHOOSE(S0,y))

and for S∈Pw([n]) and x0∈Σk, the write operation

Write(S,S0,y,x0):=SUBST(S,y,g(Read(S0,y),x0,S)).

Since any Read-Write code needs to start at some initial state, we define the set of possible

codewords Yas the transitive closure of the function y7→Write(S,S0,y,x0)starting with

y=y0and allowing all values S,S0,x. Then, an RW code is correct if the following

statements are fulfilled.

1. Correctness of the initial state: ∀S∈Pr([n]):

Read(S,y0) = x0.

2. Consistency of read operation:∀S,S0∈Pr([n])

∀y∈Y:Read(S,y) = Read(S0,y).

3. Correctness of write operation:

∀S∈Pw([n]),∀S0∈Pr([n]),∀y∈Y,

∀x∈Σn:Read(S0,Write(S,S0,y,x))=x.

4.2. Lower Bounds 103

Before we continue explaining the inherent mechanisms of an RWC in order to prove

the operational statements above more formally, we first state some lower bounds on the

parameters of an RWC that must be satisfied for the existence of RW codes at all.

4.2 Lower Bounds

The example of a (2,3,3,4)2-RW code in the previous section stores two symbols of

information in a four symbol code (c.f. Table 4.1). Unfortunately, this storage overhead

of a factor two is unavoidable, as the following theorem shows (moreover, this implies

that e.g. no (3,3,3,4)b-RWC exists).

Theorem 4.12. For parameters r +w<k+n or r,w<k and any base b, there does not

exist any (k,r,w,n)b-RWC.

Proof. Consider a write operation and a subsequent read operation where the index set W

of the write operation (|W|=w) and the index set Rof the read operation (|R|=r) have an

intersection: W∩R=Swith |S|=r+w−n. Then, there are bkpossible change vectors

with symbols in Sthat need to be encoded by the write operation since this is the only

base of information for the subsequent read operation. This holds because all further R\S

code symbols remain unchanged. Now, assume that |S|<k. Then, at most bk−1possible

changes can be encoded, and therefore, the read operation will produce faulty outputs for

some write operations. Thus, r+w−n≥kand the claim follows.

If r<k, only brdifferent messages can be distinguished while bkdifferent messages

exist. Then, from the pigeonhole principle, it follows that such a code does not exist. For

the case w<k, this is analogous.

From Theorem 4.12, it follows that in the best case (k,r,w,n)b-RW codes have param-

eters r+w=k+n. We call such RW codes perfect. Unfortunately, such perfect codes do

not always exist for any choice of parameters as the following lemma shows.

Lemma 4.13. There is no (1,2,2,3)2-RW code.

Proof. Consider a read operation on the code bits y1,y2and a write operation on y2,y3.

Then, y2is the only intersecting bit which must be inverted in case of an information bit

flip. The same holds for bit y3when considering a read operation on y1,y3and a write

operation on y2,y3. Thus, together, both y2and y3have to be inverted if the information

bit x1flips. Now, consider a sequence of three write operations on bits (1,2),(2,3),(1,3)

each inverting the information bit x1. After these operations, all code bits have been

inverted twice bringing it back to the original state. In contrast, the information bit has

been inverted thrice and is thus inverted. Therefore, all read operations lead to wrong

results.

104 Erasure Codes for Reading and Writing

Contents Code Line

x y1y2y3v

0 0 0 0 0

0 1 1 1 1

0 2 2 2 2

1 0 1 2 0

1 1 2 0 1

1 2 0 1 2

2 0 2 1 0

2 1 0 2 1

2 2 1 0 2

Table 4.2: A(1,2,2,3)3-Read-Write code for an information word xand codeword ycon-

sisting of symbols y1,y2,y3. For every information, there are three possible codewords,

and if only any two of them three are available for reading and writing, the system can

perform read and write operations (see Figure 4.4 in the next section for the encoding).

However, if we allow a larger symbol alphabet, we can find an RW code.

Lemma 4.14. There exists a (1,2,2,3)3-RW code.

Proof. See Table 4.2 for an example. The correctness is straight-forward.

Clearly, concerning operational complexity, b=2 (i.e. F2) is the best choice for codes

applied in SANs because XOR-based I/O operations can often efficiently be realized

in hardware. However, as common RAID 4/5 schemes as well as parity-based Reed-

Solomon codes correspond to an (n,n,n+1,n+1)2-RW code, n≥1, the following lemma

shows that there is no parity-based placement scheme offering better update properties.

Lemma 4.15. For n ≥1, there is no (n,n,n,n+1)2-RW code.

Proof. The proof follows directly from Theorem 4.12.

4.3 Encoding/Decoding in the RWC

Now, we know that as long as the symbol alphabet is chosen sufficiently large, perfect RW

codes always exist, and in the following, we show how perfect RW codes are generated

by describing its encoding and decoding process in terms of vector arithmetic (surely,

we always assume the symbol alphabet (i.e. the basis) to be sufficiently large whereas

throughout this chapter, we use the terms basis and symbol alphabet interchangeably).

4.3. Encoding/Decoding in the RWC 105

Fqk

Fqr

Fqn

γ(Fqk)

ζ(Fqr)

γ:Fqk

→

Fqn

ext:Fqk

→

Fqr

x‘0

y‘0

x‘1

y‘1

ζ:Fqr

→

Fqn

∆x

+ ∆ v

Figure 4.2: Illustration of Encoding/Decoding RW Codes

As already mentioned, Read-Write codes are closely related to Reed-Solomon codes

and can thus be described by the same methods (i.e. by polynomials over finite fields).

Nevertheless, in this thesis, we rather use a vector-based description that is better suited

to sketch the encoding/decoding operations. In particular, for given information tuples

x= (x1,...,xk)∈Σkwhose base bis a sufficiently large finite field Fqand additional

parameters k≤r,w≤n, we examine special subcodes of larger Reed-Solomon codes that

have dimension rand length nand in which every codeword y= (y1,...,yn)∈Σnhas

distance at most w(Figure 4.2 depicts the generation of an RW code).

Every RW encoding starts with an initial information word x0generating an initial

codeword y0. As usual with linear codes, since we have chosen some finite field Fqfor the

basis, we can consider x0to be a vector in the vector space Fk

q. In a usual linear encoding,

x0would then be mapped directly into the vector space Fn

qof dimension n>kby some

linear function γ:Fk

q→Fn

qyielding y0. More particularly, the resulting codeword y0lies

in some k-dimensional subspace γ(Fk

q)of Fn

qthat is isomorphic to the vector space Fk

and as already illustrated in Section 4.0.2, if we suppose an MDS code, optimal recovery

from up to n−kerasures can then be achieved (recall that such codes have minimum

distance d=n−k+1). However, as we also stated already, whenever the information

word x0changes to x1, all ncode symbols of y0must be modified implying a distance

d=nbetween any two codewords in γ(Fn

q), what, again, is dismal in a SAN.

106 Erasure Codes for Reading and Writing

To improve the update behavior of linear codes such that, at most, w<nsymbols suf-

fice to encode a complete information change, a Read-Write encoding takes the following

detour concerning the vector mapping. First, depending on the parameter r, each informa-

tion vector x∈Fk

qis extended to an r-sized vector x0by adding `=r−k(slack) symbols

1,...,v0

`from the underlying symbol alphabet Fqthat carry no particular information.

In other words, we translate xinto an r-dimensional vector x0∈Fr

qby some mapping

function ext :Fk

q→Fr

q. Importantly, (only) in case of the initial information word x0,

the `symbols can be arbitrarily chosen from Fqyielding q`possible extension vectors x0

(denoted by the subset S0in Figure 4.2). Depending on the choice of the slack symbols,

a vector x0

0∈S0is fixed which is then mapped by some linear function ζ:Fr

q→Fn

qto

the codeword y0

0lying in the subspace ζ(Fr

q)of Fn

q(c.f. dashed lines). Clearly, ζ(Fr

q)is

an r-dimensional subspace of Fn

qwhich is isomorphic to Fr

q. Moreover, as being also an

(n,r)-linear MDS code, it is spanned by a set of n-dimensional basis vectors b1,...,br,

and for given parameters n,k,r, a (k,r,w,n)b-Read-Write code has minimum distance

dRW =n−r+1 and is thus capable of tolerating up to n−rerasures (see next section

for a more formal description). Surely, this recovery property is not at good as offered by

usual MDS codes but which is outweighed by the following improved update behavior.

Again, consider the initial information word x0and the corresponding RW codeword

y0∈ζ(Fr

q). Now, let x0somehow be changed/modified by the user/application what

results in the new information word x1=x0+∆xwith ∆xdenoting the information dif-

ference between x0and x1. Usually, in common linear (MDS) codes, this would create

a new, independent encoding γ(x1) = y1∈Fn

qof distance d∈ {n−k+1,...,n}whose

upper bound cannot be improved. Instead, in a Read-Write code, in order to achieve that,

at most, w<ncode symbols must be modified, any new encoding, i.e. any vector pair

i,y0

i, strongly depends on the initial setting x0,y0and all subsequent vector pairs x0j,y0jfor

0<j<i. The secret lies in appropriately adjusting the `=n−w=r−kslack variables

in each step. In particular, for an arbitrary step i>0, consider the extended vector x0

iwith

slack variables v0

ivalid for the corresponding codeword y0

i. Then, to achieve y0

i+1=y0

i+∆y

from x0

i+1=x0

i+∆xin the following step, the new slack variables v0

i+1=v0

i+∆vmust be

determined such that in the vector ∆yexactly `symbols remain unchanged (that is, are

set to 0). From this, it follows that, at most, wsymbols must be rewritten, and thus, the

codeword space ζ(Fr

q)corresponds to Y, the transitive closure of the initial codeword y0

(c.f. Section 4.1).

4.3.1 The Matrix Approach

We now state the generation of RW codes formally for what we present a matrix-based

approach which is similar to common linear encodings. In particular, we consider the

information word x= (x1,...,xk)∈Fk

q, the corresponding codeword y= (y1,...yn)∈Fn

4.3.1 The Matrix Approach 107

and for any modification in x, let δ=∆xbe the information change vector. Moreover,

let v= (v1,...,v`)denote the vector of internal slack variables with `=n−w=r−k

and which carry no particular information. The linear mapping is given by an appropriate

n×rgenerator matrix Γwith Γi,j∈Fq; the sub-matrix (Γi,j)i∈[n],j∈{k+1,...,r}is called the

variable matrix. Then, an RW code relies on the following equation realizing the mapping

function ζ:Fr

q→Fn







Γ1,1Γ1,2··· Γ1,r

Γ2,1Γ2,2··· Γ2,r

Γn,1Γn,2··· Γn,r



















=











(4.1)

Operations

•Initialization:

Starting with an information vector x0= (x1,...,xk), the variables (v1,...,v`)can

be set to arbitrary values (furthermore, if one wants to benefit from additional se-

curity features of this coding system (see Section 4.4), the slack variables must

be chosen uniformly at random). The codeword y0= (y1,...,yn)is computed by

Equation 4.1.

•Read: Given rcode entries from y, compute x

We rearrange the rows of Γand the rows of ysuch that the first rentries of yare

available for reading, and let y0and Γ0denote these rearranged vector and matrix,

respectively. The first rrows of Γ0describe the r×rmatrix Γ00 that we assume to

be invertible. Then, the information vector x(and the variable vector v) is obtained

by: x

v= (Γ00)−1y.

•Differential write: Given the information change vector δand wcode entries from

y, compute the difference vector γfor the wcode entries. Recall that yis updated

by γwithout first reading the information of yat the wcode positions.

The new information vector x0is given by x0

i=xi+δi. This notation allows to

change the vector x0without reading its entries. Clearly, only the choices w<r

make sense. Now, according to Equation 4.1, given the new k-dimensional in-

formation vector x0, the task is to find another (r−k)-dimensional vector ρwith

v0=v+ρsuch that the new codeword

y0=Γx0|v0T=Γ(x+δ|v+ρ)T

108 Erasure Codes for Reading and Writing

is a vector of distance at most w. Since we only consider at most wpositions of

y0, we may, without loss of generality, assume that the last n−wpositions are zero

such that Γ(x0|v0)T= (y0

w|0)T, with y0

wof length w, and the vector 0= (0,...,0)T

is of length n−w. Clearly, we must rearrange the rows of the matrix Γdue to

the vector (y0

w|0)T. Thus, for simpler description, we partition Γaccording to the

lengths of the sub-vectors involved, that is, we rearrange the rows of Γand ysuch

that the writable code symbols are y1,...,yw. Let Γ0denote this rearranged matrix

and y0the rearranged code vector. Hence, we define the following sub-matrices of

Γ0.

Γ←↑ = (Γ0

i,j)i∈[w],j∈[k],

Γ↑→ = (Γ0

i,j)i∈[w],j∈{k+1,...,r},

Γ←↓ = (Γ0

i,j)i∈{w+1,...,n},j∈[k],

Γ↓→ = (Γ0

i,j)i∈{w+1,...,n},j∈{k+1,...,r}.

Given these definitions, we then obtain

Γ←↑x0+Γ↑→v0=y0

Γ←↓x0+Γ↓→v0=0.

Obviously, an important precondition of the write operation is the invertibility of

the submatrix Γ↓→. The code symbol vector is then updated by the w-dimensional

vector

γ= ((Γ←↑)−(Γ↑→)(Γ↓→)−1(Γ←↓)) δ,

such that the new codeword y0is derived from the former code symbols at the w

given positions by simple addition, that is,

y0=y+γ.

In fact, the (2,3,3,4)2-RWC in Table 4.1 can be generated by this matrix based ap-

proach (Figure 4.3 gives the encoding. Additionally, compare also the (1,2,2,3)3-RWC

in Table 4.2 and Figure 4.4).

The following definition and statements formalize the properties of the parameters of

an RWC.

Definition 4.16. An n ×k-matrix A over any base b with n ≥k is row-wise invertible if

each k ×k matrix, constructed by combining k distinct rows of A, has full rank (and is

therefore invertible).

4.3.1 The Matrix Approach 109







0 0 1

0 1 1

1 0 1

1 1 1











=











Readable x1x2

code symbols

y1,y2,y3y1+y3y1+y2

y1,y2,y4y2+y4y1+y2

y1,y3,y4y1+y3y3+y4

y2,y3,y4y2+y4y3+y4

Write (x0

1,x0

2) = (x1+δ1,x2+δ2)

code y0

1=y1+y0

2=y2+y0

3=y3+y0

4=y4+

1,2,3δ1+δ2δ1δ20

1,2,4δ1δ1+δ20δ2

1,3,4δ20δ1+δ2δ1

2,3,4 0 δ2δ1δ1+δ2

Figure 4.3: A(2,3,3,4)2-Read-Write-Code over F2={0,1}modulo 2.

Theorem 4.17. The matrix-based RWC is correct and well-defined if the n ×r generator

matrix Γas well as the n ×(r−k)variable sub-matrix Γ0is row-wise invertible.

Proof. Follows from the definition of row-wise invertibility and the description of the op-

erations. To prove the correctness of the coding system, we show that after each operation

Equation 4.1 is valid. This is straightforward for the initialization and read operations. It

remains to prove the correctness of the write operation.

Again, consider the additive vector (ρ1,...,ρ`)denoting the changes in the variable

vector vand the vector (γ1,...,γw). With this and the information change vector δ, we

obtain x0=x+δ,v0=v+ρand y0=y+γ. The correctness of the write operation then

follows by combining:

Γx0

v0=Γx+δ

v+ρ=Γx

v+Γδ

ρ







yw+1













γ1

γw







This equation is equivalent to the following.

(Γ←↑)δ+(Γ↑→)ρ=γ

(Γ↓→)ρ+(Γ←↓)δ=0.

110 Erasure Codes for Reading and Writing





0 1

1 1

2 1 

x

v=







Readable

code symbols x

y1,y22y1+y2

y1,y3y1+2y3

y2,y32y2+y3

x0=x+δ

Writable symbols y0

1=y0

2=y0

y1,y2δ+y12δ+y2y3

y1,y32δ+y1y2δ+y3

y2,y3y1δ+y22δ+y3

Figure 4.4: The (1,2,2,3)3-Read-Write code over the alphabet F3={0,1,2}modulo 3.

corresponding to Table 4.2 above.

Since δis given as a parameter, ρcan be computed as

ρ=Γ↓→−1−Γ←↓δ,

and γby the last upper equation. If ρis known, then the product Γ·(δ|ρ)T(reduced

to the first wrows) gives the difference vector γproviding the new code entries of y0by

y0=y+γ.

Theorem 4.18. For parameters k ≤r,w≤n with r +w=k+n, there exists a (k,r,w,n)b-

RWC for an appropriate base b. Furthermore, this coding system can be computed in

polynomial time.

Proof. Follows from the following lemma and the fact that we use standard Gaussian

elimination for recovery.

Lemma 4.19. For each n ≥k and basis b ≥2dlog2n+1e, there is a row-wise invertible

n×k-matrix V over the finite field Fb. Furthermore, each submatrix of V is also row-wise

invertible.

Proof. Define an n×kVandermonde like matrix Vconsisting of non-zero and pairwise

distinct elements (α1,...,αn)∈F[2dlog2n+1e]. Then, erase any n−krows what yields a

k×ksubmatrix V0that is also a Vandermonde-matrix. Since all Vandermonde-matrices

are invertible, the lemma follows.

4.4. Security and Redundancy 111

4.4 Security and Redundancy

Induced by the usage of additional slack variables and besides adjustable fault-tolerance

and improved update properties, Read-Write codes furthermore offer useful properties

concerning data availability and security. Consider, for instance, the very extreme sce-

nario of a combination of hard disks of nportable (laptop) computers within an office. If

then a (k,r,w,n)b-RWC is used for encoding of at most nlaptops, it is sufficient if at least

max{r,w}computers are accessible at the office at any time for data access and changes.

If merely rcomputers are connected, at least the read operations can be performed. Now,

what happens if computer hard disks are broken or information on some hard disks has

changed ? Then, the inherent redundancy of the (k,r,w,k)b-RWC allows to point out the

number of wrong data and repair it (to some extent).

A different problem occurs if computers are stolen by some adversary to achieve knowl-

edge about company data. The good news is that, for every matrix based RWC, it holds

that one can give away any n−whard disks without revealing any information to the

adversary. If the slack variables are chosen uniformly at random from Σ, the attacker will

receive hard disks with perfect random sequences, absolutely useless without the other

hard disks. As a surplus, this redundantizes the need for complex encryption algorithms.

Redundancy

Theorem 4.20. Every (k,r,w,n)b-RWC can detect and repair up to `faulty code symbols

n!(r+`)!

(n−`)r!<1

Additionally, it can reconstruct the data from n −r missing code symbols (erasures).

Proof. The latter statement is trivial because if n−rcode symbols are missing, then by

the definition of an RWC, the complete information can be recovered from any raccessi-

ble code symbols. Furthermore, if then `out of these rcode symbols are faulty, we can

simply test every combination of the n

rcombinations of rreceived code symbols and

take a majority vote over the information vector. In this vote, at least n−`

rcombinations

produce the correct result what is a majority if

n−`

r≥1

2n

r

which, by transformation, is equivalent to

n!(r+`)!

(n−`)!r!<1

112 Erasure Codes for Reading and Writing

Security

If the coded symbols are stored on distinct storage devices, with an (k,r,w,n)-RWC, the

loss of at most n−max{r,w}devices can be tolerated. For instance, if these storage

devices were stolen, then the following theorem shows that the thief cannot reveal any

information whatsoever from the encoded information. More particular, the attacker sees

only a completely random sequence vector.

Theorem 4.21. Every matrix based (k,r,w,n)b-RWC for which k+n=r+w holds can be

used such that every choice of n −w code symbols does not reveal any information about

the original information vector.

Proof. Choose random values for the slack variables v1,...,v`from the symbol alphabet

Σat initialization (let |Σ|=b). Then, there is an isomorphism between these slack vari-

ables and the stolen code symbols yielding a total of b`possibilities for the stolen code

symbols to be changed. If more symbols are added, this starts to reveal some informa-

tion.

4.5 Adaptive RW Codes

Usually in this chapter, we consider static scenarios for a SAN environment, but neverthe-

less, some perfect RW codes can be modified to also behave somehow adaptive in case of

dynamic changes in the number of accessible disks (note that we still assume the storage

devices to be of uniform size). More clearly, we already know from the previous chapter

that in a SAN, adding and removing hard disks are the most delicate maneuvers. Provided

that the size of the underlying symbol alphabet is chosen appropriately, we show in the

following that perfect RW codes exist which allow to seamlessly continue all operations

without forcing the system to be in some intermediate and, more importantly, invalid state.

Consider the following example. Assume 10 disk in a SAN using an (8,9,9,10)-RW

code. Then, for better space utilization, the system administrator wants to switch to a

(4,7,7,10)-RW code. In a usual encoding, this requires all disks to be available for the

switch. Instead, in this section, we show that some special adaptive RW codes exist that

allow to switch from one code to the other with only 9 disks being accessible for reading

and writing. If then, after computing the re-encoding of all data on the 9 disks, the 10-th

disk returns, it can immediately participate in the new (4,7,7,10)-RW code. Moreover,

if the 10-th disk is permanently lost, it can be completely reconstructed from the new

(4,7,7,10)-RW code.

In particular, an adaptive RW code is a set of certain RW codes (k,r,w,n)bwith fixed

alphabet and which, unlike the previously described codes, has a switch function. If a

code is switchable, all parameters k,r,w,ncan be subject to change. Furthermore, with

respect to the codeword y, not all of the code symbols need to be read or changed.

4.5. Adaptive RW Codes 113

Theorem 4.22. Consider a sufficiently large constant M. Then, for a basis b ≥2dlog2M+1e,

there is an (M,b)-adaptive-RWC. In this system, it is possible to switch at any time from a

(k,r,w,n)b-RWC to any (k0,r0,w0,n0)b-RWC provided that n,n0≤M, and k0+n0=r0+w0

by merely reading any set of r encoded symbols and changing any set of w0encoded

symbols.

Proof. At first, we choose an appropriate base bby selecting a finite field Fqof suffi-

cient size q≥2dlog2M+1eand take the Vandermonde Matrix based approach as shown in

Section 4.3.1. We change the main equation to the following.







1α1α2

1... αM−1

1α2α2

2... αM−1

.....

1αM−1α2

M−1... αM−1

M−1













vr−k













zM−n







Again, x1,...,xkare the content symbols, v1,···,vr−kare the slack variables and y1,...yn

are the code symbols. The variables z1,...,zM−ncan be ignored for the beginning; they

are neither contents, slack nor code symbols and can be generated from the content and

slack symbols at any time. The initial vector as well as the read and write function are

chosen as in the matrix based approach. Then, the switch operation, that is, switching

from a (k,r,w,n)-RWC to a (k0,r0,w0,n0)-RWC, works as follows.

First, we read rcode symbols at given positions and decode the vectors xand vaccord-

ing to the matrix based approach. Then, we adapt the size of the former code to the new

code size. If n0>n, we compute the corresponding variables zifrom xand v. If n0<n,

we rename n−n0code variables to z-variables, and thus, reduce the code size. If r0>r,

the content/slack-variable vector (x|v)Tis extended by (r0−r)0-entries. We assume that

new contents are written during the switch-operation (especially, if k,k0). For this, let

1,...,v0

r0−k0be the new set of slack variables. Furthermore, we suppose at most w0code

symbols (positions) available for writing.

We start by erasing the rows n0+1,...,Min yand in the Vandermonde matrix since

they are of no interest for this operation. Then, like in Section 4.3.1, we rearrange the

residual matrix and the residual code vector such that the first wpositions are the writable

variables. We additionally rearrange the columns of the Vandermonde matrix and the

contents/slack vector such that the new slack variables are on the rightmost columns,

respectively lowermost lines. This results in the generator matrix Γ0(c.f. Section 4.3.1),

and the original vector xis rearranged up to the lowest r0−k0entries (possibly containing a

mixture of old contents, old slack variables, and 0-entries). Let x0be the vector of the new

114 Erasure Codes for Reading and Writing

contents (adequately rearranged), and let v0be the new slack vector with r0−k0entries.

If r0≥r,xhas k0entries, and otherwise, x(x0) has r−r0additional entries resulting from

former slack or content variables that must to be set to 0.

We first consider the case r0≥r. The number of entries in xis k0. Then, we can perform

an RWC write operation changing w0code symbols. Let `0=r0−k0=n0−w0and partition

Γ0like in Section 4.3.1. That is, let Γ←↑ be a w0×k0-sub-matrix of Γ0,Γ↑→ aw0×k0-sub-

matrix, Γ←↓ an `0×n0-sub-matrix and Γ↓→ an invertible `0×`0-sub-matrix of Γ0. Again,

according to the matrix based approach, the new (rearranged) code vector y0is obtained

y0=y+h(Γ←↑)−(Γ↑→)(Γ↓→)−1(Γ←↓)i·(x0−x)(4.2)

using the old (rearranged) writable symbol vector y. The proof of correctness is analogous

to Section 4.3.1.

Now, consider the case r0<r. Then, the number of entries in xand x0is ˜

k=k0+r−r0.

Again, let x0be the new (adequately rearranged) vector containing the ˜

knew symbols, and

v0is the new slack variable vector with r0−k0entries. Note that x0−xcan be computed

at this stage. We now perform a slightly adapted matrix based RWC write operation that

changes w0code symbols. Clearly, compared to the previous case, that matrix consists of

additional r−r0columns but what does not cause any problem since we only have to adapt

the sub-matrices. Furthermore, let `0=r0−k0=n0−w0and ˜w=w+r−r0. Then, let Γ←↑

be a ˜w×˜

k-sub-matrix of Γ0,Γ↑→ a ˜w×`0-sub-matrix, Γ←↓ an `0×˜

k-sub-matrix and Γ↓→

an invertible `0×`0-sub-matrix of Γ0. As usual, ydenotes the old and y0the new writeable

symbols. Then, applying the defined matrices, the new vector y0is obtained as given with

Equation 4.2, and again, the proof is analogous to the proof given in Section 4.3.1.

4.6 Boolean Read-Write-Codes

We already mentioned that the computational speed of RW codes strongly depends on

the size of the finite field used (c.f. Reed-Solomon codes) because encoding/decoding is

hardly driven by algebraic computations in the applied field, and the larger the field chosen

the more complex the encoding/decoding process. Thus, the most interesting case for the

choice of the alphabet is the binary case with Σ={0,1}. Unfortunately, as we have shown

in Lemma 4.13, there are no perfect Boolean (1,2,2,3)-RW codes, and furthermore, also

for the matrix based RW codes (also called parity Read-Write codes, according to parity

Reed-Solomon codes), the Boolean basis poses a severe restriction. The reason is that

only for some dimensions Boolean matrices are row-wise invertible. We address these

cases of valid (k,r,w,n)2-RW codes in the following.

4.6. Boolean Read-Write-Codes 115

Lemma 4.23. For each n, there exists an n ×1row-wise invertible Boolean matrix. Fur-

thermore, for k ≥2, there are n ×k row-wise invertible Boolean matrices if and only if

n∈{k,k+1}.

Proof. The first claim is trivial. For the second, note there is a (k+1)×krow-wise

invertible Boolean matrix, e.g. Γ= (Γi,j)i∈[k+1],j∈[k]such that

Γi,j=1 : i=n∨i=j

0 : else

Now, we prove that there are no (k+2)×krow-wise invertible matrices as we show that,

given a k×kfull rank Boolean matrix Γ, there is always a unique vector leading to a

(k+1)×krow-wise Boolean matrix.

Consider Γas above and remove an arbitrary row i. Then, there are 2k−1possibilities

to add a row receiving a matrix with full rank. Such a row is described by the vector set

Γ(x1,...,xk)Twhere xi=1 and the rest is chosen arbitrarily. The only vector that can be

added to each of these combinations is Γ(1,...,1)T. After adding this vector, there is no

other vector which is linearly independent from the other vectors.

This lemma has severe implications on the matrix based method for Boolean bases, as

we show in the following.

Theorem 4.24. If r +w=k+n, there are Boolean n ×k generator matrices V for a valid

(k,r,w,n)2-RW encoding only for the following cases:

1. (1,1,n,n)2,(1,n,1,n)2,(k,k,k,k)2

2. For k ≥2:(k,k+1,k+1,k+2)2

3. For k ≥1:(k,k,k+1,k+1)2

4. For k ≥1:(k,k+1,k,k+1)2

Proof. Follows by combining the prerequisites of the matrix based RWC method with

Lemma 4.23.

If we merely consider the encoding of a 1-bit information word, the following lemma

holds.

Lemma 4.25. Every (1,r,w,r+w−1)2-RW code is a matrix based RW code.

116 Erasure Codes for Reading and Writing

Proof. Consider a write operation on the subset Wof code symbols with |W|=wand a

read operation on the index set Rwith |R|=rsuch that |R∩W|=1. Let ibe the index

of the element in the intersection. Now, if the information bit xflips to x0, the code bit yi

must flip as well. Otherwise, the result of the read operation would be the same as before,

what is incorrect. Thus, we can describe the operation on yiby

i≡yi+x0+x(mod 2).

Notice, there is always a set Rwith |R|=rand |R∩W|=1 for any index i∈W. Thus,

the congruence holds for all bits leading to the matrix representation of RW codes.

Unfortunately, there are no perfect Boolean RW codes of higher dimensions.

Lemma 4.26. There is no (2,r,w,n)2-RW code for r +w=2+n and w ≥4.

Proof. Consider arbitrary 4 bits of some codeword yand let their indices be given by

the set F={i,j,k,l} ⊆ [n]. Then, partition the residual index set IR= [n]\Finto two

disjoint index sets Rwith |R|=r−2 and Wwith |W|=w−4. W.l.o.g. consider the set

F={1,2,3,4}describing the code bits y1,y2,y3,y4. Now, consider write operations on

the index sets Wi,j=W∪{i,j}for all distinct i,j∈Fand read operations on the index

set Ri,j=R∪{i,j}, again for all i,j∈F, imposing an intersection on at most two code

symbols.

Now, what is the number of bits to flip whenever the information (x1,x2) changes to

(x0

1,x0

2) (for our example, there are three change possibilities) ? To cover all cases, we

define pias a predicate which is only true if yimust be changed, i.e. inverted. If we then

consider the read operation on the set Ri,j, clearly, all bits in Rremain unchanged, and the

only way the write operation induces a change for the read operation is that, at least, one

of the code bits yi,yjis changed. To consider all such read operations, we have to fulfill

the following term: ^

i,j∈[4],i,j

pi∨pj=

(p1∧p2∧p3)∨(p1∧p2∧p4)∨(p1∧p3∧p4)∨(p2∧p3∧p4).

As we see, all but (at most) one bit have to be inverted. Hence, there are five possibilities

to encode all three possible changes to the information vector (x1,x2). Moreover, by the

definition of the set Ri,j, each read operation can only observe two out of the four rewritten

bits. In particular, the following cases occur (we start with R1,2):

1. A: y1and y2are inverted.

2. B: Only y1is inverted.

3. C: Only y2is inverted.

4.6. Boolean Read-Write-Codes 117

Thus, A, B, and C can be mapped to all three possible changes of (x1,x2) to (x0

1,x0

2). Un-

fortunately, if we continue this game, we will observe a conflict with the read operation

R3,4in the situations Band Cthat both hold for R3,4, that is, both bits y3and y4must

be inverted inducing an indistinguishable situation for the read operation. Thus, in other

words, the read operation of R3,4cannot uniquely distinguish the change in the informa-

tion vector.

This lemma can be generalized to the following theorem.

Theorem 4.27. There is no (k,r,w,n)2-RW code for r +w=k+n and w ≥k+2.

Proof. The proof is analogous to the proof of Lemma 4.26 but we now consider k+2 bits

of y, and w.l.o.g. we set F= [k+2]describing the code bits y1,y2,...,yk+2. Then again,

we partition the n−kresidual code bits (i.e. the residual index set IR= [n]\F) into the

sets Rand Wwith |R|=r−2 and |W|=w−k−2. Let the considered read and write

operations on the index sets Wi,jand Ri,jbe defined as in Lemma 4.26.

Again, we are interested in the number of bits to flip if the information (xi1,...,xik)

changes to (xi1,...,xik+2) (there are 2k−1 change possibilities). Let the predicate pialso

be defined as before. If we then consider the read operation on R∪{i1,...,ik}, all bits

in Rremain the same. Thus, the only way the write operation induces a change is that,

at least, one of the code bits yi1,...,yikis changed (note that the read operation will read

only two bits from this intersection). Considering all such read operations, we must now

fulfill the following term:

S∈[k+2]:|S|=k_

i∈S

pi=_

S∈[k+2]:|S|≥3^

i∈S

pi.

Thus, at least 3 bits must be inverted in every write operation. Moreover, there are

2k+2−

∑

i=0k+2

i

possibilities to encode all 2k−1 possible changes to the vector (x1,...,xk).

However, in each read operation, only kbits are accessed (in the intersection) which

requires 2k−1 possible outcomes. If we then combine any two of the read operations, this

implies for the whole set that, at least, 2k+2−7 possible outcomes are demanded. Now,

consider the intersection of read operations that have only k−2 positions in common. If

the corresponding bits remain the same, clearly then, each of the residual pairs of two bits

must differ, leaving 9 possibilities. In the other case, we have 2k−2−1 possibilities for

the common bit vector and 16 possibilities for the ‘private’ bit pairs. Hence, there are

∑

i=0k+2

i−7>1

codes missing, for all k≥1.

118 Erasure Codes for Reading and Writing

4.7 General RW-Codes

Since only few perfect Boolean RW codes exist, we now consider more general RW codes

for which, in particular, hold

r+w>n+k.

In order to appropriately describe the existence of such codes, we use a modified matrix

based approach in which all matrix entries are chosen uniformly at random from the

alphabet Σ. Then, given this structures, we prove that with positive probability the read

and write operations work.

We call the following approach a random matrix (k,`,n)b-RW code. As usual, we

consider the information vector xto consists of ksymbols plus additional `used slack

variables, and an n-symbol code ygenerated by the matrix based encoding. Again, let

b=|Σ|denote the size of the symbol alphabet Σ. Furthermore, we consider an n×(k+`)-

generator matrix Γwhose elements are drawn (uniformly) at random from Σ. And as

usual, we use the matrix based code representation

Γ(x|v)T=y.

Then, the read and write operations can be derived as follows:

•Read

Consider r≥k+`given readable positions in the code vector y. Then, we can

derive the matrix Γ0by choosing all rows of Γcorresponding to the rows available

for reading (c.f. Section 4.3.1). In Γ0, we choose `+klinear independent row

vectors yielding matrix Γ00 (note that in case of an insufficient number of existing

linear independent rows in the matrix (what may occur since the elements were

chosen at random), the read operation fails). Corresponding to Γ00, we also reduce

the code vector yyielding a codeword y0of size k+`and compute the k-sized

content vector xas well as the slack vector vas usual by

(x|v)T= (Γ00)−1y0.

•Write

Consider the n×lvariable sub-matrix

Γ→= (Γi,j)i∈[n],j∈{k+1,...,k+l}.

4.7. General RW-Codes 119

There are n−wnon-writable positions corresponding to rows of the code vector

yand the variable sub-matrix Γ→. In case that these n−wrows of Γ→are not

linearly independent, clearly, the write operation fails (recall from Section 4.3.1 that

a precondition for the write operation is the invertibility of the (n−w)×lvariable

submatrix Γ↓→). Then, we could add some other `−n+wlinear independent rows

of Γ→(again, if these do not exist, the operation fails). Now, provided that we have

found appropriate linearly independent rows, these rows then correspond to non-

writable positions, and by using the complement of this rows as writable positions,

we can directly apply the matrix based RW code write operation.

We now investigate the success probability of read and write.

Lemma 4.28. Consider a k ×(k+`)random matrix A over a given base b ≥2. Then,

Pr[A has an invertible k ×k-sub matrix ]≥1−b−`+2.

Proof. Since the column rank and row rank of a matrix are always equal, it is more con-

venient to argue the following by the column rank of the matrix A. Then, the probability

that a random k×(k+`)submatrix of Adoes not have rank kequals the probability

that all k+`columns of Alie in some k−1-dimensional subspace of bk. To sum up

all these subspaces, we enumerate each subspace by its 1-dimensional orthogonal com-

plement yielding a total of bk−1

b−1such subspaces (since we omit the zero vector, we have

bk−1 possibilities for a 1-dimensional vector or size k, and since for each such vector

that spans some certain subspace, a multiplication with b−1 non-zero elements from the

underlying base still spans the same subspace). Then, for each of such subspaces, the

probability that all columns of the matrix are in this subspace is b−(k+`)1. By the union

bound, the probability that all the columns are in the same subspace is upper bounded by

b−(k+`)bk−1

b−1≤b−`

b−1≤b−`.

Thus, the probability that the matrix has full rank is at least 1 −b−`.

From this, it follows that if only few combinations of read and write index sets occur,

the overhead for general codes is small.

Theorem 4.29. For given parameters r and w, consider a general RWC with read index

sets R={R1,...,R|R|},|Ri| ≥ r, and write index sets W={W1,...,W|W|},|Wi| ≥ w.

Then, there is a restricted (k,r,w,n)b-RWC for any

r>k+logb|R|+2,

w+r>n+k+logb|W|+2.

1For each column vector cit holds: Pr[clies in this subspace ] = b−1=bk−1

bk. In other words, one aims at

sampling a vector from the k-dimensional space which is also in the k−1-dimensional subspace.

120 Erasure Codes for Reading and Writing

Proof. According to Lemma 4.28, the probability of having a valid code for reading by

using rrows is at least 1 −b−(r−k)+2. Now, summing up the error probabilities of all

possible read operations, we end up with

|R|

∑

i=1

Pr[Riis not valid ]≤ |R|b−(r−k)+2<1

provided that r>k+logb|R|+3.

For the write operation, we consider an `×(n−w)sub-matrix for `=r−kin which

we need to find n−w≤`independent rows, with positive probability. The other failure

case can be omitted. Again, due to the previous lemma, with a success probability greater

than 1 −bn−w−`+2we succeed. Thus, for

n−w−r+k+2>logb|W|,

the summed error probability is less than 1. Combining both error probabilities, the over-

all error probability is also less than 1. Hence, there exists a matrix allowing these re-

stricted read and write index sets.

Yet, the overall number of possible read and write index sets is quite high. The follow-

ing theorem shows that random matrices perform quite well for most of the read and write

index sets.

Theorem 4.30. For any base b, a random matrix based (k,`,n)b-RW code successfully

performs a read operation for a random choice of k +`+q code symbols with probability

1−b−q+2and successfully performs a write operation for a random choice of n −`+q

writable code symbols with probability 1−b−q+2.

Proof. Follows directly from Lemma 4.28 and from the definition of the Read and Write

operation of random matrix based (k,`,n)b-RW-Codes.

4.8 Conclusion

In this chapter, we presented a useful extension of linear erasure codes, called Read-Write

codes, for application in RAID-like storage systems. Such systems often store valuable

business data implying some certain degree of reliability on the stored data. Furthermore,

to exploit access parallelism, the data is distributed among ndistinct storage devices but

what makes the system subject to failures as the probability of failed disks increases with

the number of devices connected. Therefore, fault-tolerant data placement schemes are

highly required to beware from data loss. In addition, reducing system latency is also of

great concern since access to the devices on which the data is stored is magnitudes slower

than any memory access which leads to comparatively high I/O costs.

4.8. Conclusion 121

Usually, employed RAID schemes (RAID 4/5) and Reed-Solomon encoding are sim-

ple and efficient techniques that, on one hand, focus on aggregating disk performance by

employing parallel access to them, and on the other hand, guarantee sufficient reliability

to tolerate multiple disk failures, all at a low cost, but unfortunately, all such codes re-

quire naccessible disks for storing an n-block codeword appropriately, what leads to the

following problems.

1. The modification of the original information implies access to all n−kchecksum

disks; this write penalty leads to a performance bottleneck.

2. A complete modification of the information word results in inconsistencies if at

least any of the ndisks is missing.

3. Once the stripe size of an encoding is chosen, it cannot be changed any further,

thus, it cannot be adjusted to potentially changing demands (e.g. changing degree

of redundancy, adjusting to a higher number of tolerable failures, changing stripe

size, changing document size, etc.).

RW codes can significantly reduce, or even eliminated these problems. In a perfect

(k,r,w,n)-RW code (with r+w=k+n), any rdisks suffice for reading, and any max{r,w}

disks suffice for writing. The overhead is described by the stretch factor m

n, and any of

these combinations (k,r,w,n)with k≤r,w≤ncan be chosen. In the case of adaptive RW

codes, these parameters can be adjusted during runtime without reading and rewriting all

disks. Furthermore, the information of n−rdisks can be recovered and the system can

continue working if not more than n−max{r,w}disks fail. There is some redundancy

against faulty hard disks. However, redundancy is better solved by reducing it with disk-

wise checksums to the problem of failed disks. Interestingly, there is also a security

feature which allows the theft of up to n−wdisks. The main advantage of RW codes

in a SAN is that the current r, respectively w, fastest disks can be used for read or write

operation. This leads to more efficient storage area networks.

In addition, in [DLS+04b], methods have been introduced for a wide-area distributed

hash table (DHT) that provides high-throughput and low-latency network storage. The au-

thors suggest a system called DHash++ which based on the peer-to-peer network Chord

[SMK+01b], and one feature of DHash++ is to use an erasure-resilient code for stor-

ing data. In particular, they store each 8192-byte block as 14 1171-byte erasure-coded

fragments, any seven of which are sufficient to reconstruct the block, using the IDA (In-

formation Dispersal Algorithm) coding algorithm from [Rab89]. The benefit of this code

is low latency as the fastest peers can be used to reconstruct the stored data. On the other

hand, this system suffers from a bad update behavior making it a promising application

for Read-Write-Codes to improve peer-to-peer-networks.

Bibliography

[ABB+97] Micah Adler, Yair Bartal, John W. Byers, Michael Luby, and Danny Raz.

A modular analysis of network transmission protocols. In Israel Sympo-

sium on Theory of Computing Systems, pages 54–62, 1997.

[ABKU00] Yossi Azar, Andrei Z. Broder, Anna R. Karlin, and Eli Upfal. Balanced

allocations. SIAM Journal on Computing, 29(1):180–200, 2000.

[ACMR95] Micah Adler, Soumen Chakrabarti, Michael Mitzenmacher, and Lars

Rasmussen. Parallel randomized load balancing. pages 238–247, 1995.

[AT97] J. Alemany and J. S. Thathachar. Random striping news on demand

servers. Technical Report TR-97-02-02, 1997.

[BBBM94] M. Blaum, J. Brady, J. Bruck, and J. Menon. EVENODD: an optimal

scheme for tolerating double disk failures in RAID architectures. In Pro-

ceedings of the 21st Annual International Symposium on Computer Ar-

chitecture, pages 245–254. IEEE Computer Society TCCA and ACM

SIGARCH, April 18–21, 1994.

[BCSV00] Petra Berenbrink, Artur Czumaj, Angelika Steger, and Berthold Vöck-

ing. Balanced allocations: the heavily loaded case. pages 745–754,

2000.

[BE07] André Brinkmann and Sascha Effert. Inter-node communication in peer-

to-peer storage clusters. In Proceedings of the 24th IEEE Conference on

Mass Storage Systems and Technologies (MSST), San Diego, California,

24 - 26 September 2007.

123

124 Bibliography

[BEHV05] André Brinkmann, Sascha Effert, Michael Heidebuer, and Mario

Vodisek. Distributed md. In In Proceedings of the International Work-

shop on Storage Network Architecture and Parallel I/Os, pages 81 – 88,

Saint Louis, Missouri, USA, 18 September 2005.

[BEHV06] André Brinkmann, Sascha Effert, Michael Heidebuer, and Mario

Vodisek. Realizing multilevel snapshots in dynamically changing vir-

tualized storage environments. In 5th International Conference on Net-

working (ICN), number 5, Mauritius, 23 - 26 April 2006. Springer Verlag

LNCS.

[BEMadHS07] André Brinkmann, Sascha Effert, Friedhelm Meyer auf der Heide, and

Christian Scheideler. Dynamic and redundant data placement. In

27th IEEE International Conference on Distributed Computing Systems

(ICDCS 2007), Toronto, Canada, 25 - 29 June 2007.

[BEY98] Allan Borodin and Ran El-Yaniv. Online computation and competitive

analysis. Cambridge University Press, New York, NY, USA, 1998.

[BGMJ94] Steven Berson, Shahram Ghandeharizadeh, Richard Muntz, and Xi-

angyu Ju. Staggered striping in multimedia information systems. pages

79–90, 1994.

[BHMadH+04] André Brinkmann, Michael Heidebuer, Friedhelm Meyer auf der Heide,

Ulrich Rückert, Kay Salzwedel, and Mario Vodisek. V:drive - costs and

benefits of an out-of-band storage virtualization system. In Proceedings

of the 12th NASA Goddard, 21st IEEE Conference on Mass Storage Sys-

tems and Technologies (MSST), pages 153 – 157, College Park, Mary-

land, USA, 13 - 16 April 2004.

[BLM02] J. Byers, M. Luby, and M. Mitzenmacher. A digital fountain approach

to asynchronous reliable multicast. IEEE Journal on Selected Areas in

Communications, 20(8), Oct, 2002.

[BLMR98] John W. Byers, Michael Luby, Michael Mitzenmacher, and Ashutosh

Rege. A digital fountain approach to reliable distribution of bulk data.

In SIGCOMM’98, pages 56–67, Sep 1998.

[BMadHS97] Petra Berenbrink, Friedhelm Meyer auf der Heide, and Klaus Schröder.

Allocating weighted jobs in parallel. SPAA 1997, pages 302–310, June

1997.

Bibliography 125

[BMS+03] A. Brinkmann, F. Meyer auf der Heide, K. Salzwedel, C. Scheideler,

M. Vodisek, and U. Rückert. Storage Management as Means to Cope

with Exponential Information Growth. In Proceedings of the Interna-

tional Conference on Advances in Infrastructure for Electronic Business,

Education, Science, Medcine, and Mobile Technologies on the Internet,

SSGRR 2003w, January 2003.

[BSS00] A. Brinkmann, Kay Salzwedel, and C. Scheideler. Efficient, distributed

data placement strategies for storage area networks. In Proc. of the 12th

ACM Symposium on Parallel Algorithms and Architectures (SPAA’00),

pages 119–128, 2000.

[BSS02] A. Brinkmann, Kay Salzwedel, and C. Scheideler. Compact, adaptive

placement schemes for non-uniform distribution requirements. In Proc.

of the 14th ACM Symposium on Parallel Algorithms and Architectures

(SPAA’02), pages 53–62, 2002.

[BSV03] R. Bhagwan, S. Savage, and G. Voelker. Understanding availability. In

Proceedings of the 2nd International Workshop on Peer-to-Peer Systems

(IPTPS ’03), February 2003.

[BSV04] André Brinkmann, Kay Salzwedel, and Mario Vodisek. A case for vir-

tualized arrays of raid. In Proceedings of the International Workshop

on Storage Network Architecture and Parallel I/Os SNAPI 2004, pages

9–16, Antibes Juan-les-pins, France, 30 September 2004.

[CL00] T. Cortes and J. Labarta. A case for heterogenenous disk arrays. In

Proc. of the IEEE International Conference on Cluster Computing (Clus-

ter’2000), pages 319–325, 2000.

[CL01] T. Cortes and J. Labarta. Extending heterogeneity to RAID level 5. In

USENIX 2001, Boston, June 2001.

[CLG+94] Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, and

David A. Patterson. RAID: High-performance, reliable secondary stor-

age. ACM Computing Surveys, 26(2):145–185, 1994.

[CLRS01] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clif-

ford Stein. Introduction to Algorithms, Second Edition. The MIT Press,

September 2001.

126 Bibliography

[CW77] J. Lawrence Carter and Mark N. Wegman. Universal classes of hash

functions (extended abstract). In STOC ’77: Proceedings of the ninth

annual ACM symposium on Theory of computing, pages 106–112, New

York, NY, USA, 1977. ACM.

[DKM+88] Martin Dietzfelbinger, Anna R. Karlin, Kurt Mehlhorn, Fried-

helm Meyer auf der Heide, Hans Rohnert, and Robert Endre Tarjan. Dy-

namic perfect hashing: Upper and lower bounds. In IEEE Symposium on

Foundations of Computer Science, pages 524–531, 1988.

[DLS+04a] Frank Dabek, Jinyang Li, Emil Sit, James Robertson, M. Frans

Kaashoek, and Robert Morris. Designing a dht for low latency and high

throughput. In NSDI’04: Proceedings of the 1st conference on Sym-

posium on Networked Systems Design and Implementation, pages 7–7,

Berkeley, CA, USA, 2004. USENIX Association.

[DLS+04b] Frank Dabek, Jinyang Li, Emil Sit, James Robertson, M. Frans

Kaashoek, and Robert Morris. Designing a DHT for low latency and

high throughput. In NSDI, pages 85–98, 2004.

[DR01] P. Druschel and A. Rowstron. PAST: A large-scale, persistent peer-to-

peer storage utility. pages 75–80, 2001.

[ED88] R. J. Enbody and H. C. Du. Dynamic hashing schemes. ACM Comput.

Surv., 20(2):850–113, 1988.

[Eli55] P. Elias. Coding for two noisy channels. In Information Theory, Third

London Symposium, pages 61–76. Butterswort’s Scientific Publications,

1955.

[Fal07] FalconStor. The Value of Storage Virtualization. White Paper, FalconStor

Software Inc., 2007.

[FNPS79] Ronald Fagin, Jürg Nievergelt, Nicholas Pippenger, and H. Raymond

Strong. Extendible hashing - a fast access method for dynamic files.

ACM Trans. Database Syst., 4(3):315–344, 1979.

[Gan07] John F. Gantz et al. ICD Study: The Expanding Digital Universe. White

Paper, EMC2, March 2007.

[Gib99] Garth A Gibson. Redundant disk arrays: Reliable, parallel secondary

storage. Technical report, Berkeley, CA, USA, 1999.

Bibliography 127

[GLS93] Martin Grötschel, Lászlo Lovász, and Alexander Schrijver. Geometric

Algorithms and Combinatorial Optimization, volume 2 of Algorithms

and Combinatorics. Springer, second corrected edition edition, 1993.

[GM00] Garth A. Gibson and Rodney Van Meter. Network attached storage ar-

chitecture. Commun. ACM, 43(11):37–45, 2000.

[Gup02] M. Gupta. Storage Area Network Fundamentals. Cisco Press, 2002.

[HGK+94] Lisa Hellerstein, Garth A. Gibson, Richard M. Karp, Randy H. Katz, and

David A. Patterson. Coding techniques for handling failures in large disk

arrays. Algorithmica, 12(2/3):182–208, 1994.

[HJH02] Kai Hwang, Hai Jin, and Roy S.C. Ho. Orthogonal striping and mirroring

in distributed raid for i/o-centric cluster computing. IEEE Trans. Parallel

Distrib. Syst., 13(1):26–44, 2002.

[HM03] R. Honicky and E. Miller. A fast algorithm for online placement and

reorganization of replicated data. 2003.

[HM04] R. J. Honicky and Ethan L. Miller. Replication Under Scalable Hashing:

A Family of Algorithms for Scalable Decentralized Data Distribution. In

Proceedings of the 18th IPDPS Conference, 2004.

[HP03] Cary Huffmann and Vera Pless. Fundamentals of Error-Correcting

Codes. Cambridge University Press, Cambridge, UK, 2003.

[HR90] Torben Hagerup and C. Rüb. A guided tour of chernoff bounds. Inf.

Process. Lett., 33(6):305–308, 1990.

[Hro03] J. Hromkovic. Algorithms for Hard Problems, 2nd Edition. (Texts in

Theoretical Computer Science. An EATCS Series). Springer-Verlag New

York, Inc., Secaucus, NJ, USA, 2003.

[HZ05] J. Hromkovic and I. Zámecniková. Design and Analysis of Random-

ized Algorithms: Introduction to Design Paradigms (Texts in Theoretical

Computer Science. An EATCS Series). Springer-Verlag New York, Inc.,

Secaucus, NJ, USA, 2005.

[JK77] Norman L. Johnson and Samuel Kotz. Urn models and their application.

John Wiley & Sons, New York-London-Sydney, 1977. An approach

to modern discrete probability theory, Wiley Series in Probability and

Mathematical Statistics.

128 Bibliography

[Joh84] Olin G. Johnson. Three-dimensional wave equation computations on

vector computers. Proceedings of the IEEE, 72(1):90–95, January 1984.

[JT05] A. Telles J. Tate, R. Kanth. Introduction to Storage Area Networks. Tech-

nical report, IBM, May 2005.

[KLL+97] David Karger, Eric Lehman, Tom Leighton, Mathhew Levine, Daniel

Lewin, and Rina Panigrahy. Consistent hashing and random trees: Dis-

tributed caching protocols for relieving hot spots on the world wide web.

In ACM Symposium on Theory of Computing, pages 654–663, May 1997.

[KLM92] Richard M. Karp, Michael Luby, and Friedhelm Meyer auf der Heide.

Efficient PRAM simulation on a distributed memory machine. pages

318–326, 1992.

[KLMadH96] Richard M. Karp, Michael Luby, and Friedhelm Meyer auf der Heide.

Efficient pram simulation on a distributed memory machine. Algorith-

mica, 16(4/5):517–542, 1996.

[KSB+99] David Karger, Alex Sherman, Andy Berkheimer, Bill Bogstad, Rizwan

Dhanidina, Ken Iwamoto, Brian Kim, Luke Matkins, and Yoav

Yerushalmi. Web caching with consistent hashing. In WWW ’99: Pro-

ceedings of the eighth international conference on World Wide Web,

pages 1203–1213, New York, NY, USA, 1999. Elsevier North-Holland,

Inc.

[LMR98] W. Litwin, J. Menon, and T. Risch. LH* schemes with scalable avail-

ability. Technical Report RJ 10121 (91937), IBM Research, Almaden

Center, May 1998.

[LN86] Rudolf Lidl and Harald Niederreiter. Introduction to finite fields and

their applications. Cambridge University Press, New York, NY, USA,

1986.

[LN96] Witold Litwin and Marie-Anne Neimat. High-availability LH* schemes

with mirroring. In Conference on Cooperative Information Systems,

pages 196–205, 1996.

[LNS93] Witold Litwin, Marie-Anne Neimat, and Donovan A. Schneider. Lh*:

Linear hashing for distributed files. In SIGMOD ’93: Proceedings of the

1993 ACM SIGMOD international conference on Management of data,

pages 327–336, New York, NY, USA, 1993. ACM.

Bibliography 129

[LR02] Witold Litwin and Tore Risch. LH*G: A high-availability scalable dis-

tributed data structure by record grouping. IEEE Transactions on Knowl-

edge and Data Engineering, 14(4):923–927, 2002.

[LS00] Witold Litwin and Thomas Schwarz. LH* RS : A high-availability scal-

able distributed data structure using reed solomon codes. In SIGMOD

Conference, pages 237–248, 2000.

[LYD71] V. Y. Lum, P. S. T. Yuen, and M. Dodd. Key-to-address transform tech-

niques: a fundamental performance study on large existing formatted

files. Commun. ACM, 14(4):228–239, 1971.

[MadH06] Friedhelm Meyer auf der Heide. Kommunikation in Parallelen Rechen-

modellen. Skript, University of Paderborn, July 2006.

[Mar79] G. N.N. Martin. Spiral storage: Incrementally augmentable hash ad-

dressed storage. Technical report, Coventry, UK, UK, 1979.

[Mas97] P. Massiglia. RAB. The RAIDbook. A Storage System Technology Hand-

book. RAID Advisory Board, 1997.

[McD89] C. McDiarmid. On the method of bounded differences. In J. Siemons,

editor, Surveys in Combinatorics. London Mathematical Society Lecture

Note Series 141, Cambridge University Press, 1989.

[MNR02] D. MALKHI, M. NAOR, and D. RATAJCZAK. Viceroy: A scalable and

dynamic emulation of the butterfly. In Proceedings of the 21st annual

ACM symposium on Principles of distributed computing. ACM Press,

2002.

[MR95] Rajeev Motwani and Prabhakar Raghavan. Randomized Algorithms.

Cambridge University Press, New York, NY, USA, 1995.

[MS08] Mario Mense and Christian Scheideler. Spread: An adaptive scheme for

redundant and fair storage in dynamic heterogeneous storage systems. In

19th ACM-SIAM Symposium on Discrete Algorithms (SODA), San Fran-

cisco, California, USA, 20.-22. Febr.„ January 2008.

[Mul84] James K. Mullin. Unified dynamic hashing. In VLDB ’84: Proceedings

of the 10th International Conference on Very Large Data Bases, pages

473–480, San Francisco, CA, USA, 1984. Morgan Kaufmann Publishers

Inc.

130 Bibliography

[NB97] J. Nonnenmacher and E. Biersack. Asynchronous multicast push: Amp.

In Proceedings of ICCC’97, pp. 419–430, Cannes, France, November,

1997.

[Oto88] Ekow J. Otoo. Linearizing the directory growth in order preserving ex-

tendible hashing. In Proceedings of the Fourth International Conference

on Data Engineering, pages 580–588, Washington, DC, USA, 1988.

IEEE Computer Society.

[PGK88] D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundant

Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM

Conference on Management of Data (SIGMOD), pages 109–116, June

1988.

[Pla97] James S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance

in RAID-like systems. Software, Practice and Experience, 27(9):995–

1012, 1997.

[PS98] Christos H. Papadimitriou and Kenneth Steiglitz. Combinatorial Opti-

mization : Algorithms and Complexity. Dover Publications, July 1998.

[PT03] J. S. Plank and M. G. Thomason. On the practical use of ldpc erasure

codes for distributed storage applications. Technical Report CS-03-510,

University of Tennessee, September 2003.

[Rab89] M. O. Rabin. Efficient dispersal of information for security, load balac-

ing and fault tolerance. Journal of the ACM, 36(2):335–348, 1989.

[RD01] Antony I. T. Rowstron and Peter Druschel. Storage management and

caching in past, a large-scale, persistent peer-to-peer storage utility. In

Symposium on Operating Systems Principles, pages 188–201, 2001.

[REG+03] S. Rhea, P. Eaton, D. Geels, H. Weatherspoon, B. Zhao, and J. Kubiatow-

icz. Pond: The oceanstore prototype. In Proceedings of the Conference

on File and Storage Technologies. USENIX, 2003.

[Riz97] Luigi Rizzo. Effective erasure codes for reliable computer communica-

tion protocols. ACM Computer Communication Review, 27(2):24–36,

April 1997.

[RL05] Rodrigo Rodrigues and Barbara Liskov. High availability in dhts: Era-

sure coding vs. replication. In Miguel Castro and Robbert van Renesse,

editors, IPTPS, volume 3640 of Lecture Notes in Computer Science,

pages 226–239. Springer, 2005.

Bibliography 131

[Roc70] R. T. Rockafellar. Convex Analysis. Princeton University Press, Prince-

ton, New Jersey, 1970.

[Rom96] Steven Roman. Introduction to coding and information theory. Springer-

Verlag New York, Inc., Secaucus, NJ, USA, 1996.

[Ros00] K.H. Rosen. Handbook of Discrete and Combinatorial Mathematics.

CRC; 1 edition, 2000.

[Ros06] Sheldon M. Ross. Introduction to Probability Models, Ninth Edition.

Academic Press, Inc., Orlando, FL, USA, 2006.

[RS98] Martin Raab and Angelika Steger. “balls into bins” — A simple and tight

analysis. Lecture Notes in Computer Science, 1518:159–??, 1998.

[Sal04] Kay Salzwedel. Data Distribution Algorithms for Storage Networks.

Dissertation, Universität Paderborn, Heinz Nixdorf Institut, Theoretische

Informatik, 2004. EUR 20,00 ISBN 3-935433-62-X.

[San01] P. Sanders. Reconciling simplicity and realism in parallel disk mod-

els. In Proc. of the 12th ACM-SIAM Symposium on Discrete Algorithms

(SODA), pages 67–76. SIAM, Philadelphia, PA, 2001.

[Sch00] C. Scheideler. Probabilistic Methods for Coordination Problems. HNI-

Verlagsschriftenreihe 78, University of Paderborn, 2000.

[Sch01] Klaus Schröder. Balls into Bins: A Paradigm for Job Allocation, Data

Distribution Processes, and Routing. Dissertation, Universität Pader-

born, Heinz Nixdorf Institut, Theoretische Informatik, 2001. ISBN 3-

931466-88-4.

[SM98a] J. R. Santos and R. Muntz. Performance analysis of the RIO multimedia

storage system with heterogeneous disk configuration. In ACM Multi-

media 98, pages 303–308, 1998.

[SM98b] Jose Renato Santos and Richard Muntz. Using heterogeneous disks on

a multimedia storage system with random data allocation. Technical

Report 980011, 18, 1998.

[SMK+01a] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari

Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet

applications. In Proc. of the ACM SIGCOMM 2001, pages 149–160,

August 2001.

132 Bibliography

[SMK+01b] Ion Stoica, Robert Morris, David R. Karger, M. Frans Kaashoek, and

Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for

internet applications. In SIGCOMM, pages 149–160, 2001.

[SMRN00] Jose Renato Santos, Richard R. Muntz, and Berthier A. Ribeiro-Neto.

Comparing random data allocation and data striping in multimedia

servers. In Measurement and Modeling of Computer Systems, pages 44–

55, 2000.

[SS05] Christian Schindelhauer and Gunnar Schomaker. Weighted distributed

hash tables. In SPAA ’05: Proceedings of the seventeenth annual ACM

symposium on Parallelism in algorithms and architectures, pages 218–

227, New York, NY, USA, 2005. ACM Press.

[Tec07] Hitatchi Global Storage Technologies. Hitatchi UltrastarTMA7K1000.

Specification Sheet, Hitatchi Global Storage Technologies, 2007.

[TPBG93] F. Tobagi, J. Pang, R. Baird, and M. Gang. Streaming raid - a disk

array management system for video files. In Proceedings of of 1st ACM

Multimedia, pages 393–400, 1993.

[Voe99] Berthold Voecking. How asymmetry helps load balancing. In IEEE

Symposium on Foundations of Computer Science, pages 131–141, 1999.

[VRG95] Harrick M. Vin, S. S. Rao, and Pawan Goyal. Optimizing the placement

of multimedia objects on disk arrays. In International Conference on

Multimedia Computing and Systems, pages 158–166, 1995.

[WAS+96] Stephen Williams, Marc Abrams, Charles R. Standridge, Ghaleb Ab-

dulla, and Edward A. Fox. Removal policies in network caches for

World-Wide Web documents. In Procedings of the ACM SIGCOMM

’96 Conference, Stanford University, CA, 1996.

[WBMM06] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, and Carlos Maltzahn.

Crush: controlled, scalable, decentralized placement of replicated data.

In SC ’06: Proceedings of the 2006 ACM/IEEE conference on Super-

computing, page 122, New York, NY, USA, 2006. ACM.

[Wie07] Udi Wieder. Balanced allocations with heterogenous bins. In SPAA ’07:

Proceedings of the nineteenth annual ACM symposium on Parallel al-

gorithms and architectures, pages 188–193, New York, NY, USA, 2007.

ACM.

Bibliography 133

[WK02] Hakim Weatherspoon and John Kubiatowicz. Erasure coding vs. replica-

tion: A quantitative comparison. In IPTPS ’01: Revised Papers from the

First International Workshop on Peer-to-Peer Systems, pages 328–338,

London, UK, 2002. Springer-Verlag.

[XB99] Lihao Xu and Jehoshua Bruck. X-code: MDS array codes with optimal

encoding. IEEE Transactions on Information Theory, 45(1):272–276,

1999.

[ZG97] Roger Zimmermann and Shahram Ghandeharizadeh. Continuous dis-

play using heterogeneous disk-subsystems. In MULTIMEDIA ’97: Pro-

ceedings of the fifth ACM international conference on Multimedia, pages

227–238, New York, NY, USA, 1997. ACM.

[ZG00] Roger Zimmermann and Shahram Ghandeharizadeh. HERA: Hetero-

geneous extension of RAID. In Proc. of the International Conference

on Parallel and Distributed Processing Techniques and Applications

(PDPTA 2000), 2000.

[Zim98] Roger Zimmermann. Continuous media placement and scheduling in

heterogeneous disk storage systems. PhD thesis, Los Angeles, CA, USA,

1998. Adviser-Shahram Ghandeharizadeh.