Document [original]

Dissertation

Algorithms for

Dynamic Geometric Data Streams

Dipl.–Math. Gereon Frahling

Fakult¨

at f¨

ur Elektrotechnik, Informatik und Mathematik

Institut f¨

ur Informatik & Heinz Nixdorf Institut (HNI) &

Paderborn Institute for Scientific Computation (PaSCo)

Warburgerstraße 100, D - 33098 Paderborn

Reviewers: Jun. Prof. Dr. Christian Sohler, Universit¨

at Paderborn

Prof. Dr. Friedhelm Meyer auf der Heide, Universit¨

at Paderborn

Prof. Dr. Stefano Leonardi, Universita di Roma ”La Sapienza”, Italien

Acknowledgements

First of all I would like to thank my advisor Christian Sohler for his great support. It was not

always easy to keep pace with his great ability to find interesting problems and develop new ideas

to solve them. During the whole time he gave me the feeling that I can always ask (even stupid)

questions and was responsible for the great atmosphere in Paderborn. Without the fun I had at

work I would have never been able to develop the results presented in this thesis.

I also benefited a lot from the great experience of my co-advisor Friedhelm Meyer auf der

Heide. He gave me the opportunity to come to Paderborn and the freedom to choose a research

area to work on. He always lent a sympathetic ear for all kinds of problems. I would also like to

thank Friedhelm’s whole research group for the nice time in Paderborn.

Then I would like to thank Kristina for her patience, empathy, love, and all the other things

that make it so worthwhile to know her. She turned even the most stressful days into a wonderful

time.

Finally I would like to thank those whom I owe the most: my parents, Adolf and Margret.

They always gave me the feeling to be loved and supported, whatever I am going to do in my

life.

iii

Contents

Acknowledgements iii

Contents vi

1 Introduction 1

1.1 Motivation...................................... 4

1.2 RelatedWork .................................... 8

2 Preliminaries 13

2.1 GeneralNotations.................................. 13

2.2 DataStreams .................................... 14

2.3 Clustering...................................... 17

2.4 Chernoff Bounds with Limited Independence . . . . . . . . . . . . . . . . . . . 19

3 Sampling Data Streams 21

3.1 The Unique Element (UE) Data Structure . . . . . . . . . . . . . . . . . . . . . 22

3.2 The Distinct Elements (DE) Data Structure . . . . . . . . . . . . . . . . . . . . 23

3.3 A Sample Data Structure using Totally Random Hash Functions . . . . . . . . . 23

3.4 A Sample Data Structure using Random Number Generators . . . . . . . . . . . 25

3.5 A Sample Data Structure using Pairwise Independent Hash Functions . . . . . . 29

4 Sampling Geometric Data Streams and Applications 33

4.1 Sampling Geometric Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 -Nets and -Approximations in Data Streams . . . . . . . . . . . . . . . . . . . 35

4.3 Random Sampling with Neighborhood Information . . . . . . . . . . . . . . . . 36

4.4 Estimating the Weight of a Euclidean Minimum Spanning Tree . . . . . . . . . . 39

5 The Coreset Method 47

5.1 Definitions...................................... 47

5.2 Coresets for k-Median ............................... 54

5.3 Coresets for k-Means................................ 60

5.4 Coresets for Oblivious Optimization Problems . . . . . . . . . . . . . . . . . . . 65

5.5 Constructing Solutions on the Coreset . . . . . . . . . . . . . . . . . . . . . . . 69

5.6 CoresetsviaSampling ............................... 76

Contents

6 Coresets in Data Streams 93

6.1 Insertions ...................................... 95

6.2 Deletions ...................................... 97

6.3 MaximumSpanningTree.............................. 99

7 A Kinetic Data Structure for MaxCut 103

7.1 KineticTurnamentTrees ..............................104

7.2 Approximating the Bounding Cube . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.3 The Kinetic Data Structure for MaxCut . . . . . . . . . . . . . . . . . . . . . . 105

8 An Efficient k-Means Implementation using Coresets 111

8.1 Definitions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

8.2 TheAlgorithm ...................................114

8.3 Experiments.....................................120

9 Counting Motifs in Data Streams 133

9.1 Counting Triangles in Adjacency Streams . . . . . . . . . . . . . . . . . . . . . 134

9.2 Counting Triangles in Incidence Streams . . . . . . . . . . . . . . . . . . . . . . 142

9.3 Counting Cliques of Arbitrary Size . . . . . . . . . . . . . . . . . . . . . . . . . 149

9.4 Counting K3,3 inIncidenceStreams ........................152

10 Conclusions 157

Bibliography 159

1 Introduction

The increasing inter-connectivity of modern computer systems has led to the phenomenon of

massive data sets occuring in the form of data streams. Terabytes of Internet traffic is guided

along routers having small memory, telecommunication companies collect gigabytes of com-

munication data each day which have to be analyzed automatically. In almost each area of our

digital life a huge amount of data is created. Parts of the data are stored in gigantic data centers

on thousands of large hard discs for later analysis. But most of the data is created on the fly,

never stored anywhere, and forgotten after seconds.

In this thesis we concentrate on the analysis of such elusive data appearing as a data stream.

In the data stream model the data arrives one item by one, and can not be stored locally. We try

to maintain a small summary or sketch of the data in local memory. This summary should later

help us to answer certain kinds of queries about the data.

A big part of this thesis we will concentrate on a very common problem on huge data sets:

Clustering. The computational task of clustering is to partition a given input into subsets of

equal characteristics. These subsets are usually called clusters and ideally consist of similar

objects that are dissimilar to objects in other clusters. This way one can use clusters as a coarse

representation of the data. We loose the accuracy of the original data set but we reduce the

complexity of the data.

Clustering has straightforward applications in data streaming scenarios. First, we are supposed

to handle data sets we are not able to store. Reducing the complexity of these data sets using

clustering can give us the ability to store the clustered data for later examination. Second, on

such huge data sets clustering often is an effective method to understand the structure of the data

set at all. It solves the problem not to see the wood for the trees, which often evolves in the

analysis of huge data sets.

To cluster a data set one first has to define a distance measure between objects. This is of-

ten done in practice by mapping all objects to points in a d-dimensional Euclidean space. The

distance between the objects can then be measured by the Euclidean distance between points. Fi-

nally one can use established clustering objectives like k-median or k-means and corresponding

algorithms to cluster the data.

One of the main results of this thesis is a technique to reduce the complexity of a huge point

set to a weighted point set of logarithmic size, called coreset. On the small coreset still good

approximate clusterings for the whole point set can be computed.

We will show how to maintain a coreset using polylogarithmic memory on dynamic geomet-

ric data streams, consisting of insertions and deletions of points. Our algorithm will give us the

ability to compute (1±)-approximate k-median, k-means, and MaxCut clusterings on dynamic

goemetric data streams using polylogarithmic memory. Having a (problem dependent) mapping

1 Introduction

of objects to Euclidean points in mind, we can also compute clusterings on dynamic data streams

of arbitrary items.

We now give an overview of the results developed in this thesis.

After introducing some notation, the data stream models, and the clustering objectives used

throughout the paper in Chapter 2, we start the development of data stream algorithms in Chap-

ter 3. The data streams we consider here follow the turnstyle model, i.e. they consist of update

operations on a high dimensional vector. We show how to sample sets of indices almost uni-

formly at random from the support of the current vector. This is equivalent to sampling random

elements from a dynamic multiset Pgiven as a data stream of insert and delete operations. The

algorithm uses O(log2(UM/δ)) memory bits where Udenotes the dimension of the vector (resp.

the number of possible elements to be included in the set), Man upper bound on each vector

component (resp. the multiplicity of single elements), and δthe desired statistical difference of

the sampling to a uniform one. The difficulty here lies in the desired uniformity of the sampling

independently of the multiplicity of the elements in P. Furthermore, if the current multiset P

is small after many insert and many delete operations, we must be able to reconstruct Pto get

uniform samples.

We apply this sampling technique to point sets in Chapter 4 and present low-storage data

structures to sample points from a dynamic geometric data stream consisting of insertions and

deletions of points from the d-dimensional discrete Euclidean space {1,...,∆−1}d. The sam-

pling is done almost uniformly. We also show direct applications of our sampling technique. Let

Pbe the dynamically evolving point set encoded in the stream.

The data structures developed in Section 4.2 maintain -nets and -approximations of range

spaces of Phaving small VC-dimension D. The number of memory bits our data structures use

is bounded by poly(D, −1,log(∆/δ)), where δis the desired error probability. Although we

do not store the whole point set, we can, after passing over the dynamic geometric data stream,

approximately answer certain queries on ranges (e.g. about the number of points in a given

rectangle) using the statistics we maintain.

Based on a more sophisticated sampling of points and their respective neighbourhood we also

present a low storage data structure to approximate the weight of a Euclidean minimum spanning

tree of the points in Pin Section 4.4.

The results of Chapters 3 and 4 have been published in [G. Frahling, P. Indyk and C. Sohler,

Sampling in Dynamic Data Streams and Applications, In: Proceedings of the 21st Annual Sym-

posium on Computational Geometry (SoCG), pages 142–149, 2005. Invited to the special issue

of SoCG 2005, to appear in International Journal of Computational Geometry and Applications

(IJCGA)].

The heart of the thesis is Chapter 5. We develop (1+)-approximation algorithms for k-

median, k-means, MaxCut, maximum weighted matching (MaxWM), maximum travelling sales-

person (MaxTSP), and average distance. Our algorithms compute a small weighted coreset con-

sisting of O(k·log n/O(d))points that approximates the input point set with respect to the

considered problem. The coresets can be computed in nearly linear time.

Having a coreset one only needs a fast approximation algorithm for the weighted problem to

compute a solution quickly. In fact, even an exponential algorithm is sometimes feasible as its

running time may still be polynomial in n. We will use algorithms from [61] to compute (1+)-

approximate solutions for k-median and k-means on the coreset in poly(log n, exp(1/)) time.

For MaxCut our technique (reducing the complexity of the big point set to a small coreset and

then compute an approximate solution on the coreset) will lead the the fastest known PTAS (in

terms of n) for Euclidean MaxCut. It is presented in Section 5.5.3.

The new coreset method also has the advantage that it does not rely on assumptions on the op-

timal clustering solution (in constrast to previous approaches like [61]). This helps us to develop

the first efficient algorithms to maintain coresets and during the minsert / delete operations of a

dynamic geometric data stream. The space used and the update time per insert / delete operation

are bounded by poly(−1,log m, log ∆)for constant kand dimension d. At each point of time

during the dynamic geometric data stream we can efficiently extract a coreset from the summary

held in memory and compute an (1±)-approximate solution for k-median, k-means, MaxCut,

and all other problems. The algorithms are presented in Chapter 6.

Chapters 5 and 6 are based on [G. Frahling and C. Sohler, Coresets in Dynamic Geometric

Data Streams, In: Proceedings of the 37th Annual ACM Symposium on Theory of Computing

(STOC), pages 209–217, 2005].

In Chapter 7 we will use the coreset technique developed in Chapter 5 to develop an efficient

kinetic data structure to maintain a (1+)-approximate MaxCut clustering of npoints moving

linearly in Rd. The data structure is able to answer queries of the form “to which side of the

partition belongs query point p?” during the whole movement of the points, each query in time

polylogarithmical in n.

Previously it was not known if a set of npoints moving linearly could force Ω(n2)updates of

a(1±)-approximate MaxCut solution. Our data structure shows that such effort is not needed:

Under linear motion the data structure processes a number of events linear in n, each requiring

O(log2n)time. A flight plan update can also be performed in small expected time, when it is

performed on a point chosen uniformly from the set of points. No efficient kinetic data structures

for MaxCut have been known before.

In Chapter 8 we present an efficient implementation of a k-means clustering algorithm. Our

algorithm is a variant of KMHybrid [83, 104], i.e. it uses a combination of Lloyd-steps and

random swaps, but as a novel feature it uses the coreset construction of Chapter 5 to speed up

the algorithm. The main strength of the algorithm is that it can quickly determine clusterings of

the same point set for many values of k. This is necessary in many applications, since, typically,

one does not know a good value for kin advance. Once we have clusterings for many different

values of kwe can determine a good choice of kusing a quality measure of clusterings that is in-

dependent of k, for example the average silhouette coefficient. The average silhouette coefficient

can be approximated using coresets.

To evaluate the performance of our algorithm we compare it with algorithm KMHybrid [104]

on typical 3D data sets for an image compression application and on artificially created instances.

Our data sets consist of 300, 000 to 4.9 million points. We show that our algorithm significantly

outperforms KMHybrid on most of these input instances. Additionally, the quality of the solu-

1 Introduction

tions computed by our algorithm deviates less than that of KMHybrid.

We also compute clusterings and approximate average silhouette coefficients for each kbe-

tween 1and 100 for our input instances and discuss the performance of our algorithm in detail.

The description of the algorithm and experimental results have been previously published in

[G. Frahling and C. Sohler, A Fast k-Means Implementation using Coresets, In:Proceedings of

the 22nd Annual Symposium on Computational Geometry (SoCG), pages 135–143, 2006. Invited

to the special issue of SoCG 2006, to appear in International Journal of Computational Geome-

try and Applications (IJCGA)].

Chapter 9 concentrates on graphs given as a data stream of edges. We develop space bounded

algorithms that with probability at least 1−δcompute a (1±)-approximation of the number of

small motifs in these graphs. All algorithms are based on random sampling. Our first algorithm

does not make any assumptions on the order of edges in the stream and approximates the number

of triangles occuring in the input graph. It uses space that is inversely related to the ratio between

the number of triangles and the number of triples with at least one edge in the induced subgraph,

and uses constant expected processing time per edge.

Our second triangle counting algorithm is designed for incidence streams (all edges incident

to the same vertex appear consecutively). It uses space that is inversely related to the ratio

between the number of triangles and the number of paths of length two in the graph and also has

small expected processing time per edge. These results significantly improve over previous work

[117, 81]. We generalize the results to the counting of cliques of size k.

Last but not least we present an algorithm to count the number of bipartite cliques (K3,3) with

three nodes in each partition in directed incidence streams with bounded out-degree of the nodes.

The space needed for the approximation is inversly related to the ratio between the number of

K3,3 and the number of bipartite cliques having one node in the destination partition (K3,1).

Since the space complexity of our algorithms depends only on the structure of the input graph

and not on the number of nodes, our algorithms scale very well with increasing graph size and

provide a basic tool to analyze the structure of large graphs.

The algorithms to count triangle motifs have been published together with some experiments

in [L. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela, and C. Sohler, Counting Trian-

gles in Data Streams, In: Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART Symposium

on Principles of Database Systems (PODS), pages 253–262, 2006].

1.1 Motivation

A UC Berkeley study of Lyman and Varian [96] reports that in 2002, the most current year

for which there are figures, human mankind stored about 5 exabytes of information (that is 5

million terabytes or 37,000 times the information in the Libraries of Congress). This is just

the information which is stored. Most of the information created by human mankind, about 18

exabytes by the same study, is created on the fly and then forgotten. Most of this data would be

interesting for later examination, but can’t be stored because of high data rate or storage capacity

limits.

1.1 Motivation

A prominent example for such data is the internet traffic at a backbone router. Assume we

want to maintain some statistics about the routed packets. It would be way to costly to store

the required information (e.g., source and destination) for every packet routed. It seems to be

much more attractive to maintain a small sketch (or synopsis) of the data seen so far. Such a

sketch should contain an approximation of the information we are interested in. This leads us to

data streaming algorithms. They process the data one item by one without storing them. After

processing the stream they are able to answer certain queries about the data.

There are many other examples of data streaming scenarios: Telephone call network opti-

mization, sensor networks, banking and credit card transactions, peer to peer connection and

transmission data, financial stock trade data, etc.

Data streaming algorithms often even have an advantage in environments where huge data sets

are actually stored. According to the study cited above about 92% of the stored data is held on

hard disks. From these discs the data can be read sequentially at high rate. Unfortunately only the

data items below the head(s) of the hard disk can be accessed immidiately. If we decide to read

data from another position on the disk, the disk head has to be moved and after the movement we

have to wait for the disk to turn to the desired data item. This takes milliseconds of time even on

current high speed disks. Real life applications for huge data therefore try to avoid this random

access and just read data sequentially.

Data streaming algorithms can be fed with a sequential stream of the data, avoiding random

access at all. If for a given problem we are able to find a data streaming algorithm working with

limited main memory, we can guarantee not to trigger any hard disk head moves.

Dynamic Geometric Data Streams. Let us assume we have a data stream consisting of

insertions and deletions of objects into a big set Q. To learn something about the structure of the

set Qwe have to examine the relations of the objects of Qto each other. First questions arise

about the difference of objects. To answer such questions we first have to define what difference

of objects means.

A first attempt would be to model complicated distance measures between objects of Q. To

each pair of objects (a, b)we assign a number d(a, b)measuring the distance between the

objects. This attempt would give us a precise modelling of difference, but has a major drawback:

How can we store the distances between pairs of objects when we are not even able to store the

objects themselves?

The only solution to this problem is to provide an implicit distance measure between objects.

In practice objects are often mapped to a d-dimensional Euclidean space by a mapping α. The

distance d(a, b)between objects is then measured as the Euclidean distance between α(a)and

α(b)and can be computed implicitly. The dynamic data stream of objects then translates natu-

rally into a dynamic geometric data stream, consisting of insertions and deletions of points, by

assigning αto each object in the stream.

There are even more direct applications of dynamic geometric data streams: Sensor networks,

mobile ad-hoc networks, or the analysis of astrophysical data often provide us directly with

data streams of positional data, i.e. data streams of points in Rd. For example, in a mobile

1 Introduction

ad-hoc network the participants may regularly broadcast updates of their current position. All

participants want to maintain information about the distribution of the participants to maintain an

efficient communication network. Since mobile devices have usually limited memory it would

be nice to do this using only a small amount of space. Of course, one can model an update as

deleting the old position and inserting the new one. Therefore, the model of dynamic geometric

data streams applies.

In the context of dynamic geometric data streams we have to reduce the complexity of the

point sets. We are able just to store some small representation. Using that we want to answer

certain queries about the points. In the geometric context one of the most interesting questions

arises about the distribution of points. In Section 4.2 we will address queries like “How many

points lie in a certain rectangle?” and show how to answer them by maintaining -nets and -

approximations of points in small space.

In the context of mobile ad-hoc and sensor networks often the question about the efficience of

the current communication networks arises. We can see the Euclidean minimum spanning tree as

one of the most efficient communication networks, minimizing to the total length of connections

between sensors. Estimating the weight of the Euclidean spanning tree of the current network

gives us information about the efficience of the current communication structure. We will show

how to measure this Euclidean spanning tree weight in Section 4.4.

Clustering. The problem of clustering data sets according to some similarity measure belongs

to the most extensively studied optimization problems. Clustering often plays the role of the first

step to understand huge data sets. Clustered data help to identify big groups of similar items in

the data. Furthermore one can easily find similar items of a query item.

Search engines like Ask.com advertise the ability to present websites on one topic as a cluster

to the user. Amazon buyers are clustered to find other users having similar interests. There are

many more applications in computational biology, machine learning, data mining and pattern

recognition. Since the quality of a clustering is rather problem dependent, there is no general

clustering algorithm. Consequently, over the years many different clustering algorithms have

been developed.

The most prominent and widely used clustering algorithm is Lloyd’s algorithm sometimes also

referred to as the k-means algorithm. This algorithm requires the input set to be a set of points in

the d-dimensional Euclidean space. Its goal is to find kcluster centers and a partitioning of the

points such that the sum of squared distances to the nearest center is minimized. The algorithm

is a heuristic that converges to a local optimum. The main benefit of Lloyd’s algorithm is its

simplicity and its foundation on analysis of variances. Also, it is relatively efficient.

One major drawback of the k-means algorithm is that it needs access to the whole point set

in each iteration. In the data streaming scenarios described above it can therefore not be applied

directly.

We will provide a method to reduce the complexity of the point set Pto a weighted coreset

in Chapter 5. This method is the first to efficiently compute (1±)-approximate k-median, k-

means and MaxCut-clusterings in dynamic geometric data streams, where deletions of points are

allowed (see Chapter 6). The reduction is done using only polylogarithmic space, and is therefore

1.1 Motivation

applicable to huge dynamic geometric data streams. Using a mapping of objects to Euclidean

spaces as described above our reduction technique can also be applied to huge dynamic data

streams consisting of insertions and deletions of arbitrary objects.

In Chapter 8 we present an efficient implementation of our coreset reduction technique. It

shows that even for data sets in main memory our method can be used to accelerate the first steps

of the popular k-means method and achieve faster convergence. It can also be used in scenarios

where the number of clusters is not specified in advance, because it is suitable to quickly compute

clusterings for different numbers of k.

Clustering Kinetic Data. Clustering is also playing a central role in ad-hoc mobile commu-

nication and sensor networks, where the underlying communication structures often depend on

the proximity or other similarity characteristics of the stations in motion. For example, mobile

networks are often organized in hierarchical clusters, where all the stations inside one cluster are

in a close proximity and have direct communication. The hierarchy of clusters then induces a

tree structure on their leaders which can be used to establish communication (or perform other

data management tasks) between different clusters.

Maintaining clusters of mobile nodes is a very challenging task in mobile networks, because of

the dynamic character of the moving nodes. Good clustering algorithms should ensure a tradeoff

between the quality of the clustering at any given time and its stability and efficiency under

motion.

In Chapter 7 we will develop the first kinetic data structure for MaxCut clustering. We assume

that npoints (the sensors) are moving along linear trajectories. We show that in this scenario we

are able to maintain an approximate MaxCut clustering with e

O(n)effort during the motion of

points. Updates on the point velocities can be handled efficiently.

Our algorithm is based on the coreset construction technique described in Chapter 5. Since

the coreset construction generally applies to the k-means and k-median problems as well, we

are confident that our ideas to develop kinetic data structures using coresets can lead to efficient

kinetic data structures for k-means and k-median clustering in the future as well.

Counting Motifs in Graphs. Graphs are fundamental structures for modeling complex re-

lationships between data in Web documents, chemical compounds, XML, social networks etc.

A basic tool to uncover their structural design principles and to extract relevant information is

to mine the most frequent interconnection patterns occurring in the graph. The computation of

network indices based on counting the number of certain small subgraphs is a basic tool in the

analysis of the structure of large networks.

As an example, the occurrence of a very large number of certain dense subgraphs has been

observed in the Webgraph, the graph formed by Web pages and hyperlinked connections [88],

in the attempt of tracing the emergence of hidden cyber-communities. A stochastic model of

the growth of the Webgraph [89], the ”copying model”, has then been developed and uses these

dense subgraphs as building blocks of the process of network formation.

Another example are large software systems. The simplest way of reusing the software is

to duplicate some portion of the program and later adapt it to some specific needs. In [121]

1 Introduction

it is argued that the analysis of frequent network motifs in software architectures suggests that

duplication and diversification mechanisms are responsible for a significant part of the observed

topological features of large software graphs.

Since huge graphs (webgraph crawls for example) do not fit into main memory, we have to

find algorithms working on graphs stored on disks. To avoid random access at all we will look

into algorithms working on streams of edges.

Recent implementations and experiments suggest that our algorithms are suitable to compute

good estimations on the number of triangles of real webgraph crawls in time comparable to the

time to read the graph from the hard disc. Some of the experimental results have already been

published in [17].

1.2 Related Work

We subdivide the related work into the subjects data streams, geometric data streams, clustering,

kinetic data structures, and work related to the counting of motifs in graphs.

Data Streams. In 1996 Alon, Matias, and Szegedy analyzed data streaming algorithms to

approximate the frequency moments [4] and also gave lower bounds on the memory needed

for the approximation. The paper laid the foundation for a big amount of research done in

the subsequent years. Some areas of interest are the counting of distinct items and estimating

frequency moments [4, 22, 40, 46, 57, 77], the counting of frequent items [24, 30] and the

computation of histograms [48, 53, 79]. For other work on streaming algorithms in general

we refer to the survey by Muthukrishnan [105].

Ganguly, Garofalakis, and Rastogi gave a method to track set expression cardinalities in dy-

namic data streams [46]. Their method can potentially be altered to provide random samples as

done in Chapter 3. However, this was not the purpose of [46] and the alterations needed are

not stated in the paper. Cormode and Muthukrishnan[31] developed a sampling technique for

dynamic data streams similar to the technique given in Chapter 3. Their result was obtained at

the same time independently of our publication [42].

Geometric Data Streams. One of the first geometric problems studied on a stream of points

was to approximate the diameter of a point set in the plane [36] using O(1/)space. Later this

problem has also been considered in higher dimensions [74], where an algorithm with space

complexity O(dn1/(c2−1))to maintain a c-approximate diameter for c > √2has been obtained.

Chan and Sadjad proposed an algorithm to maintain an approximation of the diameter in the

sliding window model [33]. In this model one considers an infinite input stream but only the last

nelements are relevant (the window). They gave a (1+)-approximation algorithm for fixed

dimension, which stores O((1

)(d+1)/2 log R

)points and where Rdenotes the ratio between the

diameter and the smallest pairwise distance between two points in the window.

Cormode and Muthukrishnan introduced the radial histogram [29] to approximate different

geometric problems in the plane. A radial histogram is a subdivision of the plane given by con-

centric circles around a center point and halflines starting at the center. This way we can assign

1.2 Related Work

every point in the stream to a cell of the radial histogram. Problems that can be approximated via

radial histrograms include the diameter, convex hull, and the furthest neighbor problem. Hersh-

berger and Suri showed how to maintain a set of 2r points such that the distance from the true

convex hull of the points seen so far is O(D/r2)where Dis the current diameter of the sample

set [69].

Agarwal et al. introduced a framework for approximating various extent measures of point

sets for constant dimension [2]. Their technique can be used to obtain streaming algorithms

for problems like the diameter, width, smallest bounding box, ball, and cylinder of the point

set. Chan used this framework to develop improved streaming algorithms (using less space than

previous ones) for a number of geometric problems with constant dimension including diameter,

width, minimum-radius enclosing cylinder, minimum width enclosing annulus, etc.

Bagchi et al. [9] gave two deterministic streaming algorithms that maintain -nets and -

approximations under insertions of points. They apply their algorithm to approximate several

robust statistics in data streams including Tukey depth, simplicial depth, regression depth, the

Thiel-Sen estimator and the least median of squares. Since their algorithm is deterministic it

cannot be extended to the dynamic streaming model including deletions. Suri et al. gave both

deterministic and randomized algorithms to compute a (weighted) -approximation for ranges

that are axis-aligned boxes [118].

Indyk introduced the model of dynamic geometric data streams used in this thesis [75]. He

gave O(d·log ∆)-approximation algorithms for (the weight of) minimum weighted match-

ing, minimum bichromatic matching and minimum spanning tree. He also showed how to

approximate the weight of an optimal solution of the facility location problem within a factor

of O(dlog2∆). For the k-median problem he gave a (1+)approximation algorithm with

query time O(∆kd ·k·−d−1(log ∆+1

log 1

)) (exhaustive search). He also developed a O(1)-

approximation that needs O(∆d·k·−d−1(log ∆+1

log 1

)) time to compute an approximation

from the maintained data structure. He further gave a (1+, O(log ∆(log ∆+log(1/)/))-

approximation algorithm, i.e. an algorithm that returns O(log ∆(log ∆+log(1/)/))·kmedians

whose cost is at most 1+times the cost of an optimal algorithm. This algorithm has polylog-

arithmic query time. All of the algorithms given above work using O((k+log ∆+1/)O(1))

space.

Minimum Spanning Tree Approximation. The problem of estimating the weight of a

minimum spanning tree has been considered in the context of sublinear time approximation

algorithms. The first such algorithm for the minimum spanning tree weight is designed for

sparse graphs and computes a (1+)-approximation [25]. It has a running time of e

O(D·W/2)

when the edge weights are in a range from 1to Wand the average degree of the input graph is

D. In the geometric context a e

O(√n/O(1))time (1+)-approximation algorithm was given, if

the point set can be accessed using certain complex data structures [32]. In the metric case one

can compute a (1+)-approximation of the minimum spanning tree weight in O(n/O(1))time

[20].

1 Introduction

Clustering. It is beyond the scope of this section to give a comprehensive overview of the

clustering literature. We want to concentrate on results using coresets and then give a brief

overview of the most important developments with focus on partitioning algorithms. For a more

comprehensive overview of the work in clustering we refer to the surveys/books [14, 65, 80].

Har-Peled and Mazumdar gave (1+)-approximation algorithms for the k-median and k-

means problem, when the points are given in a data stream consisting of insertions (and no

deletions) [61]. Their algorithm is based on maintaining coresets of logarithmic size. They

also mention the extension of their results to the case of dynamic streaming algorithms as an

interesting open problem.

Chan used coresets to approximate different geometric problems including diameter and min-

volume bounding box [19]. Together with Sadjad he considered the diameter and width of a

point set in the sliding window model [20].

In the context of clustering algorithms several other coreset constructions have been developed

for the k-median and k-means clustering problem [8, 60]. These coresets found applications

in approximation algorithms [8, 60] and clustering of moving data [59]. Also for projective

clustering, coresets have been developed [63].

Apart from clustering, coresets have found applications in basic problems in computational

geometry, for example, to compute an approximation of the smallest enclosing ball of a point set

[7] or to approximate extent measure of point sets [2, 19].

An overview of coreset constructions is given in [3].

The most popular algorithm for the k-means clustering problem is Lloyd’s algorithm [41, 62,

95, 97]. It is known that this algorithm converges against a local optimum [115]. Recently,

a number of very efficient implementations of this algorithm have been developed [5, 82, 83,

108, 109, 110]. These algorithms reduce the time needed to compute the nearest neighbors

in a Lloyd’s iteration, which is the most time consuming step of the algorithm. Arthur and

Vassilvitskii showed that there are instances which require 2Ω(√n)iterations [6].

In the Euclidean space there are many (1+)-approximation algorithms for the k-means

clustering problem [8, 38, 61, 71, 90, 99]. Also for the k-means problem in metric spaces efficient

constant factor approximation algorithms are known [83, 100].

The quality of random sampling in metric spaces has been analyzed for some clustering prob-

lems including the metric and the Euclidean k-median [34, 102]. The analysis can be easily

extended to the k-means clustering problem. A testbed for k-means clustering algorithms has

been given in [104].

Streaming algorithms for clustering problems have also been considered in the more general

metric space setting [23, 54, 101]. The currently best known algorithm for the k-median problem

is a O(1)-approximation using O(k·polylog n)space [21].

MaxCut. MaxCut is known to be Max-SNP-hard for general graphs. It has a very easy 0.5-

approximation algorithm and an exciting 0.87856-approximation algorithm by Goemans and

1.2 Related Work

Williamson [50]. For metric graphs (and hence also for geometric instances), Fernandez de la

Vega, Karpinski, and Kenyon [37] designed a PTAS which computes a (1±)-approximation

in time O(n2·2O(1/2)). Indyk [72] designed a PTAS for metric graphs having runtime O(n·

log n·(2(1/)O(1)+log n)). These are also the best known algorithms for the Euclidean version

of MaxCut, for which it is still not known if the problem is NP-hard.

Kinetic Data Structures. There has been a lot of work on designing KDS for various cluster-

ing problems. For example, there are efficient KDS for the problems of finding (approximately)

minimum number of squares (or other geometric objects) that contain all the input points [47, 68]

and for the problem of finding kclusters of minimum maximum radius that cover all points [59];

for other examples, see e.g., [1, 15, 56, 68]. However, unlike the MaxCut problem studied in this

thesis, the prior work typically has focused on clustering problems in which each clustered set

has been defined only by the inner-cluster properties. In contrast, the MaxCut is the problem of

clustering the input set of points into two sets for which the sum of the inter-cluster distances is

maximized, that is, for which the dissimilarity between the points in different clusters is maxi-

mized. Comparing to the other clustering problems, MaxCut depends more on global properties

of the input points and as such it requires different (novel) techniques to be solved efficiently.

Counting Motifs in Graphs. Recently much attention has been devoted to the analysis of

complex networks arising in information systems, software systems, overlay networks etc. Min-

ing the most frequent subgraphs is here aimed to identify the building blocks of universal classes

of complex networks [78, 92]. As an example, the occurrence of a very large number of certain

dense subgraphs has been observed in the Webgraph, the graph formed by Web pages and hyper-

linked connections [88], in the attempt of tracing the emergence of hidden cyber-communities.

A stochastic model of the growth of the Webgraph [89], the ”copying model”, has these dense

subgraphs as building blocks of the process of network formation.

D. Coppersmith and S. Winograd showed that the number of triangles in a subgraph can be

counted using matrix multiplication [27]. Schank and Wagner [112] give an extensive experi-

mental study of the performance of algorithms for counting and listing triangles in graphs.

Finding frequent graph patterns also finds application to graph databases[123].

Valverde and Sol used the analysis of frequent network motifs in software architectures to

argue that network models based on duplication and diversification mechanisms accounts for a

significant part of the observed topological features of large software graphs [121].

1 Introduction

2 Preliminaries

In this chapter we introduce notations and models used throughout the thesis. In Section 2.1 we

begin with general notations.

We will then formally introduce different data stream models and complexity measures used

for the analysis of data stream algorithms in Section 2.2. After considering general data streams

consisting of arbitrary items we will concentrate on point sets and graphs encoded as data

streams.

A big part of the thesis introduces new methods to cluster huge point sets into groups of points.

In Section 2.3 we will introduce different clustering objectives, i.e. the k-median, k-means, and

MaxCut clustering.

2.1 General Notations

We begin with some basic notations and definitions.

We define

R+:= {x∈R|x>0}.

We use the e

O-notation to hide polylogarithmic factors. Formally let f:R→R+and g:R→

R+be functions with positive function values. We use the notation

f(n) = e

O(g(n))

to measure the growth of f(n)with n→∞, iff there is a constant k∈Nsuch that

f(n) = O(g(n)·logk(g(n))) as n→∞ .

We will use [n]to denote the set {0,...,n−1}. For a, b ∈Rwe define

(a, b) := {x∈R|x>a∧x<b}

and

[a, b] := {x∈R|x≥a∧x≤b}.

For x∈Rand ∈R+we define

(x±) := [x−, x +].

We define multiplications of intervals (a, b)with positive scalars x∈R+as

[a, b]·x:= x·[a, b] := [a·x, b ·x]

and multiplications of positive intervals (a, b)and (c, d)having 0<a<band 0<c<das:

[a, b]·[c, d] := [a·c, b ·d].

2 Preliminaries

Euclidean Spaces

The d-dimensional vector space Rdis the set of all d-tupels (x1, . . . , xd)of real numbers

x1, . . . , xd. The elements of the space are called vectors or points. For each point p∈Rdwe

denote the ith component of the point as p(i), such that p= (p(1), p(2), . . . , p(d)). Addition and

subtraction of points in Rdgives again a point in Rdand is defined component-wise:

∀p,q∈Rd(p+q)(i)=p(i)+q(i).

The multiplication of a scalar x∈Rwith a point p∈Rdgives again a point in Rdand is defined

(x·p)(i)=x·p(i).

Throughout the paper we will always assume that dis a constant. Therefore we will write

O(d) = O(1).

If we restrict the coordinates of a point set to a subset Rof Rwe write Rdfor the set of

d-tupels of numbers from R. In the thesis we will use the special cases [0, 1]dwhere Ris the

compact interval of numbers between 0and 1, and [∆]d={0,...,∆−1}dwith ∆∈N.

As distance measure between the points we will often use the Euclidean distance d:Rd×

Rd→R+defined as

d(p, q) := v

i=1

(p(i)−q(i))2.

We generalize this definition to sets, i.e.

∀p∈Rd∀Q⊂Rdd(p, Q) = min

q∈Qd(p, q)

and

∀Q⊂Rd∀R⊂Rdd(Q, R) = min

q∈Q,r∈Rd(q, r).

For a finite set P={p1, . . . , pn}of points from Rdthe center of gravity µ(P)is itself a point

in Rdand defined as

µ(P) = (µ(P)(1), . . . , µ(P)(d))with µ(P)(i):= 1

n·X

p∈P

p(i).

We also call the center of gravity µ(P)the mean of P.

2.2 Data Streams

We now define different types of data streams. The first models we present, the cash register

and turnstyle models, are very general and encode (huge) multisets of arbitrary items. We will

then present special cases of these cash register and turnstyle models we obtain when we con-

sider points in [∆]das items. The respective data streams are called geometric data streams and

dynamic geometric data streams. Finally we will define streams which encode graphs.

2.2 Data Streams

Cash Register and Turnstyle Model

We first define the cash register and turnstyle models of data streams. They are very gen-

eral and have been subject of many recent theoretical and practical papers. See the survey of

Muthukrishnan [105] for a lineup of recent results.

Let Qbe a finite set of Uitems of different kinds. Let q0, . . . , qU−1be these items. We assume

that Qis a very huge set.

The data streams we consider encode a multiset of items from Qinto a data stream. We use

aU-dimensional vector x= (x0, x1, . . . , xU−1)to describe this multiset: xisignals that we have

exactly xiitems of kind qiin the current multiset. We assume that we always have at most M−1

items of the same kind in our multiset, i.e. x∈[M]U.

We assume that at the beginning of the data stream the multiset is empty, i.e. all Ucomponents

of xare zero.

In the turnstyle model the data stream consists of a sequence of mupdate operations on the

vector x. Each update operation has the form UPDATE(i, a)with i∈[U]and a∈{−M, . . . , M}

and means that we have to add ato xi. All update operations in the stream lead to an xivalue

within {0,...,M−1}. Thus, at any moment, 0≤xi< M for i∈[U](our algorithms will

not verify this assumption). An operation UPDATE(i, a)with a≥0can be seen as adding a

items of kind qito our current multiset. An operation UPDATE(i, −a)with a > 0 can be seen

as deleting aitems of kind qifrom our current multiset. This way we have a huge multiset of

items encoded in the stream which is dynamically changing.

In the cash register model all UPDATE(i, a)operations have a value a≥0. Therefore we have

to deal only with insertions of items into the current multiset, deletions do not occur in the stream.

Our algorithms see the data streams one update operation by one and we cannot store all data

items due to memory restrictions. Particularly random access to the data is impossible. At cer-

tain times the algorithms are asked to answer certain queries about the current vector xresp.

the current multiset. For example in Chapter 3 we show how to return a random element from

the current multiset, each element with approximately the same probability independently of it’s

multiplicity.

Most of our algorithms will output just an approximate solution to a given problem. For ex-

ample when we consider optimization problems on the current point set, we will only require

the algorithm to return a solution having objective function value within (1±)·Opt, where

Opt denotes the optimal objective function value. We call this solution a (1±)-approximate

solution. We will always assume that  < 1/100.

Space complexity. We assume that the size Uof the set of possible items, the length of the

stream m, and the maximum number of occurences of one item Mcan be very huge. In particular

it is impossible to store the whole set of items locally.

Therefore the whole data stored by our algorithm should have bit complexity which is at most

polylogarithmic in U,m, and M.

2 Preliminaries

Time complexity. We measure time complexity in the Real Random Access Machine (Real

RAM) model, as usual in computational geometry to avoid calculations bounding numerical

errors. However, note that all algorithms could be altered to use only numbers that could be

stored using c=Olog mUM

bits, where is a small value related to the accuracy of our

data streaming algorithms. Since all of our geometric algorithms return approximate results, we

are confident that all stated computation times can also be proven in a RAM model having cbits

per register.

There are two different measures of time complexity. First, in many real world scenarios we

have to react very fast to each UPDATE operation. Therefore we are supposed to develop algo-

rithms which spend time polylogarithmic in U,m, and Mto process an UPDATE operation. The

polynomial degree should be as small as possible. Second there is the time to answer a query on

the current set of items. The time can be considerably longer than the update time, but it should

also preferably be polylogarithmic in U,m, and M.

Dynamic Geometric Data Streams

The model of dynamic geometric data streams has been first defined in [75]. A dynamic ge-

ometric data stream consists of mUPDATE operations on a point set P∈[∆]din a discrete

d-dimensional space which is initially empty. We always assume that ∆ > 100. An UPDATE op-

eration can be an ADD(p)operation of a point p∈[∆]d, which inserts pinto P, or a REMOVE(p)

operation, which deletes pfrom P. We assume that no point is inserted when it is already in P

and that no point is deleted from Pwhen it is not currently in the set.

The model of dynamic geometric data streams can be seen as a special case of the turnstyle

model by setting U=∆dand M=2. We again assume that we only have access to the sequence

one operation by one, and are allowed to use a number of memory bits polylogarithmic in ∆and

Adjacency Streams

Recently much effort has been made to investigate the structure of the webgraph. The difficulty

here is that it is quite impossible to hold huge structures in main memory. In a first scenario we

could have crawls on the webgraph on hard disc, and are able to access this data. However,

random access is very slow and has to be avoided. If we find algorithms which deal with the set

of edges given as a data stream, we could solve problems on the webgraph only using sequential

access to the data.

In another scenario one or more computers just crawl webpages. They read pages, follow links

and read the pages they are directed to recursively. This way these crawlers provide us with a

stream of edges.

In the adjacency stream model the data stream encodes an undirected graph G= (V, E). We

assume that we first are given the node set Vand that we are able to store it in local memory.

2.3 Clustering

Then we are able to read a data stream of edges sequentially, i.e. one edge by one. We are not

allowed to go back in the data stream and read a previous item. We try to develop algorithms

which use less memory bits than needed to encode the graph, i.e. o(|E|)memory bits. This model

is called One pass adjacency stream model. Algorithms working in this model are called One

pass algorithms.

When the data is stored on hard disk we can do multiple passes over the data by reading them k

times sequentially. We then speak of the kpass adjacency stream model and kpass algorithms.

Incidence Streams

Incidence streams are a special form of adjacency streams, but now the edges have a special

kind of ordering. In the incidence model we assume that each edge (v, w)appears twice, once

as (v, w)and once as (w, v). The whole set of 2·|E|edges is then presented as a data stream

ordered by source nodes. The model is more restricted than the general adjacency model. We

will develop algorithms for incidence streams in Section 9.2 having better memory bounds than

the algorithms for adjacency streams.

In Section 9.4 we will alter the model a little bit. We will look at directed graphs having

bounded out-degree of the nodes. We are again presented a data stream of edges, but this time

each (directed) edge appears just once. The edges are ordered by destination nodes.

2.3 Clustering

Assume we have a huge set of objects O. To understand the structure of the set it is often a good

idea to group the objects into clusters, such that that objects in the same cluster are similar to each

other and different from objects in other clusters. To do that we first have to define the notion of

different. Mathematically this means to define a distance function d:O×O→R+on pairs of

objects. Two objects having a large distance are different. The distance function can be defined

in various ways. One way is to embed the points in the d-dimensional Euclidean space Rdby

an embedding α:O→Rd. We can then define the distance between objects as the Euclidean

distance between the corresponding points:

∀a,b∈Od(a, b) := d(α(a), α(b)) = v

i=1

(α(a)(i)−α(b)(i))2.

Having such an embedding in mind we concentrate in the following on clustering points in Rd.

Based on the distance measure dbetween points we now define certain clustering objectives,

i.e. what is a good clustering and what is a bad clustering.

For huge point sets often the k-median or k-means objective functions are used to define good

clusterings. They have the advantage that each cluster can be represented by one single cluster

center.

2 Preliminaries

The k-median and k-means objectives measure the distance of all points to their corresponding

cluster centers in the following way:

Weighted Euclidean k-median clustering.

In the weighted Euclidean k-median clustering problem we are given a weighted set Pof

points in the Rdwith weight function w:P→R+. The goal is to find a set C={c1, . . . , ck}of

kcluster centers in Rdand a partition of the set Pinto kclusters C1, . . . , Cksuch that

Median(P, C, C1, . . . , Ck) :=

i=1X

p∈Ci

w(p)·d(p, ci)

is minimized. In the unweighted version of the problem all point weights are 1.

If the partition C1, . . . , Ckrelates each point to its nearest cluster center, i.e. if

∀p∈P∀ip∈Ci⇒d(p, ci) = min

jd(p, cj),

then we shortly write

Median(P, C) := Median(P, C, C1, . . . , Ck).

Weighted Euclidean k-means clustering.

In the weighted Euclidean k-means clustering problem we are given a weighted set Pof points

in the Rdwith weight function w:P→R+. The goal is to find a set C={c1, . . . , ck}of k

cluster centers in Rdand a partition of the set Pinto kclusters C1, . . . , Cksuch that

Means(P, C, C1, . . . , Ck) :=

i=1X

p∈Ci

w(p)·d(p, ci)2

is minimized. In the unweighted version of the problem all weights are 1.

If the partition C1, . . . , Ckrelates each point to its nearest cluster center, i.e. if

∀p∈P∀ip∈Ci⇒d(p, ci) = min

jd(p, cj),

then we shortly write

Means(P, C) := Means(P, C, C1, . . . , Ck).

2.4 Chernoff Bounds with Limited Independence

Weighted Euclidean MaxCut clustering.

The Euclidean MaxCut problem is the classical MaxCut problem on Euclidean graphs. There-

fore, the goal is to find a partition of a point set Pinto two sets Land Rsuch that the weight of

the edges of the complete Euclidean

Cut(P, L, R) := X

p∈LX

q∈R

d(p, q)

is maximized.

2.4 Chernoff Bounds with Limited Independence

We finally want to introduce a variant of Chernoff bounds from [113] we will frequently use in

the analysis of the streaming algorithms we develop. In contrast to traditional Chernoff bounds

[58] they assume only k-wise independence of the underlying random variables. A set of random

variables is called k-wise independent if the random variables in every subset of kvariables are

independent.

Theorem 1 (Theorem 5, [113]) . If Xis the sum of k-wise independent random variables, each

of which is confined to the interval [0, 1]with µ=E[X], then:

•For δ≤1:

–if k≤ bδ2µe−1/3c, then Pr[|X−µ|≥δµ]≤e−bk/2c.

–if k=bδ2µe−1/3c, then Pr[|X−µ|≥δµ]≤e−bδ2µ/3c.

•For δ≥1:

–if k≤ bδµe−1/3c, then Pr[|X−µ|≥δµ]≤e−bk/2c.

–if k=bδµe−1/3c, then Pr[|X−µ|≥δµ]≤e−bδµ/3c.

•For δ≥1and k=dδµe:

Pr[|X−µ|≥δµ]≤e−δln(1+δ)µ

2< e−δµ

2 Preliminaries

3 Sampling Data Streams

In this chapter we consider the problem to take a random sample from a dynamic multiset of

elements. The sampling should be uniform, i.e. each element of the set should be chosen with

almost the same probability independent of it’s multiplicity. The dynamic multiset is given as

a data stream of insert and delete operations. Usually it is represented by a high dimensional

vector. The stream is then a stream of update operations on this vector as defined in the turnstyle

model in Section 2.2.

In this section we present a general sampling technique on turnstyle data streams. We denote

that theoretically a similar method can be obtained from the methods Ganguly, Garofalakis, and

Rastogi used in [46]. However, they did not state a sampling result as this was not the purpose

of [46]. Cormode and Muthukrishnan[31] developed a sampling technique for turnstyle streams

similar to the technique given here. Their result was obtained at the same time independently of

our publication [42].

Our methods to sample random elements from turnstyle streams will be specialized to dynamic

geometric data streams in Chapter 4. We also give some direct applications of the sampling

method there. In Chapter 6 we will use the sampling results together with coreset techniques to

compute clusterings of dynamic data streams.

We denote by x∈[M]Uthe current vector and by UPDATE(i, a)the update operations of the

vector encoded within the stream.

Let Supp(x) = {i∈[U]|xi> 0}be the support of x. We use notation kxk0=|Supp(x)|. After

passing over the sequence of update operations we want to be able to answer queries which ask

about a uniformly distributed random element ifrom Supp(x). We want to return the index iand

the vector component value xi.

Since we are working on a high dimensional vector x= (x0, . . . , xU−1), we don’t want to use

Θ(U·log M)memory bits needed by a trivial algorithm. Instead we will develop a data structure

which accomplishes the task in space complexity polylogarithmic in Mand U.

Our data structure is parametrized by two numbers , δ > 0. The operations are as follows:

•UPDATE(i, a): performs xi←xi+a, where i∈[U], a ∈{−M . . . M}.

•SAMPLE: returns either a pair (i, xi)for some i∈[U]or a flag FAIL. The procedure

satisfies the following constraints:

–If a pair (i, xi)is returned, then iis chosen at random from Supp(x)such that for any

j∈Supp(x),

Pr[i=j] = 1

kxk0±δ.

3 Sampling Data Streams

–The probability of returning FAIL is at most δ.

Keeping O(s)instances of this data structure it is possible to choose a sample set Sof size s

(almost) uniformly at random from the non-zero entries of v.

Our sample data structure uses two elementary data structures. The first such structure checks

whether there is exactly one non-zero entry in x. If this is the case the index of the entry and its

value it returned. The second data structure approximates the number of non-zero entries in x.

3.1 The Unique Element (UE) Data Structure

The first elementary data structure called UE checks whether there is exactly one non-zero entry

in x. The data structure is deterministic. It supports the following operations on a vector x=

(x0, . . . , xU−1)with entries from [M].

•UPDATE(i, a): as above

•REPORT: if kxk06=1, then it returns FAIL. If kxk0=1, then it returns the unique pair

(i, xi)such that xi> 0.

The data structure keeps three counters c0,c1, and c2which are initialized to 0. Our UPDATE

operation will ensure that cj=Pi∈[U]xi·ijat any point of time. The operations UPDATE and

REPORT are implemented as follows:

UPDATE(i, a)

c0=c0+a

c1=c1+a·i

c2=c2+a·i2

REPORT

if c0·c2−c2

16=0or c0=0then return FAIL

else i←c1/c0

xi←c0

return (i, xi)

Claim 3.1.1 The data structure UE uses O(log(UM)) bits of space. One UPDATE operation

needs O(1)time and a REPORT query needs O(1)time. UE returns FAIL if and only if kxk06=1.

Otherwise, it correctly returns the unique pair (i, xi)with xi6=0.

Proof : The maximum value of the counters c0, c1, c2is O(U3·M)and so the counters

need O(log(UM)) bits. It remains to prove that the data structure correctly recognizes the case

kxk0=1and returns the unique pair (i, xi)with xi6=0. From xi≥0for all i∈[U]it follows

3.2 The Distinct Elements (DE) Data Structure

that c0=0, if and only if kxk0=0. Furthermore

c0·c2−c2

1=

X

i∈[U]

xi

·X

i∈[U]

xi·i2−

X

i∈[U]

xi·i



=

X

i∈[U]

xi

·X

i∈[U]

xi·i2−X

i,j∈[U]·xi·i·xj·j

i,j∈[U]

xixj·j2−X

i,j∈[U]

xi·i·xj·j

i,j∈[U]

xixj·j2+1

i,j∈[U]

xixj·i2−X

i,j∈[U]

xixj·i·j

i,j∈[U]

xixj(j2−2ij +i2)

i,j∈[U]

xixj(j−i)2.

All summands are zero unless there exist i, j ∈[U]with i6=jand xi, xj> 0. In the latter case

one summand is positive. Hence, c0·c2−c1> 0 iff kxk0> 1. The correctness of the data

structure follows immediately. 2

3.2 The Distinct Elements (DE) Data Structure

The data structure supports two operations on a vector x= (x0, . . . , xU−1)with entries from

[M]:UPDATE (as above) and REPORT, which with probability 1−δreturns a value k∈[0, U]

such that kxk0≤k≤(1+ψ)·kxk0; the numbers δand ψare parameters.

One can use a data structure from Ganguly, Garofalakis, and Rastogi[46] to solve this problem

using O(1/ψ2·log(1/ψ)log(1/δ)log(U)log(UM)) bits of space. An UPDATE operation needs

O(1

2·log(1/δ)) time, a REPORT operation needs O(log U)time.

3.3 A Sample Data Structure using Totally Random

Hash Functions

The basic idea behind our data structure is to use a hash function that maps the universe to

a smaller space [2j]. The value 2jcorresponds to a guess of the number of non-zero entries

currently present in vector x. Assuming a fully random hash function every non-zero entry is

3 Sampling Data Streams

mapped to 0with the same probability. Further, the probability that exactly one non-zero entry

is mapped to 0can be checked using our unique elements data structure. If this is the case, our

sample data structure returns the corresponding entry (it returns the index and the value of the

entry). We now give the procedure in detail.

Our data structure uses hash functions hj, j ∈[dlog Ue+1]. Each hjis of the form hj: [U]→

[2j]. Initially, we assume that each hjis a fully random hash function, we relax this assumption

later. The value 2jcorresponds to the guess that, currently, there are roughly 2jnon-zero entries

in x.

In addition, we use:

•Unique Element data structures UEj, j ∈[dlog Ue+1], and

•A Distinct Elements data structure DE, with parameters ψ=1and error probability

parameter 1/22.

We write UEj.UPDATE resp. UEj.REPORT to denote an UPDATE resp. REPORT operation for

data structure UEj.The operations are implemented as follows:

UPDATE(i, a)

for j∈[dlog Ue+1]do

if hj(i) = 0then

UEj.UPDATE(i, a)

DE.UPDATE(i, a)

SAMPLE

j=dlog(DE.REPORT)e

return UEj.REPORT

Correctness. Assume that DE is correct. Note that this happens with probability at least

1−1/22. We have UEj.REPORT6=FAIL, iff |Supp(x)∩h−1

j(0)|=1. Since we assume fully

random hash functions, the element reported by UEjis an element chosen uniformly at random

from Supp(x).

It remains to show a lower bound on the probability of |Supp(x)∩h−1

j(0)|=1. Denote Sj=

h−1

j(0)and `=kxk0. Because of j=dlog(DE.REPORT)eand kxk0≤log(DE.REPORT)≤

2kxk0, we observe `≤2j≤4`. Thus

Pr[|Sj∩Supp(x)|=1] = `·2−j·(1−2−j)`−1≥1/4 ·(1−1/`)`−1.

The function (1−1/`)`−1is monotonically decreasing for `≥1and lim`→∞(1−1/`)`−1=1/e.

Therefore we obtain:

Pr[|Sj∩Supp(x)|=1]≥1

4e ≥1

11 .

Since the error probability in our distinct elements structure is at most 1/22, we obtain that

our algorithm returns with probability 1/11 −1/22 =1/22 a random element.

3.4 A Sample Data Structure using Random Number Generators

3.4 A Sample Data Structure using Random Number

Generators

We will now give two different methods to overcome the assumption of totally random hash func-

tions and achieve space complexity polylogarithmic in Mand U. First we will present a general

approach based on the random number generator of Nisan [106]. The method introduces a small

additive error in the probability of each element to be sampled. This error can be translated into

a small multiplicative error of . However, the method is difficult to implement and for many

applications a multiplicative error is sufficient. In Section 3.5 we will present a method based

on pairwise independent hash functions. It is easy to implement but introduces a multiplicative

error in the probability of each element to be sampled.

To replace the assumption of fully random hjwe use the following lemma developed by Indyk

[73] which is based on Nisans random number generator [106].

Lemma 3.4.1 [73] Consider an algorithm Awhich, given a stream S0of pairs (i, a), and a

function f: [n0]×{0, 1}R0×{−M0. . . M0}→{−M0O(1). . . M0O(1)}, does the following:

•Set O=0; Initialize length-R0chunks R0. . . R[n0]of independent random bits

•For each new pair (i, a): perform O=O+f(i, Ri, a)

•Output A(S0) = O

Assume that the function f(·,·,·)is supported with an evaluation algorithm using O(C+R0)

space and O(T)time. Then there is an algorithm A0producing output A0(S), that uses only

O(C+R0+log(M0n0)) bits of storage and O([C+R0+log(M0n0)] log(n0R0)) random bits,

such that

Pr[A(S)6=A0(S)] ≤1/n0

over some joint probability space of randomness of Aand A0. The algorithm A0uses O(T+

log(n0R0)) arithmetic operations per each pair (i, a).ut

For each fixed jour algorithm uses the hash function hjto select a subset Sj:= {i∈[U] :

hj(i) = 0}of indices. Each index is in the set Sjwith probability 1/2j. The UE data structure

maintains the values c0=Pi∈Sjxi,c1=Pi∈Sjxi·i, and c2=Pi∈Sjxi·i2.

Instead of using a hash function hjto select the set Sjwe can use chunks R0, . . . , RU−1of

log Urandom bits. We select Sjas

Sj:= {i∈[U] : all of the first jbits of Riare 0}.

Again each index i∈[U]is selected to be in the set Sjwith probability 1/2j. An update (i, a)

on the UE data structure then simply adds the functions f0to c0,f1to c1, and f2to c2, where

•f1(i, Ri, a) := 0iff one of the first jbits of Riis 1

aiff all of the first jbits of Riare 0

3 Sampling Data Streams

•f2(i, Ri, a) := 0iff one of the first jbits of Riis 1

a·iiff all of the first jbits of Riare 0

•f3(i, Ri, a) := 0iff one of the first jbits of Riis 1

a·i2iff all of the first jbits of Riare 0

We use Lemma 3.4.1 and plug in the values n0=1/δ +U,M0=U·M,R0=dlog(U)e, and

C=log(UM). We can replace the random bits Riby O([C+R0+log(M0n0)] log(n0R0)) =

O(log(UM/δ)log(U/δ)) random bits. We use the same random bits for all jand get an algo-

rithm having the desired properties and using just O(log2(UM/δ)) bits of storage. The distribu-

tion changes by less than δ.

Lemma 3.4.2 Given a sequence of update operations on a vector x= (x0, . . . , xU−1)with en-

tries from [M], there is a data structure that with probability 1/22 −δreturns a pair (i, xi)with

i∈Supp(x)and returns a flag FAIL otherwise. The statistical difference from the distribution

of the returned pair to a uniform distribution is at most δ, particularly Pr[i=j] = 1/kxk0±δ

for every j∈Supp(x). The algorithm uses Olog2(UM/δ)space. ut

Theorem 2 (Sampling in data streams.) Let be δ≤1

44 . Given a sequence of update opera-

tions on a vector x= (x0, . . . , xU−1)with entries from [M], there is a data structure that with

probability 1−δreturns spairs (i0, xi0),...,(is−1, xis−1)with ij∈Supp(x)and returns a flag

FAIL otherwise. The returned pairs are independent of each other and may contain duplicates.

The statistical difference from the distribution of each returned pair to a uniform distribution is

at most δ, particularly for all j∈Supp(x)and all k∈[s] : Pr[ik=j] = 1

kxk0±δ. The algorithm

uses Os+log(1/δ)·log2(UM/δ)space.

Proof :

We apply Lemma 3.4.2 and invoke 352·(s+ln(1/δ)) instances of the data structure, each with

independent random choices. Let Ydenote the random variable for the number of instances that

return a random pair. Since δ≤1

44 the probability of each instance to return a random element

is greater than 1

44 . Hence, E[Y]≥8·(s+ln(1/δ)) and

Pr[Y<s]≤Pr[Y≤(1− (1/2))E[Y]] ≤e(1/2)2·E[Y]/2 ≤δ .

Therefore with probability at least 1−δwe have at least ssamples. We return the first ssamples

we obtain from the instances. 2

We will now show that we can recover the whole vector x, if we spend space slightly larger

than kxk0. This will be useful in situations when the support of the current vector is very small.

Corollary 3.4.3 Given a sequence of update operations on a vector x= (x0, . . . , xU−1)with

entries from [M]and given an oracle which tells us in advance the value of kxk0at the end of

the stream, there is a data structure that with probability 1−δreturns all pairs

(i0, xi0),...,(ikxk0−1, xikxk0−1)and returns a flag FAIL otherwise.

The algorithm uses

Okxk0·log U

δ·log2UM

δspace.

3.4 A Sample Data Structure using Random Number Generators

Proof : If U < 22 we can store the whole vector using log(M)space. Otherwise 1

2U ≤1

and we can apply the data structure of Theorem 2 with s=2kxk0(ln kxk0+ln(2/δ)) and error

probability parameter δ

2U . We simply return all distinct samples we obtain.

Let us fix an arbitrary index j∈Supp(x). We have

Pr[∀k∈[s]:j6=ik]≤1−1−δ

kxk0s≤1−1

2kxk0s≤e−ln kxk0−ln(2/δ)=δ

2kxk0

It follows from the Union Bound:

Pr[∃j∈Supp(x)∀k∈[s]:j6=ik]≤ kxk0·δ

2kxk0≤δ/2 .

Therefore, the overall probability of failure is at most δ.

The space requirement of the algorithm is:

Os+log 2U

δ·log2UM

δ0

=Okxk0·log kxk0+log 1

δ+log U

δ·log2UM

δ

=Okxk0·log U

δ·log2UM

δ .

A second consequence can be obtained by translating Theorem 2 (which uses independent

draws and thus the sample set may contain multiple copies of the same point) to the case of

sampling of random subsets.

Corollary 3.4.4 Let s≤ kxk0/2. Given a sequence of update operations on a vector x=

(x0, . . . , xU−1)with entries from [M], there is a data structure that with probability 1−δreturns

a subset S⊂Supp(x)of sindices and all pairs (i, xi)for i∈Sand returns a flag FAIL

otherwise. The statistical difference from the distribution of the returned subset to a uniform

distribution is at most δ, particularly Pr[j∈S] = s

kxk0±δfor every j∈Supp(x).

The algorithm uses Os+log(U/δ)·log2(UM/δ)space.

Proof : Let us first assume that we have a data structure that returns an index distributed exactly

uniformly among Supp(x). Then we can select s0=c·s+log(1/δ)indices i0

0, . . . , i0

s0from

Supp(x)uniformly at random with repetitions, where cis a suitable constant. Since s≤ kxk0/2

we know that for each i0

jwe have with probability at least 1/2 an index that is not among the

previously chosen indices. And so we get that with probability at least 1−δ/2 we have at least

sdistinct indices among i0

0, . . . , i0

s0for clarge enough. If there are more than sdistinct indices

among i0

0, . . . , i0

swe choose sindices uniformly at random from the distinct indices. Clearly, the

computed index set is distributed uniformly at random.

3 Sampling Data Streams

We use Theorem 2 to select s0indices almost uniformly at random. It remains to deal with

the fact that Theorem 2 does not provide us with an exact uniform distribution. We will use

error probability parameter δ0=δ/(s0·U)when we apply the theorem. Each time when we

choose a random point the statistical difference to the uniform distribution is at most δ0·U. Since

we choose s0elements the overall statistical difference of our process to the ideal one described

above is at most δ0·U·s0=δ. Therefore,Pr[j∈S] = s

kxk0±δfor every j∈Supp(x)and the

overall probability of error is at most δ.

The space needed by the algorithm is

Os0+log 1

δ0·log2UM

δ0

=O s+log 1

δ+log s0·U

δ·log2 U2M(s+1

δ)

δ!!

=Os+log U

δ·log2UM

δ .

Lemma 3.4.5 Given a sequence of update operations on a vector x= (x0, . . . , xU−1)with en-

tries from [M], there is a data structure that returns all pairs (i, xi)with i∈Supp(x)when

kxk0≤Aand a flag FAIL if kxk0> A. The algorithm works with probability 1−δand uses

OA·log U

δ·log2UM

δspace.

Proof : We start the algorithm of Corollary 3.4.3 with the assumption that kxk0=2A +2and

with error probability parameter δ0=δ/2. We call it algorithm 1. In case that kxk0≤2A +2at

the end of the stream, this algorithm reconstructs all pairs (i, xi)with i∈Supp(x).

In parallel we start the algorithm of Corollary 3.4.4 with s=A+1and error probability

parameter δ0=δ/2 and call it algorithm 2. If algorithm 2 returns A+1elements, we return

FAIL. If it returns less than A+1elements, we know that with probability 1−δ/2 we have

kxk0< 2A +2. In that case algorithm 1 provides us with all pairs (i, xi)with i∈Supp(x)with

probability 1−δ/2. We count the number of pairs. If kxk0≤A, we return all pairs, otherwise

we return FAIL.2

Remark 3.4.6 All of the results stated with an additive error of δcan be transfered to a multi-

plicative error of δ: When we apply the results with δ0=δ/U, we get an additive error of δ/U.

So each element is sampled with probability 1

kxk0±δ

U. Since 1

kxk0≥1

Uwe conclude that each

element is sampled with probability 1

kxk0·(1±δ). Since all memory bounds depend only on

Olog U

δthe replacement of δby δ

Udoes not change the memory bounds.

3.5 A Sample Data Structure using Pairwise Independent Hash Functions

3.5 A Sample Data Structure using Pairwise

Independent Hash Functions

In this section we focus on the development of a sample data structure that only uses pairwise

independent hash functions[18] instead of the approach using a pseudo-random generator. We

will consider a relative error instead of an additive error.

The basic idea behind the second sample data structure is similar to the structure using totally

random hash functions.

Assume we knew the number nof non-zero entries in our vector x. Then we could use a

hash function hthat maps [U]to a space somewhat larger than n, say to [n/]. It is easy to see

that such a hash function has n collisions in expectation. This means that typically most of the

non-zero entries in xdo not collide. We choose a value α∈[n/]uniformly at random and

check using the UE data structure whether exactly one non-zero entry was mapped to α. If this

is the case we return the corresponding unique pair (i, xi)with h(i) = αand xi6=0.

Since there are only few collisions this probability is close to n/(n/) = . If we keep

O(1/)instances of the data structure, one of them is likely to return a pair. We will show that

the index of the returned pair is almost uniformly distributed among the non-zero entries in x. To

deal with the fact that we do not know the value nand that it changes over time we follow the

approach of Section 3.3. We will keep log Uhash functions hj,1≤j≤ dlog Ue+2each cor-

responding to a guess n≈2j. It will suffice to work with a good guess, which can be identified

using our DE data structure.

We now give a detailed description of our sample data structure. Our data structure uses

O(log U·log(1/δ)/)pairwise independent hash functions hj,k with j∈[dlog Ue+2],k∈[I],

and number of instances I=log(1−/32)(δ/2)=O(log(1/δ)/)). Each hj,k is of the form

hj,k : [U]→[T]with T:= 2j+1/and corresponds to the k-th hash function for the j-th guess,

n≈2j. For each hash function we choose a value αj,k,j∈[dlog Ue+2],k∈[I], uniformly at

random from [T]. Additionally, we need UE data structures UEj,k for j∈[dlog Ue+2]and k∈

[I]. Each of these data structure is supposed to handle a subset of the UPDATE(i, a)operations

in the input stream. Namely, data structure UEj,k will process all updates with hj,k(i) = αj,k.

We also need one instance of the DE data structure using parameters ψ=1and error probability

parameter δ0=δ/2.

All hash functions, αj,k and UEj,k are initialized during a separate initialization step. We write

UEj,k.UPDATE to denote an UPDATE operation for data structure UEj,k.

The operations UPDATE and SAMPLE are implemented as follows:

UPDATE(i, a)

for j∈[dlog Ue+2]do

for k∈[I]do

if hj,k(i) = αj,k

then UEj,k.UPDATE(i, a)

DE.UPDATE(i, a)

SAMPLE

j=dlog(DE.REPORT)e

if ∀kUEj,k.REPORT=FAIL then return FAIL

k0←min{k|UEj,k.REPORT6=FAIL}

return UEj,k0.REPORT

3 Sampling Data Streams

Lemma 3.5.1 Given a sequence of update operations on a vector x= (x0, . . . , xU−1)with

entries from [M], there is a data structure that with probability 1−δreturns a pair (i, xi)

with i∈Supp(x)such that Pr[i=j]=(1±)/kxk0for every j∈Supp(x). The algo-

rithm uses O(log(UM/)·log U·log(1/δ)/)space. An UPDATE operation can be done in

O(log U·log(1/δ)/)time and a SAMPLE query can be processed in O(log U+log(1/δ)/)

time.

Proof : Denote n:= |Supp(x)|=kxk0and assume that DE returns a value DE.REPORT

having n≤DE.REPORT ≤2n, which happens with probability at least 1−δ/2. We choose

j=dlog(DE.REPORT)esuch that n≤2j≤4n.

Let k∈[I]be fixed. For simplicity of notation we define h=hj,k and α=αj,k. We say that

l∈[U]is chosen, if it is the only entry xlfrom xwith xl6=0and h(l) = α.

For fixed l, m the probability over the choice of hfor the event h(l) = h(m)is 1/T by

pairwise independence. Let us fix an lwith xl6=0. Using the union bound we get

Pr[∃mxm6=0∧h(l) = h(m)] ≤(n−1)/T ≤·(n−1)/2j+1≤/2

Since αis chosen uniformly at random from the target space [T]and independently of the choice

of h, the events h(l) = αand ∃m(xm6=0∧h(l) = h(m))are independent. Therefore,

T≥Pr[lis chosen] = Pr[h(l) = α]·(1−Pr[∃m(xm6=0∧h(l) = h(m))]) ≥1

T·(1−/2).

Since these events are disjoint for all lwe get

T≥Pr[any element is chosen]≥n

T·(1−/2).

It follows for each lwith xl6=0

Pr[lis chosen |any element is chosen]≥1

T·(1−/2)n

T= (1−/2)·1/n

and

Pr[lis chosen |any element is chosen]≤1/T

n/T ·(1−/2)=1

n·(1−/2)≤1

n(1+).

Thus, when an element is chosen, it is chosen almost uniformly at random from Supp(x). It

remains to show that for at least one kan element is chosen. From the argumentation above it

follows:

Pr[any element is chosen]≥n

T(1−/2)≥n

2j+2(1−

2)≥

16(1−

2) = (2−)

32 ≥

Note that our data structure fails exactly when the data structure DE does not work (probability

δ/2) or when no element is chosen for all k. No element is chosen, when for all kthe data

structure UEj,k fails. Since the number of instances Iis log(1−/32)(δ/2)we know that the

3.5 A Sample Data Structure using Pairwise Independent Hash Functions

probability that no element is chosen for any kis at most (1−/32)I≤δ/2. Hence the overall

probability of error is at most δand our data structure is correct with probability at least 1−δ.

The algorithm uses O(log U·log(1/δ)/)hash functions, αj,k values and UE data structures.

Each hash function uses O(log U)space, each αj,k value uses O(log(U/)) space and each UE

data structures uses O(log(UM)) space. The DE data structure uses O(log(1/δ)log(U)log(UM))

space.

Hence the overall space complexity is O(log(UM/)·log U·log(1/δ)/). We can evaluate

the hash function in constant time and obtain the stated running times. 2

Theorem 3 (Sampling in data streams.) Given a sequence of update operations on a vector

x= (x0, . . . , xU−1)with entries from [M], there is a data structure that with probability 1−δ

returns spairs (i0, xi0),...,(is−1, xis−1)with ij∈Supp(x)and returns a flag FAIL otherwise.

The returned pairs are independent of each other and may contain duplicates. For all j∈

Supp(x)and all k∈[s] : Pr[ik=j] = 1±

kxk0. The algorithm uses

Os+log(1/δ)·log(UM/)·log(U)/

space. An UPDATE operation can be processed in O((s+log(1/δ)) ·log(U)/)time. A query

operation can be processed in O((s+log(1/δ)) ·(log U+1/)) time.

Proof :

We apply Lemma 3.5.1 and invoke 16 ·(s+ln(1/δ)) instances of the data structure, each

with independent random choices and error probability parameter 1/2. Let Ydenote the random

variable for the number of instances that return a random pair. The probability of each instance

to return a random element is greater than 1/2. Hence, E[Y]≥8·(s+ln(1/δ)) and

Pr[Y<s]≤Pr[Y≤(1− (1/2))E[Y]] ≤e(1/2)2·E[Y]/2 ≤δ .

Therefore with probability at least 1−δwe have at least ssamples. We return the first ssamples

we obtain from the instances. 2

Corollary 3.5.2 Given a sequence of update operations on a vector x= (x0, . . . , xU−1)with

entries from [M]and given an oracle which tells us in advance the value of kxk0at the end of

the stream, there is a data structure that with probability 1−δreturns all pairs

(i0, xi0),...,(ikxk0−1, xikxk0−1)and returns a flag FAIL otherwise.

The algorithm uses

O(kxk0·(log kxk0+log(1/δ)) ·log(UM)·log U)space. UPDATE operations and query op-

erations can be processed in O(kxk0·(log kxk0+log(1/δ)) ·log U)time.

Proof : We apply Theorem 3 with s=2kxk0(ln kxk0+ln(2/δ)),=1/2, and error probability

parameter δ/2 and simply return all distinct samples we get.

Let us fix an arbitrary index j∈Supp(x). We have

Pr[∀k∈[s]:j6=ik]≤1−1−

kxk0s≤1−1

2kxk0s≤e−ln kxk0−ln(2/δ)=δ

2kxk0

3 Sampling Data Streams

It follows from the Union Bound:

Pr[∃j∈Supp(x)∀k∈[s]:j6=ik]≤ kxk0·δ

2kxk0≤δ/2 .

Since the sampling fails with a probability at most δ/2, the overall probability of failure is at

most δ.

A second consequence can be obtained by translating Theorem 3 (which uses independent

draws and thus the sample set may contain multiple copies of the same point) to the case of

returning subsets.

Corollary 3.5.3 Let s≤ kxk0/2. Given a sequence of update operations on a vector x=

(x0, . . . , xU−1)with entries from [M], there is a data structure that with probability 1−δreturns

a subset S⊂Supp(x)of sindices and all pairs (i, xi)for i∈Sand returns a flag FAIL oth-

erwise. The algorithm uses Os+log(1/δ)·log(UM)·log(U)space. UPDATE operations

and query operations can be processed in O((s+log(1/δ)) ·log(U)) time.

Proof : We use the data structure of Theorem 3 with parameter =1/2 and error probability

parameter δ0=δ/2 to obtain s0=c·(s+log(1/δ)) indices i0

0, . . . , i0

s0from Supp(x)uniformly

at random with repetitions, where cis a suitable constant.

Since s≤ kxk0/2 we know that for each i0

jeither we have at least sdifferent indizes previously

chosen or i0

jis with probability at least 1−

kxk0·(kxk0−s)≥1/4 an index that is not among the

previously chosen indices. We get that with probability at least 1−δ/2 we have at least sdistinct

indices among i0

0, . . . , i0

s0for clarge enough. If there are more than sdistinct indices among

0, . . . , i0

swe choose arbitrarily sindices from the distinct indices and return them. 2

Lemma 3.5.4 Given a sequence of update operations on a vector x= (x0, . . . , xU−1)with en-

tries from [M], there is a data structure that returns all pairs (i, xi)with i∈Supp(x)when

kxk0≤Aand a flag FAIL if kxk0> A. The algorithm works with probability 1−δand uses

O(A·(log A+log(1/δ)) ·log(UM)·log U)space. UPDATE operations and query operations

can be processed in O(A·(log A+log(1/δ)) ·log(U)) time.

Proof : We start the algorithm of Corollary 3.5.2 with the assumption that kxk0=2A +2and

with error probability parameter δ0=δ/2. We call it algorithm 1. In case that kxk0≤2A +2at

the end of the stream, this algorithm reconstructs all pairs (i, xi)with i∈Supp(x).

In parallel we start the algorithm of Corollary 3.5.3 with s=A+1and error probability

parameter δ0=δ/2 and call it algorithm 2. If algorithm 2 returns A+1elements, we return

FAIL. If it returns less than A+1elements, we know that with probability 1−δ/2 we have

kxk0< 2A +2. In that case algorithm 1 provides us with all pairs (i, xi)with i∈Supp(x)with

probability 1−δ/2. We count the number of pairs. If kxk0≤A, we return all pairs, otherwise

we return FAIL.2

4 Sampling Geometric Data Streams and

Applications

In this Chapter we will transfer the sampling technique of Section 3.4 to the context of dynamic

geometric data streams. It gives us the ability to choose a point almost uniformly at random from

the current point set Pencoded in a dynamic geometric data stream.

Based on that we will provide randomized streaming algorithms for three well-studied geo-

metric problems over dynamic geometric streams:

1. Maintaining an -net of P; that is, a subset N⊂Psuch that for any range Rfrom a fixed

family of ranges of VC dimension D(e.g., set of all rectangles), we have |N∩R|> 0, if

|R∩P|

|P|≥. We show how to maintain an a set of e

O(D+log(1/δ)

)points that with probability

1−δis an -net of P.

Our data structure uses O(log(1/δ)

+D

log D

)·d2·log2(∆/δ)space.

2. Maintaining an -approximation of P; that is, a subset A⊂P, such that for any range R

from a fixed family of ranges of bounded VC dimension D, we have |A∩R|

|A|=|R∩P|

|P|±. In

this case our algorithm maintains a set of e

O(D+log(1/δ)

)points that with probability 1−δ

is an -approximation.

Our data structure uses O1

2Dlog D

+log 1

δd3log3∆

δspace.

The -approximations have applications to many problems, including Tukey depth, simpli-

cial depth, regression depth, the Thiel-Sen estimator, and the least median of squares [9].

3. Maintaining a (1+)-approximation of the cost of minimum weight tree spanning the

points in P. This quantity in turn enables to achieve constant factor approximations for

other problems, such as TSP or Steiner tree cost. Our algorithms use space O(log(1/δ)·

(log ∆/)O(d)), and is correct with probability 1−δ.

Having a random sample of points from the technique developed in the last chapter, the al-

gorithms to maintain -nets and -approximations will follow relatively easily from [66] and

[119].

To compute the weight of the Euclidean minimum spanning tree our sampling procedure is

used in a more subtle way. It is known that the EMST weight can be expressed as a formula de-

pending on the number of connected components in certain subgraphs of the complete Euclidean

graph of the current point set [25, 20]. We use an algorithm from [25] to count the number

of connected components in these subgraphs. This algorithm is based on a BFS-like procedure

starting at a randomly chosen point p. The BFS runs for a constant number of rounds only and

one can show that it can never leave a ball around pof certain radius. Therefore, it suffices

4 Sampling Geometric Data Streams and Applications

to maintain a random sample point and all points in a certain radius around this sample point.

This task can also be approximately performed by a variant of our sampling routine described in

Section 4.3.

4.1 Sampling Geometric Data Streams

First, we observe that we can apply our data structure in the setting of dynamic geometric data

streams in the following way. We will use U=∆dand M=2,[M] = {0, 1}. An ADD(p)opera-

tion with p= (p0, . . . , pd−1)is implemented as an UPDATE(P, 1)operation with P=Pjp0·∆j,

i.e. by interpreting pas a ∆-ary number with ddigits. In a similar way, a REMOVE(p)opera-

tion translates to UPDATE(P, −1). Using the SAMPLE procedure we can get a pair (i, xi)having

xi=1, which can also be re-interpreted as the corresponding unique point p= (p0, . . . , pd−1)

with i=Pjp0·∆j. Thus we can sample a point from the current point set.

We will now translate the results of Section 3.4 to the context of dynamic geometric data

streams.

Theorem 4 (Sampling in geometric data streams.) Given a sequence of ADD and REMOVE

operations of points from the discrete space [∆]d, there is a data structure that with probability

1−δreturns spoints r0, . . . , rs−1from the current point set P={p0, . . . , pn−1}and a flag FAIL

otherwise. The returned points are independent of each other and may contain duplicates. The

statistical difference from the distribution of each sample point to a uniform distribution is at

most δ

∆d, particularly Pr[ri=pj] = 1

n±δ

∆dfor every j∈[n].

The algorithm uses Os+log(1/δ)·d2·log2(∆/δ)space.

Proof : We apply Theorem 2 with δ0=δ/∆d.2

We remark that Theorem 4 requires that no point in Poccurs more than once, i.e. Pis not a

multiset.

Corollary 4.1.1 Given a sequence of ADD and REMOVE operations of points from the discrete

space [∆]d, there is a data structure that with probability 1−δreturns the current point set

P={p0, . . . , pn−1}.

The algorithm uses Ond3log3∆

δspace.

Proof : This corollary follows directly from Corollary 3.4.3. 2

Corollary 4.1.2 Let s≤n/2. Given a sequence of ADD and REMOVE operations of points from

the discrete space [∆]d, there is a data structure that with probability 1−δreturns a subset S=

{r0, . . . , rs−1}of spoints from the current point set P={p0, . . . , pn−1}. The statistical difference

from the distribution of the returned subset to a uniform distribution is at most δ

∆d, particularly

Pr[pj∈S] = s

n±δ

∆dfor every j∈[n]. The algorithm uses Os+d·log(∆/δ)·d2·log2(∆/δ)

space.

Proof : This corollary follows directly from Corollary 3.4.4. 2

4.2 -Nets and -Approximations in Data Streams

A consequence of Theorem 4 is that we can get -nets and -approximations of the current point

set. We briefly recapitulate the definitions required for -nets and -approximations, which can,

for example, be found in [52].

Definition 4.2.1 (Range Spaces) Let Xbe a set of objects and Rbe a family of subsets of X.

Then we call the set system Σ= (X, R)arange space. The elements of Rare the ranges of Σ. If

Xis a finite set, then Σis called a finite range space.

Definition 4.2.2 (VC-dimension) The Vapnik-Chervonenkis dimension (VC-dimension) of a

range space Σ= (X, R)is the size of the largest subset of Xthat is shattered by R. We say

that a set Ais shattered by R, if {A∩r|r∈R}=2A.

Definition 4.2.3 (-nets, -approximation) Let Σ= (X, R)be a finite range space. A subset

N⊂Xis called -net, if N∩r6=∅for every r∈Rwith |r|≥|X|. A subset A⊆Xis called

-approximation, if for every r∈Rwe have |A∩r|

|A|−|r|

|X|≤.

Obviously, an -approximation is always an -net, while the contrary is not necessarily true.

A Data Streaming Algorithm for -Approximations. The following theorem by Vapnik

and Chervonenkis shows that for any finite range space with constant VC-dimension Da random

sample of size e

O(D+log(1/δ)

2)is an -approximation with probability at least 1−δ.

Theorem 5 [119] There is a positive constant csuch that if (X, R)is any range space of VC-

dimension at most D,A⊂Xis a finite subset and , δ > 0, then a random subset Bof cardinality

sof Awhere sis at least the minimum between |A|and

2·D·log D

+log 1

δ

is an -approximation for Awith probability at least 1−δ.

We can now combine Corollary 4.1.2 (or Corollary 4.1.1 in the case that the current point

set is small) with Theorem 5 to obtain a data structure that with probability 1−δreturns an

-approximation of the current point set.

Theorem 6 Given a sequence of ADD and REMOVE operations of points from the discrete space

[∆]d, there is a data structure that with probability 1−δreturns a set Aof e

O(D+log(1/δ)

2)points

that is an -approximation of a range space (X, R)with VC-dimension D. The algorithm uses

O1

2Dlog D

+log 1

δd3log3∆

δspace.

4 Sampling Geometric Data Streams and Applications

Proof : Let a=c

2·D · log D

+log 3

δ. By Theorem 5 a sample set of size ais an -

approximation with probability at least 1−δ/3. Let P={p0, . . . , pn−1}denote the current point

set. We can easily track the size nof Pby increasing a counter with every ADD operation and

decreasing it with every REMOVE operation. If n≤awe use the data structure from Corollary

4.1.1 of size Oad3log3∆

δto recover Pcompletely.

If n>awe will use our data structure from Corollary 4.1.2 of size Oad3log3∆

δto obtain

a random sample of size a. We will use failure parameter δ0=δ/3. This guarantees that

the overall statistical difference from the same process using the uniform distribution is at most

δ0/∆d·n≤δ/3. Similarly, the data structure fails with probability at most δ0≤δ/3. And a

set of size ais with probability 1−δ/3 an -approximation. Summing up the errors we get an

-approximation with probability 1−δ.

The space requirement follows immediately from Corollaries 4.1.2 and 4.1.1 and Theorem 5.

A Data Streaming Algorithm for -Nets. Haussler and Welzl showed that a random sam-

ple of size e

O(D+log(1/δ)

)is an -net with probability at least 1−δ.

Theorem 7 [66] Let (X, R)be a range space of VC-dimension D, let Abe a finite subset of X

and suppose 0 < , δ < 1. Let Nbe a set obtained by mrandom independent draws from A,

where

m≥max 4

log 2

δ,8·D

log 8·D



then Nis an -net for Awith probability at least 1−δ.

Combining Theorem 4 with Theorem 7 we obtain

Theorem 8 Given a sequence of ADD and REMOVE operations of points from the discrete space

[∆]d, there is a data structure that with probability 1−δreturns a set Nof e

O(D+log(1/δ)

)points

that with probability at least 1−δis an -net of a range space (X, R)with VC-dimension D. The

algorithm uses O(log(1/δ)

+D

log D

)·d2·log2(∆/δ)space.

Proof : We use a random sample (with repetitions) as given by the data structure from Theorem

4. We choose failure parameter δ0=δ/3. Since the statistical difference from the exact uniform

distribution is at most δ0=δ/3, the failure probability is at most δ0=δ/3, and a set of size

max 4

log 2

δ0,8·D

log 8·D

is an -net with probability at least 1−δ0=1−δ/3, we get that our

sample is an -net with probability at least 1−δ.2

4.3 Random Sampling with Neighborhood Information

We now want to develop a more sophisticated sampling strategy. We would like to draw a

set of points (almost) uniformly at random and for each point we also would like to know its

4.3 Random Sampling with Neighborhood Information

neighborhood, for example all points within a distance of at most zor all points in a square with

side length zcentered at the random point. Formally, we define a neighborhood in the following

way.

Definition 4.3.1 (V-neighborhood) Let V={v0, . . . , vZ−1}denote a set of grid vectors with

v0= (0,...,0). We define the V-neighborhood of a point pto be the set N(V, p) = Si∈Z{p+vi}.

We call Z=|V|the size of the V-neighborhood.

We will typically assume that the size Zof a V-neighborhood is small, i.e. polylogarithmic.

We show that we are able to get information about the V-neighborhood of a sample point. This

can be achieved in the following way:

For simplicity we identify [∆]dwith [∆d], such that P⊂{0,...,∆d−1}and V⊂{0,...,∆d−

1}. We use Theorem 2 with U=∆dand M=2Zto map the problem from the discrete Euclidean

space to a vector problem. We want to maintain the invariant that the vector xrepresents the

neighborhood of each point in the way:

∀p∈∆dxp=

Z−1

j=0

aj·2jwith aj=1iff p+vj∈P

0iff p+vj/∈P(4.1)

where Pdenotes the current point set after insert and delete operations.

Particularly xp=1mod 2⇔p∈P.

To maintain the invariant we have to translate the insert and delete operations of points into Z

update operations in the following way:

for all i∈[Z]

ADD(p)−→if p−vi∈[∆d]then

UPDATE(p−vi, 2i)

for all i∈[Z]

REMOVE(p)−→if p−vi∈[∆d]then

UPDATE(p−vi,−2i)

We have to deal with the fact that the sample data structure used in Theorem 2 samples ele-

ments from Supp(x), which is a larger set than P.

Lemma 4.3.2 |P|≥kxk0

Proof : Let be i∈Supp(x), so xi=0. Because of equation (4.1) there must be a responsible

point p∈Pwith i∈ N(p). Since |N(p)|=Z,pcan be responsible for the positivity of at

most Zentries xi. We conclude that there are at most |P|·Zentries iwith xi6=0. We conclude

|P|≥kxk0

Z.2

4 Sampling Geometric Data Streams and Applications

We apply Theorem 2 with s0=16 ·Z·(s+ln(δ/2)) and δ0=δ

Z·∆dand get a set S0consisting

of s0pairs (i, xi)with xi6=0. Starting with an empty sample set Swe check for every pair (i, xi)

if xi=1mod 2. If this is the case we add the pair (i, xi)(containing the sample point i∈P) to

our sample set S. We stop the procedure when |S|=s.

We first show that the set Scontains at least ssample points with probability 1−δ. Let Ybe

the random variable for the number of samples having xi=1mod 2(being sample points from

P). For each sample (i, xi)∈S0and each p∈Pwe have

Pr[i=p] = 1

kxk0±δ0≥1

|P|·Z−δ0≥1

2·|P|·Z.

It follows that

E[Y]≥1

2·|P|·Z·|P|·s0=8(s+ln δ/2).

From Chernoff Bounds [58]:

Pr[Y<s]≤Pr[Y≤(1−1

2)E[Y]] ≤e(1/2)2·E[Y]/2 ≤δ

Since Theorem 2 gives us s0samples with probability at least 1−δ/2, we conclude that the

constructed set Sconsists of at least ssamples (i, xi)having xi=1mod 2with probability

1−δ. It remains to show that the sampling is almost uniform.

Let S={(r1, xr1),...,(rs, xrs)}and let (k, xk)be the first sample returned by the algorithm

of Theorem 2. By the method to construct the set Swe see that for all i∈{1,...,s}and all

p∈P:

Pr[ri=p] = Pr[k=p]·kxk0

|P|=1

kxk0±δ0·kxk0

|P|=1

|P|±δ

Z·kxk0

|P|·∆d

Since |P|≥kxk0

Zby Lemma 4.3.2 we have:

Pr[ri=p] = 1

|P|±δ

∆d.

The space requirement to apply Theorem 2 with s0=16·Z·(s+ln(δ/2)),δ0=δ

Z·∆d,U=∆d,

and M=2Zis:

Os0+log 1

δ0·log2UM

δ0

=OZ·s+log 2

δ+log Z∆d

δ·log2∆d·2Z·Z·∆d

δ

=Os·Z3·d3log3∆

δ

4.4 Estimating the Weight of a Euclidean Minimum Spanning Tree

Theorem 9 Let the set V={v0, . . . , vZ−1}be fixed. Given a sequence of ADD and REMOVE

operations of points from the discrete space [∆]d, there is a data structure that with probability

1−δreturns spoints r0, . . . , rs−1from the current point set P={p0, . . . , pn−1}such that

Pr[ri=pj] = 1

n±δ

∆dfor every j∈[n]. The points are independent of each other and may

contain duplicates. Additionally, the algorithm returns the sets P∩ N(V, ri)for every i∈[s].

The algorithm uses e

Os·Z3·d3·log3(∆/δ)space. ut

4.4 Estimating the Weight of a Euclidean Minimum

Spanning Tree

In this section we will show how to estimate the weight of a Euclidean minimum spanning tree in

a dynamic geometric data stream. We denote by P={p1, . . . , pn}the current point set. Further

EMST denotes the weight of the Euclidean minimum spanning tree of the current set.

We impose log1+(√d∆)square grids over the point space. The side lengths of the grid

cells are ·(1+)i

√dfor 0≤i≤log1+(√d∆). Our algorithm maintains certain statistics of the

distribution of points in the grids. We show that these statistics can be used to compute a (1+)-

approximation of the weight EMST.

Our computation is based on a formula from [33] for the value of the minimum spanning tree

of an npoint metric space. Let GPdenote the complete Euclidean graph of a point set Pand

Wan upper bound on its longest edge. Further let c((1+)i)

Pdenote the number of connected

components in G((1+)i)

P, which is the subgraph of GPcontaining all edges of length at most

(1+)i. Under these assumptions we can use the formula from [33]:

1+·EMST ≤n−W+·

log1+W−1

i=0

(1+)ic((1+)i)

P≤EMST (4.2)

where nis the number of points in P.

Instead of considering the number of connected components in G(t)

Pfor t= (1+)iwe first

move all points of Pto the centers of a grid of side length ·t

√d. After removing multiplicities

we obtain the point set P(t). Then we consider the graph G(t)whose vertex set is P(t)and

that contains an edge between two points if their distance is at most t. Instead of counting

the connected components in G(t)

Pwe count the connected components in G(t). It follows from

Claim 4.4.1 that this only introduces a small error. We denote by c(t)the number of connected

components of G(t). Then we get

Claim 4.4.1

c(1+)i+1

P≤c(1+)i≤c(1+)i−2

Proof : Let us consider two arbitrary points p, q ∈Pand the centers of their corresponding

cells p0, q0in the grid graph G((1+)i). Recall that the corresponding grid has side length ·(1+)i

√d.

4 Sampling Geometric Data Streams and Applications

Thus by moving pand qto the centers of the corresponding grid cells their distance changes by

at most ·(1+)i.

Now assume that p, q are in the same connected component in G((1+)i−2)

P. Then they are

connected by a path of edges of length at most (1+)i−2. If we now consider the path of the

corresponding centers of grid cells, then any edge of the path has length at most (1+)i−2+

·(1+)i≤(1+)i.Therefore, p0, q0are in the same connected component of the grid cell

graph. We conclude c(1+)i≤c(1+)i−2

Assume that pand qare in the same connected component of the grid graph G((1+)i). They

are connected by a path of edges of length at most (1+)iin the grid graph G((1+)i). After

switching to the point graph GPany edge of the corresponding path has a length of at most

(1+)i+(1+)i= (1+)i+1. Therefore pand qare in the same connected component of

G((1+)i+1)

P. We conclude c(1+)i+1

P≤c(1+)i.2

We denote by n(t)=|P(t)|the number of non-empty grid cells of side length t

√d. Our algo-

rithm maintains approximations e

n, f

W, e

n(t), and ec(t)(for t= (1+)0,(1+)1,(1+)2, . . . )

of the number nof points currently in the set, the diameter W, the size n(t)of P(t), and the

number c(t)of connected components in G(t), respectively. The approximation is then derived

by inserting the maintained approximations into formula 4.2.

In the following we discuss the data structures we need to maintain our approximations.

4.4.1 Required Data Structures

Number of points. We observe that we can remember the value of nexactly by increas-

ing/decreasing e

nin case of an ADD/REMOVE operation.

Diameter. We show how to maintain an approximation f

Wof Wwith W≤f

W≤4√dW,

where Wis the largest distance between two points in the current point set. To do so, we maintain

an approximation f

Wjof the diameter of the point set in each of the ddimensions with Wj≤

Wj≤4Wj, where Wjis the diameter in dimension jfor 1≤j≤d. The maximum of the f

Wjis

our approximation f

We maintain the diameter of the point set in dimension jin the following way: For each

i∈{0, ..., log ∆}we introduce two one-dimensional grids Gi,1 and Gi,2, each of them having

cells of side length 2i.Gi,2 is displaced by 2(i−1)against Gi,1. Let gi,1 and gi,2 be the number of

occupied cells in the grid Gi,1,Gi,2, respectively.

We use our Distinct Elements data structure from Section 3.2 to count the number of grid cells

containing a point. We only want to distinguish between the case gi,1 =1and gi,1 > 1 (we

assume that there is always at least one point in the set; otherwise the problem becomes trivial).

If there is exactly one point in the current set we have g0,1 =1and g0,2 =1and the diameter

is 0. Otherwise, the diameter must be at least 1. Therefore in the finest grids G0,1 and G0,2 at

least two cells are occupied, which means that g0,1 > 1 and g0,2 > 1. We now find the smallest

value isuch that gi+1,1 =1or gi+1,2 =1. In this case we know that Wj≤2i+1.

4.4 Estimating the Weight of a Euclidean Minimum Spanning Tree

Since gi,1 > 1 and gi,2 > 1, we know that in both grids Gi,1 and Gi,2 at least two cells are

occupied. This means that the convex hull of the point set contains the border of a cell in both

grids Gi,1 and Gi,2. Since these cell borders have a distance of at least 2i−1we have Wj≥2i−1.

Therefore, we can output f

Wj=2i+1as a 4-approximation of the diameter in dimension j.

Size of P(t).The problem to find an estimation e

n(t)with (1−)n(t)≤e

n(t)≤n(t)is equivalent

to maintaining the number of distinct elements in a data stream. This can be seen as follows.

Once a point arrives we can determine its grid cell from its position. Thus we can interpret

the input stream as a stream of grid cells and we are interested in the number of distinct grid

cells. This can be approximated using an instance of the Distinct Elements (DE) data structure

of Section 3.2.

The Sample Set. To approximate the number of connected components we have to maintain

multisets S(t)of points chosen uniformly at random (with repetitions) from P(t). We will use VR

to denote the set of grid vectors of length at most R. For each point p∈S(t)we maintain all

points in the VR-neighborhood of pfor some suitably chosen value R. Since our input stream

contains ADD and REMOVE operations of points from Prather than P(t)we have to map every

point from Pto the corresponding point from P(t). This may have the effect that P(t)becomes a

multiset although Pis not. This is no problem because our procedure from Lemma 3.4.2 samples

from the support of the vector (or, in this case from the support of the multiset). Straightforward

modifications of Theorem 9 show that we can also maintain the required sets S(t).

Having such a sample and the value e

n(t)we can use an algorithm from [25] to obtain the num-

ber of connected components with sufficiently small error. This is proven in Section 4.4.2 using

a modified analysis that charges the error in the approximation to the weight of the minimum

spanning tree. This way we get our estimation ec(t)of c(t).

4.4.2 Computing ec(t)

In this section we show how to compute our estimator ec(t). To do this we will use our sample set

S(t). In the computation of the sample set S(t)we need to specify the value R. We choose

R:= log(√d·∆)2d+4√d−2·t .

Further we need in the following the value D:= R/t. Our algorithm for estimating c(t)works

as follows. First, we check, if f

W < 4t. If that is the case, W < 4t follows and for an

arbitrary sample point we know that every point of the current point set is contained in the radius

R. Therefore, we know the whole graph G(t)and can compute c(t)exactly.

Thus let us assume f

W≥4t. In this case our algorithm is essentially similar to the one

presented in [25], but our analysis is somewhat different. The difference comes from the special

structure of our input graphs G(t). We exploit the lower bound from Lemma 4.4.2 below to relate

the error induced by our approximation algorithm to the weight of the EMST.

4 Sampling Geometric Data Streams and Applications

Lemma 4.4.2 If f

W≥4t then

EMST ≥n(t)t

√d2d+1.

Proof : We distinguish between the case n(t)≥2d+1and n(t)< 2d+1.

We start with the case n(t)≥2d+1. In this case we can color the grid cells using 2dcolors

in such a way that no two adjacent cells have the same color. Since we have n(t)occupied

cells there must be one color cwhich is assigned to at least dn(t)

2deoccupied cells. Notice that

dn(t)

2de ≥ 2. The occupied cells of the same color are pairwise not adjacent, therefore any pair

of points that is contained in two distinct of these cells has a distance of at least ·t

√d. We can

conclude EMST ≥ln(t)

2dm−1t

√d≥ln(t)

2dm−1

2ln(t)

2dmt

√d≥n(t)t

√d2d+1.

In the second case we get W≥t

√d≥n(t)t

√d2d+1since f

W≥4t. This implies the result. 2

We now present a description of our method to estimate c(t). The idea is pick a random set of

vertices (with repetition) and start a BFS with a stochastic stopping rule at each vertex vfrom

the sample to determine the size of the connected component of v. If the BFS explores the whole

connected component we set a corresponding indicator variable βto 1and else to be 0. To

implement this algorithm we can use our sample set S(t). The sample set provides a multiset of

points from P(t)chosen uniformly at random. It also provides all other points within a distance

of at most R=Dt. Since we consider only edges of length at most tand since the algorithm

below stops exploring a component when it has size Dor larger, the BFS cannot reach a point

with distance more than Rfrom the starting vertex. Therefore, our sample set S(t)is sufficient

for our purposes. We remark that the random points from S(t)are not chosen exactly uniformly.

We will choose the failure parameter δ0in the sampling data structure in such a way that the

deviation from the uniform distribution is between (1−)and (1+)times its probability in

the uniform distribution (this means, we choose δ0=·δ/∆dand each point p∈P(t)is chosen

with probability (1±)·1

n(t)). We take care of this fact in the analysis. The algorithm we use is

given below.

APPROXCONNECTEDCOMPONENTS(P, t, )

Choose spoints q1, . . . , qs∈P(t)uniformly at random

for each qido

Choose integer Xaccording to distribution Prob[X≥k] = 1/k

if X≥Dthen βi=0

else

if Connected component of G(t)containing qihas at most Xvertices

then set βi=1

else set βi=0

Output: ^c(t)=

n(t)

s·Ps

i=1βi

Thus, βiis an indicator random variable for the event that the connected component containing

qihas at most Xvertices. We first show upper and lower bounds on the expected output value of

4.4 Estimating the Weight of a Euclidean Minimum Spanning Tree

the algorithm. Then we compute the variance and use it to show that the output is concentrated

around its expectation. We obtain

E[βi] = X

conn. comp.

Cin G(t)

Pr[qi∈C]·Pr[X≥|C| ∧ X<D]

≤X

conn. comp.

Cin G(t)

Pr[qi∈C]·Pr[X≥|C|]

≤X

conn. comp.

Cin G(t)

(1+)·|C|

n(t)·1

|C|= (1+)·c(t)

n(t).

For the output value

^c(t)=e

n(t)

s·

i=1

βi

of our algorithm we obtain

E[^c(t)]≤e

n(t)

n(t)(1+)c(t)≤(1+)c(t).(4.3)

From

E[βi] = X

conn. comp.

Cin G(t)

Pr[qi∈C]·Pr[X≥|C| ∧ X<D]

≥X

conn. comp.

Cin G(t)

(1−)·|C|

n(t)·1

|C|−1

D

= (1−)·c(t)

n(t)−1

D

and Lemma 4.4.2 we obtain

E[^c(t)]≥(1−)·e

n(t)

n(t)c(t)−n(t)

D≥(1−)2 c(t)−EMST ·2d+1√d

tD !(4.4)

≥(1−2)c(t)−·EMST

8t log(√d·∆).(4.5)

Our next step is to find an upper bound for the variance of ^c(t). Since the βiare {0, 1}random

variables, we get:

4 Sampling Geometric Data Streams and Applications

Var[βi]≤E[β2

i] = E[βi]≤(1+)·c(t)

n(t).

By mutual independence of the β0

iswe obtain for the variance of ^c(t)for fixed e

n(t):

Var[^c(t)] = Var[e

n(t)

i=1

βi]

=e

n(t)2

s2·s·Var[βi]

≤(1+)·(e

n(t))2

s·c(t)

n(t)

≤(1+)·n(t)c(t)

Using (4.3) and (4.5) we obtain

|c(t)−E[^c(t)]|≤2c(t)+3 ·EMST

8·t·log(√d·∆).(4.6)

We choose s, the number of sample points, as

s:= (1+)22d+10 ·d·log2(√d·∆)·log1+(√d·∆)

4=Olog3∆

5.

Chebyshev’s inequality and Lemma 4.4.2 imply:

Pr^c(t)−E[^c(t)]|≥·EMST

8·t·log(√d·∆)≤(1+)·n(t)·c(t)

s·64 ·t2·log2(√d·∆)

2·EMST2

≤(1+)·64 ·d·22d+2·log2(√d·∆)

s·4

≤1

4log1+(√d·∆).

Therefore we get together with (4.6):

Lemma 4.4.3 With probability 1−1

4log1+(√d·∆)we have |^c(t)−c(t)|≤2c(t)+·EMST

2·t·log(√d·∆).

It follows from the union bound that with probability at least 3/4 all ec(t)values satisfy the

inequality in Lemma 4.4.3. It remains to sum up the overall error taking into account that we

considered connected components of the graph G(t)and not of the corresponding subgraph of

GP. Intuitively, the connected component of G(t)are sufficient because in each of the G(t)we

moved every point by at most t which is small compared to the threshold edge length of t.

4.4 Estimating the Weight of a Euclidean Minimum Spanning Tree

Lemma 4.4.4 Let f

Mbe the output of our algorithm. Then

EMST −f

M≤69√d··EMST .

Proof : We will first show that our output value

M:= n−f

W+

log1+

W−1

i=0

(1+)iec((1+)i)

is close to

Mp:= n−f

W+

log1+

W−1

i=0

(1+)ic((1+)i)

which is a (1+)-approximation of the EMST value by equation (4.2). From Lemma 4.4.3 and

Claim 4.4.1 it follows that

(1−2)c((1+)i+1)

p−·EMST

2(1+)ilog(√d∆)≤ec((1+)i)

≤(1+2)c((1+)i−2)

P+·EMST

2(1+)ilog(√d∆)

holds with probability at least 1−1

4log1+(√d·∆). By the union bound and some calculation we

get with probability 3/4:

·

log1+

W−1

i=0

(1+)iec((1+)i)≥·(1−2)

log1+

W−1

i=0

(1+)ic((1+)i+1)

p−1

2·EMST

≥·(1−2)

log1+

i=1

(1+)i−1c((1+)i)

p−·EMST

≥·1−2

1+

log1+

W−1

i=0

(1+)ic((1+)i)

p−(1−2)

1+n−·EMST

≥(1−4)·

log1+

W−1

i=0

(1+)ic((1+)i)

p−n −·EMST

and

4 Sampling Geometric Data Streams and Applications

·

log1+

W−1

i=0

(1+)iec((1+)i)

≤1

2·EMST +

log1+

W−3

i=−2

(1+)i+2(1+2)c((1+)i)

≤·EMST +(1+)2(1+2)

log1+

W−1

i=0

(1+)iec((1+)i)

p+2(1+)(1+2)n

≤(1+11)·

log1+

W−1

i=0

(1+)ic((1+)i)

P+12n +·EMST

which gives us (together with f

W≤4√d·EMST and n≤EMST) a bound on the difference of

MPand f

MP−f

M≤112

log1+

W−1

i=0

(1+)ic((1+)i)

P+12n +·EMST

=11(MP−n+f

W) + 12n +·EMST

≤24 ·EMST +44√d·EMST .

By the triangle inequality and (4.2) we get the final result:

f

M−EMST≤f

M−MP+|MP−EMST|≤24 ·EMST +44√d·EMST +·EMST

≤69√d·EMST .

From this lemma our final result follows immediately using standard amplification techniques

to ensure that the estimation is correct at every point of time.

Theorem 10 Given a sequence of insertions / deletions of points from the discrete d-dimensional

space {1,...,∆}dthere is a streaming algorithm that uses O(log3(1/δ)·(log(∆)/)O(d))space

and O(log3(1/δ)·(log(∆)/)O(d))time (for constant d) for each update and computes with

probability 1−δa(1+)-approximation of the weight of the Euclidean minimum spanning tree.

5 The Coreset Method

In this chapter we introduce a method to reduce the complexity of huge point sets. Assume we

are given a huge point set P. To understand the structure of the point set it is often a good idea to

partition the point set into clusters, such that two points near one another are in the same cluster,

but points having a big distance of each other are in different clusters. In this section we will

look at different clustering objectives: The k-median, k-means, and MaxCut clustering.

Computing clusterings on the huge point sets directly is often impossible. Traditional cluster-

ing algorithms usually have time and space complexity at least linear in the number of points and

need random access to the data. For huge point sets which do not fit into our local memory at all

traditional algorithms are not applicable.

In this chapter we will address this problem by introducting a new technique to reduce the

complexity of the point set. We will combine points to so called weighted coreset points. One

coreset point phaving a weight of w(p)then represents w(p)points of our input instance. We

will compute a coreset having certain theoretical guarantees. The most important ones are that

it is small (it’s size is logarithmic in the number of points) and that each clustering solution

computed on the coreset is a (1±)-approximate solution on the input point set.

Unlike previous coreset construction techniques [8, 60, 61] our method does not make assump-

tions on the distribution of points in advance. This will enable us to develop the fastest known

PTAS for Euclidean MaxCut in Section 5.5.3, the first efficient k-median and k-means clustering

algorithms for dynamic geometric data streams in Chapter 6, the first kinetic data structures for

MaxCut in Chapter 7 and efficient k-means implementations for huge point sets in Chapter 8.

5.1 Definitions

The most important problems we will address with our coreset technique are the clustering prob-

lems k-median, k-means, and MaxCut as defined in Section 2.3. However, we will define a set of

other example problems here, which can as well be solved using the coreset technique presented

later.

The maximum matching problem asks to find a perfect matching of the points in Pthat maxi-

mizes the sum of the length of the matching edges. For a matching Mwe denote its cost by

MaxMatching(P, M) = X

(p,q)∈M

d(p, q).

The maximum travelling salesperson problem is to find a simple cycle C(a tour) of the points in

5 The Coreset Method

Pwith maximal cost. We denote its cost by

MaxTSP(P, C) = X

(p,q)∈C

d(p, q).

We will also consider the problem to compute the average distance between points in P.

5.1.1 Oblivious Optimization Problems

In this section we define oblivious optimization problems over point sets. Intuitively, an oblivious

optimization problem has the property that for a fixed input the set of feasible solutions depends

on the cardinality of the input set and not on the input set itself. Hence, there is a set of solutions

that are feasible independent of the position of the input points. The quality of the solutions,

however, may depend on the positions. MaxCut, MaxMatching, MaxTSP, and AverageDistance

can be formulated as oblivious optimization problems.

Let us consider an optimization problem Πon point sets in the Rd.Πcan be either a maxi-

mization or minimization problem.

We call Πan oblivious optimization problem, if it has the following structure. For any ntu-

ple of points P= (p1, . . . , pn)let SΠ(n)denote the set of feasible solutions, i.e. the set of

feasible solutions depends only on the size of the input instance and not on the instance itself.

Further, we have for each n∈Nand s∈ SΠ(n)an objective function cost(n,s)

Πthat assigns to

P= (p1, . . . , pn)a non-negative cost. We assume that given a permutation πof the points and

a solution s, there is always another solution s0having the cost cost(n,s0)

Π(π(p1), . . . , π(pn)) =

cost(n,s)

Π(p1, . . . , pn)(We say that Πdoes not depend on the order of the points). If Πis a maxi-

mization (minimization) problem then one seeks to find for a given P={p1, . . . , pn}the solution

s∗that maximizes (minimizes) cost(n,s)

Π(p1, . . . , pn). We write OptΠ(P) := cost(n,s∗)

Π(p1, . . . , pn).

Example 5.1.1 Euclidean MaxCut: A feasible solution sconsists of a partition of {1,...,n}

into 2groups C1, C2. For technical reasons we scale the usually used cost of the MaxCut problem

by 1

n. The cost of son P={p1, . . . , pn}is then given by

cost(n,s)

maxcut(p1, . . . , pn) = 1

i∈C1X

j∈C2

d(pi, pj).

Example 5.1.2 Euclidean MaxMatching: Assume that nis even. A feasible solution sis a

partition of {1,...,n}into n/2 pairs E1, . . . , En/2 where Ei= (ai, bi). The cost of son P=

{p1, . . . , pn}is given by

cost(n,s)

maxmatching(p1, . . . , pn) =

n/2

i=1

d(pai, pbi).

5.1 Definitions

Example 5.1.3 Euclidean MaxTSP: A feasible solution sis a permutation of {1,...,n}. The

cost of son P={p1, . . . , pn}is given by

cost(n,s)

maxtsp(p1, . . . , pn) = d(ps(n), ps(1)) +

n−1

i=1

d(ps(i), ps(i+1)).

Example 5.1.4 Average Distance: Since we only want to estimate the value of the average dis-

tance, there is only one feasible solution s. For technical reasons we scale the average distance

of the points by nand denote the costfunction of the solution by:

cost(n,s)

avgdistance(p1, . . . , pn) = 1

n−1

i=1X

j∈{1,...,n}\{i}

d(pi, pj).

Since the definition of the solution is oblivious of the position of points we can speak of

the change of the cost of solution swhen moving from point set Pto another point set P0. In

particular our coreset construction needs the following two conditions.

Definition 5.1.5 (`-Lipschitz) Let `≥1be a constant. We say that cost(n,s)

Πis `-Lipschitz, if for

arbitrary points p1, . . . , pnand p0

1, . . . , p0

nwith Pn

i=1d(pi, p0

i)≤Dwe have

cost(n,s)

Π(p1, . . . , pn) − cost(n,s)

Π(p0

1, . . . , p0

n)≤`·D

and

p1=p2=... =pn=⇒cost(n,s)

Π(p1, . . . , pn) = 0 .

We call Π `-Lipschitz, if for every n∈Nand s∈ SΠ(n)the objective function cost(n,s)

Πis

`-Lipschitz.

Definition 5.1.6 (λ-mean preserving) Let λ≤1be a constant. We say that Πis λ-mean pre-

serving, if for any point set Pin Rdwe have

OptΠ(P)≥λ·X

p∈P

d(p, µ),

where µ:= µ(P)is the mean or center of gravity of P(see Section 2.1).

We will show that all optimization problems stated before are `-Lipschitz and λ-means pre-

serving with constants `and λ.

Lemma 5.1.7 The Euclidean MaxCut problem is `-Lipschitz with `=1and λ-mean preserving

with λ=1

5 The Coreset Method

Proof : We first show that MaxCut is 1-Lipschitz.

cost(n,s)

Π(p1, . . . , pn) − cost(n,s)

Π(p0

1, . . . , p0

n)

=1

i∈C1X

j∈C2

d(pi, pj) − 1

i∈C1X

j∈C2

d(p0

i, p0

j)

≤1

i∈C1X

j∈C2d(pi, pj) − d(p0

i, p0

j)

i∈C1X

j∈C2

max {d(pi, pj) − d(p0

i, p0

j), d(p0

i, p0

j) − d(pi, pj)}

≤1

i∈C1X

j∈C2

max{d(pi, p0

i) + d(p0

i, p0

j) + d(p0

j, pj) − d(p0

i, p0

j),

d(p0

i, pi) + d(pi, pj) + d(pj, p0

j) − d(pi, pj)}

i∈C1X

j∈C2

(d(pi, p0

i) + d(p0

j, pj))

≤1

i∈C1

n·d(pi, p0

i) + 1

n·n·X

j∈C2

d(pj, p0

i=1

d(pi, p0

i) = D

To show that MaxCut is λ-mean preserving we first we show the following known inequality

(also shown in [39]):

d(p, µ)≤1

q∈P

d(p, q).(5.1)

First, note that by projecting all points on the line through pand µ, the left side does not change

and the right side does not increase. We can therefore assume that all points lie on a line, i.e. that

all points are real numbers. For real numbers we obtain:

d(p, µ) = |p−µ|=|p−1

n·X

q∈P

q|=1

n·|n·p−X

q∈P

n·|X

q∈P

(p−q)|≤1

q∈P

|p−q|.

We now show that MaxCut is λ-mean preserving with λ=1

4. We first consider a random

cut C1, C2. For each i∈{1, ..., n}we flip an unbiased coin to decide whether it belongs to

C1or to C2. Since for every pair of indices i, j the probability of separating the indices is 1

the distance from pito pjis counted in the sum with probability 1

2. The expected value of the

objective function 1

nPi∈C1Pj∈C2d(pi, pj)is therefore 1

4n Pp,q∈Pd(p, q). Since Opt denotes

5.1 Definitions

the maximum value of such a cut, we have

Opt ≥1

4n X

p,q∈P

d(p, q)

From equation (5.1) it follows that

Opt ≥1

p∈P

d(p, µ)

which means that the Euclidean MaxCut problem is λ-mean preserving with λ=1

Lemma 5.1.8 The Euclidean Maximum Weighted Matching problem is `-Lipschitz with `=1

and λ-mean preserving with λ=1

Proof : We first show that Euclidean Maximum Weighted Matching is 1-Lipschitz.

cost(n,s)

Π(p1, . . . , pn) − cost(n,s)

Π(p0

1, . . . , p0

n)

=

n/2

i=1

d(pai, pbi) −

n/2

i=1

d(p0

ai, p0

bi)

≤

n/2

i=1d(pai, pbi) − d(p0

ai, p0

bi)

n/2

i=1

max{d(pai, pbi) − d(p0

ai, p0

bi), d(p0

ai, p0

bi) − d(pai, pbi)}

≤

n/2

i=1

max{d(pai, p0

ai) + d(p0

ai, p0

bi) + d(p0

bi, pbi) − d(p0

ai, p0

bi),

d(p0

ai, pai) + d(pai, pbi) + d(pbi, p0

bi) − d(pai, pbi)}

n/2

i=1

(d(pai, p0

ai) + d(p0

bi, pbi))

i=1

d(pi, p0

i) = D

where the last equality comes from the fact that {1, ..., n}is partitioned into pairs (ai, bi).

To show that it is 1

2-mean preserving we look at a random matching constructed the following

way: We connect the first index i=1to one other index j∈{1,...,n}\{1}uniformly chosen

from all other indices. Then we delete both indices from {1, ..., n}and go on with this construc-

tion until all indices are matched. Notice that each pair (i, j)with i6=jbelongs to our matching

with probability 1

n−1. When Mdenotes the aggregated cost of this matching, then

5 The Coreset Method

Opt ≥E[M] = 1

2(n−1)

i,j=1

d(pi, pj)≥1

i,j=1

d(pi, pj)≥1

p∈P

d(p, µ)

where the last inequality follows from equation 5.1.

Lemma 5.1.9 The Euclidean Maximum Travelling Salesman problem is `-Lipschitz with `=2

and λ-mean preserving with λ=1.

Proof : We first show that Euclidean Maximum Travelling Salesman is 2-Lipschitz.

cost(n,s)

Π(p1, . . . , pn) − cost(n,s)

Π(p0

1, . . . , p0

n)

=d(ps(n), ps(1)) +

n−1

i=1

d(ps(i), ps(i+1)) − d(p0

s(n), p0

s(1)) −

n−1

i=1

d(p0

s(i), p0

s(i+1))

≤d(ps(n), ps(1)) − d(p0

s(n), p0

s(1))+

n−1

i=1d(ps(i), ps(i+1)) − d(p0

s(i), p0

s(i+1))

=max{d(ps(n), ps(1)) − d(p0

s(n), p0

s(1)), d(p0

s(n), p0

s(1)) − d(ps(n), ps(1))}

n−1

i=1

max{d(ps(i), ps(i+1)) − d(p0

s(i), p0

s(i+1)), d(p0

s(i), p0

s(i+1)) − d(ps(i), ps(i+1))}

≤max{d(ps(n), p0

s(n)) + d(p0

s(n), p0

s(1)) + d(p0

s(1), ps(1)) − d(p0

s(n), p0

s(1)),

d(p0

s(n), ps(n)) + d(ps(n), ps(1)) + d(ps(1), p0

s(1)) − d(ps(n), ps(1))}

n−1

i=1

max{d(ps(i), p0

s(i)) + d(p0

s(i), p0

s(i+1)) + d(p0

s(i+1), ps(i+1)) − d(p0

s(i), p0

s(i+1)),

d(p0

s(i), ps(i)) + d(ps(i), ps(i+1)) + d(ps(i+1), p0

s(i+1)) − d(ps(i), ps(i+1))}

=d(ps(n), p0

s(n)) + d(p0

s(1), ps(1)) +

n−1

i=1

d(ps(i), p0

s(i)) + d(p0

s(i+1), ps(i+1))

i=1

d(ps(i), p0

s(i)) = 2

i=1

d(pi, p0

i) = 2·D .

To show that it is 1-mean preserving we look at a random permutation schosen uniformly

from the set of permutations: Notice that each possible edge (i, j)with i6=jappears in the sum

with probability 1

n−1+1

n−2≥2

n−1. When Mdenotes the aggregated cost of this solution, then

Opt ≥E[M]≥1

n−1

i,j=1

d(pi, pj)≥1

i,j=1

d(pi, pj)≥X

p∈P

d(p, µ)

where the last inequality follows from equation 5.1. 2

5.1 Definitions

Lemma 5.1.10 The Euclidean Average Distance problem is `-Lipschitz with `=4and λ-mean

preserving with λ=1.

Proof : We first show that Euclidean Average Distance is 4-Lipschitz.

cost(n,s)

Π(p1, . . . , pn) − cost(n,s)

Π(p0

1, . . . , p0

n)

=1

n−1

i=1X

j∈{1,...,n}\{i}

d(pi, pj) − 1

n−1

i=1X

j∈{1,...,n}\{i}

d(p0

i, p0

j)

≤1

n−1

i=1X

j∈{1,...,n}\{i}d(pi, pj) − d(p0

i, p0

j)

n−1

i=1X

j∈{1,...,n}\{i}

max{d(pi, pj) − d(p0

i, p0

j), d(p0

i, p0

j) − d(pi, pj)}

≤1

n−1

i=1X

j∈{1,...,n}\{i}

max{d(pi, p0

i) + d(p0

i, p0

j) + d(p0

j, pj) − d(p0

i, p0

j),

d(p0

i, pi) + d(pi, pj) + d(pj, p0

j) − d(pi, pj)}

n−1

i=1X

j∈{1,...,n}\{i}

d(pi, p0

i) + d(pj, p0

≤2

i,j=1

2·d(pi, p0

i) = 4·

i=1

d(pi, p0

i) = 4·D

Euclidean Average Distance is 1-mean preserving:

n−1

i=1X

j∈{1,...,n}\{i}

d(pi, pj) = 1

n−1

i,j=1

d(pi, pj)≥n

n−1

i=1

d(pi, µ)≥X

p∈P

d(p, µ)

5.1.2 Coresets

Intuitively, a coreset is a small weighted set of points that approximates a large (typically un-

weighted) point set with respect to an optimization problem. We first give a definition of coresets

for k-median and k-means clustering [61].

Definition 5.1.11 (Coresets I.) [61] Let Pbe a weighted set of npoints in the Rd. A weighted

point set Pcore in Rdis an -coreset for the k-median problem, if for every set Cof kcenters

(1−)·Median(P, C)≤Median(Pcore, C)≤(1+)·Median(P, C).In a similar way a

weighted point set Pcore is an -coreset for the k-means clustering problem, if every set Cof k

centers (1−)·Means(P, C)≤Means(Pcore, C)≤(1+)·Means(P, C).

5 The Coreset Method

Now we generalize the definition of coresets to arbitrary oblivious optimization problems.

Our definition will be only for unweighted point sets. However, by replacing weighted points by

multiple copies we can easily generalize this definition to weighted point sets.

Definition 5.1.12 (Coresets II.) Let Πbe an oblivious optimization problem. Let Pbe a set of n

points in the Rd. A weighted set of points Pcore in the Rdwith weight function w:Pcore →Nis

an -coreset for Π, if there exists a mapping γ:P→Pcore that satisfies the following constraints.

•For every q∈Pcore we have |γ−1(q)|=w(q).

•For every solution s∈ SΠ(n)we have

cost(n,s)

Π(p1, . . . , pn) − ·OptΠ(P)≤cost(n,s)

Π(γ(p1), . . . , γ(pn))

≤cost(n,s)

Π(p1, . . . , pn) + ·OptΠ(P).

For an ordered point set P:= (p1, . . . , pn)we define γ(P)=(γ(p1), . . . , γ(pn)). Since the

oblivious optimization problem does not depend on the order of the points, we can define

OptΠ(Pcore, w) := OptΠ(γ(P)).

From Definition 5.1.12 the definition of coresets for the problems Euclidean MaxCut, Eu-

clidean MaxTSP, Euclidean MaxMatching, und average distance follows.

Lemma 5.1.13 Let Πbe an oblivious optimization problem, Pbe a set of npoints in the Rdand

let Pcore ⊂Rdwith weight function w:Pcore →Nbe an -coreset for Pand Π. Then

OptΠ(Pcore, w)∈OptΠ(P)·(1±)

and

OptΠ(P)∈OptΠ(Pcore, w)·(1±2).

5.2 Coresets for k-Median

We now give a description of our technique to construct coresets of small size. We always assume

that all points lie in a bounding cube of sidelength 1, for example [0, 1]d. This can be achieved

by scaling the points appropriately. Additionally we assume that the optimal objective function

value Opt of the respective problem is at least 1/e

∆. We only need a weak bound e

∆, since all

space and time bounds of our algorithms will depend logarithmically on e

∆.

In this section we describe our coreset construction for the k-median problem. The next chap-

ters will adapt the techniques to construct coresets for k-means and oblivious optimization prob-

lems.

5.2 Coresets for k-Median

5.2.1 Construction of the Coreset

We impose Znested square grids G0,...,GZ−1over the point space for some parameter Z=

llog 4·k·10d·n(1+log n)·d(d+1)/2·

∆

d+1m+1. The side length of the grid cells in grid Giis 1

2i. Our

goal will be to identify for each grid Giits heavy cells, i.e. cells that contain more than a certain

threshold of points. This threshold depends on the side length of the grid cells and grows with

its inverse (the smaller the cells, the larger the threshold). We will parametrize the threshold by

some small value δ<n, which is specified later.

Definition 5.2.1 (Heavy Cells) We call a cell of grid Giheavy, if it contains at least δ·2ipoints

of P. A grid cell that is not heavy is light.

Our process consists of two phases. In phase I we determine the coreset points. In phase II

we determine their weight. We begin with a description of phase I. The algorithm starts with the

coarsest grid G0. First, it identifies every heavy cell in G0. Note that grid G0consists of only one

cell containing all points. Since δ<nthis cell is a heavy cell. Then the algorithm subdivides

every heavy cell Cinto 2dequal sized quadratic subcells. These subcells are contained in grid

G1. We call Cthe parent cell of these subcells. If none of these subcells is heavy we place a

coreset point in the center of the cell. Otherwise, the algorithm recurses this process with all

heavy subcells. The recursion eventually stops because at some point a heavy cell is required to

have more than npoints inside.

It remains to determine the weight of each coreset point. This is done in phase II of the

algorithm. We can think of phase two as ‘moving the points’ to their corresponding coreset

points. The weight of a coreset point is simply the number of points moved to its position. The

movement must satisfy the following invariant

(a) every point stays in the smallest heavy cell it is contained it.

By our construction, every heavy cell must contain a coreset point. Thus it is easy to satisfy our

invariant. We can simply move every point pof Pto an arbitrary coreset point that is contained in

the smallest heavy cell containing p. Finally, the weights of each coreset point is determined by

the number of points moved to it. We will prove that for suitably chosen δthe resulting weighted

set of points Pcore is an -coreset for the k-median problem of size O(k·log n/d).

Below we describe our algorithm in pseudocode. It computes the coreset Pcore together with

weights w(p)for each point p∈Pcore.

COMPUTECORESET(P, Opt)

Let H1,...,Hhdenote the heavy cells in grid G0

return Sh

i=1COMPUTECORESETPOINTS(Hi)

5 The Coreset Method

COMPUTECORESETPOINTS (cell C)

if Chas no heavy subcells then

p←center of C

w(p)←number of points in C

return {p}

else

Let H1,...,Hhdenote the heavy subcells of C

Let L1,...,L`denote the light subcells of C

Let mdenote the number of points in S`

i=1Li

Pcore =Sh

i=1COMPUTECORESETPOINTS(Hi)

Let qbe an arbitrary point in Pcore

w(q)←w(q) + m

return Pcore

We will now prove that the point set computed by COMPUTECORESET is indeed an -coreset

for k-median.

We denote by L(i)the set of non-empty light cells of grid Giwhose parent cell is heavy. Notice

that SiL(i)partitions the plane.

Claim 5.2.2 Any point p∈ L(i)is moved a distance of at most √d

2i−1during our coreset con-

struction.

Proof : Our invariant assures that every point pstays within the smallest heavy cell it is

contained in. Every point pthat is contained in L(i)is contained in a heavy cell in grid Gi−1.

Therefore, it is moved at most the diagonal length of cells in Gi−1, i.e. √d

2i−1.2

From now on let Cbe an arbitrary fixed set of kcenters. We partition the sets L(i)into two

subsets Lnear(i)and Ldist(i).Lnear(i)contains all cells Cwhose distance minq∈Cd(q, C)to the

nearest center from Cis at most 4

·√d

2i, i.e.

Lnear(i) = {C ∈ L(i)|min

q∈Cd(q, C)≤4

·√d

2i}.

Ldist(i)contains all other cells from L(i), i.e.

Ldist(i) = {C ∈ L(i)|min

q∈Cd(q, C)>4

·√d

2i}.

Claim 5.2.3 The total movement PiPp∈Ldist(i)

√d

2i−1of points in distant cells satisfies

p∈Ldist(i)

√d

2i−1≤

2Median(P, C).

5.2 Coresets for k-Median

Proof : We use a charging argument from [61].

p∈Ldist(i)

√d

2i−1=

p∈Ldist(i)

·√d

≤

2·Median(P, C),

where the last inequality follows from the fact that every point in Ldist(i)contributes more than

·√d

2ito the cost of the solution C. 2

Claim 5.2.4 For δ≤d+1·Opt

4·k·10d·(1+log n)·d(d+1)/2 we get

p∈Lnear(i)

√d

2i−1≤

2Median(P, C).

Proof : We observe that the furthest point in a cell in Lnear(i)can have a distance of at most

√d

2i+4

·√d

2i=1+4

·√d

2ito the nearest center. Hence, every cell in Lnear(i)is contained in

a cube of sidelength 2·1+4

·√d

2ithat is centered at one of the kcenters of C. Each of these

cubes has volume 2·1+4

·√d

2id≤10

d·√d

2id

. Every cell in grid Gihas volume 1

2id.

Hence, there can be at most k·10

d·dd/2 cells in Lnear(i).

Each of the considered cells is light and so it contains at most δ·2ipoints. Hence for our

choice of δ:

p∈Lnear(i)

√d

2i−1=X

i:Lnear(i)6=∅X

p∈Lnear(i)

√d

2i−1

≤X

i:Lnear(i)6=∅ k·10

d

·dd/2!·δ·2i· √d

2i−1!

≤X

i:Lnear(i)6=∅

δ· 2·k·10

d

·dd/2!·√d

≤

2·

1

1+log nX

i:Lnear(i)6=∅

Opt



≤

2·Median(P, C)

1+log n·|{i:Lnear(i)6=∅}|

Now we observe that Lnear(i)6=∅implies that there are non-empty light cells in grid Giand

heavy cells in grid Gi−1. We can have non-empty light cells only if δ·2i> 1, since otherwise

5 The Coreset Method

all non-empty cells are heavy. We can have heavy cells in grid Gi−1only if δ·2i−1≤n,

since otherwise all cells are light. Hence we sum up only over those values of ithat satisfy

1/2 < δ ·2i−1≤n. Clearly, these are at most 1+log ndistinct values and so we get



2·Median(P, C)

1+log n·|{i:Lnear(i)6=∅}| ≤

2·Median(P, C),

which concludes the proof of Claim 5.2.4 2

Lemma 5.2.5 The set Pcore is an -coreset for δ≤d+1·Opt

4·k·10d·(1+log n)·d(d+1)/2 .

Proof : We first observe that every point of Pis contained in some cell in SiL(i). By Claim

5.2.2 we know that every point that is contained in a grid cell in L(i)is moved a distance of at

most √d

2i−1. Therefore, Claims 5.2.3 and 5.2.4 imply that the points are moved an overall distance

of at most Median(P, C). Finally, we observe that the cost of any set of kcenters changes by at

most ±Dwhen the points of the point set Pare moved by an overall distance of D. Hence the

set Pcore constructed by our algorithm is an -coreset for k-median. 2

5.2.2 Size of the Coreset

Our next step is to give an upper bound on the size of the coreset. For every grid Giwe define

M(i)to be the set of heavy cells that do not have a heavy subcell. Notice that the cells M(i)

are exactly those cells that contain a coreset point. Hence, it will be sufficient to determine the

cardinality of this set.

Definition 5.2.6 (Center cells) Let Copt denote an optimal set of kcenters for the k-median

problem. A cell Cin grid Giis called a center cell, if its distance to the nearest center in Copt is

less than √d

2i+1, i.e. if minq∈Copt d(q, C)<√d

2i+1.

We define Mcenter(i) = {C ∈ M(i)|Cis center cell}to be the subset of M(i)that are center

cells. We use Mexternal(i) = M(i)\Mcenter(i)to denote the remaining cells of M(i). We call

these cells external cells.

Claim 5.2.7 Every external cell contributes at least δ·√d/2 to the cost Opt of Copt.

Proof : Every external cell C ∈ M(i)is a heavy cell and so it contains at least δ·2ipoints. Each

point contributes at least √d

2i+1to the cost of the optimal solution. Hence the overall contribution

of the points in Cis at least δ·√d/2.2

Lemma 5.2.8 If δ≥d+1·Opt

8·k·10d·(1+log n)·d(d+1)/2 , the size of the coreset is at most 17·k·10d·(2+log n)·dd/2

d+1=

O(k·log n/d+1).

5.2 Coresets for k-Median

Proof : From Claim 5.2.7 it follows that

[

iMexternal(i)≤2·Opt

δ·√d.

All cells from Mcenter(i)are contained in kcubes of sidelength 2√d

2i+√d

2i+1=3·√d

2i. Since

each cube volume is 3·√d

2id

and each cell volume is √d

2iddd/2, we know that

Mcenter(i)≤k·3d·dd/2 .

Using similar arguments as in the proof of Claim 5.2.4 we obtain that M(i)6=∅for at most

1+log ndistinct values of i. Therefore, we get

[

iMcenter(i)≤(log n+1)·k·3d·dd/2 .

Since SiM(i) = SiMexternal(i)∪Mcenter(i), the size of the coreset is at most 2·Opt

δ·√d+(log n+

2)·k·3d·dd/2.

For δ≥d+1·Opt

8·k·10d·(1+log n)·d(d+1)/2 , the size of the coreset is at most 17·k·10d·(2+log n)·dd/2

d+1.2

5.2.3 Finding a suitable value of δ

The coreset construction so far was dependent on a value of δ, which itself was dependent on the

unknown value of Opt. We show how to find a suitable value of δ.

A good value for δcan be found using the statements of Lemma 5.2.5 and 5.2.8. Let be

δ0:= d+1·Opt

4k·10d(1+log n)d(d+1)/2 .

•All coresets constructed with a value δ≤δ0are -coresets according to Lemma 5.2.5.

•All coresets constructed with a value δ≥δ0/2 have a size of at most S:= 17·k·10d·(2+log n)·dd/2

d+1

according to Lemma 5.2.8.

For each value of j∈{0, 1, . . . , blog(n·e

∆·√d)c}let be

δ(j) = d+1

4k ·10d(1+log n)d(d+1)/2 ·e

∆·2j.

Denote that the coreset constructed for the highest value of δ(j)is of size at most Sbecause

this value of δis greater than δ0·n·√d/(2·Opt), which is bigger than δ0/2, since all points lie

in a unit cube.

We want to identify a value j0∈{0, 1, . . . , blog(n·e

∆·√d)c}, such that

the coreset constructed with the value δ=δ(j0)is of size at most Sand

the coreset constructed with the value δ=δ(j0−1)is of size greater than S.

If the coreset constructed with the value δ=δ(0)is of size at most S, we set j0=0. The value

j0can be easily obtained performing a binary search on the values of j.

5 The Coreset Method

Lemma 5.2.9 The coreset constructed with δ=δ(j0)is an -coreset for P. The size of the

coreset is at most S.

Proof : By the choice of j0the size of the computed coreset is at most S.

We show that the constructed coreset is an -coreset. If j0=0then the coreset is an -coreset

because δ(j0) = δ0/(Opt ·e

∆)≤δ0.

If j06=0then the coreset constructed for the value δ=δ(j0−1) = δ(j0)/2 of size greater

than S. Therefore we must have δ(j0)/2 < δ0/2. It follows δ(j0)< δ0and the computed coreset

is an -coreset. 2

In each iteration of the binary search we have to decide if the size of the coreset is at most

S. This can be done by constructing the coreset and stopping the process if the number of

disjoint heavy cells exceeds S. The marking process yielding to the actual coreset can be seen

as a quadtree traversal. Each inner node of this quadtree corresponds to a heavy cell. If on

one grid the number of heavy cells exceeds S, we stop the process. Since we have at most

Z=O(log(n·e

∆/)) grids, the tree traversal can be done in time O(S · log(e

∆·n/)) =

O(k·log n·log(e

∆·n/)/d+1).

Since we have O(log log(e

∆·n)) iterations of the binary search, the coreset can be constructed

in time O(k·log n·log(e

∆·n/)log log(e

∆·n)/d+1).

Note that given a point set Pwe can first build a quadtree of depth Zof the points in time

O(n·Z) = O(n·log(n·e

∆/)). Using this quadtree we can then answer queries on the number

of points in certain grid cells in constant time.

Theorem 11 Assume that we are given a point set P⊂[0, 1]dof size n∈Nand the guarantee

that the value Opt of an optimal k-median solution is at least 1/e

∆. For each value of i∈

{0,1,...,Z}with Z=llog 4·k·10d·n(1+log n)·d(d+1)/2·

∆

d+1m+1we have a square grid Giover the

point space, all cells of side length 1

2i.

Given an oracle, which answers queries on the number of points in grid cells in constant time,

we can compute in time O(k·log n·log(e

∆·n/)·log log(e

∆·n)/d+1)an -coreset of size

O(k·log n/d+1)for k-median.

A respective oracle can be constructed in time O(n·log(n·e

∆/)).

5.3 Coresets for k-Means

We show how to modify our coreset construction for the k-median problem in a way such that it

works for the k-means problem.

5.3.1 Construction of the Coreset

We again use Znestes square grids G0,...,GZfor Z=llog 8·k·33d+1·n(1+log n)·d(d+2)/2·

∆

d+2m+1.

The side length of the cells in grid Giis 1

2i. We use the following modified definition for heavy

cells.

5.3 Coresets for k-Means

Definition 5.3.1 (Heavy Cells) We call a grid cell of grid Giheavy for the k-means clustering

problem, if it contains at least δ·4ipoints.

We use the same invariant as in the construction of the coreset for the k-median problem, i.e.

every point stays in the smallest heavy cell it is contained in. Similarly, as in the construction

for k-median we denote by L(i)the set of non-empty light cells in grid Giwhose parent cell is

heavy. We get the following modified version of Claim 5.2.2 with an analogous proof.

Claim 5.3.2 Any point p∈ L(i)is moved a distance of at most √d

2i−1during our coreset con-

struction. ut

From now on let Cbe an arbitrary fixed set of kcenters. We partition the sets L(i)into two

subsets Lnear(i)and Ldist(i).Lnear(i)contains all cells Cwhose distance minq∈Cd(q, C)to the

nearest center from Cis at most 16

·√d

2i, i.e.

Lnear(i) = {C ∈ L(i)|min

q∈Cd(q, C)≤16

·√d

2i}.

Ldist(i)contains all other cells from L(i), i.e.

Ldist(i) = {C ∈ L(i)|min

q∈Cd(q, C)>16

·√d

2i}.

Claim 5.3.3 For any point p∈Plet ^pdenote its corresponding coreset point. Then we have

X

p∈Ldist(i)

d(p, C)2−X

p∈Ldist(i)

d(^p, C)2≤

2Means(P, C)

Proof : We use a charging argument similar to one from [61]. Let pbe an arbitrary point in P

and ^pbe the corresponding coreset point. By Claim 5.3.2 we know that d(p, ^p)≤√d

2i−1. We get

d(^p, C)2≤ d(p, C) + √d

2i−1!2

=d(p, C)2+2d(p, C)·√d

2i−1+d

4i−1

and since d(p, C)>√d

2i−1:

d(^p, C)2≥ d(p, C) − √d

2i−1!2

≥d(p, C)2−2d(p, C)·√d

2i−1−d

4i−1.

We also know that d(p, C)≥16

·√d

2iand so

d(p, C)2−d(^p, C)2≤2d(p, C)·√d

2i−1+d

4i−1

≤

4d(p, C)2+

82

d(p, C)2

≤

2d(p, C)2.

5 The Coreset Method

We get

X

p∈Ldist(i)

d(p, C)2−X

p∈Ldist(i)

d(^p, C)2≤X

p∈Ldist(i)d(p, C)2−d(^p, C)2

≤X

p∈Ldist(i)



2d(p, C)2≤

2Means(P, C).

Our next step is to give an upper bound on the change of the contribution of points in near

cells.

Claim 5.3.4 For δ≤d+2·Opt

8(1+log n)·k·33d+1·d(d+2)/2 we get

X

i:Lnear(i)6=∅X

p∈Lnear(i)d(p, C)2−d(^p, C)2≤

2·Means(P, C).

Proof : We use a similar volume argument as in the proof of Claim 5.2.4. Any cell in Lnear(i)

must be contained in one of kcubes of volume h216√d

·2i+√d

2iid

=h32

+1√d

2iid≤33

d·

√d

2id. Each cell has a volume of √d

√d·2id. Therefore, there can be at most k·33

d·dd/2 such

cells.

Let pbe an arbitrary point in Lnear(i)and ^pbe the corresponding coreset point. We will show

d(^p, C)2−d(p, C)2≤3+16

·d

4i−1and d(^p, C)2−d(p, C)2≥−3+16

·d

4i−1:

We observe that d(p, C)≤(1+8

)·√d

2i−1. We get

d(^p, C)2≤ d(p, C) + √d

2i−1!2

≤d(p, C)2+2d(p, C)·√d

2i−1+d

4i−1

≤d(p, C)2+21+8

·√d

2i−1·√d

2i−1+d

4i−1

≤d(p, C)2+3+16

·d

4i−1

5.3 Coresets for k-Means

Case A: d(p, C)≥√d

2i−1. Then:

d(^p, C)2≥ d(p, C) − √d

2i−1!2

≥d(p, C)2−2d(p, C)·√d

2i−1−d

4i−1

≥d(p, C)2−21+8

·√d

2i−1·√d

2i−1−d

4i−1

≥d(p, C)2−3+16

·d

4i−1

Case B: d(p, C)<√d

2i−1. Then

d(^p, C)2≥0≥d(p, C)2−d

4i−1≥d(p, C)2−3+16

·d

4i−1

Altogether we obtain d(^p, C)2−d(p, C)2≤3+16

·d

4i−1.

Each of the considered cells is light and so it contains at most δ·4ipoints. Hence for our

choice of δ:X

i:Lnear(i)6=∅X

p∈Lnear(i)d(p, C)2−d(^p, C)2

≤X

i:Lnear(i)6=∅X

p∈Lnear(i)d(p, C)2−d(^p, C)2

≤X

i:Lnear(i)6=∅

δ·4i

| {z }

number of points per cell

·k·33

d

·dd/2

| {z }

number of cells

·3+16

·d

4i−1

≤X

i:Lnear(i)6=∅

δ·4·k·33

d+1

·dd/2d

≤

2·

1

1+log n·X

i:Lnear(i)6=∅

Opt



≤

2·Means(P, C),

where the last inequality follows from a similar argument as in the proof of Claim 5.2.4. 2

It follows immediately, that

Lemma 5.3.5 The set Pcore is an -coreset for δ≤d+2·Opt

8(1+log n)·k·33d+1·d(d+2)/2 .ut

5 The Coreset Method

5.3.2 Size of the Coreset

We adapt the proof for the k-median problem to k-means. In the following we give the slightly

changed definitions we require. For every grid Giwe define M(i)to be the set of heavy cells

that do not have a heavy subcell.

Definition 5.3.6 (Center cells) Let Copt denote an optimal set of kcenters for the k-means prob-

lem. A cell Cin grid Giis called a center cell, if its distance to the nearest center in Copt is less

than √d

2i+1, i.e. if minq∈Copt d(q, C)<√d

2i+1.

We define Mcenter(i) = {C ∈ M(i)|Cis center cell}to be the subset of M(i)that are center

cells. We use Mexternal(i) = M(i)\Mcenter(i)to denote the remaining cells of M(i). We call

these cells external cells.

Claim 5.3.7 Every external cell contributes at least δ·d/4 to the cost Opt of Copt.

Proof : Every external cell Cis a heavy cell and so it contains at least δ·4ipoints. Each point

contributes at least √d

2i+12to the cost of the optimal solution. Hence the overall contribution of

the points in Cis at least δ·d/4.2

Lemma 5.3.8 If δ≥d+2·Opt

16(1+log n)·k·33d+1·d(d+2)/2 , the size of the coreset is at most 65(1+log n)·k·33d+1·dd/2

d+2=

O(klog n/d+2).

Proof : From Claim 5.3.7 it follows that

[

iMexternal(i)≤4·Opt

δ·d.

All cells from Mcenter(i)are contained in kcubes of sidelength 2√d

2i+√d

2i+1=3·√d

2i. Since

each cube volume is 3·√d

2id

and each cell volume is √d

2iddd/2, we know that

Mcenter(i)≤k·3d·dd/2 .

Using similar arguments as in the proof of Claim 5.2.4 we obtain that M(i)6=∅for at most

1+log ndistinct values of i. Therefore, we get

[

iMcenter(i)≤(log n+1)·k·3d·dd/2 .

Since SiM(i) = SiMexternal(i)∪ Mcenter(i), the size of the coreset is at most 4·Opt

δ·d+

(log n+1)·k·3d·dd/2. For δ≥d+2·Opt

16(1+log n)·k·33d+1·d(d+2)/2 , the size of the coreset is at most

65(1+log n)·k·33d+1·dd/2

d+2.2

5.4 Coresets for Oblivious Optimization Problems

5.3.3 Finding a suitable value of δ

To find a suitable value of δwe use the method of Section 5.2.3. We plug in the values δ0:=

d+2·Opt

8(1+log n)·k·33d+1·d(d+2)/2 and S:= 65(1+log n)·k·33d+1·dd/2

d+2, and set for j∈{0, 1, . . . , blog(n·e

∆·

√d)c}:

δ=δ(j) = d+2

8(1+log n)·k·33d+1·d(d+2)/2 ·e

∆·2j.

Doing a binary search on the values of jwe again find a value j0such that the coreset constructed

with the value δ=δ(j0)is of size at most Sand the coreset constructed with the value δ=

δ(j0−1)is of size greater than S(if there is no such value we set j0=0). Lemma 5.2.9 and

it’s proof can be stated in the same way and the coreset constructed for this value of δis then an

-coreset for k-means.

Theorem 12 Assume that we are given a point set P⊂[0, 1]dof size n∈Nand the guar-

antee that the value Opt of an optimal k-means solution is at least 1/e

∆. For each value of

i∈{0,1,...,Z}with Z=llog 8·k·33d+1·n(1+log n)·d(d+2)/2·

∆

d+2m+1we have a square grid Gi

over the point space, all cells of side length 1

2i.

Given an oracle, which answers queries on the number of points in grid cells in constant time,

we can compute in time O(k·log n·log(e

∆·n/)·log log(e

∆·n)/d+2)an -coreset of size

O(k·log n/d+2)for k-means.

A respective oracle can be constructed in time O(n·log(n·e

∆/)).

5.4 Coresets for Oblivious Optimization Problems

Let us assume that Πis `-Lipschitz and λ-mean preserving. We show that under these conditions

our k-Median algorithm run on instance Pconstructs a weighted point set Pcore that is an -coreset

for Π. We modify our proofs of the k-median coreset.

5.4.1 Construction of the Coreset

We also use Znested grids Giwith side length 1

2ifor Z=llog n(1+log n)·d(d+1)/2·(10`)d+1·

∆

d+1·λdm+1.

We use the same definition of heavy cells.

Definition 5.4.1 (Heavy Cells) We call a cell of grid Giheavy, if it contains at least δ·2ipoints

of P. A grid cell that is not heavy is light.

We denote by L(i)the set of light cells of grid Giwhose parent cell is heavy. Notice that each

point is contained in exactly one cell of SiL(i). Claim 5.2.2 still holds.

We partition the sets L(i)into two subsets Lnear(i)and Ldist(i).Lnear(i)contains all cells C

whose distance d(µ, C)to the center of gravity µis at most 4·`

·λ·√d

2i, i.e.

Lnear(i) = {C ∈ L(i)|d(µ, C)≤4·`

·λ·√d

2i}.

5 The Coreset Method

Ldist(i)contains all other cells from L(i), i.e.

Ldist(i) = {C ∈ L(i)|d(µ, C)>4·`

·λ·√d

2i}.

Claim 5.4.2 X

p∈Ldist(i)

√d

2i−1≤

2·`·Opt .

Proof : Any point in Ldist(i)has a distance of more than 4·`

·λ·√d

2ifrom the center of gravity µ.

Therefore, we get

p∈Ldist(i)

√d

2i−1=·λ

2·`X

p∈Ldist(i)

4·`

·λ·√d

≤·λ

2·`X

p∈Ldist(i)

d(p, µ)

≤

2·`·Opt

where the last inequality holds because Πis λ-mean preserving. 2

Claim 5.4.3 For δ≤d+1·λd·Opt

(1+log n)·d(d+1)/2·(10·`)d+1we get

i:L(i)6=∅X

p∈Lnear(i)

√d

2i−1≤

2·`Opt .

Proof : We observe that the furthest point in a cell in Lnear(i)can have a distance of at most

(1+4`

λ )·√d

2ito the center of gravity. Hence, every cell in Lnear(i)is contained in a cube of

sidelength 2·1+4`

λ √d

2i. The cube has volume 2·1+4`

λ √d

2id≤10`·√d

λ·2id

. Since every

cell in grid Gihas volume 1

2id, there can be at most 10`

λ d·dd/2 cells in Lnear(i).

Each cell in Lnear(i)is light and so it contains at most δ·2ipoints. Hence,

i:L(i)6=∅X

p∈Lnear(i)

√d

2i−1

≤X

i:L(i)6=∅

δ·2i·10 ·`

λ d

·dd/2 ·√d

2i−1

≤

2·`·Opt

for our choice of δand using the same arguments as in Claim 5.2.4. 2

5.4 Coresets for Oblivious Optimization Problems

Lemma 5.4.4 The set Pcore is an -coreset for δ≤d+1·λd·Opt

(1+log n)·d(d+1)/2·(10·`)d+1.

Proof : We first observe that every point of Pis contained in exactly one cell in SiL(i). By

Claim 5.2.2 we know that every point that is contained in a grid cell in L(i)is moved a distance

of at most √d

2i−1. Therefore, Claims 5.4.2 and 5.4.3 imply that the points are moved an overall

distance of at most 

`Opt. Since the optimization problem Πis `-Lipschitz we know that the cost

of any solution changes by at most ±·Opt under this movement. Hence the set Pcore constructed

by our algorithm is a coreset. 2

5.4.2 Size of the Coreset

Our next step is to give an upper bound on the size of the coreset. For every grid Giwe define

M(i)to be the set of heavy cells that do not contain a heavy subcell. Notice that the cells M(i)

are exactly those cells that contain a coreset point. Hence, it will be sufficient to determine the

cardinality of this set.

Definition 5.4.5 (Center Cells) Let µdenote the center of gravity of P. A cell Cin grid Giis

called a center cell, if its distance to µis at most √d

2i+1, i.e. if d(µ, C)≤√d

2i+1.

We define Mcenter(i) = {C ∈ M(i)|Cis center cell}to be the subset of M(i)that are center

cells. We use Mexternal(i) = M(i)\Mcenter(i)to denote the remaining cells of M(i). We call

these cells external cells. We use G=Pp∈Pd(p, µ)to denote the sum of distances to the center

of gravity.

Claim 5.4.6 Every external cell contributes at least δ·√d/2 to G.

Proof : Every external cell Cis a heavy cell and so it contains at least δ·2ipoints. Each point

has a distance of at least √d

2i+1to the center of gravity. Hence the overall contribution of the points

in Cis at least δ·√d/2.2

Lemma 5.4.7 If δ≥d+1·λd·Opt

2(1+log n)·d(d+1)/2·(10·`)d+1, the size of the coreset is at most

4·(1+log n)·dd/2 ·(10`)d+1

d+1·λd+1=O(log n/d+1).

Proof : From Claim 5.4.6 and G=Pp∈Pd(p, µ)≤1

λ·Opt it follows that

[

iMexternal(i)≤2·Opt

δ·λ·√d).

All cells from Mcenter(i)are contained in a cube of sidelength 2√d

2i+√d

2i+1=3·√d

2i. Since the

cube volume is 3·√d

2id

and each cell volume is 1

2id, we know that

Mcenter(i)≤3d·dd/2 .

5 The Coreset Method

Using similar arguments as in the proof of Claim 5.2.4 we obtain that M(i)6=∅for at most

1+log ndistinct values of i. Therefore, we get

[

iMcenter(i)≤(log n+1)·3d·dd/2 .

Since SiM(i) = SiMexternal(i)∪ Mcenter(i), the size of the coreset is at most 2·Opt

δ·λ·√d+

(log n+2)·3d·dd/2. If δ≥d+1·λd·Opt

2(1+log n)·d(d+1)/2·(10·`)d+1, the size of the coreset is at most

4·(1+log n)·dd/2·(10`)d+1

d+1·λd+1.

5.4.3 Finding a suitable value of δ

To find a suitable value of δwe use the method of Section 5.2.3. We plug in the values δ0:=

d+1·λd·Opt

(1+log n)·d(d+1)/2·(10·`)d+1and S:= 4·(1+log n)·dd/2·(10`)d+1

d+1·λd+1, and set for j∈{0, 1, . . . , blog(n·e

∆·

√d·`)c}:

δ=δ(j) = d+1·λd

(1+log n)·d(d+1)/2 ·(10 ·`)d+1·e

∆·2j.

Doing a binary search on the values of jwe again find a value j0such that the coreset con-

structed with the value δ=δ(j0)is of size at most Sand the coreset constructed with the value

δ=δ(j0−1)is of size greater than S(if there is no such value we set j0=0). Lemma 5.2.9 and

it’s proof can be stated in the same way and the coreset constructed for this value of δis then an

-coreset for the respective problem.

Theorem 13 Assume that we are given a point set P⊂[0, 1]dof size n∈Nand the guarantee

that the value Opt of an optimal solution of the oblivious optimization problem is at least 1/e

∆.

For each value of i∈{0,1,...,Z}with Z=llog n(1+log n)·d(d+1)/2·(10`)d+1·

∆

d+1·λdm+1we have a

square grid Giover the point space, all cells of side length 1

2i.

Given an oracle, which answers queries on the number of points in grid cells in constant

time, we can compute in time O(log n·log(e

∆·n/)·log log(e

∆·n)/d+1)an -coreset of size

O(log n/d+1)for the respective problem.

A respective oracle can be constructed in time O(n·log(n·e

∆/)).

Corollary 5.4.8 Given a set Pof npoints in Rd, a coreset for an oblivious optimization problem

can be constructed in time O(n·log(n/) + log2(n/)·log log n/d+1).

Proof : We first scale the points such that all points lie in [0, 1]dand that there are two points

p, q ∈Pand a dimension i∈{1,...,d}, such that |p(i)−q(i)|=1. Since the problem is λ-mean

preserving, we can conclude from the triangle inequality that Opt ≥λ/2. Therefore we can set

∆:= 2/λ and have Opt ≥1/e

∆. We apply our coreset technique on the scaled point set. After

constructing the coreset we rescale the points. 2

5.5 Constructing Solutions on the Coreset

In this section we will shortly show how to construct solutions to the various problems after

computing the coreset.

5.5.1 k-Median

We follow an approach of Har-Peled and Mazumdar [61].

Lemma 5.5.1 [61] Given a weighted set Pcore of |Pcore|points in Rdwith total weight n, one

can compute a set Dof size O(k2−2d log2n)such that at least one subset C⊂ D of size k

is a (1±)-approximate solution to the k-median problem on Pcore. The running time of this

algorithm is O(|Pcore|·log2n+k5log9n+k2−2d log2n).

As in [61] we can use this candidate set and an algorithm from [87] to compute a solution:

Lemma 5.5.2 [87] Given a weighted point set Pcore of |Pcore|points in Rd, with total weight n, a

set Dof size at most |Pcore|such that at least one subset C⊂ Dof size kis a (1±)-approximate

solution to the k-median problem on Pcore, and a parameter δ > 0, one can compute a (1±)-

approximate k-median clustering of Pcore using only centers from D. The overall running time is

O(ρ·|Pcore|·(log k)(log n)log(1/δ)), where ρ=exp[O((1+log 1/)/)d−1]. The algorithm

succeeds with probability ≥1−δ.

Theorem 14 On the coreset Pcore we can compute a (1±)-approximate k-median solution in

time

O(ρk2·log k·log3n·log(1/δ) + k·log3n·−d−1+k5log9n+k2−2d log2n)

where ρ=exp[O((1+log 1/)/)d−1].

5.5.2 k-Means

We again follow an approach of Har-Peled and Mazumdar [61] to compute a (1±)-approximate

k-means clustering on Pcore. They use the following lemma:

Lemma 5.5.3 [99] Given a weighted set Pcore of |Pcore|points in Rdwith total weight n, one can

compute a set Dof size O(k2−2d log n·log(1

)) such that at least one subset C⊂ D of size

kis a (1±)-approximate solution to the k-means problem on Pcore. The running time of this

).

As in [61] we can this candidate set Dto compute a solution. We simply enumerate all k-

tuples in D, and compute the k-means clustering value of each candidate center set. This takes

O(|D|k·k·|Pcore|)time. The best tuple provides the required approximation.

Theorem 15 On the coreset Pcore we can compute a (1±)-approximate k-means solution in

time e

O(k2k+2·−2kd−d−2·logk+1n).

5 The Coreset Method

5.5.3 MaxCut

In this section we describe how to compute a MaxCut on the computed coreset. We adapt an

algorithm from [39] for unweighted metric MaxCut, which reduces metric MaxCut to MaxCut

in big dense weighted graphs. In general we could replace each weighted point by a number of

unweighted points and run [39] on the unweighted instance (having nnodes). Such a technique

has also been used in [72].

To avoid building the graph of size nwe construct a new reduction from MaxCut on coresets

to MaxCut in small dense graphs. We will prove that the reduction can be done in space and time

polylogarithmic in n.

We assume that our coreset construction always constructs the coreset points exactly in the

middle of a heavy cell. Then we can use a property of our coreset, that each point p∈Pcore has

either a big weight or is far away from the next coreset point. Using this property the techniques

of this section follow the ideas of [39].

For every coreset point p∈Pcore let w(p)denote the weight of point p. We also assume that

n=Pp∈Pcore w(p). For each partition (L, R)of Pcore we write

Cut(Pcore, L, R) = X

p∈L,q∈R

w(p)·w(q)·d(p, q)

and for each complete graph G= (V, E)with weight function ω:E→R+and each partition

(L, R)of Vwe write

Cut(Pcore, L, R) = X

p∈L,q∈R

ω(p, q).

Recall that Opt denotes the value of an optimum cut (LOpt, ROpt)of the input point set P, scaled

by 1/n:

Opt =max

(L,R)partition of P

n·X

p∈L,q∈R

d(p, q).

We scale the distances such that the weighted average distance between the coreset points is

1, i.e. X

p,q∈Pcore

w(p)·w(q)·d(p, q) = n2.

In the following we will always assume that this equality holds.

Lemma 5.5.4 n

4≤Opt ≤2·n

Proof : Since Pcore is an -coreset for P, we have:

Opt =max

(L,R)partition of P

n·X

p∈L,q∈R

d(p, q)

≤1

1−·max

(L,R)partition of Pcore

n·X

p∈L,q∈R

w(p)·w(q)·d(p, q)≤2·n

5.5 Constructing Solutions on the Coreset

To show that Opt ≥n/4 we consider a random cut (A, B), where each point p∈Pcore is

independently put into Awith probability 1/2 and otherwise put into B. The probability of each

edge (p, q)to be in the cut is then 1/2. By linearity of expectation the expected value of the cut

is E[Cut(Pcore, A, B)] = 1/2 ·Pp,q∈Pcore w(p)·w(q)·d(p, µ) = n2/2. Therefore the maximum

cut on Pcore must have a value greater than n2/2. Since Opt is the value of a maximum cut on P

scaled by 1/n and since Pcore is an -coreset for P, we conclude

Opt ≥(1−)·n/2 ≥n/4 .

We will use the coreset Pcore of an instance started with a value of δfulfilling

d+1·Opt

80 ·(1+log n)·dd/2 ·40d≤δ≤d+1·Opt

10 ·(1+log n)·dd/2 ·40d.

We can easily find such a value of δbecause of Lemma 5.5.4. We define

β:= d+1

80 ·(1+log n)·dd/2 ·40d

Then we have

β≤δ

Opt .

Lemma 5.5.5 For each point p∈Pcore let d(p) := min{d(p, q)|q∈Pcore}be the distance to

the next coreset point. Then we have:

d(p)·w(p)≥β·n

Proof :

Our algorithm constructs the coreset points beginning with the finest grid GZ−1. In a grid Gi

a coreset point pis introduced in the middle of a cell C, iff Cis heavy and has no heavy subcell

(only in that case we would already have a coreset point within C). Since the grid Gihas side

length 1

2iwe conclude d(p)≥1

2·2i. Since the cell Cis heavy and all points in the cell are

mapped to p, we have w(p)≥δ·2iand therefore d(p)·w(p)≥δ

2≥β·Opt

2≥β·n

8. After

introducting the coreset point pno other coreset point can be introduced within distance 1

2·2iof

p, because the coreset construction goes on with larger cells and the distance to the border of the

cell containing pis then again at least 1

2·2i. Our method introduces no new coreset points in cells

already containing a coreset point. Since the weight of a coreset point can only increase during

the coreset construction, the lemma follows. 2

Definition 5.5.6 (Distance Weight) [39]: The distance weight ωpof a coreset point p∈Pcore

is defined as ωp:= w(p)·Pq∈Pcore w(q)d(p, q).

5 The Coreset Method

Definition 5.5.7 (Graph of Clones) We define a weighted complete graph W= (X, E)where X

(called the set of clones) is a multiset of points. Each point p∈Pcore is cloned to create j8·ωp

·β·n2k

identical points of X, and the edge between p0∈X, a clone of p∈Pcore, and q0∈X, a clone of

q∈Pcore, has weight

ep0q0:= 2β2n4·w(p)·w(q)

64 ·ωp·ωq·d(p, q).

Edges between clones p0, q0of the same point have weight ep0q0:= 0.

Lemma 5.5.8 :We have for each p∈Pcore:

8·ωp

·β·n2≥(1−)·8·ωp

·β·n2.

Proof : Let p∈Pcore be an arbitrary coreset point. Then Lemma 5.5.5 shows:

ωp=X

q∈Pcore

w(p)·w(q)·d(p, q)≥X

q∈Pcore

w(q)·β·n

8=β·n2

Therefore: 8ωp

·β·n2≥1



and the lemma follows. 2

We now show that the constructed auxiliary graph W= (X, E)is small.

Lemma 5.5.9

(1−)·8

β ≤|X|≤8

β

Proof : We have

|X|=X

p∈Pcore 8·ωp

·β·n2≥X

p∈Pcore

(1−)·8·ωp

·β·n2

= (1−)·X

p,q∈Pcore

·β·n2·w(p)·w(q)·d(p, q) = (1−)·8

·β

where the last equality comes from the scaling of the point distances and

|X|=X

p∈Pcore 8·ωp

·β·n2≤X

p∈Pcore

8·ωp

·β·n2=8

·β.

5.5 Constructing Solutions on the Coreset

Lemma 5.5.10 Each cut (L, R)of Pcore corresponds to a cut (L0, R0)of Whaving the value

Cut(W, L0, R0)≥(1−)2·Cut(Pcore, L, R).

Each cut (L0, R0)of Wcan easily be extended to a cut (L, R)of Pcore having the value

Cut(Pcore, L, R)≥Cut(W, L0, R0).

Proof : Consider a cut (L, R)of Pcore, and let (L0, R0)be the induced cut of Wwhere all clone

vertices of a point p∈Lbelong to L0and all clone vertices of a point p∈Rbelong to R0. We

have:

Cut(W, L0, R0) = X

p0∈L0,q0∈R0

ep0q0

p∈L,q∈R8·ωp

·β·n2·8·ωq

·β·n2·2β2n4·w(p)·w(q)·d(p, q)

64 ·ωp·ωq

≥X

p∈L,q∈R

(1−)2·w(p)·w(q)·d(p, q) = Cut(Pcore, L, R).

On the other hand let (L0, R0)be an arbitrary cut of W. We will first alter the set (L0, R0)in a

way such that the value of the cut does not decrease and all clone vertices of a point p∈Pcore

belong to the same partition.

Consider a point p∈Pcore and let vbe one of the clone vertices. We compute the values

CL:= X

w∈L0

ev,w

and

CR:= X

w∈R0

ev,w .

Notice that the value of the cut increases by CL−CRif we move one clone vertex of pfrom L0

to R0. If we move one clone vertex of pfrom R0to L0the value of the cut decreases by CL−CR.

If CL≥CRwe put all clone vertices of pinto R0, not decreasing the cut. If CL< CRwe

put all clone vertices of pinto L0, not decreasing the cut. We do this iteratively for all vertices

p∈Pcore. After that for each vertex p∈Pcore all clone vertices belong to the same partition.

We then construct a cut (L, R)of Pcore in the following way: We put a point p∈Pcore into

partition Lif its clone vertices belong to partition L0. Otherwise we put pinto R. The value of

the cut is then:

Cut(Pcore, L, R) = X

p∈L,q∈R

w(p)·w(q)·d(p, q)

≥X

p∈L,q∈R8·ωp

·β·n2·8·ωq

·β·n2·2β2n4·w(p)·w(q)·d(p, q)

64 ·ωp·ωq

p0∈L0,q0∈R0

ep0q0=Cut(W, L0, R0)

5 The Coreset Method

Lemma 5.5.10 shows that each (1±)-approximate solution of MaxCut on Wcan be ex-

trapolated to a (1±)3-approximate solution of MaxCut on Pcore. It remains to show how to

compute an approximate MaxCut on W. We will use an algorithm of [51]to compute such an

approximate solution. The algorithm works for so called dense graphs. We can show that Wis

dense by showing that the maximum weight of an edge is at most a constant factor larger than

the average edge weight.

Lemma 5.5.11 maxp0,q0∈X(ep0q0)≤16 ·avgp0,q0∈X(ep0q0)

Proof : The average value of an edge e∈Eis:

Pp,q∈Pcore j8·ωp

·β·n2k·j8·ωq

·β·n2k2β2n4·w(p)·w(q)·d(p,q)

64·ωp·ωq

|X|2

≥Pp,q∈Pcore (1−)2·w(p)·w(q)·d(p, q)

|X|2= (1−)2n2

|X|2≥(1−)2n2·2·β2

where the last inequality comes from Lemma 5.5.9.

To show an upper bound for all edge weights we use that for each point p∈Pcore:

ωp

w(p)=X

q∈Pcore

w(q)·d(p, q) = 1

n X

q,z∈Pcore

w(q)·w(z)·d(p, q)!

2n X

q,z∈Pcore

w(q)·w(z)·d(p, q)!+ X

q,z∈Pcore

w(q)·w(z)·d(p, z)!!

≥1

2n ·X

q,z∈Pcore

w(q)·w(z)·d(q, z) = n

Now take an arbitrary edge between clone vertices of pand q. Using the triangle inequality and

the fact that for arbitrary a, b ≥1we have a+b

a·b≤2

min{a,b}, we can bound the weight of the edge

as follows:

ep0q0=d(p, q)·2β2n4·w(p)·w(q)

64 ·ωp·ωq

≤

nPz∈Pcore w(z)·(d(p, z) + d(z, q)·2β2n4·w(p)·w(q)

64 ·ωp·ωq

=2β2n3

64 ·ωp

w(p)+ωq

w(q)·w(p)·w(q)

ωp·ωq

≤2β2n3

64 ·2

min{ωp

w(p),ωq

w(q)}≤2β2n3

64 ·4

n=2β2n2

5.5 Constructing Solutions on the Coreset

We conclude

max

p0,q0∈X(ep0q0)≤1

(1−)2·4·avgp0,q0∈X(ep0q0)≤16 ·avgp0,q0∈X(ep0q0).

Since Wis a dense graph we can apply the algorithm of [51]to find a (1±)-approximate

MaxCut (L0, R0)for Win O|X|2·2((1/)O(1))=Olog2n·2((1/)O(1))time. We extrapolate

this cut to a (1±)3= (1±O())-approximate Maxcut (L, R)on Pcore and get the following

result:

Theorem 16 A(1+)-approximate MaxCut (L, R)on Pcore can be found in time

Olog2n·2((1/)O(1)).

Remark 5.5.12 We can even compute in the same time an implicit MaxCut (A, B)of the input

point set Pfrom the coreset:

After computing a good cut (L, R)of Pcore we can provide a partition of the whole space [0, 1]d

into two parts Land R. During our coreset construction we store the information about the

mappings of points to coreset points. When the points of a cell Care mapped to a coreset point

p∈Lduring the coreset construction, we assign the whole cell Cto L. When the points of a

cell Care mapped to a coreset point p∈Rduring the coreset construction, we assign the whole

cell Cto R. After the construction of the partition (L,R)of the plane we know that the partition

(A, B)of Pwith A:= P∩Land B:= P∩R is a (1+)-approximate MaxCut of P.

The computation of (L,R)from (L, R)can be done in poly(log n, 1/)time and space.

Together with the results of Corollary 5.4.8 we obtain the fastest method published so far (in

terms of n) to find a (1±)-approximation of the Euclidean MaxCut of a point set. Previous

methods had runtime O(n·log n·(2(1/)O(1)+log n)) [72] and O(n2·2O(1/2))[37].

Theorem 17 Given a set Pof npoints in Rd, a (1±)-approximate solution for the Euclidean

MaxCut of the points can be found in time O(n·log(n/) + log2n·log log n·2((1/)O(1))).

5.5.4 MaxMatching

We are not aware of a method to compute an approximate MaxMatching solution on the weighted

coreset points directly without expanding the coreset.

One could replace each point p∈Pcore having weight w(p)by w(p)unweighted points and

run the algortihm of Gabow[45], which finds an exact best solution in O(n3)time.

5.5.5 MaxTSP

We are again not aware of a method to find a (1±)-approximation of the MaxTSP tour even

on unweighted points. We could obtain a solution on the coreset Pcore by expanding the coreset

to a point set of size nand running an exhaustive search. The running time is O(n!).

5 The Coreset Method

5.5.6 AverageDistance

The weighted average distance 1

n2Pp,q∈Pcore w(p)·w(q)·d(p, q)on the coreset Pcore can be

easily computed in time O(|Pcore|2) = O(log2n·−d−2).

5.6 Coresets via Sampling

In the last sections we introduced a coreset construction technique, which is suitable to reduce the

complexity of a huge point set. We showed that the huge point set can be replaced by a weighted

point set of logarithmic size, which still holds all information needed to compute approximate

solutions for various clustering problems. However, to construct the coreset we need access to

the whole point set, which is not always given in real world applications dealing with huge point

sets.

In this chapter we will alter the construction such that it depends only on point samples. We

will show that the information about all point samples can itself be stored in polylogarithmic

memory. This will help us to develop data streaming algorithms in Chapter 6. The point sample

technique will also help us to maintain MaxCut clusterings of points when points are moving

along linear trajectories. See Chapter 7 for details.

We still assume that all points lie in [0, 1]dand that Opt ≥1/e

∆.

5.6.1 k-Median

We again consider Zgrids G0,...,GZ−1for Z=llog 4·k·10d·n(1+log n)·d(d+1)/2·

∆

d+1m+1, grid Gi

having cell side length 1

2i. In each grid Giwe pick a random sample Siof points. To select our

random sample we take every point with probability

pi:= min α

δ·2i, 1

into our sample Si, where

α=6·−2ln(2·Z·2Z·d/ρ)

and ρis the desired error probability of our algorithm. The sampling is done at least α-wise

independently, which means that for each set A⊂Pof at most αpoints and each partition

{B, C}of A:

Pr[B⊂Si∧C∩Si=∅] = (pi)|B|·(1−pi)|C|.

Essentially this means that for each subset A⊂Pof size αthe sampling is done independently.

We will show that it follows from a variant of Chernoff bounds [113] that we can approximate

the number of points in every heavy cell up to a multiplicative error of (1±)just using our

point samples. The approximations will furthermore be good enough to detect the heavy cells

and to construct an -coreset in the same way as described before.

5.6 Coresets via Sampling

Definition 5.6.1 (Considered as Heavy) For each cell Cin grid Giwe define nCas the number

of points in C. We define our estimation on the number of points as

nC:= |Si∩C|·1

A cell Cin grid Giis considered as heavy, if

nC≥(1−)·δ·2i.

Lemma 5.6.2 The following events hold with probability at least 1−ρ/2 for all grids Giand

each grid cell in Gi:

•If i≤log(2

δ ), then f

nC=nC.

•If Ccontains at least δ·2i−1points, then (1−)·nC≤f

nC≤(1+)·nC.

•If Ccontains less than δ·2i−1points, then f

nC<(1−)·δ·2i(and the cell Cis not

considered as heavy).

Proof : If i≤log(2

δ )then pi=1, the sample set equals the point set, and f

nC=nC.

Let Cbe an arbitrary grid cell in Gi. To prove the last two statements for the single cell Cwe

use Theorem 1 of Section 2.4.

For each point p∈Plet Xpdenote the indicator random variable for the event that p∈Si.

We want to show that Pp∈C Xpdoes not deviate much from its expectation. If a cell contains at

least δ·2i−1points then E[Pp∈C Xp]≥α/2. From Theorem 1 it follows:

PrX

p∈C

Xp−E[X

p∈C

Xp]≥·E[X

p∈C

Xp]≤e−min{bα/2c,b2α/6c}

Plugging in Pp∈C Xp=f

nC·piand E[Pp∈C Xp] = nC·piwe obtain

Prf

nC−nC≥·nC≤ρ

2·Z·2Z·d,

and the second statement follows with probability 1−ρ

2·Z·2Z·d.

Assume that Ccontains at most δ·2i−1points. If Ccontains exactly δ·2i−1points, we can

conclude from the formula above that

nC≤(1+)·nC<(1+1/3)·δ·2i−1= (1−1/3)·δ·2i<(1−)·δ·2i

holds with probability 1−ρ/(2·Z·2Zd). We observe that the distribution of f

nCdisplaces

towards lower values when the number of points in the cell decreases, which means that Prf

nC<

(1−)δ·2i≥1−ρ/(2·Z·2Zd)also holds for smaller numbers of points in C.

5 The Coreset Method

We conclude that the two statements are valid for one fixed single cell Cwith probability

1−ρ/(2·Z·2Zd). Since we have at most Zgrids, each grid having at most 2Z·dcells, the two

statements are valid with probability 1−ρ/2 for all cells in all grids by the union bound.

If f

nC≥(1−)·δ·2i, a cell C ∈ Giis considered as heavy. This way, we detect every heavy

cell but we also consider some light cells as heavy.

We then compute a coreset by introducing a coreset point in each cell considered as heavy (as

described in Section 5.2.1). This will increase the size of our coreset. The following corollaries

show that the size of the coreset is still logarithmic in n.

Corollary 5.6.3 Assume that the statements of Lemma 5.6.2 hold for all cells in all grids (which

happens with probability 1−ρ/2).

If δ≥d+1·Opt

8·k·10d·(1+log n)·d(d+1)/2 , the size of the computed coreset is at most 33·k·10d·(2+log n)·dd/2

d+1=

O(k·log n/d+1).

Proof : We can easily modify the proof of Lemma 5.2.8 by plugging in δ/2 for the old

value of δ. The proof stays exactly the same and we can conclude that the size of the coreset

is at most 4·Opt

δ·√d+ (log n+2)·k·3d·dd/2, which is smaller than the stated coreset size for

δ≥d+1·Opt

8·k·10d·(1+log n)·d(d+1)/2 .

An important property of our sample technique is that although the sample can be large, it just

occupies a small number of cells (and can, as we show later, be stored efficiently).

Lemma 5.6.4 Let δ≥d+1·Opt

8·k·10d·(1+log n)·d(d+1)/2 . Then we have points from at most

193 ·Z·k·10d·(1+log n)·dd/2 ·ln(2·Z·2Zd/ρ)

d+3=e

O(k·log n·log2(e

∆)·log(ρ−1)/d+3)

cells in the union of our sample sets with probability at least 1−ρ/2.

Proof : Let Gibe a fixed grid. We determine an upper bound on the number of points in non-

center grid cells. Let us recall from the proof of Lemma 5.2.8 that every point except for those

contained in the k·3d·dd/2 center cells has a distance of at least √d

2i+1to the nearest center in an

optimal solution. Thus the overall number of points in non-center cells is at most Opt·2i+1

√d. Let Xp

denote the indicator random variable for the event that p∈Si. Let Ddenote the set of non-center

grid cells. We have E[Pp∈D Xp]≤pi·Opt·2i+1

√d≤2·α·Opt

δ·√d. We will assume E[Pp∈D Xp] = 2·α·Opt

δ·√d

as the distribution of Pp∈D Xpdisplaces towards lower values when E[Pp∈D Xp]<2·α·Opt

δ·√d.

Applying Theorem 1 from Section 2.4 we get

PrX

p∈D

Xp≥4·α·Opt

δ·√d≤PrX

p∈D

Xp−E[X

p∈D

Xp]≥E[X

p∈D

Xp]

≤e−min{bα/2c,b2α

δ/3c}≤ρ

2Z .

5.6 Coresets via Sampling

Therefore, with probability at least 1−ρ/(2Z)we have at most 4·α·Opt

δ·√dpoints from non-center

cells in our sample. If no two of these points are contained in the same grid cell we get an

upper bound of 4·α·Opt

δ·√don the number of non-center cells that contain a sample point. Since there

are at most k·3d·dd/2 center cells the number of cells occupied by sample points is at most

4·α·Opt

δ·√d+k·3d·dd/2 ≤193·k·10d·(1+log n)·dd/2·ln(2·Z·2Zd/ρ)

d+3.

Since these arguments hold for each of the Zgrids with probability 1−ρ/(2Z), the Lemma

follows from the union bound. 2

To obtain a coreset we use the estimations f

nCof the number of points in cells to identify the

cells we consider as heavy (all cells having f

nC≥(1−)δ 2i). Since all heavy cells are considered

as heavy we obtain a finer coreset than before. We will now show how to find a good assignment

of weights to the computed coreset points, such that the computed coreset is an -coreset for P.

Since the weight of a coreset point will also depend on the number of points in some light

cells, we have to estimate the number of points in these cells. To get an estimate for all required

cells we use the following procedure. We require that the estimate f

nCfor the number of points in

a cell considered as heavy is a (1±)-approximation and that in every cell C ∈ Giconsidered as

light there are not more than δ·2ipoints (our coreset construction uses only these assumptions,

and they hold according to Lemma 5.6.2 with probability 1−ρ/2).

We call a cell useful, if it is either considered as heavy or a direct subcell of a cell considered

as heavy. We have to deal with the fact that the sum of the total estimated number of points

PCisubcell of Cf

nCiin the subcells of Ccan exceed the estimated number of points f

nCin C. To

avoid this we have to compute new integral estimates ECfor the number of points in each useful

cell C, which still have the guarantee to be near the real value nCand which are consistent with

the values ECiof the subcells of EC. We do this by first computing upper and lower bounds UC

resp. LCon nCfor all useful cells. We will then adjust these bounds to be consistent with the

bounds for the subcells. Finally we will use the bounds to compute new estimates EC.

For i > log(2

δ )and every cell C∈ Giconsidered as heavy we define LC=df

nC/(1+)eand

UC=bf

nC/(1−)c. For i > log(2

δ )and every cell C ∈ Giconsidered as light we define LC=0

and UC=bδ2ic. For i≤log(2

δ )and every cell C∈ Giwe define LC=f

nCand UC=f

nC(since

we know the number of points in Cexactly). Using these definitions we know for every cell that

LC≤nC≤UC).

The estimates ECcan be computed bottom-up by adjusting the bounds LCand UCin cases of

conflicts:

We first compute new lower and upper bounds LCand UCfor all useful cells bottom-up.

We look at the smallest cell Cconsidered as heavy. Let Ci, i ∈{1, ..., 2d}be its subcells. If

P2d

i=1LCi> LC, we set LC:= P2d

i=1LCi. If P2d

i=1UCi< UC, we set UC:= P2d

i=1UCi. After the

assignment LC≤nC≤UCstill holds. We use this technique for all cells considered as heavy

(in the order of increasing size), getting better bounds LCand UC. From these bounds we then

compute the values ECtop-down. Since the bounds LCand UCare always at least as strong as the

bounds of the subcells, we can always easily find integral values ECsatisfying LC≤EC≤UC

and P2d

i=1ECi=EC.

Corollary 5.6.5 Assume that the statements of Lemma 5.6.2 are true for all grids and all cells.

5 The Coreset Method

Then for each cell Cidentified as heavy we have (1−4)nC≤EC≤(1+4)nC.

For each cell C ∈ Giwith i≤log(2

δ )we have EC=nC.

All estimates ECare integral and consistent with the estimates ECifor the subcells Ciof C.

Proof : The claim follows directly from the following two sequences of inequalities.

EC≥LC≥f

nC/(1+)≥1−

1+nC≥(1−2)nC

and

EC≤UC≤f

nC/(1−)≤1+

1−nC≤(1+4)nC.

We now apply the algorithm described in Section 5.2 to our estimations ECand compute a

coreset.

Lemma 5.6.6 If δ≤d+1·Opt

4·k·10d·(1+log n)·d(d+1)/2 and  < 1/15, the coreset computed with respect to

the values ECis a 11-coreset of Pwith probability 1−ρ.

Proof : Let P0be a point set that is distributed according to our estimations EC(so for every

useful cell Cwe have |P0∩C|=EC). The proof of Lemma 5.2.5 shows that the coreset computed

by our algorithm is an -coreset for P0. Let Q={q1, . . . , qm}be the computed coreset points.

We will show that if we would know the point sets Pand P0, we could (using the coreset method)

compute mappings γ:P→Qand γ0:P0→Qand corresponding weight functions w:Q→N

and w0:Q→N, such that (Q, w)is an -coreset for Pand (Q, w0)is an -coreset for P0and

for all qi∈Qwe have:

(1−4)(w(qi) − 1)≤w0(qi)≤(1+4)w(qi) + 1(5.2)

and

w(qi)≤1



=⇒w(qi) = w0(qi).(5.3)

From that we will conclude that each solution on the point set P0differs by at most a factor of

(1+O()) from the solution on the point set P. Since the computed coreset is an -coreset for

P0it follows that it is a O()-coreset for P.

Let us construct the mappings γand γ0. Lemma 5.2.5 shows that we construct a (1+)-

coreset when we map each point pto a coreset point in the smallest heavy cell it is contained

in. We start the assignment of points to coreset points within the smallest useful cells. Since the

smallest useful cells are not heavy we do not assign any points to them. We proceed to assign

points in the useful cells at the next higher level. Going through the levels bottom-up we will

assign all points in useful cells and maintain the invariants (5.2) and (5.3).

5.6 Coresets via Sampling

Let Cbe a cell considered as heavy. If there is no subcell considered as heavy, the algorithm

introduces a new coreset point q. We map all ECpoints from P0to qand all nCpoints from P

to q. Then w(q) = ECand w0(q) = nC. Notice that w(q)<1

can only happen for a coreset

point qwhen a cell in grid Gior a subcell is considered as heavy. Then δ·2i−1<1

according

to Lemma 5.6.2. This only happens in grids Giwith i < log(2

δ )where EC=nC(Corollary

5.6.5) and therefore w(q) = w0(q)after the assignment. This shows that invariant (5.3) holds.

Invariant (5.2) follows directly from Corollary 5.6.5.

Let us now consider the case that C ∈ Gihas already ccoreset points q1, . . . , qc∈Qwith

weights w(qi)and w0(qi), respectively and let us assume that the invariants (5.2) and (5.3) hold

for all these coreset points. Let l:= nC−Pc

i=1w(qi)resp. l0:= EC−Pc

i=1w0(qi)be the

number of points which have to be assigned to the coreset points qiby γresp. γ0.

We consider six cases:

•l=0and l0=0: In this case nothing has to be assigned and the invariant holds by the

assumption of the induction step.

•l > 0 and l0=0and i≥log(2

δ ): Each cell considered as heavy has at least δ·2i−1

points accordings to Lemma 5.6.2 (The threshold can only have been higher during the

coreset constructions so far). Therefore each coreset point must have a weight of at least

δ·2i−1≥1

. It remains to show invariant (5.2). We have

(1−4)

i=1

w(qi) = (1−4)(nC−l)<(1−4)nC≤EC=

i=1

w0(qi).

Therefore for one qiwe have (1−4)w(qi)< w0(qi)and we can assign at least one point

from Pto qiby γwithout violating invariant (5.2). After that assignment either l=0or

we find again a qiwe can assign points to. We go on with this assignment until l=0.

•l=0and l0> 0 and i≥log(2

δ ): Again we only have to show invariant (5.2). We have

i=1

w0(qi) = EC−l0< EC

≤(1+4)nC= (1+4)(nC−l) = (1+4)

i=1

w(qi).

Therefore for one qiwe have w0(qi)<(1+4)w(qi)and we can assign at least one point

from P0to qiby γ0without violating the invariant. After that assignment either l0=0or

we again find a qiwe can assign points to. We go on with this assignment until l0=0.

•l>0and l0=0and i < log(2

δ ): In this case we have EC=nCand

i=1

w(qi) = nC−l<nC=EC=

i=1

w0(qi).

5 The Coreset Method

Since all w(qi)and w0(qi)are integral, we have w(qi)≤w0(qi) − 1for one qiand we

can assign at least one point from Pto qiby γwithout violating the invariants. After that

assignment either l=0or we find again a qiwe can assign points to. We go on with this

assignment until l=0.

•l=0and l0> 0 and i < log(2

δ ): Again we have EC=nCand

i=1

w0(qi) = EC−l0< EC=nC= (nC−l) =

i=1

w(qi).

Therefore for one qiwe have w0(qi)≤w(qi) − 1and can assign at least one point from

P0to qiby γ0without violating the invariant. After that assignment either l0=0or we

again find a qiwe can assign points to. We go on with this assignment until l0=0.

•l > 0 and l0> 0: We assign min{l, l0}points from Pto q1by γand min{l, l0}points from

P0to q1by γ0. This does not violate the invariant. After the assignment we are in one of

the other cases.

After the inductive assignment we have constructed mappings γand γ0and corresponding

weight functions w, w0, such that invariants (5.2) and (5.3) hold. Using the invariants it is easy

to show that for all coreset points q:

(1−5)w(q)≤w0(q)≤(1+5)w(q).(5.4)

If w(q)≤1

the inequality follows from invariant (5.3). If w(q)>1

we have:

w0(q)≥(1−4)(w(q)−1)≥(1−4)(w(q)−·w(q)) = (1−4−+42)w(q)≥(1−5)w(q)

and

w0(q)≤(1+4)w(q) + 1≤(1+4)w(q) + ·w(q) = (1+5)w(q)

and inequality (5.4) follows.

Let Adenote the coreset computed by our sample algorithm. Since Ais an -coreset for P0

we know for each set of centers C:

Median(A, C)∈(1±)·Median(P0, C).

From the arguments above we know that

Median(P0, C)∈1

1±·Median((Q, w0), C)⊂(1±2)·Median((Q, w0), C).

Since the weights w0(q)and w(q)of each coreset-point in q∈Qdiffer by at most 5 ·w(q),

we can conclude:

Median((Q, w0), C)∈(1±5)·Median((Q, w), C).

5.6 Coresets via Sampling

Since (Q, γ)is an -coreset for P, we obtain:

Median((Q, w), C)∈(1±)·Median(P, C).

Alltogether we get for  < 1/15 :

Median(A, C)∈(1±)2·(1±2)·(1±5)·Median(P, C)⊂(1±11)·Median(P, C).

5.6.2 k-Means

In this section we will adapt the sampling technique of the last section to the problem k-means.

The proofs will be very similar to the proofs of the last section.

We again consider Znested grids G0,...,GZ−1for Z=llog 8·k·33d+1·n(1+log n)·d(d+2)/2·

∆

d+2m+

1, grid Gihaving cell side length 1

2i. In each grid Giwe pick a random sample Siof points. To

select our random sample we take every point with probability

pi=min α

δ·4i, 1

into our sample Si, where

α=6·−2ln(2·Z·2Z·d/ρ)

and ρis the desired error probability of our algorithm. The sampling is done at least α-wise

independently, which means that for each set A⊂Pof at most αpoints and each partition

{B, C}of A:

Pr[B⊂Si∧C∩Si=∅] = (pi)|B|·(1−pi)|C|.

Essentially this means that for each subset A⊂Pof size αthe sampling is done independently.

We will show that it follows from a variant of Chernoff bounds [113] that we can approximate

the number of points in every heavy cell up to a multiplicative error of (1±)just using our

point samples. The approximations will furthermore be good enough to detect the heavy cells

and to construct an -coreset in the same way as described before.

Definition 5.6.7 (Considered as Heavy) For each cell Cin grid Giwe define nCas the number

of points in C. We define our estimation on the number of points as

nC:= |Si∩C|·1

A cell Cin grid Giis considered as heavy, if

nC≥(1−)·δ·4i.

Lemma 5.6.8 The following events hold with probability at least 1−ρ/2 for all grids Giand

each grid cell in Gi:

5 The Coreset Method

•If i≤log4(2

δ ), then f

nC=nC.

•If Ccontains at least δ·4i/2 points, then (1−)·nC≤f

nC≤(1+)·nC.

•If Ccontains less than δ·4i/2 points, then f

nC<(1−)·δ·4i(and the cell Cis not

considered as heavy).

Proof : If i≤log4(2

δ )then pi=1, the sample set equals the point set, and f

nC=nC.

Let Cbe an arbitrary grid cell in Gi. To prove the last two statements for the single cell Cwe

use Theorem 1 of Section 2.4.

For each point p∈Plet Xpdenote the indicator random variable for the event that p∈Si.

We want to show that Pp∈C Xpdoes not deviate much from its expectation. If a cell contains at

least δ·4i/2 points then E[Pp∈C Xp]≥α/2. From Theorem 1 it follows:

PrX

p∈C

Xp−E[X

p∈C

Xp]≥·E[X

p∈C

Xp]≤e−min{bα/2c,b2α/6c}

Plugging in Pp∈C Xp=f

nC·piand E[Pp∈C Xp] = nC·piwe obtain

Prf

nC−nC≥·nC≤ρ

2·Z·2Z·d,

and the second statement follows with probability 1−ρ

2·Z·2Z·d.

Assume that Ccontains at most δ·4i/2 points. If Ccontains exactly δ·4i/2 points, we can

conclude from the formula above that

nC≤(1+)·nC<(1+1/3)·δ·4i/2 = (1−1/3)·δ·4i<(1−)·δ·4i

holds with probability 1−ρ/(2·Z·2Zd). We observe that the distribution of f

nCdisplaces

towards lower values when the number of points in the cell decreases, which means that Prf

nC<

(1−)δ·4i≥1−ρ/(2·Z·2Zd)also holds for smaller numbers of points in C.

We conclude that the two statements are valid for one fixed single cell Cwith probability

1−ρ/(2·Z·2Zd). Since we have at most Zgrids, each grid having at most 2Z·dcells, the two

statements are valid with probability 1−ρ/2 for all cells in all grids by the union bound.

If f

nC≥(1−)·δ·4i, a cell C ∈ Giis considered as heavy. This way, we detect every heavy

cell but we also consider some light cells as heavy.

We then compute a coreset by introducing a coreset point in each cell considered as heavy (as

described in Section 5.3.1). This will increase the size of our coreset. The following corollaries

show that the size of the coreset is still logarithmic in n.

Corollary 5.6.9 Assume that the statements of Lemma 5.6.8 hold for all cells in all grids.

If δ≥d+2·Opt

16·k·33d+1·(1+log n)·d(d+2)/2 , the size of the computed coreset is at most

129 ·k·33d+1·(1+log n)·dd/2

d+2=O(k·log n/d+2).

5.6 Coresets via Sampling

Proof : We can easily modify the proof of Lemma 5.3.8 by plugging in δ/2 for the old

value of δ. The proof stays exactly the same and we can conclude that the size of the coreset

is at most 8·Opt

δ·d+ (log n+1)·k·3d·dd/2, which is smaller than the stated coreset size for

δ≥d+2·Opt

16·k·33d+1·(1+log n)·d(d+2)/2 .

An important property of our sample technique is that although the sample can be large, it just

occupies a small number of cells (and can, as we show later, be stored efficiently).

Lemma 5.6.10 Let δ≥d+2·Opt

16·k·33d+1·(1+log n)·d(d+2)/2 . Then we have points from at most

769 ·Z·k·33d+1·(1+log n)·dd/2 ·ln(2·Z·2Zd/ρ)

d+4=e

O(k·log n·log2(e

∆)·log(ρ−1)/d+4)

cells in the union of our sample sets with probability at least 1−ρ/2.

Proof : Let Gibe a fixed grid. We determine an upper bound on the number of points in

non-center grid cells. Let us recall from the proof of Lemma 5.3.8 that every point except for

those contained in the k·3d·dd/2 center cells has a distance of at least √d

2i+1to the nearest center

in an optimal solution and therefore contributes with at least d

4i+1to the optimal solution. Thus

the overall number of points in non-center cells is at most Opt·4i+1

d. Let Xpdenote the indicator

random variable for the event that p∈Si. Let Ddenote the set of non-center grid cells. We have

E[Pp∈D Xp]≤pi·Opt·4i+1

d≤4·α·Opt

δ·d. We will assume E[Pp∈D Xp] = 4·α·Opt

δ·das the distribution

of Pp∈D Xpdisplaces towards lower values when E[Pp∈D Xp]<4·α·Opt

δ·d.

Applying Theorem 1 of Section 2.4 we get

PrX

p∈D

Xp≥8·α·Opt

δ·d≤PrX

p∈D

Xp−E[X

p∈D

Xp]≥E[X

p∈D

Xp]

≤e−min{bα/2c,b2α

δ/3c}≤ρ

2Z .

Therefore, with probability at least 1−ρ/(2Z)we have at most 8·α·Opt

δ·dpoints from non-center

cells in our sample. Since in the worst case no two of these points are contained in the same grid

cell we get an upper bound of 8·α·Opt

δ·don the number of non-center cells that contain a sample

point. Since there are at most k·3d·dd/2 center cells the number of cells occupied by sample

points is at most 8·α·Opt

δ·d+k·3d·dd/2 ≤769·k·33d+1·(1+log n)·dd/2·ln(2·Z·2Zd/ρ)

d+4.

Since these arguments hold for each of the Zgrids with probability 1−ρ/(2Z), the Lemma

follows from the union bound. 2

To obtain a coreset we use the estimations f

nCof the number of points in heavy cells to identify

the cells we consider as heavy (all cells having f

nC≥(1−)δ 4i). Since all heavy cells are

considered as heavy we obtain a finer coreset than before. We will now show how to find a

good assignment of weights to the computed coreset points, such that the computed coreset is an

-coreset for P.

5 The Coreset Method

Since the weight of a coreset point will also depend on the number of points in some light

cells, we have to estimate the number of points in these cells. To get an estimate for all required

cells we use the following procedure. We require that the estimate f

nCfor the number of points

in a heavy cell is a (1±)-approximation and that in every cell C ∈ Giconsidered as light there

are not more than δ·4ipoints (our coreset construction uses only these assumptions, and they

hold according to Lemma 5.6.8 with probability 1−ρ/2).

We call a cell useful, if it is either considered as heavy or a direct subcell of a cell considered

as heavy. We have to deal with the fact that the sum of the total estimated number of points

PCisubcell of Cf

nCiin the subcells of Ccan exceed the estimated number of points f

nCin C. To

avoid this we have to compute new integral estimates ECfor the number of points in each useful

cell C, which still have the guarantee to be near the real value nCand which are consistent with

the values ECiof the subcells of EC. We do this by first computing upper and lower bounds UC

resp. LCon nCfor all useful cells. We will then adjust these bounds to be consistent with the

bounds for the subcells. Finally we will use the bounds to compute new estimates EC.

For i > log4(2

δ )and every cell C∈ Giconsidered as heavy we define LC=df

nC/(1+)eand

UC=bf

nC/(1−)c. For i > log4(2

δ )and every cell C ∈ Giconsidered as light we define LC=0

and UC=bδ4ic. For i≤log4(2

δ )and every cell C∈ Giwe define LC=f

nCand UC=f

nC(since

we know the number of points in Cexactly). Using these definitions we know for every cell that

LC≤nC≤UC).

The estimates ECcan be computed bottom-up by adjusting the bounds LCand UCin cases of

conflicts:

We first compute new lower and upper bounds LCand UCfor all useful cells bottom-up.

We look at the smallest cell Cconsidered as heavy. Let Ci, i ∈{1, ..., 2d}be its subcells. If

P2d

i=1LCi> LC, we set LC:= P2d

i=1LCi. If P2d

i=1UCi< UC, we set UC:= P2d

i=1UCi. After the

assignment LC≤nC≤UCstill holds. We use this technique for all cells considered as heavy

(in the order of increasing size), getting better bounds LCand UC. From these bounds we then

compute the values ECtop-down. Since the bounds LCand UCare always at least as strong as the

bounds of the subcells, we can always easily find integral values ECsatisfying LC≤EC≤UC

and P2d

i=1ECi=EC.

Corollary 5.6.11 Assume that the statements of Lemma 5.6.8 are true for all grids and all cells.

Then for each cell Cidentified as heavy we have (1−4)nC≤EC≤(1+4)nC.

For each cell C ∈ Giwith i≤log4(2

δ )we have EC=nC.

All estimates ECare integral and consistent with the estimates ECifor the subcells Ciof C.

Proof : The proof follows exactly the proof of Corollary 5.6.5. 2

We now apply the algorithm described in Section 5.3 to our estimations ECand compute a

coreset.

Lemma 5.6.12 If δ≤d+2·Opt

8·k·33d+1·(1+log n)·d(d+2)/2 and  < 1/15, the coreset computed with respect

to the values ECis an 11-coreset of Pwith probability 1−ρ.

5.6 Coresets via Sampling

Proof : Let P0be a point set that is distributed according to our estimations EC(so for every

useful cell Cwe have |P0∩C|=EC). The proof of Lemma 5.3.5 shows that the coreset computed

by our algorithm is an -coreset for P0. Let Q={q1, . . . , qm}be the computed coreset points.

Following exactly the proof of Lemma 5.6.6 we can show that if we would know the point sets P

and P0, we could (using our coreset method) compute mappings γ:P→Qand γ0:P0→Qand

corresponding weight functions w:Q→Nand w0:Q→N, such that (Q, w)is an -coreset

for Pand (Q, w0)is an -coreset for P0and for all qi∈Qwe have:

(1−5)w(qi)≤w0(qi)≤(1+5)w(qi)(5.5)

Let Adenote the coreset computed by our sample algorithm. From Lemma 5.3.5 we know for

each set of centers C:

Means(A, C)∈(1±)·Means(P0, C).

From the arguments above we know that

Means(P0, C)∈1

1±·Means((Q, w0), C)⊂(1±2)·Means((Q, w0), C).

Since the weights w0(q)and w(q)of each coreset-point in q∈Qdiffer by at most 5 ·w(q),

we can conclude:

Means((Q, w0), C)∈(1±5)·Means((Q, w), C).

Since (Q, γ)is an -coreset for P, we obtain:

Means((Q, w), C)∈(1±)·Means(P, C).

Alltogether we get for  < 1/15 :

Means(A, C)∈(1±)2·(1±2)·(1±5)·Means(P, C)⊂(1±11)·Means(P, C).

5.6.3 Oblivious Optimization Problems

We again consider Zgrids G0,...,GZ−1for Z=llog (10`)d+1·n(1+log n)·d(d+1)/2·

∆

d+1·λdm+1, grid Gi

having cell side length 1

2i. In each grid Giwe pick a random sample Siof points. To select our

random sample we take every point with probability

pi=min α

δ·2i, 1

into our sample Si, where

α=6·−2ln(2·Z·2Z·d/ρ)

5 The Coreset Method

and ρis the desired error probability of our algorithm. The sampling is done at least α-wise

independently, which means that for each set A⊂Pof at most αpoints and each partition

{B, C}of A:

Pr[B⊂Si∧C∩Si=∅] = (pi)|B|·(1−pi)|C|.

Essentially this means that for each subset A⊂Pof size αthe sampling is done independently.

We will show that it follows from a variant of Chernoff bounds [113] that we can approximate

the number of points in every heavy cell up to a multiplicative error of (1±)just using our

point samples. The approximations will furthermore be good enough to detect the heavy cells

and to construct a coreset in the same way as described before.

Definition 5.6.13 (Considered as Heavy) For each cell Cin grid Giwe define nCas the number

of points in C. We define our estimation on the number of points as

nC:= |Si∩C|·1

A cell Cin grid Giis considered as heavy, if

nC≥(1−)·δ·2i.

Lemma 5.6.14 The following events hold with probability at least 1−ρ/2 for all grids Giand

each grid cell in Gi:

•If i≤log(2

δ ), then f

nC=nC.

•If Ccontains at least δ·2i−1points, then (1−)·nC≤f

nC≤(1+)·nC.

•If Ccontains less than δ·2i−1points, then f

nC<(1−)·δ·2i(and the cell Cis not

considered as heavy).

Proof : The proof is exactly the same as the proof of Lemma 5.6.2. 2

If f

nC≥(1−)·δ·2i, a cell C ∈ Giis considered as heavy. This way, we detect every heavy

cell but we also consider some light cells as heavy.

We then compute a coreset by introducing a coreset point in each cell considered as heavy (as

described in Section 5.4.1). This will increase the size of our coreset. The following corollaries

show that the size of the coreset is still logarithmic in n.

Corollary 5.6.15 Assume that the statements of Lemma 5.6.14 hold for all cells in all grids.

If δ≥d+1·λd·Opt

2·(10`)d+1·(1+log n)·d(d+1)/2 , the size of the computed coreset is at most 9·(10`)d+1·(2+log n)·dd/2

λd+1·d+1=

O(log n/d+1).

Proof : We can easily modify the proof of Lemma 5.4.7 by plugging in δ/2 for the old

value of δ. The proof stays exactly the same and we can conclude that the size of the coreset

5.6 Coresets via Sampling

is at most 4·Opt

δ·λ·√d+ (log n+2)·3d·dd/2, which is smaller than the stated coreset size for

δ≥d+1·λd·Opt

2·(10`)d+1·(1+log n)·d(d+1)/2 .

An important property of our sample technique is that although the sample can be large, it just

occupies a small number of cells (and can, as we show later, be stored efficiently).

Lemma 5.6.16 Let δ≥d+1·λd·Opt

2·(10`)d+1·(1+log n)·d(d+1)/2 . Then we have points from at most

49 ·Z·(10`)d+1·(1+log n)·dd/2 ·ln(2·Z·2Zd/ρ)

d+3·λd+1=e

O(log n·log e

∆·log(ρ−1)/d+3)

cells in the union of our sample sets with probability at least 1−ρ/2.

Proof : Let Gibe a fixed grid. We determine an upper bound on the number of points in non-

center grid cells. Let us recall from the proof of Lemma 5.4.7 that every point except for those

contained in the 3d·dd/2 center cells has a distance of at least √d

2i+1to the center of gravity and

contributes at least λ·√d

2i+1to the cost of an optimal solution. Thus the overall number of points

in non-center cells is at most Opt·2i+1

λ·√d. Let Xpdenote the indicator random variable for the event

that p∈Si. Let Ddenote the set of non-center grid cells. We have E[Pp∈D Xp]≤pi·Opt·2i+1

λ·√d≤

2·α·Opt

δ·λ·√d. We will assume E[Pp∈D Xp] = 2·α·Opt

δ·λ·√das the distribution of Pp∈D Xpdisplaces towards

lower values when E[Pp∈D Xp]<2·α·Opt

δ·λ·√d.

Applying Theorem 1 of Section 2.4 we get

PrX

p∈D

Xp≥4·α·Opt

δ·λ·√d≤PrX

p∈D

Xp−E[X

p∈D

Xp]≥E[X

p∈D

Xp]

≤e−min{bα/2c,b2α

δ/3c}≤ρ

2Z .

Therefore, with probability at least 1−ρ/(2Z)we have at most 4·α·Opt

δ·λ·√dpoints from non-center

cells in our sample. In the worst case no two of these points are contained in the same grid cell

and we get an upper bound of 4·α·Opt

δ·λ·√don the number of non-center cells that contain a sample

point. Since there are at most 3d·dd/2 center cells the number of cells occupied by sample

points is at most 4·α·Opt

δ·λ·√d+3d·dd/2 ≤49·(10`)d+1·(1+log n)·dd/2·ln(2·Z·2Zd/ρ)

d+3·λd+1.

Since these arguments hold for each of the Zgrids with probability 1−ρ/(2Z), the Lemma

follows from the union bound. 2

To obtain a coreset we use the estimations f

nCof the number of points in heavy cells to identify

the cells we consider as heavy (all cells having f

nC≥(1−)δ 2i). Since all heavy cells are

considered as heavy we obtain a finer coreset than before. We will now show how to find a

good assignment of weights to the computed coreset points, such that the computed coreset is an

-coreset for P.

5 The Coreset Method

Since the weight of a coreset point will also depend on the number of points in some light

cells, we have to estimate the number of points in these cells. To get an estimate for all required

cells we use the following procedure. We require that the estimate f

nCfor the number of points

in a heavy cell is a (1±)-approximation and that in every cell C ∈ Giconsidered as light there

are not more than δ·2ipoints (our coreset construction uses only these assumptions, and they

hold according to Lemma 5.6.14 with probability 1−ρ/2).

We use exactly the same technique described for k-median to obtain consistent estimates EC:

We call a cell useful, if it is either considered as heavy or a direct subcell of a cell considered

as heavy. We have to deal with the fact that the sum of the total estimated number of points

PCisubcell of Cf

nCiin the subcells of Ccan exceed the estimated number of points f

nCin C. To

avoid this we have to compute new integral estimates ECfor the number of points in each useful

cell C, which still have the guarantee to be near the real value nCand which are consistent with

the values ECiof the subcells of EC. We do this by first computing upper and lower bounds UC

resp. LCon nCfor all useful cells. We will then adjust these bounds to be consistent with the

bounds for the subcells. Finally we will use the bounds to compute new estimates EC.

For i > log(2

δ )and every cell C∈ Giconsidered as heavy we define LC=df

nC/(1+)eand

UC=bf

nC/(1−)c. For i > log(2

δ )and every cell C ∈ Giconsidered as light we define LC=0

and UC=bδ2ic. For i≤log(2

δ )and every cell C∈ Giwe define LC=f

nCand UC=f

nC(since

we know the number of points in Cexactly). Using these definitions we know for every cell that

LC≤nC≤UC).

The estimates ECcan be computed bottom-up by adjusting the bounds LCand UCin cases of

conflicts:

We first compute new lower and upper bounds LCand UCfor all useful cells bottom-up.

We look at the smallest cell Cconsidered as heavy. Let Ci, i ∈{1, ..., 2d}be its subcells. If

P2d

i=1LCi> LC, we set LC:= P2d

i=1LCi. If P2d

i=1UCi< UC, we set UC:= P2d

i=1UCi. After the

assignment LC≤nC≤UCstill holds. We use this technique for all cells considered as heavy

(in the order of increasing size), getting better bounds LCand UC. From these bounds we then

compute the values ECtop-down. Since the bounds LCand UCare always at least as strong as the

bounds of the subcells, we can always easily find integral values ECsatisfying LC≤EC≤UC

and P2d

i=1ECi=EC.

Corollary 5.6.17 Assume that the statements of Lemma 5.6.14 are true for all grids and all cells.

Then for each cell Cidentified as heavy we have (1−4)nC≤EC≤(1+4)nC.

For each cell C ∈ Giwith i≤log(2

δ )we have EC=nC.

All estimates ECare integral and consistent with the estimates ECifor the subcells Ciof C.

Proof : Exactly as the proof of Corollary 5.6.5 2

We apply the algorithm described in Section 5.4 to our estimations ECand compute a coreset.

Lemma 5.6.18 If δ≤d+1·λd·Opt

(10`)d+1·(1+log n)·d(d+1)/2 and  < 1/12, the coreset computed with respect

to the values ECis a 67 `

λ-coreset of Pwith probability 1−ρ.

5.6 Coresets via Sampling

Proof : Let P0be a point set that is distributed according to our estimations EC(so for every

useful cell Cwe have |P0∩C|=EC). The proof of Lemma 5.4.4 shows that the coreset computed

by our algorithm is an -coreset for P0. Let Q={q1, . . . , qm}be the computed coreset points.

Exactly as in the proof of Lemma 5.6.6 we can show that if we would know the point sets Pand

P0, we could (using our coreset method) compute mappings γ:P→Qand γ0:P0→Qand

corresponding weight functions w:Q→Nand w0:Q→N, such that (Q, w)is an -coreset

for Pand (Q, w0)is an -coreset for P0and for all qi∈Qwe have:

(1−5)w(q)≤w0(q)≤(1+5)w(q).(5.6)

This guarantees that each solution on (Q, w)can differ from a solution on (Q, w0)by at most

10 `

λ·Opt(Q, w):

Consider a movement of points from (Q, w)to (Q, w0). The total movement increases when

we first move each moved point to the center of gravity µand then in a second step to its desti-

nation. Since w0(q)≥(1−5)w(q), we know that for each q∈Qat most 5w(q)points are

moved away from q. The total movement of the first step therefore is smaller than

q∈Q

5w(q)d(q, µ).

Since the considered problem is `-Lipschitz, this movement changes the objective function by at

most

`·X

q∈Q

5w(q)d(q, µ).

Now we look at the second movement step. Since w0(q)≤(1+5)w(q)we know that for each

q∈Qat most 5w(q)points are moved from the center of gravity to q. The total movement of

the second step must therefore be smaller than

q∈Q

5w(q)d(q, µ)

which changes the objective function again by at most

`·X

q∈Q

5w(q)d(q, µ).

Since the problem is λ-mean preserving we conclude that the total change of the objective

function caused by both movements is at most

10` ·X

q∈Q

w(q)d(q, µ)≤10`

λ··Opt(Q, w)(5.7)

We constructed different mappings proving coreset properties for different sets. We use the

notation E:Q→Nfor the coreset weight function computed by our sampling algorithm and

m:γ(P)→γ0(P)for the mapping which moves points as shown in the last paragraph.

We know the following facts:

5 The Coreset Method

1. γ:P→Qproves that (Q, w)is a coreset for P

2. m:γ(P)→γ0(P)changes the cost of a solution by at most 10 `

λ·Opt(γ(P))

3. γ0:P0→Qproves that (Q, w0)is a coreset for P0

4. α:P0→Q(as constructed by our algorithm) proves that (Q, E)is a coreset for P0.

Let P= (p1, . . . , pn)be the set of input points and s∈SΠ(n)a solution of the oblivious

optimization problem Pi. The mapping γ◦m◦γ0−1◦α:P→α(γ0−1(m(γ(P)))) proves that

(Q, E)is a 67 `

λ-coreset for P:

We write cost(P)for cost(n,s)

Π(P)and Opt(P)for OptΠ(P)and conclude:

cost(α(γ0−1(m(γ(P))))) (5.8)

∈cost(γ0−1(m(γ(P)))) ±·Opt(γ0−1(m(γ(P)))) (5.9)

⊂cost(m(γ(P))) ±2 ·Opt(γ0−1(m(γ(P)))) (5.10)

⊂cost(m(γ(P))) ±4 ·Opt(m(γ(P))) (5.11)

⊂cost(γ(P)) ±4 ·Opt(γ(P)) + 4 ·10 `

λ·Opt(γ(P)) + 10 `

λ·Opt(γ(P))(5.12)

⊂cost(γ(P)) ±44 `

λ·Opt(γ(P)) (5.13)

⊂cost(P)±44 `

λ·Opt(P) + 44 `

λ··Opt(γ(P)) + ·Opt(P)(5.14)

⊂cost(P)±67 `

λ·Opt(P)(5.15)

where (5.9) comes from the fact 4, (5.10) from fact 3, (5.11) because of Lemma 5.1.13 and fact

3, (5.12) because of the bounded movement costs of m(fact 2), (5.13) because of `≥1,λ≤1,

and ≤1/2, and (5.14) because of fact 1. 2

6 Coresets in Data Streams

In this chapter we introduce algorithms to compute coresets on dynamic geometric data streams.

We will combine the coreset results of Chapter 5 with methods of Chapter 3 to sample items in

data streams.

We shortly recapitulate the results of Chapter 5, particularly Section 5.6. To state the results

for all problems at once we use the following definitions:

For the k-median problem we set

δ0:= d+1·Opt

4·k·10d·(1+log n)·d(d+1)/2 ,

j0:= dlog(n·e

∆·√d)e,

Z:= llog 4·k·10d·n(1+log n)·d(d+1)/2·

∆

d+1m+1,

SC:= 33·k·10d·(2+log n)·dd/2

d+1,

pi:= min α

δ·2i, 1,

A:= 193·Z·k·10d·(1+log n)·dd/2·ln(2·Z·2Zd/ρ)

d+3, and

∆:= ∆.

For the k-means problem we set

δ0:= d+2·Opt

8·k·33d+1·(1+log n)·d(d+2)/2 ,

j0:= dlog(n·e

∆·√d)e,

Z:= llog 8·k·33d+1·n(1+log n)·d(d+2)/2·

∆

d+2m+1,

SC:= 129(1+log n)·k·33d+1·dd/2

d+2,

pi:= min α

δ·4i, 1,

A:= 769·Z·k·33d+1·(1+log n)·dd/2·ln(2·Z·2Zd/ρ)

d+4, and

∆:= ∆2.

For oblivious optimization problems we set

δ0:= d+1·λd·Opt

(10`)d+1·(1+log n)·d(d+1)/2 ,

j0:= dlog(n·e

∆·√d·`)e,

Z:= llog (10`)d+1·n(1+log n)·d(d+1)/2·

∆

d+1·λdm+1,

SC:= 9·(1+log n)·dd/2·(10`)d+1

d+1·λd+1,

pi:= min α

δ·2i, 1,

A:= 49·Z·(10`)d+1·(1+log n)·dd/2·ln(2·Z·2Zd/ρ)

d+3·λd+1, and

6 Coresets in Data Streams

∆:= ∆/λ.

For all problems we set

α:= 6·−2ln(2·Z·2Z·d/ρ)

and

ρ:= ψ/(3j0),

where ψis the maximum error probability that our data streaming algorithm is allowed to have.

In Section 5.6 we showed that for a set of npoints in [0, 1]dwith Opt ≥1/e

∆an -coreset

for k-median, k-means and oblivious optimization problems can be extracted using only the

following statistics:

•Znested grids G0,...,GZ. Each cell in Gihas side length 1

2i.

•In each grid Gia sample Siof points, each point chosen to be in Siwith probability pi.

The sampling is assumed to be α-wise independently.

Note that we don’t need to know the exact location of each sample point. During our construction

we only use the information about the number of sample points in certain cells of the grid.

Therefore the information we have to compute in our data stream is the following:

1. The coordinates and size of each cell Coccupied by at least one sample point.

2. The number of sample points Si∩C in each occupied cell C.

Notice that according to Lemmas 5.6.4, 5.6.10, and 5.6.16 all information we need can be

stored in small space:

Lemma 6.0.19 Let δ≥δ0/2. Then we have points from at most Acells in the union of our

sample sets with probability at least 1−ρ/2.ut

The extracted coreset is a coreset according to Lemmas 5.6.6, 5.6.12, and 5.6.18 and the size

of it is small according to Corollarys 5.6.3, 5.6.9, and 5.6.15:

Lemma 6.0.20 The following statements hold with probability at least 1−ρ/2:

If δ≤δ0, we can compute a coreset from only the statistics 1 and 2. The coreset then is an

O()-coreset of P.

If δ≥δ0/2, the size of the computed coreset is at most SC.ut

By dividing all point coordinates by 1/∆, we can ensure that all points lie in [0, 1]d. We can

also assume that we always have Opt ≥1/e

∆:

For k-median we know the exact solution when we currently only have kpoints in the data

stream. We can use a data structure of Lemma 3.5.4 to obtain the exact point coordinates in that

6.1 Insertions

case. Otherwise Opt ≥1/e

∆, because we have two points of (scaled) distance at least 1/∆ =1/e

∆

in one cluster.

For k-means we also know the exact solution when we only have kpoints in the stream.

Otherwise Opt ≥1/∆2=1/e

∆, because we have two points of (scaled) distance at least 1/∆ in

one cluster.

Having an oblivious optimization problem Πwe also are able to recover the point set, if it

consists of only one point by Lemma 3.5.4. If Pconsists of at least two points pand q, they

must have a scaled distance of at least 1/∆. Therefore the total distance of all points to the center

of gravity µis at least 1/∆. Since Πis λ-mean preserving, we have Opt ≥λPp∈Pd(p, µ)≥

λ/∆ =1/e

∆.

6.1 Insertions

We show how to maintain the statistics 1 and 2 when the data stream consists only of insertions

of points. Since we do not know the right value of δin advance, we follow the approach of

the Sections 5.2.3, 5.3.3, and 5.4.3. We start in parallel j0different instances I1, . . . , Ij0of our

algorithm, instance Ijwith a value of δ=δ(j) = c·2jfor a constant c(see Sections 5.2.3, 5.3.3,

and 5.4.3 for the definition of c).

For each jwe maintain at most Acells Cj,1,...,Cj,A and the number of sample points sj,l

within each cell Cj,l. For each instance Ijwe maintain a value RUNNINGj, which is 1if the

instance is still running or 0if the instance has been stopped, and a value cjwhich denotes the

number of cells currently stored by the j-th instance.

An insert operation of a point pis done as follows:

INSERT(p)

for each jdo

if RUNNINGj=1then

for each i∈{1,...,Z}do

Do a biased coin flip with Pr[coin shows heads] = pi.

if coin shows heads do

Let C ∈ Gidenote the cell in grid Githat contains the point p.

if ∃l∈{1,...,cj}Cj,l =Cdo

set sj,l ←sj,l +1. // update number of sample points within this cell

else do

if cj> A do

set RUNNINGj=0. // stop storing cells for this j

else do

set cj←cj+1.

set Cj,cj←C. // store cell occupied by point

set sj,cj←1. // store number of sample points within this cell

6 Coresets in Data Streams

The choice of j0ensures that for one jwe have 1

2·δ0≤δ(j)≤δ0(see Sections 5.2.3, 5.3.3,

and 5.4.3). According to Lemma 6.0.19 for this choice of δ=δ(j)we have at most Acells

occupied by sample points. Therefore we have cj≤Aduring the algorithm and RUNNINGj=1.

From Lemma 6.0.20 we know that when we extract a coreset from the statistics for this value of

j, we obtain an O()-coreset for the respective problem. We also obtain an O()-coreset from

all statistics having smaller values of j, since they used a smaller value of δ=δ(j).

We use the smallest value of jsuch that RUNNINGj=1. From the corresponding statistics

Cj,1,...,Cj,cjand sj,1, . . . , sj,cjwe compute our coreset, which is an O()-coreset for P. If the

coreset size is bigger than SCwe know from Lemma 6.0.20 that δ(j)< δ0/2. Therefore we can

set j←j+1, still knowing that δ(j)≤δ0and the computed coreset for this value of jis also an

O()-coreset. We go on with this method until the size of the constructed coreset is at most SC.

Theorem 18 Given a data stream of point insertions our streaming algorithm maintains a data

structure for -coresets for k-median, k-means, and oblivious optimization problems. At any

point of time an O()coreset for the respective problem can be extracted with probability 1−ψ.

The data structure for k-median needs e

O(k·log4(∆)·log(1/ψ)/d+3)space, the data structure

for k-means needs e

O(k·log4(∆)·log(1/ψ)/d+4)space and the data structure for oblivious

optimization problems needs e

O(log4(∆)·log(1/ψ)/d+3)space.

An insertion can be processed in O(log(∆)·log(k·∆/)) time.

An -coreset for k-median can be extracted in O(klog3∆/d+1)time.

An -coreset for k-means can be extracted in O(klog3∆/d+2)time.

An -coreset for oblivious optimization problems can be extracted in O(log3∆/d+1)time.

Proof : We set ρ=ψ/(3·j0). By the union bound Lemma 5.6.2 and Lemma 5.6.4 hold with

probability 1−ψfor all choices of j. Thus with probability 1−ψthe returned coreset is an

O()-coreset.

We will now count the number of memory cells we use. For each of the j0=O(log ∆)

instances we maintain at most Acounters. They occupy j0·Amemory cells, each consisting of

O(log ∆)bits. The respective values of Alead to the stated memory bounds.

When a new point arrives in the stream we process one for-loop for each of the j0values of

jand each of the Zvalues of i. Using hash tables the query ∃l∈{1,...,cj}Cj,l =Ccan be done in

constant time. All other operations of the algorithm run in constant time as well. Therefore the

time to insert a point is O(Z·j0) = O(log(∆)·log(k·∆/)).

We will now bound the coreset extraction time for k-median. Let us first assume we know

the right value of δin advance. Since (according to the proof of Lemma 5.2.8) there are at

most O(klog n/d+1)heavy cells in each grid, we have a bound of O(klog2∆/d+1)on the

number of heavy cells. The marking process yielding to the actual coreset can be seen as a

quadtree traversal. Since each inner node of this tree corresponds to a heavy cell, the tree traversal

can be done in time O(klog2∆/d+1). Since we have to do this process for each value of

δ(and then choose the minimum value with a small coreset), we get a total running time of

O(klog3∆/d+1).

For k-means resp. oblivious optimization problems the number of heavy cells in each grid

is O(klog n/d+2)resp. O(log n/d+1). Therefore we get a total coreset extraction time of

O(klog3∆/d+2)for k-means and O(log3∆/d+1)for oblivious optimization problems. 2

6.2 Deletions

When we allow deletions two problems occur. First, when we encounter a DELETE operation of

a point p, we have to decide if this point is a sample point. We achieve this by replacing the coin

flips in our algorithm by the use of an α-wise independent hashfunction hi,j :∆d→hl∆d

ρmi.

We choose all points phaving hi,j(p)< pi·l∆d

ρmas sample points.

Lemma 6.2.1 The choice of the sample points is done α-wise independently. Each point p∈P

is chosen to be a sample point with probability p∈[pi, pi+ρ

∆d]. The total statistical difference

to the method which samples each point with probability piis at most ρ.

Proof :

p=lpi·l∆d

ρmm

l∆d

ρm≥pi

and

p=lpi·l∆d

ρmm

l∆d

ρm≤

pi·l∆d

ρm+1

l∆d

ρm=pi+1

l∆d

ρm≤pi+ρ

∆d.

Since we have at most ∆dpoints, the errors in the probability to sample each single point sum

up to a total statistical difference of at most ρ.2

We conclude that for a fixed jour sampling method behaves with probability 1−ρexactly like

an α-wise independent sampling of each point with probability pi. Since ρ=ψ/(3j0)we have

that with probability 1−ψ/3 all samplings behave exactly as demanded.

The second problem is that we cannot stop an instance of our algorithm if the number cjof

occupied cells exceeds A. For example, it could happen that in the first half of the stream many

points are inserted and the number of occupied cells cjis way too large. But then most of these

points can be deleted in the second half of the stream such that we eventually have less than A

occupied cells. If we want to obtain a coreset at that point of time we have to know the respective

statistics 1 and 2.

To overcome this problem we use the data structure of Lemma 3.5.4. The data structure

supports update operations on the entries of a high dimensional vector xand is able to recover

the whole vector, as long as the support of the vector is smaller than SC.

Let jbe fixed. Let C0,...,Ctwith t=O(∆d)denote the set of all cells in all grids Gifor

one fixed j. Let xidenote the number of sample points in the cell Ci. When we are able to

recover the whole vector x= (x0, . . . , xt)after insert- and delete- operations of sample points,

we can reconstruct the statistics 1 and 2. Therefore for each jwe use a data structure RECOVERj

of Lemma 3.5.4 with parameters U=Θ(∆d),M=Θ(∆d), and error probability parameter

ρ=ψ/(3j0). Then with probability 1−ψ/3 the structures RECOVERjwork for each j.

Let UPDATEjdenote an UPDATE operation on the data structure RECOVERj. We implement

INSERT and DELETE operations of points in the following way:

6 Coresets in Data Streams

INSERT(p)

for each jdo

for each i∈{1,...,Z}do

if hi,j(p)< pi·l∆d

ρmdo // pis sample point

Let Cl∈ Gidenote the cell that contains the point p.

UPDATEj(l, 1).

DELETE(p)

for each jdo

for each i∈{1,...,Z}do

if hi,j(p)< pi·l∆d

ρmdo // pis sample point

Let Cl∈ Gidenote the cell that contains the point p.

UPDATEj(l, −1).

The method ensures that xlalways represents the number of sample points within the cell Cl.

Theorem 19 Given a data stream of point insertions and deletions our streaming algorithm

maintains a data structure for -coresets for k-median, k-means. At any point of time an O()-

coreset for the respective problem can be extracted with probability 1−ψ.

The data structure for k-median needs e

O(k·log6(∆)·log(∆/ψ)/d+3)space, the data struc-

ture for k-means needs e

O(k·log6(∆)·log(∆/ψ)/d+4)space and the data structure for oblivious

optimization problems needs e

O(log6(∆)·log(∆/ψ)/d+3)) space.

Insertions and deletions of points can be processed in e

O(k·log6(∆)·log(∆/ψ)/d+3)time for

k-median, for k-means in e

O(k·log6(∆)·log(∆/ψ)/d+4)time and for oblivious optimization

problems in e

O(log6(∆)·log(∆/ψ)/d+3)) time.

An -coreset for k-median can be extracted in e

O(klog5∆·log(∆/ψ)/d+3)time.

An -coreset for k-means can be extracted in e

O(klog5∆·log(∆/ψ)/d+4)time.

An -coreset for oblivious optimization problems can in e

O(log5∆·log(∆/ψ)/d+3)time be

extracted.

Proof : We have that with probability 1−ψ/3 the hash functions doing the sampling behave

like sampling each point α-wise independently with probability pi. We condition on this event.

We used data structures RECOVERjthat work with probability 1−ψ/3 for all j. We condition

on this event.

In that event we can recover the whole vector x, which represents exactly the statistics 1 and

2. Therefore we can construct an O()-coreset. The coreset construction works with probability

1−ψ/3 according to Lemmas 5.6.2 and Lemma 5.6.4.

Let us now bound the space needed. For j0values of jwe have to store the data structure

RECOVERjof Lemma 3.5.4, each using space

OA·(log A+log(1/ρ))·log2∆=e

OA·log2(∆)·log(∆/ψ).

6.3 Maximum Spanning Tree

Therefore the total space to store all data structures RECOVERjis e

OA·log3(∆)·log(∆/ψ).

The space consumption of the hash functions is negligible compared to the space consumption

of the RECOVERjstructures. We obtain the stated memory bounds by inserting the right values

of A.

To process an update we have to process j0·Zupdate operations on RECOVER structures, each

taking O(A·(log A+log(1/ρ))·log ∆)time. Alltogether this takes e

O(Z·A·log2∆·log(∆/ψ))

time, which dominates the whole processing time. Plugging in the values of Zand Aleads to the

stated update times.

To extract the coreset for a fixed value of jwe first have to extract the statistics 1 and 2

from RECOVERj. This takes O(A·(log A+log(∆/ψ)) ·log ∆) = e

O(A·log ∆·log(∆/ψ))

time. Plugging in the value Aand multiplying that by j0=O(log ∆)(the number of possible

values of j), we obtain a total recovery time for k-median of e

O(klog5∆·log(∆/ψ)/d+3),

which dominates the coreset extraction time of O(klog3∆/d+1). For k-means and oblivious

optimization problems we get the stated results in the same way. 2

6.3 Maximum Spanning Tree

In our publication [43] we stated the problem to find a maximum spanning tree of points in

a Euclidean space as an oblivious optimization problem, which is solvable using our coreset

method. Although this can be done in general, the problem is not `-Lipschitz for a constant `,

and therefore the proofs of the previous chapters do not apply. In this section we want to close

this gap to the publication [43] and develop a very simple and more effective method to construct

a coreset for the maximum spanning tree problem.

We will first define the maximum spanning tree problem as an oblivious optimization problem.

Definition 6.3.1 (Euclidean MaxSP) The Euclidean maximum spanning tree problem (MaxSP)

asks for a spanning tree connecting all points of a given input point set P⊂Rdof size n, such

that the total length of all tree edges is maximized.

The problem can be formulated as an oblivious optimization problem:

We look at a complete graph Knwith node set {1,...,n}. A feasible solution sfor the MaxST

problem is then a spanning tree of Kn. Let Tbe the set of edges of this spanning tree of Kn. The

cost of son P={p1, . . . , pn}is then given by

cost(n,s)

maxtsp(p1, . . . , pn) = X

(a,b)∈T

d(pa, pb).

This definition of MaxSP as an oblivious optimization problem also defines -coresets for

MaxSP by Definition 5.1.12.

We first show how to construct a coreset when we have access to the whole input point set P.

6 Coresets in Data Streams

We compute an approximation e

Bto the biggest extent of the point set in one dimension. Es-

pecially we set

B:= min

p,q∈Pmin

i∈{1,...,d}p(i)−q(i).

and assume that we have computed an approximation e

Bsatisfying B≤e

B≤2·B.

We then introduce a grid consisting of square cells of side length ·

8√d. Notice that at most

(8·√d/)dcells in the grid are occupied by points of P.

We introduce a coreset point in each non-empty grid cell and map all input points from Plying

in a cell Cto the corresponding coreset point in C.

Lemma 6.3.2 The construction given above constructs an -coreset for P. The coreset size is

O(1

d).

Proof : Let sbe an arbitrary solution. We denote by Tthe corresponding spanning tree

connecting the points from P.Tcontains less than nedges. Since each endpoint of an edge is

moved a distance at most ·e

B/8, we conclude that the length of each edge changes by at most

·e

B/4. The total change of the cost of the solution sis therefore at most n··e

B/4 ≤n··B/2.

We now show that this change is smaller than ·OptΠ(P)by constructing a spanning tree

having large cost. We first connect the two points p1and p2of Pdefining the extent Bby an

edge. We iterate over the rest of the nodes. A node q∈P\{p1, p2}is connected to p1by an edge

iff d(q, p1)≥d(q, p2). Otherwise qis connected to p2by an edge. The cost of the resulting

spanning tree is then at least

d(p1, p2)+(n−2)·d(p1, p2)/2 ≥n·B/2 .

Therefore the costs OptΠ(P)of an optimum solution must be greater than n·B/2. It follows that

the change of the cost of the solution sby mapping points to coreset points is bounded by

n··B/2 ≤·OptΠ(P).

The statistics we need to construct a coreset can be easily maintained in a dynamic geometric

data stream consisting of insertions and deletions of points from {0, . . . ∆ −1}dusing techniques

of the previous sections. Let ψbe a desired error probability we want to achieve.

We start dlog(d·∆) + 1eparallel instances of a streaming algorithm, each with a different

value of j∈{0, . . . , dlog(d·∆)e}. For each instance we set e

Bj:= 2j.

Instance jintroduces a square grid Gjof cell side length ·

8√d. Let Cj,1, . . . , Cj,l be the cells

of Gj. The instance uses the data structure of Lemma 3.5.4 with values U=∆d,M=∆d,

A= (8·√d/)d, and δ=ψ/dlog(d·∆) + 1e, which is able to recover the support of a large

vector xunder UPDATE operations on x, when the support is smaller than A. For each INSERT(p)

operation in the stream instance jdiscovers the cell Cj,i containing the point p. It then triggers

an UPDATE(i, 1)operation on its data structure, increasing the value xiby 1. A DELETE(p)

100

6.3 Maximum Spanning Tree

operation is transformed into an UPDATE(i, −1)operation on the data structure, decreasing the

value xiby 1. This way we ensure that the vector entry xialways equals the number of cells in

Cj,i.

We can extract a coreset in the following way. We assume that all data structures to recover

the vectors work well (which happens by the Union Bound with probability 1−ψ). We query

all data structures of all instances to recover the support of their corresponding vectors. The data

structures return FAIL only if the support of their corresponding vector is larger than A. This

way we can decide which grids contain at most Aoccupied cells.

We take the instance with the lowest value of jhaving at most Aoccupied cells and use the data

structure of that instance to recover the corresponding vector x. This vector contains the infor-

mation about the set of occupied cells in Gjand the number of points within each cell. Using that

information we construct a coreset by introducing a coreset point within each occupied cell of Gj

and setting the weight of each coreset point to the number of input points in the corresponding

cell.

Theorem 20 Given a dynamic geometric data stream of insert and delete operations of points

from {0,...,∆−1}d, there is an algorithm that maintains a data structure for -coresets for

MaxST. At any point of time an -coreset can be extracted with probability 1−ψ. The size of the

coreset is O(1/d). The data structure uses O1

d·log log ∆

·ψ·log2∆space. Insertions and

deletions of points can be processed in O1

d·log log ∆

·ψ·log ∆time.

Proof : We first prove that the returned coreset is indeed an -coreset for MaxST. We know that

there is one instance started with a value of f

Bj0satisfying B≤f

Bj0≤2·B. By the argumentation

above we have at most Aoccupied cells in the corresponding grid. If we would construct a

coreset from the information of that instance, this would be an -coreset. However, we construct

the coreset from the information of the instance having the smallest value of jand at most A

occupied grid cells. Therefore the coreset is constructed from an instance jwith value j≤j0.

We have either j=j0or the grid of instance jis even finer than that of instance j0. Therefore the

constructed coreset is an -coreset. It’s size is at most A.

The space and update time we need is dominated by the space and update time of the dlog(d·

∆)+1edata structures from Lemma 3.5.4. This leads to the stated space requirements and update

times.

101

6 Coresets in Data Streams

102

7 A Kinetic Data Structure for MaxCut

In this chapter we will focus on clustering moving points as described in the framework of kinetic

data structures (KDS). The framework of kinetic data structures has been introduced by Basch

et al. [12] and it has been used since as the central model of studying geometric objects in

motion, see, e.g., [2, 12, 55, 56] and the references therein. The KDSs are data structures for

maintaining a certain attribute (for example, in the case of a clustering problem, assignment of

the points to the clusters) for a set of continuously moving geometric objects. The main idea

underlying the framework of KDSs is that even if the input objects are moving in a continuous

fashion, the underlying combinatorial structure of the moving objects changes only at discrete

times. Therefore, there is no need to maintain the data structure continuously but rather only

when certain combinatorial events happen: a KDS maintains a configuration function of interest

by watching for updates needed to be performed when an event occurs.

In the kinetic setting, we consider a set of points in Rdthat are continuously moving. Each

point follows a (known) trajectory that is defined by a continuous function of time; for simplicity

of presentation, we will assume that it is a linear function. In other words, every point is moving

with a constant speed along a line; the line and the speed are the parameters of the movement of

a given point. Additionally, we allow the points to change their trajectory, i.e., to perform a flight

plan update.

To measure the quality of a KDS, we will consider the following two most important perfor-

mance measures (for more details, see, e.g., [55, 56]): the time needed to update the KDS when

an event occurs and the bound for the number of events that may occur during the motion. An-

other important measure is the time to handle flight plan updates.

In this chapter we describe a kinetic data structure to maintain a (1−)-approximation of a

maximum cut. Our data structure supports queries of the type

•to which side of the partition belongs query point p?

To support such a query the data structure maintains a subdivision of the space that has complex-

ity O(log n/d+1). Each cell of the subdivision is colored red or blue. Every point located in a

red cell is red and every point in a blue cell is blue. Then our colored partition into red and blue

points is a (1−)-approximation to the MaxCut.

Our data structure uses two auxiliary kinetic data structures, kinetic turnament trees as defined

in Section 7.1 and a data structure to approximate the bounding cube as defined in Section 7.2.

103

7 A Kinetic Data Structure for MaxCut

7.1 Kinetic Turnament Trees

In this section we recap a construction of randomized turnament trees from Basch [11]. There are

other structures leading to better amortized time bounds (e.g. [16]). We will use the construction

from [11] because it deterministically achieves almost the same runtime bounds, but these bounds

hold even in the worst case.

A randomized turnament tree is a randomized data structure to maintain the maximum (or the

minimum, depending on applications) point from a set P={p1, . . . , pn}of nlinearly moving

points in R1. It stores all points in the leafs of a binary tree. Inner nodes of the tree always store

the bigger point of the two child nodes (and therefore the maximum of the subtree). It also stores

an event queue, that is a priority queue holding the times of the next events when inner nodes

will change (when we get a new maximum in the subtree, i.e. when the two children of a node

change their order).

When a point stored in an inner node changes, the maximum of the subtree must have changed.

The number of times this can happen is bounded by the number of nodes in the subtree. Summing

up over all inner nodes we get that the total number of changes of inner nodes is at most O(n·

log n).

When an inner node changes at an event, it can lead to changes of the inner nodes above.

Notice that we treat these additional changes as seperate events. Therefore each such event can

be processed in O(log n)time (we just have to adapt the event queue and test if we have an

immidiate event for the parent node as well).

An insertion of a point can be done by inserting a leaf into the tree and adjusting the inner

nodes in time O(log2n). A deletion of a point can be done by deleting the respective leaf in time

O(log2n), adjusting the inner nodes, and move the rightest leaf to the position of the deleted leaf

in time O(log2n)to balance the tree.

Theorem 21 ([11]) Let Pbe an initially empty set of points moving along linear trajectories

in R1. Let σ=σ1, . . . , σmbe a sequence of moperations σiof the form INSERT(p, ti)and

DELETE(p, ti), such that for any two operations σi, σjwith i<jwe have ti< tj(the operations

are performed sequentially in time). An INSERT(p, ti)inserts at time tipoint pinto P. A

DELETE(p, ti)removes pfrom Pat time ti. A kinetic turnament tree maintains the biggest

element of P. It requires O(log m)time to process an event and the expected number of events

is O(mlog m). Insertions and deletions are performed in expected time O(log2m).

7.2 Approximating the Bounding Cube

Our data structure uses an data structure from Agarwal, Har-Peled, and Varadarajan [2]:

Theorem 22 ([2]) Let Pbe a set of npoints moving in Rd. If Pis moving linearly, then after

O(n)preprocessing, we can construct a kinetic data structure of size O(1)that maintains a 2-

approximation of the smallest orthogonal box containing P. The data structure processes O(1)

events, and each event takes O(1)time. The sides of the maintained box are moving linearly

between the events.

104

7.3 The Kinetic Data Structure for MaxCut

It can be decided in constant time if a flight plan update of a point pchanges the data structure.

At each point of time only flight plan updates of a constant number of points can potentially

change the data structure.

We use this data structure to efficiently maintain a bounding cube Bof Phaving the following

properties.

•All points lie in Bduring the whole movement of points.

•There is always one dimension, such that the extent of the point set in the dimension is at

least half the side length of B.

•The data structure to maintain the bounding cube processes O(d2)events.

•Between these events all side borders of the bounding cube are moving linearly.

•Given a flight plan update we can decide in time O(d)if the data structure has to be

updated.

•At each point of time only flight plan updates of O(d)points can potentially change the

data structure.

For each dimension we maintain a 2-approximation of the 1-dimensional extent using the

KDS from [2] (see Theorem 22 above). Using these approximations we can easily maintain an

approximation Bof a smallest bounding cube of Pby maintaining a cube having the highest

extent as side length.

Each extent data structure processes O(1)events. Between such events the dimension that de-

fines the size of the bounding cube can change although none of the one dimensional extent data

structures processes an event. This happens when the size of the bounding cube is determined by

a new one dimensional extent. This can happen at most dtimes. Therefore, we can have at most

O(d2)events.

We store the approximation of each extent in a kinetic turnament tree whose priority is given

by the width of the 1-dimensional approximations.

7.3 The Kinetic Data Structure for MaxCut

Our kinetic data structure for MaxCut uses the data structure of the last section to maintain a

bounding cube B. This data structure processes O(d2)events, which we call major events.

Between all major events we will set up a data structure to maintain a coreset under a lin-

ear movement of the bounding box B. We will use the technique described in Section 5.6.3 to

construct coresets for oblivious optimization problems using point samples. Recall that MaxCut

(with objective function scaled by 1/n) is an oblivious optimization problem which is `-Lipschitz

with `=1and λ-mean preserving with λ=1/4.

105

7 A Kinetic Data Structure for MaxCut

The coreset technique assumes that all points always lie within [0, 1]dand that we have a lower

bound 1/e

∆on Opt (recall that Opt denotes the cost of an optimum cut devided by n). This can

easily be achieved in the following way: Between two major events we know that the movement

of the bounding cube Bis linear. At each major event we scale and move all coordinates in a way

such that the bounding cube Bis [0, 1]d. Since the scaling function is linear in the time t, after

this scaling still all point movements are linear.

Since there is always one dimension such that the extent of the point set in that dimension is

at least half the side length of the bounding cube, the sum of distances of these two points to the

center of gravity is at least 1/2. Since MaxCut is 1/4-mean preserving, we know that Opt ≥1/e

∆

for e

∆=8.

In the following we assume that all points lie in [0, 1]dand that the bounding cube B= [0, 1]

does not move.

We will maintain Znested square grids G0, . . . GZfor

Z=log 8·10d+1·4d·n(1+log n)·d(d+1)/2

d+1+1 .

Grid Gipartitions [0, 1]dinto 2id cells of side length 1/2i. We assume that these cells are num-

bered from 1to 2id and that the index of a cell can be computed efficiently from its coordinates

and that the neighbors of a cell can also be computed efficiently.

Let j0:= blog(8·n·√d)c. For j∈{0,...,j0}let be

δ=δ(j) = d+1

8·(1+log n)·d(d+1)/2 ·(40)d+1·2j.

For each value of j∈{0,...,j0}and each grid Giwe create a sample Si,j, each point chosen

independently with probability pi,j =min{α

δ·2i, 1}, where α=6·−2·ln(2·Z·2Z·d·j0/ψ).

To simplify the notation we set

K:= 6·ln(2·Z·2Z·d·j0/ψ)·8·(1+log n)·d(d+1)/2 ·40d+1

d+3=α·2j

δ.

Notice that Kis polylogarithmic in nand does not depend on ior j, and that

pi,j =min K

2i+j, 1

Following the ideas of Section 5.4.3 there is at least one value of δ(j), such that the coreset

constructed using this value of δ=δ(j)is smaller than a constant S=O(log n/d+1)(Corollary

5.6.15), and an O()-coreset for MaxCut (Lemma 5.6.18). We can easily identify such a value

of δby taking the coreset constructed with the smallest value of δ, such that the size is at most S.

For each 1≤i≤Z, we maintain under the movement of points

106

7.3 The Kinetic Data Structure for MaxCut

•the set of all cells containing sample points, and

•the number of sample points in each non-empty cell.

A(1+)-approximate solution on the coreset can be efficiently computed using the algorithm

of Section 5.5.3 in time Olog2n·2((1/)O(1)).

The data structure. At each major event we calculate a new linear scaling of the movements

of points, such that all points lie in the scaled bounding cube B= [0, 1]das described earlier. We

will then setup the following data structure and maintain it until the next major event triggers.

We assume that the cells in grid Giare numbered from 1to 2id. For each sample set Si,j we

maintain a search tree Ti,j that stores the cells in grid Githat contain at least one point from Si,j.

For each non-empty cell we maintain 2d kinetic turnament trees. For 1≤k≤dwe maintain

one kinetic max-turnament tree and one kinetic min-turnament tree, where the priority of points

is given by their k-th coordinate. To implement these turnament trees we use a kinetic turnament

trees that efficiently supports insertion and deletions of points (see Section 7.1).

The events. Additionally to the events caused by the kinetic turnament trees, our data struc-

ture stores the following (possible) events in a global event queue. For each grid Giand each

non-empty cell we have an event for each dimension k,1≤k≤dwhen the maximum or

minimum point with respect to that dimension crosses the corresponding cell boundary in that

dimension. These events are called minor events. Additionally, we have major events that occur

when an event causes to change the movement of the bounding cube. At major events the whole

data structure is updated as defined above.

Every event in a kinetic turnament tree is called a turnament event.

Time to process events. We first consider minor events. In every minor event, a point pin

some set Si,j moves from one cell C1of the grid into another cell C2. Therefore, pis deleted from

2d turnament trees corresponding to C1and is inserted into the 2d turnament trees corresponding

to C2. If the point moves into a cell that was previously empty, we must insert the cell index of

C2into the search tree Ti,j and initialize the 2d turnament trees. If pwas the only point in C1we

have to delete the 2d turnament trees. Since (cf. Section 7.1) in O(log n)time one can insert a

point in a turnament tree or search tree and since any insertion in a kinetic turnament tree creates

O(log n)new events in expectation, we get:

Lemma 7.3.1 Any minor event can be processed in O(dlog2n)time. It creates O(dlog n)new

events in kinetic turnament trees in expectation.

Next we want to analyze the running time required to setup the whole data structure at major

events.

Lemma 7.3.2 Any major event can be processed in expected time O(d·K·n·log n)

107

7 A Kinetic Data Structure for MaxCut

Proof : The time to setup our data structure at a major event is dominated by the time to setup

the kinetic turnament trees for the boundary events. Since each kinetic turnament tree consisting

of mpoints can be constructed in time O(m·log m)we have to count the number of sample

points in all kinetic turnament trees. Each sample point is inserted into 2d kinetic turnament

trees. The expected number of points in Si,j is Kn/2i+j. By linearity of expectation we get that

the total number of points in all kinetic turnament trees is

i,j

2d ·Kn

2i+j=O(dKn).

Remark 7.3.3 Instead of setting up the data structure at major events we can precompute the

major events in time O(d2·n). Then in time O(d2·n)we can precompute all times of major

events and the corresponding scaled positions and velocities of the points. From that we can

setup all data structures at the beginning instead of setting them up at major events. Using this

technique we get a total expected setup time of O(d3·K·n·log n)and can process major events

in time O(1)by just switching to the precomputed structure.

Analysis of the number of events. In Section 7.2 we showed:

Lemma 7.3.4 There are at most O(d2)major events. ut

Since (after scaling the points and velocities at major events) the bounding cube B= [0, 1]dis

fixed between major events, it follows that for every grid Githe boundaries of the grid cells are

fixed as well between major events. Therefore we get:

Lemma 7.3.5 During the linear motion between major events every point crosses at most d·

(2i−1)cells in grid Gi.

Proof : Let us consider an arbitrary point p. We regard the cell boundaries in each dimension

separately. In grid Giwe have 2i−1internal boundaries. Since pmoves linearly in time, pcan

cross each boundary at most once. Since this can happen in each of the dimensions, the lemma

follows. 2

Corollary 7.3.6 The expected number of minor events is O(d3K n ·Z).

Proof : The expected number of minor events involving points from Si,j is at most K n

2i+j·d·2i=

d K n/2jbetween two major events. Summing up over all i, j and multiplying it with the number

O(d2)of major events, we get that there are at most O(d3K n Z)events. 2

Corollary 7.3.7 The expected number of turnament events is O(d4·K·n·Z·log n).

Proof : Every minor event creates a number of O(dlog n)new events in kinetic turnament

trees. Linearity of expectation implies that the expected number of events in kinetic turnament

trees is O(d4·K·n·Z·log n).2

108

7.3 The Kinetic Data Structure for MaxCut

Flight plan updates. In KDS it is typically assumed that at certain points of time the “flight

plan” of an object can change. At every such flight plan update the data structure is notified that

a point now moves in another direction (possibly at a different speed). Such a flight plan update

compels us to update all events in the event queue that involve this particular point. In our case

we distinguish between two types of points. First, there is a constant number of points on which

flight plan updates change the data structure maintaining the bounding cube. If the movement of

one of these points is changed we have to setup a whole new data structure in time O(d3·K·n).

If the flight plan of any other point is updated the movement of the bounding cube and the

scaling of the points stays the same. We simply have to update all events the point is involved

in. Since it requires O(log2n)time to update a randomized kinetic turnament tree we have to

compute the expected number of such kinetic turnament trees a point is involved in. Every point

is stored in 2d kinetic turnament trees for each set Si,j it is contained in. We get

Lemma 7.3.8 Let p∈Pbe an arbitrary point. Then pis stored in O(dK)kinetic turnament

trees in expectation.

Proof : The probability that pis contained in set Si,j is at most K/2i+j. Hence, we get that the

expected number of sets Si,j that contain pis

i,j

2i+j=X

2iX

2j≤X

2i=O(K).

The lemma follows from the fact that every point is stored in 2d kinetic turnament trees for each

set Si,j it is contained in. 2

Assume we fix some point of time and specify for each point an arbitrary flight plan update.

If we choose one of these updates uniformly at random then the expected time to perform the

update is small, i.e., the average cost of a flight plan update is low.

Lemma 7.3.9 A flight plan update can be done in O(d·(d3+log n)·log n·K)average expected

time.

Proof : It requires O(log2n)time to do a flight plan update of a kinetic turnament tree. In

expectation every point is stored in O(d K)kinetic turnament trees. Hence the expected time

required to update these turnament trees is O(d K log2n). Additionally, we have to deal with

updates of the O(d)points that change the bounding cube structure. We can process such an

update in O(d3·K·n·log n)time. Averaging over all points we get that the average expected

update time for these kind of updates is O(d

n·d3K n log n) = O(d4·K·log n).2

Extracting the coreset. We describe how to extract a small coreset from our data structure.

We try to find a value of jsuch that the coreset constructed with the value δ=δ(j)has size of

at most S=O(log n/d+1)(Corollary 5.6.15), and that the size of the coreset constructed with

δ=δ(j−1)is greater than S. By Corollary 5.6.15 and Lemma 5.6.18 this coreset is then with

probability 1−ψ/j0an -coreset for MaxCut.

109

7 A Kinetic Data Structure for MaxCut

Testing the coreset size for one value of j∈{0,...,j0}and extracting one coreset of size Scan

be done in time O(log2n/d+1):

Since (according to the proof of Lemma 5.4.7) there are at most O(log n/d+1)heavy cells in

each grid Gi, we have a bound of O(log2n/d+1)on the number of heavy cells. The marking

process yielding to the actual coreset can be seen as a quadtree traversal. Since each inner node

of this tree corresponds to a heavy cell, the tree traversal can be done in time O(log2n/d+1).

When we see that we have too many heavy cells we stop the process.

We can search for the right value of jusing a binary search. This takes a total time of

O(log2n·log log n/d+1). Since we construct less than j0coresets, the whole process works

with probability 1−ψ.

Computing an approximate MaxCut from the coreset. According to Section 5.5.3 we

can compute a (1±)-approximate MaxCut-solution from the coreset in Olog2n·2((1/)O(1))

time.

We can apply this algorithm to the computed coreset and obtain our main result. In the theorem

we assume that dis a constant.

Theorem 23 There is a kinetic data structure that maintains a (1+)-approximation for the

Euclidean MaxCut problem, which is correct with probability 1−ψ. The data structure answers

queries of the form ’to which side of the partition belongs query point p?’ in O(log2n·log log n·

−2(d+1)·21/O(1))time. Under linear motion the data structure processes expected e

O(nlog(ψ−1)

d+3)

events, which require O(log2n)time. A flight plan update can be performed in e

O(log4n·log(ψ−1)

d+3)

average expected time, where the average is taken over the worst case update times of the points

at an arbitrary point of time. The data structure needs an expected setup time of e

O(n·log(ψ−1)

d+3).ut

110

8 An Efficient k-Means Implementation using

Coresets

In this chapter we develop an efficient k-means clustering algorithm called CoreMeans. The

main new idea is that the algorithm uses the coreset construction of Section 5.3 to speed up the

computation of the clustering. Our algorithm first computes a small coreset of the input points

and then runs variant of KMHybrid [104, 83], which is a combination of Lloyd’s algorithm and

random swaps. Then the algorithm doubles the size of the coreset and runs for a few steps on this

coreset. This process is done until the coreset coincides with the whole point set. The coreset

computation is supported by a quadtree (or higher dimensional equivalent) based data structure.

This data structure can also be used to speed up nearest neighbor queries.

We will compare our algorithm with algorithm KMHybrid [104, 83] in Section 8.3. On most

of the input instances our algorithm significantly outperforms KMHybrid, especially for low

dimensional instances. For high dimensional instances our algorithm finds good solutions faster

but KMHybrid’s solution after a few seconds is slightly better. If we want to compute a clustering

for one value of kthe running time of both algorithms is often dominated by the setup time to

compute auxiliary data structures. In this case CoreMeans benefits from its smaller setup time.

However, in many applications we do not know the right value of kin advance. In such a

case one has to compute clusterings for many different values of k. Then one can use a quality

measure independent of kto find out the best clustering. A prominent quality measure for such

a scenario is the average silhouette coefficient [86]. Although there are no theoretical guarantees

for the average silhouette coefficient, it is often used to evaluate the quality of different cluster-

ings. Unfortunately, computing the average silhouette coefficient for one clustering takes time

quadratic in the number of points, which is not feasible for point sets of medium and large size.

However, we would like to compute the sihouette coefficient for many values of k.

In this situation we can see the real strength of coresets. Using coresets it is possible to find

clusterings and compute their average silhouette coefficient for large point sets and many values

of k. For example, we computed clusterings for k=1, . . . , 100 and approximated their average

silhouette coefficient for a set of more than 4.9 million points in 3D consisting of the RGB values

of an image in a few seconds on one core of an Intel Pentium D dual core processor with 3 GHz

core frequency. In higher dimensions we did the same computations for an (artificially created)

point set of 300, 000 points in 20 dimensions for all values of kbetween 1and 100 in less than

8 minutes. Without coresets, the computation of the sihouette coefficient even for one value of k

takes several hours.

First, we develop some notation and introduce the basic k-means method.

111

8 An Efficient k-Means Implementation using Coresets

8.1 Definitions and Notations

In this chapter we will deal with weighted and unweighted sets of points. We will always assume

that the set Prepresents the input instance, which is an unweighted set of npoints in the Rd.

A weighted point set will usually be denoted by R. The weights of the points in Rare given by

w(r)for every r∈R. We will only consider integer weights in this paper. For each point p∈Rd

let p(j)denote it’s j-th coordinate.

Recall the definition of the k-means clustering problem from Section 2.3.

Two easy characterizations of an optimal k-means solution are known. We state them in

Lemma 8.1.1 and Lemma 8.1.2:

Lemma 8.1.1 Let C={c1, . . . , ck}be a set of fixed cluster centers with |C|=k. Let C1, . . . , Ck

be a partition of Pwhich fulfills

p∈Ci⇒∀j∈{1,...,k}d(p, ci)≤d(p, cj).

Then the partition C1, . . . , Ckminimizes the k-means objective function for the fixed set of cen-

ters C.ut

We write Means(P, C)to denote the cost of an optimal partition of Pwith respect to the cen-

ters in C. In a similar way we write Means(R, C)to denote the cost of an optimal partition of a

weighted set Rwith respect to C.

The other way around we can easily find the best centers for a given partition. The result is

also well known (and proven in many publications like [94]).

Lemma 8.1.2 Let C1, . . . , Ckbe a fixed partition of P. For all i∈{1,...,k}let ci:= µ(Ci)

be the center of gravity of Cias defined in Section 2.1. Then the centers c1, . . . , ckminimize the

k-means objective function for the fixed partition C1, . . . , Ckof P.

Proof : We can do the minimization within each cluster seperately. Therefore we have to show

for a set of points Q, that the function f:Rd→Rwith f(r) = Pq∈Qd(q, r)2is minimized at

r=µ(Q).

We have

f(r) = X

q∈Q

d(q, r)2=X

q∈Q

i=1

(q(i)−r(i))2=

i=1X

q∈Q

(q(i)−r(i))2.

Therefore we can minimize each dimension seperately: For i∈{1,...,d}let

fi(x) = X

q∈Q

(q(i)−x)2.

When we chose r(i)such that fi(r(i)) = Pq∈Q(q(i)−r(i))2is minimized, then rminimizes f.

112

8.1 Definitions and Notations

Let i∈{1,...,d}be fixed. Obviously fi:R→Ris continuously differentiable. Furthermore

limx→∞ fi(x) = ∞and limx→−∞fi(x) = ∞. Therefore fimust have a global minimum, and at

that global minimum we must have f0(x) := d

dx f(x) = 0.

We observe:

f0(x) = X

q∈Q

2·(x−q(i)).

Therefore we have

f0(x) = 0⇔X

q∈Q

2·(x−q(i)) = 0⇔ X

q∈Q

x!− X

q∈Q

q(i)!

⇔|Q|·x=X

q∈Q

q(i)⇔x=1

|Q|X

q∈Q

q(i)⇔x=µ(Q)(i).

Therefore µ(Q)is the only point in Rdminimizing f. 2

8.1.1 The Basic k-Means Method

Based on the observations of Lemma 8.1.1 and Lemma 8.1.2 that it is easy to compute an optimal

partition for a fixed set of centers and an optimal set of centers for a fixed partition, a simple and

elegant clustering heuristic has been developed [41, 95, 97]. Nowadays, one often refers to this

heuristic as the k-means algorithm or Lloyd’s algorithm. This algorithm runs in iterations. At

the beginning of an iteration the algorithm has a set of kcenters {c1, . . . , ck}. Every iteration

consists of two steps:

1. For every p∈Pcompute its nearest center in {c1, . . . , ck}. Partition Pinto ksets C1, . . . , Ck

such that Cicontains all points whose nearest center is ci(and break ties arbitrarily).

2. For every cluster Cicompute its center of gravity µ(Ci), i.e. the optimal center of that

cluster. Then set ci:= µ(Ci)for every 1≤i≤k.

Each iteration runs in O(ndk)time. Typically, the algorithm runs for a fixed number of iterations

(standard values are in the range from 50 to 500). It is well known that the algorithm only

converges to a local optimum and that the quality of this solution depends strongly on the initial

set of centers. Therefore, the algorithm is usually repeated several times with different sets of

initial centers and the best discovered partition is returned.

8.1.2 Algorithm KMHybrid

In the experiments we compare our algorithm to an algorithm called KMHybrid [104, 83].

KMHybrid combines Lloyd’s algorithm with swapping of centers (moving centers away to the

position of a random point to break local minima) and a variant of simulated annealing. The

algorithm does one swap followed by a sequence of Lloyd’s steps and an acceptance test. If the

113

8 An Efficient k-Means Implementation using Coresets

current solution passes the test, the algorithm continues with the current solution. Otherwise, it

restores the old one. The acceptance test is based on a variant of simulated annealing. Addition-

ally, the algorithm uses a kD-tree to speed up nearest neighbor search in the Lloyd’s steps. For

more details see [104, 83].

8.1.3 The Silhouette Coefficient

In many applications the right number of clusters is not known in advance. Since the k-means

objective function drops monotonically as kincreases, one needs a different measure for the

quality of a clustering that is independent of k. Such a measure is provided by the average

silhouette coefficient [86] of the clustering. The silhouette coefficient of a point piis computed

as follows. First compute the average distance of pito the points in the same cluster as pi. Then

for each cluster Cthat does not contain picompute the average distance from pito all points in

C. Let bidenote the minimum average distance to these clusters. Then the silhouette coefficient

of piis defined as (bi−ai)/max(ai, bi).

The value of the silhouette coefficient of a point varies between −1and 1. A value near −1

indicates that the point is clustered badly. A value near 1indicates that the point is well-clustered.

To evaluate the quality of a clustering one can compute the average silhouette coefficient of all

points.

8.2 The Algorithm

We first provide a high level description of our algorithm and then we give some more details on

the implementation.

Our algorithm uses the observation that the first iterations of k-means algorithms or swapping

heuristics do not need the accuracy of the whole point set. Our algorithm uses the coreset con-

struction technique of Section 5.3 to reduce the complexity of the point set. The hope is that the

first iterations of iterative algorithms can be done much more efficiently on small coresets, still

significantly improving the objective function.

The algorithm starts to compute a coreset of size roughly 2k and chooses kpoints from this

coreset as a starting solution. Then it repeats max{40/√k, 2}times1the following two steps,

which form the main loop of the algorithm.

•First it runs Lloyd’s algorithm for dsteps. After this, the current solution is compared to

the previously best and the algorithm continues with the better of these solutions.

•In the second step, the algorithm chooses a number k0between 0and kuniformly at

random. Then it picks centers from the current set of centers according to the following

probability distribution until k0different centers are chosen. The probability that the center

1The value comes from the idea that the time to extract the coreset should be comparable to the time to run

iterations on the coreset. Without a dependency on khere, the coreset extraction time dominates the algorithm

runtime for small values of kat the beginning of the algorithm. The special value 40/√khas been empirically

adjusted.

114

8.2 The Algorithm

COREMEANS(P, k)

m←2k

while m≤|P|do

Compute coreset of size m.

if m=2k then

C←krandom points from coreset

K←C

repeat max{40/√k, 2}times (* Main loop *)

Do diterations of Lloyd’s algorithm starting with C

Let Cdenote the current solution

if Means(P, K)<Means(P, C)then

C←K

K←C

Choose k0randomly from {0,...,k}

Swap k0centers from Cwith points chosen uniformly from the coreset

m←2·m

Figure 8.1: The COREMEANS algorithm

of cluster Cjis chosen is 1

|Cj|·Pk

i=1

|Ci|

, where C1, . . . , Ckdenotes the current clustering.

Therefore centers of smaller clusters are picked with higher probability. Finally, these k0

random centers are replaced by points chosen uniformly at random from the coreset.

After the main loop is finished the algorithm doubles the size of the coreset and continues with

the main loop. This is done until the coreset is the whole point set. The algorithm is given in

Figure 8.1.

To support efficient computation of coresets and to speed up nearest neighbor queries in

Lloyd’s algorithm we use a quadtree or its higher dimensional equivalent. Our approach is the

analog to the kD-tree algorithm from [83]. The root of the quadtree corresponds to a bounding

box of the point set. With each occupied cell Bassociated with a node of the quadtree we store:

•The number nB=|P∩B|of points contained in the cell,

•A vector bB= (b(1)

B, . . . , b(d)

B)with b(i)

B:= Pp∈P∩Bp(i),

•The sum of the squared `2-norm of the point vectors

eB:= Pp∈P∩Bkpk2

2=Pp∈P∩BPd

i=1(p(i))2.

This information can be used to quickly compute the exact cost of the partition of Pthat cor-

responds to a given partition of the coreset. The same precomputations have also been used in

[83, 124]. Since all points from one cell are assigned to the same coreset point and hence to the

115

8 An Efficient k-Means Implementation using Coresets

same cluster, we can compute the cost of a cluster Cin the following way. Let B1, . . . , Blbe the

disjoint cells assigned to C. We first compute the center of gravity c:= µ(C)using the formula

µ(C)(i)=1

|C|·X

p∈C

p(i)=

j=1X

p∈P∩Bj

p(i)=

j=1

b(i)

Bj.

Then we have to compute

p∈C

d(p, c)2=X

p∈C

i=1

(p(i)−c(i))2=X

p∈C

i=1(p(i))2−2p(i)c(i)+ (c(i))2

j=1

eBj−2·

i=1

c(i)·

j=1

b(i)

Bj+

j=1

nBj· d

i=1

(c(i))2!,

which can be done efficiently by using the information stored with each cell. Notice that the

runtime of this computation only depends on the number of cells, not on the number of points

inside the cells.

In our implementation we fixed the depth of the quadtree to 112. To build the tree we pro-

ceeded bottom-up. We identify the non-empty cells in the grid corresponding to the 11-th level

of the tree. The non-empty cells are stored in a hash table with cell coordinates as keys. After

we have computed the non-empty cells in the finest grid we iterate over these cells and compute

all non-empty cells in the next coarser grid together with the corresponding point statistics.

To compute a coreset of around m points we first have to identify a good guess for δ. Larger

values of δlead to a smaller coreset but also to a worse approximation (see Section 5.3). We find

a good value by setting δto a large value and dividing it iteratively by 1.3. After each iteration

we compute the heavy cells from the cell statistics, and from that the size of the coreset (without

computing the actual coreset points). For high values of δthis is done very fast since the coreset

size can be computed using a few large cells. Alltogether the time to compute the coreset size is

negligible compared to the coreset computation time.

The coreset for a given value of δis computed using a recursive depth first search function on

the quadtree cells. For the root cell we call a function COMPUTECORESETPOINTS (see Figure

8.2). This function has a cell as input parameter as well as statistics about points to be moved

into that cell. COMPUTECORESETPOINTS adds the input points to the cell statistics. Then for

each subcell it checks heaviness. If there is at least one heavy subcell, it calls COMPUTECORE-

SETPOINTS for all heavy subcells. The points given as function parameter and the points in

all light subcells are then moved into a heavy subcell by adding up their statistics and giving

these statistics to one call of COMPUTECORESETPOINTS as function parameters. If a cell has no

2During the computation of big coresets in later iterations it can happen that the coreset extraction needs point

information of deeper levels. However, since we used images as test instances with 8bits per color channel, a

depth of 11 was enough. We believe that this also suffices for most other real world scenarios.

116

8.2 The Algorithm

Global variables used in functions COMPUTECORESETPOINTS, COMPUTEASSIGN-

MENTFORCELL,AND COMPUTEASSIGNMENT

L:the current set of centers.

For each center c∈L:

nc∈N:the number of points assigned to the center.

bc=∈Rd:the sum of coordinates of points assigned to the center.

ec=∈R:the sum of squared `2-norms of points assigned to the center.

R:the coreset computed by COMPUTECORESETPOINTS.

For each cell Cwith points in it:

nC∈N, bC∈Rd, eC∈R: The statistics for the points P∩Cin cell C.

nC∈N,e

bC∈Rd,eeC∈R: The statistics for the points PCin cell Ctaking

into account the movement of points into cells during coreset construction.

COMPUTECORESETPOINTS returns a set of coreset points, each being a triple from

N×Rd×R.

COMPUTECORESETPOINTS(Cell C, n ∈N, b ∈Rd, e ∈R)

nC←nC+n.

bC←bC+b.

eeC←eC+e.

Let H1, . . . , Hhbe the heavy subcells of C.

Let L1, . . . , Llbe the light subcells of C.

for j=1,. . . l do:

nLj←0. // Points are moved into a heavy cell.

bLj←(0,...,0)∈Rd.

eeLj←0.

N←n+Pl

j=1nLj. // Number of points to be assigned.

B←b+Pl

j=1bLj.

E←e+Pl

j=1eLj.

if Chas at least one heavy subcell do:

Coreset ←COMPUTECORESETPOINTS(H1, N, B, E).

for j=2,...,hdo:

Coreset ←Coreset ∪COMPUTECORESETPOINTS(Hj, 0, −→

0 , 0).

else

Coreset←{(N, B, E)}. // Create new coreset point.

RETURN Coreset.

Figure 8.2: The COMPUTECORESETPOINTS function

117

8 An Efficient k-Means Implementation using Coresets

heavy subcell, a coreset point is introduced from the statistics about cell points and the statistics

given as function input.

To speed up of the later k-means algorithm we store the following statistics: For each cell

occupied by coreset points we store a pointer to a corresponding coreset point (if there is one)

and pointers to all subcells containing coreset points. For a cell Blet PBbe the points from P

which are in cell Bafter moving points during the coreset construction. During the construction

of coreset Rwe additionally compute for each cell Bthe corresponding coreset statistics (i.e. the

cell statistics taking into account that points are moved into cells during the coreset construction):

•The number e

nB=|PB|of points contained in the cell (after moving points).

•A vector e

bB= (e

b(1)

B,...,e

b(d)

B)with e

b(i)

B:= Pp∈PBp(i),

•The sum of the squared `2-norm of the point vectors

eeB:= Pp∈PBkpk2

2=Pp∈PBPd

i=1(p(i))2.

See Figure 8.2 for details. The function COMPUTECORESETPOINTS returns a set of triples

from N×Rd×R, where triple (n0, b0, e0)stands for a coreset point representing n0points from

the input instance with coordinate sum b0and sum of squared `2-norm e0.

One iteration of the k-means algorithm is done as follows: Instead of searching the nearest

center for each coreset point separately we use an approach analogous to the kd-tree approach

as in [83]. We start with a list Lof all centers as possible candidates for nearest centers and do

a depth first walk on those quadtree cells which contain a coreset point. For each cell Cin the

quadtree we check if we can rule out that some centers qin Lare the nearest center to any point

in C. This is done by first computing the point lfrom Lthat is nearest to the center of C. Then

we check for each point q∈ L, whether Clies completely on the same side as lof the bisector

between land q. All centers which cannot be nearest centers for coreset points in Care evicted

from Land the algorithm proceeds to the children of cell C.

If |L|=1we know the nearest center for all coreset points within the cell. Since we hold

statistics for all coreset points within each cell we can then assign all coreset points in one step

to the center l∈ Land stop.

If |R∩C|=1, we compute the distances of the coreset point to all l∈ L directly, assign the

coreset point and stop the depth first walk.

See Figure 8.3 for details.

Computation of silhouette coefficients. The computation of silhouette coefficients for

each point piis speeded up in the following way: We first compute the average distance aito

all points in the same cluster. To compute bi, the minimum over average distances bi,j to points

in other clusters Cj, we identify the second nearest cluster center cland compute the average

distance bi,l to all points in Cl. In most cases bi,l is the minimum of all bi,j for other clusters. To

118

8.2 The Algorithm

COMPUTEASSIGNMENTFORCELL(Cell C, List of possible centers L ⊂ L)

if |R∩C|=1do: // Only one coreset point in C.

Let g∈R∩Cbe the coreset point in cell C.

Compute center l∈ Lnearest to g.

nl←nl+e

nC.

bl←bl+e

bC.

el←el+eeC.

else: // more than one coreset point in C.

Let abe the center of cell C.

Compute the center l∈ Lhaving smallest distance to a.

for each q∈ L:

if each point of cell Bhas smaller distance to lthan to qdo:

L→L\ {q}.

if |L|=1do: // Then L={l}.

nl←nl+e

nC.

bl←bl+e

bC.

el←el+eeC.

else: // |L|≥2.

for each subcell e

Cof Chaving a coreset point do:

COMPUTEASSIGNMENTFORCELL(e

C, L).

COMPUTEASSIGNMENT()

for each l∈Ldo:

nl←0.

bl←(0,...,0)∈Rd.

el←0.

COMPUTEASSIGNMENTFORCELL (Largest cell C, L).

Figure 8.3: The COMPUTEASSIGNMENT function

SILHOUETTE(K)

m←5.

while m≤Kdo:

Compute coreset of size m.

for each k∈{1, . . . , 100}do

Use main loop of CoreMeans to compute clustering.

Compute average silhouette coefficients for the current coreset and centers.

m←2·m.

Figure 8.4: The SILHOUETTE function

119

8 An Efficient k-Means Implementation using Coresets

Setup Times

Instance Size KMHybrid CoreMeans

Tower 4,915,200 28.59 4.77

Bridge 3,145,728 18.13 2.95

PaSCO 3,145,728 19.41 4.29

Frymire 1,234,390 4.71 0.65

Clegg 716320 2.76 1.05

Monarch 393,216 1.43 0.63

Artificial5D 300,000 2.27 1.49

Artificial10D 300,000 3.71 2.17

Artificial15D 300,000 4.71 2.70

Artificial20D 300,000 6.09 3.87

Table 8.1: Data sets and setup time.

get a certificate for this, we use the lower bound d(pi, cj)≤bi,j. We check for all other clusters

if d(pi, cj)≥bi,l. If this inequality holds then bi,l ≤bi,j and bi,j cannot be the minimal one. In

that case we save the computations of all distances to points in cluster Cj.

8.3 Experiments

We implemented our algorithm using C++. The code was compiled using gcc version 3.4.4 using

optimization level 2 (-O2). We compare our algorithm the implementation of KMHybrid from

[83]. KMHybrid was compiled using the same compiler and also with optimization level 2. We

ran it using the standard settings given by the developers.

We ran our experiments on an Intel Pentium D dual core processor with two cores. Both

algorithms used only one core with core frequency 3 GHz. The computer has 2GByte RAM.

8.3.1 Data Sets

We performed our experiments on two different types of instances. The first type of instance

consists of images and we want to cluster the RGB values of the pixels. Thus the input points

lie in 3D and the i-th input point corresponds to the RGB-values of the i-th pixel of the image.

Such a clustering has applications in lossy data compression, since one can reduce the palette of

colors used in the picture to the colors corresponding to the cluster centers.

Our test images consist of three large images (Tower, Bridge, and PaSCo) and three medium

size images (Monarch, Frymire, and Clegg). The latter images are commonly used to evaluate the

performance of image compression algorithms. The exact sizes of the test images can be found in

Table 8.3.1. The images are available at homepages.uni-paderborn.de/frahling/coremeans.html

120

8.3 Experiments

Artificially Created Instances.

The second type of instance is artificially created. Instance ArtificialxD consists of 300, 000

points in xdimensions. The instance is generated by taking a random point from one of 20

Gaussian distributed clusters, whose centers are picked uniformly at random from the unit cube.

The standard deviation of the Gaussian distribution is 0.02 ·√d, i.e. it is the product of the

one dimensional Gaussian distribution with standard deviation 0.02. An example of a sample of

points from instance Artificial2D is given in Figure 8.15.

8.3.2 Comparison of CoreMeans and KMHybrid

To evaluate the performance of CoreMeans we compare our algorithm to KMHybrid. We first

compare the setup times for both algorithms, i.e. the time to construct the auxiliary data struc-

tures. If one wants to compute a clustering for fixed value of kthen the setup times often domi-

nate the running time of the algorithm. If a good value of kis not known, then one often wants

to compute a clustering for multiple values of k. In this case, it is more interesting to compare

the running time of both algorithms without setup time (however, the time to extract the coresets

from the data structure is contained in the given running times). This is done in Sections 8.3.2 to

8.3.2. In Section 8.3.2 we compare both algorithm for different input sizes. In Section 8.3.2 we

focus on the performance with increasing dimension, and in Section 8.3.2 we investigate into the

dependence on the number of clusters.

Setup time

The times to compute the auxiliary data structure are given in Table 8.3.1. The time to build these

structures does not depend on k. The setup time for KMHybrid is between 1.5 to 7times higher

than that of CoreMeans. There is a tendency that the gap becomes larger for larger instances.

However, there seems to be also an dependence on the distribution of points as the largest factor

was achieved for the medium size instance Frymire.

If one computes a clustering for one value of kthen the setup time is typically larger than the

computation time. Even for the larger instances both algorithms obtain a good clustering in a

few seconds (see also Section 8.3.2).

121

8 An Efficient k-Means Implementation using Coresets

Dependence on Input Size

To evaluate the dependence on the input size we run both algorithms on instance Monarch, Clegg,

Frymire, PaSCo, Bridge, and Tower. We used paramter k=50. In general, CoreMeans performs

better for smaller k(see Section 8.3.2) and tends to perform similar to KMHybrid as kincreases.

The results are shown in Figures 8.5 to 8.10. The plots give the average performance of 10

runs. The vertical bars indicate the best and worst solution found within these runs. The relative

performance of CoreMeans increases slightly with the size of n. We would like to emphasize

that the difference between the best and worst solution found during the 10 runs is much smaller

for CoreMeans. Therefore, to guarantee a good solution we have to run KMHybrid more often

than CoreMeans. Another interesting observation is that CoreMeans achieves slightly better

approximations for larger instances.

0.2 0.4 0.6 0.8 1 1.2 1.4

sec

100

150

200

250

300

350

400

Objective

KMLocalHybrid

CoreMeans

Figure 8.5: Performance on data set Monarch for k=50 excluding setup time.

122

8.3 Experiments

0.5 1 1.5 2 2.5 3

sec

200

400

600

800

1000

1200

Objective

KMLocalHybrid

CoreMeans

Figure 8.6: Performance for Clegg for k=50 excluding setup time.

0.2 0.4 0.6 0.8 1

sec

200

400

600

800

1000

1200

Objective

KMLocalHybrid

CoreMeans

Figure 8.7: Performance for Frymire for k=50 excluding setup time.

123

8 An Efficient k-Means Implementation using Coresets

0.5 1 1.5 2 2.5 3

sec

100

200

300

400

500

600

700

Objective

KMLocalHybrid

CoreMeans

Figure 8.8: Performance for PaSCo for k=50 excluding setup time.

0.25 0.5 0.75 1 1.25 1.5 1.75 2

sec

100

125

150

175

200

Objective

KMLocalHybrid

CoreMeans

Figure 8.9: Performance for Bridge for k=50 excluding setup time.

124

8.3 Experiments

0.25 0.5 0.75 1 1.25 1.5 1.75 2

sec

100

150

200

250

Objective

KMLocalHybrid

CoreMeans

Figure 8.10: Performance for Tower for k=50 excluding setup time.

Dependence on the Dimension

Next we are interested in the dependence on the dimension. To evaluate this dependence, we

compare the average performance of 10 runs of KMHybrid and CoreMeans for k=20 on the

instances ArtificialxD for x=5, 10, 15, and 20. The graphs are shown in Figure 8.11 and 8.12.

CoreMeans performs better on all instances. The most significant difference in performance can

be found in the 5D instance, where CoreMeans performs a factor 10 −30 better. The higher

the dimension the smaller is the advantage of CoreMeans. In these experiments the deviation

of KMHybrid was much bigger than that of CoreMeans. Although CoreMeans shows the much

better average performance, the best solution found by KMHybrid was better than the best so-

lution found by CoreMeans. Overall, the performance of the algorithm for medium dimensions

was much better than theory predicts with an exponential dependence on the dimension.

125

8 An Efficient k-Means Implementation using Coresets

1 2 3 4 5

sec.

0.05

0.1

0.15

0.2

0.25

Objective

d!10, KMLocalHybrid

d!10, CoreMeans

d!5 , KMLocalHybrid

d!5 , CoreMeans

Figure 8.11: Performance for ArtificialxD with x=5and x=10 excluding setup time.

2 4 6 8 10 12 14

sec.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Objective

d!20, KMLocalHybrid

d!15, KMLocalHybrid

d!20, CoreMeans

d!15, CoreMeans

Figure 8.12: Performance for ArtificialxD with x=15 and x=20 excluding setup time.

126

8.3 Experiments

Dependence on the Number of Clusters

To investigate in the dependence on the number of cluster centers, we ran a number of experi-

ments on different inputs. Due to space limitations we only present results for k=10, 50, 100,

and 200 for instance Bridge. These results are typical for the performance we encountered. As

before, the Figures 8.13 and 8.14 show the average performance of 10 runs excluding the setup

times. Typically, our algorithm performs significantly better for small values of k. For example,

for k=10 CoreMeans often performs a factor 10 −100 better. Additionally, the quality of the

solutions computed by KMHybrid varies significantly. CoreMeans is less sensitive to the random

choices of the algorithm. As a consequence one must perform more runs of KMHybrid to obtain

a good solution with high probabilty. As kgrows larger the performance gap between the two

algorithms decreases. The reason for this is that the quality of the coreset decreases as kgrows.

0.1 0.2 0.3 0.4 0.5 0.6

sec.

200

400

600

800

1000

Objective

k!10, KMLocalHybrid

k!10, CoreMeans

k!50, KMLocalHybrid

k!50, CoreMeans

Figure 8.13: Performance for Bridge with k=10 and k=50 excluding setup time.

127

8 An Efficient k-Means Implementation using Coresets

0.5 1 1.5 2 2.5 3

sec.

100

Objective

k!100, KMLocalHybrid

k!100, CoreMeans

k!200, KMLocalHybrid

k!200, CoreMeans

Figure 8.14: Performance for Bridge with k=100 and k=200 excluding setup time.

8.3.3 Computing the Silhouette Coefficient

We computed the approximate average silhouette coefficient for 1≤k≤100 for instances

Tower, Clegg, Monarch, and ArtificialxD with x=2, 10, 20 using coresets of different sizes.

Table 8.2 summarizes the running times of our tests. The second column gives the overall running

time for the computation and the third column states the time spend to compute the silhouette

coefficients. Since the time to compute the silhouette coefficient is quadratic in the coreset size,

the fraction of time spent for this computation increases significantly with increasing coreset

size.

To show the effectivity of our method we focus on instance Artificial2D. A sample of points

from this instance is shown in figure 8.15. The average silhouette coefficent for this instance

and coreset sizes 427, 1616, and 6431 is given in Figure 8.16. We see that even the smallest

coreset suffices to approximate the coefficient quite well. The only problem is that the silhouette

coefficient increases slightly with k. A reason for this may be that the number of centers is

already relatively large compared to the number of coreset points. If some centers contain only

one point, then they have silhouette coefficient exactly 1and this may lead to slightly increasing

coefficient, if kis large compared to the coreset size. For our applications a coreset of size

roughly 1600 will definitly suffice. There is almost no difference to one with more than 6000

points.

The highest silhouette coefficient value was achieved for 14 clusters (using the larger coreset)

by the cluster centers shown in Figure 8.15. The reason why only 14 clusters were found (al-

128

8.3 Experiments

though we had 20 cluster centers) can be explained by the fact that some of the clusters were

very close to each other and so the clustering coefficient is higher when one assigns only one

center to these clusters.

Instance Coreset Time Silhouette

Tower 404 7.99 0.84

1607 19.24 6.43

Clegg 423 4.69 0.8

1720 15.07 6.58

Monarch 428 4.80 0.77

1626 15.37 6.11

Artificial2D 427 2.52 0.62

1616 7.73 4.3

6431 51.89 45.57

Artificial10D 400 43.34 1.88

1711 123.38 17.68

Artificial20D 408 139.58 4.18

1778 442.5 40.62

Table 8.2: Time to compute clusterings and approximate average silhouette coefficients. The

second column contains the overall running time (including setup). The third column

gives the time required to compute the approximate average silhouette coefficient.

129

8 An Efficient k-Means Implementation using Coresets

0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

Figure 8.15: A sample of points from instance Artificial2D. The bold points are the centers that

achieve the best average silhouette coefficient.

130

8.3 Experiments

20 40 60 80 100

0.2

0.4

0.6

0.8

silhouette coeff.

Coreset 427

Coreset 1616

Coreset 6431

Figure 8.16: The average silhouette coefficient of Artificial2D.

8.3.4 Summary

Summarizing we can say that our algorithm CoreMeans performs very well compared to KMHy-

brid [104] for small dimension and small to medium k. When we compare the computation time

of CoreMeans with KMHybrid we see that for one clustering the running time of both algo-

rithms is typically dominated by the setup time. The quality of the solutions varies less than that

of KMHybrid, which implies that we need fewer runs to guarantee a good solution.

The main strength of our algorithm is to quickly find relatively good approximations for many

values of k, for example when a good value for kis not known in advance. In this case, we can

also use the coresets to compute the average clustering coefficient and thus to find a good choice

of k.

131

8 An Efficient k-Means Implementation using Coresets

132

9 Counting Motifs in Data Streams

In this chapter we present estimators for the number of triangles, the number of cliques of any

size, and the number of bipartite cliques K3,3 with three nodes in each partition for graphs given

as a data stream of edges. Our data stream algorithms compute a (1+)-approximation of the

respective value with probability 1−δ.

To estimate the number of triangles for a graph given as an adjacency stream, our algorithm

uses O(1

2·log(1

δ)·(1+|T1|+|T2|

|T3|)·log(|V|)) memory bits, where Tidenotes the set of node-triples

having iedges in the induced subgraph (see Definition 9.1.1). This is always better than the naive

sampling algorithm that requires O(1

2log(1

δ)(1+|T0|+|T1|+|T2|

|T3|)·log(|V|)) memory bits, while it

strongly improves the O(1

2·log(1

δ)·(1+|T1|+|T2|

|T3|)3·log |V|)solution provided in [117]. Comparing

our results in this model with the previous work in [81], we obtain a one-pass algorithm that

achieves the same space bound and better update time as the three pass algorithm from [81]. The

two other algorithms in [81] either require bounded maximum degree or are incomparable to our

result because the space complexity depends on different parameters (e.g., the number of cycles

of length 4 and 6 in the graph).

The number of memory bits used by our algorithm still depends on the value of |T1|/|T3|, that

can be as large as O(|E|·|V|). Our method in the case of graphs in arbitrary order is therefore of

practical interest for networks with a large number of triangles.

We then develop a method of greater practical relevance for graphs given as an incidence

stream of edges which uses O(1

2log(1

δ)log2(|V|)(1+|T2|

|T3|)) bits. To give a flavor of the quality

of our result, observe that |T3|

|T2|is exactly equal to 1/3 of the inverse of the transitivity coefficient

of the graph, a universal measure closely related to the clustering coefficient, whose value for

networks of practical interest is rarely bigger than 105. Our algorithmic results improve the

result of [117] that requires O(1

2log(1

δ)log(|V|)(1+|T2|

|T3|)2+dlog |V|)memory bits (where d

denotes the maximum degree of the graph) and improves over the naive sampling method.

Our method is suitable to be adapted to several other classes of subgraphs. As an example we

provide an algorithm to estimate the number of cliques of size αin the incidence stream model

that uses O(1

2log(1

δ)log2(|V|)(1+|Sα|

|Kα|)) memory bits, where Sαis the set of stars of αnodes

and Kαis the set of cliques of size αin the graph.

Denote by Ki,j the set of complete bipartite cliques in the graph where each of ivertices link

to all of jother vertices. As a last contribution we provide a data stream algorithm that provides

an approximation of the number of K3,3 of the graph in the incidence stream model ordered by

destination nodes with outdegree bounded by ∆which needs Olog2(|V|)·|K3,1|·∆2ln(1

δ)

|K3,3|·2mem-

133

9 Counting Motifs in Data Streams

ory bits.

In Section 9.1 we present our algorithm to count the number of triangles in the adjacency

stream model (as defined in Section 2.2). For input graphs given as an incidence stream (see

Section 2.2), we develop a better algorithm to count the number of triangles in Section 9.2.

This algorithm will be generalized to cliques of any size in Section 9.3. Section 9.4 presents an

algorithm to count bipartite cliques K3,3 in the incidence stream model.

9.1 Counting Triangles in Adjacency Streams

We consider an undirected graph G= (V, E)without self-loops. Each edge is an unordered pair

of nodes (v, w)such that (v, w)=(w, v). We assume that V={1,...,n}and nis known in

advance, and that Gis then given as an adjacency stream consisting of all edges in the graph as

defined in Section 2.2. The edges appear in arbitrary order and no edge is repeated in the stream.

There is no bound on the degree of the nodes.

Definition 9.1.1 (Node triples, T0, T1, T2, T3)We define a node triple as a set {v1, v2, v3}⊂V

consisting of exactly 3different nodes of V. We partition the set of node triples into four sets

T0, T1, T2,and T3. A node triple {v1, v2, v3}belongs to

•T0iff no edge exists between the nodes v1, v2,and v3,

•T1iff exactly one of the edges (v1, v2),(v2, v3), and (v3, v1)exists,

•T2iff exactly two of the edges (v1, v2),(v2, v3), and (v3, v1)exist,

•T3iff all of the edges (v1, v2),(v2, v3), and (v3, v1)exist (i.e. iff {v1, v2, v3}is a triangle).

Therefore |T3|denotes the number of triangles in G.

The algorithms we present here find a (1±)-approximation of |T3|using local memory of

size Θ1

21+|T1|+|T2|

|T3|·log |V|. In social graphs and the webgraph the value (|T1|+|T2|)|T3|

usually is O(|V|).

9.1.1 3 Pass Algorithm

We will first present an algorithm which passes three times over the stream and computes a

(1±)-approximation on the number of triangles. A different algorithm with the same space

complexity has been presented in [81]. However, our algorithm has a significantly improved

update time and as we later show, we can combine the three passes to a one-pass algorithm.

We introduce a streaming algorithm SAMPLETRIANGLE, which outputs a {0, 1}variable β

with expected value 3|T3|/(|T1|+2·|T2|+3·|T3|). The algorithm is given in Figure 9.1.

It is easy to see that each of the passes can be implemented in a single pass over the set of

edges (i.e., the input stream) using O(1)memory cells, each storing O(log |V|)bits.

134

9.1 Counting Triangles in Adjacency Streams

SAMPLETRIANGLE

1st. Pass:

Count the number of edges |E|in the stream

2nd. Pass:

Sample an edge e= (a, b)uniformly chosen from E

Choose a node vuniformly from V\ {a, b}.

3rd. Pass:

if (a, v)∈E∧(b, v)∈Ethen set β←1

else set β←0

return β

Figure 9.1: The 3-pass algorithm SAMPLETRIANGLE for adjacency streams

Lemma 9.1.2 Algorithm SAMPLETRIANGLE outputs a value βwith expected value

E[β] = 3|T3|

|T1|+2·|T2|+3·|T3|

Furthermore

|T1|+2·|T2|+3·|T3|=|E|·(|V|−2)

and

|T3|=E[β]·|E|·(|V|−2)/3 .

Proof : We look at all node triples. Each triple belongs to one of the sets T0,T1,T2, or T3.

The algorithm chooses such a triple by choosing an edge e= (a, b)together with one node

v∈V\ {a, b}. Therefore, no triple from T0is chosen.

We will show how many different choices of one edge and one node choose a node triple be-

longing to T1(resp. T2,T3). Since all pairs of one edge and one node have the sample probability

to be chosen, we can then compute the probability to select a triangle.

Let us denote by t={w1, w2, w3}a fixed triple from T1. Wlog. let be (w1, w2)∈Eand so

(w2, w3),(w3, w1)/∈E. The algorithm chooses t, iff it samples edge (w1, w2)and vertex w3.

Therefore there are exactly |T1|choices of an edge and a node that select a triple from T1.

Now assume t∈T2. Then tis chosen by SAMPLETRIANGLE, iff one of the two edges in the

triple is sampled and vequals to the remaining node of the triple. Therefore there are exactly

2·|T2|choices of an edge and a node that select a triple from T2.

For the same reason, a triple in T3is chosen whenever one of its three edges and the remaining

vertex is chosen. Therefore there are exactly 3·|T3|choices of an edge and a node that select a

triple from T3. These are exactly the choices that lead to β=1.

We conclude that

E[β] = 3|T3|

|T1|+2·|T2|+3·|T3|.

135

9 Counting Motifs in Data Streams

Since there are |E|·(|V|−2) = |T1|+2·|T2|+3·|T3|choices to sample an edge and a node,

it follows that |T3|=E[β]·|E|·(|V|−2)/3.2

A streaming algorithm COUNTTRIANGLES, which outputs an estimate of |T3|, easily follows.

It can be adjusted using an input parameter s.

COUNTTRIANGLES (s∈N)

Run sinstances of SAMPLETRIANGLE in parallel.

Let βibe the value returned by the ith instance.

T3←1

sPs

i=1βi·|E|·(|V|−2)3.

return e

T3.

Figure 9.2: The 3-pass algorithm COUNTTRIANGLES for adjacency streams

Lemma 9.1.3 Algorithm COUNTTRIANGLES outputs a value e

T3having expected value E[e

T3] =

|T3|. If s≥1

2·|T1|+2·|T2|+3·|T3|

|T3|·ln(2

δ). then with probability 1−δthe algorithm outputs a value

T3satisfying

(1−)·|T3|≤e

T3≤(1+)·|T3|.

Proof : We use Chernoff’s Bounds [58]:

Pr1

i=1

βi≥(1+)E[β]< e−2·E[β]·s/3

Pr1

i=1

βi≤(1−)E[β]< e−2·E[β]·s/2

For s≥1

2·|T1|+2·|T2|+3·|T3|

|T3|·ln(2

δ)the sum of both probabilities is bounded by δ. The lemma

follows now from Lemma 9.1.2 stating that |T3|=E[β]·|E|·(|V|−2)/3.2

Lemma 9.1.3 guarantees that algorithm COUNTTRIANGLES outputs a value e

T3having the

right expectation.

However, there could be applications which need guaranteed approximation bounds. When

we run COUNTTRIANGLES with a predefined number of instances s, it outputs an estimation e

on |T3|, but the requirement s≥1

2·|T1|+2·|T2|+3·|T3|

|T3|·ln(2

δ)of Lemma 9.1.3 is impossible to check.

We can not find out if we used enough instances to ensure a (1±)-approximation.

To overcome this problem we later develop a new algorithm COUNTTRIANGLESSAFE based

on COUNTTRIANGLES. It outputs a pair (e

T3,˜

)and guarantees that e

T3is a (1±˜

)-approximation

of |T3|with probability 1−δ.

136

9.1 Counting Triangles in Adjacency Streams

COUNTTRIANGLESSAFE (s∈N, ψ ∈(0, 1))

Set r←d2·log(2/ψ)e.

Set t←ds/(4log(2/ψ))e.

Run rinstances of COUNTTRIANGLES(t)in parallel.

Let e

(i)be the value returned by the ith instance.

Set e

T3←mediani(e

(i)).

Set ˜

←q176

T3·|T1|+2·|T2|+3·|T3|

s·log 2

ψ=q176

T3·|E|·(|V|−2)

s·log 2

ψ.

return (e

T3,˜

).

Figure 9.3: The 3-pass algorithm COUNTTRIANGLESSAFE for adjacency streams

Let us first analyse the update time of COUNTTRIANGLES. As before we analyse time in the

Real RAM model. Note that we could also show our results in a RAM model, assuming that all

stored values (each needing O(log |V|)memory bits to be stored) fit into one single register.

If we implement the different instances of our algorithm independently of each other, we re-

quire O(s)time to process each edge during the third pass. We show how to reduce this to

expected constant time. Before we invoke the third pass, we collect all edge-vertex pairs chosen

by different instances of the algorithm. For each pair with edge e= (a, b)and vertex vwe

would like to find out whether (a, v)and (b, v)are in E. Therefore, we construct a set Mof

missing edges that for each such edge-vertex pair contains the edges (a, v)and (b, v). Next, we

construct a hash table for Musing a uniform hash function that requires linear space, as proposed

in [107]. Now we can implement the third pass in the following way. For each edge e, we lookup

whether it is in the set M. If e∈Mwe mark it. These steps can both be done in expected con-

stant time. In a postprocessing step we can then determine the edge-vertex pairs that are triangles.

We now develop an algorithm COUNTTRIANGLESSAFE based on COUNTTRIANGLES. It has

a number sand a desired error probability ψas input parameters and outputs a pair (e

T3,˜

).

The algorithm gives the guarantee that e

T3is a (1±˜

)-approximation of |T3|with probability

1−ψ. It runs at most sinstances of SAMPLETRIANGLE in parallel. We show the pseudocode

of COUNTTRIANGLESSAFE in Figure 9.3.

Lemma 9.1.4 Let (e

T3,˜

)be the output of algorithm COUNTTRIANGLESSAFE. With probability

1−ψthe following statements are true:

•(1−˜

)·|T3|<e

T3<(1+˜

)·|T3|

•If s≥352

2·|T1|+2·|T2|+3·|T3|

|T3|·log(2/ψ), then the algorithm outputs (e

T3,˜

)with ˜

≤.

Proof : COUNTTRIANGLESSAFE outputs (e

T3,˜

)with ˜

=q176

T3·|T1|+2·|T2|+3·|T3|

s·log 2

ψ.

137

9 Counting Motifs in Data Streams

Therefore we have e

T3=176

2·|T1|+2·|T2|+3·|T3|

s·log 2

ψ. Because of the choice of tit follows

T3≥44

2·|T1|+2·|T2|+3·|T3|

t.(9.1)

From Markov’s inequality it follows that

∀i∈{1,...,r}Pre

(i)≥11 ·E[e

(i)]≤1

11 .

Since for all iwe have E[e

(i)] = |T3|it follows:

∀i∈{1,...,r}Pre

(i)≥11 ·|T3|≤1

11 .

Because of e

T3=mediani(e

(i))we can only have e

T3≥11 ·|T3|if for at least r/2 values of i

we have e

(i)≥11·|T3|. For each single value of ithe probability for that is smaller than 1/11 as

shown above. The probability to have at least r/2 values of ifulfilling that equation is therefore

bounded by:

Pre

T3≥11 ·|T3|≤r

r/2·1

11r/2

≤e·r

r/2 r/2

·1

11r/2

=2e

11r/2

≤1

2r/2

≤ψ/2 .

From inequality (9.1) we conclude

Pr44

2·|T1|+2·|T2|+3·|T3|

t≥11 ·|T3|≤ψ/2

and finally

Pr|T3|≤4

2·|T1|+2·|T2|+3·|T3|

t≤ψ/2 .

In the following we therefore condition on the event that |T3|>4

2·|T1|+2·|T2|+3·|T3|

t(which

happens with probability at least 1−ψ/2). Then for δ:= 1/11 we have ln(2/δ)≤4and

t≥1

2·|T1|+2·|T2|+3·|T3|

|T3|·ln 2

δ.

Therefore by Lemma 9.1.3 each instance of COUNTTRIANGLES outputs a value e

(i)∈(1±˜

)·

|T3|with probability at least 1−δ=1−1

11 .

Since e

T3is set to the median of all e

(i), we can only have e

T3/∈(1±˜

)·|T3|if for at least r/2

of the e

(i)we have e

(i)/∈(1±˜

)·|T3|. The probability for that is bounded by:

Pre

T3/∈(1±˜

)·|T3|≤r

r/2·1

11r/2

≤e·r

r/2 r/2

·1

11r/2

=2e

11r/2

≤1

2r/2

≤ψ/2 .

138

9.1 Counting Triangles in Adjacency Streams

The first statement of the lemma follows directly.

We now show the second statement. We condition on the event that e

T3≤(1+˜

)|T3|, which

happens with probability 1−ψby the first statement of the lemma.

If s≥352

2·|T1|+2·|T2|+3·|T3|

|T3|·log 2

ψthen we have

|T3|≥352

2·|T1|+2·|T2|+3·|T3|

s·log 2

and therefore

T3≥|T3|

2≥176

2·|T1|+2·|T2|+3·|T3|

s·log 2

ψ.

It follows directly:

≥s176

T3·|T1|+2·|T2|+3·|T3|

s·log 2

ψ=˜

 .

Notice that r·t≤sfor s≥8log(2/ψ). Therefore we have at most sinstances of SAMPLE-

TRIANGLE running in parallel. The time SAMPLETRIANGLESAFE(s) needs to process an edge

in the stream is therefore at most the time SAMPLETRIANGLE(s) needs.

We summarize our results in the following theorem. We remark that we significantly improve

the update time over the previously best result from [81] while achieving the same space com-

plexity. The update time in [81] is roughly proportional to the space complexity compared to

expected constant time for our algorithm.

Theorem 24 There is a 3-Pass streaming algorithm to count the number of triangles in a stream

of edges up to a multiplicative error of 1±with probability at least 1−ψ, which needs

O(1

2·log(1

ψ)·(1+|T1|+|T2|

|T3|)·log |V|)memory bits and constant expected update time. ut

9.1.2 1 Pass Algorithm

In this section we show that the previous 3-pass algorithms can be implemented in one pass us-

ing the same amount of space and constant expected amortized update time, if |E|is significantly

larger than the number of instances we run.

We first show how to adapt algorithm SAMPLETRIANGLE. We observe that we can find a

random edge in one pass by reservoir sampling [120], i.e. choosing the first edge as a sample

edge and replacing this edge by the ith edge of the stream with probability 1/i. It is known that

this method can be implemented in O(log |V|)expected time per sample (not counting the time

to read the stream) by randomly choosing the next index of the replacing edge according to an

appropriate probability distribution.

139

9 Counting Motifs in Data Streams

SAMPLETRIANGLEONEPASS

i←1

for each edge e= (u, w)in the stream do

Flip a coin. With probability 1/i do

a←u;b←w;

v←Node uniformly chosen from V\ {a, b}

x←false; y←false

end do

if e= (a, v)then x←true

if e= (b, v)then y←true

i←i+1

end for

if x=true ∧y=true then return β←1else return β←0.

Figure 9.4: The 1-pass algorithm SAMPLETRIANGLEONEPASS for adjacency streams

We combine this with the third pass and obtain algorithm SAMPLETRIANGLEONEPASS as

shown in Figure 9.4. It may happen that we sample an edge e= (a, b)of the stream together

with a node v, but we do not see the edge (a, v)or (b, v)in the subsequent stream (because they

appeared before the edge e). In this case, we do not detect a, b, v as a triangle. However, we

detect a, b, v, iff (a, b)is the first edge of the triangle that appears in the stream. This changes

the expected value of βby a factor of 3.

Lemma 9.1.5 Algorithm SAMPLETRIANGLEONEPASS outputs a value βhaving expected value

E[β] = |T3|

|T1|+2·|T2|+3·|T3|.

Proof : The proof is similar to the proof of Lemma 9.1.2, taking into account that only 1/3 of

the choices that select a T3-triple actually detect a triangle and lead to value β=1.2

An algorithm COUNTTRIANGLESONEPASS outputting an estimation e

T3of |T3|can be devel-

oped similarly to COUNTTRIANGLES (see Figure 9.5).

Lemma 9.1.6 Algorithm COUNTTRIANGLESONEPASS outputs a value e

T3having expected value

E[e

T3] = |T3|. If s≥3

2·|T1|+2·|T2|+3·|T3|

|T3|·ln(2

δ). then with probability 1−δthe algorithm outputs

a value e

T3satisfying

(1−)·|T3|≤e

T3≤(1+)·|T3|.

Proof : The proof is similar to the proof of Lemma 9.1.3 (with an additional factor of 3in most

formulas). 2

140

9.1 Counting Triangles in Adjacency Streams

COUNTTRIANGLESONEPASS (s∈N)

Run sinstances of SAMPLETRIANGLEONEPASS in parallel.

Let βibe the value returned by the ith instance.

T3←1

sPs

i=1βi·|E|·(|V|−2).

return e

T3.

Figure 9.5: The 1-pass algorithm COUNTTRIANGLESONEPASS for adjacency streams

COUNTTRIANGLESONEPASSSAFE (s∈N, ψ ∈(0, 1))

Set r←d2·log(2/ψ)e.

Set t←ds/(4log(2/ψ))e.

Run rinstances of COUNTTRIANGLESONEPASS(t)in parallel.

Let e

(i)be the value returned by the ith instance.

T3←mediani(e

(i)).

Set ˜

←q528

T3·|T1|+2·|T2|+3·|T3|

s·log 2

ψ=q528

T3·|E|·(|V|−2)

s·log 2

ψ.

return (e

T3,˜

).

Figure 9.6: The 1-pass algorithm COUNTTRIANGLESONEPASSSAFE for adjacency streams

Similarly to COUNTTRIANGLESSAFE we can develop a one pass algorithm COUNTTRIAN-

GLESONEPASSSAFE (Figure 9.6) which outputs an approximation e

T3together with an approxi-

mation guarantee.

Lemma 9.1.7 Let (e

T3,˜

)be the output of algorithm COUNTTRIANGLESONEPASSSAFE. With

probability 1−ψthe following statements are true:

•(1−˜

)·|T3|<e

T3<(1+˜

)·|T3|

•If s≥1056

2·|T1|+2·|T2|+3·|T3|

|T3|·log(2/ψ), then the algorithm outputs (e

T3,˜

)with ˜

≤.

Proof : The proof can be done similarly to the proof of Lemma 9.1.3 (with an additional factor

of 3in most formulas). 2

By applying the reservoir sampling algorithm from [120] to select the edge, the selection re-

quires O(log |V|)expected time for each instance of SAMPLETRIANGLEONEPASS for the whole

stream. Additionally we use the hash table approach from the previous chapter to efficiently find

instances of SAMPLETRIANGLEONEPASS which search for an edge in the stream. Alltogether

we get expected O(1+s·log |E|

|E|)update time per edge in the stream.

141

9 Counting Motifs in Data Streams

Theorem 25 There is a 1-Pass streaming algorithm to count the number of triangles in a stream

of edges up to a multiplicative error of 1±with probability at least 1−ψ, which needs

O(1

2·log(1

ψ)·(1+|T1|+|T2|

|T3|)·log |V|)memory bits and expected update time O(1+1

2·|T1|+|T2|

|T3|·

log |E|

|E|·log 1

ψ).

9.2 Counting Triangles in Incidence Streams

When a graph G= (V, E)is coded as an incidence stream (see Section 2.2) all edges incident

to the same vertex appear subsequently in the stream. First arrive all edges incident to vertex

v1, followed by all edges incident to v2, and so on. The ordering v1, . . . , vnof the vertices can

be arbitrarily, i.e. determined by an adversary. We consider undirected graphs and so each edge

appears twice (within the incidence list of both incident nodes). There is no bound on the degree

of the nodes (in contrast to [117]).

Often large graphs (e.g. the webgraph) are stored on hard discs as incidence lists of edges.

Our methods can therefore be used to approximate the number of triangles in these graphs using

only sequential access to the data.

9.2.1 3 Pass Algorithm

We again will first develop a 3-pass algorithm, and later combine the passes to get a one pass

algorithm. Let didenote the degree of node vi. The 3-pass algorithm SAMPLETRIANGLE2 is

presented in Figure 9.7. The algorithm SAMPLETRIANGLE2 can be implemented using O(1)

memory cells, each consisting of O(log |V|)bits.

SAMPLETRIANGLE2

1st. Pass:

Count the number Pof paths of length 2in the graph G.

2nd. Pass:

Uniformly choose one of these paths using algorithm UNIFORMTWOPATH(P).

Let (a, v, b)be this path.

3rd. Pass:

Test if edge (a, b)appears within the stream.

if (a, b)∈Ethen set β←1

else set β←0

return β

Figure 9.7: The 3-pass algorithm SAMPLETRIANGLE2 for incidence streams

142

9.2 Counting Triangles in Incidence Streams

We observe that the number of paths of length 2in the graph Gis exactly

P:= |T2|+3·|T3|=

|V|

i=1

di·(di−1)/2 .

Thus we can easily count the number of paths of length 2by determining the degree of each

node. This is possible because the edges appear as an incidence stream.

The second pass can be implemented using reservoir sampling. However, we propose a differ-

ent approach which achieves slightly better amortized running time and is based on the following

idea. If vis incident to the nodes w1, w2, ..., wd, we define an order on the possible paths of

length 2 with vin the middle in the following way: (w1, v, w2)<(w1, v, w3)<(w2, v, w3)<

(w1, v, w4), ... . The triples (wi, v, wj)are ordered firstly by max{i, j}. Ties are ordered by i.

We choose a value k∈{1,...,P}uniformly at random and want to select the k-th triple

(wi, v, wj)in the order given above. iand jcan be computed from kusing the formulas given in

Figure 9.8. The k-th triple is chosen, if the node vis in the middle of enough paths of length 2.

Otherwise we search for the k−dv·(dv−1)/2-th path within the next incidence list.

The algorithm UNIFORMTWOPATH is presented in Figure 9.8.

UNIFORMTWOPATH(P)

Select value kuniformly from the set {1,...,P}

For each node vin the incidence list do

If k>0then

Set j←lq2k +1

4+1

Set i←j−j2−j

2+k−1

Pass over the complete incidence list of node v.

If incidence list of vcontains more than jedges then

a←the ith node in the incidence list of v

b←the jth node in the incidence list of v

w←v

end if

d←degree of node v

k←k−d2−d

end if

end do

return edges (a, w, b)

Figure 9.8: The 1-pass algorithm UNIFORMTWOPATH for incidence streams

143

9 Counting Motifs in Data Streams

Lemma 9.2.1 Algorithm SAMPLETRIANGLE2outputs a value βwith expected value

E[β] = 3·|T3|

|T2|+3·|T3|

Proof : We look at all triples of nodes in V. Each triple belongs to one of the sets T0,T1,T2, or

T3. The algorithm chooses such a triple by choosing a node vtogether with two adjacent edges.

Therefore the selected triples belong to the set T2∪T3. We select a triple from T2, if we choose

the unique node adjacent to both edges and the corresponding edges. Therefore there are exactly

|T2|choices that choose a triple belonging to T2.

A triple from set T3can be chosen in three different ways by selecting one of the three nodes

of the triple together with both adjacent edges. Since each choice of a path of length two has the

same probability, the probability of choosing a triple in T3is exactly 3·|T3|/(|T2|+3·|T3|)as

stated. 2

A streaming algorithm COUNTTRIANGLES2, which outputs an estimate of |T3|, easily follows.

It can be adjusted using an input parameter sand is given in Figure 9.9.

COUNTTRIANGLES2 ( s∈N)

Run sinstances of SAMPLETRIANGLE2 in parallel.

Let βibe the value returned by the ith instance.

T3←1

sPs

i=1βi·|T2|+3·|T3|

3=1

sPs

i=1βi·Pv∈Vdv·(dv−1)/6.

return e

T3.

Figure 9.9: The 3-pass algorithm COUNTTRIANGLES2 for incidence streams

Lemma 9.2.2 Algorithm COUNTTRIANGLES2outputs a value e

T3having expected value E[e

T3] =

|T3|. If s≥1

2·|T2|+3·|T3|

|T3|·ln(2

δ). then with probability 1−δthe algorithm outputs a value e

satisfying

(1−)·|T3|≤e

T3≤(1+)·|T3|.

Proof : Equivalent to the proof of Lemma 9.1.3. 2

We can again develop an algorithm COUNTTRIANGLESSAFE2 based on COUNTTRIANGLES2.

It has a number sand a desired error probability ψas input parameters and outputs a pair (e

T3,˜

)

where ˜

signals that e

T3is an (1±˜

)-approximation of |T3|. It uses at most sparallel instances of

SAMPLETRIANGLE2. We show the pseudocode of COUNTTRIANGLESSAFE2 in Figure 9.10.

Lemma 9.2.3 Let (e

T3,˜

)be the output of algorithm COUNTTRIANGLESSAFE2. With probabil-

ity 1−ψthe following statements are true:

•(1−˜

)·|T3|<e

T3<(1+˜

)·|T3|

144

9.2 Counting Triangles in Incidence Streams

COUNTTRIANGLESSAFE2 ( s∈N, ψ ∈(0, 1))

Set r←d2·log(2/ψ)e.

Set t←ds/(4log(2/ψ))e.

Run rinstances of COUNTTRIANGLES2(t)in parallel.

Let e

(i)be the value returned by the ith instance.

T3←mediani(e

(i)).

Set ˜

←q176

T3·|T2|+3·|T3|

s·log 2

ψ=q88

T3·Pv∈Vdv·(dv−1)

s·log 2

ψ.

return (e

T3,˜

).

Figure 9.10: The 3-pass algorithm COUNTTRIANGLESSAFE2 for adjacency streams

•If s≥352

2·|T2|+3·|T3|

|T3|·log(2/ψ), then the algorithm outputs (e

T3,˜

)with ˜

≤.

Proof : Equivalent to the proof of Lemma 9.1.4. 2

To get small amortized expected update time we proceed as follows. Each time when the

incidence list of a new vertex starts, we compute the values iand jfor every instance. Then we

insert the j-values into a global priority queue keeping a pointer to the corresponding instance.

When we then process the incidence list of the current vertex we maintain a global counter for the

number of neighbors of the current vertex we have seen. If this number is equal to the smallest

value stored in the priority queue we remove it and process the corresponding instance. After the

incidence list has been processed, we empty the priority queue. This way, each instance of the

algorithm requires O(1)time per vertex. Additionally, we need O(s·log |V|)time to process the

removal of the smallest element in the priority queue. Overall, the amortized cost of the second

pass is O(1+s·|V|

|E|), which is constant for moderately large values of |E|. To implement the

third pass we use hashing in a similar way as in the algorithm for adjacency lists. This leads to

expected constant update time for the third pass.

Theorem 26 There is a 3-Pass streaming algorithm to count the number of triangles in incidence

streams up to a multiplicative error of 1±with probability at least 1−ψ, which needs

O1

2·|T2|+3·|T3|

|T3|·log 1

ψ·log |V|

memory bits and amortized expected update time

O1+1

2·|T2|+3·|T3|

|T3|·log 1

ψ·|V|

|E|.

145

9 Counting Motifs in Data Streams

9.2.2 1 Pass Algorithm

To get a one pass algorithm we will again combine the passes of SAMPLETRIANGLE2. The first

pass only counts the number Pof paths of length 2 in the graph. Instead of counting this number

in advance, we will start an instance of the streaming algorithm for each guess ˜

Pof the number

of length-2-paths in the set {1, 2, 4, 8, ..., |V|3}. In parallel we will count P. At the end we can

find one instance started with a value ˜

Psatisfying P≤˜

P < 2P. We choose the result of this

instance as the result of our algorithm.

We have to develop a data stream algorithm which only relies on an estimation ˜

Pfulfilling

P≤˜

P < 2P.

To combine the second and third pass we only test all edges seen after drawing the sample.

The algorithm SAMPLETRIANGLEONEPASS2 is given in Figure 9.11 and can be implemented

using O(log2|V|)memory bits.

SAMPLETRIANGLEONEPASS2

do the following things in parallel for i=0, 1, 2, . . . , blog(|V|3)c:

Let be e

Pi:= 2i.

Uniformly choose one path of length 2 using algorithm UNIFORMTWOPATH(e

Pi).

If UNIFORMTWOPATH did not select a path until the end of the stream, then return ⊥.

Let (a, v, b)be the selected path.

After choosing the path, test if edge (a, b)appears within the rest of the stream.

if (a, b)appears in the stream after the incidence list of vthen set βi←1

else set βi←0

in parallel to the for loop do: count the number P=Pvdv(dv−1)/2.

set β←βdlog Pe

return β

Figure 9.11: The 1-pass algorithm SAMPLETRIANGLEONEPASS2 for incidence streams

Lemma 9.2.4 Algorithm SAMPLETRIANGLEONEPASS2outputs with probability at least 1/2 a

value βhaving expected value

E[β] = 2·|T3|

|T2|+3·|T3|.

Otherwise it outputs the value ⊥.

Proof : We set β=βiwith i=dlog Peat the end of the algorithm. This value of βihas been

set to 0or 1, if UNIFORMTWOPATH(e

Pi)did select a path of length two. Since e

Pi=2dlog Pewe

have P≤e

Pi< 2 ·P.

UNIFORMTWOPATH(e

Pi)selects a path of length two by first choosing k∈{1, . . . , e

Pi}uni-

formly at random and then selecting the k−th path of length two in the stream. If k≤e

Pi/2 we

146

9.2 Counting Triangles in Incidence Streams

have k≤Pand therefore a path is selected. This happens with probability 1/2 and in that case

SAMPLETRIANGLEONEPASS2 does not output ⊥.

Let us now condition on the event that β6=⊥and analyse the expected value of β.

Let {a, b, c}be a fixed triangle. Wlog. we assume that we see the incidence list of afirst in

the stream, then the incidence list of band then the incidence list of c.

In the algorithm SAMPLETRIANGLE2 we detected the triangle by selecting (a, b, c),(c, a, b),

or (b, c, a)as the path of length two.

Now only the selections of (a, b, c)or (c, a, b)lead to a detection of the triangle (because

the edge (c, a)resp. (c, b)appears in the incidence stream after selecting bresp. a). The

selection of (b, c, a)as a path of length two is done using the incidence list of c. Therefore the

incidence lists of aand bhave passed and we don’t detect the edge (a, b). We conclude that the

probability to output β=1(under the condition that β6=⊥) is exactly 2/3 times the probability

of SAMPLETRIANGLE2 to output β=1.

A streaming algorithm COUNTTRIANGLESONEPASS2, which outputs an estimate of |T3|, eas-

ily follows. It can be adjusted using an input parameter sand is given in Figure 9.12.

COUNTTRIANGLESONEPASS2 ( s∈N)

Run sinstances of SAMPLETRIANGLEONEPASS2 in parallel.

Let s0be the number of instances not returning ⊥.

Let βibe the value returned by the ith such instance (not returning ⊥).

T3←1

s0Ps0

i=1βi·|T2|+3·|T3|

2=1

s0Ps0

i=1βi·Pv∈Vdv·(dv−1)/4.

return e

T3.

Figure 9.12: The 1-pass algorithm COUNTTRIANGLESONEPASS2 for incidence streams

Lemma 9.2.5 Algorithm COUNTTRIANGLESONEPASS2outputs a value e

T3having expected

value E[e

T3] = |T3|. If ≤1/2 and s≥6

2·|T2|+3·|T3|

T3·ln(4

δ). then with probability 1−δ

the algorithm outputs a value e

T3satisfying

(1−)·|T3|≤e

T3≤(1+)·|T3|.

Proof : The expected value of e

T3follows easily from Lemma 9.2.4.

Let be ≤1/2 and s≥6

2·|T2|+3·|T3|

T3·ln(4

δ). First we show that s0≥3

2·2·|T2|+3·|T3|

T3·ln(4

δ)

with probability at least 1−δ/2.

Let cj∈{0, 1}be the random indicator variable which is 1if the jth instance of SAMPLETRI-

ANGLEONEPASS2 returns 0or 1and which is 0if the jth instance returns ⊥. By Lemma 9.2.4

147

9 Counting Motifs in Data Streams

we have E[cj]≥1/2. By Chernoff Bounds [58]:

Prs

j=1

cj≤(s

2·E[cj])=Pr1

j=1

cj≤1

2·E[cj]< e−1

4·E[cj]·s/2 .

Therefore we have:

Prs0≤3

2·2·|T2|+3·|T3|

T3·ln(4

δ)≤e−s/16 ≤δ/4

for  < 1/2.

We now condition on the event that s0≥3

2·2·|T2|+3·|T3|

T3·ln(4

δ).

We again use Chernoff Bounds [58]:

Pr1

i=1

βi≥(1+)E[β]< e−2·E[β]·s0/3

Pr1

i=1

βi≤(1−)E[β]< e−2·E[β]·s0/2

For s0≥3

2·2·|T2|+3·|T3|

|T3|·ln(4

δ)the sum of both probabilities is bounded by δ/2.

We can again develop an algorithm COUNTTRIANGLESONEPASSSAFE2 based on COUNT-

TRIANGLESONEPASS2. It has a number sand a desired error probability ψas input parameters

and outputs a pair (e

T3,˜

).e

T3then is a (1±˜

)-approximation of |T3|with probability 1−ψ. It

uses at most sparallel instances of SAMPLETRIANGLEONEPASS2. We show the pseudocode of

COUNTTRIANGLESONEPASSSAFE2 in Figure 9.13.

COUNTTRIANGLESONEPASSSAFE2 ( s∈N, ψ ∈(0, 1))

Set r←d2·log(4/ψ)e.

Set t←ds/(4log(4/ψ))e.

Run rinstances of COUNTTRIANGLESONEPASS2(t)in parallel.

Let e

(i)be the value returned by the ith instance.

T3←mediani(e

(i)).

Set ˜

←q1056

T3·|T2|+3·|T3|

s·log 4

ψ=q528

T3·Pv∈Vdv·(dv−1)

s·log 4

ψ.

return (e

T3,˜

).

Figure 9.13: The 1-pass algorithm COUNTTRIANGLESONEPASSSAFE2 for adjacency streams

148

9.3 Counting Cliques of Arbitrary Size

Lemma 9.2.6 Let (e

T3,˜

)be the output of algorithm COUNTTRIANGLESONEPASSSAFE2. With

probability 1−ψthe following statements are true:

•(1−˜

)·|T3|<e

T3<(1+˜

)·|T3|

•If s≥2112

2·|T2|+3·|T3|

|T3|·log(4/ψ), then the algorithm outputs (e

T3,˜

)with ˜

≤.

Proof : Equivalent to the proof of Lemma 9.1.4. 2

We use techniques to reduce the amortized update time as shown in the previous section.

Since we start O(log |V|)parallel instances for different guesses of ˜

P, the amortized update time

increases by a factor of O(log |V|).

Theorem 27 There is a 1-Pass streaming algorithm to count the number of triangles in incidence

streams up to a multiplicative error of 1±with probability at least 1−ψ, which needs

O1

2·1+|T2|

|T3|log 1

ψ·log2|V|

memory bits and amortized expected update time

Olog(|V|)·1+1

2·1+|T2|

|T3|log 1

ψ·|V|

|E| .

9.3 Counting Cliques of Arbitrary Size

Using the approach of the previous sections we can count cliques of αnodes in incidence streams

as well using one pass. We assume that αis a small constant. Let Sαbe the set of α-stars (α

nodes v1, . . . , vαand edges (v1, v2),...,(v1, vα)) in Gand Kαbe the set of cliques of size αin

G. Our memory bounds will depend on |Sα|/|Kα|. In network analysis we are interested in those

networks where this ratio is small, for example constant.

We use the method UNIFORMSTAR given in Figure 9.14 to uniformly choose an α-star. It uses

O(log |V|)memory bits and has expected running time O(|V|·log |V|), not counting the time to

read the stream. The selected star is then used by SAMPLECLIQUEONEPASS given in Figure

9.15. Each instance of SAMPLECLIQUEONEPASS uses O(log2|V|)memory bits. When we run

sparallel instances of SAMPLECLIQUEONEPASS, the whole method has amortized expected

running time O(s·|V|·log |V|

|E|+1)per edge.

Lemma 9.3.1 Algorithm SAMPLECLIQUEONEPASS with probability at least 1/2 outputs a value

βhaving expected value

E[β] = 2·|Kα|

|Sα|

Otherwise it outputs the value ⊥.

149

9 Counting Motifs in Data Streams

UNIFORMSTAR(P)

Select value kuniformly from the set {1,...,P}.

For each node vin the stream do

Use the reservoir sampling technique of [120] to obtain α−1sample nodes

from the incidence list of v.

Let dbe the degree of node v.

If d

α−1≥kthen

return sample nodes.

end if

k←k−d

α−1

end do

return ⊥.

Figure 9.14: The 1-pass algorithm UNIFORMSTAR for adjacency streams

Proof : We set β=βiwith i=dlog Peat the end of the algorithm. This value of βihas been

set to 0or 1, if UNIFORMSTAR(e

Pi)did select a star. Since e

Pi=2dlog Pewe have P≤e

Pi< 2 ·P.

UNIFORMSTAR(e

Pi)selects a star by first choosing k∈{1, . . . , e

Pi}uniformly at random and

then selecting the k-th star in the stream. If k≤e

Pi/2 we have k≤Pand therefore a star is

selected. This happens with probability 1/2 and in that case SAMPLECLIQUEONEPASS outputs

β=0or β=1.

Let us now condition on the event that β6=⊥and analyse the expected value of β.

If the star is sampled from the incidence list of vand vis the first or the second of the star-

nodes within the stream we can find all other edges after selecting the sample star. So if the star

belongs to a clique, this clique is detected.

However, if vis the third or a later node in the stream we miss the edge connecting the first and

second node of the star. When we fix a clique, the probability that the chosen node vis first or

second node of the clique in the stream is 2/α. Therefore two choices of stars lead to a detection

of a fixed clique. It follows that the expected value of βis

E[β] = 2

α·α·|Kα|

|Sα|

A streaming algorithm COUNTCLIQUESONEPASS, which outputs an estimate of |Kα|, easily

follows. It can be adjusted using an input parameter sand is given in Figure 9.16.

Lemma 9.3.2 Algorithm COUNTCLIQUESONEPASS outputs a value f

Kαhaving expected value

E[f

Kα] = |Kα|.

150

9.3 Counting Cliques of Arbitrary Size

SAMPLECLIQUEONEPASS

Do the following things in parallel for i=0, 1, 2, . . . , blog(|V|3)c:

Let be e

Pi=2i.

Uniformly choose one path of length 2 using algorithm UNIFORMSTAR(e

Pi).

If UNIFORMSTAR did not select a star until the end of the stream, then return ⊥.

Let (v1, v2, ..., vα)be this star with v1as middle node.

After choosing the path, test if each edge (vi, vj)for i, j ∈{2,...,α}and i6=jappears

within the rest of the stream.

if all edges appear in the stream after the incidence list of vthen set βi←1

else set βi←0

in parallel do: count the number P=Pv∈Vdv

α−1of stars in the graph.

set β←βdlog Pe

return β

Figure 9.15: The 1-pass algorithm SAMPLECLIQUEONEPASS for incidence streams

COUNTCLIQUESONEPASS (s∈N)

Run sinstances of SAMPLECLIQUEONEPASS in parallel.

Let s0be the number of instances not returning ⊥.

Let βibe the value returned by the ith such instance (not returning ⊥).

Kα←1

s0Ps0

i=1βi·|Sα|2=1

s0Ps0

i=1βi·Pv∈Vdv

α−12.

return f

Kα.

Figure 9.16: The 1-pass algorithm COUNTCLIQUESONEPASS for incidence streams

If ≤1/2 and s≥6

2·|Sα|

|Kα|·ln(4

δ). then with probability 1−δthe algorithm outputs a value

Kαsatisfying

(1−)·|Kα|≤f

Kα≤(1+)·|Kα|.

Proof : Equivalent to the proof of Lemma 9.2.5. 2

Based on that we can again develop an algorithm COUNTCLIQUESONEPASSSAFE. It has a

number sand a desired error probability ψas input parameters and outputs a pair (f

Kα,˜

). The

algorithm gives the guarantee that f

Kαis a (1±)-approximation of |Kα|with probability 1−ψ

and uses at most sparallel instances of SAMPLECLIQUEONEPASS. We show the pseudocode of

COUNTCLIQUESONEPASSSAFE in Figure 9.17.

Lemma 9.3.3 Let (f

Kα,˜

)be the output of algorithm COUNTCLIQUESONEPASSSAFE. With

probability 1−ψthe following statements are true:

151

9 Counting Motifs in Data Streams

COUNTCLIQUESONEPASSSAFE (s∈N, ψ ∈(0, 1))

Set r←d2·log(4/ψ)e.

Set t←ds/(4log(4/ψ))e.

Run rinstances of COUNTCLIQUESONEPASS(t)in parallel.

Let f

Kα

(i)be the value returned by the ith instance.

Kα←mediani(f

Kα

(i)).

Set ˜

←q1056

Kα·|Sα|

s·log 4

ψ=r1056

Kα·Pv∈V(dv

α−1)

s·log 4

ψ.

return (f

Kα,˜

).

Figure 9.17: The 1-pass algorithm COUNTCLIQUESONEPASSSAFE for adjacency streams

•(1−˜

)·|Kα|<f

Kα<(1+˜

)·|Kα|

•If s≥2112

2·|Sα|

|Kα|·log(4/ψ), then the algorithm outputs (f

Kα,˜

)with ˜

≤.

Proof : Equivalent to the proof of Lemma 9.1.4. 2

Theorem 28 There is a 1-Pass streaming algorithm to count the number of Kαin incidence

streams up to a multiplicative error of 1±with probability at least 1−ψ, which needs

O1

2·1+|Sα|

|Kα|·log(1

ψ)·log2|V|

memory bits and has amortized expected update time

O1+1

2·1+|Sα|

|Kα|·log 1

ψ·log2|V|·|V|

|E| .

9.4 Counting K3,3 in Incidence Streams

We propose a method to estimate the number of K3,3, when the graph is directed and given as

an incidence stream and the outdegree of each node is bounded by ∆. The stream of edges is

ordered by destination nodes (so we see for each destination node all source nodes after one

another). Our assumption is justified because in large social graphs and the webgraph there are

often only a small number of links going out of each node. The graphs are often stored on hard

disk(s) and for each node all incoming edges are precomputed and stored with the graph.

We do not assume any ordering by source nodes. Let K3,3 denote the set of K3,3 minors and

K3,1 denote the set of K3,1 minors as shown in Figure 9.18.

We will first show how we can choose a K3,1 uniformly at random from the stream. This is

done similarly to choosing the length-2-paths in the triangle algorithm for incidence lists. We

152

9.4 Counting K3,3 in Incidence Streams

Figure 9.18: A K3,1 (on the left) and a K3,3 (on the right).

start a number of different estimations on the number of K3,1. In parallel we count the number

|K3,1|=P|V|

i=1di·(di−1)·(di−2)/6.

We will extend the method UNIFORMTWOPATH to a method UNIFORMK3,1 as shown in Fig-

ure 9.19. It has an estimation Pof |K3,1|as input parameter and selects uniformly at random

one K3,1 motif from the incidence stream. Using UNIFORMK3,1 we can develop a method

SAMPLEK3,3, outputting a variable βwhose expectation is related to the number |K3,3|. The

method SAMPLEK3,3 is given in Figure 9.20. It can be implemented using O(log2|V|)memory

bits. '

UNIFORMK3,1(P)

Select value kuniformly from the set {1,...,P}.

For each node vin the incidence list do:

If k>0then

Set h←f−1(k)(with f(x) := x

3)

Set k2←k−f(h−1)

Set i←lq2k2+1

4+1

Set j←i−i2−i

2+k2−1

Pass over the complete incidence list of node v.

If incidence list of vcontains more than jedges then

a←the hth node in the incidence list of v

b←the ith node in the incidence list of v

c←the jth node in the incidence list of v

u←v

end if

d←degree of node v

k←k−d·(d−1)·(d−2)

end if

end do

return edges (a, u),(b, u)and (c, u)

Figure 9.19: The 1-pass algorithm UNIFORMK3,1 for directed incidence streams

153

9 Counting Motifs in Data Streams

SAMPLEK3,3

Do the following things in parallel for i=0, 1, 2, . . . , blog(|V|4)c:

Let be e

Pi=2i.

From all K3,1 occuring in the stream choose one uniformly using UNIFORMK3,1(e

Pi).

If UNIFORMK3,1 did not select a K3,1 until the end of the stream, then return ⊥.

Let the three edges of the chosen K3,1 be (a, u),(b, u)and (c, u)

Select uniformly x1, x2∈{a, b, c}

Choose uniformly random variables k1, k2∈{1, 2, . . . ∆}

If k1=k2∧x1=x2then set βi←0

else:

Go on passing over the rest of stream (the part behind the occurence of the K3,1).

Select (x1, v)as the k1-th edge (x1,·)after selecting the K3,1.

Select (x2, w)as the k2-th edge (x2,·)after selecting the K3,1.

From the time of selecting (x1, v):

check, if (a, v),(b, v),(c, v)are present in the stream

From the time of selecting (x2, w):

check, if (a, w),(b, w),(c, w)are present in the stream

If both is the case, then set βi←1else set βi←0.

In parallel to the for loop count the number |K3,1|=P=Pv∈Vdv

3of K3,1 in the graph.

set β←βdlog Pe

return β

Figure 9.20: The 1-pass algorithm SAMPLEK3,3 for directed incidence streams

Lemma 9.4.1 Algorithm SAMPLEK3,3 outputs with probability at least 1/2 a random value β

having

E[β] = 2·|K3,3|

9·∆2·|K3,1|.

Otherwise it outputs the value ⊥.

Proof : We set β=βiwith i=dlog Peat the end of the algorithm. This value of βihas been

set to 0or 1, if UNIFORMK3,1(e

Pi)did select a K3,1. Since e

Pi=2dlog Pewe have P≤e

Pi< 2 ·P.

UNIFORMK3,1(e

Pi)selects a K3,1 by first choosing k∈{1, . . . , e

Pi}uniformly at random and

then selecting the k-th K3,1 in the stream. If k≤e

Pi/2 we have k≤Pand therefore a K3,1 is

selected. This happens with probability 1/2 and in that case SAMPLEK3,3 does not output ⊥.

Let us now condition on the event that β6=⊥and analyse the expected value of β. Let

(a, b, c, u, v, w)be an arbitrary fixed K3,3 with edges directed from a,b,cto u,vand w. Let u

be the vertex whose incidence list appears first within the incidence stream, v,woccuring after

uwithing the stream. The K3,3 will be detected exactly when all of the following events occur:

154

9.4 Counting K3,3 in Incidence Streams

•a, b, c, u are chosen as K3,1 with ubeing the destination node

•vand wmust be chosen

•x1must be the first within the incidence list of v.

•x2must be the first within the incidence list of w.

The probability of the first event is 1/|K3,1|.

Conditioned on the first event the probability to choose vand wis 2/∆2: Each edge (x1,·)

appearing after (x1, u)in the stream has a probability of 1/∆ to be chosen by the algorithm. We

know that (x1, v)and (x1, w)appear after (x1, u)in the stream. Therefore each of these two

edges has a probability of 1/∆ to be chosen. By a similar argument we have independently a

probability of 1/∆ to choose the edge (x2, v)resp. (x2, w). We select the nodes vand wif we

either select (x1, v)and (x2, w)or (x1, w)and (x2, v). Therefore the probability to choose vand

wis 2/∆2.

Observe that the probability for vand wto be chosen does not depend on the choice of x1and

x2. We can therefore exchange the order in which we analyse these events.

Conditioned on the first two events the probability for the third event is 1/3, also the probability

for the fourth event. We get alltogether a probability of 2

9·∆2·|K3,1|to choose the fixed K3,3.2

A streaming algorithm COUNTK3,3, which outputs an estimate of |K3,3|, easily follows. It can

be adjusted using an input parameter s. It is given in Figure 9.21.

COUNTK3,3 (s∈N)

Run sinstances of SAMPLEK3,3 in parallel.

Let s0be the number of instances not returning ⊥.

Let βibe the value returned by the ith such instance (not returning ⊥).

K3,3 ←1

s0Ps0

i=1βi·9

2·∆2·|K3,1|=1

s0Ps0

i=1βi·9

2·∆2·Pv∈Vdv

3.

return g

K3,3.

Figure 9.21: The 3-pass algorithm COUNTK3,3 for adjacency streams

Lemma 9.4.2 Algorithm COUNTK3,3 outputs a value g

K3,3 having expected value E[g

K3,3] =

|K3,3|. If ≤1/2 and s≥54

2·∆2·|K3,1|

|K3,3|·ln(2

δ)then with probability 1−δthe algorithm outputs

a value g

K3,3 satisfying

(1−)·|K3,3|≤g

K3,3 ≤(1+)·|K3,3|.

Proof : Equivalent to the proof of Lemma 9.2.5. 2

155

9 Counting Motifs in Data Streams

COUNTK3,3SAFE (s∈N, ψ ∈(0, 1))

Set r←d2·log(4/ψ)e.

Set t←ds/(4log(4/ψ))e.

Run rinstances of COUNTK3,3 (t)in parallel.

Let g

K3,3

(i)be the value returned by the ith instance.

Set g

K3,3 ←mediani(g

K3,3

(i)).

Set ˜

←q9504·∆2

K3,3 ·|K3,1|

s·log 4

ψ=r9504·∆2

K3,3 ·Pv∈V(dv

s·log 4

ψ.

return (g

K3,3,˜

).

Figure 9.22: The 3-pass algorithm COUNTK3,3SAFE for adjacency streams

We can develop an algorithm COUNTK3,3SAFE based on COUNTK3,3. It has a number sand a

desired error probability ψas input parameters and outputs a pair (g

K3,3,˜

). The algorithm gives

the guarantee that g

K3,3 is a (1±)-approximation of |K3,3|with probability 1−ψ. It uses at most

sparallel instances of SAMPLEK3,3. We show the pseudocode of COUNTK3,3SAFE in Figure

9.22.

Lemma 9.4.3 Let (g

K3,3,˜

)be the output of algorithm COUNTK3,3SAFE. With probability 1−ψ

the following statements are true:

•(1−˜

)·|K3,3|<g

K3,3 <(1+˜

)·|K3,3|

•If s≥19008·∆2

2·|K3,1|

|K3,3|·log(4/ψ), then the algorithm outputs (g

K3,3,˜

)with ˜

≤.

Proof : Equivalent to the proof of Lemma 9.1.4. 2

Theorem 29 There is a 1-Pass streaming algorithm to count the number of K3,3 in incidence

streams ordered by destination nodes with outdegree bounded by ∆up to a multiplicative error

of with probability at least 1−ψ, which needs

O log2(|V|)·|K3,1|·∆2ln(1

ψ)

|K3,3|·2!

memory bits.

156

10 Conclusions

In this thesis we developed new methods to analyse dynamic geometric data streams and obtain

structural information about the large data sets encoded in the streams.

In Chapter 3 we first developed a method to draw a uniform random sample from a multi-

set, when the multiset is given as a turnstyle data stream. The method we propose is a building

block for various data stream algorithms. As examples we showed in Chapter 4 some direct

consequences of the new sampling method, i.e. how to maintain -nets and -approximations

of points when the point set is given as a dynamic geometric data stream. We also developed

a method to estimate the weight of a minimum tree spanning all the points encoded in a dy-

namic geometric data stream. As random sampling and -approximations are powerful tools in

computational geometry we believe that our techniques have many more applications.

The space used by the algorithm for -approximations, i.e. roughly O(1/2), is essentially

optimal as a function of . This is, because it is known that, for some range spaces, the size of

-approximations tends to 1/2as the VC dimension tends to infinity. However, for some range

spaces smaller -approximations can be constructed, even for points delivered in an insertions-

only stream. For example, [118] showed how to compute -approximations for ranges defined

by halfspaces in ddimensions of size roughly O(1/2−2/(d+1)). We do not know how to extend

this result to dynamic geometric data streams.

To estimate the weight of the minimum spanning tree of the points encoded in a dynamic ge-

ometric data stream we used O(log3(1/δ)·(log(∆)/)O(d))space. Although we give the first

algorithm at all to estimate the value in space polylogarithmic in ∆, we believe that one can de-

velop algorithms having much better memory bounds.

One of the central results of this thesis, a universal method to construct coresets for k-median,

k-means, MaxCut, and more problems, has been given in Chapter 5. Our method is much simpler

than previous coreset methods [8, 60, 61] and makes less assumptions about the distribution of

the points. The only information needed to construct coresets is the number of points in heavy

cells (cells containing a certain number of points) of certain square grids. The simplicity of our

method enabled us to develop the fastest known PTAS for Euclidean MaxCut in Section 5.5.3

and the first efficient methods to maintain coresets for k-median, k-means, MaxCut, and more

problems on dynamic geometric data streams in Chapter 6.

The coreset we obtain is of size O(k·log n/d+1)for k-median, O(k·log n/d+2)for k-

means and O(log n/d+1)for all other problems we consider. Recently some methods have

been proposed to obtain smaller coresets for k-median and k-means. Har-Peled and Kushal [60]

showed how to compute -coresets of size O(k2/d)for k-median and of size O(k3/d+1)for

k-means (not dependent on n). However, their coreset construction does not apply to dynamic

geometric data streams. Their results indicate that the space dependency of our algorithms on n

157

10 Conclusions

could also be improved.

All space requirements of our coreset algorithms depend exponentially on d(we did not state

it in the formulas because we assumed a constant dimension). Therefore our methods are not

suited for high dimensional data. Recently Chen [26] proposed a method to obtain coresets us-

ing space which depends polynomially on d. His method is suited for streams of insertions of

points, but does not translate to streams of insertions and deletions. It remains an interesting

open problem to develop coreset methods for dynamic geometric data streams, which have space

complexity polynomial in the dimension d.

In Chapter 7 we used our coreset technique to develop the first kinetic data structure for the

Euclidean MaxCut problem. Our KDS can be extended to MaxTSP, MaxMatching, and average

distance. However, the time to compute a solution from the coreset (which has to be done for

each query to the data structure, or, alternatively with each event) can differ significantly.

Extending our KDS to k-median and k-means clustering requires additional ideas. The tech-

nical problem is here that one cannot get a lower bound on the solution from the width of the

bounding box. Hence, it is not clear how to get an upper bound on the number of events. De-

veloping kinetic data structures for k-median and k-means therefore remains an interesting open

problem.

In Chapter 8 we presented an efficient implementation of a k-means clustering algorithm using

coresets. Our algorithm performs very well compared to KMHybrid [105] for small dimension

and small to medium k. The quality of the solutions varies less than that of KMHybrid, which

implies that we need fewer runs to guarantee a good solution. The main strength of our algorithm

is to quickly find relatively good approximations for many values of k, for example when a good

value for kis not known in advance. In this case, we can also use the coresets to compute the

average clustering coefficient and thus to find a good choice of k.

As mentioned above recently some constructions for smaller coresets [26, 60] have been de-

veloped. It would be interesting to measure if the usage of these smaller coresets would lead to

even faster convergence of k-means based algorithms.

We have proposed a methodology in Chapter 9 to find (1±)-approximations on the number of

frequent subgraphs of large graphs given as data streams. The amount of samples resp. memory

bits needed by our algorithms depend on the number of certain small structures in these graphs.

Recent results on the internal structure of the webgraph or large sozial graphs [17, 81, 89, 92,

78] suggest that the amount of space needed by our algorithm to count motifs is constant or at

most logarithmic in the number of nodes for these graphs. Recent tests [17] suggest that our

algorithms can compute good estimations on the number of triangles of real webgraph crawls in

time comparable to the time to read the graph from the hard disc.

158

Bibliography

[1] P. K. Agarwal, J. Gao, and L. J. Guibas. Kinetic Medians and kd-Trees. Proceedings of

the 10th Annual European Symposium on Algorithms (ESA), pp. 5–16, 2002.

[2] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximating Extent Measures of

Points. Journal of the ACM, 51(4):606–635, July 2004.

[3] P. Agarwal, S. Har-Peled, and K. Varadarajan. Geometric Approximation via Coresets.

Survey available at http://valis.cs.uiuc.edu/ sariel/research/papers/04/survey/survey.pdf

[4] N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Fre-

quency Moments. J. Comput. Syst. Sci., 58(1), pp. 137–147, 1999.

[5] K. Alsabti, S. Ranka, and V. Singh. An Efficient k-Means Clustering Algorithm. Proceed-

ings of the first Workshop on High Performance Data Mining, 1998.

[6] D. Arthur and S.Vassilvitskii. How Slow is the k-Means Method? Proceedings of the 21nd

Annual ACM Symposium on Computational Geometry (SoCG), pp.144–153, 2006.

[7] M. Badoiu and K. Clarkson. Smaller Core-Sets for Balls. Proceedings of the 14th Sympo-

sium on Discrete Algorithms (SODA’03), pp. 801–802, 2003.

[8] M. Badoiu, S. Har-Peled, and P. Indyk. Approximate Clustering via Coresets. Proceedings

of the 34th Annual ACM Symposium on Theory of Computing (STOC’02), pp. 250–257,

2002.

[9] A. Bagchi, A. Chaudhary, D. Eppstein and M. T. Goodrich. Deterministic Sampling and

Range Counting in Geometric Data Streams. Proceedings of the 20th Annual Symposium

on Computational Geometry (SoCG), pp. 144–151, 2004.

[10] Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting Distinct

Elements in a Data Stream. Proceedings of the 6th International Workshop on Randomiza-

tion and Approximation Techniques, pages 1-10, 2002.

[11] J. Basch. Kinetic Data Structures. Ph.D. thesis, Stanford University, 1999.

[12] J. Basch, L. J. Guibas, and J. Hershberger. Data Structures for Mobile Data. J. Algorithms,

31(1):1–28 1999.

[13] J. Basch, L. J. Guibas, and G. Ramkumar. Sweeping Lines and Line Segments with a Heap.

Proceedings of the 13th Annual ACM Symposium on Computational Geometry (SoCG), pp.

469–471, 1997.

159

Bibliography

[14] P. Berkhin. Survey of Clustering Data Mining Techniques. Available at ..., 2002.

[15] S. Bespamyatnikh, B. Bhattacharya, D. Kirkpatrick, and M. Segal. Mobile Facility Loca-

tion. Proceedings of the 4th DIAL M, pp. 46–53, 2000.

[16] G. S. Brodal and R. Jacob. Dynamic Planar Convex Hull. Proceedings of the 43rd IEEE

Symposium on Foundations of Computer Science (FOCS), pp. 617–626, 2002.

[17] L. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela, and C. Sohler. Counting

Triangles in Data Streams. Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART Sym-

posium on Principles of Database Systems, pages 253–262, 2006

[18] J. Carter and M. Wegman. Universal Classes of Hash Functions. Journal of Computer and

System Sciences, 18(2), pp. 143–154, 1979

[19] T. M. Chan. Faster Coreset Constructions and Data Stream Algorithms in Fixed Dimen-

sions. Proceedings of the 20th Annual Symposium on Computational Geometry (SoCG),

pp. 152–159, 2004.

[20] T. Chan, B. Sadjad. Geometric Optimization Problems Over Sliding Windows. Proceedings

of the 15th Annual International Symposium on Algorithms and Computation (ISAAC), pp.

246–258, 2004.

[21] M. Charikar, L. O’Callaghan, and R. Panigrahy. Better Streaming Algorithms for Cluster-

ing Problems. Proceedings of the 35th Annual ACM Symposium on Theory of Computing

(STOC), pp. 30–39, 2003.

[22] M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards Estimation Error

Guarantees for Distinct Values. Proceedings of the 19th ACM SIGMOD Symposium on

Principles of Database Systems (PODS), 2000.

[23] M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental Clustering and Dynamic

Information Retrieval. Proceedings of the 29th Annual ACM Symposium on Theory of

Computing (STOC), 626–635, 1997.

[24] M. Charikar, K. Chen, M. Farach-Colton. Finding Frequent Items in Data Streams. Pro-

ceedings of the . 29th Annual International Colloquium on Automata, Languages and Pro-

gramming (ICALP), pp. 693–703, 2002.

[25] B. Chazelle, R. Rubinfeld, and L. Trevisan. Approximating the Minimum Spanning Tree

Weight in Sublinear Time. Proceedings of the 28th Annual International Colloquium on

Automata, Languages and Programming (ICALP), pages 190–200, 2001.

[26] K. Chen. On k-Median Clustering in High Dimensions. Proceedings of the 17th Annual

ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1177–1185, 2006.

160

Bibliography

[27] D. Coppersmith and S. Winograd. Matrix Multiplication via Arithmetic Progressions. Jour-

nal of Symbolic Computation 3, no. 9.

[28] G. Cormode, M. Datar, P. Indyk. Comparing Data Streams using Hamming norms. Pro-

ceedings of the International Conference on Very Large Databases (VLDB), pp. 335–345,

2002.

[29] G. Cormode and S. Muthukrishnan. Radial Histograms for Spatial Streams. DIMACS

Technical Report 2003-11, 2003.

[30] G. Cormode and S. Muthukrishnan. Improved Data Stream Summaries: The Count-Min

Sketch and its Applications. Proceedings of the 6th Latin American Theoretical Informatics

(LATIN), pp. 29–38, 2004.

[31] G. Cormode and S. Muthukrishnan and I. Rozenbaum. Summarizing and Mining Inverse

Distributions on Data Streams via Dynamic Sampling. DIMACS Technical Report 2005-11,

2005.

[32] A. Czumaj, F. Erg¨

un, L. Fortnow, A. Magen, I. Newman, R. Rubinfeld, and C. Sohler.

Sublinear-Time Approximation of Euclidean Minimum Spanning Tree. SIAM Journal on

Computing, 35(1): 91-109 ,2005.

[33] A. Czumaj and C. Sohler. Estimating the Weight of Metric Minimum Spanning Trees in

Sublinear-Time. Proceedings of the 36th Annual ACM Symposium on Theory of Computing

(STOC), pp. 175–183, 2004.

[34] A. Czumaj and C. Sohler. Sublinear-Time Approximation for Clustering via Random Sam-

pling. Proceedings of the 31st Annual International Colloquium on Autonate, Languages

and Programming (ICALP’04), pp. 396–407, 2004.

[35] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing

Iceberg Queries Efficiently. Proceedings of the 1998 Intl. Conf. on Very Large Data Bases,

pp. 299-310, 1998.

[36] J. Feigenbaum, S. Kannan, and J. Zhang. Computing Diameter in the Streaming and Sliding

Window Models. Technical Report YALEU/DCS/TR-1245, Yale University, 2002.

[37] W. Fernandez de la Vega, M. Karpinski, C. Kenyon. Approximation Schemes for Metric

Minimum Bisection and Partitioning. Proceedings of the 15th Annual ACM-SIAM Sympo-

sium on Discrete Algorithms (SODA), 2004.

[38] W. Fernandez de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation

Schemes for Clustering Problems. Proceedings of the 35th Annual ACM Symposium on

Theory of Computing (STOC), pp. 50–58, 2003.

[39] W. Fernandez de la Vega and C. Kenyon. A Randomized Approximation Scheme for Metric

MAX-CUT. J. Comput. Syst. Sci., 63(4):531-541, 2001.

161

Bibliography

[40] P. Flajolet and G. Martin. Probabilistic Counting Algorithms for Data Base Applications.

Journal of Computer and System Sciences, 31:182–209, 1985.

[41] E. Forgey. Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classifi-

cation. Biometrics, 21:768, 1965.

[42] G. Frahling, P. Indyk, and C. Sohler. Sampling in Dynamic Data Streams and Applica-

tions. Proceedings of the 21st Annual Symposium on Computational Geometry (SoCG),

pages 142–149, 2005. Invited to the special issue of SoCG 2005, to appear in International

Journal of Computational Geometry and Applications (IJCGA).

[43] G. Frahling and C. Sohler. Coresets in Dynamic Geometric Data Streams. Proceedings of

the 37th Annual ACM Symposium on Theory of Computing (STOC), pp. 209–217, 2005.

[44] G. Frahling and C. Sohler. A Fast k-Means Implementation using Coresets. Proceedings of

the 22nd Annual Symposium on Computational Geometry (SoCG), pages 135–143, 2006.

Invited to the special issue of SoCG 2006, to appear in International Journal of Computa-

tional Geometry and Applications (IJCGA).

[45] H.N. Gabow. Data Structures for Weighted Matching and Nearest Common Ancestors with

Linking Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms

(SODA), 434–443, 1990.

[46] S. Ganguly, M. Garofalakis and R. Rastogi. Tracking Set-Expression Cardinalities over

Continuous Update Streams. The VLDB Journal, 13(4), pp. 354–369, 2004.

[47] J. Gao, L. J. Guibas, J. Hershberger, L. Zhang, and A. Zhu. Discrete Mobile Centers.

Discrete & Computational Geometry, 30(1):45–63, 2003.

[48] A. Gilbert, S. Guha, Y. Kotidis, P. Indyk, S. Muthukrishnan, M. Strauss. Fast, Small Space

algorithm for Approximate Histogram Maintenance. Proceedings of the 34th Annual ACM

Symposium on Theory of Computing (STOC), pp.389–398, 2005.

[49] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, Surfing Wavelets on Streams:

One-Pass Summaries for Approximate Aggregate Queries. VLDB, 2001, pp. 79–88.

[50] M. X. Goemans and D. P. Williamson. Improved Approximation Algorithms for Maximum

Cut and Satisfiability Problems using Semidefinite Programming JACM, 42:1115–1145,

1995.

[51] O. Goldreich, S. Goldwasser, D. Ron. Property Testing and its Connection to Learning and

Approximation. Journal of the ACM, 45(4):653–750, 1998.

[52] J. Goodman and J. O’Rourke. Handbook of Discrete and Computational Geometry. CRC

Press, 1997.

162

Bibliography

[53] S. Guha, N. Koudas, and K. Shim. Data-Streams and Histograms. Proceedings of the An-

nual ACM Symposium on Theory of Computing (STOC), 2001, pp. 471–475.

[54] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering Data Streams. Proceed-

ings of the 41st IEEE Symposium on Foundations of Computer Science (FOCS), 359–366,

2000.

[55] L. J. Guibas. Kinetic Data Structures — a State of the Art Report. Proceedings of the 3rd

Workshop on the Algorithmic Foundations of Robotics (WAFR), pp. 191–209, 1998.

[56] L. J. Guibas. Modeling Motion. In Handbook of Discrete and Computational Geometry,

edited by J. E. Goodman and J. O’Rourke, 2nd edition, Chapter 50, pp. 1117–1134, 2004.

[57] P. Haas, J. Naughton, S. Seshadri, and L. Stokes. Sampling-based Estimation of the Number

of Distinct Values of an Attribute. Proceedings of the 21st International Conference on Very

Large Data Bases (VLDB), pp.311–322, 1995.

[58] T. Hagerub and C. R¨

ub. A Guided Tour of Chernoff Bounds. Information Processing

Letters, 33:305–308, 1989/90.

[59] S. Har-Peled. Clustering Motion. Discrete & Computational Geometry, 31:545–565, 2004.

[60] S. Har-Peled and A. Kushal. Smaller Coresets for k-Median and k-Means Clustering.

[61] S. Har-Peled and S. Mazumdar. Coresets for k-Means and k-Medians and Their Applica-

tions. Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC),

291–300, 2004.

[62] S. Har-Peled and B. Sadri. On Lloyd’s k-Means Method. Proceedings of the 16th Annual

ACM-SIAM Symposium on Discrete Algorithms (SODA), 2005.

[63] S. Har-Peled and K. Varadarajan. Projective Clustering in High Dimensions using Coresets.

Proceedings 18th Annual ACM Symposium on Computational Geometry (SoCG’02), pp.

312–318, 2002.

[64] F. Harary and H. J. Kommel. Matrix Measures for Transitivity and Balance. Journal of

Mathematical Sociology (6), 199210.

[65] J. Hartigan. Clustering Algorithms. John Wiley & Sons, New York, 1975.

[66] D. Haussler, E: Welzl. -Nets and Simplex Range Queries. Discrete and Computational

Geometry, 2:127–151, 1987.

[67] M. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on Data Streams. 1998

[68] J. Hershberger. Smooth Kinetic Maintenance of Clusters. Computational Geometry, Theory

and Applications, 31(1–2):3–30, 2005.

163

Bibliography

[69] J. Hershberger and S. Suri. Convex Hulls and Related Problems in Data Streams. Proceed-

ings of the ACM/DIMACS Workshop on Management and Processing of Data Streams,

2003.

[70] J. Hershberger and S. Suri. Adaptive Sampling for Geometric Problems over Data Streams.

Proceedings of the ACM Symposium on Principles of Database Systems (PODS), 2004.

[71] M. Inaba, N. Katoh, and H. Imai. Applications of Weighted Voronoi Diagrams and Ran-

domization to Variance-Based k-Clustering. Proceedings of the 10th Annual ACM Sympo-

sium on Computational Geometry (SoCG), pp. 332–339, 1994.

[72] P. Indyk. High-Dimensional Computational Geometry. Ph.D. thesis, Stanford University,

2000.

[73] P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream

Computation. Proceedings of the 41st IEEE Symposium on Foundations of Computer Sci-

ence (FOCS), pp. 189–197, 2000.

[74] P. Indyk. Better Algorithms for High-Dimensional Proximity Problems via Asymmetric

Embeddings. Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algo-

rithms (SODA), pp. 539–545, 2003.

[75] P. Indyk. Algorithms for Dynamic Geometric Problems over Data Streams. Proceedings of

the 36th Annual ACM Symposium on Theory of Computing (STOC), pp. 373–380, 2004.

[76] P. Indyk and D. Woodruff. Tight Lower Bounds for the Distinct Elements Problem. Annual

Symposium on Foundations of Computer Science, pages 283–290, 2003.

[77] P. Indyk and D. Woodruff. Optimal Approximations of the Frequency Moments of Data

Streams. Proceedings of the 37th Annual ACM Symposium on Theory of Computing

(STOC), 2005.

[78] S. Itzkovitz, N. Kashtan, D. Chklovskii, R. Milo, S. Shen-Orr, and U. Alon. Network

Motifs: Simple Building Blocks of Complex Networks. Science (298), no. 509, 824 – 827.

[79] H. Jagadish, N. Koudas, and S. Muthukrishnan. Mining deviants in a time serias database.

Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 102–

113, 1999.

[80] A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, New Jersey, 1988.

[81] Hossein Jowhari and Mohammad Ghodsi. New Streaming Algorithms for Counting Trian-

gles in Graphs Proceedings of the COCOON, 2005, pp. 710–716.

[82] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. An Efficient

k-Means Clustering Algorithm: Analysis and Implementation. IEEE Trans. Pattern Anal.

Mach. Intell. 24(7): 881-892, 2002.

164

Bibliography

[83] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. A Local

Search Approximation Algorithm for k-Means Clustering. Proceedings of the 18th Annual

Symposium on Computational Geometry (SoCG’02), pp. 10–18, 2002.

[84] H. Kaplan, R. E. Tarjan, and K. Tsioutsiouliklis. Faster Kinetic Heaps and Their Use in

Broadcast scheduling. SODA, pp. 834–844, 2001.

[85] S. Khot, G. Kindler, E. Mossel, and R. O’Donnell. Optimal Inapproximability Results

for Max-Cut and Other 2-Variable CSPs? Proceedings of the 45th IEEE Symposium on

Foundations of Computer Science (FOCS), pp. 146–154, 2004.

[86] D. Knuth. The Art of Computer Programming: Sorting and Searching, Vol. 3, Addison-

Wesley, 1973.

[87] S. G. Kolliopoulos and S. Rao. A Nearly Linear-Time Approximation Scheme for the

Euclidean k-Median Problem. Proceedings of the 7th Annual European Symposium on

Algorithms (ESA), pp. 378-389, 1999.

[88] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for Emerging

Cyber Communities. (1999), 403–416.

[89] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Random

Graph Models for the Web Graph. Proceedings of the IEEE Symposium on Foundations of

Computer Science (FOCS), 2000, pp. 57–65.

[90] A. Kumar, Y. Sabharwal, and S. Sen. A Simple Linear Time (1+ε)-Approximation Algo-

rithm for k-Means Clustering in any Dimensions. Proceedings of the 45th IEEE Symposium

on Foundations of Computer Science (FOCS), pp. 454–462, 2004.

[91] A. Kumar, Y. Sabharwal, and S. Sen. Linear Time Algorithms for Clustering Problems in

any Dimensions. Proceedings of the 32nd Annual International Colloquium on Automata,

Languages and Programming (ICALP), pp. 1374–1385, 2005.

[92] L. Laura, S. Leonardi, S. Millozzi, and J.F. Sybeyn. Algorithms and Experiments for the

Webgraph. Proceedings of the Annual European Symposium on Algorithms (ESA), 2002.

[93] S. Leonardi S. Millozzi L.S. Buriol, D. Donato. Link and Temporal Analysis of Wikigraphs.

Technical Report (2005).

[94] Y. Linde, A. Buzo, and R. Gray. An algorithm for Vector Quantizer Design. IEEE Trans-

action on Communications, 28(1), pp. 84–94, 1980.

[95] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory,

28: 129–137, 1982.

[96] P. Lyman and H. Varian. How much information. University of California, Berkeley,

http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/, 2003.

165

Bibliography

[97] J. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations.

Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability,

volume 1, pp. 281–296, 1967.

[98] G. S. Manku, R. Motwani. Approximate Frequency Counts over Data Streams. Proceedings

of the 2002 Intl. Conf. on Very Large Data Bases, pp. 346–357, 2002.

[99] J. Matouˇ

sek. On Approximate Geometric k-Clustering. Discrete & Computational Geom-

etry, 24(1): 61–84, 2000.

[100] R. Mettu and G. Plaxton. Optimal Time Bounds for Approximate Clustering. Machine

Learning, 56(1-3):35–60, 2004.

[101] A. Meyerson. Online Facility Location. Proceedings of the IEEE Symposium on Founda-

tions of Computer Science (FOCS), pp. 426–431, 2001.

[102] N. Mishra, D. Oblinger, and L. Pitt. Sublinear Time Approximate Clustering. Proceedings

of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 439–447,

2001.

[103] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press,

1995.

[104] D. Mount. KMlocal: A Testbed for k-means Clustering Algorithms. Available at

http://www.cs.umd.edu/ mount/Projects/KMeans/km-local-doc.pdf

[105] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends

in Theoretical Computer Science, Volume 1, Issue 2, 2005.

[106] N. Nisan. Pseudorandom Generators for Space-Bounded Computation. Proceedings of

the 22nd Annual ACM Symposium on Theory of Computing (STOC), 204–212, 1990.

[107] A. Ostlin and R. Pagh. Uniform Hashing in Constant Time and Linear Space. Proceedings

of the 35th Annual ACM Symposium on Theory of Computing (STOC), ACM Press, 2003,

pp. 622–628.

[108] D. Pelleg and A. Moore. Accelerating Exact k-Means Algorithms with Geometric Reason-

ing. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery

and Data Mining, pp. 277–281, 1999.

[109] D. Pelleg and A. Moore. x-Means: Extending k-Means with Efficient Estimation of the

Number of Clusters. Proceedings of the 17th International Conference on Machine Learn-

ing, 2000.

[110] S. Phillips. Acceleration of k-Means and Related Clustering Problems. Proceedings of

Algorithms Engineering and Experiments (ALENEX’02), 2002.

166

Bibliography

[111] R. Prim. Shortest Connection Networks and some Generalizations. Bell Systems Technical

Journal, 36:1389-1401, 1957.

[112] T. Schank and D. Wagner. Finding, Counting, and Listing all Triangles in Large Graphs,

an Experimental Study. Proceedings of the WEA, 2005, pp. 606–609.

[113] J. Schmidt, A. Siegel, and A. Srinivasan. Chernoff-Hoeffding Bounds for Applications

with Limited Independence. SIAM Journal on Discrete Mathematics, 8(2):223–250, 1995.

[114] R. Seidel and C. Aragon. Randomized Search Trees. Algorithmica 16, 464–497, 1996.

[115] S. Selim and M. Ismail. k-Means-Type Algorithms: A Generalized Convergence Theorem

and Characterizations of Local Optimality. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 6::81–87, 1984.

[116] D. Shasha, J. Tsong-Li Wang, and R. Giugno. Algorithmics and Applications of Tree

and Graph Searching. Proceedings of the 21st ACM SIGMOD Symposium on Principles of

Database Systems (PODS), 2002,pp. 39–52.

[117] D. Sivakumar, Z. Bar-Yosseff, R. Kumar. Reductions in Streaming Algorithms, with an

Application to Counting Triangles in Graphs. Proceedings of the 13th Annual ACM-SIAM

Symposium on Discrete Algorithms (SODA) (2002), pp. 623–632.

[118] S. Suri, C. D. Toth, and Y. Zhou. Range Counting over Multidimensional Data Streams.

Proceedings of the 20th Annual Symposium on Computational Geometry, pp. 160–169,

2004.

[119] V. Vapnik and A. Chervonenkis. On the Uniform Convergence of Relative Frequencies of

Events to their Probabilities. Theory Probab. Appl., 16:264–280, 1971.

[120] J. S. Vitter. Random Sampling with a Reservoir. ACM Trans. Math. Softw. (11) (1985),

no. 1, 37–57.

[121] S. Valverde and R. Sol. Network Motifs in Computational Graphs: A Case Study in

Software Architecture. Physical Review E (72), 2005

[122] D. J. Watts and S. H. Strogatz. Collective Dynamics of Small-World Networks. Nature

(393), 440–442.

[123] X. Yan, P. S. Yu, and J. Han. Graph Indexing: A Frequent Structure-Based Approach

Proceedings of the SIGMOD, 2004, pp. 335–346.

[124] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: A New Data Clustering Algorithm

and its Applications. Journal of Data Mining and Knowledge Discovery,1(2),pp. 141-182,

1997.

[125] J. Zhao. An Implementation of Min-Wise Independent Permutation Family. (2005),

http://www.icsi.berkeley.edu/ zhao/minwise/.

167