scieee Science in your language
[en] (orig)
Dissertation
Algorithms for
Dynamic Geometric Data Streams
Dipl.–Math. Gereon Frahling
Fakult¨
at f¨
ur Elektrotechnik, Informatik und Mathematik
Institut f¨
ur Informatik & Heinz Nixdorf Institut (HNI) &
Paderborn Institute for Scientific Computation (PaSCo)
Warburgerstraße 100, D - 33098 Paderborn
Reviewers: Jun. Prof. Dr. Christian Sohler, Universit¨
at Paderborn
Prof. Dr. Friedhelm Meyer auf der Heide, Universit¨
at Paderborn
Prof. Dr. Stefano Leonardi, Universita di Roma ”La Sapienza”, Italien
Acknowledgements
First of all I would like to thank my advisor Christian Sohler for his great support. It was not
always easy to keep pace with his great ability to find interesting problems and develop new ideas
to solve them. During the whole time he gave me the feeling that I can always ask (even stupid)
questions and was responsible for the great atmosphere in Paderborn. Without the fun I had at
work I would have never been able to develop the results presented in this thesis.
I also benefited a lot from the great experience of my co-advisor Friedhelm Meyer auf der
Heide. He gave me the opportunity to come to Paderborn and the freedom to choose a research
area to work on. He always lent a sympathetic ear for all kinds of problems. I would also like to
thank Friedhelm’s whole research group for the nice time in Paderborn.
Then I would like to thank Kristina for her patience, empathy, love, and all the other things
that make it so worthwhile to know her. She turned even the most stressful days into a wonderful
time.
Finally I would like to thank those whom I owe the most: my parents, Adolf and Margret.
They always gave me the feeling to be loved and supported, whatever I am going to do in my
life.
iii
iv
Contents
Acknowledgements iii
Contents vi
1 Introduction 1
1.1 Motivation...................................... 4
1.2 RelatedWork .................................... 8
2 Preliminaries 13
2.1 GeneralNotations.................................. 13
2.2 DataStreams .................................... 14
2.3 Clustering...................................... 17
2.4 Chernoff Bounds with Limited Independence . . . . . . . . . . . . . . . . . . . 19
3 Sampling Data Streams 21
3.1 The Unique Element (UE) Data Structure . . . . . . . . . . . . . . . . . . . . . 22
3.2 The Distinct Elements (DE) Data Structure . . . . . . . . . . . . . . . . . . . . 23
3.3 A Sample Data Structure using Totally Random Hash Functions . . . . . . . . . 23
3.4 A Sample Data Structure using Random Number Generators . . . . . . . . . . . 25
3.5 A Sample Data Structure using Pairwise Independent Hash Functions . . . . . . 29
4 Sampling Geometric Data Streams and Applications 33
4.1 Sampling Geometric Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 -Nets and -Approximations in Data Streams . . . . . . . . . . . . . . . . . . . 35
4.3 Random Sampling with Neighborhood Information . . . . . . . . . . . . . . . . 36
4.4 Estimating the Weight of a Euclidean Minimum Spanning Tree . . . . . . . . . . 39
5 The Coreset Method 47
5.1 Definitions...................................... 47
5.2 Coresets for k-Median ............................... 54
5.3 Coresets for k-Means................................ 60
5.4 Coresets for Oblivious Optimization Problems . . . . . . . . . . . . . . . . . . . 65
5.5 Constructing Solutions on the Coreset . . . . . . . . . . . . . . . . . . . . . . . 69
5.6 CoresetsviaSampling ............................... 76
v
Contents
6 Coresets in Data Streams 93
6.1 Insertions ...................................... 95
6.2 Deletions ...................................... 97
6.3 MaximumSpanningTree.............................. 99
7 A Kinetic Data Structure for MaxCut 103
7.1 KineticTurnamentTrees ..............................104
7.2 Approximating the Bounding Cube . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.3 The Kinetic Data Structure for MaxCut . . . . . . . . . . . . . . . . . . . . . . 105
8 An Efficient k-Means Implementation using Coresets 111
8.1 Definitions and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.2 TheAlgorithm ...................................114
8.3 Experiments.....................................120
9 Counting Motifs in Data Streams 133
9.1 Counting Triangles in Adjacency Streams . . . . . . . . . . . . . . . . . . . . . 134
9.2 Counting Triangles in Incidence Streams . . . . . . . . . . . . . . . . . . . . . . 142
9.3 Counting Cliques of Arbitrary Size . . . . . . . . . . . . . . . . . . . . . . . . . 149
9.4 Counting K3,3 inIncidenceStreams ........................152
10 Conclusions 157
Bibliography 159
vi
1 Introduction
The increasing inter-connectivity of modern computer systems has led to the phenomenon of
massive data sets occuring in the form of data streams. Terabytes of Internet traffic is guided
along routers having small memory, telecommunication companies collect gigabytes of com-
munication data each day which have to be analyzed automatically. In almost each area of our
digital life a huge amount of data is created. Parts of the data are stored in gigantic data centers
on thousands of large hard discs for later analysis. But most of the data is created on the fly,
never stored anywhere, and forgotten after seconds.
In this thesis we concentrate on the analysis of such elusive data appearing as a data stream.
In the data stream model the data arrives one item by one, and can not be stored locally. We try
to maintain a small summary or sketch of the data in local memory. This summary should later
help us to answer certain kinds of queries about the data.
A big part of this thesis we will concentrate on a very common problem on huge data sets:
Clustering. The computational task of clustering is to partition a given input into subsets of
equal characteristics. These subsets are usually called clusters and ideally consist of similar
objects that are dissimilar to objects in other clusters. This way one can use clusters as a coarse
representation of the data. We loose the accuracy of the original data set but we reduce the
complexity of the data.
Clustering has straightforward applications in data streaming scenarios. First, we are supposed
to handle data sets we are not able to store. Reducing the complexity of these data sets using
clustering can give us the ability to store the clustered data for later examination. Second, on
such huge data sets clustering often is an effective method to understand the structure of the data
set at all. It solves the problem not to see the wood for the trees, which often evolves in the
analysis of huge data sets.
To cluster a data set one first has to define a distance measure between objects. This is of-
ten done in practice by mapping all objects to points in a d-dimensional Euclidean space. The
distance between the objects can then be measured by the Euclidean distance between points. Fi-
nally one can use established clustering objectives like k-median or k-means and corresponding
algorithms to cluster the data.
One of the main results of this thesis is a technique to reduce the complexity of a huge point
set to a weighted point set of logarithmic size, called coreset. On the small coreset still good
approximate clusterings for the whole point set can be computed.
We will show how to maintain a coreset using polylogarithmic memory on dynamic geomet-
ric data streams, consisting of insertions and deletions of points. Our algorithm will give us the
ability to compute (1±)-approximate k-median, k-means, and MaxCut clusterings on dynamic
goemetric data streams using polylogarithmic memory. Having a (problem dependent) mapping
1
1 Introduction
of objects to Euclidean points in mind, we can also compute clusterings on dynamic data streams
of arbitrary items.
We now give an overview of the results developed in this thesis.
After introducing some notation, the data stream models, and the clustering objectives used
throughout the paper in Chapter 2, we start the development of data stream algorithms in Chap-
ter 3. The data streams we consider here follow the turnstyle model, i.e. they consist of update
operations on a high dimensional vector. We show how to sample sets of indices almost uni-
formly at random from the support of the current vector. This is equivalent to sampling random
elements from a dynamic multiset Pgiven as a data stream of insert and delete operations. The
algorithm uses O(log2(UM/δ)) memory bits where Udenotes the dimension of the vector (resp.
the number of possible elements to be included in the set), Man upper bound on each vector
component (resp. the multiplicity of single elements), and δthe desired statistical difference of
the sampling to a uniform one. The difficulty here lies in the desired uniformity of the sampling
independently of the multiplicity of the elements in P. Furthermore, if the current multiset P
is small after many insert and many delete operations, we must be able to reconstruct Pto get
uniform samples.
We apply this sampling technique to point sets in Chapter 4 and present low-storage data
structures to sample points from a dynamic geometric data stream consisting of insertions and
deletions of points from the d-dimensional discrete Euclidean space {1,...,∆1}d. The sam-
pling is done almost uniformly. We also show direct applications of our sampling technique. Let
Pbe the dynamically evolving point set encoded in the stream.
The data structures developed in Section 4.2 maintain -nets and -approximations of range
spaces of Phaving small VC-dimension D. The number of memory bits our data structures use
is bounded by poly(D, 1,log(∆/δ)), where δis the desired error probability. Although we
do not store the whole point set, we can, after passing over the dynamic geometric data stream,
approximately answer certain queries on ranges (e.g. about the number of points in a given
rectangle) using the statistics we maintain.
Based on a more sophisticated sampling of points and their respective neighbourhood we also
present a low storage data structure to approximate the weight of a Euclidean minimum spanning
tree of the points in Pin Section 4.4.
The results of Chapters 3 and 4 have been published in [G. Frahling, P. Indyk and C. Sohler,
Sampling in Dynamic Data Streams and Applications, In: Proceedings of the 21st Annual Sym-
posium on Computational Geometry (SoCG), pages 142–149, 2005. Invited to the special issue
of SoCG 2005, to appear in International Journal of Computational Geometry and Applications
(IJCGA)].
The heart of the thesis is Chapter 5. We develop (1+)-approximation algorithms for k-
median, k-means, MaxCut, maximum weighted matching (MaxWM), maximum travelling sales-
person (MaxTSP), and average distance. Our algorithms compute a small weighted coreset con-
sisting of O(k·log n/O(d))points that approximates the input point set with respect to the
considered problem. The coresets can be computed in nearly linear time.
Having a coreset one only needs a fast approximation algorithm for the weighted problem to
2
compute a solution quickly. In fact, even an exponential algorithm is sometimes feasible as its
running time may still be polynomial in n. We will use algorithms from [61] to compute (1+)-
approximate solutions for k-median and k-means on the coreset in poly(log n, exp(1/)) time.
For MaxCut our technique (reducing the complexity of the big point set to a small coreset and
then compute an approximate solution on the coreset) will lead the the fastest known PTAS (in
terms of n) for Euclidean MaxCut. It is presented in Section 5.5.3.
The new coreset method also has the advantage that it does not rely on assumptions on the op-
timal clustering solution (in constrast to previous approaches like [61]). This helps us to develop
the first efficient algorithms to maintain coresets and during the minsert / delete operations of a
dynamic geometric data stream. The space used and the update time per insert / delete operation
are bounded by poly(1,log m, log )for constant kand dimension d. At each point of time
during the dynamic geometric data stream we can efficiently extract a coreset from the summary
held in memory and compute an (1±)-approximate solution for k-median, k-means, MaxCut,
and all other problems. The algorithms are presented in Chapter 6.
Chapters 5 and 6 are based on [G. Frahling and C. Sohler, Coresets in Dynamic Geometric
Data Streams, In: Proceedings of the 37th Annual ACM Symposium on Theory of Computing
(STOC), pages 209–217, 2005].
In Chapter 7 we will use the coreset technique developed in Chapter 5 to develop an efficient
kinetic data structure to maintain a (1+)-approximate MaxCut clustering of npoints moving
linearly in Rd. The data structure is able to answer queries of the form “to which side of the
partition belongs query point p?” during the whole movement of the points, each query in time
polylogarithmical in n.
Previously it was not known if a set of npoints moving linearly could force (n2)updates of
a(1±)-approximate MaxCut solution. Our data structure shows that such effort is not needed:
Under linear motion the data structure processes a number of events linear in n, each requiring
O(log2n)time. A flight plan update can also be performed in small expected time, when it is
performed on a point chosen uniformly from the set of points. No efficient kinetic data structures
for MaxCut have been known before.
In Chapter 8 we present an efficient implementation of a k-means clustering algorithm. Our
algorithm is a variant of KMHybrid [83, 104], i.e. it uses a combination of Lloyd-steps and
random swaps, but as a novel feature it uses the coreset construction of Chapter 5 to speed up
the algorithm. The main strength of the algorithm is that it can quickly determine clusterings of
the same point set for many values of k. This is necessary in many applications, since, typically,
one does not know a good value for kin advance. Once we have clusterings for many different
values of kwe can determine a good choice of kusing a quality measure of clusterings that is in-
dependent of k, for example the average silhouette coefficient. The average silhouette coefficient
can be approximated using coresets.
To evaluate the performance of our algorithm we compare it with algorithm KMHybrid [104]
on typical 3D data sets for an image compression application and on artificially created instances.
Our data sets consist of 300, 000 to 4.9 million points. We show that our algorithm significantly
outperforms KMHybrid on most of these input instances. Additionally, the quality of the solu-
3
1 Introduction
tions computed by our algorithm deviates less than that of KMHybrid.
We also compute clusterings and approximate average silhouette coefficients for each kbe-
tween 1and 100 for our input instances and discuss the performance of our algorithm in detail.
The description of the algorithm and experimental results have been previously published in
[G. Frahling and C. Sohler, A Fast k-Means Implementation using Coresets, In:Proceedings of
the 22nd Annual Symposium on Computational Geometry (SoCG), pages 135–143, 2006. Invited
to the special issue of SoCG 2006, to appear in International Journal of Computational Geome-
try and Applications (IJCGA)].
Chapter 9 concentrates on graphs given as a data stream of edges. We develop space bounded
algorithms that with probability at least 1δcompute a (1±)-approximation of the number of
small motifs in these graphs. All algorithms are based on random sampling. Our first algorithm
does not make any assumptions on the order of edges in the stream and approximates the number
of triangles occuring in the input graph. It uses space that is inversely related to the ratio between
the number of triangles and the number of triples with at least one edge in the induced subgraph,
and uses constant expected processing time per edge.
Our second triangle counting algorithm is designed for incidence streams (all edges incident
to the same vertex appear consecutively). It uses space that is inversely related to the ratio
between the number of triangles and the number of paths of length two in the graph and also has
small expected processing time per edge. These results significantly improve over previous work
[117, 81]. We generalize the results to the counting of cliques of size k.
Last but not least we present an algorithm to count the number of bipartite cliques (K3,3) with
three nodes in each partition in directed incidence streams with bounded out-degree of the nodes.
The space needed for the approximation is inversly related to the ratio between the number of
K3,3 and the number of bipartite cliques having one node in the destination partition (K3,1).
Since the space complexity of our algorithms depends only on the structure of the input graph
and not on the number of nodes, our algorithms scale very well with increasing graph size and
provide a basic tool to analyze the structure of large graphs.
The algorithms to count triangle motifs have been published together with some experiments
in [L. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela, and C. Sohler, Counting Trian-
gles in Data Streams, In: Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART Symposium
on Principles of Database Systems (PODS), pages 253–262, 2006].
1.1 Motivation
A UC Berkeley study of Lyman and Varian [96] reports that in 2002, the most current year
for which there are figures, human mankind stored about 5 exabytes of information (that is 5
million terabytes or 37,000 times the information in the Libraries of Congress). This is just
the information which is stored. Most of the information created by human mankind, about 18
exabytes by the same study, is created on the fly and then forgotten. Most of this data would be
interesting for later examination, but can’t be stored because of high data rate or storage capacity
limits.
4
1.1 Motivation
A prominent example for such data is the internet traffic at a backbone router. Assume we
want to maintain some statistics about the routed packets. It would be way to costly to store
the required information (e.g., source and destination) for every packet routed. It seems to be
much more attractive to maintain a small sketch (or synopsis) of the data seen so far. Such a
sketch should contain an approximation of the information we are interested in. This leads us to
data streaming algorithms. They process the data one item by one without storing them. After
processing the stream they are able to answer certain queries about the data.
There are many other examples of data streaming scenarios: Telephone call network opti-
mization, sensor networks, banking and credit card transactions, peer to peer connection and
transmission data, financial stock trade data, etc.
Data streaming algorithms often even have an advantage in environments where huge data sets
are actually stored. According to the study cited above about 92% of the stored data is held on
hard disks. From these discs the data can be read sequentially at high rate. Unfortunately only the
data items below the head(s) of the hard disk can be accessed immidiately. If we decide to read
data from another position on the disk, the disk head has to be moved and after the movement we
have to wait for the disk to turn to the desired data item. This takes milliseconds of time even on
current high speed disks. Real life applications for huge data therefore try to avoid this random
access and just read data sequentially.
Data streaming algorithms can be fed with a sequential stream of the data, avoiding random
access at all. If for a given problem we are able to find a data streaming algorithm working with
limited main memory, we can guarantee not to trigger any hard disk head moves.
Dynamic Geometric Data Streams. Let us assume we have a data stream consisting of
insertions and deletions of objects into a big set Q. To learn something about the structure of the
set Qwe have to examine the relations of the objects of Qto each other. First questions arise
about the difference of objects. To answer such questions we first have to define what difference
of objects means.
A first attempt would be to model complicated distance measures between objects of Q. To
each pair of objects (a, b)we assign a number d(a, b)measuring the distance between the
objects. This attempt would give us a precise modelling of difference, but has a major drawback:
How can we store the distances between pairs of objects when we are not even able to store the
objects themselves?
The only solution to this problem is to provide an implicit distance measure between objects.
In practice objects are often mapped to a d-dimensional Euclidean space by a mapping α. The
distance d(a, b)between objects is then measured as the Euclidean distance between α(a)and
α(b)and can be computed implicitly. The dynamic data stream of objects then translates natu-
rally into a dynamic geometric data stream, consisting of insertions and deletions of points, by
assigning αto each object in the stream.
There are even more direct applications of dynamic geometric data streams: Sensor networks,
mobile ad-hoc networks, or the analysis of astrophysical data often provide us directly with
data streams of positional data, i.e. data streams of points in Rd. For example, in a mobile
5
1 Introduction
ad-hoc network the participants may regularly broadcast updates of their current position. All
participants want to maintain information about the distribution of the participants to maintain an
efficient communication network. Since mobile devices have usually limited memory it would
be nice to do this using only a small amount of space. Of course, one can model an update as
deleting the old position and inserting the new one. Therefore, the model of dynamic geometric
data streams applies.
In the context of dynamic geometric data streams we have to reduce the complexity of the
point sets. We are able just to store some small representation. Using that we want to answer
certain queries about the points. In the geometric context one of the most interesting questions
arises about the distribution of points. In Section 4.2 we will address queries like “How many
points lie in a certain rectangle?” and show how to answer them by maintaining -nets and -
approximations of points in small space.
In the context of mobile ad-hoc and sensor networks often the question about the efficience of
the current communication networks arises. We can see the Euclidean minimum spanning tree as
one of the most efficient communication networks, minimizing to the total length of connections
between sensors. Estimating the weight of the Euclidean spanning tree of the current network
gives us information about the efficience of the current communication structure. We will show
how to measure this Euclidean spanning tree weight in Section 4.4.
Clustering. The problem of clustering data sets according to some similarity measure belongs
to the most extensively studied optimization problems. Clustering often plays the role of the first
step to understand huge data sets. Clustered data help to identify big groups of similar items in
the data. Furthermore one can easily find similar items of a query item.
Search engines like Ask.com advertise the ability to present websites on one topic as a cluster
to the user. Amazon buyers are clustered to find other users having similar interests. There are
many more applications in computational biology, machine learning, data mining and pattern
recognition. Since the quality of a clustering is rather problem dependent, there is no general
clustering algorithm. Consequently, over the years many different clustering algorithms have
been developed.
The most prominent and widely used clustering algorithm is Lloyd’s algorithm sometimes also
referred to as the k-means algorithm. This algorithm requires the input set to be a set of points in
the d-dimensional Euclidean space. Its goal is to find kcluster centers and a partitioning of the
points such that the sum of squared distances to the nearest center is minimized. The algorithm
is a heuristic that converges to a local optimum. The main benefit of Lloyd’s algorithm is its
simplicity and its foundation on analysis of variances. Also, it is relatively efficient.
One major drawback of the k-means algorithm is that it needs access to the whole point set
in each iteration. In the data streaming scenarios described above it can therefore not be applied
directly.
We will provide a method to reduce the complexity of the point set Pto a weighted coreset
in Chapter 5. This method is the first to efficiently compute (1±)-approximate k-median, k-
means and MaxCut-clusterings in dynamic geometric data streams, where deletions of points are
allowed (see Chapter 6). The reduction is done using only polylogarithmic space, and is therefore
6
1.1 Motivation
applicable to huge dynamic geometric data streams. Using a mapping of objects to Euclidean
spaces as described above our reduction technique can also be applied to huge dynamic data
streams consisting of insertions and deletions of arbitrary objects.
In Chapter 8 we present an efficient implementation of our coreset reduction technique. It
shows that even for data sets in main memory our method can be used to accelerate the first steps
of the popular k-means method and achieve faster convergence. It can also be used in scenarios
where the number of clusters is not specified in advance, because it is suitable to quickly compute
clusterings for different numbers of k.
Clustering Kinetic Data. Clustering is also playing a central role in ad-hoc mobile commu-
nication and sensor networks, where the underlying communication structures often depend on
the proximity or other similarity characteristics of the stations in motion. For example, mobile
networks are often organized in hierarchical clusters, where all the stations inside one cluster are
in a close proximity and have direct communication. The hierarchy of clusters then induces a
tree structure on their leaders which can be used to establish communication (or perform other
data management tasks) between different clusters.
Maintaining clusters of mobile nodes is a very challenging task in mobile networks, because of
the dynamic character of the moving nodes. Good clustering algorithms should ensure a tradeoff
between the quality of the clustering at any given time and its stability and efficiency under
motion.
In Chapter 7 we will develop the first kinetic data structure for MaxCut clustering. We assume
that npoints (the sensors) are moving along linear trajectories. We show that in this scenario we
are able to maintain an approximate MaxCut clustering with e
O(n)effort during the motion of
points. Updates on the point velocities can be handled efficiently.
Our algorithm is based on the coreset construction technique described in Chapter 5. Since
the coreset construction generally applies to the k-means and k-median problems as well, we
are confident that our ideas to develop kinetic data structures using coresets can lead to efficient
kinetic data structures for k-means and k-median clustering in the future as well.
Counting Motifs in Graphs. Graphs are fundamental structures for modeling complex re-
lationships between data in Web documents, chemical compounds, XML, social networks etc.
A basic tool to uncover their structural design principles and to extract relevant information is
to mine the most frequent interconnection patterns occurring in the graph. The computation of
network indices based on counting the number of certain small subgraphs is a basic tool in the
analysis of the structure of large networks.
As an example, the occurrence of a very large number of certain dense subgraphs has been
observed in the Webgraph, the graph formed by Web pages and hyperlinked connections [88],
in the attempt of tracing the emergence of hidden cyber-communities. A stochastic model of
the growth of the Webgraph [89], the ”copying model”, has then been developed and uses these
dense subgraphs as building blocks of the process of network formation.
Another example are large software systems. The simplest way of reusing the software is
to duplicate some portion of the program and later adapt it to some specific needs. In [121]
7
1 Introduction
it is argued that the analysis of frequent network motifs in software architectures suggests that
duplication and diversification mechanisms are responsible for a significant part of the observed
topological features of large software graphs.
Since huge graphs (webgraph crawls for example) do not fit into main memory, we have to
find algorithms working on graphs stored on disks. To avoid random access at all we will look
into algorithms working on streams of edges.
Recent implementations and experiments suggest that our algorithms are suitable to compute
good estimations on the number of triangles of real webgraph crawls in time comparable to the
time to read the graph from the hard disc. Some of the experimental results have already been
published in [17].
1.2 Related Work
We subdivide the related work into the subjects data streams, geometric data streams, clustering,
kinetic data structures, and work related to the counting of motifs in graphs.
Data Streams. In 1996 Alon, Matias, and Szegedy analyzed data streaming algorithms to
approximate the frequency moments [4] and also gave lower bounds on the memory needed
for the approximation. The paper laid the foundation for a big amount of research done in
the subsequent years. Some areas of interest are the counting of distinct items and estimating
frequency moments [4, 22, 40, 46, 57, 77], the counting of frequent items [24, 30] and the
computation of histograms [48, 53, 79]. For other work on streaming algorithms in general
we refer to the survey by Muthukrishnan [105].
Ganguly, Garofalakis, and Rastogi gave a method to track set expression cardinalities in dy-
namic data streams [46]. Their method can potentially be altered to provide random samples as
done in Chapter 3. However, this was not the purpose of [46] and the alterations needed are
not stated in the paper. Cormode and Muthukrishnan[31] developed a sampling technique for
dynamic data streams similar to the technique given in Chapter 3. Their result was obtained at
the same time independently of our publication [42].
Geometric Data Streams. One of the first geometric problems studied on a stream of points
was to approximate the diameter of a point set in the plane [36] using O(1/)space. Later this
problem has also been considered in higher dimensions [74], where an algorithm with space
complexity O(dn1/(c21))to maintain a c-approximate diameter for c > 2has been obtained.
Chan and Sadjad proposed an algorithm to maintain an approximation of the diameter in the
sliding window model [33]. In this model one considers an infinite input stream but only the last
nelements are relevant (the window). They gave a (1+)-approximation algorithm for fixed
dimension, which stores O((1
)(d+1)/2 log R
)points and where Rdenotes the ratio between the
diameter and the smallest pairwise distance between two points in the window.
Cormode and Muthukrishnan introduced the radial histogram [29] to approximate different
geometric problems in the plane. A radial histogram is a subdivision of the plane given by con-
centric circles around a center point and halflines starting at the center. This way we can assign
8
1.2 Related Work
every point in the stream to a cell of the radial histogram. Problems that can be approximated via
radial histrograms include the diameter, convex hull, and the furthest neighbor problem. Hersh-
berger and Suri showed how to maintain a set of 2r points such that the distance from the true
convex hull of the points seen so far is O(D/r2)where Dis the current diameter of the sample
set [69].
Agarwal et al. introduced a framework for approximating various extent measures of point
sets for constant dimension [2]. Their technique can be used to obtain streaming algorithms
for problems like the diameter, width, smallest bounding box, ball, and cylinder of the point
set. Chan used this framework to develop improved streaming algorithms (using less space than
previous ones) for a number of geometric problems with constant dimension including diameter,
width, minimum-radius enclosing cylinder, minimum width enclosing annulus, etc.
Bagchi et al. [9] gave two deterministic streaming algorithms that maintain -nets and -
approximations under insertions of points. They apply their algorithm to approximate several
robust statistics in data streams including Tukey depth, simplicial depth, regression depth, the
Thiel-Sen estimator and the least median of squares. Since their algorithm is deterministic it
cannot be extended to the dynamic streaming model including deletions. Suri et al. gave both
deterministic and randomized algorithms to compute a (weighted) -approximation for ranges
that are axis-aligned boxes [118].
Indyk introduced the model of dynamic geometric data streams used in this thesis [75]. He
gave O(d·log )-approximation algorithms for (the weight of) minimum weighted match-
ing, minimum bichromatic matching and minimum spanning tree. He also showed how to
approximate the weight of an optimal solution of the facility location problem within a factor
of O(dlog2). For the k-median problem he gave a (1+)approximation algorithm with
query time O(kd ·k·d1(log +1
log 1
)) (exhaustive search). He also developed a O(1)-
approximation that needs O(d·k·d1(log +1
log 1
)) time to compute an approximation
from the maintained data structure. He further gave a (1+, O(log (log +log(1/)/))-
approximation algorithm, i.e. an algorithm that returns O(log (log +log(1/)/))·kmedians
whose cost is at most 1+times the cost of an optimal algorithm. This algorithm has polylog-
arithmic query time. All of the algorithms given above work using O((k+log +1/)O(1))
space.
Minimum Spanning Tree Approximation. The problem of estimating the weight of a
minimum spanning tree has been considered in the context of sublinear time approximation
algorithms. The first such algorithm for the minimum spanning tree weight is designed for
sparse graphs and computes a (1+)-approximation [25]. It has a running time of e
O(D·W/2)
when the edge weights are in a range from 1to Wand the average degree of the input graph is
D. In the geometric context a e
O(n/O(1))time (1+)-approximation algorithm was given, if
the point set can be accessed using certain complex data structures [32]. In the metric case one
can compute a (1+)-approximation of the minimum spanning tree weight in O(n/O(1))time
[20].
9
1 Introduction
Clustering. It is beyond the scope of this section to give a comprehensive overview of the
clustering literature. We want to concentrate on results using coresets and then give a brief
overview of the most important developments with focus on partitioning algorithms. For a more
comprehensive overview of the work in clustering we refer to the surveys/books [14, 65, 80].
Har-Peled and Mazumdar gave (1+)-approximation algorithms for the k-median and k-
means problem, when the points are given in a data stream consisting of insertions (and no
deletions) [61]. Their algorithm is based on maintaining coresets of logarithmic size. They
also mention the extension of their results to the case of dynamic streaming algorithms as an
interesting open problem.
Chan used coresets to approximate different geometric problems including diameter and min-
volume bounding box [19]. Together with Sadjad he considered the diameter and width of a
point set in the sliding window model [20].
In the context of clustering algorithms several other coreset constructions have been developed
for the k-median and k-means clustering problem [8, 60]. These coresets found applications
in approximation algorithms [8, 60] and clustering of moving data [59]. Also for projective
clustering, coresets have been developed [63].
Apart from clustering, coresets have found applications in basic problems in computational
geometry, for example, to compute an approximation of the smallest enclosing ball of a point set
[7] or to approximate extent measure of point sets [2, 19].
An overview of coreset constructions is given in [3].
The most popular algorithm for the k-means clustering problem is Lloyd’s algorithm [41, 62,
95, 97]. It is known that this algorithm converges against a local optimum [115]. Recently,
a number of very efficient implementations of this algorithm have been developed [5, 82, 83,
108, 109, 110]. These algorithms reduce the time needed to compute the nearest neighbors
in a Lloyd’s iteration, which is the most time consuming step of the algorithm. Arthur and
Vassilvitskii showed that there are instances which require 2(n)iterations [6].
In the Euclidean space there are many (1+)-approximation algorithms for the k-means
clustering problem [8, 38, 61, 71, 90, 99]. Also for the k-means problem in metric spaces efficient
constant factor approximation algorithms are known [83, 100].
The quality of random sampling in metric spaces has been analyzed for some clustering prob-
lems including the metric and the Euclidean k-median [34, 102]. The analysis can be easily
extended to the k-means clustering problem. A testbed for k-means clustering algorithms has
been given in [104].
Streaming algorithms for clustering problems have also been considered in the more general
metric space setting [23, 54, 101]. The currently best known algorithm for the k-median problem
is a O(1)-approximation using O(k·polylog n)space [21].
MaxCut. MaxCut is known to be Max-SNP-hard for general graphs. It has a very easy 0.5-
approximation algorithm and an exciting 0.87856-approximation algorithm by Goemans and
10
1.2 Related Work
Williamson [50]. For metric graphs (and hence also for geometric instances), Fernandez de la
Vega, Karpinski, and Kenyon [37] designed a PTAS which computes a (1±)-approximation
in time O(n2·2O(1/2)). Indyk [72] designed a PTAS for metric graphs having runtime O(n·
log n·(2(1/)O(1)+log n)). These are also the best known algorithms for the Euclidean version
of MaxCut, for which it is still not known if the problem is NP-hard.
Kinetic Data Structures. There has been a lot of work on designing KDS for various cluster-
ing problems. For example, there are efficient KDS for the problems of finding (approximately)
minimum number of squares (or other geometric objects) that contain all the input points [47, 68]
and for the problem of finding kclusters of minimum maximum radius that cover all points [59];
for other examples, see e.g., [1, 15, 56, 68]. However, unlike the MaxCut problem studied in this
thesis, the prior work typically has focused on clustering problems in which each clustered set
has been defined only by the inner-cluster properties. In contrast, the MaxCut is the problem of
clustering the input set of points into two sets for which the sum of the inter-cluster distances is
maximized, that is, for which the dissimilarity between the points in different clusters is maxi-
mized. Comparing to the other clustering problems, MaxCut depends more on global properties
of the input points and as such it requires different (novel) techniques to be solved efficiently.
Counting Motifs in Graphs. Recently much attention has been devoted to the analysis of
complex networks arising in information systems, software systems, overlay networks etc. Min-
ing the most frequent subgraphs is here aimed to identify the building blocks of universal classes
of complex networks [78, 92]. As an example, the occurrence of a very large number of certain
dense subgraphs has been observed in the Webgraph, the graph formed by Web pages and hyper-
linked connections [88], in the attempt of tracing the emergence of hidden cyber-communities.
A stochastic model of the growth of the Webgraph [89], the ”copying model”, has these dense
subgraphs as building blocks of the process of network formation.
D. Coppersmith and S. Winograd showed that the number of triangles in a subgraph can be
counted using matrix multiplication [27]. Schank and Wagner [112] give an extensive experi-
mental study of the performance of algorithms for counting and listing triangles in graphs.
Finding frequent graph patterns also finds application to graph databases[123].
Valverde and Sol used the analysis of frequent network motifs in software architectures to
argue that network models based on duplication and diversification mechanisms accounts for a
significant part of the observed topological features of large software graphs [121].
11
1 Introduction
12
2 Preliminaries
In this chapter we introduce notations and models used throughout the thesis. In Section 2.1 we
begin with general notations.
We will then formally introduce different data stream models and complexity measures used
for the analysis of data stream algorithms in Section 2.2. After considering general data streams
consisting of arbitrary items we will concentrate on point sets and graphs encoded as data
streams.
A big part of the thesis introduces new methods to cluster huge point sets into groups of points.
In Section 2.3 we will introduce different clustering objectives, i.e. the k-median, k-means, and
MaxCut clustering.
2.1 General Notations
We begin with some basic notations and definitions.
We define
R+:= {xR|x>0}.
We use the e
O-notation to hide polylogarithmic factors. Formally let f:RR+and g:R
R+be functions with positive function values. We use the notation
f(n) = e
O(g(n))
to measure the growth of f(n)with n, iff there is a constant kNsuch that
f(n) = O(g(n)·logk(g(n))) as n .
We will use [n]to denote the set {0,...,n1}. For a, b Rwe define
(a, b) := {xR|x>ax<b}
and
[a, b] := {xR|xaxb}.
For xRand R+we define
(x±) := [x, x +].
We define multiplications of intervals (a, b)with positive scalars xR+as
[a, b]·x:= x·[a, b] := [a·x, b ·x]
and multiplications of positive intervals (a, b)and (c, d)having 0<a<band 0<c<das:
[a, b]·[c, d] := [a·c, b ·d].
13
2 Preliminaries
Euclidean Spaces
The d-dimensional vector space Rdis the set of all d-tupels (x1, . . . , xd)of real numbers
x1, . . . , xd. The elements of the space are called vectors or points. For each point pRdwe
denote the ith component of the point as p(i), such that p= (p(1), p(2), . . . , p(d)). Addition and
subtraction of points in Rdgives again a point in Rdand is defined component-wise:
p,qRd(p+q)(i)=p(i)+q(i).
The multiplication of a scalar xRwith a point pRdgives again a point in Rdand is defined
as
(x·p)(i)=x·p(i).
Throughout the paper we will always assume that dis a constant. Therefore we will write
O(d) = O(1).
If we restrict the coordinates of a point set to a subset Rof Rwe write Rdfor the set of
d-tupels of numbers from R. In the thesis we will use the special cases [0, 1]dwhere Ris the
compact interval of numbers between 0and 1, and []d={0,...,∆1}dwith N.
As distance measure between the points we will often use the Euclidean distance d:Rd×
RdR+defined as
d(p, q) := v
u
u
t
d
X
i=1
(p(i)q(i))2.
We generalize this definition to sets, i.e.
pRdQRdd(p, Q) = min
qQd(p, q)
and
QRdRRdd(Q, R) = min
qQ,rRd(q, r).
For a finite set P={p1, . . . , pn}of points from Rdthe center of gravity µ(P)is itself a point
in Rdand defined as
µ(P) = (µ(P)(1), . . . , µ(P)(d))with µ(P)(i):= 1
n·X
pP
p(i).
We also call the center of gravity µ(P)the mean of P.
2.2 Data Streams
We now define different types of data streams. The first models we present, the cash register
and turnstyle models, are very general and encode (huge) multisets of arbitrary items. We will
then present special cases of these cash register and turnstyle models we obtain when we con-
sider points in []das items. The respective data streams are called geometric data streams and
dynamic geometric data streams. Finally we will define streams which encode graphs.
14
2.2 Data Streams
Cash Register and Turnstyle Model
We first define the cash register and turnstyle models of data streams. They are very gen-
eral and have been subject of many recent theoretical and practical papers. See the survey of
Muthukrishnan [105] for a lineup of recent results.
Let Qbe a finite set of Uitems of different kinds. Let q0, . . . , qU1be these items. We assume
that Qis a very huge set.
The data streams we consider encode a multiset of items from Qinto a data stream. We use
aU-dimensional vector x= (x0, x1, . . . , xU1)to describe this multiset: xisignals that we have
exactly xiitems of kind qiin the current multiset. We assume that we always have at most M1
items of the same kind in our multiset, i.e. x[M]U.
We assume that at the beginning of the data stream the multiset is empty, i.e. all Ucomponents
of xare zero.
In the turnstyle model the data stream consists of a sequence of mupdate operations on the
vector x. Each update operation has the form UPDATE(i, a)with i[U]and a{M, . . . , M}
and means that we have to add ato xi. All update operations in the stream lead to an xivalue
within {0,...,M1}. Thus, at any moment, 0xi< M for i[U](our algorithms will
not verify this assumption). An operation UPDATE(i, a)with a0can be seen as adding a
items of kind qito our current multiset. An operation UPDATE(i, a)with a > 0 can be seen
as deleting aitems of kind qifrom our current multiset. This way we have a huge multiset of
items encoded in the stream which is dynamically changing.
In the cash register model all UPDATE(i, a)operations have a value a0. Therefore we have
to deal only with insertions of items into the current multiset, deletions do not occur in the stream.
Our algorithms see the data streams one update operation by one and we cannot store all data
items due to memory restrictions. Particularly random access to the data is impossible. At cer-
tain times the algorithms are asked to answer certain queries about the current vector xresp.
the current multiset. For example in Chapter 3 we show how to return a random element from
the current multiset, each element with approximately the same probability independently of it’s
multiplicity.
Most of our algorithms will output just an approximate solution to a given problem. For ex-
ample when we consider optimization problems on the current point set, we will only require
the algorithm to return a solution having objective function value within (1±)·Opt, where
Opt denotes the optimal objective function value. We call this solution a (1±)-approximate
solution. We will always assume that < 1/100.
Space complexity. We assume that the size Uof the set of possible items, the length of the
stream m, and the maximum number of occurences of one item Mcan be very huge. In particular
it is impossible to store the whole set of items locally.
Therefore the whole data stored by our algorithm should have bit complexity which is at most
polylogarithmic in U,m, and M.
15
2 Preliminaries
Time complexity. We measure time complexity in the Real Random Access Machine (Real
RAM) model, as usual in computational geometry to avoid calculations bounding numerical
errors. However, note that all algorithms could be altered to use only numbers that could be
stored using c=Olog mUM
bits, where is a small value related to the accuracy of our
data streaming algorithms. Since all of our geometric algorithms return approximate results, we
are confident that all stated computation times can also be proven in a RAM model having cbits
per register.
There are two different measures of time complexity. First, in many real world scenarios we
have to react very fast to each UPDATE operation. Therefore we are supposed to develop algo-
rithms which spend time polylogarithmic in U,m, and Mto process an UPDATE operation. The
polynomial degree should be as small as possible. Second there is the time to answer a query on
the current set of items. The time can be considerably longer than the update time, but it should
also preferably be polylogarithmic in U,m, and M.
Dynamic Geometric Data Streams
The model of dynamic geometric data streams has been first defined in [75]. A dynamic ge-
ometric data stream consists of mUPDATE operations on a point set P[]din a discrete
d-dimensional space which is initially empty. We always assume that > 100. An UPDATE op-
eration can be an ADD(p)operation of a point p[]d, which inserts pinto P, or a REMOVE(p)
operation, which deletes pfrom P. We assume that no point is inserted when it is already in P
and that no point is deleted from Pwhen it is not currently in the set.
The model of dynamic geometric data streams can be seen as a special case of the turnstyle
model by setting U=dand M=2. We again assume that we only have access to the sequence
one operation by one, and are allowed to use a number of memory bits polylogarithmic in and
m.
Adjacency Streams
Recently much effort has been made to investigate the structure of the webgraph. The difficulty
here is that it is quite impossible to hold huge structures in main memory. In a first scenario we
could have crawls on the webgraph on hard disc, and are able to access this data. However,
random access is very slow and has to be avoided. If we find algorithms which deal with the set
of edges given as a data stream, we could solve problems on the webgraph only using sequential
access to the data.
In another scenario one or more computers just crawl webpages. They read pages, follow links
and read the pages they are directed to recursively. This way these crawlers provide us with a
stream of edges.
In the adjacency stream model the data stream encodes an undirected graph G= (V, E). We
assume that we first are given the node set Vand that we are able to store it in local memory.
16
2.3 Clustering
Then we are able to read a data stream of edges sequentially, i.e. one edge by one. We are not
allowed to go back in the data stream and read a previous item. We try to develop algorithms
which use less memory bits than needed to encode the graph, i.e. o(|E|)memory bits. This model
is called One pass adjacency stream model. Algorithms working in this model are called One
pass algorithms.
When the data is stored on hard disk we can do multiple passes over the data by reading them k
times sequentially. We then speak of the kpass adjacency stream model and kpass algorithms.
Incidence Streams
Incidence streams are a special form of adjacency streams, but now the edges have a special
kind of ordering. In the incidence model we assume that each edge (v, w)appears twice, once
as (v, w)and once as (w, v). The whole set of 2·|E|edges is then presented as a data stream
ordered by source nodes. The model is more restricted than the general adjacency model. We
will develop algorithms for incidence streams in Section 9.2 having better memory bounds than
the algorithms for adjacency streams.
In Section 9.4 we will alter the model a little bit. We will look at directed graphs having
bounded out-degree of the nodes. We are again presented a data stream of edges, but this time
each (directed) edge appears just once. The edges are ordered by destination nodes.
2.3 Clustering
Assume we have a huge set of objects O. To understand the structure of the set it is often a good
idea to group the objects into clusters, such that that objects in the same cluster are similar to each
other and different from objects in other clusters. To do that we first have to define the notion of
different. Mathematically this means to define a distance function d:O×OR+on pairs of
objects. Two objects having a large distance are different. The distance function can be defined
in various ways. One way is to embed the points in the d-dimensional Euclidean space Rdby
an embedding α:ORd. We can then define the distance between objects as the Euclidean
distance between the corresponding points:
a,bOd(a, b) := d(α(a), α(b)) = v
u
u
t
d
X
i=1
(α(a)(i)α(b)(i))2.
Having such an embedding in mind we concentrate in the following on clustering points in Rd.
Based on the distance measure dbetween points we now define certain clustering objectives,
i.e. what is a good clustering and what is a bad clustering.
For huge point sets often the k-median or k-means objective functions are used to define good
clusterings. They have the advantage that each cluster can be represented by one single cluster
center.
17
2 Preliminaries
The k-median and k-means objectives measure the distance of all points to their corresponding
cluster centers in the following way:
Weighted Euclidean k-median clustering.
In the weighted Euclidean k-median clustering problem we are given a weighted set Pof
points in the Rdwith weight function w:PR+. The goal is to find a set C={c1, . . . , ck}of
kcluster centers in Rdand a partition of the set Pinto kclusters C1, . . . , Cksuch that
Median(P, C, C1, . . . , Ck) :=
k
X
i=1X
pCi
w(p)·d(p, ci)
is minimized. In the unweighted version of the problem all point weights are 1.
If the partition C1, . . . , Ckrelates each point to its nearest cluster center, i.e. if
pPipCid(p, ci) = min
jd(p, cj),
then we shortly write
Median(P, C) := Median(P, C, C1, . . . , Ck).
Weighted Euclidean k-means clustering.
In the weighted Euclidean k-means clustering problem we are given a weighted set Pof points
in the Rdwith weight function w:PR+. The goal is to find a set C={c1, . . . , ck}of k
cluster centers in Rdand a partition of the set Pinto kclusters C1, . . . , Cksuch that
Means(P, C, C1, . . . , Ck) :=
k
X
i=1X
pCi
w(p)·d(p, ci)2
is minimized. In the unweighted version of the problem all weights are 1.
If the partition C1, . . . , Ckrelates each point to its nearest cluster center, i.e. if
pPipCid(p, ci) = min
jd(p, cj),
then we shortly write
Means(P, C) := Means(P, C, C1, . . . , Ck).
18
2.4 Chernoff Bounds with Limited Independence
Weighted Euclidean MaxCut clustering.
The Euclidean MaxCut problem is the classical MaxCut problem on Euclidean graphs. There-
fore, the goal is to find a partition of a point set Pinto two sets Land Rsuch that the weight of
the edges of the complete Euclidean
Cut(P, L, R) := X
pLX
qR
d(p, q)
is maximized.
2.4 Chernoff Bounds with Limited Independence
We finally want to introduce a variant of Chernoff bounds from [113] we will frequently use in
the analysis of the streaming algorithms we develop. In contrast to traditional Chernoff bounds
[58] they assume only k-wise independence of the underlying random variables. A set of random
variables is called k-wise independent if the random variables in every subset of kvariables are
independent.
Theorem 1 (Theorem 5, [113]) . If Xis the sum of k-wise independent random variables, each
of which is confined to the interval [0, 1]with µ=E[X], then:
For δ1:
if k bδ2µe1/3c, then Pr[|Xµ|δµ]ebk/2c.
if k=bδ2µe1/3c, then Pr[|Xµ|δµ]ebδ2µ/3c.
For δ1:
if k bδµe1/3c, then Pr[|Xµ|δµ]ebk/2c.
if k=bδµe1/3c, then Pr[|Xµ|δµ]ebδµ/3c.
For δ1and k=dδµe:
Pr[|Xµ|δµ]eδln(1+δ)µ
2< eδµ
3.
19
2 Preliminaries
20
3 Sampling Data Streams
In this chapter we consider the problem to take a random sample from a dynamic multiset of
elements. The sampling should be uniform, i.e. each element of the set should be chosen with
almost the same probability independent of it’s multiplicity. The dynamic multiset is given as
a data stream of insert and delete operations. Usually it is represented by a high dimensional
vector. The stream is then a stream of update operations on this vector as defined in the turnstyle
model in Section 2.2.
In this section we present a general sampling technique on turnstyle data streams. We denote
that theoretically a similar method can be obtained from the methods Ganguly, Garofalakis, and
Rastogi used in [46]. However, they did not state a sampling result as this was not the purpose
of [46]. Cormode and Muthukrishnan[31] developed a sampling technique for turnstyle streams
similar to the technique given here. Their result was obtained at the same time independently of
our publication [42].
Our methods to sample random elements from turnstyle streams will be specialized to dynamic
geometric data streams in Chapter 4. We also give some direct applications of the sampling
method there. In Chapter 6 we will use the sampling results together with coreset techniques to
compute clusterings of dynamic data streams.
We denote by x[M]Uthe current vector and by UPDATE(i, a)the update operations of the
vector encoded within the stream.
Let Supp(x) = {i[U]|xi> 0}be the support of x. We use notation kxk0=|Supp(x)|. After
passing over the sequence of update operations we want to be able to answer queries which ask
about a uniformly distributed random element ifrom Supp(x). We want to return the index iand
the vector component value xi.
Since we are working on a high dimensional vector x= (x0, . . . , xU1), we don’t want to use
Θ(U·log M)memory bits needed by a trivial algorithm. Instead we will develop a data structure
which accomplishes the task in space complexity polylogarithmic in Mand U.
Our data structure is parametrized by two numbers , δ > 0. The operations are as follows:
UPDATE(i, a): performs xixi+a, where i[U], a {M . . . M}.
SAMPLE: returns either a pair (i, xi)for some i[U]or a flag FAIL. The procedure
satisfies the following constraints:
If a pair (i, xi)is returned, then iis chosen at random from Supp(x)such that for any
jSupp(x),
Pr[i=j] = 1
kxk0±δ.
21
3 Sampling Data Streams
The probability of returning FAIL is at most δ.
Keeping O(s)instances of this data structure it is possible to choose a sample set Sof size s
(almost) uniformly at random from the non-zero entries of v.
Our sample data structure uses two elementary data structures. The first such structure checks
whether there is exactly one non-zero entry in x. If this is the case the index of the entry and its
value it returned. The second data structure approximates the number of non-zero entries in x.
3.1 The Unique Element (UE) Data Structure
The first elementary data structure called UE checks whether there is exactly one non-zero entry
in x. The data structure is deterministic. It supports the following operations on a vector x=
(x0, . . . , xU1)with entries from [M].
UPDATE(i, a): as above
REPORT: if kxk06=1, then it returns FAIL. If kxk0=1, then it returns the unique pair
(i, xi)such that xi> 0.
The data structure keeps three counters c0,c1, and c2which are initialized to 0. Our UPDATE
operation will ensure that cj=Pi[U]xi·ijat any point of time. The operations UPDATE and
REPORT are implemented as follows:
'
&
$
%
UPDATE(i, a)
c0=c0+a
c1=c1+a·i
c2=c2+a·i2
REPORT
if c0·c2c2
16=0or c0=0then return FAIL
else ic1/c0
xic0
return (i, xi)
Claim 3.1.1 The data structure UE uses O(log(UM)) bits of space. One UPDATE operation
needs O(1)time and a REPORT query needs O(1)time. UE returns FAIL if and only if kxk06=1.
Otherwise, it correctly returns the unique pair (i, xi)with xi6=0.
Proof : The maximum value of the counters c0, c1, c2is O(U3·M)and so the counters
need O(log(UM)) bits. It remains to prove that the data structure correctly recognizes the case
kxk0=1and returns the unique pair (i, xi)with xi6=0. From xi0for all i[U]it follows
22
3.2 The Distinct Elements (DE) Data Structure
that c0=0, if and only if kxk0=0. Furthermore
c0·c2c2
1=
X
i[U]
xi
·X
i[U]
xi·i2
X
i[U]
xi·i
2
=
X
i[U]
xi
·X
i[U]
xi·i2X
i,j[U]·xi·i·xj·j
=X
i,j[U]
xixj·j2X
i,j[U]
xi·i·xj·j
=1
2X
i,j[U]
xixj·j2+1
2X
i,j[U]
xixj·i2X
i,j[U]
xixj·i·j
=1
2X
i,j[U]
xixj(j22ij +i2)
=1
2X
i,j[U]
xixj(ji)2.
All summands are zero unless there exist i, j [U]with i6=jand xi, xj> 0. In the latter case
one summand is positive. Hence, c0·c2c1> 0 iff kxk0> 1. The correctness of the data
structure follows immediately. 2
3.2 The Distinct Elements (DE) Data Structure
The data structure supports two operations on a vector x= (x0, . . . , xU1)with entries from
[M]:UPDATE (as above) and REPORT, which with probability 1δreturns a value k[0, U]
such that kxk0k(1+ψ)·kxk0; the numbers δand ψare parameters.
One can use a data structure from Ganguly, Garofalakis, and Rastogi[46] to solve this problem
using O(1/ψ2·log(1/ψ)log(1/δ)log(U)log(UM)) bits of space. An UPDATE operation needs
O(1
2·log(1/δ)) time, a REPORT operation needs O(log U)time.
3.3 A Sample Data Structure using Totally Random
Hash Functions
The basic idea behind our data structure is to use a hash function that maps the universe to
a smaller space [2j]. The value 2jcorresponds to a guess of the number of non-zero entries
currently present in vector x. Assuming a fully random hash function every non-zero entry is
23
3 Sampling Data Streams
mapped to 0with the same probability. Further, the probability that exactly one non-zero entry
is mapped to 0can be checked using our unique elements data structure. If this is the case, our
sample data structure returns the corresponding entry (it returns the index and the value of the
entry). We now give the procedure in detail.
Our data structure uses hash functions hj, j [dlog Ue+1]. Each hjis of the form hj: [U]
[2j]. Initially, we assume that each hjis a fully random hash function, we relax this assumption
later. The value 2jcorresponds to the guess that, currently, there are roughly 2jnon-zero entries
in x.
In addition, we use:
Unique Element data structures UEj, j [dlog Ue+1], and
A Distinct Elements data structure DE, with parameters ψ=1and error probability
parameter 1/22.
We write UEj.UPDATE resp. UEj.REPORT to denote an UPDATE resp. REPORT operation for
data structure UEj.The operations are implemented as follows:
'
&
$
%
UPDATE(i, a)
for j[dlog Ue+1]do
if hj(i) = 0then
UEj.UPDATE(i, a)
DE.UPDATE(i, a)
SAMPLE
j=dlog(DE.REPORT)e
return UEj.REPORT
Correctness. Assume that DE is correct. Note that this happens with probability at least
11/22. We have UEj.REPORT6=FAIL, iff |Supp(x)h1
j(0)|=1. Since we assume fully
random hash functions, the element reported by UEjis an element chosen uniformly at random
from Supp(x).
It remains to show a lower bound on the probability of |Supp(x)h1
j(0)|=1. Denote Sj=
h1
j(0)and `=kxk0. Because of j=dlog(DE.REPORT)eand kxk0log(DE.REPORT)
2kxk0, we observe `2j4`. Thus
Pr[|SjSupp(x)|=1] = `·2j·(12j)`11/4 ·(11/`)`1.
The function (11/`)`1is monotonically decreasing for `1and lim`→∞(11/`)`1=1/e.
Therefore we obtain:
Pr[|SjSupp(x)|=1]1
4e 1
11 .
Since the error probability in our distinct elements structure is at most 1/22, we obtain that
our algorithm returns with probability 1/11 1/22 =1/22 a random element.
24
3.4 A Sample Data Structure using Random Number Generators
3.4 A Sample Data Structure using Random Number
Generators
We will now give two different methods to overcome the assumption of totally random hash func-
tions and achieve space complexity polylogarithmic in Mand U. First we will present a general
approach based on the random number generator of Nisan [106]. The method introduces a small
additive error in the probability of each element to be sampled. This error can be translated into
a small multiplicative error of . However, the method is difficult to implement and for many
applications a multiplicative error is sufficient. In Section 3.5 we will present a method based
on pairwise independent hash functions. It is easy to implement but introduces a multiplicative
error in the probability of each element to be sampled.
To replace the assumption of fully random hjwe use the following lemma developed by Indyk
[73] which is based on Nisans random number generator [106].
Lemma 3.4.1 [73] Consider an algorithm Awhich, given a stream S0of pairs (i, a), and a
function f: [n0]×{0, 1}R0×{M0. . . M0}{M0O(1). . . M0O(1)}, does the following:
Set O=0; Initialize length-R0chunks R0. . . R[n0]of independent random bits
For each new pair (i, a): perform O=O+f(i, Ri, a)
Output A(S0) = O
Assume that the function f(·,·,·)is supported with an evaluation algorithm using O(C+R0)
space and O(T)time. Then there is an algorithm A0producing output A0(S), that uses only
O(C+R0+log(M0n0)) bits of storage and O([C+R0+log(M0n0)] log(n0R0)) random bits,
such that
Pr[A(S)6=A0(S)] 1/n0
over some joint probability space of randomness of Aand A0. The algorithm A0uses O(T+
log(n0R0)) arithmetic operations per each pair (i, a).ut
For each fixed jour algorithm uses the hash function hjto select a subset Sj:= {i[U] :
hj(i) = 0}of indices. Each index is in the set Sjwith probability 1/2j. The UE data structure
maintains the values c0=PiSjxi,c1=PiSjxi·i, and c2=PiSjxi·i2.
Instead of using a hash function hjto select the set Sjwe can use chunks R0, . . . , RU1of
log Urandom bits. We select Sjas
Sj:= {i[U] : all of the first jbits of Riare 0}.
Again each index i[U]is selected to be in the set Sjwith probability 1/2j. An update (i, a)
on the UE data structure then simply adds the functions f0to c0,f1to c1, and f2to c2, where
f1(i, Ri, a) := 0iff one of the first jbits of Riis 1
aiff all of the first jbits of Riare 0
25
3 Sampling Data Streams
f2(i, Ri, a) := 0iff one of the first jbits of Riis 1
a·iiff all of the first jbits of Riare 0
f3(i, Ri, a) := 0iff one of the first jbits of Riis 1
a·i2iff all of the first jbits of Riare 0
We use Lemma 3.4.1 and plug in the values n0=1/δ +U,M0=U·M,R0=dlog(U)e, and
C=log(UM). We can replace the random bits Riby O([C+R0+log(M0n0)] log(n0R0)) =
O(log(UM/δ)log(U/δ)) random bits. We use the same random bits for all jand get an algo-
rithm having the desired properties and using just O(log2(UM/δ)) bits of storage. The distribu-
tion changes by less than δ.
Lemma 3.4.2 Given a sequence of update operations on a vector x= (x0, . . . , xU1)with en-
tries from [M], there is a data structure that with probability 1/22 δreturns a pair (i, xi)with
iSupp(x)and returns a flag FAIL otherwise. The statistical difference from the distribution
of the returned pair to a uniform distribution is at most δ, particularly Pr[i=j] = 1/kxk0±δ
for every jSupp(x). The algorithm uses Olog2(UM/δ)space. ut
Theorem 2 (Sampling in data streams.) Let be δ1
44 . Given a sequence of update opera-
tions on a vector x= (x0, . . . , xU1)with entries from [M], there is a data structure that with
probability 1δreturns spairs (i0, xi0),...,(is1, xis1)with ijSupp(x)and returns a flag
FAIL otherwise. The returned pairs are independent of each other and may contain duplicates.
The statistical difference from the distribution of each returned pair to a uniform distribution is
at most δ, particularly for all jSupp(x)and all k[s] : Pr[ik=j] = 1
kxk0±δ. The algorithm
uses Os+log(1/δ)·log2(UM/δ)space.
Proof :
We apply Lemma 3.4.2 and invoke 352·(s+ln(1/δ)) instances of the data structure, each with
independent random choices. Let Ydenote the random variable for the number of instances that
return a random pair. Since δ1
44 the probability of each instance to return a random element
is greater than 1
44 . Hence, E[Y]8·(s+ln(1/δ)) and
Pr[Y<s]Pr[Y(1 (1/2))E[Y]] e(1/2)2·E[Y]/2 δ .
Therefore with probability at least 1δwe have at least ssamples. We return the first ssamples
we obtain from the instances. 2
We will now show that we can recover the whole vector x, if we spend space slightly larger
than kxk0. This will be useful in situations when the support of the current vector is very small.
Corollary 3.4.3 Given a sequence of update operations on a vector x= (x0, . . . , xU1)with
entries from [M]and given an oracle which tells us in advance the value of kxk0at the end of
the stream, there is a data structure that with probability 1δreturns all pairs
(i0, xi0),...,(ikxk01, xikxk01)and returns a flag FAIL otherwise.
The algorithm uses
Okxk0·log U
δ·log2UM
δspace.
26
3.4 A Sample Data Structure using Random Number Generators
Proof : If U < 22 we can store the whole vector using log(M)space. Otherwise 1
2U 1
44
and we can apply the data structure of Theorem 2 with s=2kxk0(ln kxk0+ln(2/δ)) and error
probability parameter δ
2U . We simply return all distinct samples we obtain.
Let us fix an arbitrary index jSupp(x). We have
Pr[k[s]:j6=ik]11δ
2U
kxk0s11
2kxk0seln kxk0ln(2/δ)=δ
2kxk0
.
It follows from the Union Bound:
Pr[jSupp(x)k[s]:j6=ik] kxk0·δ
2kxk0δ/2 .
Therefore, the overall probability of failure is at most δ.
The space requirement of the algorithm is:
Os+log 2U
δ·log2UM
δ0
=Okxk0·log kxk0+log 1
δ+log U
δ·log2UM
δ
=Okxk0·log U
δ·log2UM
δ .
2
A second consequence can be obtained by translating Theorem 2 (which uses independent
draws and thus the sample set may contain multiple copies of the same point) to the case of
sampling of random subsets.
Corollary 3.4.4 Let s kxk0/2. Given a sequence of update operations on a vector x=
(x0, . . . , xU1)with entries from [M], there is a data structure that with probability 1δreturns
a subset SSupp(x)of sindices and all pairs (i, xi)for iSand returns a flag FAIL
otherwise. The statistical difference from the distribution of the returned subset to a uniform
distribution is at most δ, particularly Pr[jS] = s
kxk0±δfor every jSupp(x).
The algorithm uses Os+log(U/δ)·log2(UM/δ)space.
Proof : Let us first assume that we have a data structure that returns an index distributed exactly
uniformly among Supp(x). Then we can select s0=c·s+log(1/δ)indices i0
0, . . . , i0
s0from
Supp(x)uniformly at random with repetitions, where cis a suitable constant. Since s kxk0/2
we know that for each i0
jwe have with probability at least 1/2 an index that is not among the
previously chosen indices. And so we get that with probability at least 1δ/2 we have at least
sdistinct indices among i0
0, . . . , i0
s0for clarge enough. If there are more than sdistinct indices
among i0
0, . . . , i0
swe choose sindices uniformly at random from the distinct indices. Clearly, the
computed index set is distributed uniformly at random.
27
3 Sampling Data Streams
We use Theorem 2 to select s0indices almost uniformly at random. It remains to deal with
the fact that Theorem 2 does not provide us with an exact uniform distribution. We will use
error probability parameter δ0=δ/(s0·U)when we apply the theorem. Each time when we
choose a random point the statistical difference to the uniform distribution is at most δ0·U. Since
we choose s0elements the overall statistical difference of our process to the ideal one described
above is at most δ0·U·s0=δ. Therefore,Pr[jS] = s
kxk0±δfor every jSupp(x)and the
overall probability of error is at most δ.
The space needed by the algorithm is
Os0+log 1
δ0·log2UM
δ0
=O s+log 1
δ+log s0·U
δ·log2 U2M(s+1
δ)
δ!!
=Os+log U
δ·log2UM
δ .
2
Lemma 3.4.5 Given a sequence of update operations on a vector x= (x0, . . . , xU1)with en-
tries from [M], there is a data structure that returns all pairs (i, xi)with iSupp(x)when
kxk0Aand a flag FAIL if kxk0> A. The algorithm works with probability 1δand uses
OA·log U
δ·log2UM
δspace.
Proof : We start the algorithm of Corollary 3.4.3 with the assumption that kxk0=2A +2and
with error probability parameter δ0=δ/2. We call it algorithm 1. In case that kxk02A +2at
the end of the stream, this algorithm reconstructs all pairs (i, xi)with iSupp(x).
In parallel we start the algorithm of Corollary 3.4.4 with s=A+1and error probability
parameter δ0=δ/2 and call it algorithm 2. If algorithm 2 returns A+1elements, we return
FAIL. If it returns less than A+1elements, we know that with probability 1δ/2 we have
kxk0< 2A +2. In that case algorithm 1 provides us with all pairs (i, xi)with iSupp(x)with
probability 1δ/2. We count the number of pairs. If kxk0A, we return all pairs, otherwise
we return FAIL.2
Remark 3.4.6 All of the results stated with an additive error of δcan be transfered to a multi-
plicative error of δ: When we apply the results with δ0=δ/U, we get an additive error of δ/U.
So each element is sampled with probability 1
kxk0±δ
U. Since 1
kxk01
Uwe conclude that each
element is sampled with probability 1
kxk0·(1±δ). Since all memory bounds depend only on
Olog U
δthe replacement of δby δ
Udoes not change the memory bounds.
28
3.5 A Sample Data Structure using Pairwise Independent Hash Functions
3.5 A Sample Data Structure using Pairwise
Independent Hash Functions
In this section we focus on the development of a sample data structure that only uses pairwise
independent hash functions[18] instead of the approach using a pseudo-random generator. We
will consider a relative error instead of an additive error.
The basic idea behind the second sample data structure is similar to the structure using totally
random hash functions.
Assume we knew the number nof non-zero entries in our vector x. Then we could use a
hash function hthat maps [U]to a space somewhat larger than n, say to [n/]. It is easy to see
that such a hash function has n collisions in expectation. This means that typically most of the
non-zero entries in xdo not collide. We choose a value α[n/]uniformly at random and
check using the UE data structure whether exactly one non-zero entry was mapped to α. If this
is the case we return the corresponding unique pair (i, xi)with h(i) = αand xi6=0.
Since there are only few collisions this probability is close to n/(n/) = . If we keep
O(1/)instances of the data structure, one of them is likely to return a pair. We will show that
the index of the returned pair is almost uniformly distributed among the non-zero entries in x. To
deal with the fact that we do not know the value nand that it changes over time we follow the
approach of Section 3.3. We will keep log Uhash functions hj,1j dlog Ue+2each cor-
responding to a guess n2j. It will suffice to work with a good guess, which can be identified
using our DE data structure.
We now give a detailed description of our sample data structure. Our data structure uses
O(log U·log(1/δ)/)pairwise independent hash functions hj,k with j[dlog Ue+2],k[I],
and number of instances I=log(1/32)(δ/2)=O(log(1/δ)/)). Each hj,k is of the form
hj,k : [U][T]with T:= 2j+1/and corresponds to the k-th hash function for the j-th guess,
n2j. For each hash function we choose a value αj,k,j[dlog Ue+2],k[I], uniformly at
random from [T]. Additionally, we need UE data structures UEj,k for j[dlog Ue+2]and k
[I]. Each of these data structure is supposed to handle a subset of the UPDATE(i, a)operations
in the input stream. Namely, data structure UEj,k will process all updates with hj,k(i) = αj,k.
We also need one instance of the DE data structure using parameters ψ=1and error probability
parameter δ0=δ/2.
All hash functions, αj,k and UEj,k are initialized during a separate initialization step. We write
UEj,k.UPDATE to denote an UPDATE operation for data structure UEj,k.
The operations UPDATE and SAMPLE are implemented as follows:
'
&
$
%
UPDATE(i, a)
for j[dlog Ue+2]do
for k[I]do
if hj,k(i) = αj,k
then UEj,k.UPDATE(i, a)
DE.UPDATE(i, a)
SAMPLE
j=dlog(DE.REPORT)e
if kUEj,k.REPORT=FAIL then return FAIL
k0min{k|UEj,k.REPORT6=FAIL}
return UEj,k0.REPORT
29
3 Sampling Data Streams
Lemma 3.5.1 Given a sequence of update operations on a vector x= (x0, . . . , xU1)with
entries from [M], there is a data structure that with probability 1δreturns a pair (i, xi)
with iSupp(x)such that Pr[i=j]=(1±)/kxk0for every jSupp(x). The algo-
rithm uses O(log(UM/)·log U·log(1/δ)/)space. An UPDATE operation can be done in
O(log U·log(1/δ)/)time and a SAMPLE query can be processed in O(log U+log(1/δ)/)
time.
Proof : Denote n:= |Supp(x)|=kxk0and assume that DE returns a value DE.REPORT
having nDE.REPORT 2n, which happens with probability at least 1δ/2. We choose
j=dlog(DE.REPORT)esuch that n2j4n.
Let k[I]be fixed. For simplicity of notation we define h=hj,k and α=αj,k. We say that
l[U]is chosen, if it is the only entry xlfrom xwith xl6=0and h(l) = α.
For fixed l, m the probability over the choice of hfor the event h(l) = h(m)is 1/T by
pairwise independence. Let us fix an lwith xl6=0. Using the union bound we get
Pr[mxm6=0h(l) = h(m)] (n1)/T ·(n1)/2j+1/2
Since αis chosen uniformly at random from the target space [T]and independently of the choice
of h, the events h(l) = αand m(xm6=0h(l) = h(m))are independent. Therefore,
1
TPr[lis chosen] = Pr[h(l) = α]·(1Pr[m(xm6=0h(l) = h(m))]) 1
T·(1/2).
Since these events are disjoint for all lwe get
n
TPr[any element is chosen]n
T·(1/2).
It follows for each lwith xl6=0
Pr[lis chosen |any element is chosen]1
T·(1/2)n
T= (1/2)·1/n
and
Pr[lis chosen |any element is chosen]1/T
n/T ·(1/2)=1
n·(1/2)1
n(1+).
Thus, when an element is chosen, it is chosen almost uniformly at random from Supp(x). It
remains to show that for at least one kan element is chosen. From the argumentation above it
follows:
Pr[any element is chosen]n
T(1/2)n
2j+2(1
2)
16(1
2) = (2)
32
32
Note that our data structure fails exactly when the data structure DE does not work (probability
δ/2) or when no element is chosen for all k. No element is chosen, when for all kthe data
structure UEj,k fails. Since the number of instances Iis log(1/32)(δ/2)we know that the
30
3.5 A Sample Data Structure using Pairwise Independent Hash Functions
probability that no element is chosen for any kis at most (1/32)Iδ/2. Hence the overall
probability of error is at most δand our data structure is correct with probability at least 1δ.
The algorithm uses O(log U·log(1/δ)/)hash functions, αj,k values and UE data structures.
Each hash function uses O(log U)space, each αj,k value uses O(log(U/)) space and each UE
data structures uses O(log(UM)) space. The DE data structure uses O(log(1/δ)log(U)log(UM))
space.
Hence the overall space complexity is O(log(UM/)·log U·log(1/δ)/). We can evaluate
the hash function in constant time and obtain the stated running times. 2
Theorem 3 (Sampling in data streams.) Given a sequence of update operations on a vector
x= (x0, . . . , xU1)with entries from [M], there is a data structure that with probability 1δ
returns spairs (i0, xi0),...,(is1, xis1)with ijSupp(x)and returns a flag FAIL otherwise.
The returned pairs are independent of each other and may contain duplicates. For all j
Supp(x)and all k[s] : Pr[ik=j] = 1±
kxk0. The algorithm uses
Os+log(1/δ)·log(UM/)·log(U)/
space. An UPDATE operation can be processed in O((s+log(1/δ)) ·log(U)/)time. A query
operation can be processed in O((s+log(1/δ)) ·(log U+1/)) time.
Proof :
We apply Lemma 3.5.1 and invoke 16 ·(s+ln(1/δ)) instances of the data structure, each
with independent random choices and error probability parameter 1/2. Let Ydenote the random
variable for the number of instances that return a random pair. The probability of each instance
to return a random element is greater than 1/2. Hence, E[Y]8·(s+ln(1/δ)) and
Pr[Y<s]Pr[Y(1 (1/2))E[Y]] e(1/2)2·E[Y]/2 δ .
Therefore with probability at least 1δwe have at least ssamples. We return the first ssamples
we obtain from the instances. 2
Corollary 3.5.2 Given a sequence of update operations on a vector x= (x0, . . . , xU1)with
entries from [M]and given an oracle which tells us in advance the value of kxk0at the end of
the stream, there is a data structure that with probability 1δreturns all pairs
(i0, xi0),...,(ikxk01, xikxk01)and returns a flag FAIL otherwise.
The algorithm uses
O(kxk0·(log kxk0+log(1/δ)) ·log(UM)·log U)space. UPDATE operations and query op-
erations can be processed in O(kxk0·(log kxk0+log(1/δ)) ·log U)time.
Proof : We apply Theorem 3 with s=2kxk0(ln kxk0+ln(2/δ)),=1/2, and error probability
parameter δ/2 and simply return all distinct samples we get.
Let us fix an arbitrary index jSupp(x). We have
Pr[k[s]:j6=ik]11
kxk0s11
2kxk0seln kxk0ln(2/δ)=δ
2kxk0
.
31
3 Sampling Data Streams
It follows from the Union Bound:
Pr[jSupp(x)k[s]:j6=ik] kxk0·δ
2kxk0δ/2 .
Since the sampling fails with a probability at most δ/2, the overall probability of failure is at
most δ.
2
A second consequence can be obtained by translating Theorem 3 (which uses independent
draws and thus the sample set may contain multiple copies of the same point) to the case of
returning subsets.
Corollary 3.5.3 Let s kxk0/2. Given a sequence of update operations on a vector x=
(x0, . . . , xU1)with entries from [M], there is a data structure that with probability 1δreturns
a subset SSupp(x)of sindices and all pairs (i, xi)for iSand returns a flag FAIL oth-
erwise. The algorithm uses Os+log(1/δ)·log(UM)·log(U)space. UPDATE operations
and query operations can be processed in O((s+log(1/δ)) ·log(U)) time.
Proof : We use the data structure of Theorem 3 with parameter =1/2 and error probability
parameter δ0=δ/2 to obtain s0=c·(s+log(1/δ)) indices i0
0, . . . , i0
s0from Supp(x)uniformly
at random with repetitions, where cis a suitable constant.
Since s kxk0/2 we know that for each i0
jeither we have at least sdifferent indizes previously
chosen or i0
jis with probability at least 1
kxk0·(kxk0s)1/4 an index that is not among the
previously chosen indices. We get that with probability at least 1δ/2 we have at least sdistinct
indices among i0
0, . . . , i0
s0for clarge enough. If there are more than sdistinct indices among
i0
0, . . . , i0
swe choose arbitrarily sindices from the distinct indices and return them. 2
Lemma 3.5.4 Given a sequence of update operations on a vector x= (x0, . . . , xU1)with en-
tries from [M], there is a data structure that returns all pairs (i, xi)with iSupp(x)when
kxk0Aand a flag FAIL if kxk0> A. The algorithm works with probability 1δand uses
O(A·(log A+log(1/δ)) ·log(UM)·log U)space. UPDATE operations and query operations
can be processed in O(A·(log A+log(1/δ)) ·log(U)) time.
Proof : We start the algorithm of Corollary 3.5.2 with the assumption that kxk0=2A +2and
with error probability parameter δ0=δ/2. We call it algorithm 1. In case that kxk02A +2at
the end of the stream, this algorithm reconstructs all pairs (i, xi)with iSupp(x).
In parallel we start the algorithm of Corollary 3.5.3 with s=A+1and error probability
parameter δ0=δ/2 and call it algorithm 2. If algorithm 2 returns A+1elements, we return
FAIL. If it returns less than A+1elements, we know that with probability 1δ/2 we have
kxk0< 2A +2. In that case algorithm 1 provides us with all pairs (i, xi)with iSupp(x)with
probability 1δ/2. We count the number of pairs. If kxk0A, we return all pairs, otherwise
we return FAIL.2
32
4 Sampling Geometric Data Streams and
Applications
In this Chapter we will transfer the sampling technique of Section 3.4 to the context of dynamic
geometric data streams. It gives us the ability to choose a point almost uniformly at random from
the current point set Pencoded in a dynamic geometric data stream.
Based on that we will provide randomized streaming algorithms for three well-studied geo-
metric problems over dynamic geometric streams:
1. Maintaining an -net of P; that is, a subset NPsuch that for any range Rfrom a fixed
family of ranges of VC dimension D(e.g., set of all rectangles), we have |NR|> 0, if
|RP|
|P|. We show how to maintain an a set of e
O(D+log(1/δ)
)points that with probability
1δis an -net of P.
Our data structure uses O(log(1/δ)
+D
log D
)·d2·log2(∆/δ)space.
2. Maintaining an -approximation of P; that is, a subset AP, such that for any range R
from a fixed family of ranges of bounded VC dimension D, we have |AR|
|A|=|RP|
|P|±. In
this case our algorithm maintains a set of e
O(D+log(1/δ)
)points that with probability 1δ
is an -approximation.
Our data structure uses O1
2Dlog D
+log 1
δd3log3
δspace.
The -approximations have applications to many problems, including Tukey depth, simpli-
cial depth, regression depth, the Thiel-Sen estimator, and the least median of squares [9].
3. Maintaining a (1+)-approximation of the cost of minimum weight tree spanning the
points in P. This quantity in turn enables to achieve constant factor approximations for
other problems, such as TSP or Steiner tree cost. Our algorithms use space O(log(1/δ)·
(log ∆/)O(d)), and is correct with probability 1δ.
Having a random sample of points from the technique developed in the last chapter, the al-
gorithms to maintain -nets and -approximations will follow relatively easily from [66] and
[119].
To compute the weight of the Euclidean minimum spanning tree our sampling procedure is
used in a more subtle way. It is known that the EMST weight can be expressed as a formula de-
pending on the number of connected components in certain subgraphs of the complete Euclidean
graph of the current point set [25, 20]. We use an algorithm from [25] to count the number
of connected components in these subgraphs. This algorithm is based on a BFS-like procedure
starting at a randomly chosen point p. The BFS runs for a constant number of rounds only and
one can show that it can never leave a ball around pof certain radius. Therefore, it suffices
33
4 Sampling Geometric Data Streams and Applications
to maintain a random sample point and all points in a certain radius around this sample point.
This task can also be approximately performed by a variant of our sampling routine described in
Section 4.3.
4.1 Sampling Geometric Data Streams
First, we observe that we can apply our data structure in the setting of dynamic geometric data
streams in the following way. We will use U=dand M=2,[M] = {0, 1}. An ADD(p)opera-
tion with p= (p0, . . . , pd1)is implemented as an UPDATE(P, 1)operation with P=Pjp0·j,
i.e. by interpreting pas a -ary number with ddigits. In a similar way, a REMOVE(p)opera-
tion translates to UPDATE(P, 1). Using the SAMPLE procedure we can get a pair (i, xi)having
xi=1, which can also be re-interpreted as the corresponding unique point p= (p0, . . . , pd1)
with i=Pjp0·j. Thus we can sample a point from the current point set.
We will now translate the results of Section 3.4 to the context of dynamic geometric data
streams.
Theorem 4 (Sampling in geometric data streams.) Given a sequence of ADD and REMOVE
operations of points from the discrete space []d, there is a data structure that with probability
1δreturns spoints r0, . . . , rs1from the current point set P={p0, . . . , pn1}and a flag FAIL
otherwise. The returned points are independent of each other and may contain duplicates. The
statistical difference from the distribution of each sample point to a uniform distribution is at
most δ
d, particularly Pr[ri=pj] = 1
n±δ
dfor every j[n].
The algorithm uses Os+log(1/δ)·d2·log2(∆/δ)space.
Proof : We apply Theorem 2 with δ0=δ/d.2
We remark that Theorem 4 requires that no point in Poccurs more than once, i.e. Pis not a
multiset.
Corollary 4.1.1 Given a sequence of ADD and REMOVE operations of points from the discrete
space []d, there is a data structure that with probability 1δreturns the current point set
P={p0, . . . , pn1}.
The algorithm uses Ond3log3
δspace.
Proof : This corollary follows directly from Corollary 3.4.3. 2
Corollary 4.1.2 Let sn/2. Given a sequence of ADD and REMOVE operations of points from
the discrete space []d, there is a data structure that with probability 1δreturns a subset S=
{r0, . . . , rs1}of spoints from the current point set P={p0, . . . , pn1}. The statistical difference
from the distribution of the returned subset to a uniform distribution is at most δ
d, particularly
Pr[pjS] = s
n±δ
dfor every j[n]. The algorithm uses Os+d·log(∆/δ)·d2·log2(∆/δ)
space.
Proof : This corollary follows directly from Corollary 3.4.4. 2
34
4.2 -Nets and -Approximations in Data Streams
4.2 -Nets and -Approximations in Data Streams
A consequence of Theorem 4 is that we can get -nets and -approximations of the current point
set. We briefly recapitulate the definitions required for -nets and -approximations, which can,
for example, be found in [52].
Definition 4.2.1 (Range Spaces) Let Xbe a set of objects and Rbe a family of subsets of X.
Then we call the set system Σ= (X, R)arange space. The elements of Rare the ranges of Σ. If
Xis a finite set, then Σis called a finite range space.
Definition 4.2.2 (VC-dimension) The Vapnik-Chervonenkis dimension (VC-dimension) of a
range space Σ= (X, R)is the size of the largest subset of Xthat is shattered by R. We say
that a set Ais shattered by R, if {Ar|rR}=2A.
Definition 4.2.3 (-nets, -approximation) Let Σ= (X, R)be a finite range space. A subset
NXis called -net, if Nr6=for every rRwith |r||X|. A subset AXis called
-approximation, if for every rRwe have |Ar|
|A||r|
|X|.
Obviously, an -approximation is always an -net, while the contrary is not necessarily true.
A Data Streaming Algorithm for -Approximations. The following theorem by Vapnik
and Chervonenkis shows that for any finite range space with constant VC-dimension Da random
sample of size e
O(D+log(1/δ)
2)is an -approximation with probability at least 1δ.
Theorem 5 [119] There is a positive constant csuch that if (X, R)is any range space of VC-
dimension at most D,AXis a finite subset and , δ > 0, then a random subset Bof cardinality
sof Awhere sis at least the minimum between |A|and
c
2·D·log D
+log 1
δ
is an -approximation for Awith probability at least 1δ.
We can now combine Corollary 4.1.2 (or Corollary 4.1.1 in the case that the current point
set is small) with Theorem 5 to obtain a data structure that with probability 1δreturns an
-approximation of the current point set.
Theorem 6 Given a sequence of ADD and REMOVE operations of points from the discrete space
[]d, there is a data structure that with probability 1δreturns a set Aof e
O(D+log(1/δ)
2)points
that is an -approximation of a range space (X, R)with VC-dimension D. The algorithm uses
O1
2Dlog D
+log 1
δd3log3
δspace.
35
4 Sampling Geometric Data Streams and Applications
Proof : Let a=c
2·D · log D
+log 3
δ. By Theorem 5 a sample set of size ais an -
approximation with probability at least 1δ/3. Let P={p0, . . . , pn1}denote the current point
set. We can easily track the size nof Pby increasing a counter with every ADD operation and
decreasing it with every REMOVE operation. If nawe use the data structure from Corollary
4.1.1 of size Oad3log3
δto recover Pcompletely.
If n>awe will use our data structure from Corollary 4.1.2 of size Oad3log3
δto obtain
a random sample of size a. We will use failure parameter δ0=δ/3. This guarantees that
the overall statistical difference from the same process using the uniform distribution is at most
δ0/d·nδ/3. Similarly, the data structure fails with probability at most δ0δ/3. And a
set of size ais with probability 1δ/3 an -approximation. Summing up the errors we get an
-approximation with probability 1δ.
The space requirement follows immediately from Corollaries 4.1.2 and 4.1.1 and Theorem 5.
2
A Data Streaming Algorithm for -Nets. Haussler and Welzl showed that a random sam-
ple of size e
O(D+log(1/δ)
)is an -net with probability at least 1δ.
Theorem 7 [66] Let (X, R)be a range space of VC-dimension D, let Abe a finite subset of X
and suppose 0 < , δ < 1. Let Nbe a set obtained by mrandom independent draws from A,
where
mmax 4
log 2
δ,8·D
log 8·D
then Nis an -net for Awith probability at least 1δ.
Combining Theorem 4 with Theorem 7 we obtain
Theorem 8 Given a sequence of ADD and REMOVE operations of points from the discrete space
[]d, there is a data structure that with probability 1δreturns a set Nof e
O(D+log(1/δ)
)points
that with probability at least 1δis an -net of a range space (X, R)with VC-dimension D. The
algorithm uses O(log(1/δ)
+D
log D
)·d2·log2(∆/δ)space.
Proof : We use a random sample (with repetitions) as given by the data structure from Theorem
4. We choose failure parameter δ0=δ/3. Since the statistical difference from the exact uniform
distribution is at most δ0=δ/3, the failure probability is at most δ0=δ/3, and a set of size
max 4
log 2
δ0,8·D
log 8·D
is an -net with probability at least 1δ0=1δ/3, we get that our
sample is an -net with probability at least 1δ.2
4.3 Random Sampling with Neighborhood Information
We now want to develop a more sophisticated sampling strategy. We would like to draw a
set of points (almost) uniformly at random and for each point we also would like to know its
36
4.3 Random Sampling with Neighborhood Information
neighborhood, for example all points within a distance of at most zor all points in a square with
side length zcentered at the random point. Formally, we define a neighborhood in the following
way.
Definition 4.3.1 (V-neighborhood) Let V={v0, . . . , vZ1}denote a set of grid vectors with
v0= (0,...,0). We define the V-neighborhood of a point pto be the set N(V, p) = SiZ{p+vi}.
We call Z=|V|the size of the V-neighborhood.
We will typically assume that the size Zof a V-neighborhood is small, i.e. polylogarithmic.
We show that we are able to get information about the V-neighborhood of a sample point. This
can be achieved in the following way:
For simplicity we identify []dwith [d], such that P{0,...,∆d1}and V{0,...,∆d
1}. We use Theorem 2 with U=dand M=2Zto map the problem from the discrete Euclidean
space to a vector problem. We want to maintain the invariant that the vector xrepresents the
neighborhood of each point in the way:
pdxp=
Z1
X
j=0
aj·2jwith aj=1iff p+vjP
0iff p+vj/P(4.1)
where Pdenotes the current point set after insert and delete operations.
Particularly xp=1mod 2pP.
To maintain the invariant we have to translate the insert and delete operations of points into Z
update operations in the following way:
for all i[Z]
ADD(p)if pvi[d]then
UPDATE(pvi, 2i)
for all i[Z]
REMOVE(p)if pvi[d]then
UPDATE(pvi,2i)
We have to deal with the fact that the sample data structure used in Theorem 2 samples ele-
ments from Supp(x), which is a larger set than P.
Lemma 4.3.2 |P|kxk0
Z.
Proof : Let be iSupp(x), so xi=0. Because of equation (4.1) there must be a responsible
point pPwith i N(p). Since |N(p)|=Z,pcan be responsible for the positivity of at
most Zentries xi. We conclude that there are at most |P|·Zentries iwith xi6=0. We conclude
|P|kxk0
Z.2
37
4 Sampling Geometric Data Streams and Applications
We apply Theorem 2 with s0=16 ·Z·(s+ln(δ/2)) and δ0=δ
Z·dand get a set S0consisting
of s0pairs (i, xi)with xi6=0. Starting with an empty sample set Swe check for every pair (i, xi)
if xi=1mod 2. If this is the case we add the pair (i, xi)(containing the sample point iP) to
our sample set S. We stop the procedure when |S|=s.
We first show that the set Scontains at least ssample points with probability 1δ. Let Ybe
the random variable for the number of samples having xi=1mod 2(being sample points from
P). For each sample (i, xi)S0and each pPwe have
Pr[i=p] = 1
kxk0±δ01
|P|·Zδ01
2·|P|·Z.
It follows that
E[Y]1
2·|P|·Z·|P|·s0=8(s+ln δ/2).
From Chernoff Bounds [58]:
Pr[Y<s]Pr[Y(11
2)E[Y]] e(1/2)2·E[Y]/2 δ
2.
Since Theorem 2 gives us s0samples with probability at least 1δ/2, we conclude that the
constructed set Sconsists of at least ssamples (i, xi)having xi=1mod 2with probability
1δ. It remains to show that the sampling is almost uniform.
Let S={(r1, xr1),...,(rs, xrs)}and let (k, xk)be the first sample returned by the algorithm
of Theorem 2. By the method to construct the set Swe see that for all i{1,...,s}and all
pP:
Pr[ri=p] = Pr[k=p]·kxk0
|P|=1
kxk0±δ0·kxk0
|P|=1
|P|±δ
Z·kxk0
|P|·d
Since |P|kxk0
Zby Lemma 4.3.2 we have:
Pr[ri=p] = 1
|P|±δ
d.
The space requirement to apply Theorem 2 with s0=16·Z·(s+ln(δ/2)),δ0=δ
Z·d,U=d,
and M=2Zis:
Os0+log 1
δ0·log2UM
δ0
=OZ·s+log 2
δ+log Z∆d
δ·log2d·2Z·Z·d
δ
=Os·Z3·d3log3
δ
38
4.4 Estimating the Weight of a Euclidean Minimum Spanning Tree
Theorem 9 Let the set V={v0, . . . , vZ1}be fixed. Given a sequence of ADD and REMOVE
operations of points from the discrete space []d, there is a data structure that with probability
1δreturns spoints r0, . . . , rs1from the current point set P={p0, . . . , pn1}such that
Pr[ri=pj] = 1
n±δ
dfor every j[n]. The points are independent of each other and may
contain duplicates. Additionally, the algorithm returns the sets P N(V, ri)for every i[s].
The algorithm uses e
Os·Z3·d3·log3(∆/δ)space. ut
4.4 Estimating the Weight of a Euclidean Minimum
Spanning Tree
In this section we will show how to estimate the weight of a Euclidean minimum spanning tree in
a dynamic geometric data stream. We denote by P={p1, . . . , pn}the current point set. Further
EMST denotes the weight of the Euclidean minimum spanning tree of the current set.
We impose log1+(d∆)square grids over the point space. The side lengths of the grid
cells are ·(1+)i
dfor 0ilog1+(d∆). Our algorithm maintains certain statistics of the
distribution of points in the grids. We show that these statistics can be used to compute a (1+)-
approximation of the weight EMST.
Our computation is based on a formula from [33] for the value of the minimum spanning tree
of an npoint metric space. Let GPdenote the complete Euclidean graph of a point set Pand
Wan upper bound on its longest edge. Further let c((1+)i)
Pdenote the number of connected
components in G((1+)i)
P, which is the subgraph of GPcontaining all edges of length at most
(1+)i. Under these assumptions we can use the formula from [33]:
1
1+·EMST nW+·
log1+W1
X
i=0
(1+)ic((1+)i)
PEMST (4.2)
where nis the number of points in P.
Instead of considering the number of connected components in G(t)
Pfor t= (1+)iwe first
move all points of Pto the centers of a grid of side length ·t
d. After removing multiplicities
we obtain the point set P(t). Then we consider the graph G(t)whose vertex set is P(t)and
that contains an edge between two points if their distance is at most t. Instead of counting
the connected components in G(t)
Pwe count the connected components in G(t). It follows from
Claim 4.4.1 that this only introduces a small error. We denote by c(t)the number of connected
components of G(t). Then we get
Claim 4.4.1
c(1+)i+1
Pc(1+)ic(1+)i2
P
Proof : Let us consider two arbitrary points p, q Pand the centers of their corresponding
cells p0, q0in the grid graph G((1+)i). Recall that the corresponding grid has side length ·(1+)i
d.
39
4 Sampling Geometric Data Streams and Applications
Thus by moving pand qto the centers of the corresponding grid cells their distance changes by
at most ·(1+)i.
Now assume that p, q are in the same connected component in G((1+)i2)
P. Then they are
connected by a path of edges of length at most (1+)i2. If we now consider the path of the
corresponding centers of grid cells, then any edge of the path has length at most (1+)i2+
·(1+)i(1+)i.Therefore, p0, q0are in the same connected component of the grid cell
graph. We conclude c(1+)ic(1+)i2
P.
Assume that pand qare in the same connected component of the grid graph G((1+)i). They
are connected by a path of edges of length at most (1+)iin the grid graph G((1+)i). After
switching to the point graph GPany edge of the corresponding path has a length of at most
(1+)i+(1+)i= (1+)i+1. Therefore pand qare in the same connected component of
G((1+)i+1)
P. We conclude c(1+)i+1
Pc(1+)i.2
We denote by n(t)=|P(t)|the number of non-empty grid cells of side length t
d. Our algo-
rithm maintains approximations e
n, f
W, e
n(t), and ec(t)(for t= (1+)0,(1+)1,(1+)2, . . . )
of the number nof points currently in the set, the diameter W, the size n(t)of P(t), and the
number c(t)of connected components in G(t), respectively. The approximation is then derived
by inserting the maintained approximations into formula 4.2.
In the following we discuss the data structures we need to maintain our approximations.
4.4.1 Required Data Structures
Number of points. We observe that we can remember the value of nexactly by increas-
ing/decreasing e
nin case of an ADD/REMOVE operation.
Diameter. We show how to maintain an approximation f
Wof Wwith Wf
W4dW,
where Wis the largest distance between two points in the current point set. To do so, we maintain
an approximation f
Wjof the diameter of the point set in each of the ddimensions with Wj
f
Wj4Wj, where Wjis the diameter in dimension jfor 1jd. The maximum of the f
Wjis
our approximation f
W.
We maintain the diameter of the point set in dimension jin the following way: For each
i{0, ..., log }we introduce two one-dimensional grids Gi,1 and Gi,2, each of them having
cells of side length 2i.Gi,2 is displaced by 2(i1)against Gi,1. Let gi,1 and gi,2 be the number of
occupied cells in the grid Gi,1,Gi,2, respectively.
We use our Distinct Elements data structure from Section 3.2 to count the number of grid cells
containing a point. We only want to distinguish between the case gi,1 =1and gi,1 > 1 (we
assume that there is always at least one point in the set; otherwise the problem becomes trivial).
If there is exactly one point in the current set we have g0,1 =1and g0,2 =1and the diameter
is 0. Otherwise, the diameter must be at least 1. Therefore in the finest grids G0,1 and G0,2 at
least two cells are occupied, which means that g0,1 > 1 and g0,2 > 1. We now find the smallest
value isuch that gi+1,1 =1or gi+1,2 =1. In this case we know that Wj2i+1.
40
4.4 Estimating the Weight of a Euclidean Minimum Spanning Tree
Since gi,1 > 1 and gi,2 > 1, we know that in both grids Gi,1 and Gi,2 at least two cells are
occupied. This means that the convex hull of the point set contains the border of a cell in both
grids Gi,1 and Gi,2. Since these cell borders have a distance of at least 2i1we have Wj2i1.
Therefore, we can output f
Wj=2i+1as a 4-approximation of the diameter in dimension j.
Size of P(t).The problem to find an estimation e
n(t)with (1)n(t)e
n(t)n(t)is equivalent
to maintaining the number of distinct elements in a data stream. This can be seen as follows.
Once a point arrives we can determine its grid cell from its position. Thus we can interpret
the input stream as a stream of grid cells and we are interested in the number of distinct grid
cells. This can be approximated using an instance of the Distinct Elements (DE) data structure
of Section 3.2.
The Sample Set. To approximate the number of connected components we have to maintain
multisets S(t)of points chosen uniformly at random (with repetitions) from P(t). We will use VR
to denote the set of grid vectors of length at most R. For each point pS(t)we maintain all
points in the VR-neighborhood of pfor some suitably chosen value R. Since our input stream
contains ADD and REMOVE operations of points from Prather than P(t)we have to map every
point from Pto the corresponding point from P(t). This may have the effect that P(t)becomes a
multiset although Pis not. This is no problem because our procedure from Lemma 3.4.2 samples
from the support of the vector (or, in this case from the support of the multiset). Straightforward
modifications of Theorem 9 show that we can also maintain the required sets S(t).
Having such a sample and the value e
n(t)we can use an algorithm from [25] to obtain the num-
ber of connected components with sufficiently small error. This is proven in Section 4.4.2 using
a modified analysis that charges the error in the approximation to the weight of the minimum
spanning tree. This way we get our estimation ec(t)of c(t).
4.4.2 Computing ec(t)
In this section we show how to compute our estimator ec(t). To do this we will use our sample set
S(t). In the computation of the sample set S(t)we need to specify the value R. We choose
R:= log(d·)2d+4d2·t .
Further we need in the following the value D:= R/t. Our algorithm for estimating c(t)works
as follows. First, we check, if f
W < 4t. If that is the case, W < 4t follows and for an
arbitrary sample point we know that every point of the current point set is contained in the radius
R. Therefore, we know the whole graph G(t)and can compute c(t)exactly.
Thus let us assume f
W4t. In this case our algorithm is essentially similar to the one
presented in [25], but our analysis is somewhat different. The difference comes from the special
structure of our input graphs G(t). We exploit the lower bound from Lemma 4.4.2 below to relate
the error induced by our approximation algorithm to the weight of the EMST.
41
4 Sampling Geometric Data Streams and Applications
Lemma 4.4.2 If f
W4t then
EMST n(t)t
d2d+1.
Proof : We distinguish between the case n(t)2d+1and n(t)< 2d+1.
We start with the case n(t)2d+1. In this case we can color the grid cells using 2dcolors
in such a way that no two adjacent cells have the same color. Since we have n(t)occupied
cells there must be one color cwhich is assigned to at least dn(t)
2deoccupied cells. Notice that
dn(t)
2de 2. The occupied cells of the same color are pairwise not adjacent, therefore any pair
of points that is contained in two distinct of these cells has a distance of at least ·t
d. We can
conclude EMST ln(t)
2dm1t
dln(t)
2dm1
2ln(t)
2dmt
dn(t)t
d2d+1.
In the second case we get Wt
dn(t)t
d2d+1since f
W4t. This implies the result. 2
We now present a description of our method to estimate c(t). The idea is pick a random set of
vertices (with repetition) and start a BFS with a stochastic stopping rule at each vertex vfrom
the sample to determine the size of the connected component of v. If the BFS explores the whole
connected component we set a corresponding indicator variable βto 1and else to be 0. To
implement this algorithm we can use our sample set S(t). The sample set provides a multiset of
points from P(t)chosen uniformly at random. It also provides all other points within a distance
of at most R=Dt. Since we consider only edges of length at most tand since the algorithm
below stops exploring a component when it has size Dor larger, the BFS cannot reach a point
with distance more than Rfrom the starting vertex. Therefore, our sample set S(t)is sufficient
for our purposes. We remark that the random points from S(t)are not chosen exactly uniformly.
We will choose the failure parameter δ0in the sampling data structure in such a way that the
deviation from the uniform distribution is between (1)and (1+)times its probability in
the uniform distribution (this means, we choose δ0=·δ/dand each point pP(t)is chosen
with probability (1±)·1
n(t)). We take care of this fact in the analysis. The algorithm we use is
given below.
'
&
$
%
APPROXCONNECTEDCOMPONENTS(P, t, )
Choose spoints q1, . . . , qsP(t)uniformly at random
for each qido
Choose integer Xaccording to distribution Prob[Xk] = 1/k
if XDthen βi=0
else
if Connected component of G(t)containing qihas at most Xvertices
then set βi=1
else set βi=0
Output: ^c(t)=
e
n(t)
s·Ps
i=1βi
Thus, βiis an indicator random variable for the event that the connected component containing
qihas at most Xvertices. We first show upper and lower bounds on the expected output value of
42
4.4 Estimating the Weight of a Euclidean Minimum Spanning Tree
the algorithm. Then we compute the variance and use it to show that the output is concentrated
around its expectation. We obtain
E[βi] = X
conn. comp.
Cin G(t)
Pr[qiC]·Pr[X|C| X<D]
X
conn. comp.
Cin G(t)
Pr[qiC]·Pr[X|C|]
X
conn. comp.
Cin G(t)
(1+)·|C|
n(t)·1
|C|= (1+)·c(t)
n(t).
For the output value
^c(t)=e
n(t)
s·
s
X
i=1
βi
of our algorithm we obtain
E[^c(t)]e
n(t)
n(t)(1+)c(t)(1+)c(t).(4.3)
From
E[βi] = X
conn. comp.
Cin G(t)
Pr[qiC]·Pr[X|C| X<D]
X
conn. comp.
Cin G(t)
(1)·|C|
n(t)·1
|C|1
D
= (1)·c(t)
n(t)1
D
and Lemma 4.4.2 we obtain
E[^c(t)](1)·e
n(t)
n(t)c(t)n(t)
D(1)2 c(t)EMST ·2d+1d
tD !(4.4)
(12)c(t)·EMST
8t log(d·).(4.5)
Our next step is to find an upper bound for the variance of ^c(t). Since the βiare {0, 1}random
variables, we get:
43
4 Sampling Geometric Data Streams and Applications
Var[βi]E[β2
i] = E[βi](1+)·c(t)
n(t).
By mutual independence of the β0
iswe obtain for the variance of ^c(t)for fixed e
n(t):
Var[^c(t)] = Var[e
n(t)
s
s
X
i=1
βi]
=e
n(t)2
s2·s·Var[βi]
(1+)·(e
n(t))2
s·c(t)
n(t)
(1+)·n(t)c(t)
s.
Using (4.3) and (4.5) we obtain
|c(t)E[^c(t)]|2c(t)+3 ·EMST
8·t·log(d·).(4.6)
We choose s, the number of sample points, as
s:= (1+)22d+10 ·d·log2(d·)·log1+(d·)
4=Olog3
5.
Chebyshev’s inequality and Lemma 4.4.2 imply:
Pr^c(t)E[^c(t)]|·EMST
8·t·log(d·)(1+)·n(t)·c(t)
s·64 ·t2·log2(d·)
2·EMST2
(1+)·64 ·d·22d+2·log2(d·)
s·4
1
4log1+(d·).
Therefore we get together with (4.6):
Lemma 4.4.3 With probability 11
4log1+(d·)we have |^c(t)c(t)|2c(t)+·EMST
2·t·log(d·).
It follows from the union bound that with probability at least 3/4 all ec(t)values satisfy the
inequality in Lemma 4.4.3. It remains to sum up the overall error taking into account that we
considered connected components of the graph G(t)and not of the corresponding subgraph of
GP. Intuitively, the connected component of G(t)are sufficient because in each of the G(t)we
moved every point by at most t which is small compared to the threshold edge length of t.
44
4.4 Estimating the Weight of a Euclidean Minimum Spanning Tree
Lemma 4.4.4 Let f
Mbe the output of our algorithm. Then
EMST f
M69d··EMST .
Proof : We will first show that our output value
f
M:= nf
W+
log1+
f
W1
X
i=0
(1+)iec((1+)i)
is close to
Mp:= nf
W+
log1+
f
W1
X
i=0
(1+)ic((1+)i)
p
which is a (1+)-approximation of the EMST value by equation (4.2). From Lemma 4.4.3 and
Claim 4.4.1 it follows that
(12)c((1+)i+1)
p·EMST
2(1+)ilog(d∆)ec((1+)i)
(1+2)c((1+)i2)
P+·EMST
2(1+)ilog(d∆)
holds with probability at least 11
4log1+(d·). By the union bound and some calculation we
get with probability 3/4:
·
log1+
f
W1
X
i=0
(1+)iec((1+)i)·(12)
log1+
f
W1
X
i=0
(1+)ic((1+)i+1)
p1
2·EMST
·(12)
log1+
f
W
X
i=1
(1+)i1c((1+)i)
p·EMST
·12
1+
log1+
f
W1
X
i=0
(1+)ic((1+)i)
p(12)
1+n·EMST
(14)·
log1+
f
W1
X
i=0
(1+)ic((1+)i)
pn ·EMST
and
45
4 Sampling Geometric Data Streams and Applications
·
log1+
f
W1
X
i=0
(1+)iec((1+)i)
1
2·EMST +
log1+
f
W3
X
i=−2
(1+)i+2(1+2)c((1+)i)
p
·EMST +(1+)2(1+2)
log1+
f
W1
X
i=0
(1+)iec((1+)i)
p+2(1+)(1+2)n
(1+11)·
log1+
f
W1
X
i=0
(1+)ic((1+)i)
P+12n +·EMST
which gives us (together with f
W4d·EMST and nEMST) a bound on the difference of
MPand f
M:
MPf
M112
log1+
f
W1
X
i=0
(1+)ic((1+)i)
P+12n +·EMST
=11(MPn+f
W) + 12n +·EMST
24 ·EMST +44d·EMST .
By the triangle inequality and (4.2) we get the final result:
f
MEMSTf
MMP+|MPEMST|24 ·EMST +44d·EMST +·EMST
69d·EMST .
2
From this lemma our final result follows immediately using standard amplification techniques
to ensure that the estimation is correct at every point of time.
Theorem 10 Given a sequence of insertions / deletions of points from the discrete d-dimensional
space {1,...,∆}dthere is a streaming algorithm that uses O(log3(1/δ)·(log()/)O(d))space
and O(log3(1/δ)·(log()/)O(d))time (for constant d) for each update and computes with
probability 1δa(1+)-approximation of the weight of the Euclidean minimum spanning tree.
ut
46
5 The Coreset Method
In this chapter we introduce a method to reduce the complexity of huge point sets. Assume we
are given a huge point set P. To understand the structure of the point set it is often a good idea to
partition the point set into clusters, such that two points near one another are in the same cluster,
but points having a big distance of each other are in different clusters. In this section we will
look at different clustering objectives: The k-median, k-means, and MaxCut clustering.
Computing clusterings on the huge point sets directly is often impossible. Traditional cluster-
ing algorithms usually have time and space complexity at least linear in the number of points and
need random access to the data. For huge point sets which do not fit into our local memory at all
traditional algorithms are not applicable.
In this chapter we will address this problem by introducting a new technique to reduce the
complexity of the point set. We will combine points to so called weighted coreset points. One
coreset point phaving a weight of w(p)then represents w(p)points of our input instance. We
will compute a coreset having certain theoretical guarantees. The most important ones are that
it is small (it’s size is logarithmic in the number of points) and that each clustering solution
computed on the coreset is a (1±)-approximate solution on the input point set.
Unlike previous coreset construction techniques [8, 60, 61] our method does not make assump-
tions on the distribution of points in advance. This will enable us to develop the fastest known
PTAS for Euclidean MaxCut in Section 5.5.3, the first efficient k-median and k-means clustering
algorithms for dynamic geometric data streams in Chapter 6, the first kinetic data structures for
MaxCut in Chapter 7 and efficient k-means implementations for huge point sets in Chapter 8.
5.1 Definitions
The most important problems we will address with our coreset technique are the clustering prob-
lems k-median, k-means, and MaxCut as defined in Section 2.3. However, we will define a set of
other example problems here, which can as well be solved using the coreset technique presented
later.
The maximum matching problem asks to find a perfect matching of the points in Pthat maxi-
mizes the sum of the length of the matching edges. For a matching Mwe denote its cost by
MaxMatching(P, M) = X
(p,q)∈M
d(p, q).
The maximum travelling salesperson problem is to find a simple cycle C(a tour) of the points in
47
5 The Coreset Method
Pwith maximal cost. We denote its cost by
MaxTSP(P, C) = X
(p,q)∈C
d(p, q).
We will also consider the problem to compute the average distance between points in P.
5.1.1 Oblivious Optimization Problems
In this section we define oblivious optimization problems over point sets. Intuitively, an oblivious
optimization problem has the property that for a fixed input the set of feasible solutions depends
on the cardinality of the input set and not on the input set itself. Hence, there is a set of solutions
that are feasible independent of the position of the input points. The quality of the solutions,
however, may depend on the positions. MaxCut, MaxMatching, MaxTSP, and AverageDistance
can be formulated as oblivious optimization problems.
Let us consider an optimization problem Πon point sets in the Rd.Πcan be either a maxi-
mization or minimization problem.
We call Πan oblivious optimization problem, if it has the following structure. For any ntu-
ple of points P= (p1, . . . , pn)let SΠ(n)denote the set of feasible solutions, i.e. the set of
feasible solutions depends only on the size of the input instance and not on the instance itself.
Further, we have for each nNand s SΠ(n)an objective function cost(n,s)
Πthat assigns to
P= (p1, . . . , pn)a non-negative cost. We assume that given a permutation πof the points and
a solution s, there is always another solution s0having the cost cost(n,s0)
Π(π(p1), . . . , π(pn)) =
cost(n,s)
Π(p1, . . . , pn)(We say that Πdoes not depend on the order of the points). If Πis a maxi-
mization (minimization) problem then one seeks to find for a given P={p1, . . . , pn}the solution
sthat maximizes (minimizes) cost(n,s)
Π(p1, . . . , pn). We write OptΠ(P) := cost(n,s)
Π(p1, . . . , pn).
Example 5.1.1 Euclidean MaxCut: A feasible solution sconsists of a partition of {1,...,n}
into 2groups C1, C2. For technical reasons we scale the usually used cost of the MaxCut problem
by 1
n. The cost of son P={p1, . . . , pn}is then given by
cost(n,s)
maxcut(p1, . . . , pn) = 1
nX
iC1X
jC2
d(pi, pj).
Example 5.1.2 Euclidean MaxMatching: Assume that nis even. A feasible solution sis a
partition of {1,...,n}into n/2 pairs E1, . . . , En/2 where Ei= (ai, bi). The cost of son P=
{p1, . . . , pn}is given by
cost(n,s)
maxmatching(p1, . . . , pn) =
n/2
X
i=1
d(pai, pbi).
48
5.1 Definitions
Example 5.1.3 Euclidean MaxTSP: A feasible solution sis a permutation of {1,...,n}. The
cost of son P={p1, . . . , pn}is given by
cost(n,s)
maxtsp(p1, . . . , pn) = d(ps(n), ps(1)) +
n1
X
i=1
d(ps(i), ps(i+1)).
Example 5.1.4 Average Distance: Since we only want to estimate the value of the average dis-
tance, there is only one feasible solution s. For technical reasons we scale the average distance
of the points by nand denote the costfunction of the solution by:
cost(n,s)
avgdistance(p1, . . . , pn) = 1
n1
n
X
i=1X
j{1,...,n}\{i}
d(pi, pj).
Since the definition of the solution is oblivious of the position of points we can speak of
the change of the cost of solution swhen moving from point set Pto another point set P0. In
particular our coreset construction needs the following two conditions.
Definition 5.1.5 (`-Lipschitz) Let `1be a constant. We say that cost(n,s)
Πis `-Lipschitz, if for
arbitrary points p1, . . . , pnand p0
1, . . . , p0
nwith Pn
i=1d(pi, p0
i)Dwe have
cost(n,s)
Π(p1, . . . , pn) cost(n,s)
Π(p0
1, . . . , p0
n)`·D
and
p1=p2=... =pn=cost(n,s)
Π(p1, . . . , pn) = 0 .
We call Π `-Lipschitz, if for every nNand s SΠ(n)the objective function cost(n,s)
Πis
`-Lipschitz.
Definition 5.1.6 (λ-mean preserving) Let λ1be a constant. We say that Πis λ-mean pre-
serving, if for any point set Pin Rdwe have
OptΠ(P)λ·X
pP
d(p, µ),
where µ:= µ(P)is the mean or center of gravity of P(see Section 2.1).
We will show that all optimization problems stated before are `-Lipschitz and λ-means pre-
serving with constants `and λ.
Lemma 5.1.7 The Euclidean MaxCut problem is `-Lipschitz with `=1and λ-mean preserving
with λ=1
4.
49
5 The Coreset Method
Proof : We first show that MaxCut is 1-Lipschitz.
cost(n,s)
Π(p1, . . . , pn) cost(n,s)
Π(p0
1, . . . , p0
n)
=1
nX
iC1X
jC2
d(pi, pj) 1
nX
iC1X
jC2
d(p0
i, p0
j)
1
nX
iC1X
jC2d(pi, pj) d(p0
i, p0
j)
=1
nX
iC1X
jC2
max {d(pi, pj) d(p0
i, p0
j), d(p0
i, p0
j) d(pi, pj)}
1
nX
iC1X
jC2
max{d(pi, p0
i) + d(p0
i, p0
j) + d(p0
j, pj) d(p0
i, p0
j),
d(p0
i, pi) + d(pi, pj) + d(pj, p0
j) d(pi, pj)}
=1
nX
iC1X
jC2
(d(pi, p0
i) + d(p0
j, pj))
1
nX
iC1
n·d(pi, p0
i) + 1
n·n·X
jC2
d(pj, p0
j)
=
n
X
i=1
d(pi, p0
i) = D
To show that MaxCut is λ-mean preserving we first we show the following known inequality
(also shown in [39]):
d(p, µ)1
nX
qP
d(p, q).(5.1)
First, note that by projecting all points on the line through pand µ, the left side does not change
and the right side does not increase. We can therefore assume that all points lie on a line, i.e. that
all points are real numbers. For real numbers we obtain:
d(p, µ) = |pµ|=|p1
n·X
qP
q|=1
n·|n·pX
qP
q|
=1
n·|X
qP
(pq)|1
nX
qP
|pq|.
We now show that MaxCut is λ-mean preserving with λ=1
4. We first consider a random
cut C1, C2. For each i{1, ..., n}we flip an unbiased coin to decide whether it belongs to
C1or to C2. Since for every pair of indices i, j the probability of separating the indices is 1
2,
the distance from pito pjis counted in the sum with probability 1
2. The expected value of the
objective function 1
nPiC1PjC2d(pi, pj)is therefore 1
4n Pp,qPd(p, q). Since Opt denotes
50
5.1 Definitions
the maximum value of such a cut, we have
Opt 1
4n X
p,qP
d(p, q)
From equation (5.1) it follows that
Opt 1
4X
pP
d(p, µ)
which means that the Euclidean MaxCut problem is λ-mean preserving with λ=1
42
Lemma 5.1.8 The Euclidean Maximum Weighted Matching problem is `-Lipschitz with `=1
and λ-mean preserving with λ=1
2.
Proof : We first show that Euclidean Maximum Weighted Matching is 1-Lipschitz.
cost(n,s)
Π(p1, . . . , pn) cost(n,s)
Π(p0
1, . . . , p0
n)
=
n/2
X
i=1
d(pai, pbi)
n/2
X
i=1
d(p0
ai, p0
bi)
n/2
X
i=1d(pai, pbi) d(p0
ai, p0
bi)
=
n/2
X
i=1
max{d(pai, pbi) d(p0
ai, p0
bi), d(p0
ai, p0
bi) d(pai, pbi)}
n/2
X
i=1
max{d(pai, p0
ai) + d(p0
ai, p0
bi) + d(p0
bi, pbi) d(p0
ai, p0
bi),
d(p0
ai, pai) + d(pai, pbi) + d(pbi, p0
bi) d(pai, pbi)}
=
n/2
X
i=1
(d(pai, p0
ai) + d(p0
bi, pbi))
=
n
X
i=1
d(pi, p0
i) = D
where the last equality comes from the fact that {1, ..., n}is partitioned into pairs (ai, bi).
To show that it is 1
2-mean preserving we look at a random matching constructed the following
way: We connect the first index i=1to one other index j{1,...,n}\{1}uniformly chosen
from all other indices. Then we delete both indices from {1, ..., n}and go on with this construc-
tion until all indices are matched. Notice that each pair (i, j)with i6=jbelongs to our matching
with probability 1
n1. When Mdenotes the aggregated cost of this matching, then
51
5 The Coreset Method
Opt E[M] = 1
2(n1)
n
X
i,j=1
d(pi, pj)1
2n
n
X
i,j=1
d(pi, pj)1
2X
pP
d(p, µ)
where the last inequality follows from equation 5.1.
2
Lemma 5.1.9 The Euclidean Maximum Travelling Salesman problem is `-Lipschitz with `=2
and λ-mean preserving with λ=1.
Proof : We first show that Euclidean Maximum Travelling Salesman is 2-Lipschitz.
cost(n,s)
Π(p1, . . . , pn) cost(n,s)
Π(p0
1, . . . , p0
n)
=d(ps(n), ps(1)) +
n1
X
i=1
d(ps(i), ps(i+1)) d(p0
s(n), p0
s(1))
n1
X
i=1
d(p0
s(i), p0
s(i+1))
d(ps(n), ps(1)) d(p0
s(n), p0
s(1))+
n1
X
i=1d(ps(i), ps(i+1)) d(p0
s(i), p0
s(i+1))
=max{d(ps(n), ps(1)) d(p0
s(n), p0
s(1)), d(p0
s(n), p0
s(1)) d(ps(n), ps(1))}
+
n1
X
i=1
max{d(ps(i), ps(i+1)) d(p0
s(i), p0
s(i+1)), d(p0
s(i), p0
s(i+1)) d(ps(i), ps(i+1))}
max{d(ps(n), p0
s(n)) + d(p0
s(n), p0
s(1)) + d(p0
s(1), ps(1)) d(p0
s(n), p0
s(1)),
d(p0
s(n), ps(n)) + d(ps(n), ps(1)) + d(ps(1), p0
s(1)) d(ps(n), ps(1))}
+
n1
X
i=1
max{d(ps(i), p0
s(i)) + d(p0
s(i), p0
s(i+1)) + d(p0
s(i+1), ps(i+1)) d(p0
s(i), p0
s(i+1)),
d(p0
s(i), ps(i)) + d(ps(i), ps(i+1)) + d(ps(i+1), p0
s(i+1)) d(ps(i), ps(i+1))}
=d(ps(n), p0
s(n)) + d(p0
s(1), ps(1)) +
n1
X
i=1
d(ps(i), p0
s(i)) + d(p0
s(i+1), ps(i+1))
=2
n
X
i=1
d(ps(i), p0
s(i)) = 2
n
X
i=1
d(pi, p0
i) = 2·D .
To show that it is 1-mean preserving we look at a random permutation schosen uniformly
from the set of permutations: Notice that each possible edge (i, j)with i6=jappears in the sum
with probability 1
n1+1
n22
n1. When Mdenotes the aggregated cost of this solution, then
Opt E[M]1
n1
n
X
i,j=1
d(pi, pj)1
n
n
X
i,j=1
d(pi, pj)X
pP
d(p, µ)
where the last inequality follows from equation 5.1. 2
52
5.1 Definitions
Lemma 5.1.10 The Euclidean Average Distance problem is `-Lipschitz with `=4and λ-mean
preserving with λ=1.
Proof : We first show that Euclidean Average Distance is 4-Lipschitz.
cost(n,s)
Π(p1, . . . , pn) cost(n,s)
Π(p0
1, . . . , p0
n)
=1
n1
n
X
i=1X
j{1,...,n}\{i}
d(pi, pj) 1
n1
n
X
i=1X
j{1,...,n}\{i}
d(p0
i, p0
j)
1
n1
n
X
i=1X
j{1,...,n}\{i}d(pi, pj) d(p0
i, p0
j)
=1
n1
n
X
i=1X
j{1,...,n}\{i}
max{d(pi, pj) d(p0
i, p0
j), d(p0
i, p0
j) d(pi, pj)}
1
n1
n
X
i=1X
j{1,...,n}\{i}
max{d(pi, p0
i) + d(p0
i, p0
j) + d(p0
j, pj) d(p0
i, p0
j),
d(p0
i, pi) + d(pi, pj) + d(pj, p0
j) d(pi, pj)}
=1
n1
n
X
i=1X
j{1,...,n}\{i}
d(pi, p0
i) + d(pj, p0
j)
2
n
n
X
i,j=1
2·d(pi, p0
i) = 4·
n
X
i=1
d(pi, p0
i) = 4·D
Euclidean Average Distance is 1-mean preserving:
1
n1
n
X
i=1X
j{1,...,n}\{i}
d(pi, pj) = 1
n1
n
X
i,j=1
d(pi, pj)n
n1
n
X
i=1
d(pi, µ)X
pP
d(p, µ)
2
5.1.2 Coresets
Intuitively, a coreset is a small weighted set of points that approximates a large (typically un-
weighted) point set with respect to an optimization problem. We first give a definition of coresets
for k-median and k-means clustering [61].
Definition 5.1.11 (Coresets I.) [61] Let Pbe a weighted set of npoints in the Rd. A weighted
point set Pcore in Rdis an -coreset for the k-median problem, if for every set Cof kcenters
(1)·Median(P, C)Median(Pcore, C)(1+)·Median(P, C).In a similar way a
weighted point set Pcore is an -coreset for the k-means clustering problem, if every set Cof k
centers (1)·Means(P, C)Means(Pcore, C)(1+)·Means(P, C).
53
5 The Coreset Method
Now we generalize the definition of coresets to arbitrary oblivious optimization problems.
Our definition will be only for unweighted point sets. However, by replacing weighted points by
multiple copies we can easily generalize this definition to weighted point sets.
Definition 5.1.12 (Coresets II.) Let Πbe an oblivious optimization problem. Let Pbe a set of n
points in the Rd. A weighted set of points Pcore in the Rdwith weight function w:Pcore Nis
an -coreset for Π, if there exists a mapping γ:PPcore that satisfies the following constraints.
For every qPcore we have |γ1(q)|=w(q).
For every solution s SΠ(n)we have
cost(n,s)
Π(p1, . . . , pn) ·OptΠ(P)cost(n,s)
Π(γ(p1), . . . , γ(pn))
cost(n,s)
Π(p1, . . . , pn) + ·OptΠ(P).
For an ordered point set P:= (p1, . . . , pn)we define γ(P)=(γ(p1), . . . , γ(pn)). Since the
oblivious optimization problem does not depend on the order of the points, we can define
OptΠ(Pcore, w) := OptΠ(γ(P)).
From Definition 5.1.12 the definition of coresets for the problems Euclidean MaxCut, Eu-
clidean MaxTSP, Euclidean MaxMatching, und average distance follows.
Lemma 5.1.13 Let Πbe an oblivious optimization problem, Pbe a set of npoints in the Rdand
let Pcore Rdwith weight function w:Pcore Nbe an -coreset for Pand Π. Then
OptΠ(Pcore, w)OptΠ(P)·(1±)
and
OptΠ(P)OptΠ(Pcore, w)·(1±2).
ut
5.2 Coresets for k-Median
We now give a description of our technique to construct coresets of small size. We always assume
that all points lie in a bounding cube of sidelength 1, for example [0, 1]d. This can be achieved
by scaling the points appropriately. Additionally we assume that the optimal objective function
value Opt of the respective problem is at least 1/e
. We only need a weak bound e
, since all
space and time bounds of our algorithms will depend logarithmically on e
.
In this section we describe our coreset construction for the k-median problem. The next chap-
ters will adapt the techniques to construct coresets for k-means and oblivious optimization prob-
lems.
54
5.2 Coresets for k-Median
5.2.1 Construction of the Coreset
We impose Znested square grids G0,...,GZ1over the point space for some parameter Z=
llog 4·k·10d·n(1+log n)·d(d+1)/2·
e
d+1m+1. The side length of the grid cells in grid Giis 1
2i. Our
goal will be to identify for each grid Giits heavy cells, i.e. cells that contain more than a certain
threshold of points. This threshold depends on the side length of the grid cells and grows with
its inverse (the smaller the cells, the larger the threshold). We will parametrize the threshold by
some small value δ<n, which is specified later.
Definition 5.2.1 (Heavy Cells) We call a cell of grid Giheavy, if it contains at least δ·2ipoints
of P. A grid cell that is not heavy is light.
Our process consists of two phases. In phase I we determine the coreset points. In phase II
we determine their weight. We begin with a description of phase I. The algorithm starts with the
coarsest grid G0. First, it identifies every heavy cell in G0. Note that grid G0consists of only one
cell containing all points. Since δ<nthis cell is a heavy cell. Then the algorithm subdivides
every heavy cell Cinto 2dequal sized quadratic subcells. These subcells are contained in grid
G1. We call Cthe parent cell of these subcells. If none of these subcells is heavy we place a
coreset point in the center of the cell. Otherwise, the algorithm recurses this process with all
heavy subcells. The recursion eventually stops because at some point a heavy cell is required to
have more than npoints inside.
It remains to determine the weight of each coreset point. This is done in phase II of the
algorithm. We can think of phase two as ‘moving the points’ to their corresponding coreset
points. The weight of a coreset point is simply the number of points moved to its position. The
movement must satisfy the following invariant
(a) every point stays in the smallest heavy cell it is contained it.
By our construction, every heavy cell must contain a coreset point. Thus it is easy to satisfy our
invariant. We can simply move every point pof Pto an arbitrary coreset point that is contained in
the smallest heavy cell containing p. Finally, the weights of each coreset point is determined by
the number of points moved to it. We will prove that for suitably chosen δthe resulting weighted
set of points Pcore is an -coreset for the k-median problem of size O(k·log n/d).
Below we describe our algorithm in pseudocode. It computes the coreset Pcore together with
weights w(p)for each point pPcore.
'
&
$
%
COMPUTECORESET(P, Opt)
Let H1,...,Hhdenote the heavy cells in grid G0
return Sh
i=1COMPUTECORESETPOINTS(Hi)
55
5 The Coreset Method
'
&
$
%
COMPUTECORESETPOINTS (cell C)
if Chas no heavy subcells then
pcenter of C
w(p)number of points in C
return {p}
else
Let H1,...,Hhdenote the heavy subcells of C
Let L1,...,L`denote the light subcells of C
Let mdenote the number of points in S`
i=1Li
Pcore =Sh
i=1COMPUTECORESETPOINTS(Hi)
Let qbe an arbitrary point in Pcore
w(q)w(q) + m
return Pcore
We will now prove that the point set computed by COMPUTECORESET is indeed an -coreset
for k-median.
We denote by L(i)the set of non-empty light cells of grid Giwhose parent cell is heavy. Notice
that SiL(i)partitions the plane.
Claim 5.2.2 Any point p L(i)is moved a distance of at most d
2i1during our coreset con-
struction.
Proof : Our invariant assures that every point pstays within the smallest heavy cell it is
contained in. Every point pthat is contained in L(i)is contained in a heavy cell in grid Gi1.
Therefore, it is moved at most the diagonal length of cells in Gi1, i.e. d
2i1.2
From now on let Cbe an arbitrary fixed set of kcenters. We partition the sets L(i)into two
subsets Lnear(i)and Ldist(i).Lnear(i)contains all cells Cwhose distance minqCd(q, C)to the
nearest center from Cis at most 4
·d
2i, i.e.
Lnear(i) = {C L(i)|min
qCd(q, C)4
·d
2i}.
Ldist(i)contains all other cells from L(i), i.e.
Ldist(i) = {C L(i)|min
qCd(q, C)>4
·d
2i}.
Claim 5.2.3 The total movement PiPp∈Ldist(i)
d
2i1of points in distant cells satisfies
X
iX
p∈Ldist(i)
d
2i1
2Median(P, C).
56
5.2 Coresets for k-Median
Proof : We use a charging argument from [61].
X
iX
p∈Ldist(i)
d
2i1=
2X
iX
p∈Ldist(i)
4
·d
2i
2·Median(P, C),
where the last inequality follows from the fact that every point in Ldist(i)contributes more than
4
·d
2ito the cost of the solution C. 2
Claim 5.2.4 For δd+1·Opt
4·k·10d·(1+log n)·d(d+1)/2 we get
X
iX
p∈Lnear(i)
d
2i1
2Median(P, C).
Proof : We observe that the furthest point in a cell in Lnear(i)can have a distance of at most
d
2i+4
·d
2i=1+4
·d
2ito the nearest center. Hence, every cell in Lnear(i)is contained in
a cube of sidelength 2·1+4
·d
2ithat is centered at one of the kcenters of C. Each of these
cubes has volume 2·1+4
·d
2id10
d·d
2id
. Every cell in grid Gihas volume 1
2id.
Hence, there can be at most k·10
d·dd/2 cells in Lnear(i).
Each of the considered cells is light and so it contains at most δ·2ipoints. Hence for our
choice of δ:
X
iX
p∈Lnear(i)
d
2i1=X
i:Lnear(i)6=X
p∈Lnear(i)
d
2i1
X
i:Lnear(i)6= k·10
d
·dd/2!·δ·2i· d
2i1!
X
i:Lnear(i)6=
δ· 2·k·10
d
·dd/2!·d
2·
1
1+log nX
i:Lnear(i)6=
Opt
2·Median(P, C)
1+log n·|{i:Lnear(i)6=}|
Now we observe that Lnear(i)6=implies that there are non-empty light cells in grid Giand
heavy cells in grid Gi1. We can have non-empty light cells only if δ·2i> 1, since otherwise
57
5 The Coreset Method
all non-empty cells are heavy. We can have heavy cells in grid Gi1only if δ·2i1n,
since otherwise all cells are light. Hence we sum up only over those values of ithat satisfy
1/2 < δ ·2i1n. Clearly, these are at most 1+log ndistinct values and so we get
2·Median(P, C)
1+log n·|{i:Lnear(i)6=}|
2·Median(P, C),
which concludes the proof of Claim 5.2.4 2
Lemma 5.2.5 The set Pcore is an -coreset for δd+1·Opt
4·k·10d·(1+log n)·d(d+1)/2 .
Proof : We first observe that every point of Pis contained in some cell in SiL(i). By Claim
5.2.2 we know that every point that is contained in a grid cell in L(i)is moved a distance of at
most d
2i1. Therefore, Claims 5.2.3 and 5.2.4 imply that the points are moved an overall distance
of at most Median(P, C). Finally, we observe that the cost of any set of kcenters changes by at
most ±Dwhen the points of the point set Pare moved by an overall distance of D. Hence the
set Pcore constructed by our algorithm is an -coreset for k-median. 2
5.2.2 Size of the Coreset
Our next step is to give an upper bound on the size of the coreset. For every grid Giwe define
M(i)to be the set of heavy cells that do not have a heavy subcell. Notice that the cells M(i)
are exactly those cells that contain a coreset point. Hence, it will be sufficient to determine the
cardinality of this set.
Definition 5.2.6 (Center cells) Let Copt denote an optimal set of kcenters for the k-median
problem. A cell Cin grid Giis called a center cell, if its distance to the nearest center in Copt is
less than d
2i+1, i.e. if minqCopt d(q, C)<d
2i+1.
We define Mcenter(i) = {C M(i)|Cis center cell}to be the subset of M(i)that are center
cells. We use Mexternal(i) = M(i)\Mcenter(i)to denote the remaining cells of M(i). We call
these cells external cells.
Claim 5.2.7 Every external cell contributes at least δ·d/2 to the cost Opt of Copt.
Proof : Every external cell C M(i)is a heavy cell and so it contains at least δ·2ipoints. Each
point contributes at least d
2i+1to the cost of the optimal solution. Hence the overall contribution
of the points in Cis at least δ·d/2.2
Lemma 5.2.8 If δd+1·Opt
8·k·10d·(1+log n)·d(d+1)/2 , the size of the coreset is at most 17·k·10d·(2+log n)·dd/2
d+1=
O(k·log n/d+1).
58
5.2 Coresets for k-Median
Proof : From Claim 5.2.7 it follows that
[
iMexternal(i)2·Opt
δ·d.
All cells from Mcenter(i)are contained in kcubes of sidelength 2d
2i+d
2i+1=3·d
2i. Since
each cube volume is 3·d
2id
and each cell volume is d
2iddd/2, we know that
Mcenter(i)k·3d·dd/2 .
Using similar arguments as in the proof of Claim 5.2.4 we obtain that M(i)6=for at most
1+log ndistinct values of i. Therefore, we get
[
iMcenter(i)(log n+1)·k·3d·dd/2 .
Since SiM(i) = SiMexternal(i)Mcenter(i), the size of the coreset is at most 2·Opt
δ·d+(log n+
2)·k·3d·dd/2.
For δd+1·Opt
8·k·10d·(1+log n)·d(d+1)/2 , the size of the coreset is at most 17·k·10d·(2+log n)·dd/2
d+1.2
5.2.3 Finding a suitable value of δ
The coreset construction so far was dependent on a value of δ, which itself was dependent on the
unknown value of Opt. We show how to find a suitable value of δ.
A good value for δcan be found using the statements of Lemma 5.2.5 and 5.2.8. Let be
δ0:= d+1·Opt
4k·10d(1+log n)d(d+1)/2 .
All coresets constructed with a value δδ0are -coresets according to Lemma 5.2.5.
All coresets constructed with a value δδ0/2 have a size of at most S:= 17·k·10d·(2+log n)·dd/2
d+1
according to Lemma 5.2.8.
For each value of j{0, 1, . . . , blog(n·e
·d)c}let be
δ(j) = d+1
4k ·10d(1+log n)d(d+1)/2 ·e
·2j.
Denote that the coreset constructed for the highest value of δ(j)is of size at most Sbecause
this value of δis greater than δ0·n·d/(2·Opt), which is bigger than δ0/2, since all points lie
in a unit cube.
We want to identify a value j0{0, 1, . . . , blog(n·e
·d)c}, such that
the coreset constructed with the value δ=δ(j0)is of size at most Sand
the coreset constructed with the value δ=δ(j01)is of size greater than S.
If the coreset constructed with the value δ=δ(0)is of size at most S, we set j0=0. The value
j0can be easily obtained performing a binary search on the values of j.
59
5 The Coreset Method
Lemma 5.2.9 The coreset constructed with δ=δ(j0)is an -coreset for P. The size of the
coreset is at most S.
Proof : By the choice of j0the size of the computed coreset is at most S.
We show that the constructed coreset is an -coreset. If j0=0then the coreset is an -coreset
because δ(j0) = δ0/(Opt ·e
)δ0.
If j06=0then the coreset constructed for the value δ=δ(j01) = δ(j0)/2 of size greater
than S. Therefore we must have δ(j0)/2 < δ0/2. It follows δ(j0)< δ0and the computed coreset
is an -coreset. 2
In each iteration of the binary search we have to decide if the size of the coreset is at most
S. This can be done by constructing the coreset and stopping the process if the number of
disjoint heavy cells exceeds S. The marking process yielding to the actual coreset can be seen
as a quadtree traversal. Each inner node of this quadtree corresponds to a heavy cell. If on
one grid the number of heavy cells exceeds S, we stop the process. Since we have at most
Z=O(log(n·e
∆/)) grids, the tree traversal can be done in time O(S · log(e
·n/)) =
O(k·log n·log(e
·n/)/d+1).
Since we have O(log log(e
·n)) iterations of the binary search, the coreset can be constructed
in time O(k·log n·log(e
·n/)log log(e
·n)/d+1).
Note that given a point set Pwe can first build a quadtree of depth Zof the points in time
O(n·Z) = O(n·log(n·e
∆/)). Using this quadtree we can then answer queries on the number
of points in certain grid cells in constant time.
Theorem 11 Assume that we are given a point set P[0, 1]dof size nNand the guarantee
that the value Opt of an optimal k-median solution is at least 1/e
. For each value of i
{0,1,...,Z}with Z=llog 4·k·10d·n(1+log n)·d(d+1)/2·
e
d+1m+1we have a square grid Giover the
point space, all cells of side length 1
2i.
Given an oracle, which answers queries on the number of points in grid cells in constant time,
we can compute in time O(k·log n·log(e
·n/)·log log(e
·n)/d+1)an -coreset of size
O(k·log n/d+1)for k-median.
A respective oracle can be constructed in time O(n·log(n·e
∆/)).
5.3 Coresets for k-Means
We show how to modify our coreset construction for the k-median problem in a way such that it
works for the k-means problem.
5.3.1 Construction of the Coreset
We again use Znestes square grids G0,...,GZfor Z=llog 8·k·33d+1·n(1+log n)·d(d+2)/2·
e
d+2m+1.
The side length of the cells in grid Giis 1
2i. We use the following modified definition for heavy
cells.
60
5.3 Coresets for k-Means
Definition 5.3.1 (Heavy Cells) We call a grid cell of grid Giheavy for the k-means clustering
problem, if it contains at least δ·4ipoints.
We use the same invariant as in the construction of the coreset for the k-median problem, i.e.
every point stays in the smallest heavy cell it is contained in. Similarly, as in the construction
for k-median we denote by L(i)the set of non-empty light cells in grid Giwhose parent cell is
heavy. We get the following modified version of Claim 5.2.2 with an analogous proof.
Claim 5.3.2 Any point p L(i)is moved a distance of at most d
2i1during our coreset con-
struction. ut
From now on let Cbe an arbitrary fixed set of kcenters. We partition the sets L(i)into two
subsets Lnear(i)and Ldist(i).Lnear(i)contains all cells Cwhose distance minqCd(q, C)to the
nearest center from Cis at most 16
·d
2i, i.e.
Lnear(i) = {C L(i)|min
qCd(q, C)16
·d
2i}.
Ldist(i)contains all other cells from L(i), i.e.
Ldist(i) = {C L(i)|min
qCd(q, C)>16
·d
2i}.
Claim 5.3.3 For any point pPlet ^pdenote its corresponding coreset point. Then we have
X
iX
p∈Ldist(i)
d(p, C)2X
iX
p∈Ldist(i)
d(^p, C)2
2Means(P, C)
Proof : We use a charging argument similar to one from [61]. Let pbe an arbitrary point in P
and ^pbe the corresponding coreset point. By Claim 5.3.2 we know that d(p, ^p)d
2i1. We get
d(^p, C)2 d(p, C) + d
2i1!2
=d(p, C)2+2d(p, C)·d
2i1+d
4i1
and since d(p, C)>d
2i1:
d(^p, C)2 d(p, C) d
2i1!2
d(p, C)22d(p, C)·d
2i1d
4i1.
We also know that d(p, C)16
·d
2iand so
d(p, C)2d(^p, C)22d(p, C)·d
2i1+d
4i1
4d(p, C)2+
82
d(p, C)2
2d(p, C)2.
61
5 The Coreset Method
We get
X
iX
p∈Ldist(i)
d(p, C)2X
iX
p∈Ldist(i)
d(^p, C)2X
iX
p∈Ldist(i)d(p, C)2d(^p, C)2
X
iX
p∈Ldist(i)
2d(p, C)2
2Means(P, C).
2
Our next step is to give an upper bound on the change of the contribution of points in near
cells.
Claim 5.3.4 For δd+2·Opt
8(1+log n)·k·33d+1·d(d+2)/2 we get
X
i:Lnear(i)6=X
p∈Lnear(i)d(p, C)2d(^p, C)2
2·Means(P, C).
Proof : We use a similar volume argument as in the proof of Claim 5.2.4. Any cell in Lnear(i)
must be contained in one of kcubes of volume h216d
·2i+d
2iid
=h32
+1d
2iid33
d·
d
2id. Each cell has a volume of d
d·2id. Therefore, there can be at most k·33
d·dd/2 such
cells.
Let pbe an arbitrary point in Lnear(i)and ^pbe the corresponding coreset point. We will show
d(^p, C)2d(p, C)23+16
·d
4i1and d(^p, C)2d(p, C)23+16
·d
4i1:
We observe that d(p, C)(1+8
)·d
2i1. We get
d(^p, C)2 d(p, C) + d
2i1!2
d(p, C)2+2d(p, C)·d
2i1+d
4i1
d(p, C)2+21+8
·d
2i1·d
2i1+d
4i1
d(p, C)2+3+16
·d
4i1
62
5.3 Coresets for k-Means
Case A: d(p, C)d
2i1. Then:
d(^p, C)2 d(p, C) d
2i1!2
d(p, C)22d(p, C)·d
2i1d
4i1
d(p, C)221+8
·d
2i1·d
2i1d
4i1
d(p, C)23+16
·d
4i1
Case B: d(p, C)<d
2i1. Then
d(^p, C)20d(p, C)2d
4i1d(p, C)23+16
·d
4i1
Altogether we obtain d(^p, C)2d(p, C)23+16
·d
4i1.
Each of the considered cells is light and so it contains at most δ·4ipoints. Hence for our
choice of δ:X
i:Lnear(i)6=X
p∈Lnear(i)d(p, C)2d(^p, C)2
X
i:Lnear(i)6=X
p∈Lnear(i)d(p, C)2d(^p, C)2
X
i:Lnear(i)6=
δ·4i
| {z }
number of points per cell
·k·33
d
·dd/2
| {z }
number of cells
·3+16
·d
4i1
X
i:Lnear(i)6=
δ·4·k·33
d+1
·dd/2d
2·
1
1+log n·X
i:Lnear(i)6=
Opt
2·Means(P, C),
where the last inequality follows from a similar argument as in the proof of Claim 5.2.4. 2
It follows immediately, that
Lemma 5.3.5 The set Pcore is an -coreset for δd+2·Opt
8(1+log n)·k·33d+1·d(d+2)/2 .ut
63
5 The Coreset Method
5.3.2 Size of the Coreset
We adapt the proof for the k-median problem to k-means. In the following we give the slightly
changed definitions we require. For every grid Giwe define M(i)to be the set of heavy cells
that do not have a heavy subcell.
Definition 5.3.6 (Center cells) Let Copt denote an optimal set of kcenters for the k-means prob-
lem. A cell Cin grid Giis called a center cell, if its distance to the nearest center in Copt is less
than d
2i+1, i.e. if minqCopt d(q, C)<d
2i+1.
We define Mcenter(i) = {C M(i)|Cis center cell}to be the subset of M(i)that are center
cells. We use Mexternal(i) = M(i)\Mcenter(i)to denote the remaining cells of M(i). We call
these cells external cells.
Claim 5.3.7 Every external cell contributes at least δ·d/4 to the cost Opt of Copt.
Proof : Every external cell Cis a heavy cell and so it contains at least δ·4ipoints. Each point
contributes at least d
2i+12to the cost of the optimal solution. Hence the overall contribution of
the points in Cis at least δ·d/4.2
Lemma 5.3.8 If δd+2·Opt
16(1+log n)·k·33d+1·d(d+2)/2 , the size of the coreset is at most 65(1+log n)·k·33d+1·dd/2
d+2=
O(klog n/d+2).
Proof : From Claim 5.3.7 it follows that
[
iMexternal(i)4·Opt
δ·d.
All cells from Mcenter(i)are contained in kcubes of sidelength 2d
2i+d
2i+1=3·d
2i. Since
each cube volume is 3·d
2id
and each cell volume is d
2iddd/2, we know that
Mcenter(i)k·3d·dd/2 .
Using similar arguments as in the proof of Claim 5.2.4 we obtain that M(i)6=for at most
1+log ndistinct values of i. Therefore, we get
[
iMcenter(i)(log n+1)·k·3d·dd/2 .
Since SiM(i) = SiMexternal(i) Mcenter(i), the size of the coreset is at most 4·Opt
δ·d+
(log n+1)·k·3d·dd/2. For δd+2·Opt
16(1+log n)·k·33d+1·d(d+2)/2 , the size of the coreset is at most
65(1+log n)·k·33d+1·dd/2
d+2.2
64
5.4 Coresets for Oblivious Optimization Problems
5.3.3 Finding a suitable value of δ
To find a suitable value of δwe use the method of Section 5.2.3. We plug in the values δ0:=
d+2·Opt
8(1+log n)·k·33d+1·d(d+2)/2 and S:= 65(1+log n)·k·33d+1·dd/2
d+2, and set for j{0, 1, . . . , blog(n·e
·
d)c}:
δ=δ(j) = d+2
8(1+log n)·k·33d+1·d(d+2)/2 ·e
·2j.
Doing a binary search on the values of jwe again find a value j0such that the coreset constructed
with the value δ=δ(j0)is of size at most Sand the coreset constructed with the value δ=
δ(j01)is of size greater than S(if there is no such value we set j0=0). Lemma 5.2.9 and
it’s proof can be stated in the same way and the coreset constructed for this value of δis then an
-coreset for k-means.
Theorem 12 Assume that we are given a point set P[0, 1]dof size nNand the guar-
antee that the value Opt of an optimal k-means solution is at least 1/e
. For each value of
i{0,1,...,Z}with Z=llog 8·k·33d+1·n(1+log n)·d(d+2)/2·
e
d+2m+1we have a square grid Gi
over the point space, all cells of side length 1
2i.
Given an oracle, which answers queries on the number of points in grid cells in constant time,
we can compute in time O(k·log n·log(e
·n/)·log log(e
·n)/d+2)an -coreset of size
O(k·log n/d+2)for k-means.
A respective oracle can be constructed in time O(n·log(n·e
∆/)).
5.4 Coresets for Oblivious Optimization Problems
Let us assume that Πis `-Lipschitz and λ-mean preserving. We show that under these conditions
our k-Median algorithm run on instance Pconstructs a weighted point set Pcore that is an -coreset
for Π. We modify our proofs of the k-median coreset.
5.4.1 Construction of the Coreset
We also use Znested grids Giwith side length 1
2ifor Z=llog n(1+log n)·d(d+1)/2·(10`)d+1·
e
d+1·λdm+1.
We use the same definition of heavy cells.
Definition 5.4.1 (Heavy Cells) We call a cell of grid Giheavy, if it contains at least δ·2ipoints
of P. A grid cell that is not heavy is light.
We denote by L(i)the set of light cells of grid Giwhose parent cell is heavy. Notice that each
point is contained in exactly one cell of SiL(i). Claim 5.2.2 still holds.
We partition the sets L(i)into two subsets Lnear(i)and Ldist(i).Lnear(i)contains all cells C
whose distance d(µ, C)to the center of gravity µis at most 4·`
·λ·d
2i, i.e.
Lnear(i) = {C L(i)|d(µ, C)4·`
·λ·d
2i}.
65
5 The Coreset Method
Ldist(i)contains all other cells from L(i), i.e.
Ldist(i) = {C L(i)|d(µ, C)>4·`
·λ·d
2i}.
Claim 5.4.2 X
iX
p∈Ldist(i)
d
2i1
2·`·Opt .
Proof : Any point in Ldist(i)has a distance of more than 4·`
·λ·d
2ifrom the center of gravity µ.
Therefore, we get
X
iX
p∈Ldist(i)
d
2i1=·λ
2·`X
iX
p∈Ldist(i)
4·`
·λ·d
2i
·λ
2·`X
iX
p∈Ldist(i)
d(p, µ)
2·`·Opt
where the last inequality holds because Πis λ-mean preserving. 2
Claim 5.4.3 For δd+1·λd·Opt
(1+log n)·d(d+1)/2·(10·`)d+1we get
X
i:L(i)6=X
p∈Lnear(i)
d
2i1
2·`Opt .
Proof : We observe that the furthest point in a cell in Lnear(i)can have a distance of at most
(1+4`
λ )·d
2ito the center of gravity. Hence, every cell in Lnear(i)is contained in a cube of
sidelength 2·1+4`
λ d
2i. The cube has volume 2·1+4`
λ d
2id10`·d
λ·2id
. Since every
cell in grid Gihas volume 1
2id, there can be at most 10`
λ d·dd/2 cells in Lnear(i).
Each cell in Lnear(i)is light and so it contains at most δ·2ipoints. Hence,
X
i:L(i)6=X
p∈Lnear(i)
d
2i1
X
i:L(i)6=
δ·2i·10 ·`
λ d
·dd/2 ·d
2i1
2·`·Opt
for our choice of δand using the same arguments as in Claim 5.2.4. 2
66
5.4 Coresets for Oblivious Optimization Problems
Lemma 5.4.4 The set Pcore is an -coreset for δd+1·λd·Opt
(1+log n)·d(d+1)/2·(10·`)d+1.
Proof : We first observe that every point of Pis contained in exactly one cell in SiL(i). By
Claim 5.2.2 we know that every point that is contained in a grid cell in L(i)is moved a distance
of at most d
2i1. Therefore, Claims 5.4.2 and 5.4.3 imply that the points are moved an overall
distance of at most
`Opt. Since the optimization problem Πis `-Lipschitz we know that the cost
of any solution changes by at most ±·Opt under this movement. Hence the set Pcore constructed
by our algorithm is a coreset. 2
5.4.2 Size of the Coreset
Our next step is to give an upper bound on the size of the coreset. For every grid Giwe define
M(i)to be the set of heavy cells that do not contain a heavy subcell. Notice that the cells M(i)
are exactly those cells that contain a coreset point. Hence, it will be sufficient to determine the
cardinality of this set.
Definition 5.4.5 (Center Cells) Let µdenote the center of gravity of P. A cell Cin grid Giis
called a center cell, if its distance to µis at most d
2i+1, i.e. if d(µ, C)d
2i+1.
We define Mcenter(i) = {C M(i)|Cis center cell}to be the subset of M(i)that are center
cells. We use Mexternal(i) = M(i)\Mcenter(i)to denote the remaining cells of M(i). We call
these cells external cells. We use G=PpPd(p, µ)to denote the sum of distances to the center
of gravity.
Claim 5.4.6 Every external cell contributes at least δ·d/2 to G.
Proof : Every external cell Cis a heavy cell and so it contains at least δ·2ipoints. Each point
has a distance of at least d
2i+1to the center of gravity. Hence the overall contribution of the points
in Cis at least δ·d/2.2
Lemma 5.4.7 If δd+1·λd·Opt
2(1+log n)·d(d+1)/2·(10·`)d+1, the size of the coreset is at most
4·(1+log n)·dd/2 ·(10`)d+1
d+1·λd+1=O(log n/d+1).
Proof : From Claim 5.4.6 and G=PpPd(p, µ)1
λ·Opt it follows that
[
iMexternal(i)2·Opt
δ·λ·d).
All cells from Mcenter(i)are contained in a cube of sidelength 2d
2i+d
2i+1=3·d
2i. Since the
cube volume is 3·d
2id
and each cell volume is 1
2id, we know that
Mcenter(i)3d·dd/2 .
67
5 The Coreset Method
Using similar arguments as in the proof of Claim 5.2.4 we obtain that M(i)6=for at most
1+log ndistinct values of i. Therefore, we get
[
iMcenter(i)(log n+1)·3d·dd/2 .
Since SiM(i) = SiMexternal(i) Mcenter(i), the size of the coreset is at most 2·Opt
δ·λ·d+
(log n+2)·3d·dd/2. If δd+1·λd·Opt
2(1+log n)·d(d+1)/2·(10·`)d+1, the size of the coreset is at most
4·(1+log n)·dd/2·(10`)d+1
d+1·λd+1.
2
5.4.3 Finding a suitable value of δ
To find a suitable value of δwe use the method of Section 5.2.3. We plug in the values δ0:=
d+1·λd·Opt
(1+log n)·d(d+1)/2·(10·`)d+1and S:= 4·(1+log n)·dd/2·(10`)d+1
d+1·λd+1, and set for j{0, 1, . . . , blog(n·e
·
d·`)c}:
δ=δ(j) = d+1·λd
(1+log n)·d(d+1)/2 ·(10 ·`)d+1·e
·2j.
Doing a binary search on the values of jwe again find a value j0such that the coreset con-
structed with the value δ=δ(j0)is of size at most Sand the coreset constructed with the value
δ=δ(j01)is of size greater than S(if there is no such value we set j0=0). Lemma 5.2.9 and
it’s proof can be stated in the same way and the coreset constructed for this value of δis then an
-coreset for the respective problem.
Theorem 13 Assume that we are given a point set P[0, 1]dof size nNand the guarantee
that the value Opt of an optimal solution of the oblivious optimization problem is at least 1/e
.
For each value of i{0,1,...,Z}with Z=llog n(1+log n)·d(d+1)/2·(10`)d+1·
e
d+1·λdm+1we have a
square grid Giover the point space, all cells of side length 1
2i.
Given an oracle, which answers queries on the number of points in grid cells in constant
time, we can compute in time O(log n·log(e
·n/)·log log(e
·n)/d+1)an -coreset of size
O(log n/d+1)for the respective problem.
A respective oracle can be constructed in time O(n·log(n·e
∆/)).
Corollary 5.4.8 Given a set Pof npoints in Rd, a coreset for an oblivious optimization problem
can be constructed in time O(n·log(n/) + log2(n/)·log log n/d+1).
Proof : We first scale the points such that all points lie in [0, 1]dand that there are two points
p, q Pand a dimension i{1,...,d}, such that |p(i)q(i)|=1. Since the problem is λ-mean
preserving, we can conclude from the triangle inequality that Opt λ/2. Therefore we can set
e
:= 2/λ and have Opt 1/e
. We apply our coreset technique on the scaled point set. After
constructing the coreset we rescale the points. 2
68
5.5 Constructing Solutions on the Coreset
5.5 Constructing Solutions on the Coreset
In this section we will shortly show how to construct solutions to the various problems after
computing the coreset.
5.5.1 k-Median
We follow an approach of Har-Peled and Mazumdar [61].
Lemma 5.5.1 [61] Given a weighted set Pcore of |Pcore|points in Rdwith total weight n, one
can compute a set Dof size O(k22d log2n)such that at least one subset C D of size k
is a (1±)-approximate solution to the k-median problem on Pcore. The running time of this
algorithm is O(|Pcore|·log2n+k5log9n+k22d log2n).
As in [61] we can use this candidate set and an algorithm from [87] to compute a solution:
Lemma 5.5.2 [87] Given a weighted point set Pcore of |Pcore|points in Rd, with total weight n, a
set Dof size at most |Pcore|such that at least one subset C Dof size kis a (1±)-approximate
solution to the k-median problem on Pcore, and a parameter δ > 0, one can compute a (1±)-
approximate k-median clustering of Pcore using only centers from D. The overall running time is
O(ρ·|Pcore|·(log k)(log n)log(1/δ)), where ρ=exp[O((1+log 1/)/)d1]. The algorithm
succeeds with probability 1δ.
Theorem 14 On the coreset Pcore we can compute a (1±)-approximate k-median solution in
time
O(ρk2·log k·log3n·log(1/δ) + k·log3n·d1+k5log9n+k22d log2n)
where ρ=exp[O((1+log 1/)/)d1].
5.5.2 k-Means
We again follow an approach of Har-Peled and Mazumdar [61] to compute a (1±)-approximate
k-means clustering on Pcore. They use the following lemma:
Lemma 5.5.3 [99] Given a weighted set Pcore of |Pcore|points in Rdwith total weight n, one can
compute a set Dof size O(k22d log n·log(1
)) such that at least one subset C D of size
kis a (1±)-approximate solution to the k-means problem on Pcore. The running time of this
algorithm is O(|Pcore|·log(|Pcore|) + |Pcore|·dlog 1
).
As in [61] we can this candidate set Dto compute a solution. We simply enumerate all k-
tuples in D, and compute the k-means clustering value of each candidate center set. This takes
O(|D|k·k·|Pcore|)time. The best tuple provides the required approximation.
Theorem 15 On the coreset Pcore we can compute a (1±)-approximate k-means solution in
time e
O(k2k+2·2kdd2·logk+1n).
69
5 The Coreset Method
5.5.3 MaxCut
In this section we describe how to compute a MaxCut on the computed coreset. We adapt an
algorithm from [39] for unweighted metric MaxCut, which reduces metric MaxCut to MaxCut
in big dense weighted graphs. In general we could replace each weighted point by a number of
unweighted points and run [39] on the unweighted instance (having nnodes). Such a technique
has also been used in [72].
To avoid building the graph of size nwe construct a new reduction from MaxCut on coresets
to MaxCut in small dense graphs. We will prove that the reduction can be done in space and time
polylogarithmic in n.
We assume that our coreset construction always constructs the coreset points exactly in the
middle of a heavy cell. Then we can use a property of our coreset, that each point pPcore has
either a big weight or is far away from the next coreset point. Using this property the techniques
of this section follow the ideas of [39].
For every coreset point pPcore let w(p)denote the weight of point p. We also assume that
n=PpPcore w(p). For each partition (L, R)of Pcore we write
Cut(Pcore, L, R) = X
pL,qR
w(p)·w(q)·d(p, q)
and for each complete graph G= (V, E)with weight function ω:ER+and each partition
(L, R)of Vwe write
Cut(Pcore, L, R) = X
pL,qR
ω(p, q).
Recall that Opt denotes the value of an optimum cut (LOpt, ROpt)of the input point set P, scaled
by 1/n:
Opt =max
(L,R)partition of P
1
n·X
pL,qR
d(p, q).
We scale the distances such that the weighted average distance between the coreset points is
1, i.e. X
p,qPcore
w(p)·w(q)·d(p, q) = n2.
In the following we will always assume that this equality holds.
Lemma 5.5.4 n
4Opt 2·n
Proof : Since Pcore is an -coreset for P, we have:
Opt =max
(L,R)partition of P
1
n·X
pL,qR
d(p, q)
1
1·max
(L,R)partition of Pcore
1
n·X
pL,qR
w(p)·w(q)·d(p, q)2·n
70
5.5 Constructing Solutions on the Coreset
To show that Opt n/4 we consider a random cut (A, B), where each point pPcore is
independently put into Awith probability 1/2 and otherwise put into B. The probability of each
edge (p, q)to be in the cut is then 1/2. By linearity of expectation the expected value of the cut
is E[Cut(Pcore, A, B)] = 1/2 ·Pp,qPcore w(p)·w(q)·d(p, µ) = n2/2. Therefore the maximum
cut on Pcore must have a value greater than n2/2. Since Opt is the value of a maximum cut on P
scaled by 1/n and since Pcore is an -coreset for P, we conclude
Opt (1)·n/2 n/4 .
2
We will use the coreset Pcore of an instance started with a value of δfulfilling
d+1·Opt
80 ·(1+log n)·dd/2 ·40dδd+1·Opt
10 ·(1+log n)·dd/2 ·40d.
We can easily find such a value of δbecause of Lemma 5.5.4. We define
β:= d+1
80 ·(1+log n)·dd/2 ·40d
Then we have
βδ
Opt .
Lemma 5.5.5 For each point pPcore let d(p) := min{d(p, q)|qPcore}be the distance to
the next coreset point. Then we have:
d(p)·w(p)β·n
8.
Proof :
Our algorithm constructs the coreset points beginning with the finest grid GZ1. In a grid Gi
a coreset point pis introduced in the middle of a cell C, iff Cis heavy and has no heavy subcell
(only in that case we would already have a coreset point within C). Since the grid Gihas side
length 1
2iwe conclude d(p)1
2·2i. Since the cell Cis heavy and all points in the cell are
mapped to p, we have w(p)δ·2iand therefore d(p)·w(p)δ
2β·Opt
2β·n
8. After
introducting the coreset point pno other coreset point can be introduced within distance 1
2·2iof
p, because the coreset construction goes on with larger cells and the distance to the border of the
cell containing pis then again at least 1
2·2i. Our method introduces no new coreset points in cells
already containing a coreset point. Since the weight of a coreset point can only increase during
the coreset construction, the lemma follows. 2
Definition 5.5.6 (Distance Weight) [39]: The distance weight ωpof a coreset point pPcore
is defined as ωp:= w(p)·PqPcore w(q)d(p, q).
71
5 The Coreset Method
Definition 5.5.7 (Graph of Clones) We define a weighted complete graph W= (X, E)where X
(called the set of clones) is a multiset of points. Each point pPcore is cloned to create j8·ωp
·β·n2k
identical points of X, and the edge between p0X, a clone of pPcore, and q0X, a clone of
qPcore, has weight
ep0q0:= 2β2n4·w(p)·w(q)
64 ·ωp·ωq·d(p, q).
Edges between clones p0, q0of the same point have weight ep0q0:= 0.
Lemma 5.5.8 :We have for each pPcore:
8·ωp
·β·n2(1)·8·ωp
·β·n2.
Proof : Let pPcore be an arbitrary coreset point. Then Lemma 5.5.5 shows:
ωp=X
qPcore
w(p)·w(q)·d(p, q)X
qPcore
w(q)·β·n
8=β·n2
8.
Therefore: p
·β·n21
and the lemma follows. 2
We now show that the constructed auxiliary graph W= (X, E)is small.
Lemma 5.5.9
(1)·8
β |X|8
β
Proof : We have
|X|=X
pPcore 8·ωp
·β·n2X
pPcore
(1)·8·ωp
·β·n2
= (1)·X
p,qPcore
8
·β·n2·w(p)·w(q)·d(p, q) = (1)·8
·β
where the last equality comes from the scaling of the point distances and
|X|=X
pPcore 8·ωp
·β·n2X
pPcore
8·ωp
·β·n2=8
·β.
2
72
5.5 Constructing Solutions on the Coreset
Lemma 5.5.10 Each cut (L, R)of Pcore corresponds to a cut (L0, R0)of Whaving the value
Cut(W, L0, R0)(1)2·Cut(Pcore, L, R).
Each cut (L0, R0)of Wcan easily be extended to a cut (L, R)of Pcore having the value
Cut(Pcore, L, R)Cut(W, L0, R0).
Proof : Consider a cut (L, R)of Pcore, and let (L0, R0)be the induced cut of Wwhere all clone
vertices of a point pLbelong to L0and all clone vertices of a point pRbelong to R0. We
have:
Cut(W, L0, R0) = X
p0L0,q0R0
ep0q0
=X
pL,qR8·ωp
·β·n2·8·ωq
·β·n2·2β2n4·w(p)·w(q)·d(p, q)
64 ·ωp·ωq
X
pL,qR
(1)2·w(p)·w(q)·d(p, q) = Cut(Pcore, L, R).
On the other hand let (L0, R0)be an arbitrary cut of W. We will first alter the set (L0, R0)in a
way such that the value of the cut does not decrease and all clone vertices of a point pPcore
belong to the same partition.
Consider a point pPcore and let vbe one of the clone vertices. We compute the values
CL:= X
wL0
ev,w
and
CR:= X
wR0
ev,w .
Notice that the value of the cut increases by CLCRif we move one clone vertex of pfrom L0
to R0. If we move one clone vertex of pfrom R0to L0the value of the cut decreases by CLCR.
If CLCRwe put all clone vertices of pinto R0, not decreasing the cut. If CL< CRwe
put all clone vertices of pinto L0, not decreasing the cut. We do this iteratively for all vertices
pPcore. After that for each vertex pPcore all clone vertices belong to the same partition.
We then construct a cut (L, R)of Pcore in the following way: We put a point pPcore into
partition Lif its clone vertices belong to partition L0. Otherwise we put pinto R. The value of
the cut is then:
Cut(Pcore, L, R) = X
pL,qR
w(p)·w(q)·d(p, q)
X
pL,qR8·ωp
·β·n2·8·ωq
·β·n2·2β2n4·w(p)·w(q)·d(p, q)
64 ·ωp·ωq
=X
p0L0,q0R0
ep0q0=Cut(W, L0, R0)
73
5 The Coreset Method
2
Lemma 5.5.10 shows that each (1±)-approximate solution of MaxCut on Wcan be ex-
trapolated to a (1±)3-approximate solution of MaxCut on Pcore. It remains to show how to
compute an approximate MaxCut on W. We will use an algorithm of [51]to compute such an
approximate solution. The algorithm works for so called dense graphs. We can show that Wis
dense by showing that the maximum weight of an edge is at most a constant factor larger than
the average edge weight.
Lemma 5.5.11 maxp0,q0X(ep0q0)16 ·avgp0,q0X(ep0q0)
Proof : The average value of an edge eEis:
Pp,qPcore j8·ωp
·β·n2k·j8·ωq
·β·n2k2β2n4·w(p)·w(q)·d(p,q)
64·ωp·ωq
|X|2
Pp,qPcore (1)2·w(p)·w(q)·d(p, q)
|X|2= (1)2n2
|X|2(1)2n2·2·β2
64
where the last inequality comes from Lemma 5.5.9.
To show an upper bound for all edge weights we use that for each point pPcore:
ωp
w(p)=X
qPcore
w(q)·d(p, q) = 1
n X
q,zPcore
w(q)·w(z)·d(p, q)!
=1
2n X
q,zPcore
w(q)·w(z)·d(p, q)!+ X
q,zPcore
w(q)·w(z)·d(p, z)!!
1
2n ·X
q,zPcore
w(q)·w(z)·d(q, z) = n
2
Now take an arbitrary edge between clone vertices of pand q. Using the triangle inequality and
the fact that for arbitrary a, b 1we have a+b
a·b2
min{a,b}, we can bound the weight of the edge
as follows:
ep0q0=d(p, q)·2β2n4·w(p)·w(q)
64 ·ωp·ωq
1
nPzPcore w(z)·(d(p, z) + d(z, q)·2β2n4·w(p)·w(q)
64 ·ωp·ωq
=2β2n3
64 ·ωp
w(p)+ωq
w(q)·w(p)·w(q)
ωp·ωq
2β2n3
64 ·2
min{ωp
w(p),ωq
w(q)}2β2n3
64 ·4
n=2β2n2
16
74
5.5 Constructing Solutions on the Coreset
We conclude
max
p0,q0X(ep0q0)1
(1)2·4·avgp0,q0X(ep0q0)16 ·avgp0,q0X(ep0q0).
2
Since Wis a dense graph we can apply the algorithm of [51]to find a (1±)-approximate
MaxCut (L0, R0)for Win O|X|2·2((1/)O(1))=Olog2n·2((1/)O(1))time. We extrapolate
this cut to a (1±)3= (1±O())-approximate Maxcut (L, R)on Pcore and get the following
result:
Theorem 16 A(1+)-approximate MaxCut (L, R)on Pcore can be found in time
Olog2n·2((1/)O(1)).
Remark 5.5.12 We can even compute in the same time an implicit MaxCut (A, B)of the input
point set Pfrom the coreset:
After computing a good cut (L, R)of Pcore we can provide a partition of the whole space [0, 1]d
into two parts Land R. During our coreset construction we store the information about the
mappings of points to coreset points. When the points of a cell Care mapped to a coreset point
pLduring the coreset construction, we assign the whole cell Cto L. When the points of a
cell Care mapped to a coreset point pRduring the coreset construction, we assign the whole
cell Cto R. After the construction of the partition (L,R)of the plane we know that the partition
(A, B)of Pwith A:= PLand B:= PR is a (1+)-approximate MaxCut of P.
The computation of (L,R)from (L, R)can be done in poly(log n, 1/)time and space.
Together with the results of Corollary 5.4.8 we obtain the fastest method published so far (in
terms of n) to find a (1±)-approximation of the Euclidean MaxCut of a point set. Previous
methods had runtime O(n·log n·(2(1/)O(1)+log n)) [72] and O(n2·2O(1/2))[37].
Theorem 17 Given a set Pof npoints in Rd, a (1±)-approximate solution for the Euclidean
MaxCut of the points can be found in time O(n·log(n/) + log2n·log log n·2((1/)O(1))).
5.5.4 MaxMatching
We are not aware of a method to compute an approximate MaxMatching solution on the weighted
coreset points directly without expanding the coreset.
One could replace each point pPcore having weight w(p)by w(p)unweighted points and
run the algortihm of Gabow[45], which finds an exact best solution in O(n3)time.
5.5.5 MaxTSP
We are again not aware of a method to find a (1±)-approximation of the MaxTSP tour even
on unweighted points. We could obtain a solution on the coreset Pcore by expanding the coreset
to a point set of size nand running an exhaustive search. The running time is O(n!).
75
5 The Coreset Method
5.5.6 AverageDistance
The weighted average distance 1
n2Pp,qPcore w(p)·w(q)·d(p, q)on the coreset Pcore can be
easily computed in time O(|Pcore|2) = O(log2n·d2).
5.6 Coresets via Sampling
In the last sections we introduced a coreset construction technique, which is suitable to reduce the
complexity of a huge point set. We showed that the huge point set can be replaced by a weighted
point set of logarithmic size, which still holds all information needed to compute approximate
solutions for various clustering problems. However, to construct the coreset we need access to
the whole point set, which is not always given in real world applications dealing with huge point
sets.
In this chapter we will alter the construction such that it depends only on point samples. We
will show that the information about all point samples can itself be stored in polylogarithmic
memory. This will help us to develop data streaming algorithms in Chapter 6. The point sample
technique will also help us to maintain MaxCut clusterings of points when points are moving
along linear trajectories. See Chapter 7 for details.
We still assume that all points lie in [0, 1]dand that Opt 1/e
.
5.6.1 k-Median
We again consider Zgrids G0,...,GZ1for Z=llog 4·k·10d·n(1+log n)·d(d+1)/2·
e
d+1m+1, grid Gi
having cell side length 1
2i. In each grid Giwe pick a random sample Siof points. To select our
random sample we take every point with probability
pi:= min α
δ·2i, 1
into our sample Si, where
α=6·2ln(2·Z·2Z·d)
and ρis the desired error probability of our algorithm. The sampling is done at least α-wise
independently, which means that for each set APof at most αpoints and each partition
{B, C}of A:
Pr[BSiCSi=] = (pi)|B|·(1pi)|C|.
Essentially this means that for each subset APof size αthe sampling is done independently.
We will show that it follows from a variant of Chernoff bounds [113] that we can approximate
the number of points in every heavy cell up to a multiplicative error of (1±)just using our
point samples. The approximations will furthermore be good enough to detect the heavy cells
and to construct an -coreset in the same way as described before.
76
5.6 Coresets via Sampling
Definition 5.6.1 (Considered as Heavy) For each cell Cin grid Giwe define nCas the number
of points in C. We define our estimation on the number of points as
f
nC:= |SiC|·1
pi
.
A cell Cin grid Giis considered as heavy, if
f
nC(1)·δ·2i.
Lemma 5.6.2 The following events hold with probability at least 1ρ/2 for all grids Giand
each grid cell in Gi:
If ilog(2
δ ), then f
nC=nC.
If Ccontains at least δ·2i1points, then (1)·nCf
nC(1+)·nC.
If Ccontains less than δ·2i1points, then f
nC<(1)·δ·2i(and the cell Cis not
considered as heavy).
Proof : If ilog(2
δ )then pi=1, the sample set equals the point set, and f
nC=nC.
Let Cbe an arbitrary grid cell in Gi. To prove the last two statements for the single cell Cwe
use Theorem 1 of Section 2.4.
For each point pPlet Xpdenote the indicator random variable for the event that pSi.
We want to show that Pp∈C Xpdoes not deviate much from its expectation. If a cell contains at
least δ·2i1points then E[Pp∈C Xp]α/2. From Theorem 1 it follows:
PrX
p∈C
XpE[X
p∈C
Xp]·E[X
p∈C
Xp]emin{bα/2c,b2α/6c}
Plugging in Pp∈C Xp=f
nC·piand E[Pp∈C Xp] = nC·piwe obtain
Prf
nCnC·nCρ
2·Z·2Z·d,
and the second statement follows with probability 1ρ
2·Z·2Z·d.
Assume that Ccontains at most δ·2i1points. If Ccontains exactly δ·2i1points, we can
conclude from the formula above that
f
nC(1+)·nC<(1+1/3)·δ·2i1= (11/3)·δ·2i<(1)·δ·2i
holds with probability 1ρ/(2·Z·2Zd). We observe that the distribution of f
nCdisplaces
towards lower values when the number of points in the cell decreases, which means that Prf
nC<
(1)δ·2i1ρ/(2·Z·2Zd)also holds for smaller numbers of points in C.
77
5 The Coreset Method
We conclude that the two statements are valid for one fixed single cell Cwith probability
1ρ/(2·Z·2Zd). Since we have at most Zgrids, each grid having at most 2Z·dcells, the two
statements are valid with probability 1ρ/2 for all cells in all grids by the union bound.
2
If f
nC(1)·δ·2i, a cell C Giis considered as heavy. This way, we detect every heavy
cell but we also consider some light cells as heavy.
We then compute a coreset by introducing a coreset point in each cell considered as heavy (as
described in Section 5.2.1). This will increase the size of our coreset. The following corollaries
show that the size of the coreset is still logarithmic in n.
Corollary 5.6.3 Assume that the statements of Lemma 5.6.2 hold for all cells in all grids (which
happens with probability 1ρ/2).
If δd+1·Opt
8·k·10d·(1+log n)·d(d+1)/2 , the size of the computed coreset is at most 33·k·10d·(2+log n)·dd/2
d+1=
O(k·log n/d+1).
Proof : We can easily modify the proof of Lemma 5.2.8 by plugging in δ/2 for the old
value of δ. The proof stays exactly the same and we can conclude that the size of the coreset
is at most 4·Opt
δ·d+ (log n+2)·k·3d·dd/2, which is smaller than the stated coreset size for
δd+1·Opt
8·k·10d·(1+log n)·d(d+1)/2 .
2
An important property of our sample technique is that although the sample can be large, it just
occupies a small number of cells (and can, as we show later, be stored efficiently).
Lemma 5.6.4 Let δd+1·Opt
8·k·10d·(1+log n)·d(d+1)/2 . Then we have points from at most
193 ·Z·k·10d·(1+log n)·dd/2 ·ln(2·Z·2Zd)
d+3=e
O(k·log n·log2(e
)·log(ρ1)/d+3)
cells in the union of our sample sets with probability at least 1ρ/2.
Proof : Let Gibe a fixed grid. We determine an upper bound on the number of points in non-
center grid cells. Let us recall from the proof of Lemma 5.2.8 that every point except for those
contained in the k·3d·dd/2 center cells has a distance of at least d
2i+1to the nearest center in an
optimal solution. Thus the overall number of points in non-center cells is at most Opt·2i+1
d. Let Xp
denote the indicator random variable for the event that pSi. Let Ddenote the set of non-center
grid cells. We have E[Pp∈D Xp]pi·Opt·2i+1
d2·α·Opt
δ·d. We will assume E[Pp∈D Xp] = 2·α·Opt
δ·d
as the distribution of Pp∈D Xpdisplaces towards lower values when E[Pp∈D Xp]<2·α·Opt
δ·d.
Applying Theorem 1 from Section 2.4 we get
PrX
p∈D
Xp4·α·Opt
δ·dPrX
p∈D
XpE[X
p∈D
Xp]E[X
p∈D
Xp]
emin{bα/2c,b2α
δ/3c}ρ
2Z .
78
5.6 Coresets via Sampling
Therefore, with probability at least 1ρ/(2Z)we have at most 4·α·Opt
δ·dpoints from non-center
cells in our sample. If no two of these points are contained in the same grid cell we get an
upper bound of 4·α·Opt
δ·don the number of non-center cells that contain a sample point. Since there
are at most k·3d·dd/2 center cells the number of cells occupied by sample points is at most
4·α·Opt
δ·d+k·3d·dd/2 193·k·10d·(1+log n)·dd/2·ln(2·Z·2Zd)
d+3.
Since these arguments hold for each of the Zgrids with probability 1ρ/(2Z), the Lemma
follows from the union bound. 2
To obtain a coreset we use the estimations f
nCof the number of points in cells to identify the
cells we consider as heavy (all cells having f
nC(1)δ 2i). Since all heavy cells are considered
as heavy we obtain a finer coreset than before. We will now show how to find a good assignment
of weights to the computed coreset points, such that the computed coreset is an -coreset for P.
Since the weight of a coreset point will also depend on the number of points in some light
cells, we have to estimate the number of points in these cells. To get an estimate for all required
cells we use the following procedure. We require that the estimate f
nCfor the number of points in
a cell considered as heavy is a (1±)-approximation and that in every cell C Giconsidered as
light there are not more than δ·2ipoints (our coreset construction uses only these assumptions,
and they hold according to Lemma 5.6.2 with probability 1ρ/2).
We call a cell useful, if it is either considered as heavy or a direct subcell of a cell considered
as heavy. We have to deal with the fact that the sum of the total estimated number of points
PCisubcell of Cf
nCiin the subcells of Ccan exceed the estimated number of points f
nCin C. To
avoid this we have to compute new integral estimates ECfor the number of points in each useful
cell C, which still have the guarantee to be near the real value nCand which are consistent with
the values ECiof the subcells of EC. We do this by first computing upper and lower bounds UC
resp. LCon nCfor all useful cells. We will then adjust these bounds to be consistent with the
bounds for the subcells. Finally we will use the bounds to compute new estimates EC.
For i > log(2
δ )and every cell C Giconsidered as heavy we define LC=df
nC/(1+)eand
UC=bf
nC/(1)c. For i > log(2
δ )and every cell C Giconsidered as light we define LC=0
and UC=bδ2ic. For ilog(2
δ )and every cell C Giwe define LC=f
nCand UC=f
nC(since
we know the number of points in Cexactly). Using these definitions we know for every cell that
LCnCUC).
The estimates ECcan be computed bottom-up by adjusting the bounds LCand UCin cases of
conflicts:
We first compute new lower and upper bounds LCand UCfor all useful cells bottom-up.
We look at the smallest cell Cconsidered as heavy. Let Ci, i {1, ..., 2d}be its subcells. If
P2d
i=1LCi> LC, we set LC:= P2d
i=1LCi. If P2d
i=1UCi< UC, we set UC:= P2d
i=1UCi. After the
assignment LCnCUCstill holds. We use this technique for all cells considered as heavy
(in the order of increasing size), getting better bounds LCand UC. From these bounds we then
compute the values ECtop-down. Since the bounds LCand UCare always at least as strong as the
bounds of the subcells, we can always easily find integral values ECsatisfying LCECUC
and P2d
i=1ECi=EC.
Corollary 5.6.5 Assume that the statements of Lemma 5.6.2 are true for all grids and all cells.
79
5 The Coreset Method
Then for each cell Cidentified as heavy we have (14)nCEC(1+4)nC.
For each cell C Giwith ilog(2
δ )we have EC=nC.
All estimates ECare integral and consistent with the estimates ECifor the subcells Ciof C.
Proof : The claim follows directly from the following two sequences of inequalities.
ECLCf
nC/(1+)1
1+nC(12)nC
and
ECUCf
nC/(1)1+
1nC(1+4)nC.
2
We now apply the algorithm described in Section 5.2 to our estimations ECand compute a
coreset.
Lemma 5.6.6 If δd+1·Opt
4·k·10d·(1+log n)·d(d+1)/2 and < 1/15, the coreset computed with respect to
the values ECis a 11-coreset of Pwith probability 1ρ.
Proof : Let P0be a point set that is distributed according to our estimations EC(so for every
useful cell Cwe have |P0C|=EC). The proof of Lemma 5.2.5 shows that the coreset computed
by our algorithm is an -coreset for P0. Let Q={q1, . . . , qm}be the computed coreset points.
We will show that if we would know the point sets Pand P0, we could (using the coreset method)
compute mappings γ:PQand γ0:P0Qand corresponding weight functions w:QN
and w0:QN, such that (Q, w)is an -coreset for Pand (Q, w0)is an -coreset for P0and
for all qiQwe have:
(14)(w(qi) 1)w0(qi)(1+4)w(qi) + 1(5.2)
and
w(qi)1
=w(qi) = w0(qi).(5.3)
From that we will conclude that each solution on the point set P0differs by at most a factor of
(1+O()) from the solution on the point set P. Since the computed coreset is an -coreset for
P0it follows that it is a O()-coreset for P.
Let us construct the mappings γand γ0. Lemma 5.2.5 shows that we construct a (1+)-
coreset when we map each point pto a coreset point in the smallest heavy cell it is contained
in. We start the assignment of points to coreset points within the smallest useful cells. Since the
smallest useful cells are not heavy we do not assign any points to them. We proceed to assign
points in the useful cells at the next higher level. Going through the levels bottom-up we will
assign all points in useful cells and maintain the invariants (5.2) and (5.3).
80
5.6 Coresets via Sampling
Let Cbe a cell considered as heavy. If there is no subcell considered as heavy, the algorithm
introduces a new coreset point q. We map all ECpoints from P0to qand all nCpoints from P
to q. Then w(q) = ECand w0(q) = nC. Notice that w(q)<1
can only happen for a coreset
point qwhen a cell in grid Gior a subcell is considered as heavy. Then δ·2i1<1
according
to Lemma 5.6.2. This only happens in grids Giwith i < log(2
δ )where EC=nC(Corollary
5.6.5) and therefore w(q) = w0(q)after the assignment. This shows that invariant (5.3) holds.
Invariant (5.2) follows directly from Corollary 5.6.5.
Let us now consider the case that C Gihas already ccoreset points q1, . . . , qcQwith
weights w(qi)and w0(qi), respectively and let us assume that the invariants (5.2) and (5.3) hold
for all these coreset points. Let l:= nCPc
i=1w(qi)resp. l0:= ECPc
i=1w0(qi)be the
number of points which have to be assigned to the coreset points qiby γresp. γ0.
We consider six cases:
l=0and l0=0: In this case nothing has to be assigned and the invariant holds by the
assumption of the induction step.
l > 0 and l0=0and ilog(2
δ ): Each cell considered as heavy has at least δ·2i1
points accordings to Lemma 5.6.2 (The threshold can only have been higher during the
coreset constructions so far). Therefore each coreset point must have a weight of at least
δ·2i11
. It remains to show invariant (5.2). We have
(14)
c
X
i=1
w(qi) = (14)(nCl)<(14)nCEC=
c
X
i=1
w0(qi).
Therefore for one qiwe have (14)w(qi)< w0(qi)and we can assign at least one point
from Pto qiby γwithout violating invariant (5.2). After that assignment either l=0or
we find again a qiwe can assign points to. We go on with this assignment until l=0.
l=0and l0> 0 and ilog(2
δ ): Again we only have to show invariant (5.2). We have
c
X
i=1
w0(qi) = ECl0< EC
(1+4)nC= (1+4)(nCl) = (1+4)
c
X
i=1
w(qi).
Therefore for one qiwe have w0(qi)<(1+4)w(qi)and we can assign at least one point
from P0to qiby γ0without violating the invariant. After that assignment either l0=0or
we again find a qiwe can assign points to. We go on with this assignment until l0=0.
l>0and l0=0and i < log(2
δ ): In this case we have EC=nCand
c
X
i=1
w(qi) = nCl<nC=EC=
c
X
i=1
w0(qi).
81
5 The Coreset Method
Since all w(qi)and w0(qi)are integral, we have w(qi)w0(qi) 1for one qiand we
can assign at least one point from Pto qiby γwithout violating the invariants. After that
assignment either l=0or we find again a qiwe can assign points to. We go on with this
assignment until l=0.
l=0and l0> 0 and i < log(2
δ ): Again we have EC=nCand
c
X
i=1
w0(qi) = ECl0< EC=nC= (nCl) =
c
X
i=1
w(qi).
Therefore for one qiwe have w0(qi)w(qi) 1and can assign at least one point from
P0to qiby γ0without violating the invariant. After that assignment either l0=0or we
again find a qiwe can assign points to. We go on with this assignment until l0=0.
l > 0 and l0> 0: We assign min{l, l0}points from Pto q1by γand min{l, l0}points from
P0to q1by γ0. This does not violate the invariant. After the assignment we are in one of
the other cases.
After the inductive assignment we have constructed mappings γand γ0and corresponding
weight functions w, w0, such that invariants (5.2) and (5.3) hold. Using the invariants it is easy
to show that for all coreset points q:
(15)w(q)w0(q)(1+5)w(q).(5.4)
If w(q)1
the inequality follows from invariant (5.3). If w(q)>1
we have:
w0(q)(14)(w(q)−1)(14)(w(q)−·w(q)) = (14+42)w(q)(15)w(q)
and
w0(q)(1+4)w(q) + 1(1+4)w(q) + ·w(q) = (1+5)w(q)
and inequality (5.4) follows.
Let Adenote the coreset computed by our sample algorithm. Since Ais an -coreset for P0
we know for each set of centers C:
Median(A, C)(1±)·Median(P0, C).
From the arguments above we know that
Median(P0, C)1
1±·Median((Q, w0), C)(1±2)·Median((Q, w0), C).
Since the weights w0(q)and w(q)of each coreset-point in qQdiffer by at most 5 ·w(q),
we can conclude:
Median((Q, w0), C)(1±5)·Median((Q, w), C).
82
5.6 Coresets via Sampling
Since (Q, γ)is an -coreset for P, we obtain:
Median((Q, w), C)(1±)·Median(P, C).
Alltogether we get for < 1/15 :
Median(A, C)(1±)2·(1±2)·(1±5)·Median(P, C)(1±11)·Median(P, C).
2
5.6.2 k-Means
In this section we will adapt the sampling technique of the last section to the problem k-means.
The proofs will be very similar to the proofs of the last section.
We again consider Znested grids G0,...,GZ1for Z=llog 8·k·33d+1·n(1+log n)·d(d+2)/2·
e
d+2m+
1, grid Gihaving cell side length 1
2i. In each grid Giwe pick a random sample Siof points. To
select our random sample we take every point with probability
pi=min α
δ·4i, 1
into our sample Si, where
α=6·2ln(2·Z·2Z·d)
and ρis the desired error probability of our algorithm. The sampling is done at least α-wise
independently, which means that for each set APof at most αpoints and each partition
{B, C}of A:
Pr[BSiCSi=] = (pi)|B|·(1pi)|C|.
Essentially this means that for each subset APof size αthe sampling is done independently.
We will show that it follows from a variant of Chernoff bounds [113] that we can approximate
the number of points in every heavy cell up to a multiplicative error of (1±)just using our
point samples. The approximations will furthermore be good enough to detect the heavy cells
and to construct an -coreset in the same way as described before.
Definition 5.6.7 (Considered as Heavy) For each cell Cin grid Giwe define nCas the number
of points in C. We define our estimation on the number of points as
f
nC:= |SiC|·1
pi
.
A cell Cin grid Giis considered as heavy, if
f
nC(1)·δ·4i.
Lemma 5.6.8 The following events hold with probability at least 1ρ/2 for all grids Giand
each grid cell in Gi:
83
5 The Coreset Method
If ilog4(2
δ ), then f
nC=nC.
If Ccontains at least δ·4i/2 points, then (1)·nCf
nC(1+)·nC.
If Ccontains less than δ·4i/2 points, then f
nC<(1)·δ·4i(and the cell Cis not
considered as heavy).
Proof : If ilog4(2
δ )then pi=1, the sample set equals the point set, and f
nC=nC.
Let Cbe an arbitrary grid cell in Gi. To prove the last two statements for the single cell Cwe
use Theorem 1 of Section 2.4.
For each point pPlet Xpdenote the indicator random variable for the event that pSi.
We want to show that Pp∈C Xpdoes not deviate much from its expectation. If a cell contains at
least δ·4i/2 points then E[Pp∈C Xp]α/2. From Theorem 1 it follows:
PrX
p∈C
XpE[X
p∈C
Xp]·E[X
p∈C
Xp]emin{bα/2c,b2α/6c}
Plugging in Pp∈C Xp=f
nC·piand E[Pp∈C Xp] = nC·piwe obtain
Prf
nCnC·nCρ
2·Z·2Z·d,
and the second statement follows with probability 1ρ
2·Z·2Z·d.
Assume that Ccontains at most δ·4i/2 points. If Ccontains exactly δ·4i/2 points, we can
conclude from the formula above that
f
nC(1+)·nC<(1+1/3)·δ·4i/2 = (11/3)·δ·4i<(1)·δ·4i
holds with probability 1ρ/(2·Z·2Zd). We observe that the distribution of f
nCdisplaces
towards lower values when the number of points in the cell decreases, which means that Prf
nC<
(1)δ·4i1ρ/(2·Z·2Zd)also holds for smaller numbers of points in C.
We conclude that the two statements are valid for one fixed single cell Cwith probability
1ρ/(2·Z·2Zd). Since we have at most Zgrids, each grid having at most 2Z·dcells, the two
statements are valid with probability 1ρ/2 for all cells in all grids by the union bound.
2
If f
nC(1)·δ·4i, a cell C Giis considered as heavy. This way, we detect every heavy
cell but we also consider some light cells as heavy.
We then compute a coreset by introducing a coreset point in each cell considered as heavy (as
described in Section 5.3.1). This will increase the size of our coreset. The following corollaries
show that the size of the coreset is still logarithmic in n.
Corollary 5.6.9 Assume that the statements of Lemma 5.6.8 hold for all cells in all grids.
If δd+2·Opt
16·k·33d+1·(1+log n)·d(d+2)/2 , the size of the computed coreset is at most
129 ·k·33d+1·(1+log n)·dd/2
d+2=O(k·log n/d+2).
84
5.6 Coresets via Sampling
Proof : We can easily modify the proof of Lemma 5.3.8 by plugging in δ/2 for the old
value of δ. The proof stays exactly the same and we can conclude that the size of the coreset
is at most 8·Opt
δ·d+ (log n+1)·k·3d·dd/2, which is smaller than the stated coreset size for
δd+2·Opt
16·k·33d+1·(1+log n)·d(d+2)/2 .
2
An important property of our sample technique is that although the sample can be large, it just
occupies a small number of cells (and can, as we show later, be stored efficiently).
Lemma 5.6.10 Let δd+2·Opt
16·k·33d+1·(1+log n)·d(d+2)/2 . Then we have points from at most
769 ·Z·k·33d+1·(1+log n)·dd/2 ·ln(2·Z·2Zd)
d+4=e
O(k·log n·log2(e
)·log(ρ1)/d+4)
cells in the union of our sample sets with probability at least 1ρ/2.
Proof : Let Gibe a fixed grid. We determine an upper bound on the number of points in
non-center grid cells. Let us recall from the proof of Lemma 5.3.8 that every point except for
those contained in the k·3d·dd/2 center cells has a distance of at least d
2i+1to the nearest center
in an optimal solution and therefore contributes with at least d
4i+1to the optimal solution. Thus
the overall number of points in non-center cells is at most Opt·4i+1
d. Let Xpdenote the indicator
random variable for the event that pSi. Let Ddenote the set of non-center grid cells. We have
E[Pp∈D Xp]pi·Opt·4i+1
d4·α·Opt
δ·d. We will assume E[Pp∈D Xp] = 4·α·Opt
δ·das the distribution
of Pp∈D Xpdisplaces towards lower values when E[Pp∈D Xp]<4·α·Opt
δ·d.
Applying Theorem 1 of Section 2.4 we get
PrX
p∈D
Xp8·α·Opt
δ·dPrX
p∈D
XpE[X
p∈D
Xp]E[X
p∈D
Xp]
emin{bα/2c,b2α
δ/3c}ρ
2Z .
Therefore, with probability at least 1ρ/(2Z)we have at most 8·α·Opt
δ·dpoints from non-center
cells in our sample. Since in the worst case no two of these points are contained in the same grid
cell we get an upper bound of 8·α·Opt
δ·don the number of non-center cells that contain a sample
point. Since there are at most k·3d·dd/2 center cells the number of cells occupied by sample
points is at most 8·α·Opt
δ·d+k·3d·dd/2 769·k·33d+1·(1+log n)·dd/2·ln(2·Z·2Zd)
d+4.
Since these arguments hold for each of the Zgrids with probability 1ρ/(2Z), the Lemma
follows from the union bound. 2
To obtain a coreset we use the estimations f
nCof the number of points in heavy cells to identify
the cells we consider as heavy (all cells having f
nC(1)δ 4i). Since all heavy cells are
considered as heavy we obtain a finer coreset than before. We will now show how to find a
good assignment of weights to the computed coreset points, such that the computed coreset is an
-coreset for P.
85
5 The Coreset Method
Since the weight of a coreset point will also depend on the number of points in some light
cells, we have to estimate the number of points in these cells. To get an estimate for all required
cells we use the following procedure. We require that the estimate f
nCfor the number of points
in a heavy cell is a (1±)-approximation and that in every cell C Giconsidered as light there
are not more than δ·4ipoints (our coreset construction uses only these assumptions, and they
hold according to Lemma 5.6.8 with probability 1ρ/2).
We call a cell useful, if it is either considered as heavy or a direct subcell of a cell considered
as heavy. We have to deal with the fact that the sum of the total estimated number of points
PCisubcell of Cf
nCiin the subcells of Ccan exceed the estimated number of points f
nCin C. To
avoid this we have to compute new integral estimates ECfor the number of points in each useful
cell C, which still have the guarantee to be near the real value nCand which are consistent with
the values ECiof the subcells of EC. We do this by first computing upper and lower bounds UC
resp. LCon nCfor all useful cells. We will then adjust these bounds to be consistent with the
bounds for the subcells. Finally we will use the bounds to compute new estimates EC.
For i > log4(2
δ )and every cell C Giconsidered as heavy we define LC=df
nC/(1+)eand
UC=bf
nC/(1)c. For i > log4(2
δ )and every cell C Giconsidered as light we define LC=0
and UC=bδ4ic. For ilog4(2
δ )and every cell C Giwe define LC=f
nCand UC=f
nC(since
we know the number of points in Cexactly). Using these definitions we know for every cell that
LCnCUC).
The estimates ECcan be computed bottom-up by adjusting the bounds LCand UCin cases of
conflicts:
We first compute new lower and upper bounds LCand UCfor all useful cells bottom-up.
We look at the smallest cell Cconsidered as heavy. Let Ci, i {1, ..., 2d}be its subcells. If
P2d
i=1LCi> LC, we set LC:= P2d
i=1LCi. If P2d
i=1UCi< UC, we set UC:= P2d
i=1UCi. After the
assignment LCnCUCstill holds. We use this technique for all cells considered as heavy
(in the order of increasing size), getting better bounds LCand UC. From these bounds we then
compute the values ECtop-down. Since the bounds LCand UCare always at least as strong as the
bounds of the subcells, we can always easily find integral values ECsatisfying LCECUC
and P2d
i=1ECi=EC.
Corollary 5.6.11 Assume that the statements of Lemma 5.6.8 are true for all grids and all cells.
Then for each cell Cidentified as heavy we have (14)nCEC(1+4)nC.
For each cell C Giwith ilog4(2
δ )we have EC=nC.
All estimates ECare integral and consistent with the estimates ECifor the subcells Ciof C.
Proof : The proof follows exactly the proof of Corollary 5.6.5. 2
We now apply the algorithm described in Section 5.3 to our estimations ECand compute a
coreset.
Lemma 5.6.12 If δd+2·Opt
8·k·33d+1·(1+log n)·d(d+2)/2 and < 1/15, the coreset computed with respect
to the values ECis an 11-coreset of Pwith probability 1ρ.
86
5.6 Coresets via Sampling
Proof : Let P0be a point set that is distributed according to our estimations EC(so for every
useful cell Cwe have |P0C|=EC). The proof of Lemma 5.3.5 shows that the coreset computed
by our algorithm is an -coreset for P0. Let Q={q1, . . . , qm}be the computed coreset points.
Following exactly the proof of Lemma 5.6.6 we can show that if we would know the point sets P
and P0, we could (using our coreset method) compute mappings γ:PQand γ0:P0Qand
corresponding weight functions w:QNand w0:QN, such that (Q, w)is an -coreset
for Pand (Q, w0)is an -coreset for P0and for all qiQwe have:
(15)w(qi)w0(qi)(1+5)w(qi)(5.5)
Let Adenote the coreset computed by our sample algorithm. From Lemma 5.3.5 we know for
each set of centers C:
Means(A, C)(1±)·Means(P0, C).
From the arguments above we know that
Means(P0, C)1
1±·Means((Q, w0), C)(1±2)·Means((Q, w0), C).
Since the weights w0(q)and w(q)of each coreset-point in qQdiffer by at most 5 ·w(q),
we can conclude:
Means((Q, w0), C)(1±5)·Means((Q, w), C).
Since (Q, γ)is an -coreset for P, we obtain:
Means((Q, w), C)(1±)·Means(P, C).
Alltogether we get for < 1/15 :
Means(A, C)(1±)2·(1±2)·(1±5)·Means(P, C)(1±11)·Means(P, C).
2
5.6.3 Oblivious Optimization Problems
We again consider Zgrids G0,...,GZ1for Z=llog (10`)d+1·n(1+log n)·d(d+1)/2·
e
d+1·λdm+1, grid Gi
having cell side length 1
2i. In each grid Giwe pick a random sample Siof points. To select our
random sample we take every point with probability
pi=min α
δ·2i, 1
into our sample Si, where
α=6·2ln(2·Z·2Z·d)
87
5 The Coreset Method
and ρis the desired error probability of our algorithm. The sampling is done at least α-wise
independently, which means that for each set APof at most αpoints and each partition
{B, C}of A:
Pr[BSiCSi=] = (pi)|B|·(1pi)|C|.
Essentially this means that for each subset APof size αthe sampling is done independently.
We will show that it follows from a variant of Chernoff bounds [113] that we can approximate
the number of points in every heavy cell up to a multiplicative error of (1±)just using our
point samples. The approximations will furthermore be good enough to detect the heavy cells
and to construct a coreset in the same way as described before.
Definition 5.6.13 (Considered as Heavy) For each cell Cin grid Giwe define nCas the number
of points in C. We define our estimation on the number of points as
f
nC:= |SiC|·1
pi
.
A cell Cin grid Giis considered as heavy, if
f
nC(1)·δ·2i.
Lemma 5.6.14 The following events hold with probability at least 1ρ/2 for all grids Giand
each grid cell in Gi:
If ilog(2
δ ), then f
nC=nC.
If Ccontains at least δ·2i1points, then (1)·nCf
nC(1+)·nC.
If Ccontains less than δ·2i1points, then f
nC<(1)·δ·2i(and the cell Cis not
considered as heavy).
Proof : The proof is exactly the same as the proof of Lemma 5.6.2. 2
If f
nC(1)·δ·2i, a cell C Giis considered as heavy. This way, we detect every heavy
cell but we also consider some light cells as heavy.
We then compute a coreset by introducing a coreset point in each cell considered as heavy (as
described in Section 5.4.1). This will increase the size of our coreset. The following corollaries
show that the size of the coreset is still logarithmic in n.
Corollary 5.6.15 Assume that the statements of Lemma 5.6.14 hold for all cells in all grids.
If δd+1·λd·Opt
2·(10`)d+1·(1+log n)·d(d+1)/2 , the size of the computed coreset is at most 9·(10`)d+1·(2+log n)·dd/2
λd+1·d+1=
O(log n/d+1).
Proof : We can easily modify the proof of Lemma 5.4.7 by plugging in δ/2 for the old
value of δ. The proof stays exactly the same and we can conclude that the size of the coreset
88
5.6 Coresets via Sampling
is at most 4·Opt
δ·λ·d+ (log n+2)·3d·dd/2, which is smaller than the stated coreset size for
δd+1·λd·Opt
2·(10`)d+1·(1+log n)·d(d+1)/2 .
2
An important property of our sample technique is that although the sample can be large, it just
occupies a small number of cells (and can, as we show later, be stored efficiently).
Lemma 5.6.16 Let δd+1·λd·Opt
2·(10`)d+1·(1+log n)·d(d+1)/2 . Then we have points from at most
49 ·Z·(10`)d+1·(1+log n)·dd/2 ·ln(2·Z·2Zd)
d+3·λd+1=e
O(log n·log e
·log(ρ1)/d+3)
cells in the union of our sample sets with probability at least 1ρ/2.
Proof : Let Gibe a fixed grid. We determine an upper bound on the number of points in non-
center grid cells. Let us recall from the proof of Lemma 5.4.7 that every point except for those
contained in the 3d·dd/2 center cells has a distance of at least d
2i+1to the center of gravity and
contributes at least λ·d
2i+1to the cost of an optimal solution. Thus the overall number of points
in non-center cells is at most Opt·2i+1
λ·d. Let Xpdenote the indicator random variable for the event
that pSi. Let Ddenote the set of non-center grid cells. We have E[Pp∈D Xp]pi·Opt·2i+1
λ·d
2·α·Opt
δ·λ·d. We will assume E[Pp∈D Xp] = 2·α·Opt
δ·λ·das the distribution of Pp∈D Xpdisplaces towards
lower values when E[Pp∈D Xp]<2·α·Opt
δ·λ·d.
Applying Theorem 1 of Section 2.4 we get
PrX
p∈D
Xp4·α·Opt
δ·λ·dPrX
p∈D
XpE[X
p∈D
Xp]E[X
p∈D
Xp]
emin{bα/2c,b2α
δ/3c}ρ
2Z .
Therefore, with probability at least 1ρ/(2Z)we have at most 4·α·Opt
δ·λ·dpoints from non-center
cells in our sample. In the worst case no two of these points are contained in the same grid cell
and we get an upper bound of 4·α·Opt
δ·λ·don the number of non-center cells that contain a sample
point. Since there are at most 3d·dd/2 center cells the number of cells occupied by sample
points is at most 4·α·Opt
δ·λ·d+3d·dd/2 49·(10`)d+1·(1+log n)·dd/2·ln(2·Z·2Zd)
d+3·λd+1.
Since these arguments hold for each of the Zgrids with probability 1ρ/(2Z), the Lemma
follows from the union bound. 2
To obtain a coreset we use the estimations f
nCof the number of points in heavy cells to identify
the cells we consider as heavy (all cells having f
nC(1)δ 2i). Since all heavy cells are
considered as heavy we obtain a finer coreset than before. We will now show how to find a
good assignment of weights to the computed coreset points, such that the computed coreset is an
-coreset for P.
89
5 The Coreset Method
Since the weight of a coreset point will also depend on the number of points in some light
cells, we have to estimate the number of points in these cells. To get an estimate for all required
cells we use the following procedure. We require that the estimate f
nCfor the number of points
in a heavy cell is a (1±)-approximation and that in every cell C Giconsidered as light there
are not more than δ·2ipoints (our coreset construction uses only these assumptions, and they
hold according to Lemma 5.6.14 with probability 1ρ/2).
We use exactly the same technique described for k-median to obtain consistent estimates EC:
We call a cell useful, if it is either considered as heavy or a direct subcell of a cell considered
as heavy. We have to deal with the fact that the sum of the total estimated number of points
PCisubcell of Cf
nCiin the subcells of Ccan exceed the estimated number of points f
nCin C. To
avoid this we have to compute new integral estimates ECfor the number of points in each useful
cell C, which still have the guarantee to be near the real value nCand which are consistent with
the values ECiof the subcells of EC. We do this by first computing upper and lower bounds UC
resp. LCon nCfor all useful cells. We will then adjust these bounds to be consistent with the
bounds for the subcells. Finally we will use the bounds to compute new estimates EC.
For i > log(2
δ )and every cell C Giconsidered as heavy we define LC=df
nC/(1+)eand
UC=bf
nC/(1)c. For i > log(2
δ )and every cell C Giconsidered as light we define LC=0
and UC=bδ2ic. For ilog(2
δ )and every cell C Giwe define LC=f
nCand UC=f
nC(since
we know the number of points in Cexactly). Using these definitions we know for every cell that
LCnCUC).
The estimates ECcan be computed bottom-up by adjusting the bounds LCand UCin cases of
conflicts:
We first compute new lower and upper bounds LCand UCfor all useful cells bottom-up.
We look at the smallest cell Cconsidered as heavy. Let Ci, i {1, ..., 2d}be its subcells. If
P2d
i=1LCi> LC, we set LC:= P2d
i=1LCi. If P2d
i=1UCi< UC, we set UC:= P2d
i=1UCi. After the
assignment LCnCUCstill holds. We use this technique for all cells considered as heavy
(in the order of increasing size), getting better bounds LCand UC. From these bounds we then
compute the values ECtop-down. Since the bounds LCand UCare always at least as strong as the
bounds of the subcells, we can always easily find integral values ECsatisfying LCECUC
and P2d
i=1ECi=EC.
Corollary 5.6.17 Assume that the statements of Lemma 5.6.14 are true for all grids and all cells.
Then for each cell Cidentified as heavy we have (14)nCEC(1+4)nC.
For each cell C Giwith ilog(2
δ )we have EC=nC.
All estimates ECare integral and consistent with the estimates ECifor the subcells Ciof C.
Proof : Exactly as the proof of Corollary 5.6.5 2
We apply the algorithm described in Section 5.4 to our estimations ECand compute a coreset.
Lemma 5.6.18 If δd+1·λd·Opt
(10`)d+1·(1+log n)·d(d+1)/2 and < 1/12, the coreset computed with respect
to the values ECis a 67 `
λ-coreset of Pwith probability 1ρ.
90
5.6 Coresets via Sampling
Proof : Let P0be a point set that is distributed according to our estimations EC(so for every
useful cell Cwe have |P0C|=EC). The proof of Lemma 5.4.4 shows that the coreset computed
by our algorithm is an -coreset for P0. Let Q={q1, . . . , qm}be the computed coreset points.
Exactly as in the proof of Lemma 5.6.6 we can show that if we would know the point sets Pand
P0, we could (using our coreset method) compute mappings γ:PQand γ0:P0Qand
corresponding weight functions w:QNand w0:QN, such that (Q, w)is an -coreset
for Pand (Q, w0)is an -coreset for P0and for all qiQwe have:
(15)w(q)w0(q)(1+5)w(q).(5.6)
This guarantees that each solution on (Q, w)can differ from a solution on (Q, w0)by at most
10 `
λ·Opt(Q, w):
Consider a movement of points from (Q, w)to (Q, w0). The total movement increases when
we first move each moved point to the center of gravity µand then in a second step to its desti-
nation. Since w0(q)(15)w(q), we know that for each qQat most 5w(q)points are
moved away from q. The total movement of the first step therefore is smaller than
X
qQ
5w(q)d(q, µ).
Since the considered problem is `-Lipschitz, this movement changes the objective function by at
most
`·X
qQ
5w(q)d(q, µ).
Now we look at the second movement step. Since w0(q)(1+5)w(q)we know that for each
qQat most 5w(q)points are moved from the center of gravity to q. The total movement of
the second step must therefore be smaller than
X
qQ
5w(q)d(q, µ)
which changes the objective function again by at most
`·X
qQ
5w(q)d(q, µ).
Since the problem is λ-mean preserving we conclude that the total change of the objective
function caused by both movements is at most
10` ·X
qQ
w(q)d(q, µ)10`
λ··Opt(Q, w)(5.7)
We constructed different mappings proving coreset properties for different sets. We use the
notation E:QNfor the coreset weight function computed by our sampling algorithm and
m:γ(P)γ0(P)for the mapping which moves points as shown in the last paragraph.
We know the following facts:
91
5 The Coreset Method
1. γ:PQproves that (Q, w)is a coreset for P
2. m:γ(P)γ0(P)changes the cost of a solution by at most 10 `
λ·Opt(γ(P))
3. γ0:P0Qproves that (Q, w0)is a coreset for P0
4. α:P0Q(as constructed by our algorithm) proves that (Q, E)is a coreset for P0.
Let P= (p1, . . . , pn)be the set of input points and sSΠ(n)a solution of the oblivious
optimization problem Pi. The mapping γmγ01α:Pα(γ01(m(γ(P)))) proves that
(Q, E)is a 67 `
λ-coreset for P:
We write cost(P)for cost(n,s)
Π(P)and Opt(P)for OptΠ(P)and conclude:
cost(α(γ01(m(γ(P))))) (5.8)
cost(γ01(m(γ(P)))) ±·Opt(γ01(m(γ(P)))) (5.9)
cost(m(γ(P))) ±2 ·Opt(γ01(m(γ(P)))) (5.10)
cost(m(γ(P))) ±4 ·Opt(m(γ(P))) (5.11)
cost(γ(P)) ±4 ·Opt(γ(P)) + 4 ·10 `
λ·Opt(γ(P)) + 10 `
λ·Opt(γ(P))(5.12)
cost(γ(P)) ±44 `
λ·Opt(γ(P)) (5.13)
cost(P)±44 `
λ·Opt(P) + 44 `
λ··Opt(γ(P)) + ·Opt(P)(5.14)
cost(P)±67 `
λ·Opt(P)(5.15)
where (5.9) comes from the fact 4, (5.10) from fact 3, (5.11) because of Lemma 5.1.13 and fact
3, (5.12) because of the bounded movement costs of m(fact 2), (5.13) because of `1,λ1,
and 1/2, and (5.14) because of fact 1. 2
92
6 Coresets in Data Streams
In this chapter we introduce algorithms to compute coresets on dynamic geometric data streams.
We will combine the coreset results of Chapter 5 with methods of Chapter 3 to sample items in
data streams.
We shortly recapitulate the results of Chapter 5, particularly Section 5.6. To state the results
for all problems at once we use the following definitions:
For the k-median problem we set
δ0:= d+1·Opt
4·k·10d·(1+log n)·d(d+1)/2 ,
j0:= dlog(n·e
·d)e,
Z:= llog 4·k·10d·n(1+log n)·d(d+1)/2·
e
d+1m+1,
SC:= 33·k·10d·(2+log n)·dd/2
d+1,
pi:= min α
δ·2i, 1,
A:= 193·Z·k·10d·(1+log n)·dd/2·ln(2·Z·2Zd)
d+3, and
e
:= .
For the k-means problem we set
δ0:= d+2·Opt
8·k·33d+1·(1+log n)·d(d+2)/2 ,
j0:= dlog(n·e
·d)e,
Z:= llog 8·k·33d+1·n(1+log n)·d(d+2)/2·
e
d+2m+1,
SC:= 129(1+log n)·k·33d+1·dd/2
d+2,
pi:= min α
δ·4i, 1,
A:= 769·Z·k·33d+1·(1+log n)·dd/2·ln(2·Z·2Zd)
d+4, and
e
:= 2.
For oblivious optimization problems we set
δ0:= d+1·λd·Opt
(10`)d+1·(1+log n)·d(d+1)/2 ,
j0:= dlog(n·e
·d·`)e,
Z:= llog (10`)d+1·n(1+log n)·d(d+1)/2·
e
d+1·λdm+1,
SC:= 9·(1+log n)·dd/2·(10`)d+1
d+1·λd+1,
pi:= min α
δ·2i, 1,
A:= 49·Z·(10`)d+1·(1+log n)·dd/2·ln(2·Z·2Zd)
d+3·λd+1, and
93
6 Coresets in Data Streams
e
:= ∆/λ.
For all problems we set
α:= 6·2ln(2·Z·2Z·d)
and
ρ:= ψ/(3j0),
where ψis the maximum error probability that our data streaming algorithm is allowed to have.
In Section 5.6 we showed that for a set of npoints in [0, 1]dwith Opt 1/e
an -coreset
for k-median, k-means and oblivious optimization problems can be extracted using only the
following statistics:
Znested grids G0,...,GZ. Each cell in Gihas side length 1
2i.
In each grid Gia sample Siof points, each point chosen to be in Siwith probability pi.
The sampling is assumed to be α-wise independently.
Note that we don’t need to know the exact location of each sample point. During our construction
we only use the information about the number of sample points in certain cells of the grid.
Therefore the information we have to compute in our data stream is the following:
1. The coordinates and size of each cell Coccupied by at least one sample point.
2. The number of sample points SiC in each occupied cell C.
Notice that according to Lemmas 5.6.4, 5.6.10, and 5.6.16 all information we need can be
stored in small space:
Lemma 6.0.19 Let δδ0/2. Then we have points from at most Acells in the union of our
sample sets with probability at least 1ρ/2.ut
The extracted coreset is a coreset according to Lemmas 5.6.6, 5.6.12, and 5.6.18 and the size
of it is small according to Corollarys 5.6.3, 5.6.9, and 5.6.15:
Lemma 6.0.20 The following statements hold with probability at least 1ρ/2:
If δδ0, we can compute a coreset from only the statistics 1 and 2. The coreset then is an
O()-coreset of P.
If δδ0/2, the size of the computed coreset is at most SC.ut
By dividing all point coordinates by 1/, we can ensure that all points lie in [0, 1]d. We can
also assume that we always have Opt 1/e
:
For k-median we know the exact solution when we currently only have kpoints in the data
stream. We can use a data structure of Lemma 3.5.4 to obtain the exact point coordinates in that
94
6.1 Insertions
case. Otherwise Opt 1/e
, because we have two points of (scaled) distance at least 1/ =1/e
in one cluster.
For k-means we also know the exact solution when we only have kpoints in the stream.
Otherwise Opt 1/2=1/e
, because we have two points of (scaled) distance at least 1/ in
one cluster.
Having an oblivious optimization problem Πwe also are able to recover the point set, if it
consists of only one point by Lemma 3.5.4. If Pconsists of at least two points pand q, they
must have a scaled distance of at least 1/. Therefore the total distance of all points to the center
of gravity µis at least 1/. Since Πis λ-mean preserving, we have Opt λPpPd(p, µ)
λ/ =1/e
.
6.1 Insertions
We show how to maintain the statistics 1 and 2 when the data stream consists only of insertions
of points. Since we do not know the right value of δin advance, we follow the approach of
the Sections 5.2.3, 5.3.3, and 5.4.3. We start in parallel j0different instances I1, . . . , Ij0of our
algorithm, instance Ijwith a value of δ=δ(j) = c·2jfor a constant c(see Sections 5.2.3, 5.3.3,
and 5.4.3 for the definition of c).
For each jwe maintain at most Acells Cj,1,...,Cj,A and the number of sample points sj,l
within each cell Cj,l. For each instance Ijwe maintain a value RUNNINGj, which is 1if the
instance is still running or 0if the instance has been stopped, and a value cjwhich denotes the
number of cells currently stored by the j-th instance.
An insert operation of a point pis done as follows:
'
&
$
%
INSERT(p)
for each jdo
if RUNNINGj=1then
for each i{1,...,Z}do
Do a biased coin flip with Pr[coin shows heads] = pi.
if coin shows heads do
Let C Gidenote the cell in grid Githat contains the point p.
if l{1,...,cj}Cj,l =Cdo
set sj,l sj,l +1. // update number of sample points within this cell
else do
if cj> A do
set RUNNINGj=0. // stop storing cells for this j
else do
set cjcj+1.
set Cj,cjC. // store cell occupied by point
set sj,cj1. // store number of sample points within this cell
95
6 Coresets in Data Streams
The choice of j0ensures that for one jwe have 1
2·δ0δ(j)δ0(see Sections 5.2.3, 5.3.3,
and 5.4.3). According to Lemma 6.0.19 for this choice of δ=δ(j)we have at most Acells
occupied by sample points. Therefore we have cjAduring the algorithm and RUNNINGj=1.
From Lemma 6.0.20 we know that when we extract a coreset from the statistics for this value of
j, we obtain an O()-coreset for the respective problem. We also obtain an O()-coreset from
all statistics having smaller values of j, since they used a smaller value of δ=δ(j).
We use the smallest value of jsuch that RUNNINGj=1. From the corresponding statistics
Cj,1,...,Cj,cjand sj,1, . . . , sj,cjwe compute our coreset, which is an O()-coreset for P. If the
coreset size is bigger than SCwe know from Lemma 6.0.20 that δ(j)< δ0/2. Therefore we can
set jj+1, still knowing that δ(j)δ0and the computed coreset for this value of jis also an
O()-coreset. We go on with this method until the size of the constructed coreset is at most SC.
Theorem 18 Given a data stream of point insertions our streaming algorithm maintains a data
structure for -coresets for k-median, k-means, and oblivious optimization problems. At any
point of time an O()coreset for the respective problem can be extracted with probability 1ψ.
The data structure for k-median needs e
O(k·log4()·log(1/ψ)/d+3)space, the data structure
for k-means needs e
O(k·log4()·log(1/ψ)/d+4)space and the data structure for oblivious
optimization problems needs e
O(log4()·log(1/ψ)/d+3)space.
An insertion can be processed in O(log()·log(k·∆/)) time.
An -coreset for k-median can be extracted in O(klog3∆/d+1)time.
An -coreset for k-means can be extracted in O(klog3∆/d+2)time.
An -coreset for oblivious optimization problems can be extracted in O(log3∆/d+1)time.
Proof : We set ρ=ψ/(3·j0). By the union bound Lemma 5.6.2 and Lemma 5.6.4 hold with
probability 1ψfor all choices of j. Thus with probability 1ψthe returned coreset is an
O()-coreset.
We will now count the number of memory cells we use. For each of the j0=O(log )
instances we maintain at most Acounters. They occupy j0·Amemory cells, each consisting of
O(log )bits. The respective values of Alead to the stated memory bounds.
When a new point arrives in the stream we process one for-loop for each of the j0values of
jand each of the Zvalues of i. Using hash tables the query l{1,...,cj}Cj,l =Ccan be done in
constant time. All other operations of the algorithm run in constant time as well. Therefore the
time to insert a point is O(Z·j0) = O(log()·log(k·∆/)).
We will now bound the coreset extraction time for k-median. Let us first assume we know
the right value of δin advance. Since (according to the proof of Lemma 5.2.8) there are at
most O(klog n/d+1)heavy cells in each grid, we have a bound of O(klog2∆/d+1)on the
number of heavy cells. The marking process yielding to the actual coreset can be seen as a
quadtree traversal. Since each inner node of this tree corresponds to a heavy cell, the tree traversal
can be done in time O(klog2∆/d+1). Since we have to do this process for each value of
δ(and then choose the minimum value with a small coreset), we get a total running time of
O(klog3∆/d+1).
For k-means resp. oblivious optimization problems the number of heavy cells in each grid
is O(klog n/d+2)resp. O(log n/d+1). Therefore we get a total coreset extraction time of
O(klog3∆/d+2)for k-means and O(log3∆/d+1)for oblivious optimization problems. 2
96
6.2 Deletions
6.2 Deletions
When we allow deletions two problems occur. First, when we encounter a DELETE operation of
a point p, we have to decide if this point is a sample point. We achieve this by replacing the coin
flips in our algorithm by the use of an α-wise independent hashfunction hi,j :dhld
ρmi.
We choose all points phaving hi,j(p)< pi·ld
ρmas sample points.
Lemma 6.2.1 The choice of the sample points is done α-wise independently. Each point pP
is chosen to be a sample point with probability p[pi, pi+ρ
d]. The total statistical difference
to the method which samples each point with probability piis at most ρ.
Proof :
p=lpi·ld
ρmm
ld
ρmpi
and
p=lpi·ld
ρmm
ld
ρm
pi·ld
ρm+1
ld
ρm=pi+1
ld
ρmpi+ρ
d.
Since we have at most dpoints, the errors in the probability to sample each single point sum
up to a total statistical difference of at most ρ.2
We conclude that for a fixed jour sampling method behaves with probability 1ρexactly like
an α-wise independent sampling of each point with probability pi. Since ρ=ψ/(3j0)we have
that with probability 1ψ/3 all samplings behave exactly as demanded.
The second problem is that we cannot stop an instance of our algorithm if the number cjof
occupied cells exceeds A. For example, it could happen that in the first half of the stream many
points are inserted and the number of occupied cells cjis way too large. But then most of these
points can be deleted in the second half of the stream such that we eventually have less than A
occupied cells. If we want to obtain a coreset at that point of time we have to know the respective
statistics 1 and 2.
To overcome this problem we use the data structure of Lemma 3.5.4. The data structure
supports update operations on the entries of a high dimensional vector xand is able to recover
the whole vector, as long as the support of the vector is smaller than SC.
Let jbe fixed. Let C0,...,Ctwith t=O(d)denote the set of all cells in all grids Gifor
one fixed j. Let xidenote the number of sample points in the cell Ci. When we are able to
recover the whole vector x= (x0, . . . , xt)after insert- and delete- operations of sample points,
we can reconstruct the statistics 1 and 2. Therefore for each jwe use a data structure RECOVERj
of Lemma 3.5.4 with parameters U=Θ(d),M=Θ(d), and error probability parameter
ρ=ψ/(3j0). Then with probability 1ψ/3 the structures RECOVERjwork for each j.
Let UPDATEjdenote an UPDATE operation on the data structure RECOVERj. We implement
INSERT and DELETE operations of points in the following way:
97
6 Coresets in Data Streams
'
&
$
%
INSERT(p)
for each jdo
for each i{1,...,Z}do
if hi,j(p)< pi·ld
ρmdo // pis sample point
Let Cl Gidenote the cell that contains the point p.
UPDATEj(l, 1).
'
&
$
%
DELETE(p)
for each jdo
for each i{1,...,Z}do
if hi,j(p)< pi·ld
ρmdo // pis sample point
Let Cl Gidenote the cell that contains the point p.
UPDATEj(l, 1).
The method ensures that xlalways represents the number of sample points within the cell Cl.
Theorem 19 Given a data stream of point insertions and deletions our streaming algorithm
maintains a data structure for -coresets for k-median, k-means. At any point of time an O()-
coreset for the respective problem can be extracted with probability 1ψ.
The data structure for k-median needs e
O(k·log6()·log(∆/ψ)/d+3)space, the data struc-
ture for k-means needs e
O(k·log6()·log(∆/ψ)/d+4)space and the data structure for oblivious
optimization problems needs e
O(log6()·log(∆/ψ)/d+3)) space.
Insertions and deletions of points can be processed in e
O(k·log6()·log(∆/ψ)/d+3)time for
k-median, for k-means in e
O(k·log6()·log(∆/ψ)/d+4)time and for oblivious optimization
problems in e
O(log6()·log(∆/ψ)/d+3)) time.
An -coreset for k-median can be extracted in e
O(klog5·log(∆/ψ)/d+3)time.
An -coreset for k-means can be extracted in e
O(klog5·log(∆/ψ)/d+4)time.
An -coreset for oblivious optimization problems can in e
O(log5·log(∆/ψ)/d+3)time be
extracted.
Proof : We have that with probability 1ψ/3 the hash functions doing the sampling behave
like sampling each point α-wise independently with probability pi. We condition on this event.
We used data structures RECOVERjthat work with probability 1ψ/3 for all j. We condition
on this event.
In that event we can recover the whole vector x, which represents exactly the statistics 1 and
2. Therefore we can construct an O()-coreset. The coreset construction works with probability
1ψ/3 according to Lemmas 5.6.2 and Lemma 5.6.4.
Let us now bound the space needed. For j0values of jwe have to store the data structure
RECOVERjof Lemma 3.5.4, each using space
OA·(log A+log(1/ρ))·log2=e
OA·log2()·log(∆/ψ).
98
6.3 Maximum Spanning Tree
Therefore the total space to store all data structures RECOVERjis e
OA·log3()·log(∆/ψ).
The space consumption of the hash functions is negligible compared to the space consumption
of the RECOVERjstructures. We obtain the stated memory bounds by inserting the right values
of A.
To process an update we have to process j0·Zupdate operations on RECOVER structures, each
taking O(A·(log A+log(1/ρ))·log )time. Alltogether this takes e
O(Z·A·log2·log(∆/ψ))
time, which dominates the whole processing time. Plugging in the values of Zand Aleads to the
stated update times.
To extract the coreset for a fixed value of jwe first have to extract the statistics 1 and 2
from RECOVERj. This takes O(A·(log A+log(∆/ψ)) ·log ) = e
O(A·log ·log(∆/ψ))
time. Plugging in the value Aand multiplying that by j0=O(log )(the number of possible
values of j), we obtain a total recovery time for k-median of e
O(klog5·log(∆/ψ)/d+3),
which dominates the coreset extraction time of O(klog3∆/d+1). For k-means and oblivious
optimization problems we get the stated results in the same way. 2
6.3 Maximum Spanning Tree
In our publication [43] we stated the problem to find a maximum spanning tree of points in
a Euclidean space as an oblivious optimization problem, which is solvable using our coreset
method. Although this can be done in general, the problem is not `-Lipschitz for a constant `,
and therefore the proofs of the previous chapters do not apply. In this section we want to close
this gap to the publication [43] and develop a very simple and more effective method to construct
a coreset for the maximum spanning tree problem.
We will first define the maximum spanning tree problem as an oblivious optimization problem.
Definition 6.3.1 (Euclidean MaxSP) The Euclidean maximum spanning tree problem (MaxSP)
asks for a spanning tree connecting all points of a given input point set PRdof size n, such
that the total length of all tree edges is maximized.
The problem can be formulated as an oblivious optimization problem:
We look at a complete graph Knwith node set {1,...,n}. A feasible solution sfor the MaxST
problem is then a spanning tree of Kn. Let Tbe the set of edges of this spanning tree of Kn. The
cost of son P={p1, . . . , pn}is then given by
cost(n,s)
maxtsp(p1, . . . , pn) = X
(a,b)T
d(pa, pb).
This definition of MaxSP as an oblivious optimization problem also defines -coresets for
MaxSP by Definition 5.1.12.
We first show how to construct a coreset when we have access to the whole input point set P.
99
6 Coresets in Data Streams
We compute an approximation e
Bto the biggest extent of the point set in one dimension. Es-
pecially we set
B:= min
p,qPmin
i{1,...,d}p(i)q(i).
and assume that we have computed an approximation e
Bsatisfying Be
B2·B.
We then introduce a grid consisting of square cells of side length ·
e
B
8d. Notice that at most
(8·d/)dcells in the grid are occupied by points of P.
We introduce a coreset point in each non-empty grid cell and map all input points from Plying
in a cell Cto the corresponding coreset point in C.
Lemma 6.3.2 The construction given above constructs an -coreset for P. The coreset size is
O(1
d).
Proof : Let sbe an arbitrary solution. We denote by Tthe corresponding spanning tree
connecting the points from P.Tcontains less than nedges. Since each endpoint of an edge is
moved a distance at most ·e
B/8, we conclude that the length of each edge changes by at most
·e
B/4. The total change of the cost of the solution sis therefore at most n··e
B/4 n··B/2.
We now show that this change is smaller than ·OptΠ(P)by constructing a spanning tree
having large cost. We first connect the two points p1and p2of Pdefining the extent Bby an
edge. We iterate over the rest of the nodes. A node qP\{p1, p2}is connected to p1by an edge
iff d(q, p1)d(q, p2). Otherwise qis connected to p2by an edge. The cost of the resulting
spanning tree is then at least
d(p1, p2)+(n2)·d(p1, p2)/2 n·B/2 .
Therefore the costs OptΠ(P)of an optimum solution must be greater than n·B/2. It follows that
the change of the cost of the solution sby mapping points to coreset points is bounded by
n··B/2 ·OptΠ(P).
2
The statistics we need to construct a coreset can be easily maintained in a dynamic geometric
data stream consisting of insertions and deletions of points from {0, . . . 1}dusing techniques
of the previous sections. Let ψbe a desired error probability we want to achieve.
We start dlog(d·) + 1eparallel instances of a streaming algorithm, each with a different
value of j{0, . . . , dlog(d·)e}. For each instance we set e
Bj:= 2j.
Instance jintroduces a square grid Gjof cell side length ·
f
Bj
8d. Let Cj,1, . . . , Cj,l be the cells
of Gj. The instance uses the data structure of Lemma 3.5.4 with values U=d,M=d,
A= (8·d/)d, and δ=ψ/dlog(d·) + 1e, which is able to recover the support of a large
vector xunder UPDATE operations on x, when the support is smaller than A. For each INSERT(p)
operation in the stream instance jdiscovers the cell Cj,i containing the point p. It then triggers
an UPDATE(i, 1)operation on its data structure, increasing the value xiby 1. A DELETE(p)
100
6.3 Maximum Spanning Tree
operation is transformed into an UPDATE(i, 1)operation on the data structure, decreasing the
value xiby 1. This way we ensure that the vector entry xialways equals the number of cells in
Cj,i.
We can extract a coreset in the following way. We assume that all data structures to recover
the vectors work well (which happens by the Union Bound with probability 1ψ). We query
all data structures of all instances to recover the support of their corresponding vectors. The data
structures return FAIL only if the support of their corresponding vector is larger than A. This
way we can decide which grids contain at most Aoccupied cells.
We take the instance with the lowest value of jhaving at most Aoccupied cells and use the data
structure of that instance to recover the corresponding vector x. This vector contains the infor-
mation about the set of occupied cells in Gjand the number of points within each cell. Using that
information we construct a coreset by introducing a coreset point within each occupied cell of Gj
and setting the weight of each coreset point to the number of input points in the corresponding
cell.
Theorem 20 Given a dynamic geometric data stream of insert and delete operations of points
from {0,...,∆1}d, there is an algorithm that maintains a data structure for -coresets for
MaxST. At any point of time an -coreset can be extracted with probability 1ψ. The size of the
coreset is O(1/d). The data structure uses O1
d·log log
·ψ·log2space. Insertions and
deletions of points can be processed in O1
d·log log
·ψ·log time.
Proof : We first prove that the returned coreset is indeed an -coreset for MaxST. We know that
there is one instance started with a value of f
Bj0satisfying Bf
Bj02·B. By the argumentation
above we have at most Aoccupied cells in the corresponding grid. If we would construct a
coreset from the information of that instance, this would be an -coreset. However, we construct
the coreset from the information of the instance having the smallest value of jand at most A
occupied grid cells. Therefore the coreset is constructed from an instance jwith value jj0.
We have either j=j0or the grid of instance jis even finer than that of instance j0. Therefore the
constructed coreset is an -coreset. It’s size is at most A.
The space and update time we need is dominated by the space and update time of the dlog(d·
)+1edata structures from Lemma 3.5.4. This leads to the stated space requirements and update
times.
2
101
6 Coresets in Data Streams
102
7 A Kinetic Data Structure for MaxCut
In this chapter we will focus on clustering moving points as described in the framework of kinetic
data structures (KDS). The framework of kinetic data structures has been introduced by Basch
et al. [12] and it has been used since as the central model of studying geometric objects in
motion, see, e.g., [2, 12, 55, 56] and the references therein. The KDSs are data structures for
maintaining a certain attribute (for example, in the case of a clustering problem, assignment of
the points to the clusters) for a set of continuously moving geometric objects. The main idea
underlying the framework of KDSs is that even if the input objects are moving in a continuous
fashion, the underlying combinatorial structure of the moving objects changes only at discrete
times. Therefore, there is no need to maintain the data structure continuously but rather only
when certain combinatorial events happen: a KDS maintains a configuration function of interest
by watching for updates needed to be performed when an event occurs.
In the kinetic setting, we consider a set of points in Rdthat are continuously moving. Each
point follows a (known) trajectory that is defined by a continuous function of time; for simplicity
of presentation, we will assume that it is a linear function. In other words, every point is moving
with a constant speed along a line; the line and the speed are the parameters of the movement of
a given point. Additionally, we allow the points to change their trajectory, i.e., to perform a flight
plan update.
To measure the quality of a KDS, we will consider the following two most important perfor-
mance measures (for more details, see, e.g., [55, 56]): the time needed to update the KDS when
an event occurs and the bound for the number of events that may occur during the motion. An-
other important measure is the time to handle flight plan updates.
In this chapter we describe a kinetic data structure to maintain a (1)-approximation of a
maximum cut. Our data structure supports queries of the type
to which side of the partition belongs query point p?
To support such a query the data structure maintains a subdivision of the space that has complex-
ity O(log n/d+1). Each cell of the subdivision is colored red or blue. Every point located in a
red cell is red and every point in a blue cell is blue. Then our colored partition into red and blue
points is a (1)-approximation to the MaxCut.
Our data structure uses two auxiliary kinetic data structures, kinetic turnament trees as defined
in Section 7.1 and a data structure to approximate the bounding cube as defined in Section 7.2.
103
7 A Kinetic Data Structure for MaxCut
7.1 Kinetic Turnament Trees
In this section we recap a construction of randomized turnament trees from Basch [11]. There are
other structures leading to better amortized time bounds (e.g. [16]). We will use the construction
from [11] because it deterministically achieves almost the same runtime bounds, but these bounds
hold even in the worst case.
A randomized turnament tree is a randomized data structure to maintain the maximum (or the
minimum, depending on applications) point from a set P={p1, . . . , pn}of nlinearly moving
points in R1. It stores all points in the leafs of a binary tree. Inner nodes of the tree always store
the bigger point of the two child nodes (and therefore the maximum of the subtree). It also stores
an event queue, that is a priority queue holding the times of the next events when inner nodes
will change (when we get a new maximum in the subtree, i.e. when the two children of a node
change their order).
When a point stored in an inner node changes, the maximum of the subtree must have changed.
The number of times this can happen is bounded by the number of nodes in the subtree. Summing
up over all inner nodes we get that the total number of changes of inner nodes is at most O(n·
log n).
When an inner node changes at an event, it can lead to changes of the inner nodes above.
Notice that we treat these additional changes as seperate events. Therefore each such event can
be processed in O(log n)time (we just have to adapt the event queue and test if we have an
immidiate event for the parent node as well).
An insertion of a point can be done by inserting a leaf into the tree and adjusting the inner
nodes in time O(log2n). A deletion of a point can be done by deleting the respective leaf in time
O(log2n), adjusting the inner nodes, and move the rightest leaf to the position of the deleted leaf
in time O(log2n)to balance the tree.
Theorem 21 ([11]) Let Pbe an initially empty set of points moving along linear trajectories
in R1. Let σ=σ1, . . . , σmbe a sequence of moperations σiof the form INSERT(p, ti)and
DELETE(p, ti), such that for any two operations σi, σjwith i<jwe have ti< tj(the operations
are performed sequentially in time). An INSERT(p, ti)inserts at time tipoint pinto P. A
DELETE(p, ti)removes pfrom Pat time ti. A kinetic turnament tree maintains the biggest
element of P. It requires O(log m)time to process an event and the expected number of events
is O(mlog m). Insertions and deletions are performed in expected time O(log2m).
7.2 Approximating the Bounding Cube
Our data structure uses an data structure from Agarwal, Har-Peled, and Varadarajan [2]:
Theorem 22 ([2]) Let Pbe a set of npoints moving in Rd. If Pis moving linearly, then after
O(n)preprocessing, we can construct a kinetic data structure of size O(1)that maintains a 2-
approximation of the smallest orthogonal box containing P. The data structure processes O(1)
events, and each event takes O(1)time. The sides of the maintained box are moving linearly
between the events.
104
7.3 The Kinetic Data Structure for MaxCut
It can be decided in constant time if a flight plan update of a point pchanges the data structure.
At each point of time only flight plan updates of a constant number of points can potentially
change the data structure.
We use this data structure to efficiently maintain a bounding cube Bof Phaving the following
properties.
All points lie in Bduring the whole movement of points.
There is always one dimension, such that the extent of the point set in the dimension is at
least half the side length of B.
The data structure to maintain the bounding cube processes O(d2)events.
Between these events all side borders of the bounding cube are moving linearly.
Given a flight plan update we can decide in time O(d)if the data structure has to be
updated.
At each point of time only flight plan updates of O(d)points can potentially change the
data structure.
For each dimension we maintain a 2-approximation of the 1-dimensional extent using the
KDS from [2] (see Theorem 22 above). Using these approximations we can easily maintain an
approximation Bof a smallest bounding cube of Pby maintaining a cube having the highest
extent as side length.
Each extent data structure processes O(1)events. Between such events the dimension that de-
fines the size of the bounding cube can change although none of the one dimensional extent data
structures processes an event. This happens when the size of the bounding cube is determined by
a new one dimensional extent. This can happen at most dtimes. Therefore, we can have at most
O(d2)events.
We store the approximation of each extent in a kinetic turnament tree whose priority is given
by the width of the 1-dimensional approximations.
7.3 The Kinetic Data Structure for MaxCut
Our kinetic data structure for MaxCut uses the data structure of the last section to maintain a
bounding cube B. This data structure processes O(d2)events, which we call major events.
Between all major events we will set up a data structure to maintain a coreset under a lin-
ear movement of the bounding box B. We will use the technique described in Section 5.6.3 to
construct coresets for oblivious optimization problems using point samples. Recall that MaxCut
(with objective function scaled by 1/n) is an oblivious optimization problem which is `-Lipschitz
with `=1and λ-mean preserving with λ=1/4.
105
7 A Kinetic Data Structure for MaxCut
The coreset technique assumes that all points always lie within [0, 1]dand that we have a lower
bound 1/e
on Opt (recall that Opt denotes the cost of an optimum cut devided by n). This can
easily be achieved in the following way: Between two major events we know that the movement
of the bounding cube Bis linear. At each major event we scale and move all coordinates in a way
such that the bounding cube Bis [0, 1]d. Since the scaling function is linear in the time t, after
this scaling still all point movements are linear.
Since there is always one dimension such that the extent of the point set in that dimension is
at least half the side length of the bounding cube, the sum of distances of these two points to the
center of gravity is at least 1/2. Since MaxCut is 1/4-mean preserving, we know that Opt 1/e
for e
=8.
In the following we assume that all points lie in [0, 1]dand that the bounding cube B= [0, 1]
does not move.
We will maintain Znested square grids G0, . . . GZfor
Z=log 8·10d+1·4d·n(1+log n)·d(d+1)/2
d+1+1 .
Grid Gipartitions [0, 1]dinto 2id cells of side length 1/2i. We assume that these cells are num-
bered from 1to 2id and that the index of a cell can be computed efficiently from its coordinates
and that the neighbors of a cell can also be computed efficiently.
Let j0:= blog(8·n·d)c. For j{0,...,j0}let be
δ=δ(j) = d+1
8·(1+log n)·d(d+1)/2 ·(40)d+1·2j.
For each value of j{0,...,j0}and each grid Giwe create a sample Si,j, each point chosen
independently with probability pi,j =min{α
δ·2i, 1}, where α=6·2·ln(2·Z·2Z·d·j0).
To simplify the notation we set
K:= 6·ln(2·Z·2Z·d·j0)·8·(1+log n)·d(d+1)/2 ·40d+1
d+3=α·2j
δ.
Notice that Kis polylogarithmic in nand does not depend on ior j, and that
pi,j =min K
2i+j, 1
Following the ideas of Section 5.4.3 there is at least one value of δ(j), such that the coreset
constructed using this value of δ=δ(j)is smaller than a constant S=O(log n/d+1)(Corollary
5.6.15), and an O()-coreset for MaxCut (Lemma 5.6.18). We can easily identify such a value
of δby taking the coreset constructed with the smallest value of δ, such that the size is at most S.
For each 1iZ, we maintain under the movement of points
106
7.3 The Kinetic Data Structure for MaxCut
the set of all cells containing sample points, and
the number of sample points in each non-empty cell.
A(1+)-approximate solution on the coreset can be efficiently computed using the algorithm
of Section 5.5.3 in time Olog2n·2((1/)O(1)).
The data structure. At each major event we calculate a new linear scaling of the movements
of points, such that all points lie in the scaled bounding cube B= [0, 1]das described earlier. We
will then setup the following data structure and maintain it until the next major event triggers.
We assume that the cells in grid Giare numbered from 1to 2id. For each sample set Si,j we
maintain a search tree Ti,j that stores the cells in grid Githat contain at least one point from Si,j.
For each non-empty cell we maintain 2d kinetic turnament trees. For 1kdwe maintain
one kinetic max-turnament tree and one kinetic min-turnament tree, where the priority of points
is given by their k-th coordinate. To implement these turnament trees we use a kinetic turnament
trees that efficiently supports insertion and deletions of points (see Section 7.1).
The events. Additionally to the events caused by the kinetic turnament trees, our data struc-
ture stores the following (possible) events in a global event queue. For each grid Giand each
non-empty cell we have an event for each dimension k,1kdwhen the maximum or
minimum point with respect to that dimension crosses the corresponding cell boundary in that
dimension. These events are called minor events. Additionally, we have major events that occur
when an event causes to change the movement of the bounding cube. At major events the whole
data structure is updated as defined above.
Every event in a kinetic turnament tree is called a turnament event.
Time to process events. We first consider minor events. In every minor event, a point pin
some set Si,j moves from one cell C1of the grid into another cell C2. Therefore, pis deleted from
2d turnament trees corresponding to C1and is inserted into the 2d turnament trees corresponding
to C2. If the point moves into a cell that was previously empty, we must insert the cell index of
C2into the search tree Ti,j and initialize the 2d turnament trees. If pwas the only point in C1we
have to delete the 2d turnament trees. Since (cf. Section 7.1) in O(log n)time one can insert a
point in a turnament tree or search tree and since any insertion in a kinetic turnament tree creates
O(log n)new events in expectation, we get:
Lemma 7.3.1 Any minor event can be processed in O(dlog2n)time. It creates O(dlog n)new
events in kinetic turnament trees in expectation.
Next we want to analyze the running time required to setup the whole data structure at major
events.
Lemma 7.3.2 Any major event can be processed in expected time O(d·K·n·log n)
107
7 A Kinetic Data Structure for MaxCut
Proof : The time to setup our data structure at a major event is dominated by the time to setup
the kinetic turnament trees for the boundary events. Since each kinetic turnament tree consisting
of mpoints can be constructed in time O(m·log m)we have to count the number of sample
points in all kinetic turnament trees. Each sample point is inserted into 2d kinetic turnament
trees. The expected number of points in Si,j is Kn/2i+j. By linearity of expectation we get that
the total number of points in all kinetic turnament trees is
X
i,j
2d ·Kn
2i+j=O(dKn).
2
Remark 7.3.3 Instead of setting up the data structure at major events we can precompute the
major events in time O(d2·n). Then in time O(d2·n)we can precompute all times of major
events and the corresponding scaled positions and velocities of the points. From that we can
setup all data structures at the beginning instead of setting them up at major events. Using this
technique we get a total expected setup time of O(d3·K·n·log n)and can process major events
in time O(1)by just switching to the precomputed structure.
Analysis of the number of events. In Section 7.2 we showed:
Lemma 7.3.4 There are at most O(d2)major events. ut
Since (after scaling the points and velocities at major events) the bounding cube B= [0, 1]dis
fixed between major events, it follows that for every grid Githe boundaries of the grid cells are
fixed as well between major events. Therefore we get:
Lemma 7.3.5 During the linear motion between major events every point crosses at most d·
(2i1)cells in grid Gi.
Proof : Let us consider an arbitrary point p. We regard the cell boundaries in each dimension
separately. In grid Giwe have 2i1internal boundaries. Since pmoves linearly in time, pcan
cross each boundary at most once. Since this can happen in each of the dimensions, the lemma
follows. 2
Corollary 7.3.6 The expected number of minor events is O(d3K n ·Z).
Proof : The expected number of minor events involving points from Si,j is at most K n
2i+j·d·2i=
d K n/2jbetween two major events. Summing up over all i, j and multiplying it with the number
O(d2)of major events, we get that there are at most O(d3K n Z)events. 2
Corollary 7.3.7 The expected number of turnament events is O(d4·K·n·Z·log n).
Proof : Every minor event creates a number of O(dlog n)new events in kinetic turnament
trees. Linearity of expectation implies that the expected number of events in kinetic turnament
trees is O(d4·K·n·Z·log n).2
108
7.3 The Kinetic Data Structure for MaxCut
Flight plan updates. In KDS it is typically assumed that at certain points of time the “flight
plan” of an object can change. At every such flight plan update the data structure is notified that
a point now moves in another direction (possibly at a different speed). Such a flight plan update
compels us to update all events in the event queue that involve this particular point. In our case
we distinguish between two types of points. First, there is a constant number of points on which
flight plan updates change the data structure maintaining the bounding cube. If the movement of
one of these points is changed we have to setup a whole new data structure in time O(d3·K·n).
If the flight plan of any other point is updated the movement of the bounding cube and the
scaling of the points stays the same. We simply have to update all events the point is involved
in. Since it requires O(log2n)time to update a randomized kinetic turnament tree we have to
compute the expected number of such kinetic turnament trees a point is involved in. Every point
is stored in 2d kinetic turnament trees for each set Si,j it is contained in. We get
Lemma 7.3.8 Let pPbe an arbitrary point. Then pis stored in O(dK)kinetic turnament
trees in expectation.
Proof : The probability that pis contained in set Si,j is at most K/2i+j. Hence, we get that the
expected number of sets Si,j that contain pis
X
i,j
K
2i+j=X
i
1
2iX
j
K
2jX
i
2K
2i=O(K).
The lemma follows from the fact that every point is stored in 2d kinetic turnament trees for each
set Si,j it is contained in. 2
Assume we fix some point of time and specify for each point an arbitrary flight plan update.
If we choose one of these updates uniformly at random then the expected time to perform the
update is small, i.e., the average cost of a flight plan update is low.
Lemma 7.3.9 A flight plan update can be done in O(d·(d3+log n)·log n·K)average expected
time.
Proof : It requires O(log2n)time to do a flight plan update of a kinetic turnament tree. In
expectation every point is stored in O(d K)kinetic turnament trees. Hence the expected time
required to update these turnament trees is O(d K log2n). Additionally, we have to deal with
updates of the O(d)points that change the bounding cube structure. We can process such an
update in O(d3·K·n·log n)time. Averaging over all points we get that the average expected
update time for these kind of updates is O(d
n·d3K n log n) = O(d4·K·log n).2
Extracting the coreset. We describe how to extract a small coreset from our data structure.
We try to find a value of jsuch that the coreset constructed with the value δ=δ(j)has size of
at most S=O(log n/d+1)(Corollary 5.6.15), and that the size of the coreset constructed with
δ=δ(j1)is greater than S. By Corollary 5.6.15 and Lemma 5.6.18 this coreset is then with
probability 1ψ/j0an -coreset for MaxCut.
109
7 A Kinetic Data Structure for MaxCut
Testing the coreset size for one value of j{0,...,j0}and extracting one coreset of size Scan
be done in time O(log2n/d+1):
Since (according to the proof of Lemma 5.4.7) there are at most O(log n/d+1)heavy cells in
each grid Gi, we have a bound of O(log2n/d+1)on the number of heavy cells. The marking
process yielding to the actual coreset can be seen as a quadtree traversal. Since each inner node
of this tree corresponds to a heavy cell, the tree traversal can be done in time O(log2n/d+1).
When we see that we have too many heavy cells we stop the process.
We can search for the right value of jusing a binary search. This takes a total time of
O(log2n·log log n/d+1). Since we construct less than j0coresets, the whole process works
with probability 1ψ.
Computing an approximate MaxCut from the coreset. According to Section 5.5.3 we
can compute a (1±)-approximate MaxCut-solution from the coreset in Olog2n·2((1/)O(1))
time.
We can apply this algorithm to the computed coreset and obtain our main result. In the theorem
we assume that dis a constant.
Theorem 23 There is a kinetic data structure that maintains a (1+)-approximation for the
Euclidean MaxCut problem, which is correct with probability 1ψ. The data structure answers
queries of the form to which side of the partition belongs query point p?’ in O(log2n·log log n·
2(d+1)·21/O(1))time. Under linear motion the data structure processes expected e
O(nlog(ψ1)
d+3)
events, which require O(log2n)time. A flight plan update can be performed in e
O(log4n·log(ψ1)
d+3)
average expected time, where the average is taken over the worst case update times of the points
at an arbitrary point of time. The data structure needs an expected setup time of e
O(n·log(ψ1)
d+3).ut
110
8 An Efficient k-Means Implementation using
Coresets
In this chapter we develop an efficient k-means clustering algorithm called CoreMeans. The
main new idea is that the algorithm uses the coreset construction of Section 5.3 to speed up the
computation of the clustering. Our algorithm first computes a small coreset of the input points
and then runs variant of KMHybrid [104, 83], which is a combination of Lloyd’s algorithm and
random swaps. Then the algorithm doubles the size of the coreset and runs for a few steps on this
coreset. This process is done until the coreset coincides with the whole point set. The coreset
computation is supported by a quadtree (or higher dimensional equivalent) based data structure.
This data structure can also be used to speed up nearest neighbor queries.
We will compare our algorithm with algorithm KMHybrid [104, 83] in Section 8.3. On most
of the input instances our algorithm significantly outperforms KMHybrid, especially for low
dimensional instances. For high dimensional instances our algorithm finds good solutions faster
but KMHybrid’s solution after a few seconds is slightly better. If we want to compute a clustering
for one value of kthe running time of both algorithms is often dominated by the setup time to
compute auxiliary data structures. In this case CoreMeans benefits from its smaller setup time.
However, in many applications we do not know the right value of kin advance. In such a
case one has to compute clusterings for many different values of k. Then one can use a quality
measure independent of kto find out the best clustering. A prominent quality measure for such
a scenario is the average silhouette coefficient [86]. Although there are no theoretical guarantees
for the average silhouette coefficient, it is often used to evaluate the quality of different cluster-
ings. Unfortunately, computing the average silhouette coefficient for one clustering takes time
quadratic in the number of points, which is not feasible for point sets of medium and large size.
However, we would like to compute the sihouette coefficient for many values of k.
In this situation we can see the real strength of coresets. Using coresets it is possible to find
clusterings and compute their average silhouette coefficient for large point sets and many values
of k. For example, we computed clusterings for k=1, . . . , 100 and approximated their average
silhouette coefficient for a set of more than 4.9 million points in 3D consisting of the RGB values
of an image in a few seconds on one core of an Intel Pentium D dual core processor with 3 GHz
core frequency. In higher dimensions we did the same computations for an (artificially created)
point set of 300, 000 points in 20 dimensions for all values of kbetween 1and 100 in less than
8 minutes. Without coresets, the computation of the sihouette coefficient even for one value of k
takes several hours.
First, we develop some notation and introduce the basic k-means method.
111
8 An Efficient k-Means Implementation using Coresets
8.1 Definitions and Notations
In this chapter we will deal with weighted and unweighted sets of points. We will always assume
that the set Prepresents the input instance, which is an unweighted set of npoints in the Rd.
A weighted point set will usually be denoted by R. The weights of the points in Rare given by
w(r)for every rR. We will only consider integer weights in this paper. For each point pRd
let p(j)denote it’s j-th coordinate.
Recall the definition of the k-means clustering problem from Section 2.3.
Two easy characterizations of an optimal k-means solution are known. We state them in
Lemma 8.1.1 and Lemma 8.1.2:
Lemma 8.1.1 Let C={c1, . . . , ck}be a set of fixed cluster centers with |C|=k. Let C1, . . . , Ck
be a partition of Pwhich fulfills
pCij{1,...,k}d(p, ci)d(p, cj).
Then the partition C1, . . . , Ckminimizes the k-means objective function for the fixed set of cen-
ters C.ut
We write Means(P, C)to denote the cost of an optimal partition of Pwith respect to the cen-
ters in C. In a similar way we write Means(R, C)to denote the cost of an optimal partition of a
weighted set Rwith respect to C.
The other way around we can easily find the best centers for a given partition. The result is
also well known (and proven in many publications like [94]).
Lemma 8.1.2 Let C1, . . . , Ckbe a fixed partition of P. For all i{1,...,k}let ci:= µ(Ci)
be the center of gravity of Cias defined in Section 2.1. Then the centers c1, . . . , ckminimize the
k-means objective function for the fixed partition C1, . . . , Ckof P.
Proof : We can do the minimization within each cluster seperately. Therefore we have to show
for a set of points Q, that the function f:RdRwith f(r) = PqQd(q, r)2is minimized at
r=µ(Q).
We have
f(r) = X
qQ
d(q, r)2=X
qQ
d
X
i=1
(q(i)r(i))2=
d
X
i=1X
qQ
(q(i)r(i))2.
Therefore we can minimize each dimension seperately: For i{1,...,d}let
fi(x) = X
qQ
(q(i)x)2.
When we chose r(i)such that fi(r(i)) = PqQ(q(i)r(i))2is minimized, then rminimizes f.
112
8.1 Definitions and Notations
Let i{1,...,d}be fixed. Obviously fi:RRis continuously differentiable. Furthermore
limx→∞ fi(x) = and limxfi(x) = . Therefore fimust have a global minimum, and at
that global minimum we must have f0(x) := d
dx f(x) = 0.
We observe:
f0(x) = X
qQ
2·(xq(i)).
Therefore we have
f0(x) = 0X
qQ
2·(xq(i)) = 0 X
qQ
x! X
qQ
q(i)!
|Q|·x=X
qQ
q(i)x=1
|Q|X
qQ
q(i)x=µ(Q)(i).
Therefore µ(Q)is the only point in Rdminimizing f. 2
8.1.1 The Basic k-Means Method
Based on the observations of Lemma 8.1.1 and Lemma 8.1.2 that it is easy to compute an optimal
partition for a fixed set of centers and an optimal set of centers for a fixed partition, a simple and
elegant clustering heuristic has been developed [41, 95, 97]. Nowadays, one often refers to this
heuristic as the k-means algorithm or Lloyd’s algorithm. This algorithm runs in iterations. At
the beginning of an iteration the algorithm has a set of kcenters {c1, . . . , ck}. Every iteration
consists of two steps:
1. For every pPcompute its nearest center in {c1, . . . , ck}. Partition Pinto ksets C1, . . . , Ck
such that Cicontains all points whose nearest center is ci(and break ties arbitrarily).
2. For every cluster Cicompute its center of gravity µ(Ci), i.e. the optimal center of that
cluster. Then set ci:= µ(Ci)for every 1ik.
Each iteration runs in O(ndk)time. Typically, the algorithm runs for a fixed number of iterations
(standard values are in the range from 50 to 500). It is well known that the algorithm only
converges to a local optimum and that the quality of this solution depends strongly on the initial
set of centers. Therefore, the algorithm is usually repeated several times with different sets of
initial centers and the best discovered partition is returned.
8.1.2 Algorithm KMHybrid
In the experiments we compare our algorithm to an algorithm called KMHybrid [104, 83].
KMHybrid combines Lloyd’s algorithm with swapping of centers (moving centers away to the
position of a random point to break local minima) and a variant of simulated annealing. The
algorithm does one swap followed by a sequence of Lloyd’s steps and an acceptance test. If the
113
8 An Efficient k-Means Implementation using Coresets
current solution passes the test, the algorithm continues with the current solution. Otherwise, it
restores the old one. The acceptance test is based on a variant of simulated annealing. Addition-
ally, the algorithm uses a kD-tree to speed up nearest neighbor search in the Lloyd’s steps. For
more details see [104, 83].
8.1.3 The Silhouette Coefficient
In many applications the right number of clusters is not known in advance. Since the k-means
objective function drops monotonically as kincreases, one needs a different measure for the
quality of a clustering that is independent of k. Such a measure is provided by the average
silhouette coefficient [86] of the clustering. The silhouette coefficient of a point piis computed
as follows. First compute the average distance of pito the points in the same cluster as pi. Then
for each cluster Cthat does not contain picompute the average distance from pito all points in
C. Let bidenote the minimum average distance to these clusters. Then the silhouette coefficient
of piis defined as (biai)/max(ai, bi).
The value of the silhouette coefficient of a point varies between 1and 1. A value near 1
indicates that the point is clustered badly. A value near 1indicates that the point is well-clustered.
To evaluate the quality of a clustering one can compute the average silhouette coefficient of all
points.
8.2 The Algorithm
We first provide a high level description of our algorithm and then we give some more details on
the implementation.
Our algorithm uses the observation that the first iterations of k-means algorithms or swapping
heuristics do not need the accuracy of the whole point set. Our algorithm uses the coreset con-
struction technique of Section 5.3 to reduce the complexity of the point set. The hope is that the
first iterations of iterative algorithms can be done much more efficiently on small coresets, still
significantly improving the objective function.
The algorithm starts to compute a coreset of size roughly 2k and chooses kpoints from this
coreset as a starting solution. Then it repeats max{40/k, 2}times1the following two steps,
which form the main loop of the algorithm.
First it runs Lloyd’s algorithm for dsteps. After this, the current solution is compared to
the previously best and the algorithm continues with the better of these solutions.
In the second step, the algorithm chooses a number k0between 0and kuniformly at
random. Then it picks centers from the current set of centers according to the following
probability distribution until k0different centers are chosen. The probability that the center
1The value comes from the idea that the time to extract the coreset should be comparable to the time to run
iterations on the coreset. Without a dependency on khere, the coreset extraction time dominates the algorithm
runtime for small values of kat the beginning of the algorithm. The special value 40/khas been empirically
adjusted.
114
8.2 The Algorithm
'
&
$
%
COREMEANS(P, k)
m2k
while m|P|do
Compute coreset of size m.
if m=2k then
Ckrandom points from coreset
KC
repeat max{40/k, 2}times (* Main loop *)
Do diterations of Lloyd’s algorithm starting with C
Let Cdenote the current solution
if Means(P, K)<Means(P, C)then
CK
KC
Choose k0randomly from {0,...,k}
Swap k0centers from Cwith points chosen uniformly from the coreset
m2·m
Figure 8.1: The COREMEANS algorithm
of cluster Cjis chosen is 1
|Cj|·Pk
i=1
1
|Ci|
, where C1, . . . , Ckdenotes the current clustering.
Therefore centers of smaller clusters are picked with higher probability. Finally, these k0
random centers are replaced by points chosen uniformly at random from the coreset.
After the main loop is finished the algorithm doubles the size of the coreset and continues with
the main loop. This is done until the coreset is the whole point set. The algorithm is given in
Figure 8.1.
To support efficient computation of coresets and to speed up nearest neighbor queries in
Lloyd’s algorithm we use a quadtree or its higher dimensional equivalent. Our approach is the
analog to the kD-tree algorithm from [83]. The root of the quadtree corresponds to a bounding
box of the point set. With each occupied cell Bassociated with a node of the quadtree we store:
The number nB=|PB|of points contained in the cell,
A vector bB= (b(1)
B, . . . , b(d)
B)with b(i)
B:= PpPBp(i),
The sum of the squared `2-norm of the point vectors
eB:= PpPBkpk2
2=PpPBPd
i=1(p(i))2.
This information can be used to quickly compute the exact cost of the partition of Pthat cor-
responds to a given partition of the coreset. The same precomputations have also been used in
[83, 124]. Since all points from one cell are assigned to the same coreset point and hence to the
115
8 An Efficient k-Means Implementation using Coresets
same cluster, we can compute the cost of a cluster Cin the following way. Let B1, . . . , Blbe the
disjoint cells assigned to C. We first compute the center of gravity c:= µ(C)using the formula
µ(C)(i)=1
|C|·X
pC
p(i)=
l
X
j=1X
pPBj
p(i)=
l
X
j=1
b(i)
Bj.
Then we have to compute
X
pC
d(p, c)2=X
pC
d
X
i=1
(p(i)c(i))2=X
pC
d
X
i=1(p(i))22p(i)c(i)+ (c(i))2
=
l
X
j=1
eBj2·
d
X
i=1
c(i)·
l
X
j=1
b(i)
Bj+
l
X
j=1
nBj· d
X
i=1
(c(i))2!,
which can be done efficiently by using the information stored with each cell. Notice that the
runtime of this computation only depends on the number of cells, not on the number of points
inside the cells.
In our implementation we fixed the depth of the quadtree to 112. To build the tree we pro-
ceeded bottom-up. We identify the non-empty cells in the grid corresponding to the 11-th level
of the tree. The non-empty cells are stored in a hash table with cell coordinates as keys. After
we have computed the non-empty cells in the finest grid we iterate over these cells and compute
all non-empty cells in the next coarser grid together with the corresponding point statistics.
To compute a coreset of around m points we first have to identify a good guess for δ. Larger
values of δlead to a smaller coreset but also to a worse approximation (see Section 5.3). We find
a good value by setting δto a large value and dividing it iteratively by 1.3. After each iteration
we compute the heavy cells from the cell statistics, and from that the size of the coreset (without
computing the actual coreset points). For high values of δthis is done very fast since the coreset
size can be computed using a few large cells. Alltogether the time to compute the coreset size is
negligible compared to the coreset computation time.
The coreset for a given value of δis computed using a recursive depth first search function on
the quadtree cells. For the root cell we call a function COMPUTECORESETPOINTS (see Figure
8.2). This function has a cell as input parameter as well as statistics about points to be moved
into that cell. COMPUTECORESETPOINTS adds the input points to the cell statistics. Then for
each subcell it checks heaviness. If there is at least one heavy subcell, it calls COMPUTECORE-
SETPOINTS for all heavy subcells. The points given as function parameter and the points in
all light subcells are then moved into a heavy subcell by adding up their statistics and giving
these statistics to one call of COMPUTECORESETPOINTS as function parameters. If a cell has no
2During the computation of big coresets in later iterations it can happen that the coreset extraction needs point
information of deeper levels. However, since we used images as test instances with 8bits per color channel, a
depth of 11 was enough. We believe that this also suffices for most other real world scenarios.
116
8.2 The Algorithm
'
&
$
%
Global variables used in functions COMPUTECORESETPOINTS, COMPUTEASSIGN-
MENTFORCELL,AND COMPUTEASSIGNMENT
L:the current set of centers.
For each center cL:
ncN:the number of points assigned to the center.
bc=Rd:the sum of coordinates of points assigned to the center.
ec=R:the sum of squared `2-norms of points assigned to the center.
R:the coreset computed by COMPUTECORESETPOINTS.
For each cell Cwith points in it:
nCN, bCRd, eCR: The statistics for the points PCin cell C.
e
nCN,e
bCRd,eeCR: The statistics for the points PCin cell Ctaking
into account the movement of points into cells during coreset construction.
'
&
$
%
COMPUTECORESETPOINTS returns a set of coreset points, each being a triple from
N×Rd×R.
COMPUTECORESETPOINTS(Cell C, n N, b Rd, e R)
e
nCnC+n.
e
bCbC+b.
eeCeC+e.
Let H1, . . . , Hhbe the heavy subcells of C.
Let L1, . . . , Llbe the light subcells of C.
for j=1,. . . l do:
e
nLj0. // Points are moved into a heavy cell.
e
bLj(0,...,0)Rd.
eeLj0.
Nn+Pl
j=1nLj. // Number of points to be assigned.
Bb+Pl
j=1bLj.
Ee+Pl
j=1eLj.
if Chas at least one heavy subcell do:
Coreset COMPUTECORESETPOINTS(H1, N, B, E).
for j=2,...,hdo:
Coreset Coreset COMPUTECORESETPOINTS(Hj, 0,
0 , 0).
else
Coreset{(N, B, E)}. // Create new coreset point.
RETURN Coreset.
Figure 8.2: The COMPUTECORESETPOINTS function
117
8 An Efficient k-Means Implementation using Coresets
heavy subcell, a coreset point is introduced from the statistics about cell points and the statistics
given as function input.
To speed up of the later k-means algorithm we store the following statistics: For each cell
occupied by coreset points we store a pointer to a corresponding coreset point (if there is one)
and pointers to all subcells containing coreset points. For a cell Blet PBbe the points from P
which are in cell Bafter moving points during the coreset construction. During the construction
of coreset Rwe additionally compute for each cell Bthe corresponding coreset statistics (i.e. the
cell statistics taking into account that points are moved into cells during the coreset construction):
The number e
nB=|PB|of points contained in the cell (after moving points).
A vector e
bB= (e
b(1)
B,...,e
b(d)
B)with e
b(i)
B:= PpPBp(i),
The sum of the squared `2-norm of the point vectors
eeB:= PpPBkpk2
2=PpPBPd
i=1(p(i))2.
See Figure 8.2 for details. The function COMPUTECORESETPOINTS returns a set of triples
from N×Rd×R, where triple (n0, b0, e0)stands for a coreset point representing n0points from
the input instance with coordinate sum b0and sum of squared `2-norm e0.
One iteration of the k-means algorithm is done as follows: Instead of searching the nearest
center for each coreset point separately we use an approach analogous to the kd-tree approach
as in [83]. We start with a list Lof all centers as possible candidates for nearest centers and do
a depth first walk on those quadtree cells which contain a coreset point. For each cell Cin the
quadtree we check if we can rule out that some centers qin Lare the nearest center to any point
in C. This is done by first computing the point lfrom Lthat is nearest to the center of C. Then
we check for each point q L, whether Clies completely on the same side as lof the bisector
between land q. All centers which cannot be nearest centers for coreset points in Care evicted
from Land the algorithm proceeds to the children of cell C.
If |L|=1we know the nearest center for all coreset points within the cell. Since we hold
statistics for all coreset points within each cell we can then assign all coreset points in one step
to the center l Land stop.
If |RC|=1, we compute the distances of the coreset point to all l L directly, assign the
coreset point and stop the depth first walk.
See Figure 8.3 for details.
Computation of silhouette coefficients. The computation of silhouette coefficients for
each point piis speeded up in the following way: We first compute the average distance aito
all points in the same cluster. To compute bi, the minimum over average distances bi,j to points
in other clusters Cj, we identify the second nearest cluster center cland compute the average
distance bi,l to all points in Cl. In most cases bi,l is the minimum of all bi,j for other clusters. To
118
8.2 The Algorithm
'
&
$
%
COMPUTEASSIGNMENTFORCELL(Cell C, List of possible centers L L)
if |RC|=1do: // Only one coreset point in C.
Let gRCbe the coreset point in cell C.
Compute center l Lnearest to g.
nlnl+e
nC.
blbl+e
bC.
elel+eeC.
else: // more than one coreset point in C.
Let abe the center of cell C.
Compute the center l Lhaving smallest distance to a.
for each q L:
if each point of cell Bhas smaller distance to lthan to qdo:
LL\ {q}.
if |L|=1do: // Then L={l}.
nlnl+e
nC.
blbl+e
bC.
elel+eeC.
else: // |L|2.
for each subcell e
Cof Chaving a coreset point do:
COMPUTEASSIGNMENTFORCELL(e
C, L).
COMPUTEASSIGNMENT()
for each lLdo:
nl0.
bl(0,...,0)Rd.
el0.
COMPUTEASSIGNMENTFORCELL (Largest cell C, L).
Figure 8.3: The COMPUTEASSIGNMENT function
'
&
$
%
SILHOUETTE(K)
m5.
while mKdo:
Compute coreset of size m.
for each k{1, . . . , 100}do
Use main loop of CoreMeans to compute clustering.
Compute average silhouette coefficients for the current coreset and centers.
m2·m.
Figure 8.4: The SILHOUETTE function
119
8 An Efficient k-Means Implementation using Coresets
Setup Times
Instance Size KMHybrid CoreMeans
Tower 4,915,200 28.59 4.77
Bridge 3,145,728 18.13 2.95
PaSCO 3,145,728 19.41 4.29
Frymire 1,234,390 4.71 0.65
Clegg 716320 2.76 1.05
Monarch 393,216 1.43 0.63
Artificial5D 300,000 2.27 1.49
Artificial10D 300,000 3.71 2.17
Artificial15D 300,000 4.71 2.70
Artificial20D 300,000 6.09 3.87
Table 8.1: Data sets and setup time.
get a certificate for this, we use the lower bound d(pi, cj)bi,j. We check for all other clusters
if d(pi, cj)bi,l. If this inequality holds then bi,l bi,j and bi,j cannot be the minimal one. In
that case we save the computations of all distances to points in cluster Cj.
8.3 Experiments
We implemented our algorithm using C++. The code was compiled using gcc version 3.4.4 using
optimization level 2 (-O2). We compare our algorithm the implementation of KMHybrid from
[83]. KMHybrid was compiled using the same compiler and also with optimization level 2. We
ran it using the standard settings given by the developers.
We ran our experiments on an Intel Pentium D dual core processor with two cores. Both
algorithms used only one core with core frequency 3 GHz. The computer has 2GByte RAM.
8.3.1 Data Sets
We performed our experiments on two different types of instances. The first type of instance
consists of images and we want to cluster the RGB values of the pixels. Thus the input points
lie in 3D and the i-th input point corresponds to the RGB-values of the i-th pixel of the image.
Such a clustering has applications in lossy data compression, since one can reduce the palette of
colors used in the picture to the colors corresponding to the cluster centers.
Our test images consist of three large images (Tower, Bridge, and PaSCo) and three medium
size images (Monarch, Frymire, and Clegg). The latter images are commonly used to evaluate the
performance of image compression algorithms. The exact sizes of the test images can be found in
Table 8.3.1. The images are available at homepages.uni-paderborn.de/frahling/coremeans.html
120
8.3 Experiments
Artificially Created Instances.
The second type of instance is artificially created. Instance ArtificialxD consists of 300, 000
points in xdimensions. The instance is generated by taking a random point from one of 20
Gaussian distributed clusters, whose centers are picked uniformly at random from the unit cube.
The standard deviation of the Gaussian distribution is 0.02 ·d, i.e. it is the product of the
one dimensional Gaussian distribution with standard deviation 0.02. An example of a sample of
points from instance Artificial2D is given in Figure 8.15.
8.3.2 Comparison of CoreMeans and KMHybrid
To evaluate the performance of CoreMeans we compare our algorithm to KMHybrid. We first
compare the setup times for both algorithms, i.e. the time to construct the auxiliary data struc-
tures. If one wants to compute a clustering for fixed value of kthen the setup times often domi-
nate the running time of the algorithm. If a good value of kis not known, then one often wants
to compute a clustering for multiple values of k. In this case, it is more interesting to compare
the running time of both algorithms without setup time (however, the time to extract the coresets
from the data structure is contained in the given running times). This is done in Sections 8.3.2 to
8.3.2. In Section 8.3.2 we compare both algorithm for different input sizes. In Section 8.3.2 we
focus on the performance with increasing dimension, and in Section 8.3.2 we investigate into the
dependence on the number of clusters.
Setup time
The times to compute the auxiliary data structure are given in Table 8.3.1. The time to build these
structures does not depend on k. The setup time for KMHybrid is between 1.5 to 7times higher
than that of CoreMeans. There is a tendency that the gap becomes larger for larger instances.
However, there seems to be also an dependence on the distribution of points as the largest factor
was achieved for the medium size instance Frymire.
If one computes a clustering for one value of kthen the setup time is typically larger than the
computation time. Even for the larger instances both algorithms obtain a good clustering in a
few seconds (see also Section 8.3.2).
121
8 An Efficient k-Means Implementation using Coresets
Dependence on Input Size
To evaluate the dependence on the input size we run both algorithms on instance Monarch, Clegg,
Frymire, PaSCo, Bridge, and Tower. We used paramter k=50. In general, CoreMeans performs
better for smaller k(see Section 8.3.2) and tends to perform similar to KMHybrid as kincreases.
The results are shown in Figures 8.5 to 8.10. The plots give the average performance of 10
runs. The vertical bars indicate the best and worst solution found within these runs. The relative
performance of CoreMeans increases slightly with the size of n. We would like to emphasize
that the difference between the best and worst solution found during the 10 runs is much smaller
for CoreMeans. Therefore, to guarantee a good solution we have to run KMHybrid more often
than CoreMeans. Another interesting observation is that CoreMeans achieves slightly better
approximations for larger instances.
0.2 0.4 0.6 0.8 1 1.2 1.4
sec
50
100
150
200
250
300
350
400
Objective
KMLocalHybrid
CoreMeans
Figure 8.5: Performance on data set Monarch for k=50 excluding setup time.
122
8.3 Experiments
0.5 1 1.5 2 2.5 3
sec
200
400
600
800
1000
1200
Objective
KMLocalHybrid
CoreMeans
Figure 8.6: Performance for Clegg for k=50 excluding setup time.
0.2 0.4 0.6 0.8 1
sec
200
400
600
800
1000
1200
Objective
KMLocalHybrid
CoreMeans
Figure 8.7: Performance for Frymire for k=50 excluding setup time.
123
8 An Efficient k-Means Implementation using Coresets
0.5 1 1.5 2 2.5 3
sec
100
200
300
400
500
600
700
Objective
KMLocalHybrid
CoreMeans
Figure 8.8: Performance for PaSCo for k=50 excluding setup time.
0.25 0.5 0.75 1 1.25 1.5 1.75 2
sec
25
50
75
100
125
150
175
200
Objective
KMLocalHybrid
CoreMeans
Figure 8.9: Performance for Bridge for k=50 excluding setup time.
124
8.3 Experiments
0.25 0.5 0.75 1 1.25 1.5 1.75 2
sec
50
100
150
200
250
Objective
KMLocalHybrid
CoreMeans
Figure 8.10: Performance for Tower for k=50 excluding setup time.
Dependence on the Dimension
Next we are interested in the dependence on the dimension. To evaluate this dependence, we
compare the average performance of 10 runs of KMHybrid and CoreMeans for k=20 on the
instances ArtificialxD for x=5, 10, 15, and 20. The graphs are shown in Figure 8.11 and 8.12.
CoreMeans performs better on all instances. The most significant difference in performance can
be found in the 5D instance, where CoreMeans performs a factor 10 30 better. The higher
the dimension the smaller is the advantage of CoreMeans. In these experiments the deviation
of KMHybrid was much bigger than that of CoreMeans. Although CoreMeans shows the much
better average performance, the best solution found by KMHybrid was better than the best so-
lution found by CoreMeans. Overall, the performance of the algorithm for medium dimensions
was much better than theory predicts with an exponential dependence on the dimension.
125
8 An Efficient k-Means Implementation using Coresets
1 2 3 4 5
sec.
0.05
0.1
0.15
0.2
0.25
Objective
d!10, KMLocalHybrid
d!10, CoreMeans
d!5 , KMLocalHybrid
d!5 , CoreMeans
Figure 8.11: Performance for ArtificialxD with x=5and x=10 excluding setup time.
2 4 6 8 10 12 14
sec.
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Objective
d!20, KMLocalHybrid
d!15, KMLocalHybrid
d!20, CoreMeans
d!15, CoreMeans
Figure 8.12: Performance for ArtificialxD with x=15 and x=20 excluding setup time.
126
8.3 Experiments
Dependence on the Number of Clusters
To investigate in the dependence on the number of cluster centers, we ran a number of experi-
ments on different inputs. Due to space limitations we only present results for k=10, 50, 100,
and 200 for instance Bridge. These results are typical for the performance we encountered. As
before, the Figures 8.13 and 8.14 show the average performance of 10 runs excluding the setup
times. Typically, our algorithm performs significantly better for small values of k. For example,
for k=10 CoreMeans often performs a factor 10 100 better. Additionally, the quality of the
solutions computed by KMHybrid varies significantly. CoreMeans is less sensitive to the random
choices of the algorithm. As a consequence one must perform more runs of KMHybrid to obtain
a good solution with high probabilty. As kgrows larger the performance gap between the two
algorithms decreases. The reason for this is that the quality of the coreset decreases as kgrows.
0.1 0.2 0.3 0.4 0.5 0.6
sec.
200
400
600
800
1000
Objective
k!10, KMLocalHybrid
k!10, CoreMeans
k!50, KMLocalHybrid
k!50, CoreMeans
Figure 8.13: Performance for Bridge with k=10 and k=50 excluding setup time.
127
8 An Efficient k-Means Implementation using Coresets
0.5 1 1.5 2 2.5 3
sec.
20
40
60
80
100
Objective
k!100, KMLocalHybrid
k!100, CoreMeans
k!200, KMLocalHybrid
k!200, CoreMeans
Figure 8.14: Performance for Bridge with k=100 and k=200 excluding setup time.
8.3.3 Computing the Silhouette Coefficient
We computed the approximate average silhouette coefficient for 1k100 for instances
Tower, Clegg, Monarch, and ArtificialxD with x=2, 10, 20 using coresets of different sizes.
Table 8.2 summarizes the running times of our tests. The second column gives the overall running
time for the computation and the third column states the time spend to compute the silhouette
coefficients. Since the time to compute the silhouette coefficient is quadratic in the coreset size,
the fraction of time spent for this computation increases significantly with increasing coreset
size.
To show the effectivity of our method we focus on instance Artificial2D. A sample of points
from this instance is shown in figure 8.15. The average silhouette coefficent for this instance
and coreset sizes 427, 1616, and 6431 is given in Figure 8.16. We see that even the smallest
coreset suffices to approximate the coefficient quite well. The only problem is that the silhouette
coefficient increases slightly with k. A reason for this may be that the number of centers is
already relatively large compared to the number of coreset points. If some centers contain only
one point, then they have silhouette coefficient exactly 1and this may lead to slightly increasing
coefficient, if kis large compared to the coreset size. For our applications a coreset of size
roughly 1600 will definitly suffice. There is almost no difference to one with more than 6000
points.
The highest silhouette coefficient value was achieved for 14 clusters (using the larger coreset)
by the cluster centers shown in Figure 8.15. The reason why only 14 clusters were found (al-
128
8.3 Experiments
though we had 20 cluster centers) can be explained by the fact that some of the clusters were
very close to each other and so the clustering coefficient is higher when one assigns only one
center to these clusters.
Instance Coreset Time Silhouette
Tower 404 7.99 0.84
1607 19.24 6.43
Clegg 423 4.69 0.8
1720 15.07 6.58
Monarch 428 4.80 0.77
1626 15.37 6.11
Artificial2D 427 2.52 0.62
1616 7.73 4.3
6431 51.89 45.57
Artificial10D 400 43.34 1.88
1711 123.38 17.68
Artificial20D 408 139.58 4.18
1778 442.5 40.62
Table 8.2: Time to compute clusterings and approximate average silhouette coefficients. The
second column contains the overall running time (including setup). The third column
gives the time required to compute the approximate average silhouette coefficient.
129
8 An Efficient k-Means Implementation using Coresets
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Figure 8.15: A sample of points from instance Artificial2D. The bold points are the centers that
achieve the best average silhouette coefficient.
130
8.3 Experiments
20 40 60 80 100
k
0.2
0.4
0.6
0.8
1
silhouette coeff.
Coreset 427
Coreset 1616
Coreset 6431
Figure 8.16: The average silhouette coefficient of Artificial2D.
8.3.4 Summary
Summarizing we can say that our algorithm CoreMeans performs very well compared to KMHy-
brid [104] for small dimension and small to medium k. When we compare the computation time
of CoreMeans with KMHybrid we see that for one clustering the running time of both algo-
rithms is typically dominated by the setup time. The quality of the solutions varies less than that
of KMHybrid, which implies that we need fewer runs to guarantee a good solution.
The main strength of our algorithm is to quickly find relatively good approximations for many
values of k, for example when a good value for kis not known in advance. In this case, we can
also use the coresets to compute the average clustering coefficient and thus to find a good choice
of k.
131
8 An Efficient k-Means Implementation using Coresets
132
9 Counting Motifs in Data Streams
In this chapter we present estimators for the number of triangles, the number of cliques of any
size, and the number of bipartite cliques K3,3 with three nodes in each partition for graphs given
as a data stream of edges. Our data stream algorithms compute a (1+)-approximation of the
respective value with probability 1δ.
To estimate the number of triangles for a graph given as an adjacency stream, our algorithm
uses O(1
2·log(1
δ)·(1+|T1|+|T2|
|T3|)·log(|V|)) memory bits, where Tidenotes the set of node-triples
having iedges in the induced subgraph (see Definition 9.1.1). This is always better than the naive
sampling algorithm that requires O(1
2log(1
δ)(1+|T0|+|T1|+|T2|
|T3|)·log(|V|)) memory bits, while it
strongly improves the O(1
2·log(1
δ)·(1+|T1|+|T2|
|T3|)3·log |V|)solution provided in [117]. Comparing
our results in this model with the previous work in [81], we obtain a one-pass algorithm that
achieves the same space bound and better update time as the three pass algorithm from [81]. The
two other algorithms in [81] either require bounded maximum degree or are incomparable to our
result because the space complexity depends on different parameters (e.g., the number of cycles
of length 4 and 6 in the graph).
The number of memory bits used by our algorithm still depends on the value of |T1|/|T3|, that
can be as large as O(|E|·|V|). Our method in the case of graphs in arbitrary order is therefore of
practical interest for networks with a large number of triangles.
We then develop a method of greater practical relevance for graphs given as an incidence
stream of edges which uses O(1
2log(1
δ)log2(|V|)(1+|T2|
|T3|)) bits. To give a flavor of the quality
of our result, observe that |T3|
|T2|is exactly equal to 1/3 of the inverse of the transitivity coefficient
of the graph, a universal measure closely related to the clustering coefficient, whose value for
networks of practical interest is rarely bigger than 105. Our algorithmic results improve the
result of [117] that requires O(1
2log(1
δ)log(|V|)(1+|T2|
|T3|)2+dlog |V|)memory bits (where d
denotes the maximum degree of the graph) and improves over the naive sampling method.
Our method is suitable to be adapted to several other classes of subgraphs. As an example we
provide an algorithm to estimate the number of cliques of size αin the incidence stream model
that uses O(1
2log(1
δ)log2(|V|)(1+|Sα|
|Kα|)) memory bits, where Sαis the set of stars of αnodes
and Kαis the set of cliques of size αin the graph.
Denote by Ki,j the set of complete bipartite cliques in the graph where each of ivertices link
to all of jother vertices. As a last contribution we provide a data stream algorithm that provides
an approximation of the number of K3,3 of the graph in the incidence stream model ordered by
destination nodes with outdegree bounded by which needs Olog2(|V|)·|K3,1|·2ln(1
δ)
|K3,3|·2mem-
133
9 Counting Motifs in Data Streams
ory bits.
In Section 9.1 we present our algorithm to count the number of triangles in the adjacency
stream model (as defined in Section 2.2). For input graphs given as an incidence stream (see
Section 2.2), we develop a better algorithm to count the number of triangles in Section 9.2.
This algorithm will be generalized to cliques of any size in Section 9.3. Section 9.4 presents an
algorithm to count bipartite cliques K3,3 in the incidence stream model.
9.1 Counting Triangles in Adjacency Streams
We consider an undirected graph G= (V, E)without self-loops. Each edge is an unordered pair
of nodes (v, w)such that (v, w)=(w, v). We assume that V={1,...,n}and nis known in
advance, and that Gis then given as an adjacency stream consisting of all edges in the graph as
defined in Section 2.2. The edges appear in arbitrary order and no edge is repeated in the stream.
There is no bound on the degree of the nodes.
Definition 9.1.1 (Node triples, T0, T1, T2, T3)We define a node triple as a set {v1, v2, v3}V
consisting of exactly 3different nodes of V. We partition the set of node triples into four sets
T0, T1, T2,and T3. A node triple {v1, v2, v3}belongs to
T0iff no edge exists between the nodes v1, v2,and v3,
T1iff exactly one of the edges (v1, v2),(v2, v3), and (v3, v1)exists,
T2iff exactly two of the edges (v1, v2),(v2, v3), and (v3, v1)exist,
T3iff all of the edges (v1, v2),(v2, v3), and (v3, v1)exist (i.e. iff {v1, v2, v3}is a triangle).
Therefore |T3|denotes the number of triangles in G.
The algorithms we present here find a (1±)-approximation of |T3|using local memory of
size Θ1
21+|T1|+|T2|
|T3|·log |V|. In social graphs and the webgraph the value (|T1|+|T2|)|T3|
usually is O(|V|).
9.1.1 3 Pass Algorithm
We will first present an algorithm which passes three times over the stream and computes a
(1±)-approximation on the number of triangles. A different algorithm with the same space
complexity has been presented in [81]. However, our algorithm has a significantly improved
update time and as we later show, we can combine the three passes to a one-pass algorithm.
We introduce a streaming algorithm SAMPLETRIANGLE, which outputs a {0, 1}variable β
with expected value 3|T3|/(|T1|+2·|T2|+3·|T3|). The algorithm is given in Figure 9.1.
It is easy to see that each of the passes can be implemented in a single pass over the set of
edges (i.e., the input stream) using O(1)memory cells, each storing O(log |V|)bits.
134
9.1 Counting Triangles in Adjacency Streams
'
&
$
%
SAMPLETRIANGLE
1st. Pass:
Count the number of edges |E|in the stream
2nd. Pass:
Sample an edge e= (a, b)uniformly chosen from E
Choose a node vuniformly from V\ {a, b}.
3rd. Pass:
if (a, v)E(b, v)Ethen set β1
else set β0
return β
Figure 9.1: The 3-pass algorithm SAMPLETRIANGLE for adjacency streams
Lemma 9.1.2 Algorithm SAMPLETRIANGLE outputs a value βwith expected value
E[β] = 3|T3|
|T1|+2·|T2|+3·|T3|
Furthermore
|T1|+2·|T2|+3·|T3|=|E|·(|V|2)
and
|T3|=E[β]·|E|·(|V|2)/3 .
Proof : We look at all node triples. Each triple belongs to one of the sets T0,T1,T2, or T3.
The algorithm chooses such a triple by choosing an edge e= (a, b)together with one node
vV\ {a, b}. Therefore, no triple from T0is chosen.
We will show how many different choices of one edge and one node choose a node triple be-
longing to T1(resp. T2,T3). Since all pairs of one edge and one node have the sample probability
to be chosen, we can then compute the probability to select a triangle.
Let us denote by t={w1, w2, w3}a fixed triple from T1. Wlog. let be (w1, w2)Eand so
(w2, w3),(w3, w1)/E. The algorithm chooses t, iff it samples edge (w1, w2)and vertex w3.
Therefore there are exactly |T1|choices of an edge and a node that select a triple from T1.
Now assume tT2. Then tis chosen by SAMPLETRIANGLE, iff one of the two edges in the
triple is sampled and vequals to the remaining node of the triple. Therefore there are exactly
2·|T2|choices of an edge and a node that select a triple from T2.
For the same reason, a triple in T3is chosen whenever one of its three edges and the remaining
vertex is chosen. Therefore there are exactly 3·|T3|choices of an edge and a node that select a
triple from T3. These are exactly the choices that lead to β=1.
We conclude that
E[β] = 3|T3|
|T1|+2·|T2|+3·|T3|.
135
9 Counting Motifs in Data Streams
Since there are |E|·(|V|2) = |T1|+2·|T2|+3·|T3|choices to sample an edge and a node,
it follows that |T3|=E[β]·|E|·(|V|2)/3.2
A streaming algorithm COUNTTRIANGLES, which outputs an estimate of |T3|, easily follows.
It can be adjusted using an input parameter s.
'
&
$
%
COUNTTRIANGLES (sN)
Run sinstances of SAMPLETRIANGLE in parallel.
Let βibe the value returned by the ith instance.
e
T31
sPs
i=1βi·|E|·(|V|2)3.
return e
T3.
Figure 9.2: The 3-pass algorithm COUNTTRIANGLES for adjacency streams
Lemma 9.1.3 Algorithm COUNTTRIANGLES outputs a value e
T3having expected value E[e
T3] =
|T3|. If s1
2·|T1|+2·|T2|+3·|T3|
|T3|·ln(2
δ). then with probability 1δthe algorithm outputs a value
e
T3satisfying
(1)·|T3|e
T3(1+)·|T3|.
Proof : We use Chernoffs Bounds [58]:
Pr1
s
s
X
i=1
βi(1+)E[β]< e2·E[β]·s/3
Pr1
s
s
X
i=1
βi(1)E[β]< e2·E[β]·s/2
For s1
2·|T1|+2·|T2|+3·|T3|
|T3|·ln(2
δ)the sum of both probabilities is bounded by δ. The lemma
follows now from Lemma 9.1.2 stating that |T3|=E[β]·|E|·(|V|2)/3.2
Lemma 9.1.3 guarantees that algorithm COUNTTRIANGLES outputs a value e
T3having the
right expectation.
However, there could be applications which need guaranteed approximation bounds. When
we run COUNTTRIANGLES with a predefined number of instances s, it outputs an estimation e
T3
on |T3|, but the requirement s1
2·|T1|+2·|T2|+3·|T3|
|T3|·ln(2
δ)of Lemma 9.1.3 is impossible to check.
We can not find out if we used enough instances to ensure a (1±)-approximation.
To overcome this problem we later develop a new algorithm COUNTTRIANGLESSAFE based
on COUNTTRIANGLES. It outputs a pair (e
T3,˜
)and guarantees that e
T3is a (1±˜
)-approximation
of |T3|with probability 1δ.
136
9.1 Counting Triangles in Adjacency Streams
'
&
$
%
COUNTTRIANGLESSAFE (sN, ψ (0, 1))
Set rd2·log(2/ψ)e.
Set tds/(4log(2/ψ))e.
Run rinstances of COUNTTRIANGLES(t)in parallel.
Let e
T3
(i)be the value returned by the ith instance.
Set e
T3mediani(e
T3
(i)).
Set ˜
q176
f
T3·|T1|+2·|T2|+3·|T3|
s·log 2
ψ=q176
f
T3·|E|·(|V|2)
s·log 2
ψ.
return (e
T3,˜
).
Figure 9.3: The 3-pass algorithm COUNTTRIANGLESSAFE for adjacency streams
Let us first analyse the update time of COUNTTRIANGLES. As before we analyse time in the
Real RAM model. Note that we could also show our results in a RAM model, assuming that all
stored values (each needing O(log |V|)memory bits to be stored) fit into one single register.
If we implement the different instances of our algorithm independently of each other, we re-
quire O(s)time to process each edge during the third pass. We show how to reduce this to
expected constant time. Before we invoke the third pass, we collect all edge-vertex pairs chosen
by different instances of the algorithm. For each pair with edge e= (a, b)and vertex vwe
would like to find out whether (a, v)and (b, v)are in E. Therefore, we construct a set Mof
missing edges that for each such edge-vertex pair contains the edges (a, v)and (b, v). Next, we
construct a hash table for Musing a uniform hash function that requires linear space, as proposed
in [107]. Now we can implement the third pass in the following way. For each edge e, we lookup
whether it is in the set M. If eMwe mark it. These steps can both be done in expected con-
stant time. In a postprocessing step we can then determine the edge-vertex pairs that are triangles.
We now develop an algorithm COUNTTRIANGLESSAFE based on COUNTTRIANGLES. It has
a number sand a desired error probability ψas input parameters and outputs a pair (e
T3,˜
).
The algorithm gives the guarantee that e
T3is a (1±˜
)-approximation of |T3|with probability
1ψ. It runs at most sinstances of SAMPLETRIANGLE in parallel. We show the pseudocode
of COUNTTRIANGLESSAFE in Figure 9.3.
Lemma 9.1.4 Let (e
T3,˜
)be the output of algorithm COUNTTRIANGLESSAFE. With probability
1ψthe following statements are true:
(1˜
)·|T3|<e
T3<(1+˜
)·|T3|
If s352
2·|T1|+2·|T2|+3·|T3|
|T3|·log(2/ψ), then the algorithm outputs (e
T3,˜
)with ˜
.
Proof : COUNTTRIANGLESSAFE outputs (e
T3,˜
)with ˜
=q176
f
T3·|T1|+2·|T2|+3·|T3|
s·log 2
ψ.
137
9 Counting Motifs in Data Streams
Therefore we have e
T3=176
˜
2·|T1|+2·|T2|+3·|T3|
s·log 2
ψ. Because of the choice of tit follows
e
T344
˜
2·|T1|+2·|T2|+3·|T3|
t.(9.1)
From Markov’s inequality it follows that
i{1,...,r}Pre
T3
(i)11 ·E[e
T3
(i)]1
11 .
Since for all iwe have E[e
T3
(i)] = |T3|it follows:
i{1,...,r}Pre
T3
(i)11 ·|T3|1
11 .
Because of e
T3=mediani(e
T3
(i))we can only have e
T311 ·|T3|if for at least r/2 values of i
we have e
T3
(i)11·|T3|. For each single value of ithe probability for that is smaller than 1/11 as
shown above. The probability to have at least r/2 values of ifulfilling that equation is therefore
bounded by:
Pre
T311 ·|T3|r
r/2·1
11r/2
e·r
r/2 r/2
·1
11r/2
=2e
11r/2
1
2r/2
ψ/2 .
From inequality (9.1) we conclude
Pr44
˜
2·|T1|+2·|T2|+3·|T3|
t11 ·|T3|ψ/2
and finally
Pr|T3|4
˜
2·|T1|+2·|T2|+3·|T3|
tψ/2 .
In the following we therefore condition on the event that |T3|>4
˜
2·|T1|+2·|T2|+3·|T3|
t(which
happens with probability at least 1ψ/2). Then for δ:= 1/11 we have ln(2/δ)4and
t1
˜
2·|T1|+2·|T2|+3·|T3|
|T3|·ln 2
δ.
Therefore by Lemma 9.1.3 each instance of COUNTTRIANGLES outputs a value e
T3
(i)(1±˜
)·
|T3|with probability at least 1δ=11
11 .
Since e
T3is set to the median of all e
T3
(i), we can only have e
T3/(1±˜
)·|T3|if for at least r/2
of the e
T3
(i)we have e
T3
(i)/(1±˜
)·|T3|. The probability for that is bounded by:
Pre
T3/(1±˜
)·|T3|r
r/2·1
11r/2
e·r
r/2 r/2
·1
11r/2
=2e
11r/2
1
2r/2
ψ/2 .
138
9.1 Counting Triangles in Adjacency Streams
The first statement of the lemma follows directly.
We now show the second statement. We condition on the event that e
T3(1+˜
)|T3|, which
happens with probability 1ψby the first statement of the lemma.
If s352
2·|T1|+2·|T2|+3·|T3|
|T3|·log 2
ψthen we have
|T3|352
2·|T1|+2·|T2|+3·|T3|
s·log 2
ψ
and therefore
e
T3|T3|
2176
2·|T1|+2·|T2|+3·|T3|
s·log 2
ψ.
It follows directly:
s176
e
T3·|T1|+2·|T2|+3·|T3|
s·log 2
ψ=˜
.
2
Notice that r·tsfor s8log(2/ψ). Therefore we have at most sinstances of SAMPLE-
TRIANGLE running in parallel. The time SAMPLETRIANGLESAFE(s) needs to process an edge
in the stream is therefore at most the time SAMPLETRIANGLE(s) needs.
We summarize our results in the following theorem. We remark that we significantly improve
the update time over the previously best result from [81] while achieving the same space com-
plexity. The update time in [81] is roughly proportional to the space complexity compared to
expected constant time for our algorithm.
Theorem 24 There is a 3-Pass streaming algorithm to count the number of triangles in a stream
of edges up to a multiplicative error of 1±with probability at least 1ψ, which needs
O(1
2·log(1
ψ)·(1+|T1|+|T2|
|T3|)·log |V|)memory bits and constant expected update time. ut
9.1.2 1 Pass Algorithm
In this section we show that the previous 3-pass algorithms can be implemented in one pass us-
ing the same amount of space and constant expected amortized update time, if |E|is significantly
larger than the number of instances we run.
We first show how to adapt algorithm SAMPLETRIANGLE. We observe that we can find a
random edge in one pass by reservoir sampling [120], i.e. choosing the first edge as a sample
edge and replacing this edge by the ith edge of the stream with probability 1/i. It is known that
this method can be implemented in O(log |V|)expected time per sample (not counting the time
to read the stream) by randomly choosing the next index of the replacing edge according to an
appropriate probability distribution.
139
9 Counting Motifs in Data Streams
'
&
$
%
SAMPLETRIANGLEONEPASS
i1
for each edge e= (u, w)in the stream do
Flip a coin. With probability 1/i do
au;bw;
vNode uniformly chosen from V\ {a, b}
xfalse; yfalse
end do
if e= (a, v)then xtrue
if e= (b, v)then ytrue
ii+1
end for
if x=true y=true then return β1else return β0.
Figure 9.4: The 1-pass algorithm SAMPLETRIANGLEONEPASS for adjacency streams
We combine this with the third pass and obtain algorithm SAMPLETRIANGLEONEPASS as
shown in Figure 9.4. It may happen that we sample an edge e= (a, b)of the stream together
with a node v, but we do not see the edge (a, v)or (b, v)in the subsequent stream (because they
appeared before the edge e). In this case, we do not detect a, b, v as a triangle. However, we
detect a, b, v, iff (a, b)is the first edge of the triangle that appears in the stream. This changes
the expected value of βby a factor of 3.
Lemma 9.1.5 Algorithm SAMPLETRIANGLEONEPASS outputs a value βhaving expected value
E[β] = |T3|
|T1|+2·|T2|+3·|T3|.
Proof : The proof is similar to the proof of Lemma 9.1.2, taking into account that only 1/3 of
the choices that select a T3-triple actually detect a triangle and lead to value β=1.2
An algorithm COUNTTRIANGLESONEPASS outputting an estimation e
T3of |T3|can be devel-
oped similarly to COUNTTRIANGLES (see Figure 9.5).
Lemma 9.1.6 Algorithm COUNTTRIANGLESONEPASS outputs a value e
T3having expected value
E[e
T3] = |T3|. If s3
2·|T1|+2·|T2|+3·|T3|
|T3|·ln(2
δ). then with probability 1δthe algorithm outputs
a value e
T3satisfying
(1)·|T3|e
T3(1+)·|T3|.
Proof : The proof is similar to the proof of Lemma 9.1.3 (with an additional factor of 3in most
formulas). 2
140
9.1 Counting Triangles in Adjacency Streams
'
&
$
%
COUNTTRIANGLESONEPASS (sN)
Run sinstances of SAMPLETRIANGLEONEPASS in parallel.
Let βibe the value returned by the ith instance.
e
T31
sPs
i=1βi·|E|·(|V|2).
return e
T3.
Figure 9.5: The 1-pass algorithm COUNTTRIANGLESONEPASS for adjacency streams
'
&
$
%
COUNTTRIANGLESONEPASSSAFE (sN, ψ (0, 1))
Set rd2·log(2/ψ)e.
Set tds/(4log(2/ψ))e.
Run rinstances of COUNTTRIANGLESONEPASS(t)in parallel.
Let e
T3
(i)be the value returned by the ith instance.
e
T3mediani(e
T3
(i)).
Set ˜
q528
f
T3·|T1|+2·|T2|+3·|T3|
s·log 2
ψ=q528
f
T3·|E|·(|V|2)
s·log 2
ψ.
return (e
T3,˜
).
Figure 9.6: The 1-pass algorithm COUNTTRIANGLESONEPASSSAFE for adjacency streams
Similarly to COUNTTRIANGLESSAFE we can develop a one pass algorithm COUNTTRIAN-
GLESONEPASSSAFE (Figure 9.6) which outputs an approximation e
T3together with an approxi-
mation guarantee.
Lemma 9.1.7 Let (e
T3,˜
)be the output of algorithm COUNTTRIANGLESONEPASSSAFE. With
probability 1ψthe following statements are true:
(1˜
)·|T3|<e
T3<(1+˜
)·|T3|
If s1056
2·|T1|+2·|T2|+3·|T3|
|T3|·log(2/ψ), then the algorithm outputs (e
T3,˜
)with ˜
.
Proof : The proof can be done similarly to the proof of Lemma 9.1.3 (with an additional factor
of 3in most formulas). 2
By applying the reservoir sampling algorithm from [120] to select the edge, the selection re-
quires O(log |V|)expected time for each instance of SAMPLETRIANGLEONEPASS for the whole
stream. Additionally we use the hash table approach from the previous chapter to efficiently find
instances of SAMPLETRIANGLEONEPASS which search for an edge in the stream. Alltogether
we get expected O(1+s·log |E|
|E|)update time per edge in the stream.
141
9 Counting Motifs in Data Streams
Theorem 25 There is a 1-Pass streaming algorithm to count the number of triangles in a stream
of edges up to a multiplicative error of 1±with probability at least 1ψ, which needs
O(1
2·log(1
ψ)·(1+|T1|+|T2|
|T3|)·log |V|)memory bits and expected update time O(1+1
2·|T1|+|T2|
|T3|·
log |E|
|E|·log 1
ψ).
9.2 Counting Triangles in Incidence Streams
When a graph G= (V, E)is coded as an incidence stream (see Section 2.2) all edges incident
to the same vertex appear subsequently in the stream. First arrive all edges incident to vertex
v1, followed by all edges incident to v2, and so on. The ordering v1, . . . , vnof the vertices can
be arbitrarily, i.e. determined by an adversary. We consider undirected graphs and so each edge
appears twice (within the incidence list of both incident nodes). There is no bound on the degree
of the nodes (in contrast to [117]).
Often large graphs (e.g. the webgraph) are stored on hard discs as incidence lists of edges.
Our methods can therefore be used to approximate the number of triangles in these graphs using
only sequential access to the data.
9.2.1 3 Pass Algorithm
We again will first develop a 3-pass algorithm, and later combine the passes to get a one pass
algorithm. Let didenote the degree of node vi. The 3-pass algorithm SAMPLETRIANGLE2 is
presented in Figure 9.7. The algorithm SAMPLETRIANGLE2 can be implemented using O(1)
memory cells, each consisting of O(log |V|)bits.
'
&
$
%
SAMPLETRIANGLE2
1st. Pass:
Count the number Pof paths of length 2in the graph G.
2nd. Pass:
Uniformly choose one of these paths using algorithm UNIFORMTWOPATH(P).
Let (a, v, b)be this path.
3rd. Pass:
Test if edge (a, b)appears within the stream.
if (a, b)Ethen set β1
else set β0
return β
Figure 9.7: The 3-pass algorithm SAMPLETRIANGLE2 for incidence streams
142
9.2 Counting Triangles in Incidence Streams
We observe that the number of paths of length 2in the graph Gis exactly
P:= |T2|+3·|T3|=
|V|
X
i=1
di·(di1)/2 .
Thus we can easily count the number of paths of length 2by determining the degree of each
node. This is possible because the edges appear as an incidence stream.
The second pass can be implemented using reservoir sampling. However, we propose a differ-
ent approach which achieves slightly better amortized running time and is based on the following
idea. If vis incident to the nodes w1, w2, ..., wd, we define an order on the possible paths of
length 2 with vin the middle in the following way: (w1, v, w2)<(w1, v, w3)<(w2, v, w3)<
(w1, v, w4), ... . The triples (wi, v, wj)are ordered firstly by max{i, j}. Ties are ordered by i.
We choose a value k{1,...,P}uniformly at random and want to select the k-th triple
(wi, v, wj)in the order given above. iand jcan be computed from kusing the formulas given in
Figure 9.8. The k-th triple is chosen, if the node vis in the middle of enough paths of length 2.
Otherwise we search for the kdv·(dv1)/2-th path within the next incidence list.
The algorithm UNIFORMTWOPATH is presented in Figure 9.8.
'
&
$
%
UNIFORMTWOPATH(P)
Select value kuniformly from the set {1,...,P}
For each node vin the incidence list do
If k>0then
Set jlq2k +1
4+1
2m
Set ijj2j
2+k1
Pass over the complete incidence list of node v.
If incidence list of vcontains more than jedges then
athe ith node in the incidence list of v
bthe jth node in the incidence list of v
wv
end if
ddegree of node v
kkd2d
2
end if
end do
return edges (a, w, b)
Figure 9.8: The 1-pass algorithm UNIFORMTWOPATH for incidence streams
143
9 Counting Motifs in Data Streams
Lemma 9.2.1 Algorithm SAMPLETRIANGLE2outputs a value βwith expected value
E[β] = 3·|T3|
|T2|+3·|T3|
Proof : We look at all triples of nodes in V. Each triple belongs to one of the sets T0,T1,T2, or
T3. The algorithm chooses such a triple by choosing a node vtogether with two adjacent edges.
Therefore the selected triples belong to the set T2T3. We select a triple from T2, if we choose
the unique node adjacent to both edges and the corresponding edges. Therefore there are exactly
|T2|choices that choose a triple belonging to T2.
A triple from set T3can be chosen in three different ways by selecting one of the three nodes
of the triple together with both adjacent edges. Since each choice of a path of length two has the
same probability, the probability of choosing a triple in T3is exactly 3·|T3|/(|T2|+3·|T3|)as
stated. 2
A streaming algorithm COUNTTRIANGLES2, which outputs an estimate of |T3|, easily follows.
It can be adjusted using an input parameter sand is given in Figure 9.9.
'
&
$
%
COUNTTRIANGLES2 ( sN)
Run sinstances of SAMPLETRIANGLE2 in parallel.
Let βibe the value returned by the ith instance.
e
T31
sPs
i=1βi·|T2|+3·|T3|
3=1
sPs
i=1βi·PvVdv·(dv1)/6.
return e
T3.
Figure 9.9: The 3-pass algorithm COUNTTRIANGLES2 for incidence streams
Lemma 9.2.2 Algorithm COUNTTRIANGLES2outputs a value e
T3having expected value E[e
T3] =
|T3|. If s1
2·|T2|+3·|T3|
|T3|·ln(2
δ). then with probability 1δthe algorithm outputs a value e
T3
satisfying
(1)·|T3|e
T3(1+)·|T3|.
Proof : Equivalent to the proof of Lemma 9.1.3. 2
We can again develop an algorithm COUNTTRIANGLESSAFE2 based on COUNTTRIANGLES2.
It has a number sand a desired error probability ψas input parameters and outputs a pair (e
T3,˜
)
where ˜
signals that e
T3is an (1±˜
)-approximation of |T3|. It uses at most sparallel instances of
SAMPLETRIANGLE2. We show the pseudocode of COUNTTRIANGLESSAFE2 in Figure 9.10.
Lemma 9.2.3 Let (e
T3,˜
)be the output of algorithm COUNTTRIANGLESSAFE2. With probabil-
ity 1ψthe following statements are true:
(1˜
)·|T3|<e
T3<(1+˜
)·|T3|
144
9.2 Counting Triangles in Incidence Streams
'
&
$
%
COUNTTRIANGLESSAFE2 ( sN, ψ (0, 1))
Set rd2·log(2/ψ)e.
Set tds/(4log(2/ψ))e.
Run rinstances of COUNTTRIANGLES2(t)in parallel.
Let e
T3
(i)be the value returned by the ith instance.
e
T3mediani(e
T3
(i)).
Set ˜
q176
f
T3·|T2|+3·|T3|
s·log 2
ψ=q88
f
T3·PvVdv·(dv1)
s·log 2
ψ.
return (e
T3,˜
).
Figure 9.10: The 3-pass algorithm COUNTTRIANGLESSAFE2 for adjacency streams
If s352
2·|T2|+3·|T3|
|T3|·log(2/ψ), then the algorithm outputs (e
T3,˜
)with ˜
.
Proof : Equivalent to the proof of Lemma 9.1.4. 2
To get small amortized expected update time we proceed as follows. Each time when the
incidence list of a new vertex starts, we compute the values iand jfor every instance. Then we
insert the j-values into a global priority queue keeping a pointer to the corresponding instance.
When we then process the incidence list of the current vertex we maintain a global counter for the
number of neighbors of the current vertex we have seen. If this number is equal to the smallest
value stored in the priority queue we remove it and process the corresponding instance. After the
incidence list has been processed, we empty the priority queue. This way, each instance of the
algorithm requires O(1)time per vertex. Additionally, we need O(s·log |V|)time to process the
removal of the smallest element in the priority queue. Overall, the amortized cost of the second
pass is O(1+s·|V|
|E|), which is constant for moderately large values of |E|. To implement the
third pass we use hashing in a similar way as in the algorithm for adjacency lists. This leads to
expected constant update time for the third pass.
Theorem 26 There is a 3-Pass streaming algorithm to count the number of triangles in incidence
streams up to a multiplicative error of 1±with probability at least 1ψ, which needs
O1
2·|T2|+3·|T3|
|T3|·log 1
ψ·log |V|
memory bits and amortized expected update time
O1+1
2·|T2|+3·|T3|
|T3|·log 1
ψ·|V|
|E|.
145
9 Counting Motifs in Data Streams
9.2.2 1 Pass Algorithm
To get a one pass algorithm we will again combine the passes of SAMPLETRIANGLE2. The first
pass only counts the number Pof paths of length 2 in the graph. Instead of counting this number
in advance, we will start an instance of the streaming algorithm for each guess ˜
Pof the number
of length-2-paths in the set {1, 2, 4, 8, ..., |V|3}. In parallel we will count P. At the end we can
find one instance started with a value ˜
Psatisfying P˜
P < 2P. We choose the result of this
instance as the result of our algorithm.
We have to develop a data stream algorithm which only relies on an estimation ˜
Pfulfilling
P˜
P < 2P.
To combine the second and third pass we only test all edges seen after drawing the sample.
The algorithm SAMPLETRIANGLEONEPASS2 is given in Figure 9.11 and can be implemented
using O(log2|V|)memory bits.
'
&
$
%
SAMPLETRIANGLEONEPASS2
do the following things in parallel for i=0, 1, 2, . . . , blog(|V|3)c:
Let be e
Pi:= 2i.
Uniformly choose one path of length 2 using algorithm UNIFORMTWOPATH(e
Pi).
If UNIFORMTWOPATH did not select a path until the end of the stream, then return .
Let (a, v, b)be the selected path.
After choosing the path, test if edge (a, b)appears within the rest of the stream.
if (a, b)appears in the stream after the incidence list of vthen set βi1
else set βi0
in parallel to the for loop do: count the number P=Pvdv(dv1)/2.
set ββdlog Pe
return β
Figure 9.11: The 1-pass algorithm SAMPLETRIANGLEONEPASS2 for incidence streams
Lemma 9.2.4 Algorithm SAMPLETRIANGLEONEPASS2outputs with probability at least 1/2 a
value βhaving expected value
E[β] = 2·|T3|
|T2|+3·|T3|.
Otherwise it outputs the value .
Proof : We set β=βiwith i=dlog Peat the end of the algorithm. This value of βihas been
set to 0or 1, if UNIFORMTWOPATH(e
Pi)did select a path of length two. Since e
Pi=2dlog Pewe
have Pe
Pi< 2 ·P.
UNIFORMTWOPATH(e
Pi)selects a path of length two by first choosing k{1, . . . , e
Pi}uni-
formly at random and then selecting the kth path of length two in the stream. If ke
Pi/2 we
146
9.2 Counting Triangles in Incidence Streams
have kPand therefore a path is selected. This happens with probability 1/2 and in that case
SAMPLETRIANGLEONEPASS2 does not output .
Let us now condition on the event that β6=and analyse the expected value of β.
Let {a, b, c}be a fixed triangle. Wlog. we assume that we see the incidence list of afirst in
the stream, then the incidence list of band then the incidence list of c.
In the algorithm SAMPLETRIANGLE2 we detected the triangle by selecting (a, b, c),(c, a, b),
or (b, c, a)as the path of length two.
Now only the selections of (a, b, c)or (c, a, b)lead to a detection of the triangle (because
the edge (c, a)resp. (c, b)appears in the incidence stream after selecting bresp. a). The
selection of (b, c, a)as a path of length two is done using the incidence list of c. Therefore the
incidence lists of aand bhave passed and we don’t detect the edge (a, b). We conclude that the
probability to output β=1(under the condition that β6=) is exactly 2/3 times the probability
of SAMPLETRIANGLE2 to output β=1.
2
A streaming algorithm COUNTTRIANGLESONEPASS2, which outputs an estimate of |T3|, eas-
ily follows. It can be adjusted using an input parameter sand is given in Figure 9.12.
'
&
$
%
COUNTTRIANGLESONEPASS2 ( sN)
Run sinstances of SAMPLETRIANGLEONEPASS2 in parallel.
Let s0be the number of instances not returning .
Let βibe the value returned by the ith such instance (not returning ).
e
T31
s0Ps0
i=1βi·|T2|+3·|T3|
2=1
s0Ps0
i=1βi·PvVdv·(dv1)/4.
return e
T3.
Figure 9.12: The 1-pass algorithm COUNTTRIANGLESONEPASS2 for incidence streams
Lemma 9.2.5 Algorithm COUNTTRIANGLESONEPASS2outputs a value e
T3having expected
value E[e
T3] = |T3|. If 1/2 and s6
2·|T2|+3·|T3|
T3·ln(4
δ). then with probability 1δ
the algorithm outputs a value e
T3satisfying
(1)·|T3|e
T3(1+)·|T3|.
Proof : The expected value of e
T3follows easily from Lemma 9.2.4.
Let be 1/2 and s6
2·|T2|+3·|T3|
T3·ln(4
δ). First we show that s03
2·2·|T2|+3·|T3|
T3·ln(4
δ)
with probability at least 1δ/2.
Let cj{0, 1}be the random indicator variable which is 1if the jth instance of SAMPLETRI-
ANGLEONEPASS2 returns 0or 1and which is 0if the jth instance returns . By Lemma 9.2.4
147
9 Counting Motifs in Data Streams
we have E[cj]1/2. By Chernoff Bounds [58]:
Prs
X
j=1
cj(s
2·E[cj])=Pr1
s
s
X
j=1
cj1
2·E[cj]< e1
4·E[cj]·s/2 .
Therefore we have:
Prs03
2·2·|T2|+3·|T3|
T3·ln(4
δ)es/16 δ/4
for < 1/2.
We now condition on the event that s03
2·2·|T2|+3·|T3|
T3·ln(4
δ).
We again use Chernoff Bounds [58]:
Pr1
s0
s0
X
i=1
βi(1+)E[β]< e2·E[β]·s0/3
Pr1
s0
s0
X
i=1
βi(1)E[β]< e2·E[β]·s0/2
For s03
2·2·|T2|+3·|T3|
|T3|·ln(4
δ)the sum of both probabilities is bounded by δ/2.
2
We can again develop an algorithm COUNTTRIANGLESONEPASSSAFE2 based on COUNT-
TRIANGLESONEPASS2. It has a number sand a desired error probability ψas input parameters
and outputs a pair (e
T3,˜
).e
T3then is a (1±˜
)-approximation of |T3|with probability 1ψ. It
uses at most sparallel instances of SAMPLETRIANGLEONEPASS2. We show the pseudocode of
COUNTTRIANGLESONEPASSSAFE2 in Figure 9.13.
'
&
$
%
COUNTTRIANGLESONEPASSSAFE2 ( sN, ψ (0, 1))
Set rd2·log(4/ψ)e.
Set tds/(4log(4/ψ))e.
Run rinstances of COUNTTRIANGLESONEPASS2(t)in parallel.
Let e
T3
(i)be the value returned by the ith instance.
e
T3mediani(e
T3
(i)).
Set ˜
q1056
f
T3·|T2|+3·|T3|
s·log 4
ψ=q528
f
T3·PvVdv·(dv1)
s·log 4
ψ.
return (e
T3,˜
).
Figure 9.13: The 1-pass algorithm COUNTTRIANGLESONEPASSSAFE2 for adjacency streams
148
9.3 Counting Cliques of Arbitrary Size
Lemma 9.2.6 Let (e
T3,˜
)be the output of algorithm COUNTTRIANGLESONEPASSSAFE2. With
probability 1ψthe following statements are true:
(1˜
)·|T3|<e
T3<(1+˜
)·|T3|
If s2112
2·|T2|+3·|T3|
|T3|·log(4/ψ), then the algorithm outputs (e
T3,˜
)with ˜
.
Proof : Equivalent to the proof of Lemma 9.1.4. 2
We use techniques to reduce the amortized update time as shown in the previous section.
Since we start O(log |V|)parallel instances for different guesses of ˜
P, the amortized update time
increases by a factor of O(log |V|).
Theorem 27 There is a 1-Pass streaming algorithm to count the number of triangles in incidence
streams up to a multiplicative error of 1±with probability at least 1ψ, which needs
O1
2·1+|T2|
|T3|log 1
ψ·log2|V|
memory bits and amortized expected update time
Olog(|V|)·1+1
2·1+|T2|
|T3|log 1
ψ·|V|
|E| .
9.3 Counting Cliques of Arbitrary Size
Using the approach of the previous sections we can count cliques of αnodes in incidence streams
as well using one pass. We assume that αis a small constant. Let Sαbe the set of α-stars (α
nodes v1, . . . , vαand edges (v1, v2),...,(v1, vα)) in Gand Kαbe the set of cliques of size αin
G. Our memory bounds will depend on |Sα|/|Kα|. In network analysis we are interested in those
networks where this ratio is small, for example constant.
We use the method UNIFORMSTAR given in Figure 9.14 to uniformly choose an α-star. It uses
O(log |V|)memory bits and has expected running time O(|V|·log |V|), not counting the time to
read the stream. The selected star is then used by SAMPLECLIQUEONEPASS given in Figure
9.15. Each instance of SAMPLECLIQUEONEPASS uses O(log2|V|)memory bits. When we run
sparallel instances of SAMPLECLIQUEONEPASS, the whole method has amortized expected
running time O(s·|V|·log |V|
|E|+1)per edge.
Lemma 9.3.1 Algorithm SAMPLECLIQUEONEPASS with probability at least 1/2 outputs a value
βhaving expected value
E[β] = 2·|Kα|
|Sα|
Otherwise it outputs the value .
149
9 Counting Motifs in Data Streams
'
&
$
%
UNIFORMSTAR(P)
Select value kuniformly from the set {1,...,P}.
For each node vin the stream do
Use the reservoir sampling technique of [120] to obtain α1sample nodes
from the incidence list of v.
Let dbe the degree of node v.
If d
α1kthen
return sample nodes.
end if
kkd
α1
end do
return .
Figure 9.14: The 1-pass algorithm UNIFORMSTAR for adjacency streams
Proof : We set β=βiwith i=dlog Peat the end of the algorithm. This value of βihas been
set to 0or 1, if UNIFORMSTAR(e
Pi)did select a star. Since e
Pi=2dlog Pewe have Pe
Pi< 2 ·P.
UNIFORMSTAR(e
Pi)selects a star by first choosing k{1, . . . , e
Pi}uniformly at random and
then selecting the k-th star in the stream. If ke
Pi/2 we have kPand therefore a star is
selected. This happens with probability 1/2 and in that case SAMPLECLIQUEONEPASS outputs
β=0or β=1.
Let us now condition on the event that β6=and analyse the expected value of β.
If the star is sampled from the incidence list of vand vis the first or the second of the star-
nodes within the stream we can find all other edges after selecting the sample star. So if the star
belongs to a clique, this clique is detected.
However, if vis the third or a later node in the stream we miss the edge connecting the first and
second node of the star. When we fix a clique, the probability that the chosen node vis first or
second node of the clique in the stream is 2/α. Therefore two choices of stars lead to a detection
of a fixed clique. It follows that the expected value of βis
E[β] = 2
α·α·|Kα|
|Sα|
2
A streaming algorithm COUNTCLIQUESONEPASS, which outputs an estimate of |Kα|, easily
follows. It can be adjusted using an input parameter sand is given in Figure 9.16.
Lemma 9.3.2 Algorithm COUNTCLIQUESONEPASS outputs a value f
Kαhaving expected value
E[f
Kα] = |Kα|.
150
9.3 Counting Cliques of Arbitrary Size
'
&
$
%
SAMPLECLIQUEONEPASS
Do the following things in parallel for i=0, 1, 2, . . . , blog(|V|3)c:
Let be e
Pi=2i.
Uniformly choose one path of length 2 using algorithm UNIFORMSTAR(e
Pi).
If UNIFORMSTAR did not select a star until the end of the stream, then return .
Let (v1, v2, ..., vα)be this star with v1as middle node.
After choosing the path, test if each edge (vi, vj)for i, j {2,...,α}and i6=jappears
within the rest of the stream.
if all edges appear in the stream after the incidence list of vthen set βi1
else set βi0
in parallel do: count the number P=PvVdv
α1of stars in the graph.
set ββdlog Pe
return β
Figure 9.15: The 1-pass algorithm SAMPLECLIQUEONEPASS for incidence streams
'
&
$
%
COUNTCLIQUESONEPASS (sN)
Run sinstances of SAMPLECLIQUEONEPASS in parallel.
Let s0be the number of instances not returning .
Let βibe the value returned by the ith such instance (not returning ).
f
Kα1
s0Ps0
i=1βi·|Sα|2=1
s0Ps0
i=1βi·PvVdv
α12.
return f
Kα.
Figure 9.16: The 1-pass algorithm COUNTCLIQUESONEPASS for incidence streams
If 1/2 and s6
2·|Sα|
|Kα|·ln(4
δ). then with probability 1δthe algorithm outputs a value
f
Kαsatisfying
(1)·|Kα|f
Kα(1+)·|Kα|.
Proof : Equivalent to the proof of Lemma 9.2.5. 2
Based on that we can again develop an algorithm COUNTCLIQUESONEPASSSAFE. It has a
number sand a desired error probability ψas input parameters and outputs a pair (f
Kα,˜
). The
algorithm gives the guarantee that f
Kαis a (1±)-approximation of |Kα|with probability 1ψ
and uses at most sparallel instances of SAMPLECLIQUEONEPASS. We show the pseudocode of
COUNTCLIQUESONEPASSSAFE in Figure 9.17.
Lemma 9.3.3 Let (f
Kα,˜
)be the output of algorithm COUNTCLIQUESONEPASSSAFE. With
probability 1ψthe following statements are true:
151
9 Counting Motifs in Data Streams
'
&
$
%
COUNTCLIQUESONEPASSSAFE (sN, ψ (0, 1))
Set rd2·log(4/ψ)e.
Set tds/(4log(4/ψ))e.
Run rinstances of COUNTCLIQUESONEPASS(t)in parallel.
Let f
Kα
(i)be the value returned by the ith instance.
f
Kαmediani(f
Kα
(i)).
Set ˜
q1056
f
Kα·|Sα|
s·log 4
ψ=r1056
f
Kα·PvV(dv
α1)
s·log 4
ψ.
return (f
Kα,˜
).
Figure 9.17: The 1-pass algorithm COUNTCLIQUESONEPASSSAFE for adjacency streams
(1˜
)·|Kα|<f
Kα<(1+˜
)·|Kα|
If s2112
2·|Sα|
|Kα|·log(4/ψ), then the algorithm outputs (f
Kα,˜
)with ˜
.
Proof : Equivalent to the proof of Lemma 9.1.4. 2
Theorem 28 There is a 1-Pass streaming algorithm to count the number of Kαin incidence
streams up to a multiplicative error of 1±with probability at least 1ψ, which needs
O1
2·1+|Sα|
|Kα|·log(1
ψ)·log2|V|
memory bits and has amortized expected update time
O1+1
2·1+|Sα|
|Kα|·log 1
ψ·log2|V|·|V|
|E| .
9.4 Counting K3,3 in Incidence Streams
We propose a method to estimate the number of K3,3, when the graph is directed and given as
an incidence stream and the outdegree of each node is bounded by . The stream of edges is
ordered by destination nodes (so we see for each destination node all source nodes after one
another). Our assumption is justified because in large social graphs and the webgraph there are
often only a small number of links going out of each node. The graphs are often stored on hard
disk(s) and for each node all incoming edges are precomputed and stored with the graph.
We do not assume any ordering by source nodes. Let K3,3 denote the set of K3,3 minors and
K3,1 denote the set of K3,1 minors as shown in Figure 9.18.
We will first show how we can choose a K3,1 uniformly at random from the stream. This is
done similarly to choosing the length-2-paths in the triangle algorithm for incidence lists. We
152
9.4 Counting K3,3 in Incidence Streams
Figure 9.18: A K3,1 (on the left) and a K3,3 (on the right).
start a number of different estimations on the number of K3,1. In parallel we count the number
|K3,1|=P|V|
i=1di·(di1)·(di2)/6.
We will extend the method UNIFORMTWOPATH to a method UNIFORMK3,1 as shown in Fig-
ure 9.19. It has an estimation Pof |K3,1|as input parameter and selects uniformly at random
one K3,1 motif from the incidence stream. Using UNIFORMK3,1 we can develop a method
SAMPLEK3,3, outputting a variable βwhose expectation is related to the number |K3,3|. The
method SAMPLEK3,3 is given in Figure 9.20. It can be implemented using O(log2|V|)memory
bits. '
&
$
%
UNIFORMK3,1(P)
Select value kuniformly from the set {1,...,P}.
For each node vin the incidence list do:
If k>0then
Set hf1(k)(with f(x) := x
3)
Set k2kf(h1)
Set ilq2k2+1
4+1
2m
Set jii2i
2+k21
Pass over the complete incidence list of node v.
If incidence list of vcontains more than jedges then
athe hth node in the incidence list of v
bthe ith node in the incidence list of v
cthe jth node in the incidence list of v
uv
end if
ddegree of node v
kkd·(d1)·(d2)
6
end if
end do
return edges (a, u),(b, u)and (c, u)
Figure 9.19: The 1-pass algorithm UNIFORMK3,1 for directed incidence streams
153
9 Counting Motifs in Data Streams
'
&
$
%
SAMPLEK3,3
Do the following things in parallel for i=0, 1, 2, . . . , blog(|V|4)c:
Let be e
Pi=2i.
From all K3,1 occuring in the stream choose one uniformly using UNIFORMK3,1(e
Pi).
If UNIFORMK3,1 did not select a K3,1 until the end of the stream, then return .
Let the three edges of the chosen K3,1 be (a, u),(b, u)and (c, u)
Select uniformly x1, x2{a, b, c}
Choose uniformly random variables k1, k2{1, 2, . . . }
If k1=k2x1=x2then set βi0
else:
Go on passing over the rest of stream (the part behind the occurence of the K3,1).
Select (x1, v)as the k1-th edge (x1,·)after selecting the K3,1.
Select (x2, w)as the k2-th edge (x2,·)after selecting the K3,1.
From the time of selecting (x1, v):
check, if (a, v),(b, v),(c, v)are present in the stream
From the time of selecting (x2, w):
check, if (a, w),(b, w),(c, w)are present in the stream
If both is the case, then set βi1else set βi0.
In parallel to the for loop count the number |K3,1|=P=PvVdv
3of K3,1 in the graph.
set ββdlog Pe
return β
Figure 9.20: The 1-pass algorithm SAMPLEK3,3 for directed incidence streams
Lemma 9.4.1 Algorithm SAMPLEK3,3 outputs with probability at least 1/2 a random value β
having
E[β] = 2·|K3,3|
9·2·|K3,1|.
Otherwise it outputs the value .
Proof : We set β=βiwith i=dlog Peat the end of the algorithm. This value of βihas been
set to 0or 1, if UNIFORMK3,1(e
Pi)did select a K3,1. Since e
Pi=2dlog Pewe have Pe
Pi< 2 ·P.
UNIFORMK3,1(e
Pi)selects a K3,1 by first choosing k{1, . . . , e
Pi}uniformly at random and
then selecting the k-th K3,1 in the stream. If ke
Pi/2 we have kPand therefore a K3,1 is
selected. This happens with probability 1/2 and in that case SAMPLEK3,3 does not output .
Let us now condition on the event that β6=and analyse the expected value of β. Let
(a, b, c, u, v, w)be an arbitrary fixed K3,3 with edges directed from a,b,cto u,vand w. Let u
be the vertex whose incidence list appears first within the incidence stream, v,woccuring after
uwithing the stream. The K3,3 will be detected exactly when all of the following events occur:
154
9.4 Counting K3,3 in Incidence Streams
a, b, c, u are chosen as K3,1 with ubeing the destination node
vand wmust be chosen
x1must be the first within the incidence list of v.
x2must be the first within the incidence list of w.
The probability of the first event is 1/|K3,1|.
Conditioned on the first event the probability to choose vand wis 2/2: Each edge (x1,·)
appearing after (x1, u)in the stream has a probability of 1/ to be chosen by the algorithm. We
know that (x1, v)and (x1, w)appear after (x1, u)in the stream. Therefore each of these two
edges has a probability of 1/ to be chosen. By a similar argument we have independently a
probability of 1/ to choose the edge (x2, v)resp. (x2, w). We select the nodes vand wif we
either select (x1, v)and (x2, w)or (x1, w)and (x2, v). Therefore the probability to choose vand
wis 2/2.
Observe that the probability for vand wto be chosen does not depend on the choice of x1and
x2. We can therefore exchange the order in which we analyse these events.
Conditioned on the first two events the probability for the third event is 1/3, also the probability
for the fourth event. We get alltogether a probability of 2
9·2·|K3,1|to choose the fixed K3,3.2
A streaming algorithm COUNTK3,3, which outputs an estimate of |K3,3|, easily follows. It can
be adjusted using an input parameter s. It is given in Figure 9.21.
'
&
$
%
COUNTK3,3 (sN)
Run sinstances of SAMPLEK3,3 in parallel.
Let s0be the number of instances not returning .
Let βibe the value returned by the ith such instance (not returning ).
g
K3,3 1
s0Ps0
i=1βi·9
2·2·|K3,1|=1
s0Ps0
i=1βi·9
2·2·PvVdv
3.
return g
K3,3.
Figure 9.21: The 3-pass algorithm COUNTK3,3 for adjacency streams
Lemma 9.4.2 Algorithm COUNTK3,3 outputs a value g
K3,3 having expected value E[g
K3,3] =
|K3,3|. If 1/2 and s54
2·2·|K3,1|
|K3,3|·ln(2
δ)then with probability 1δthe algorithm outputs
a value g
K3,3 satisfying
(1)·|K3,3|g
K3,3 (1+)·|K3,3|.
Proof : Equivalent to the proof of Lemma 9.2.5. 2
155
9 Counting Motifs in Data Streams
'
&
$
%
COUNTK3,3SAFE (sN, ψ (0, 1))
Set rd2·log(4/ψ)e.
Set tds/(4log(4/ψ))e.
Run rinstances of COUNTK3,3 (t)in parallel.
Let g
K3,3
(i)be the value returned by the ith instance.
Set g
K3,3 mediani(g
K3,3
(i)).
Set ˜
q9504·2
g
K3,3 ·|K3,1|
s·log 4
ψ=r9504·2
g
K3,3 ·PvV(dv
3)
s·log 4
ψ.
return (g
K3,3,˜
).
Figure 9.22: The 3-pass algorithm COUNTK3,3SAFE for adjacency streams
We can develop an algorithm COUNTK3,3SAFE based on COUNTK3,3. It has a number sand a
desired error probability ψas input parameters and outputs a pair (g
K3,3,˜
). The algorithm gives
the guarantee that g
K3,3 is a (1±)-approximation of |K3,3|with probability 1ψ. It uses at most
sparallel instances of SAMPLEK3,3. We show the pseudocode of COUNTK3,3SAFE in Figure
9.22.
Lemma 9.4.3 Let (g
K3,3,˜
)be the output of algorithm COUNTK3,3SAFE. With probability 1ψ
the following statements are true:
(1˜
)·|K3,3|<g
K3,3 <(1+˜
)·|K3,3|
If s19008·2
2·|K3,1|
|K3,3|·log(4/ψ), then the algorithm outputs (g
K3,3,˜
)with ˜
.
Proof : Equivalent to the proof of Lemma 9.1.4. 2
Theorem 29 There is a 1-Pass streaming algorithm to count the number of K3,3 in incidence
streams ordered by destination nodes with outdegree bounded by up to a multiplicative error
of with probability at least 1ψ, which needs
O log2(|V|)·|K3,1|·2ln(1
ψ)
|K3,3|·2!
memory bits.
156
10 Conclusions
In this thesis we developed new methods to analyse dynamic geometric data streams and obtain
structural information about the large data sets encoded in the streams.
In Chapter 3 we first developed a method to draw a uniform random sample from a multi-
set, when the multiset is given as a turnstyle data stream. The method we propose is a building
block for various data stream algorithms. As examples we showed in Chapter 4 some direct
consequences of the new sampling method, i.e. how to maintain -nets and -approximations
of points when the point set is given as a dynamic geometric data stream. We also developed
a method to estimate the weight of a minimum tree spanning all the points encoded in a dy-
namic geometric data stream. As random sampling and -approximations are powerful tools in
computational geometry we believe that our techniques have many more applications.
The space used by the algorithm for -approximations, i.e. roughly O(1/2), is essentially
optimal as a function of . This is, because it is known that, for some range spaces, the size of
-approximations tends to 1/2as the VC dimension tends to infinity. However, for some range
spaces smaller -approximations can be constructed, even for points delivered in an insertions-
only stream. For example, [118] showed how to compute -approximations for ranges defined
by halfspaces in ddimensions of size roughly O(1/22/(d+1)). We do not know how to extend
this result to dynamic geometric data streams.
To estimate the weight of the minimum spanning tree of the points encoded in a dynamic ge-
ometric data stream we used O(log3(1/δ)·(log()/)O(d))space. Although we give the first
algorithm at all to estimate the value in space polylogarithmic in , we believe that one can de-
velop algorithms having much better memory bounds.
One of the central results of this thesis, a universal method to construct coresets for k-median,
k-means, MaxCut, and more problems, has been given in Chapter 5. Our method is much simpler
than previous coreset methods [8, 60, 61] and makes less assumptions about the distribution of
the points. The only information needed to construct coresets is the number of points in heavy
cells (cells containing a certain number of points) of certain square grids. The simplicity of our
method enabled us to develop the fastest known PTAS for Euclidean MaxCut in Section 5.5.3
and the first efficient methods to maintain coresets for k-median, k-means, MaxCut, and more
problems on dynamic geometric data streams in Chapter 6.
The coreset we obtain is of size O(k·log n/d+1)for k-median, O(k·log n/d+2)for k-
means and O(log n/d+1)for all other problems we consider. Recently some methods have
been proposed to obtain smaller coresets for k-median and k-means. Har-Peled and Kushal [60]
showed how to compute -coresets of size O(k2/d)for k-median and of size O(k3/d+1)for
k-means (not dependent on n). However, their coreset construction does not apply to dynamic
geometric data streams. Their results indicate that the space dependency of our algorithms on n
157
10 Conclusions
could also be improved.
All space requirements of our coreset algorithms depend exponentially on d(we did not state
it in the formulas because we assumed a constant dimension). Therefore our methods are not
suited for high dimensional data. Recently Chen [26] proposed a method to obtain coresets us-
ing space which depends polynomially on d. His method is suited for streams of insertions of
points, but does not translate to streams of insertions and deletions. It remains an interesting
open problem to develop coreset methods for dynamic geometric data streams, which have space
complexity polynomial in the dimension d.
In Chapter 7 we used our coreset technique to develop the first kinetic data structure for the
Euclidean MaxCut problem. Our KDS can be extended to MaxTSP, MaxMatching, and average
distance. However, the time to compute a solution from the coreset (which has to be done for
each query to the data structure, or, alternatively with each event) can differ significantly.
Extending our KDS to k-median and k-means clustering requires additional ideas. The tech-
nical problem is here that one cannot get a lower bound on the solution from the width of the
bounding box. Hence, it is not clear how to get an upper bound on the number of events. De-
veloping kinetic data structures for k-median and k-means therefore remains an interesting open
problem.
In Chapter 8 we presented an efficient implementation of a k-means clustering algorithm using
coresets. Our algorithm performs very well compared to KMHybrid [105] for small dimension
and small to medium k. The quality of the solutions varies less than that of KMHybrid, which
implies that we need fewer runs to guarantee a good solution. The main strength of our algorithm
is to quickly find relatively good approximations for many values of k, for example when a good
value for kis not known in advance. In this case, we can also use the coresets to compute the
average clustering coefficient and thus to find a good choice of k.
As mentioned above recently some constructions for smaller coresets [26, 60] have been de-
veloped. It would be interesting to measure if the usage of these smaller coresets would lead to
even faster convergence of k-means based algorithms.
We have proposed a methodology in Chapter 9 to find (1±)-approximations on the number of
frequent subgraphs of large graphs given as data streams. The amount of samples resp. memory
bits needed by our algorithms depend on the number of certain small structures in these graphs.
Recent results on the internal structure of the webgraph or large sozial graphs [17, 81, 89, 92,
78] suggest that the amount of space needed by our algorithm to count motifs is constant or at
most logarithmic in the number of nodes for these graphs. Recent tests [17] suggest that our
algorithms can compute good estimations on the number of triangles of real webgraph crawls in
time comparable to the time to read the graph from the hard disc.
158
Bibliography
[1] P. K. Agarwal, J. Gao, and L. J. Guibas. Kinetic Medians and kd-Trees. Proceedings of
the 10th Annual European Symposium on Algorithms (ESA), pp. 5–16, 2002.
[2] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximating Extent Measures of
Points. Journal of the ACM, 51(4):606–635, July 2004.
[3] P. Agarwal, S. Har-Peled, and K. Varadarajan. Geometric Approximation via Coresets.
Survey available at http://valis.cs.uiuc.edu/ sariel/research/papers/04/survey/survey.pdf
[4] N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Fre-
quency Moments. J. Comput. Syst. Sci., 58(1), pp. 137–147, 1999.
[5] K. Alsabti, S. Ranka, and V. Singh. An Efficient k-Means Clustering Algorithm. Proceed-
ings of the first Workshop on High Performance Data Mining, 1998.
[6] D. Arthur and S.Vassilvitskii. How Slow is the k-Means Method? Proceedings of the 21nd
Annual ACM Symposium on Computational Geometry (SoCG), pp.144–153, 2006.
[7] M. Badoiu and K. Clarkson. Smaller Core-Sets for Balls. Proceedings of the 14th Sympo-
sium on Discrete Algorithms (SODA’03), pp. 801–802, 2003.
[8] M. Badoiu, S. Har-Peled, and P. Indyk. Approximate Clustering via Coresets. Proceedings
of the 34th Annual ACM Symposium on Theory of Computing (STOC’02), pp. 250–257,
2002.
[9] A. Bagchi, A. Chaudhary, D. Eppstein and M. T. Goodrich. Deterministic Sampling and
Range Counting in Geometric Data Streams. Proceedings of the 20th Annual Symposium
on Computational Geometry (SoCG), pp. 144–151, 2004.
[10] Z. Bar-Yossef, T.S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting Distinct
Elements in a Data Stream. Proceedings of the 6th International Workshop on Randomiza-
tion and Approximation Techniques, pages 1-10, 2002.
[11] J. Basch. Kinetic Data Structures. Ph.D. thesis, Stanford University, 1999.
[12] J. Basch, L. J. Guibas, and J. Hershberger. Data Structures for Mobile Data. J. Algorithms,
31(1):1–28 1999.
[13] J. Basch, L. J. Guibas, and G. Ramkumar. Sweeping Lines and Line Segments with a Heap.
Proceedings of the 13th Annual ACM Symposium on Computational Geometry (SoCG), pp.
469–471, 1997.
159
Bibliography
[14] P. Berkhin. Survey of Clustering Data Mining Techniques. Available at ..., 2002.
[15] S. Bespamyatnikh, B. Bhattacharya, D. Kirkpatrick, and M. Segal. Mobile Facility Loca-
tion. Proceedings of the 4th DIAL M, pp. 46–53, 2000.
[16] G. S. Brodal and R. Jacob. Dynamic Planar Convex Hull. Proceedings of the 43rd IEEE
Symposium on Foundations of Computer Science (FOCS), pp. 617–626, 2002.
[17] L. Buriol, G. Frahling, S. Leonardi, A. Marchetti-Spaccamela, and C. Sohler. Counting
Triangles in Data Streams. Proceedings of the 25th ACM SIGMOD-SIGACT-SIGART Sym-
posium on Principles of Database Systems, pages 253–262, 2006
[18] J. Carter and M. Wegman. Universal Classes of Hash Functions. Journal of Computer and
System Sciences, 18(2), pp. 143–154, 1979
[19] T. M. Chan. Faster Coreset Constructions and Data Stream Algorithms in Fixed Dimen-
sions. Proceedings of the 20th Annual Symposium on Computational Geometry (SoCG),
pp. 152–159, 2004.
[20] T. Chan, B. Sadjad. Geometric Optimization Problems Over Sliding Windows. Proceedings
of the 15th Annual International Symposium on Algorithms and Computation (ISAAC), pp.
246–258, 2004.
[21] M. Charikar, L. O’Callaghan, and R. Panigrahy. Better Streaming Algorithms for Cluster-
ing Problems. Proceedings of the 35th Annual ACM Symposium on Theory of Computing
(STOC), pp. 30–39, 2003.
[22] M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards Estimation Error
Guarantees for Distinct Values. Proceedings of the 19th ACM SIGMOD Symposium on
Principles of Database Systems (PODS), 2000.
[23] M. Charikar, C. Chekuri, T. Feder, and R. Motwani. Incremental Clustering and Dynamic
Information Retrieval. Proceedings of the 29th Annual ACM Symposium on Theory of
Computing (STOC), 626–635, 1997.
[24] M. Charikar, K. Chen, M. Farach-Colton. Finding Frequent Items in Data Streams. Pro-
ceedings of the . 29th Annual International Colloquium on Automata, Languages and Pro-
gramming (ICALP), pp. 693–703, 2002.
[25] B. Chazelle, R. Rubinfeld, and L. Trevisan. Approximating the Minimum Spanning Tree
Weight in Sublinear Time. Proceedings of the 28th Annual International Colloquium on
Automata, Languages and Programming (ICALP), pages 190–200, 2001.
[26] K. Chen. On k-Median Clustering in High Dimensions. Proceedings of the 17th Annual
ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1177–1185, 2006.
160
Bibliography
[27] D. Coppersmith and S. Winograd. Matrix Multiplication via Arithmetic Progressions. Jour-
nal of Symbolic Computation 3, no. 9.
[28] G. Cormode, M. Datar, P. Indyk. Comparing Data Streams using Hamming norms. Pro-
ceedings of the International Conference on Very Large Databases (VLDB), pp. 335–345,
2002.
[29] G. Cormode and S. Muthukrishnan. Radial Histograms for Spatial Streams. DIMACS
Technical Report 2003-11, 2003.
[30] G. Cormode and S. Muthukrishnan. Improved Data Stream Summaries: The Count-Min
Sketch and its Applications. Proceedings of the 6th Latin American Theoretical Informatics
(LATIN), pp. 29–38, 2004.
[31] G. Cormode and S. Muthukrishnan and I. Rozenbaum. Summarizing and Mining Inverse
Distributions on Data Streams via Dynamic Sampling. DIMACS Technical Report 2005-11,
2005.
[32] A. Czumaj, F. Erg¨
un, L. Fortnow, A. Magen, I. Newman, R. Rubinfeld, and C. Sohler.
Sublinear-Time Approximation of Euclidean Minimum Spanning Tree. SIAM Journal on
Computing, 35(1): 91-109 ,2005.
[33] A. Czumaj and C. Sohler. Estimating the Weight of Metric Minimum Spanning Trees in
Sublinear-Time. Proceedings of the 36th Annual ACM Symposium on Theory of Computing
(STOC), pp. 175–183, 2004.
[34] A. Czumaj and C. Sohler. Sublinear-Time Approximation for Clustering via Random Sam-
pling. Proceedings of the 31st Annual International Colloquium on Autonate, Languages
and Programming (ICALP’04), pp. 396–407, 2004.
[35] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing
Iceberg Queries Efficiently. Proceedings of the 1998 Intl. Conf. on Very Large Data Bases,
pp. 299-310, 1998.
[36] J. Feigenbaum, S. Kannan, and J. Zhang. Computing Diameter in the Streaming and Sliding
Window Models. Technical Report YALEU/DCS/TR-1245, Yale University, 2002.
[37] W. Fernandez de la Vega, M. Karpinski, C. Kenyon. Approximation Schemes for Metric
Minimum Bisection and Partitioning. Proceedings of the 15th Annual ACM-SIAM Sympo-
sium on Discrete Algorithms (SODA), 2004.
[38] W. Fernandez de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation
Schemes for Clustering Problems. Proceedings of the 35th Annual ACM Symposium on
Theory of Computing (STOC), pp. 50–58, 2003.
[39] W. Fernandez de la Vega and C. Kenyon. A Randomized Approximation Scheme for Metric
MAX-CUT. J. Comput. Syst. Sci., 63(4):531-541, 2001.
161
Bibliography
[40] P. Flajolet and G. Martin. Probabilistic Counting Algorithms for Data Base Applications.
Journal of Computer and System Sciences, 31:182–209, 1985.
[41] E. Forgey. Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classifi-
cation. Biometrics, 21:768, 1965.
[42] G. Frahling, P. Indyk, and C. Sohler. Sampling in Dynamic Data Streams and Applica-
tions. Proceedings of the 21st Annual Symposium on Computational Geometry (SoCG),
pages 142–149, 2005. Invited to the special issue of SoCG 2005, to appear in International
Journal of Computational Geometry and Applications (IJCGA).
[43] G. Frahling and C. Sohler. Coresets in Dynamic Geometric Data Streams. Proceedings of
the 37th Annual ACM Symposium on Theory of Computing (STOC), pp. 209–217, 2005.
[44] G. Frahling and C. Sohler. A Fast k-Means Implementation using Coresets. Proceedings of
the 22nd Annual Symposium on Computational Geometry (SoCG), pages 135–143, 2006.
Invited to the special issue of SoCG 2006, to appear in International Journal of Computa-
tional Geometry and Applications (IJCGA).
[45] H.N. Gabow. Data Structures for Weighted Matching and Nearest Common Ancestors with
Linking Proceedings of the 1st Annual ACM-SIAM Symposium on Discrete Algorithms
(SODA), 434–443, 1990.
[46] S. Ganguly, M. Garofalakis and R. Rastogi. Tracking Set-Expression Cardinalities over
Continuous Update Streams. The VLDB Journal, 13(4), pp. 354–369, 2004.
[47] J. Gao, L. J. Guibas, J. Hershberger, L. Zhang, and A. Zhu. Discrete Mobile Centers.
Discrete & Computational Geometry, 30(1):45–63, 2003.
[48] A. Gilbert, S. Guha, Y. Kotidis, P. Indyk, S. Muthukrishnan, M. Strauss. Fast, Small Space
algorithm for Approximate Histogram Maintenance. Proceedings of the 34th Annual ACM
Symposium on Theory of Computing (STOC), pp.389–398, 2005.
[49] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss, Surfing Wavelets on Streams:
One-Pass Summaries for Approximate Aggregate Queries. VLDB, 2001, pp. 79–88.
[50] M. X. Goemans and D. P. Williamson. Improved Approximation Algorithms for Maximum
Cut and Satisfiability Problems using Semidefinite Programming JACM, 42:1115–1145,
1995.
[51] O. Goldreich, S. Goldwasser, D. Ron. Property Testing and its Connection to Learning and
Approximation. Journal of the ACM, 45(4):653–750, 1998.
[52] J. Goodman and J. O’Rourke. Handbook of Discrete and Computational Geometry. CRC
Press, 1997.
162
Bibliography
[53] S. Guha, N. Koudas, and K. Shim. Data-Streams and Histograms. Proceedings of the An-
nual ACM Symposium on Theory of Computing (STOC), 2001, pp. 471–475.
[54] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering Data Streams. Proceed-
ings of the 41st IEEE Symposium on Foundations of Computer Science (FOCS), 359–366,
2000.
[55] L. J. Guibas. Kinetic Data Structures a State of the Art Report. Proceedings of the 3rd
Workshop on the Algorithmic Foundations of Robotics (WAFR), pp. 191–209, 1998.
[56] L. J. Guibas. Modeling Motion. In Handbook of Discrete and Computational Geometry,
edited by J. E. Goodman and J. O’Rourke, 2nd edition, Chapter 50, pp. 1117–1134, 2004.
[57] P. Haas, J. Naughton, S. Seshadri, and L. Stokes. Sampling-based Estimation of the Number
of Distinct Values of an Attribute. Proceedings of the 21st International Conference on Very
Large Data Bases (VLDB), pp.311–322, 1995.
[58] T. Hagerub and C. R¨
ub. A Guided Tour of Chernoff Bounds. Information Processing
Letters, 33:305–308, 1989/90.
[59] S. Har-Peled. Clustering Motion. Discrete & Computational Geometry, 31:545–565, 2004.
[60] S. Har-Peled and A. Kushal. Smaller Coresets for k-Median and k-Means Clustering.
[61] S. Har-Peled and S. Mazumdar. Coresets for k-Means and k-Medians and Their Applica-
tions. Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC),
291–300, 2004.
[62] S. Har-Peled and B. Sadri. On Lloyd’s k-Means Method. Proceedings of the 16th Annual
ACM-SIAM Symposium on Discrete Algorithms (SODA), 2005.
[63] S. Har-Peled and K. Varadarajan. Projective Clustering in High Dimensions using Coresets.
Proceedings 18th Annual ACM Symposium on Computational Geometry (SoCG’02), pp.
312–318, 2002.
[64] F. Harary and H. J. Kommel. Matrix Measures for Transitivity and Balance. Journal of
Mathematical Sociology (6), 199210.
[65] J. Hartigan. Clustering Algorithms. John Wiley & Sons, New York, 1975.
[66] D. Haussler, E: Welzl. -Nets and Simplex Range Queries. Discrete and Computational
Geometry, 2:127–151, 1987.
[67] M. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on Data Streams. 1998
[68] J. Hershberger. Smooth Kinetic Maintenance of Clusters. Computational Geometry, Theory
and Applications, 31(1–2):3–30, 2005.
163
Bibliography
[69] J. Hershberger and S. Suri. Convex Hulls and Related Problems in Data Streams. Proceed-
ings of the ACM/DIMACS Workshop on Management and Processing of Data Streams,
2003.
[70] J. Hershberger and S. Suri. Adaptive Sampling for Geometric Problems over Data Streams.
Proceedings of the ACM Symposium on Principles of Database Systems (PODS), 2004.
[71] M. Inaba, N. Katoh, and H. Imai. Applications of Weighted Voronoi Diagrams and Ran-
domization to Variance-Based k-Clustering. Proceedings of the 10th Annual ACM Sympo-
sium on Computational Geometry (SoCG), pp. 332–339, 1994.
[72] P. Indyk. High-Dimensional Computational Geometry. Ph.D. thesis, Stanford University,
2000.
[73] P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream
Computation. Proceedings of the 41st IEEE Symposium on Foundations of Computer Sci-
ence (FOCS), pp. 189–197, 2000.
[74] P. Indyk. Better Algorithms for High-Dimensional Proximity Problems via Asymmetric
Embeddings. Proceedings of the 14th Annual ACM-SIAM Symposium on Discrete Algo-
rithms (SODA), pp. 539–545, 2003.
[75] P. Indyk. Algorithms for Dynamic Geometric Problems over Data Streams. Proceedings of
the 36th Annual ACM Symposium on Theory of Computing (STOC), pp. 373–380, 2004.
[76] P. Indyk and D. Woodruff. Tight Lower Bounds for the Distinct Elements Problem. Annual
Symposium on Foundations of Computer Science, pages 283–290, 2003.
[77] P. Indyk and D. Woodruff. Optimal Approximations of the Frequency Moments of Data
Streams. Proceedings of the 37th Annual ACM Symposium on Theory of Computing
(STOC), 2005.
[78] S. Itzkovitz, N. Kashtan, D. Chklovskii, R. Milo, S. Shen-Orr, and U. Alon. Network
Motifs: Simple Building Blocks of Complex Networks. Science (298), no. 509, 824 827.
[79] H. Jagadish, N. Koudas, and S. Muthukrishnan. Mining deviants in a time serias database.
Proceedings of the International Conference on Very Large Data Bases (VLDB), pp. 102–
113, 1999.
[80] A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, New Jersey, 1988.
[81] Hossein Jowhari and Mohammad Ghodsi. New Streaming Algorithms for Counting Trian-
gles in Graphs Proceedings of the COCOON, 2005, pp. 710–716.
[82] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. An Efficient
k-Means Clustering Algorithm: Analysis and Implementation. IEEE Trans. Pattern Anal.
Mach. Intell. 24(7): 881-892, 2002.
164
Bibliography
[83] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. A Local
Search Approximation Algorithm for k-Means Clustering. Proceedings of the 18th Annual
Symposium on Computational Geometry (SoCG’02), pp. 10–18, 2002.
[84] H. Kaplan, R. E. Tarjan, and K. Tsioutsiouliklis. Faster Kinetic Heaps and Their Use in
Broadcast scheduling. SODA, pp. 834–844, 2001.
[85] S. Khot, G. Kindler, E. Mossel, and R. O’Donnell. Optimal Inapproximability Results
for Max-Cut and Other 2-Variable CSPs? Proceedings of the 45th IEEE Symposium on
Foundations of Computer Science (FOCS), pp. 146–154, 2004.
[86] D. Knuth. The Art of Computer Programming: Sorting and Searching, Vol. 3, Addison-
Wesley, 1973.
[87] S. G. Kolliopoulos and S. Rao. A Nearly Linear-Time Approximation Scheme for the
Euclidean k-Median Problem. Proceedings of the 7th Annual European Symposium on
Algorithms (ESA), pp. 378-389, 1999.
[88] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for Emerging
Cyber Communities. (1999), 403–416.
[89] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Random
Graph Models for the Web Graph. Proceedings of the IEEE Symposium on Foundations of
Computer Science (FOCS), 2000, pp. 57–65.
[90] A. Kumar, Y. Sabharwal, and S. Sen. A Simple Linear Time (1+ε)-Approximation Algo-
rithm for k-Means Clustering in any Dimensions. Proceedings of the 45th IEEE Symposium
on Foundations of Computer Science (FOCS), pp. 454–462, 2004.
[91] A. Kumar, Y. Sabharwal, and S. Sen. Linear Time Algorithms for Clustering Problems in
any Dimensions. Proceedings of the 32nd Annual International Colloquium on Automata,
Languages and Programming (ICALP), pp. 1374–1385, 2005.
[92] L. Laura, S. Leonardi, S. Millozzi, and J.F. Sybeyn. Algorithms and Experiments for the
Webgraph. Proceedings of the Annual European Symposium on Algorithms (ESA), 2002.
[93] S. Leonardi S. Millozzi L.S. Buriol, D. Donato. Link and Temporal Analysis of Wikigraphs.
Technical Report (2005).
[94] Y. Linde, A. Buzo, and R. Gray. An algorithm for Vector Quantizer Design. IEEE Trans-
action on Communications, 28(1), pp. 84–94, 1980.
[95] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory,
28: 129–137, 1982.
[96] P. Lyman and H. Varian. How much information. University of California, Berkeley,
http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/, 2003.
165
Bibliography
[97] J. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations.
Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability,
volume 1, pp. 281–296, 1967.
[98] G. S. Manku, R. Motwani. Approximate Frequency Counts over Data Streams. Proceedings
of the 2002 Intl. Conf. on Very Large Data Bases, pp. 346–357, 2002.
[99] J. Matouˇ
sek. On Approximate Geometric k-Clustering. Discrete & Computational Geom-
etry, 24(1): 61–84, 2000.
[100] R. Mettu and G. Plaxton. Optimal Time Bounds for Approximate Clustering. Machine
Learning, 56(1-3):35–60, 2004.
[101] A. Meyerson. Online Facility Location. Proceedings of the IEEE Symposium on Founda-
tions of Computer Science (FOCS), pp. 426–431, 2001.
[102] N. Mishra, D. Oblinger, and L. Pitt. Sublinear Time Approximate Clustering. Proceedings
of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 439–447,
2001.
[103] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press,
1995.
[104] D. Mount. KMlocal: A Testbed for k-means Clustering Algorithms. Available at
http://www.cs.umd.edu/ mount/Projects/KMeans/km-local-doc.pdf
[105] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends
in Theoretical Computer Science, Volume 1, Issue 2, 2005.
[106] N. Nisan. Pseudorandom Generators for Space-Bounded Computation. Proceedings of
the 22nd Annual ACM Symposium on Theory of Computing (STOC), 204–212, 1990.
[107] A. Ostlin and R. Pagh. Uniform Hashing in Constant Time and Linear Space. Proceedings
of the 35th Annual ACM Symposium on Theory of Computing (STOC), ACM Press, 2003,
pp. 622–628.
[108] D. Pelleg and A. Moore. Accelerating Exact k-Means Algorithms with Geometric Reason-
ing. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pp. 277–281, 1999.
[109] D. Pelleg and A. Moore. x-Means: Extending k-Means with Efficient Estimation of the
Number of Clusters. Proceedings of the 17th International Conference on Machine Learn-
ing, 2000.
[110] S. Phillips. Acceleration of k-Means and Related Clustering Problems. Proceedings of
Algorithms Engineering and Experiments (ALENEX’02), 2002.
166
Bibliography
[111] R. Prim. Shortest Connection Networks and some Generalizations. Bell Systems Technical
Journal, 36:1389-1401, 1957.
[112] T. Schank and D. Wagner. Finding, Counting, and Listing all Triangles in Large Graphs,
an Experimental Study. Proceedings of the WEA, 2005, pp. 606–609.
[113] J. Schmidt, A. Siegel, and A. Srinivasan. Chernoff-Hoeffding Bounds for Applications
with Limited Independence. SIAM Journal on Discrete Mathematics, 8(2):223–250, 1995.
[114] R. Seidel and C. Aragon. Randomized Search Trees. Algorithmica 16, 464–497, 1996.
[115] S. Selim and M. Ismail. k-Means-Type Algorithms: A Generalized Convergence Theorem
and Characterizations of Local Optimality. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6::81–87, 1984.
[116] D. Shasha, J. Tsong-Li Wang, and R. Giugno. Algorithmics and Applications of Tree
and Graph Searching. Proceedings of the 21st ACM SIGMOD Symposium on Principles of
Database Systems (PODS), 2002,pp. 39–52.
[117] D. Sivakumar, Z. Bar-Yosseff, R. Kumar. Reductions in Streaming Algorithms, with an
Application to Counting Triangles in Graphs. Proceedings of the 13th Annual ACM-SIAM
Symposium on Discrete Algorithms (SODA) (2002), pp. 623–632.
[118] S. Suri, C. D. Toth, and Y. Zhou. Range Counting over Multidimensional Data Streams.
Proceedings of the 20th Annual Symposium on Computational Geometry, pp. 160–169,
2004.
[119] V. Vapnik and A. Chervonenkis. On the Uniform Convergence of Relative Frequencies of
Events to their Probabilities. Theory Probab. Appl., 16:264–280, 1971.
[120] J. S. Vitter. Random Sampling with a Reservoir. ACM Trans. Math. Softw. (11) (1985),
no. 1, 37–57.
[121] S. Valverde and R. Sol. Network Motifs in Computational Graphs: A Case Study in
Software Architecture. Physical Review E (72), 2005
[122] D. J. Watts and S. H. Strogatz. Collective Dynamics of Small-World Networks. Nature
(393), 440–442.
[123] X. Yan, P. S. Yu, and J. Han. Graph Indexing: A Frequent Structure-Based Approach
Proceedings of the SIGMOD, 2004, pp. 335–346.
[124] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: A New Data Clustering Algorithm
and its Applications. Journal of Data Mining and Knowledge Discovery,1(2),pp. 141-182,
1997.
[125] J. Zhao. An Implementation of Min-Wise Independent Permutation Family. (2005),
http://www.icsi.berkeley.edu/ zhao/minwise/.
167