scieee Science in your language
[en] (orig)
PRACTICAL ALGORITHMS
FOR CLUSTERING AND MODELING
LARGE DATA SETS
ANALYSIS AND IMPROVEMENTS
Dissertation
zur Erlangung des Grades
„Doktor der Naturwissenschaften”
(doctor rerum naturalium)
der Fakultät für Elektrotechnik, Informatik und Mathematik
der Universität Paderborn
vorgelegt von
DIPL.-INFORM.
DANIEL KUNTZE
Paderborn, den 29. August 2013
(überarbeitete Fassung vom 28. Dezember 2013)
angenommen auf Vorschlag von
PROF. DR. JOHANNES BLÖMER
UNIVERSITÄT PADERBORN
PROF. DR. CHRISTIAN SOHLER
TECHNISCHE UNIVERSITÄT DORTMUND
verteidigt am
7. November 2013
DEDICATED TO MY WIFE
EVA REBECCA
ACKNOWLEDGMENTS
This thesis would not have been written without a set of people sup-
porting me. Clustering them leads to four groups of people I am
deeply thankful.
First of all, I would like to thank my advisors. Prof. Dr. Johannes
Blömer was never too busy to invest time in discussions that inspired
new trains of thought and his guidance always advanced my work.
Prof. Dr. Christian Sohler not only made me eat sushi but also fed
my thoughts with lots of valuable ideas for my research. My col-
leagues Claudia Jahn, Kathrin Bujna, Dr. Stefanie Mense, Dr. Marcel
Ackermann, Jonas Gefele, Peter Günther, Dr. Volker Krummel, Gen-
nadij Liske, David Teusner, Paul Wolf, and of course the IT crowd
(Ulrich Ahlers, Tobias Schultz-Friese, Thomas Thissen, and Michael
Utermöhle) helped me out when I was stuck and provided welcome
excuses from work. I learned a lot during my time with you.
The research presented in this thesis was supported by the Priority
Programme ’Algorithm Engineering’ of the German Research Foun-
dation (DFG), grants BL 314/6-1to BL 314/6-3, for which I am also
really thankful.
I also want to thank my dear colleague Kathrin Bujna, who tamed
R for me and made it draw beautiful pictures.
Last but not least, I want to express my appreciation for the family
and friends cluster. Without Emily and Ronja, I might have finished
more quickly but with less fun. You were the cutest to come into my
office for chocolate! Furthermore, I am very grateful for my parents
and parents-in-law. Their support, love, and prayers were a great
encouragement. My friend Peter Richert never stopped to cheer me
on thanks for believing in me, mate.
Most of all, I want to thank my wife for accompanying me on my
way, for sharing all the ups and downs, and for giving me unfailing
support. Without you, I would never have succeeded.
v
ABSTRACT
In times of big data, large-scale database applications are of increas-
ing importance. A central task in this field is to reduce the amount
of data that has to be processed. One way to approach this task is
to capture the structure of the given data. This thesis deals with two
particular data structuring techniques.
The first part of this thesis is about cluster analysis. More precisely,
we consider the diameter k-clustering problem. The subject of this
problem is to partition a finite subset of Rdinto ksubsets called clus-
ters such that the largest diameter of the clusters is minimized. One
early clustering algorithm that computes a hierarchy of approximate
solutions to this problem (for all values of k) is the agglomerative
clustering algorithm with the complete linkage strategy. For decades,
this algorithm has been widely used by practitioners. However, it
is not well studied theoretically. We provide the first analysis of the
agglomerative complete linkage clustering algorithm. Assuming that
the dimension dis a constant, we show that for any kthe solution
computed by this algorithm is an O(logk)-approximation to the di-
ameter k-clustering problem. Our analysis does not only hold for
the Euclidean distance but for any metric that is based on a norm.
Furthermore, we analyze the closely related k-center and discrete k-
center problem. For the corresponding agglomerative algorithms, we
deduce an approximation factor of O(logk)as well.
The second part of this thesis deals with data modeling by training
statistical models. More precisely, we focus on the parameter esti-
mation problem for general and Gaussian mixture distributions. We
analyze a probabilistic variant of the Expectation-Maximization (EM)
algorithm, known as the Stochastic EM or SEM algorithm. Unlike the
original work, we focus on the analysis of a single run of the algo-
rithm. First, we discuss the SEM algorithm for general mixture dis-
tributions. Second, we consider Gaussian mixture models and show
that with high probability the updates of the EM algorithm and its
probabilistic variant are almost the same, given that the data set is suf-
ficiently large. A series of experiments confirms that this still holds
for a large number of successive update steps. Furthermore, we ex-
plain why the SEM algorithm is considerably faster than the classical
EM algorithm. In particular, we show that the probabilistic variant
establishes a constant factor speedup for Gaussian mixture models.
vii
ZUSAMMENFASSUNG
In Zeiten von Big Data sind große Datenbankanwendungen von wach-
sender Bedeutung. Eine zentrale Aufgabe aus diesem Bereich ist die
Reduzierung der zu verarbeitenden Datenmenge. Diese Aufgabe
lässt sich beispielsweise lösen, indem die Struktur der gegebenen
Daten erfasst wird. In dieser Dissertation beschäftigen wir uns mit
zwei Techniken zur Strukturierung von großen Datenmengen.
Im ersten Teil der Arbeit geht es um Clusteranalyse, speziell um das
Durchmesser-k-Clustering-Problem. Dabei wird eine endliche Teil-
menge des Rdin kCluster partitioniert, so dass der größte Cluster-
durchmesser minimiert wird. Ein früher Clusteringalgorithmus, der
eine Hierarchie von Näherungslösungen (für alle Werte von k) zu
diesem Problem berechnet, ist der agglomerative Clusteringalgorith-
mus mit der Complete-Linkage-Strategie. Dieser Algorithmus erfährt
in der Praxis seit Jahrzehnten eine breite Anwendung, ist aber theo-
retisch nicht gut untersucht. Wir liefern die erste Analyse des ag-
glomerativen Complete-Linkage Clusteringalgorithmus. Wir zeigen,
dass die vom Algorithmus berechnete Lösung für konstante Dimen-
sion dfür jedes keine O(logk)-Approximation des Durchmesser-k-
Clustering-Problems ist. Unsere Analyse gilt dabei nicht nur für
den Euklidischen Abstand, sondern für jede Metrik, die auf einer
Norm basiert. Außerdem analysieren wir das k-Center-Problem und
das diskrete k-Center-Problem, welche eng mit dem Durchmesser-k-
Clustering-Problem verwandt sind. Für die dazugehörigen agglome-
rativen Algorithmen leiten wir ebenfalls einen Approximationsfaktor
von O(logk)her.
Der zweite Teil der Arbeit beschäftigt sich mit der Modellierung von
Daten durch das Trainieren von statistischen Modellen. Hier betrach-
ten wir die Parameterschätzung von allgemeinen und Gaußschen
Mixturverteilungen. Wir analysieren eine probabilistische Variante
des Expectation-Maximization (EM) Algorithmus, die als Stochastic
EM oder SEM Algorithmus bekannt ist. Im Gegensatz zu den ur-
sprünglichen Arbeiten zum SEM Algorithmus betrachten wir in un-
serer Analyse einen einzelnen Durchlauf des Algorithmus. Dabei
beschreiben wir den Algorithmus zunächst für allgemeine Mixtur-
verteilungen. Danach betrachten wir Gaußsche Mixturmodelle und
zeigen für diese, dass die Updates des EM Algorithmus und seiner
probabilistischen Variante mit hoher Wahrscheinlichkeit fast identisch
sind, wenn die Datenmenge gr genug ist. Eine Reihe von Expe-
rimenten bestätigt, dass dies auch für eine große Anzahl aufeinan-
derfolgender Updateschritte noch gilt. Darüber hinaus erörtern wir,
ix
warum der SEM Algorithmus schneller arbeitet als der klassische EM
Algorithmus. Insbesondere zeigen wir für Gaußsche Mixturmodelle,
dass die probabilistische Variante eine Beschleunigung um einen kon-
stanten Faktor erreicht.
x
CONTENTS
1 introduction 1
1.1Cluster analysis 2
1.2Parameter estimation 5
1.3Outline of the thesis 9
i on cluster analysis 11
2 clustering basics 13
2.1The notion of similarity 14
2.2Clustering quality 14
2.3Problem Definitions 16
3 analysis of agglomerative clustering 17
3.1Preliminaries 18
3.2Discrete k-center clustering 20
3.3k-center clustering 24
3.4Diameter k-clustering 34
3.5The one-dimensional case 41
3.6Lower bounds 45
ii on parameter estimation 55
4 statistical models 57
4.1The parameter estimation problem 58
4.2General mixture models 60
4.3Gaussian mixture models 61
4.4Handy notions from probability theory 67
5 the classical em algorithm 71
5.1Convergence of the EM algorithm 73
5.2The EM algorithm for mixture distributions 77
5.3The EM algorithm for Gaussian mixtures 78
6 the stochastic em algorithm 81
6.1The generic SEM algorithm 82
6.2The SEM algorithm for mixture distributions 83
6.3The SEM algorithm for Gaussian mixtures 86
7 experimental analysis 99
7.1Implementation 99
7.2Data sets 100
7.3Experiments 102
7.4Results 105
xi
1
INTRODUCTION
Recently, as part of the Human Brain Project [Walker, 2012], researchers
from Canada and Germany released BigBrain, a high-resolution 3D
model of the human brain [Amunts et al., 2013]. The aim of this
model is to provide “considerable neuroanatomical insight into the
human brain, thereby allowing the extraction of microscopic data for
modeling and simulation”. The long-term objective of the Human
Brain Project is the simulation of the human brain on supercomputers,
thus realizing a recent dream of mankind. The chances of success for
a venture like this were already discussed in the 1950s by great minds
as Alan Turing1and John von Neumann2.
Although science is not able to simulate the human brain yet, a lot
of effort is put into the imitation of its capabilities. One of these ca-
pabilities is to handle large amounts of data. The human retina alone
transmits approximately one megabyte of data to the brain every sec-
ond [Koch et al., 2006]. Fortunately, we do not have to consciously
process each bit of information that we see. Instead, our brain’s ab-
straction capacity reduces the amount of information. This enables
us to recognize a white area covered with black speckles as a page
of printed text. Moreover, we are able to classify the speckles as Ro-
man letters and recognize groups of letters as English words. Then,
by processing one word after another while ignoring the remaining
words on the page, we are able to read whole sentences.
For the sake of simplicity let’s wrongly assume that the information
of a sentence is simply the sum of information in the single words.
Furthermore, we assume that the amount of information in a single
word is simply the number of bits needed to identify one word of the
English language. If we estimate the size of the English vocabulary
by 500 000 words, then we need 19 bits per word. An average reader
is able to read about 4words per second which corresponds to 76 bits
per second. This rough approximation shows that the brain reduces
the information from the perception in the eye to the recognition of
the words by several orders of magnitude.
1In 1950, Turing wrote his well-known article “Computing Machinery and Intelli-
gence” [Turing, 1950].
2Shortly before his death in 1957, von Neumann began to write the unfinished and
posthumously published book “The Computer and the Brain” [von Neumann, 1958].
1
introduction
Furthermore, the brain is also able to store large amounts of data.
The number of neurons in the human brain is estimated to about
86 billion cells, 16 billion of which are located in the cerebral cortex
[Azevedo et al., 2009]. We may assume that each neuron in the cere-
bral cortex is connected to other neurons by 50 000 synapses on aver-
age [Brewer et al., 2009]. This yields a total number of approximately
800 trillion synapses. It is not known how the brain stores informa-
tion and which role the interaction between neurons and synapses
plays. However, recent work suggests that a synapse switches be-
tween discrete states [Montgomery and Madison, 2004]. If we assume
that each synapse can hold one bit of information, we get a total stor-
age capacity of approximately 100 terabytes in the cerebral cortex.
The combination of the capability to store large amounts of data
and the ability to process it resembles the requirements of large-scale
database applications. With continuously growing databases, it be-
comes increasingly important to reduce the amount of data that has
to be processed. That is, in order to work with very large data sets,
we need to be able to extract the essential information. This is quite
similar to the brain’s ability to extract text from the electrochemical
impulses received from the optic nerve. One step into this direction
is to try and capture the structure of a given data set. The approaches
to this task are as diverse as the underlying applications. This thesis
deals with two particular data structuring techniques introduced in
the following two sections.
1.1cluster analysis
One way of investigating the structure of a given data set is to divide
it into subsets of similar data elements. This approach is called cluster
analysis or simply clustering. There are many applications for clus-
tering, including data compression [Pereira et al., 1993], structuring
results of search engines [Broder et al., 1997], analysis of gene expres-
sion data [Eisen et al., 1998], anomaly detection [Lee et al., 2008], and
community detection [Meyerhenke and Staudt, 2013]. For every ap-
plication, a proper objective function is used to measure the quality
of a clustering. One particular objective function that is based on a
distance measure between data elements is the largest diameter of
the clusters. If the desired number of clusters kis given, we call the
problem of minimizing this objective function the diameter k-clustering
problem. In this thesis, we focus on the diameter k-clustering problem
and two closely related clustering problems.
One of the earliest and most widely used clustering algorithms
is agglomerative clustering. The history of agglomerative clustering
goes back to the 1950s at least [Florek et al., 1951, McQuitty, 1957].
Later, biological taxonomy became one of the driving forces of clus-
ter analysis. In [Sneath and Sokal, 1973] the authors, who where
2
1.1 cluster analysis
the first biologists using computers to classify organisms, discuss sev-
eral agglomerative clustering methods. Agglomerative clustering is a
bottom-up clustering process. At the beginning, every data element
forms its own cluster. In each subsequent step, the two ’closest’ clus-
ters will be merged until only one cluster remains.
In order to define the agglomerative algorithm properly, we have
to specify what it means for two clusters to be ’close’. In case of
the diameter clustering problem mentioned above, two clusters are
defined to be close when the distance between their farthest pair of
data elements is small. Then, the two ’closest’ clusters are the two
clusters that minimize the diameter of their union. Using this def-
inition of closeness is called the complete linkage strategy. Alternate
strategies are to minimize the distance between the closest pair of
data elements or to minimize the average distance between data ele-
ments from the two clusters. These are called single linkage strategy3
and average linkage strategy, respectively.
Agglomerative clustering algorithms create a hierarchy of clusters,
such that for any two different clusters Aand Bfrom different levels
of the hierarchy we either have AB=,AB, or BA. Such a
hierarchy is useful in many applications, for example, when one is
interested in hereditary properties of the clusters (as in some bioin-
formatics applications) or if the exact number of clusters is a priori
unknown.
1.1.1related work
In this thesis, we study the agglomerative clustering algorithm us-
ing the complete linkage strategy. Since the complete linkage strat-
egy minimizes the diameter in each step, this algorithm is a natural
choice to find hierarchical diameter k-clusterings. The running time
is obviously polynomial in the description length of the input data.
Therefore, our only goal in this thesis is to give an approximation
guarantee for the diameter k-clustering problem. The approximation
guarantee is given by a factor αsuch that the largest diameter of
the k-clustering computed by the algorithm is at most αtimes the
largest diameter of an optimal k-clustering. Although the agglom-
erative complete linkage clustering algorithm is widely used, there
are only few theoretical results considering the quality of the clus-
tering computed by this algorithm. In [Dasgupta and Long, 2005]
the authors provide a certain metric distance function such that this
algorithm computes a k-clustering with an approximation factor of
(logk).
The diameter k-clustering problem is closely related to the k-center
problem. In this problem, we search for kcenters and the objective is
3Using the single linkage strategy is equivalent to computing a minimum spanning
tree of the graph induced by the distance function using Kruskal’s algorithm.
3
introduction
to minimize the maximum distance of any data element to the nearest
center. When the centers are restricted to come from the input data
set, the problem is called the discrete k-center problem. It can easily be
verified that for metric distance functions the costs of optimal solu-
tions to all three problems are within a factor of 2from each other.
For the Euclidean case, we know that for fixed k, i.e., when we are
not interested in a hierarchy of clusterings, the diameter k-clustering
problem and the k-center problem are NP-hard. In fact, it is already
NP-hard to approximate both problems with an approximation factor
below 1.96 and 1.82, respectively [Feder and Greene, 1988]. These
non-approximability results are transferable to the discrete k-center
problem for approximation factors below 1.73 [Leder, 2013].
Furthermore, there exist provably good approximation algorithms
in this case. For the k-center problem, a simple 2-approximation al-
gorithm is known for metric spaces [Gonzalez, 1985]. Since this al-
gorithm chooses the cluster centers from the input data set, it is also
a2-approximation algorithm for the discrete k-center problem. The
analysis of this algorithm is immediately transferrable to the diame-
ter k-clustering problem and yields a 2-approximation algorithm for
this problem, too. For the k-center problem, a variety of results is
known. For example, for the Euclidean metric in [B¯
adoiu et al., 2002]
a(1+ϵ)-approximation algorithm with running time 2O(klogk
/ϵ2)dn
is shown. This implies a (2+ϵ)-approximation algorithm with the
same running time for the diameter k-clustering problem. Moreover,
for metric spaces a hierarchical clustering strategy is known with an
approximation guarantee of 8for the discrete k-center problem [Das-
gupta and Long, 2005]. This implies an algorithm with an approxi-
mation guarantee of 16 for the diameter k-clustering problem.
Part I of this thesis as well as all of the above mentioned work is
about static clustering, i. e., in the problem definition we are given
the whole input data set at once. An alternative model of the input
data is to consider sequences of data elements that are given one after
another. In [Charikar et al., 1997], the authors discuss clustering in
a so-called incremental clustering model. They give an algorithm with
constant approximation factor that maintains a hierarchical clustering
while new elements are added to the input data set. Furthermore,
they show a lower bound of (log k)for the agglomerative complete
linkage algorithm and the diameter k-clustering problem. However,
since their model differs from ours, their results have no bearing on
our results.
1.1.2our contribution
One of the open problems discussed in [Dasgupta and Long, 2005]
is to derive a non-trivial upper bound for the approximation guar-
antee of the classical agglomerative complete linkage clustering al-
4
1.2 parameter estimation
gorithm. In this thesis, we present such an analysis for the agglom-
erative complete linkage and related clustering algorithms. Our re-
sults hold for metrics on Rdthat are based on a norm as for exam-
ple the Euclidean metric. We prove that the agglomerative solution
to the diameter k-clustering problem is an O(logk)-approximation.
Here, the O-notation hides a constant that is doubly exponential in
d. This approximation guarantee holds for every level of the hierar-
chy computed by the algorithm. That is, we compare each computed
k-clustering with an optimal solution for that particular value of k.
These optimal k-clusterings do not necessarily form a hierarchy. In
fact, there are simple examples where optimal solutions have no hier-
archical structure.
Our analysis also yields that if we allow 2k instead of kclusters
and compare the cost of the computed 2k-clustering to an optimal
solution with kclusters, the approximation factor is independent of
kand depends only on d. Moreover, the techniques of our analysis
can be applied to prove stronger results for the k-center problem and
the discrete k-center problem. For the k-center problem, we derive
an approximation guarantee that is logarithmic in kand only singly
exponential in d. For the discrete k-center problem, we derive an
approximation guarantee that is logarithmic in kand the dependence
on dis only linear and additive.
Furthermore, we give almost matching upper and lower bounds
for the one-dimensional case. These bounds are independent of k.
For d2and the metric based on the L-norm, we provide a lower
bound that exceeds the upper bound for d=1. For d3, we give
a lower bound for the Euclidean case which is larger than the lower
bound for d=1. Finally, we construct data sets which provide lower
bounds for any metric based on an Lp-norm with 1p. How-
ever, the construction of these data sets needs the dimension to de-
pend on k.
Our analysis, presented in Chapter 3, was previously published in
[Ackermann et al., 2011, Ackermann et al., 2012].
1.2parameter estimation
Apart from the clustering approach outlined in Section 1.1, another
way of revealing the structure of large data sets is to train the parame-
ters of statistical models. This approach is discussed in Part II of this
thesis. To describe a given data set by the parameters of a statistical
model is a central task in the field of data mining and machine learn-
ing. Many applications in this field deal with real-world data that is
suitable to be modeled by stochastic processes. That is, the data is as-
sumed to be generated by a stochastic process which is governed by
some model parameters. Then, the goal is to optimize the parameters
such that the stochastic process is a likely source of the data. Presum-
5
introduction
ably, one of the first studies of this problem was done by Karl Pearson
for a data set consisting of measurements of the frontal breadth of the
carapace of 1 000 female shore crabs [Pearson, 1894]. Pearson showed
that the data was most likely a mixture of two Gaussian distributions.
This suggests that the crabs belonged to two different species. To
compute the parameters of the mixture, Pearson used the method of
moments to derive a ninth-degree polynomial by hand which gave
rise to a set of candidate parameters.
Today, we are able to work with much larger data sets and we do
not have to do any calculations by hand. Instead, we are equipped
with powerful algorithms for the parameter optimization. One way
to optimize the parameters is to maximize the probability of the given
data set over the choice of the parameters. As a function of the model
parameters, this probability is called the likelihood function. The
problem of maximizing the likelihood is called the parameter estima-
tion problem. The Expectation-Maximization (EM) algorithm intro-
duced in [Dempster et al., 1977] is a general scheme for finding maxi-
mum likelihood solutions for this parameter estimation problem. It is
used when the observed data Xcan be seen as incomplete and when
the parameter estimation problem given the corresponding complete
data (X,Z)is easy to solve. Starting with an initial set of model pa-
rameters, the algorithm repeatedly performs two steps to compute a
new set of parameters. The first step derives an optimal distribution
for the hidden values. The second step computes the new set of pa-
rameters by maximizing the expectation (over the hidden values) of
the complete-data likelihood.
Applications of the EM algorithm to special cases were already
known before the method was formulated in its general form in
[Dempster et al., 1977]. For example, the well known Baum-Welch
algorithm (see [Baum et al., 1970]) for the computation of parameters
for hidden Markov models is an instantiation of a generalized variant
of the EM algorithm. For exponential families, the method was stud-
ied in [Sundberg, 1974]. More examples can be found in [Dempster
et al., 1977].
1.2.1related work
During the decades since its first presentation, a lot of work has been
done to improve the EM algorithm. Most of these improvements deal
with one or more of three major drawbacks of the algorithm. First,
the convergence of the EM algorithm can be very slow. For a discus-
sion of several speed-up techniques see Chapter 4in [McLachlan and
Krishnan, 2008]. Second, the EM algorithm is prone to converge only
to a local maximum, or even worse, to a saddle point of the likelihood
function [Wu, 1983, Dias and Wedel, 2004]. One way to deal with this
problem is to compute ’good’ initial estimates [Biernacki et al., 2003]
6
1.2 parameter estimation
or to restart the algorithm with several different initializations in the
hope that one iteration leads to a satisfying solution. Finally, in some
cases the EM algorithm is not applicable at all since the maximiza-
tion step is computationally intractable. To overcome this problem,
the maximization step has to be substantially simplified.
There exist a number of stochastic variants of the EM algorithm
that try to deal with these drawbacks, including the Stochastic EM
or SEM algorithm [Celeux and Diebolt, 1985], the Stochastic Approx-
imation EM or SAEM algorithm [Celeux and Diebolt, 1992], and the
Monte Carlo EM or MCEM algorithm [Wei and Tanner, 1990]. The
MCEM algorithm replaces the distribution over the hidden values
from the first step of the EM algorithm by an empirical Monte Carlo
approximation. Thus, for continuous random variables Z, it replaces
analytic by numerical computation. Furthermore, due to the inherent
randomness, the algorithm is capable of escaping from a saddle point
or from a local maximum of the likelihood function.
The SEM algorithm is even simpler than the MCEM algorithm. In-
stead of maximizing the expectation as done in the second step of the
EM algorithm, it uses the distribution from the first step to simply
guess the missing values. That is, the algorithm samples an assign-
ment for the hidden values Z. Afterwards, it maximizes the complete-
data likelihood only for that fixed assignment. This considerably re-
duces the computational complexity and may lead to the applicability
of the EM scheme in the first place. The SEM algorithm can be seen
as a special case of the MCEM algorithm where only one sample is
drawn to approximate the distribution over the hidden values. It is
known that for particular classes of statistical models, the sequence of
models generated by the SEM iterations is an ergodic Markov chain
converging weakly to a stationary distribution over models. Further-
more, it is known that under appropriate assumptions and for the
number of input points going to infinity, the mean of this stationary
distribution converges to the maximum likelihood estimate [Ip, 1994].
A suitable application of the SEM algorithm is the parameter es-
timation problem for mixture distributions. In this model, the data
set consists of independent draws from a mixture distribution and the
hidden values are usually assumed to be the assignments of each data
element to the component distribution it was generated by. That is,
for each data element, the SEM algorithm guesses which component
distribution it was generated by. Obviously, if the number of data ele-
ments is small, then each wrong assignment has a large impact. And
indeed, a limitation of the SEM algorithm is that it does not work
for very small data sets. Therefore, the authors of the original SEM
work introduced the SAEM algorithm later. The SAEM algorithm
is a hybrid algorithm that mixes the computations of the SEM algo-
rithm with the computations of the classical EM algorithm. At the
beginning it works like a pure SEM algorithm and then it gradually
7
introduction
transforms until it finally does pure EM computations. For a compar-
ison of all three mentioned stochastic EM variants see [Celeux et al.,
1995].
1.2.2our contribution
As already mentioned in the previous section, the original line of
work on the SEM algorithm focused on stochastic properties of the
sequence of models generated by the SEM algorithm. However, to
approximate the above mentioned mean that converges to the max-
imum likelihood estimate, we would need to average over a large
number of runs of the SEM algorithm. In this thesis, we present
a new analysis of the SEM algorithm for Gaussian mixture models.
Unlike the previous work, our analysis studies only a single run of
the algorithm. Furthermore, in addition to our theoretical results, we
also provide experimental results for various artificial and real world
data sets.
Since practitioners are often satisfied with the solutions computed
by the classical EM algorithm, we analyze the SEM algorithm in re-
lation to the EM algorithm. More precisely, we examine the parame-
ter updates computed in an arbitrary iteration of the SEM algorithm
and compare them to the updates computed by the EM algorithm
for the same previous parameter estimate. We show that with high
probability the updates of the EM and SEM algorithm are almost the
same, given that the input data set is sufficiently large. Our results
are stated as three separate theorems. The first theorem gives a high
probability bound on the difference of the weight updates. This re-
sult does not only hold for Gaussian mixture models, but also for
arbitrary mixture models. The second theorem gives a high probabil-
ity bound on the distance between the means computed by the EM
and SEM iteration. Finally, the third theorem does the same for the
covariance updates.
Since the computed parameters generally differ from each other af-
ter a single round of the two algorithms, our analysis does not work
for a sequence of several EM and SEM iterations. However, our exper-
iments confirm that the similarity of the parameters mostly persists
even after a large number of successive update steps. Moreover, we
explain why the SEM algorithm is considerably faster than the classi-
cal EM algorithm. In particular, we show that the probabilistic variant
establishes a constant factor speedup for Gaussian mixture models.
Our theoretical and experimental analysis, presented in Chapter 6
and Chapter 7, was also published in [Blömer et al., 2013].
8
1.3 outline of the thesis
1.3outline of the thesis
This thesis is organized into two parts. In Part I we turn towards clus-
ter analysis. After giving a short introduction to clustering in Chap-
ter 2, we present the first analysis of a classical clustering algorithm
in Chapter 3. Part II is about statistical models and the parameter es-
timation problem. The basic concepts needed later are introduced
in Chapter 4. In Chapter 5, we discuss the classical Expectation-
Maximization algorithm. In Chapter 6, we analyze a probabilistic
variant of the EM algorithm. Finally, in Chapter 7, we present an
experimental comparison of both algorithms.
On some pages you will find cross-references and additional re-
marks written in the margins that may be useful to understand the
main text. Furthermore, some bibliographic references are only men-
tioned in these marginal notes.
9
Part I
ON CLUSTER ANALYSIS
The first part of this thesis is about cluster analysis. We
focus on so-called distance-based hard clustering. Using
this approach, the distances between the elements of a
given data set are used to compute a partition of the data
into subsets of elements that are close to each other.
The main result of Part I is the first analysis of the qual-
ity of agglomerative complete linkage clustering for the
diameter k-clustering problem.
2
CLUSTERING BASICS
Simply put, clustering or cluster analysis is the process of organizing a
given set of items into groups (called clusters) of similar items. The
term is used for the entire number of techniques to solve this task as
well as for the task itself. Furthermore, the term clustering is also used
for a resulting set of clusters. If a clustering consists of kclusters, it
is called a k-clustering.
Ideally, all elements of a particular cluster are similar while ele-
ments from different clusters are dissimilar. It is evident that the no-
tion of similarity or dissimilarity is a crucial ingredient for the formu-
lation of clustering problems. As a consequence, the different ways
to model a cluster are as numerous as the number of clustering appli-
cations. There is a broad range of (possibly overlapping) approaches,
including graph-based clustering [Schaeffer, 2007, Fortunato, 2010],
density-based clustering [Kriegel et al., 2011a], distribution-based clus-
tering (cf. Chapter 4), and distance-based clustering (see Section 2.1).
Another key differentiator between common clustering applications
is the certainty of membership in a cluster. The two main categories
are hard and soft clustering. In case of soft clustering, an element of the
input data set may be assigned to multiple clusters at the same time
and in addition this assignment can be weighted. For example, using
a distribution-based clustering approach, we may want to model the
input data by a mixture distribution (cf. Chapter 4.2). Then, we can
identify the mixture components with the clusters and assign the ele-
ments of the input data set to each cluster with a weight proportional
to the probability of the element in the corresponding mixture com-
ponent. In case of hard clustering however, each element is assigned
to exactly one cluster. That is, a hard clustering is actually a partition
of the given data set.
An important special case of hard clustering is the so-called hierar-
chical clustering. We call a collection of k-clusterings of the same finite
data set X, but for different values of k, hierarchical, if it fulfills the
following two properties. First, for any k, the collection contains at
most one k-clustering. Second, for any two of its clusterings Ci,Cj
with |Ci|=i < j =|Cj|, every cluster in Ciis the union of one or more
clusters from Cj. A hierarchical collection of clusterings is called a
hierarchical clustering.
13
clustering basics
In the following, we introduce the concepts of clustering that we
need to apply our analysis of agglomerative clustering in Chapter 3.
In particular, we consider distance-based hard clustering only. For a
thorough introduction to clustering see for example [Tan et al., 2005].
2.1the notion of similarity
The distance-based clustering approach establishes a notion of dis-
similarity based on the distances between the elements of the input
data. Therefore, we start our formal introduction to cluster analysis
with the definition of distance measures. We use the term distance mea-
sure to differentiate from the distance function of a metric since we
neither require symmetry nor the triangle inequality.
Definition 2.1.Let Xbe an arbitrary domain. A function D:X×XR
is called a distance measure, if the following conditions are satisfied for all
x,yX:
1.D(x,y)0(non-negativity)
2.D(x,y) = 0x=y(identity of indiscernibles)
Cluster analysis is established for a wide variety of distance mea-
sures. Although it is very common to embed the input data into a
metric space like the d-dimensional Euclidean vector space Rd, some
widely used distance measures do not form a metric. For example,
the squared Euclidean distance and the Mahalanobis distance both are no
metrics since they do not fulfill the triangle inequality. Other distance
measures as for example the Kullback-Leibler divergence or the Itakura-
Saito divergence (both defined for X=Rd
0) are not even symmetric.
For a discussion of clustering algorithms for non-metric distance mea-
sures see [Ackermann, 2009]. For the remainder of this chapter, we
consider metric distance measures only.
Definition 2.2.Let Xbe an arbitrary domain. A function D:X×XR
is called a metric, if Dis a distance measure and in addition, the following
conditions are satisfied for all x,y,zX:
3.D(x,y) = D(y,x)(symmetry)
4.D(x,y)D(x,z) + D(z,y)(triangle inequality)
2.2clustering quality
To distinguish good clusterings from bad clusterings we need to in-
troduce a measure of quality. It is obvious that the choice of the
right measure highly depends on the particular clustering application.
Generally, the desired properties of the clustering are transformed to
a cost function that maps clusterings to real numbers. Then, the goal
14
2.2 clustering quality
b
b b
b
(a) The diameter.
b
b b
b
(b) The radius.
b
b b
b
(c) The discrete radius.
Figure 1: Different measurements for the same cluster.
of the clustering process is to minimize the cost of the resulting clus-
tering.
Before introducing three popular hard clustering cost functions, we
give a formal definition of a k-clustering.
Definition 2.3.Let Xbe an arbitrary domain, kNand XXa finite
set with |X|k. Then, a k-clustering of Xis a partition Ck={C1,. . . ,Ck}
of Xinto knon-empty subsets. The subsets C1,. . . ,Ckare called clusters.
The first cost function we consider is the diameter cost of a cluster-
ing which is simply the largest diameter among its clusters.
Definition 2.4.Let Xbe an arbitrary domain with a distance measure D.
Furthermore, let XXbe a finite input data set and C={C1,. . . ,Ck}a
k-clustering of X.
Then, the diameter cost of Cis defined as
costdiam(C):= max
CCdiam(C),
where diam(C) := maxx,yCD(x,y)denotes the diameter of C.cf. Figure 1a
The diameter cost only depends on the distances between elements
of the input data set. For the definition of the radius cost, we addi-
15
clustering basics
tionally introduce so-called cluster centers. These centers come from
the same domain that the input data set is embedded into.
Definition 2.5.Let Xbe an arbitrary domain with a distance measure D.
Furthermore, let XXbe a finite input data set and C={C1,. . . ,Ck}a
k-clustering of X.
Then, the radius cost of Cis defined as
costrad(C):= max
CCrad(C),
where rad(C) := minyXmaxxCD(x,y)denotes the radius of C.cf. Figure 1b
The discrete radius cost is a special case of the radius cost where
the centers have to be elements of the input data set.
Definition 2.6.Let Xbe an arbitrary domain with a distance measure D.
Furthermore, let XXbe a finite input data set and C={C1,. . . ,Ck}a
k-clustering of X.
Then, the discrete radius cost of Cis defined as
costdrad(C):= max
CCdrad(C),
where drad(C) := minyCmaxxCD(x,y)denotes the discrete radius ofcf. Figure 1c
C.
That is, if we choose X=Xthe radius cost coincides with the
discrete radius cost.
2.3problem definitions
Based on the three cost functions introduced in Section 2.2we de-
fine the corresponding clustering problems. All three problems are
discussed in Chapter 3.
Problem 2.7(diameter k-clustering).Given kNand a finite set X
Rdwith |X|k, find a k-clustering Ckof Xwith minimal diameter cost.
Problem 2.8(k-center).Given kNand a finite set XRdwith |X|k,
find a k-clustering Ckof Xwith minimal radius cost.
Problem 2.9(discrete k-center).Given kNand a finite set XRd
with |X|k, find a k-clustering Ckof Xwith minimal discrete radius cost.
16
3
ANALYSIS OF AGGLOMERATIVE CLUSTERING
In this chapter we analyze the agglomerative clustering algorithms
for the (discrete) k-center problem and the diameter k-clustering prob-
lem. As mentioned in the introduction, an agglomerative algorithm
takes a bottom-up approach. It starts with the |X|-clustering that con-
tains one cluster for each element of a given data set and then suc-
cessively merges two of the remaining clusters such that the cost of
the resulting clustering is minimized. That is, in each merge step
the agglomerative algorithms for Problem 2.7, Problem 2.8and Prob-
lem 2.9minimize the diameter, the radius and the discrete radius of
the resulting cluster, respectively.
Our main objective is the agglomerative complete linkage cluster-
ing algorithm, which minimizes the diameter in every step. Never-
theless, we start with the analysis of the agglomerative algorithm for
the discrete k-center problem since it is the simplest one of the three.
Then we adapt our analysis to the k-center problem and finally to
the diameter k-clustering problem. In each case we need to introduce
further techniques to deal with the increased complexity of the given
problem.
We show that all three algorithms compute an O(logk)-approxima-
tion for the particular corresponding clustering problem. However,
the dependency on the dimension which is hidden in the O-notation
ranges from only linear and additive in case of the discrete k-center
problem to a factor that is doubly exponential in case of the diameter
k-clustering problem. As mentioned in the introduction, the cost of
optimal solutions to the three problems are within a factor of 2from
each other. That is, each algorithm computes an O(logk)-approxi-
mation for all three problems. However, we will analyze the proper
agglomerative algorithm for each problem.
Strictly speaking, the clusterings computed by the agglomerative
algorithms stated as Algorithm 3.1,3.2and 3.3are not uniquely de-
termined, since the minimization step may be ambiguous. However,
all our results hold for any particular tie-breaking strategy (even non-
deterministic ones). To keep the discussion simple, we ignore such
subtleties unless they are crucial.
17
analysis of agglomerative clustering
3.1preliminaries
Throughout this chapter, we confine ourselves to metric distance mea-
sures that are defined on Rdfor some dN. That is, we consider
input data sets that are finite subsets of Rd. Our results hold for arbi-
trary metrics that are based on a norm, i. e., we consider only distance
measures of the form
D(x,y) = xy
for an arbitrary norm ∥·∥. Readers who are not familiar with arbi-The figures in this
chapter assume the
Euclidean metric. trary metrics or are only interested in the Euclidean case, may assume
that the Euclidean norm ∥·∥2is used, i. e.,
xy=v
u
u
t
d
i=1
(xiyi)2.
Furthermore, for rRand yRd, we denote the closed d-dimen-
sional ball of radius rcentered at yby
Bd
r(y) := {x|xyr}.
For our analysis of agglomerative clustering, we repeatedly use the
volume argument stated in Lemma 3.2. This argument provides an
upper bound on the minimum distance between two points from a
finite set of points lying inside the union of finitely many balls. For
the application of this argument, the following definition is crucial.
Definition 3.1.Let kNand rR. Then, a set XRdis called
(k,r)-coverable, if there exist y1,...,ykRdwith
X
k
i=1
Bd
r(yi).
Lemma 3.2.Let kN,rRand PRdbe finite and (k,r)-coverable
with |P|> k. Then, there exist distinct p,qPsuch that
pq4r d
k
|P|.
Proof. Let ZRdwith |Z|=kand PzZBd
r(z). We define δto be
the minimum distance between two points of P, i. e.,
δ:= min
p,qP
p=qpq.
That is, we have to show that δ4r d
k
|P|. We assume for contra-
diction that u:= 4r d
k
|P|< δ. Since |P|> k there exists zZwith
18
3.1 preliminaries
z
p
r
r+u
/2
u
/2
Figure 2: The volume argument.
Bd
r(z)P2. It follows that δ2r and hence, u
2< r. Note that for
any yRd,τR, and any norm ·,
Vol(Bd
τ(y))=τd·Vd,
where Vdis the volume of the d-dimensional unit ball Bd
1(0)(see
Corollary 6.2.15 in [Webster, 1994]). Therefore,
Vol(
zZ
Bd
r+u
/2(z))<
zZ
Vol(Bd
2r(z))=k(2r)d·Vd.
Furthermore, since any pPis contained in a ball Bd
r(z)for some
zZ, we conclude that any ball Bd
u
/2(p)for pPis contained in a
ball Bd
r+u
/2(z)for some zZ(cf. Figure 2). Thus,
Vol
pP
Bd
u
/2(p)
Vol(
zZ
Bd
r+u
/2(z))< k(2r)d·Vd. (1)
Since u < δ, for any distinct p,qP, we conclude that
Bd
u
/2(p)Bd
u
/2(q) = .
Therefore, using the definition of u, the total volume of the |P|balls
Bd
u
/2(p)is given by
Vol
pP
Bd
u
/2(p)
=|P|(u
2)dVd=k(2r)d·Vd.
This however contradicts Inequality (1) and we obtain δu, which
proves the lemma.
Furthermore, we need the following covering lemma that is a corol-
lary of Theorem 1.2in [Naszódi, 2010].
Lemma 3.3.Let λ,τRwith λ,τ > 0. Then, for arbitrary metrics that are
based on a norm, a ball of radius τcan be covered by (3
λ)dballs of radius
λτ.
19
analysis of agglomerative clustering
Algorithm 3.1:AgglomerativeDiscreteRadius(X)
Input: finite set of input points XRd
C|X| {{x} | xX}
for i=|X|1,. . . ,1do
find distinct clusters A,BCi+1minimizing drad(AB)
Ci (Ci+1\ {A,B}){AB}
end
return C|X|,...,C1
3.2discrete k-center clustering
The agglomerative algorithm for the discrete k-center problem is sta-
ted as Algorithm 3.1. Given a finite set of input points XRd,
the algorithm computes hierarchical k-clusterings for all values of
kbetween 1and |X|. We denote them by C|X|,...,C1. Throughout
this section, cost always means discrete radius cost and optkrefers to
the cost of an optimal discrete k-center clustering of XRdwhere
kNwith k|X|. That is, optkdenotes the cost of an optimal
solution to Problem 2.9. Since any cluster Cis contained in a ball of
radius drad(C), we have that Xis (k,costdrad(Ck))-coverable for any
k-clustering Ckof X. It follows that Xis (k,optk)-coverable. This
fact, as well as the following observation about the greedy strategy of
Algorithm 3.1, will be used frequently in our analysis.
Observation 3.4.The cost of all computed clusterings is equal to the dis-
crete radius of the cluster created last. Furthermore, the discrete radius of
the union of any two clusters is always an upper bound for the cost of the
clustering to be computed next.
The following theorem states our result for the discrete k-center
problem.
Theorem 3.5.Let XRdbe a finite set of points. Then, for arbitrary
metrics that are based on a norm and for all kNwith k|X|, the
partition Ckof Xinto kclusters as computed by Algorithm 3.1satisfies
costdrad(Ck)<(20d +2log2(k) + 2)·optk,
where optkdenotes the cost of an optimal solution to Problem 2.9.
We prove Theorem 3.5in two steps. First, Proposition 3.6in Sec-
tion 3.2.1provides an upper bound to the cost of the intermediate
2k-clustering. This upper bound is independent of kand |X|, only lin-
ear in dand may be of independent interest. In its proof, we use the
volume argument from Lemma 3.2to bound the distance between the
centers of pairs of remaining clusters.
Second, in Section 3.2.2, we analyze the remaining kmerge steps of
Algorithm 3.1down to the computation of the k-clustering. There, we
20
3.2 discrete k-center clustering
no longer need to apply Lemma 3.2to bound the distance between
two cluster centers. Instead, the volume argument is replaced by a
very simple bound that is already sufficient to obtain our result.
3.2.1analysis of the 2k-clustering
Proposition 3.6.Let XRdbe finite. Then, for all kNwith 2k |X|,
the partition C2k of Xinto 2k clusters as computed by Algorithm 3.1satisfies
costdrad(C2k)< 20d ·optk,
where optkdenotes the cost of an optimal solution to Problem 2.9.
To prove Proposition 3.6, we divide the merge steps of Algorithm 3.1
into phases, each reducing the number of remaining clusters by one
fourth. The following lemma bounds the increase of the cost during
a single phase by an additive term.
Lemma 3.7.Let mNwith 2k < m |X|. Then,
costdrad(C3m
4)costdrad(Cm)+4d
2k
m·optk.
Proof. Let t:= 3m
4. Then, CmCt+1is the set of clusters from Cm
that still exist m
41 < m
4merge steps after the computation of Cm.
In each iteration of its loop, the algorithm can merge at most two
clusters from Cm. Thus,
|CmCt+1|> m 2m
4=m
2.
Let τ:= costdrad(Cm). From every cluster CCm, we fix a center
pCCwith CBd
τ(pC)and define P:= {pCCCmCt+1}. Then,
|P|=|CmCt+1|>m
2> k.
Since Xis (k,optk)-coverable, so is PX. Then, by Lemma 3.2, there
exist distinct A,BCmCt+1such that
pApB4d
2k
m·optk.
Therefore, the distance from pAto any qBis at most 4d
2k
m·optk+τ.
We conclude that merging Aand Bwould result in a cluster whose
discrete radius can be upper bounded by
cf. Figure 3
drad(AB)costdrad(Cm)+4d
2k
m·optk.
The result follows from A,BCt+1and Observation 3.4.
To prove Proposition 3.6, we apply Lemma 3.7for log4
3
|X|
2k con-
secutive phases.
21
analysis of agglomerative clustering
pA
pB
costdrad(Cm)
costdrad(Cm)
Figure 3: drad(AB)<costdrad(Cm)+pApB.
Proof of Proposition 3.6.Let u:= log4
3
|X|
2k and for i=0,. . . ,u, define
mi:= (3
4)i|X|.
Then, mu2k and mi> 2k for i=0,...,u1. Since
3mi
4=3
4(3
4)i|X|⌉⌋(3
4)i+1|X|+3
4(3
4)i+1|X|=mi+1
and Algorithm 3.1uses a greedy strategy,
costdrad(Cmi+1)costdrad(C3mi
4)
for i=0,...,u1. Combining this inequality with Lemma 3.7(ap-
plied to m=mi), we obtain
costdrad(Cmi+1)costdrad(Cmi)+4d
2k
mi·optk.
By repeatedly applying this inequality for i=0,...,u1and using
costdrad(C2k)costdrad(Cmu)and costdrad(Cm0)=0, we deduce
costdrad(C2k)
u1
i=0(4d
2k
mi·optk)
4d
2k
|X|·
u1
i=0
d
(4
3)i
·optk.
Solving the geometric series and using u1 < log4
3
|X|
2k leads to
costdrad(C2k)4d
2k
|X|·
d
(4
3)u1
d
4
31·optk<
4d
4
3
d
4
31·optk. (2)
By taking only the first two terms of the series expansion of the ex-
ponential function, we get d
4
3=eln 4
3
d> 1 +ln 4
3
d. Substituting this
bound into Inequality (2) yields
costdrad(C2k)<
4d
4
3
ln 4
3
d·optk< 20d ·optk.
22
3.2 discrete k-center clustering
3.2.2analysis of the remaining merge steps
The analysis of the remaining merge steps introduces the O(logk)
term to the approximation factor of our result. It is similar to the
analysis used in the proof of Proposition 3.6. Again, we divide the
merge steps into phases. This time however, one phase consists of
one half of the remaining merge steps. Furthermore, we replace the
volume argument from Lemma 3.2with a simpler bound. This bound
results from the observation that, as long as there are more than k
clusters left, we are able to find a pair of clusters whose centers lie
in the same cluster of an optimal k-clustering. That is, the distance
between these centers is at most two times the discrete radius of the
common cluster. The following lemma bounds the increase of the
cost during a single phase.
Lemma 3.8.Let mNwith k < m |X|. Then,
costdrad(Cm+k
2)costdrad(Cm)+2optk.
Proof. Let t:= m+k
2. Then, CmCt+1is the set of clusters from Cm
that still exist mk
21 < mk
2merge steps after the computation of
Cm. In each iteration of its loop, the algorithm can merge at most two
clusters from Cm. Thus,
|CmCt+1|> m 2mk
2=k.
Let τ:= costdrad(Cm). From every cluster CCm, we fix a center
pCCwith CBd
τ(pC)and define P:= {pCCCmCt+1}. Then,
|P|=|CmCt+1|> k.
Since Xis (k,optk)-coverable, so is PX. It follows that there exist
distinct A,BCmCt+1such that pAand pBare contained in the
same ball of radius optk, i.e.,
pApB2optk.
Then, the distance from pAto any qBis at most 2optk+τ. We con-
clude that merging Aand Bwould result in a cluster whose discrete
radius can be upper bounded by
cf. Figure 3
drad(AB)costdrad(Cm)+2optk.
The result follows using A,BCt+1and Observation 3.4.
To prove Theorem 3.5, we apply Lemma 3.8for about logkconsec-
utive phases.
23
analysis of agglomerative clustering
Algorithm 3.2:AgglomerativeRadius(X)
Input: finite set of input points XRd
C|X| {{x} | xX}
for i=|X|1,. . . ,1do
find distinct clusters A,BCi+1minimizing rad(AB)
Ci (Ci+1\ {A,B}){AB}
end
return C|X|,...,C1
Proof of Theorem 3.5.Let u:= log2(k)+1. Then,
log2k < u log2(k) + 1.
Furthermore, for i=0,...,uwe define
mi:= k+(1
2)ik.
Then, mu=kand mi> k for i=0,...,u1. Since
mi+k
2=1
2(2k +(1
2)ik⌋)⌋=k+1
2(1
2)ik⌋⌋
k+(1
2)i+1k=mi+1
and Algorithm 3.1uses a greedy strategy,
costdrad(Cmi+1)costdrad(Cmi+k
2)
for i=0,...,u1. Combining this inequality with Lemma 3.8(ap-
plied to m=mi), we obtain
costdiam(Cmi+1)costdrad(Cmi)+2optk.
By repeatedly applying this inequality for i=0,...,u1and using
m0=2k and mu=k, we deduce
costdrad(Ck)costdrad(C2k)+
u1
i=0
2optk
costdrad(C2k)+ (2log2(k) + 2)·optk.
Hence, the result follows using Proposition 3.6.
3.3k-center clustering
The agglomerative algorithm for the k-center problem is stated as Al-
gorithm 3.2. The only difference to Algorithm 3.1is the minimization
of the radius instead of the discrete radius in the minimization step.
In the following, cost always means radius cost and optkrefers to the
24
3.3k-center clustering
cost of an optimal k-center clustering of XRdwhere kNwith
k|X|. That is, optkdenotes the cost of an optimal solution to Prob-
lem 2.8. Analogously to the discrete case, any cluster Cis contained
in a ball of radius drad(C)and thus, the set Xis (k,optk)-coverable.
Observation 3.9.The cost of all computed clusterings is equal to the ra- This is analogous to
Observation 3.4.
dius of the cluster created last. Furthermore, the radius of the union of any
two clusters is always an upper bound for the cost of the clustering to be
computed next.
The following theorem states our result for the k-center problem.
Theorem 3.10.Let XRdbe a finite set of points. Then, for arbitrary
metrics that are based on a norm and for all kNwith k|X|, the partition
Ckof Xinto kclusters as computed by Algorithm 3.2satisfies
costrad(Ck)=O(logk)·optk,
where optkdenotes the cost of an optimal solution to Problem 2.8, and the
constant hidden in the O-notation is singly exponential in the dimension d.
As in the proof of Theorem 3.5, we first bound the cost of the inter-
mediate 2k-clustering. However, we have to apply a different analysis.
As a consequence, the dependency on the dimension increases from
linear and additive to a singly exponential factor.
3.3.1analysis of the 2k-clustering
Proposition 3.11.Let XRdbe finite. Then, for all kNwith 2k |X|,
the partition C2k of Xinto 2k clusters as computed by Algorithm 3.2satisfies
costrad(C2k)< 24d ·e24d ·optk,
where optkdenotes the cost of an optimal solution to Problem 2.8.
Just as in the analysis of Algorithm 3.1, we divide the merge steps
of Algorithm 3.2into phases, such that in each phase the number of
remaining clusters is reduced by one fourth. Like in the discrete case,
the input points are (k,optk)-coverable. However, centers correspond-
ing to an intermediate solution computed by Algorithm 3.2need not
be covered by the kballs induced by an optimal solution. As a conse-
quence, we are no longer able to apply Lemma 3.2on the centers as
in the discrete case.
To bound the increase of the cost during a single phase, we cover
the remaining clusters at the beginning of a phase by a set of over-
lapping balls. Each of the clusters is completely contained in one of
these balls that all have the same radius. Furthermore, the number of
remaining clusters will be at least twice the number of these balls. It
follows that there are many pairs of clusters that are contained in the
25
analysis of agglomerative clustering
yi
zC
optk
optk+costrad(Cm)
costrad(Cm)
Figure 4: Intermediate centers.
same ball. Then, as long as the existence of at least one such pair can
be guaranteed, the radius of the cluster created next can be bounded
by the radius of the covering balls. The following lemma will be used
to bound the increase of the cost during a single phase.
Lemma 3.12.Let mNwith 2k < m |X|. Then,
costrad(C3m
4)<(1+6d
2k
m)costrad(Cm)+6d
2k
moptk.
Proof. Let P={P1,...,Pk}be an optimal k-center clustering of X. For
each i{1,...,k}we fix a point yiRdsuch that
PiBoptk(yi).
Furthermore, for any CCmlet zCRdsuch that
CBcostrad(Cm)(zC).
With τ:= optk+costrad(Cm)it follows that each zCis contained in at
least one of the balls Bτ(yi)for i{1,...,k}as illustrated in Figure 4.
Applying Lemma 3.3with λ=3
d
m
2k yields that each of the balls
Bτ(yi)for i=1,...,kcan be covered by
:= m
2km
2k
balls of radius ε:=
d
m
2k . Therefore, there exist
kℓ m
2
centers v1,. . . ,vkℓ Rdsuch that each zCfor CCmis contained in
at least one of the balls Bε(v1),...,Bε(vkℓ). For i=1,...,kℓ, we define
Vi:= Bcostrad(Cm)+ε(vi).
Then, any cluster CCmis contained in at least one of the balls
V1,. . . ,Vkℓ (cf. Figure 5).
26
3.3k-center clustering
viε
costrad(Cm)+ε
zC1
costrad(Cm)
zC2
τ
Figure 5: Covering centers and clusters.
As in the proof of Lemma 3.7we define t:= 3m
4and deduce |Cm
Ct+1|>m
2. Since kℓ m
2, there exist two clusters A,BCmCt+1
that are contained in the same ball Viwith i{1,...,kℓ}. Therefore,
merging clusters Aand Bwould result in a cluster whose radius can
be upper bounded by
cf. Figure 5
rad(AB)costrad(Cm)+ε.
Using Observation 3.9and the fact that Aand Bare part of the clus-
tering Ct+1, we can upper bound the cost of Ctby
costrad(Ct)costrad(Cm)+ε.
It remains to show
ε < 6 d
2k
m(optk+costrad(Cm)).
However, since m
2k > 1, it follows m
2k < 2 m
2k . Thus,
d
m
2k <d
2m
2k2d
m
2k
and we conclude
3
d
m
2k
< 6 d
2k
m.
To prove Proposition 3.11, we apply Lemma 3.12 for log4
3
|X|
2k con-
secutive phases.
Proof of Proposition 3.11.Analogously to the proof of Proposition 3.6,
we define u:= log4
3
|X|
2k and mi:= (3
4)i|X|for i=0,...,u. Then,
mu2k and for i=0,...,u1,mi> 2k and 3mi
4mi+1. Using
Lemma 3.12, for i=0,...,u1we deduce
costrad(Cmi+1)<(1+6d
2k
mi)costrad(Cmi)+6d
2k
mi
optk.
27
analysis of agglomerative clustering
By repeatedly applying this inequality for i=0,...,u1and using
costrad(C2k)costrad(Cmu)and costrad(Cm0)=0, we get
costrad(C2k)<
u1
i=0
6d
2k
mi
u1
j=i+1(1+6d
2k
mj)
optk.
Using mi(3
4)i|X|, we obtain
costrad(C2k)
6optk
d
2k
|X|·
u1
i=0
(4
3)i
du1
j=i+1(1+6d
2k
|X|(4
3)j
d)
.
Substituting u1ifor iyields
costrad(C2k)
=6optk
d
2k
|X|(4
3)u1
d
·
u1
i=0
(3
4)i
du1
j=ui(1+6d
2k
|X|(4
3)u1
d(3
4)u1j
d)
.
Using u1 < log4
3
|X|
2k , we deduce
costrad(C2k)< 6 optk·
u1
i=0
(3
4)i
di1
j=0(1+6(3
4)j
d)
. (3)
By taking only the first two terms of the series expansion of the expo-
nential function, we get 1+6(3
4)j
d< e6(3
4)j
dand therefore,
i1
j=0(1+6(3
4)j
d)<
i1
j=0(e6(3
4)j
d)=e6i1
j=0(3
4)j
d. (4)
The sum in the exponent can be bounded by the infinite geometric
series
j=0(3
4)j
d
<1
1(3
4)1
d
4d. (5)
Here, the last inequality follows by upper bounding the convex func-
tion f(x) = (3
4)xin the interval [0,1]by the line through f(0) = 1
and f(1) = 3
4, i.e., (3
4)x1x
4. Putting Inequalities (3), (4) and (5)
together then gives
costrad(C2k)< 6 optk·
u1
i=0((3
4)i
d
e24d)< 24d ·e24d ·optk,
where the last inequality follows by using Inequality (5) again.
28
3.3k-center clustering
3.3.2connected subsets
The analysis of the remaining merge steps from the discrete k-center
case (cf. Section 3.2.2) is not transferable to the k-center case. Again,
as in the proof of Proposition 3.11, we are no longer able to derive
a simple additive bound on the increase of the cost when merging
two clusters. In order to preserve the logarithmic dependency of the
approximation factor on k, we show that it is sufficient to analyze
Algorithm 3.2on a subset YXsatisfying a certain connectivity
property. Using this property, we are able to apply a combinatorial
approach that relies on the number of merge steps left.
We start by defining the connectivity property that will be used to
relate clusters to an optimal k-clustering.
Definition 3.13.Let ZRdand rR. Two sets A,BRdare called
(Z,r)-connected if there exists a point zZwith
Bd
r(z)A=and Bd
r(z)B=.
Note that for any two (Z,r)-connected clusters A,B, we have
rad(AB)rad(A) + 2rad(B) + 2r. (6)
To see this, let yRdwith ABd
rad(A)(y),xaBd
r(z)Aand xb
Bd
r(z)B. Then, the distance between yand xais at most radA, the
distance between xaand xbis at most 2r and the distance between xb
and any point xBis at most 2rad(B).
Next, we show that for any data set Xwe can bound the cost of the
k-clustering computed by Algorithm 3.2by the cost of the -clustering
computed on a connected subset YXfor a proper k.
Lemma 3.14.Let XRdbe finite and kNwith k|X|. Then, there
exists a subset YX, a number Nwith kand |Y|, and a set
ZRdwith |Z|=such that:
1.Yis (,optk)-coverable;
2.costrad(Ck)costrad(P);
3. For all nNwith < n |Y|, every cluster in Pnis (Z,optk)-
connected to another cluster in Pn.
Here, the collection P1,. . . ,P|Y|denotes the hierarchical clustering computed
by Algorithm 3.2on input Y.
Proof. To define Y,Z, and we consider the (k+1)-clustering com-
puted by Algorithm 3.2on input X. We know that X=ACk+1A
is (k,optk)-coverable. Let ECk+1be a minimal subset such that
AEAis (|E|1,optk)-coverable, i. e., for all sets FCk+1with
|F|<|E|the union AFAis not (|F|1,optk)-coverable. Since a set F
of size 1cannot be (|F|1,optk)-coverable, we deduce |E|2.
29
analysis of agglomerative clustering
Let Y:= AEAand := |E|1. Then, kand Yis (,optk)-
coverable. This establishes the first property of the lemma. It follows
that there exists a set ZRdwith |Z|=and YzZBd
optk(z).
Since Yis the union of the clusters from ECk+1, each merge
step between the computation of C|X|and Ck+1merges either two
clusters A,BYor two clusters A,BX\Y. In the following, we
consider the restriction of the clusterings Ck+1,...,C|X|to the input
data set YX. We need to deal with the resulting clusterings as if
they were computed by Algorithm 3.2on input Y. This is the part
of our analysis where we have to be more precise on the tie-breaking
strategy as briefly mentioned at the beginning of this chapter. If on
input Y, the minimization step of Algorithm 3.2is not always unam-
biguous, the computed clusterings may differ from the restriction of
Ck+1,...,C|X|to Y. Thus, for the analysis of any fixed run on input X
we simply assume that on input YAlgorithm 3.2performs the same
sequence of merge steps that ignoring the merge steps inside X\Y
lead to Ck+1,...,C|X|in the fixed run on input X.
We let P1,...,P|Y|be the hierarchical clustering computed by Algo-
rithm 3.2on input Y. Then, we have P+1=E=Ck+12Y. Thus,2Ydenotes the
power set of Y.P+1Ck+1. To compute P, on input Y, Algorithm 3.2merges two
clusters from P+1that minimize the radius of the resulting cluster.
Analogously, on input X, Algorithm 3.2merges two clusters from
Ck+1to compute Ck. Since P+1Ck+1, Observation 3.9implies
costrad(Ck)costrad(P), thus proving the second property of the
lemma.
It remains to show that for all nNwith < n |Y|it holds
that every cluster in Pnis (Z,optk)-connected to another cluster in
Pn(the third property of the lemma). By the definition of Z, every
cluster in Pnintersects at least one ball Bd
optk(z)for zZ. Therefore,
it is enough to show that each ball Bd
optk(z)intersects at least two
clusters from Pn. We first show this property for n=+1. For =1,
this follows from the fact that Bd
optk(z)with Z={z}has to contain both
clusters from P2. For > 1, we would otherwise be able to remove
one cluster from P+1and get clusters whose union is (1,optk)-
coverable. However, this contradicts the definition of E=P+1as a
minimal subset with this property.
To show the third property of the lemma for general n, let A
Pnand zZwith Bd
optk(z)A=. There exists a unique cluster
˜
AP+1with A˜
A. Then, we have Bd
optk(z)˜
A=. However, we
have just shown that Bd
optk(z)has to intersect at least two clusters from
P+1. Thus, there exists another cluster ˜
BP+1with Bd
optk(z)˜
B=.
Since every cluster from P+1is a union of clusters from Pn, there
exists at least one cluster BPnwith B˜
Band Bd
optk(z)B=.
30
3.3k-center clustering
optk
costrad(Pm)
B
costrad(Pn)
A1
A2
Figure 6: Merging (Z,optk)-connected clusters.
3.3.3analysis of the remaining merge steps
Let Y,Z,, and P1,. . . ,P|Y|be as given by Lemma 3.14. Then, Proposi-
tion 3.11 can be used to obtain an upper bound for the cost of P2ℓ. In
the following, we analyze the merge steps leading from P2ℓ to P+1
and show how to obtain an upper bound for the cost of P+1. As
in Section 3.3.1, we analyze the merge steps in phases. The following
lemma is used to bound the increase of the cost during a single phase.
Note that optkstill refers to the cost of an optimal solution on input
X, not Y.
Lemma 3.15.Let m,nNwith n2ℓ and ℓ<mn|Y|. If there are
no two (Z,optk)-connected clusters in PmPn,
costrad(Pm+
2)costrad(Pm)+2costrad(Pn)+2optk.
Proof. Analogously to the proof of Lemma 3.8we define t:= m+
2
and deduce that there are strictly less than m
2merge steps between
the computations of Pmand Pt+1. We show that there exist at least
mdisjoint pairs of clusters from Pmsuch that the radius of their
union can be upper bounded by costrad(Pm)+2costrad(Pn)+2optk.
Then, at least one of these pairs still has to exist in Pt+1since in each
iteration of its loop the algorithm can destroy at most two of the pairs.
The lemma follows using Observation 3.9and from the fact that there
is only one merge step left.
To bound the number of the above mentioned cluster pairs, we start
with a structural observation. PmPnis the set of clusters from Pn
that still exist in Pm. By our definition of Y,Z, and , we conclude that
any cluster APmPnis (Z,optk)-connected to another cluster B
Pm. If we assume that there are no two (Z,optk)-connected clusters in
PmPn, this implies BPm\ Pn(cf. Figure 6). Thus, using APn,
BPm, and Inequality (6), the radius of ABcan be bounded by
rad(AB)costrad(Pm)+2costrad(Pn)+2optk. (7)
Moreover, using a similar argument, we derive the same bound for
two clusters A1,A2PmPnthat are (Z,optk)-connected to the same
cluster BPm\ Pn. That is,
rad(A1A2)costrad(Pm)+2costrad(Pn)+2optk. (8)
31
analysis of agglomerative clustering
Next, we show that there exist at least |PmPn|
2disjoint pairs of
clusters from Pmsuch that the radius of their union can be bounded
either by Inequality (7) or by Inequality (8). To do so, we first consider
the pairs of clusters from PmPnthat are (Z,optk)-connected to the
same cluster from Pm\ Pnuntil no candidates are left. For these
pairs, we can bound the radius of their union by Inequality (8). Then,
each cluster from Pm\ Pnis (Z,optk)-connected to at most one of
the remaining clusters from PmPn. Thus, each remaining cluster
APmPncan be paired with a different cluster BPm\ Pnsuch
that Aand Bare (Z,optk)-connected. For these pairs, we can bound
the radius of their union by Inequality (7). Since for all pairs either
one or both of the clusters come from the set PmPn, we can lower
bound the number of pairs by |PmPn|
2.
To complete the proof, we show
m|PmPn|
2.
In each iteration of its loop, the algorithm can merge at most two
clusters from Pn. Therefore, there are at least n|PmPn|
2merge
steps between the computations of Pnand Pm. Hence,
mnn|PmPn|
2n
2+|PmPn|
2.
Using n2ℓ, we deduce
m|PmPn|
2.
Lemma 3.16.Let nNwith n2ℓ and < n |Y|. Then,
costrad(P+1)< 2 (log2() + 2)(costrad(Pn)+optk).
Proof. For n=+1there is nothing to show. Hence, we assume
n>ℓ+1. By the definition of Z, there exist two (Z,optk)-connected
clusters in Pn. Let ˜nNwith ˜n < n be maximal such that no two
(Z,optk)-connected clusters exist in P˜nPn. The number ˜nis well-
defined since |P1|=1implies ˜n1. It follows that the same holds for
all mNwith m˜n. We conclude that Lemma 3.15 is applicable
for all mNwith ℓ<m˜n.
By the definition of ˜nthere still exist at least two (Z,optk)-connect-
ed clusters in P˜n+1Pn. Then, Observation 3.9implies
costrad(P˜n)2costrad(Pn)+optk. (9)
If ˜n+1, Inequality (9) proves the lemma. For ˜n > +1, define
u=log2(˜n)and
mi=(1
2)i
(˜n) + >
32
3.3k-center clustering
for i=0,...,u. Then, m0=˜nand mu=+1. Furthermore,
mi+
2=1
2(1
2)i(˜n) + +
21
2((1
2)i(˜n) + +1)+
2
=(1
2)i+1(˜n) + +1
2(1
2)i+1(˜n) + =mi+1.
Since Algorithm 3.2uses a greedy strategy,
costrad(Pmi+1)costrad(Pmi+
2)
for i=0,. . . ,u1. Combining this inequality with Lemma 3.15 (ap-
plied to m=mi), we obtain
costrad(Pmi+1)costrad(Pmi)+2costrad(Pn)+2optk.
By repeatedly applying this inequality for i=0,...,u1and sum-
ming up the costs, we get
costrad(Pmu)<costrad(P˜n)+2u ·(costrad(Pn)+optk)
(9)
< 2(u+1)·(costrad(Pn)+optk).
Since ˜n < 2ℓ, we get u < log2() + 1and the lemma follows using
mu=+1.
The following lemma finishes the analysis except for the last merge
step.
Lemma 3.17.Let YRdbe finite and |Y|such that Yis (,optk)-cov-
erable. Furthermore, let ZRdwith |Z|=such that for all nNwith
+1n|Y|, every cluster in Pnis (Z,optk)-connected to another cluster
in Pn, where P1,...,P|Y|denotes the hierarchical clustering computed by
Algorithm 3.2on input Y. Then,
costrad(P+1)< 2(log2() + 2)·(24d ·e24d +1)·optk.
Proof. Let n:= min(|Y|,2ℓ). Then, using Proposition 3.11,
costrad(Pn)< 24d ·e24d ·optk.
The result follows by using this bound together with Lemma 3.16.
3.3.4proof of theorem 3.10
By Lemma 3.14, there exists a subset YX, a number k, and a
hierarchical clustering P1,. . . ,P|Y|of Ywith
costrad(Ck)costrad(P).
Furthermore, there exists a set ZRdsuch that every cluster from
P+1is (Z,optk)-connected to another cluster in P+1. Thus, P+1
contains two clusters A,Bthat intersect the same ball of radius optk.
Hence,
costrad(Ck)rad(AB)2costrad(P+1)+optk.
The theorem follows using Lemma 3.17 and k.
33
analysis of agglomerative clustering
Algorithm 3.3:AgglomerativeCompleteLinkage(X)
Input: finite set of input points XRd
C|X| {{x} | xX}
for i=|X|1,. . . ,1do
find distinct clusters A,BCi+1minimizing diam(AB)
Ci (Ci+1\ {A,B}){AB}
end
return C|X|,...,C1
3.4diameter k-clustering
In this section, we analyze the agglomerative complete linkage clus-
tering algorithm for Problem 2.7stated as Algorithm 3.3. Again, the
only difference to Algorithm 3.1and Algorithm 3.2is the minimiza-
tion of the diameter in the minimization step.
Note that in this section cost always means diameter cost and optk
refers to the cost of an optimal diameter k-clustering of XRdwhere
kNwith k|X|. That is, optkdenotes the cost of an optimal
solution to Problem 2.7. Analogously to the (discrete) radius case,
any cluster Cis contained in a ball of radius diam(C)and thus, the
set Xis (k,optk)-coverable.
Observation 3.18.The cost of all computed clusterings is equal to the di-This is analogous to
Observation 3.4and
Observation 3.9.ameter of the cluster created last. Furthermore, the diameter of the union of
any two clusters is always an upper bound for the cost of the clustering to
be computed next.
The following theorem states our main result.
Theorem 3.19.Let XRdbe a finite set of points. Then, for arbitrary
metrics that are based on a norm and for all kNwith k|X|, the partition
Ckof Xinto kclusters as computed by Algorithm 3.3satisfies
costdiam(Ck)=O(logk)·optk,
where optkdenotes the cost of an optimal solution to Problem 2.7, and the
constant hidden in the O-notation is doubly exponential in the dimension d.
As in the proof of Theorem 3.5and Theorem 3.10, we first bound
the cost of the intermediate 2k-clustering. However, we have to apply
a different analysis again. This time, the new analysis results in a
bound that depends doubly exponential on the dimension.
3.4.1analysis of the 2k-clustering
Proposition 3.20.Let XRdbe finite. Then, for all kNwith 2k |X|,
the partition C2k of Xinto 2k clusters as computed by Algorithm 3.3satisfies
costdiam(C2k)< 2 (28d +6)·optk,
34
3.4 diameter k-clustering
where σ= (42d)dand optkdenotes the cost of an optimal solution to Prob-
lem 2.7.
In our analysis of the k-center problem, we made use of the fact
that merging two clusters lying inside a ball of some radius rresults
in a new cluster of radius at most r. This is no longer true for the
diameter k-clustering problem. We are not able to derive a bound for
the diameter of the new cluster that is significantly less than 2r. The
additional factor of 2makes our analysis from Section 3.3.1useless
for the diameter case since it would result in an approximation factor
that is linear in the size of Xrather than only depending on kand d.
To prove Proposition 3.20, we split the merge steps of Algorithm 3.3
into two stages. The first stage consists of the merge steps down to
a(22O(dlogd)k)-clustering. The analysis of the first stage is based on
the following notion of similarity. Two clusters are called similar if
one cluster can be translated such that every point of the translated
cluster is near a point of the second cluster. Then, by merging sim-
ilar clusters, the diameter essentially increases by the length of the
translation vector. During the first stage, we guarantee that there is a
sufficiently large number of similar clusters left. The cost of the inter-
mediate (22O(dlogd)k)-clustering can be upper bounded by O(d)·optk.
The second stage consists of the steps reducing the number of re-
maining clusters from 22O(dlogd)kto only 2k. Since the number of
remaining merge steps as well as the upper bound of the cost are in-
dependent of X, we are able to show an approximation factor that
only depends on kand d. In this stage, we are no longer able to
guarantee that a sufficiently large number of similar clusters exists.
Therefore, we analyze the merge steps of the second stage using a
weaker argument, very similar to the one used in the second step of
the analysis in the discrete k-center case (cf. Section 3.2.2). As long
as there are more than 2k clusters left, we are able to find sufficiently
many pairs of clusters that intersect the same cluster of an optimal
k-clustering. Therefore, we can bound the cost of merging such a
pair by the sum of the diameters of the two clusters plus the diame-
ter of the optimal cluster. We show that the cost of the intermediate
2k-clustering is upper bounded by 22O(dlog d)·optk. Let us remark that
we do not obtain our main result if we already use this argument for
the first stage.
Both stages are again subdivided into phases, such that in each
phase the number of remaining clusters is reduced by one fourth.
stage one
The following lemma will be used to bound the increase of the cost
during a single phase of the first stage.
35
analysis of agglomerative clustering
Lemma 3.21.Let λRwith 0<λ<1and ρ=(3
λ)d. Furthermore, let
mNwith 2ρ+1k < m |X|. Then,
costdiam(C3m
4)(1+)costdiam(Cm)+4d
2ρ+1k
moptk. (10)
Proof. Let τ:= costdiam(Cm). From every cluster CCm, we fix an
arbitrary point and denote it by pC. Then, the distance from pCto
any qCis at most τand
CpC:= {ppCpC}Bd
τ(0).
By Lemma 3.3, a ball of radius τcan be covered by ρballs of radius
λτ. Hence, we are able to fix y1,...,yρRdwith
Bd
τ(0)
ρ
i=1
Bd
λτ(yi).
For CCm, we define the configuration of Cby
Conf(C) := {yi|1iρand Bd
λτ(yi)(CpC)=}.
That is, we identify each cluster CCmwith the subset of the balls
Bd
λτ(y1),...,Bd
λτ(yρ)that intersect CpC. Note that no cluster from
CCmhas an empty configuration and the number of possible con-
figurations is upper bounded by 2ρ.
As in the proof of Lemma 3.7we define t:= 3m
4and deduce
|CmCt+1|>m
2. It follows that there exist
j > 1
2ρ·m
2=m
2ρ+1
distinct clusters C1,...,CjCmCt+1with the same configuration.
Using m > 2ρ+1k, we deduce j > k.
Let P:= {pC1,...,pCj}. Since Xis (k,optk)-coverable, so is PX.
Therefore, by Lemma 3.2, there exist distinct a,b{1,. . . ,j}such that
pCapCb4d
2ρ+1k
moptk.
Next, we derive a bound for the diameter of the union of the corre-
sponding clusters Caand Cb. That is, we need to bound the largest
distance uvbetween two points u,vCaCb. If u,vCaor
u,vCb, then uvcostdiam(Cm). Hence, let uCaand vCb.
Using the triangle inequality, for any wRd,
cf. Figure 7uvpCapCb+u+pCbpCaw+wv.
For pCapCb, we already derived an upper bound. To bound
u+pCbpCaw, let yConf(Ca) = Conf(Cb)such that
upCaBd
λτ(y).
36
3.4 diameter k-clustering
pCa
u
pCb
w
v
pCapCb
u+pCbpCaw
wv
Figure 7: Congruent configurations.
Furthermore, let wCbwith
wpCbBd
λτ(y).
Then,
u+pCbpCaw=upCa (wpCb)
·costdiam(Cm)
=2λτ.
Since v,wCb, their distance is bounded by
wvdiam(Cb)costdiam(Cm).
We conclude that merging the clusters Caand Cbresults in a cluster
whose diameter can be upper bounded by
diam(CaCb)(1+)costdiam(Cm)+4d
2ρ+1k
moptk.
Using Observation 3.18 and the fact that Caand Cbare part of the
clustering Ct+1, we can upper bound the cost of Ctby
costdiam(Ct)diam(CaCb).
Note that the parameter λfrom Lemma 3.21 establishes a trade-
off between the two terms on the right-hand side of Inequality (10).
For the analysis of the first stage, we have to choose λcarefully.
In the proof of the following lemma, we use λ=ln 4
3/4d and apply
Lemma 3.21 for log4
3
|X|
2σ+1kconsecutive phases, where σ= (42d)d.
Then, we are able to upper bound the cost of the clustering computed
after the first stage by a term that is linear in dand optkand indepen-
dent of |X|and k. The number of remaining clusters is independent
of the number of input points |X|and depends only on the dimension
dand the desired number of clusters k.
37
analysis of agglomerative clustering
Lemma 3.22.Let 2σ+1k < |X|for σ= (42d)d. Then, on input X, Algo-
rithm 3.3computes a clustering C2σ+1kwith
costdiam(C2σ+1k)<(28d +4)optk.
Proof. Let u:= log3
4
2σ+1k
|X|and define mi:= (3
4)i|X|for i=0,...,u.
Furthermore, let λ:= ln 4
3/4d. This implies ρσfor the parameter ρ
of Lemma 3.21. Then, mu2σ+1kand mi> 2σ+1k2ρ+1kfor
i=0,. . . ,u1. Since
3mi
4=3
4(3
4)i|X|⌉⌋(3
4)i+1|X|+3
4(3
4)i+1|X|=mi+1
and Algorithm 3.3uses a greedy strategy,
costdiam(Cmi+1)costdiam(C3mi
4)
for i=0,. . . ,u1. Combining this inequality with Lemma 3.21 (ap-
plied to m=mi), we obtain
costdiam(Cmi+1)(1+)costdiam(Cmi)+4d
2ρ+1k
mi
optk.
By repeatedly applying this inequality for i=0,...,u1and using
costdiam(C2σ+1k)costdiam(Cmu)and costdiam(Cm0)=0, we get
costdiam(C2σ+1k)
u1
i=0
(1+)i·4d
v
u
u
t
2σ+1k
(3
4)u1i|X|optk
=4d
v
u
u
t
2σ+1k
(3
4)u1|X|optk·
u1
i=0
(1+)i·d
(3
4)i
.
Using u1 < log3
4
2σ+1k
|X|,
costdiam(C2σ+1k)< 4 optk·
u1
i=0
1+
d
4
3
i
. (11)
By taking only the first two terms of the series expansion of the expo-
nential function, we get 1+ =1+ln 4
3
2d < eln 4
3
2d =2d
4
3. Substituting
this bound into Inequality (11) and extending the sum yields
costdiam(C2σ+1k)< 4 optk·
i=0
1
2d
4
3
i
< 4 optk·
i=0(1
1+)i
.
Solving the geometric series leads to
costdiam(C2σ+1k)< 4 (1
+1)·optk<(28d +4)·optk.
38
3.4 diameter k-clustering
r
A B
diam(A)
diam(B)
diam(AB)
Figure 8: Merging two clusters intersecting a ball of radius r.
stage two
The second stage covers the remaining merge steps until Algorithm 3.3
computes the clustering C2k. However, compared to stage one, the
analysis of a single phase yields a weaker bound. The following
lemma provides an analysis of a single phase of the second stage.
It is very similar to Lemma 3.7and Lemma 3.8in the analysis of the
discrete k-center problem.
Lemma 3.23.Let mNwith 2k < m |X|. Then,
costdiam(C3m
4)2·(costdiam(Cm)+optk).
Proof. As in the proof of Lemma 3.7, we define t:= 3m
4and deduce
|CmCt+1|>m
2> k. Since Xis (k,optk)-coverable, there exists a point
yRdsuch that Bd
optk(y)intersects two clusters A,BCmCt+1.
We conclude that merging Aand Bwould result in a cluster whose
diameter can be upper bounded by
cf. Figure 8
diam(AB)2costdiam(Cm)+2optk.
The result follows using A,BCt+1and Observation 3.18.
Lemma 3.24.Let nNwith n2σ+1kand 2k < n |X|for σ= (42d)d.
Then, on input X, Algorithm 3.3computes a clustering C2k with
costdiam(C2k)< 2 (costdiam(Cn)+2optk).
Proof. Let u:= log3
4
2k
nand define mi:= (3
4)infor i=0,...,u.
Then, mu2k and mi> 2k for i=0,. . . ,u1. Analogously to the
proof of Lemma 3.22, we get 3mi
4mi+1and using Lemma 3.23,
costdiam(Cmi+1)2(costdiam(Cmi)+optk)
for i=0,...,u1. By repeatedly applying this inequality for i=
0,. . . ,u1and using costdiam(C2k)costdiam(Cmu), we get
costdiam(C2k)2ucostdiam(Cn)+
u
i=1
2ioptk
< 2u(costdiam(Cn)+2optk).
Hence using ulog4
32σ< , the result follows.
39
analysis of agglomerative clustering
Proof of Proposition 3.20.The Proposition follows immediately by com-
bining Lemma 3.22 and Lemma 3.24.
3.4.2analysis of the remaining merge steps
We analyze the remaining merge steps analogously to the k-center
problem. Therefore, in this section we only discuss the differences,
most of which are slightly modified bounds for the cost of merging
two clusters.
The connectivity property from Section 3.3.2remains the same.
However, for any two (Z,r)-connected clusters A,B, we use
cf. Figure 8diam(AB)diam(A) + diam(B) + 2r (12)
as a replacement for Inequality (6). Furthermore, Lemma 3.14 also
holds for the diameter k-clustering problem, i.e., with
costdiam(Ck)costdiam(P).
Using Inequality (12) in the proof of Lemma 3.15, we get
diam(AB)costdiam(Pm)+costdiam(Pn)+2optk
as a replacement for Inequality (7) while Inequality (8) can be re-
placed by
diam(A1A2)costdiam(Pm)+2(costdiam(Pn)+2optk).
That is, for the diameter k-clustering problem the two upper bounds
are different. However, the second one is larger than the first one.
Using it in both cases, the inequality stated in Lemma 3.15 changes
slightly to
costdiam(Pm+
2)costdiam(Pm)+2(costdiam(Pn)+2optk).
Together with costdiam(P˜n)2costdiam(Pn)+2optkas a replacement
for Inequality (9), the bound stated in Lemma 3.16 becomes
costdiam(P+1)< 2 (log2() + 2)(costdiam(Pn)+2optk).
Thus, using Proposition 3.20, the upper bound for the cost of the
(+1)-clustering of Ystated in Lemma 3.17 becomes
costdiam(P+1)< 2 (log2() + 2)(2 (28d +6)+2)·optk
for σ= (42d)d. Analogously to Section 3.3.4, this proves Theorem 3.19.
40
3.5 the one-dimensional case
3.5the one-dimensional case
For d=1, we are able to show that Algorithm 3.3computes an ap-
proximation to the diameter k-clustering problem (Problem 2.7) with
an approximation factor of at most 3. We even know that for any
data set XRthe approximation factor of the computed solution is
strictly below 3. However, we do not show an approximation factor
of 3ϵfor some ϵ > 0. The proof of this upper bound makes use of
the total order of the real numbers and is certainly not generalizable
to higher dimensions.
Theorem 3.25.Let d=1. Then, Algorithm 3.3gives a solution to the
diameter k-clustering problem (Problem 2.7) with cost less than three times
the cost of an optimal solution.
For any set CRit holds that diam(C) = 2rad(C). That is, in the
one-dimensional case, the diameter and the radius of a cluster differ
by a constant factor. Thus, for any two clusters Aand B,
diam(A)<diam(B)rad(A)<rad(B).
It follows that Algorithm 3.3and Algorithm 3.2choose the same clus- We assume that
there are no ties.
ters in the merge step and we immediately get the following corollary.
Corollary 3.26.Let d=1. Then, Algorithm 3.2gives a solution to the
k-center problem (Problem 2.8) with cost less than three times the cost of an
optimal solution.
Before proving Theorem 3.25, we discuss some features of one-
dimensional clusterings and introduce a suitable terminology. Af-
terwards, we formulate the central argument used in the proof of
Theorem 3.25 as a separate lemma.
In the following, we consider solely clusterings of the finite input
data set XR. Since Algorithm 3.3uses a greedy strategy to choose
the clusters to be merged, it merges only clusters whose points lie
directly next to each other. That is, the created clusters are cohesive
in the sense that they contain all points of Xthat lie between their
smallest and their largest point. Without loss of generality, we may
assume that the same holds for the clusters of an optimal solution.
Obviously, any clustering can be transformed to have this property
without changing the largest diameter. It follows that all considered
clusters are uniquely determined by their smallest and their largest
point. Therefore, in the figures of this section, clusters are depicted by
line segments between their smallest and their largest point. We call
the smallest point of a cluster its left boundary and the largest point its
right boundary.
We start with considering the relative position of two clusters A,B
Xfrom possibly different clusterings of XR. We say that Alies in-
side of B, if AB. Furthermore, Alies to the left of Bif the right bound-
ary of Ais strictly less than the left boundary of B, i. e., maxxAx <
41
analysis of agglomerative clustering
minxBx. Analogously, Alies to the right of B, if maxxBx < minxAx.
For two distinct clusters A,Bfrom the same clustering C, it follows
that Alies either to the left or to the right of B. We say that such
clusters Aand Bare neighboring if no further cluster CClies in
between. A set of n > 2 clusters from the same clustering Cis called
neighboring if the clusters can be denoted as C1,. . . ,Cnsuch that for
i=1,. . . ,n1, the clusters Ciand Ci+1are neighboring. For two
neighboring clusters A,Bwith Alying to the left of B, we define the
gap between Aand Bas the open interval between the right bound-
ary of Aand the left boundary of B. It follows that a gap contains
no elements from X. Furthermore, for two clusters A,Bfrom different
clusterings we define Ato be left aligned to B, if Aand Bshare their left
boundary, i. e., minxAx=minxB. Analogously, Ais right aligned
to B, if maxxAx=maxxBx. If it is clear from the context or if it is
irrelevant which cluster a cluster Ais aligned to, we simply say Ais
left (or right) aligned.
For the remainder of this section, we fix an optimal diameter k-
clustering Sof X. That is, Sis an optimal solution to Problem 2.7
for d=1. Without loss of generality, we assume S={S1,...,Sk}
with Silying to the left of Si+1for i=1,. . . ,k1. Furthermore,
we denote the hierarchical clusterings computed by Algorithm 3.3on
input Xby C|X|,. . . ,C1. For our line of reasoning, we focus on two
particular clusterings Cand Cmwith m. Let mNbe theThe smaller the
index, the later the
clustering is
computed.
smallest index, such that all clusters of Cmhave a diameter of at most
optkand let Nbe the smallest index, such that all clusters of C
have a diameter strictly less than 3·optk. Since Algorithm 3.3uses
a greedy strategy, it is sufficient to show that is at most kto prove
Theorem 3.25. To do this, we use the following technical lemma.
Lemma 3.27.Let PSbe a set of neighboring clusters from the optimal
solution Swith the following properties:
1. Gaps between neighboring clusters from Phave length at most optk.
2. The rightmost cluster of Pis right aligned to a cluster CCmand it
is the only one with this property.
Furthermore, let LCbe such that:
3. For each cluster CL, the left boundary of Cis contained in one of
the clusters from P.
4. The left boundary of the leftmost cluster of Lis contained in the left-
most cluster of P.
Then, it holds that |L| |P|.
Proof. We prove the lemma by induction over the size of Pwith two
base cases. First, let |P| =1. We denote P={P}and assume |L| 2for
contradiction. By definition, each cluster from Lis the union of one or
42
3.5 the one-dimensional case
PSP
LC
CmA B
(a) Sketch of contradiction for |P| =1.
PSP1P2
LCA B C
(b) Sketch of contradiction for |P| =2.
Figure 9: The two base cases of the induction.
more clusters from Cmas illustrated in Figure 9a. It follows that there
exist at least two clusters A,BCmsuch that their left boundaries are
contained in the single cluster P. Then, the right boundaries of Aand
Balso have to be contained in P. Otherwise, there could not exist a
cluster CCmthat is right aligned to Pas required for the second
property. It follows that Aand Bboth lie inside of Pwhich has a
diameter of at most optk. That is, the union of Aand Bwould have
a diameter of at most optk, too. Thus, Algorithm 3.3would have
created a clustering Cm1with cost of at most optk. However, this
contradicts the definition of m.
Second, let |P| =2, denote P={P1,P2}with P1lying to the left of P2
and assume |L| 3for contradiction. Let CLbe the rightmost clus-
ter of Land recall that the gap between the two neighboring clusters
P1and P2contains no points from X. It follows that there have to ex-
ist at least two clusters A,BLto the left of C, which both lie inside
P1P2(cf. Figure 9b). Furthermore, the right boundaries of Aand
Bhave to be strictly less than the right boundary of P2. Otherwise,
the left boundary of Ccould not be contained in one of the clusters
of Pas required in the third property. Using the first property, it fol-
lows that the union of Aand Bhas a diameter of strictly less than
3·optk. However, analogously to the first base case, this contradicts
the definition of .
Finally, for the inductive step, let |P| 3and assume that for any
two sets Pand Las defined in the lemma but with |P|<|P|, it
holds that |L||P|. Let P1,P2Pbe the two leftmost clusters of
Pwith P1lying to the left of P2. Furthermore, let A,B,CLbe For |L| < 3, there is
nothing to show.
the three leftmost clusters of LCordered from left to right as
illustrated in Figure 10. Every cluster in Cis the union of one or
more clusters from Cm. Thus, the second property yields that the
rightmost cluster of Pis the only cluster from Pthat may be right
aligned to a cluster from C. However, this cluster from Ccan not
be the cluster B, since to the right of Bthere has to be room for the
cluster C. It follows that Bis not right aligned to a cluster from P.
43
analysis of agglomerative clustering
P1P2P
PS
LCA B C
Figure 10: The inductive step.
S
P1
z }| {
P2
z }| { ... Pt
z}|{
Cm
Figure 11: Partition of the optimal solution.
The definition of yields diam(AB)3·optkand the first property
provides diam(P1P2)3·optk. Since Bcan not be right aligned to
P2, the right boundary of Band the left boundary of Cboth have to
be contained in a cluster PPthat lies to the right of P2.
We define a new set Pby removing all clusters from Pthat lie
to the left of P. It follows that |P||P| 2. Furthermore, we define
L:= L \ {A,B}. Then, all four properties of Pand Las required by the
lemma are also satisfied by Pand L. Using the inductive hypothesis
|L||P|, we conclude
|L| =|L|+2|P|+2|P|.
Proof of Theorem 3.25.In the following, we denote Cm={C1,...,Cm}
with Cilying to the left of Ci+1for i=1,. . . ,m1. Since Sand Cm
are clusterings of the same set X, it follows that C1is left aligned to S1
and Cmis right aligned to Sk. Furthermore, it follows that if for any
1im1and 1jk1the cluster Ciis right aligned to the
cluster Sj, then Ci+1has to be left aligned to Sj+1(cf. Figure 11). That
is, Cmand Shave a common gap between the neighboring clusters
Ci,Ci+1and Sj,Sj+1, respectively. We use these common gaps of Cm
and Sto define a partition P1,. . . ,Ptof the optimal solution S. Let
tNbe the number of clusters in Sthat are left aligned to a cluster
of Cm(which is also the number of right aligned clusters). Then, for
i=1,...,tlet Libe the i-th cluster of S(counted from the left) that
is left aligned to a cluster of Cmand let Ribe the i-th right aligned
cluster. Using Liand Ri, we define Pito be the set of neighboring
clusters of Sfrom Lito Ri.
Based on the partition P1,. . . ,Ptof Swe define a partition of C.
To this, for each i{1,. . . ,t}, we pick all clusters from Cwhose left
boundary is contained in a cluster from Pi. This results in a partition
L1,...,Ltof Cwith
Li:= {CC(min
xCx)Pfor a cluster PPi}.
44
3.6 lower bounds
Then, for each i=1,. . . ,t, the sets Piand Lisatisfy the first three
properties required in Lemma 3.27 for the sets Pand L. In order to
also satisfy the fourth property, we need to modify the sets P1,...,Pt
slightly. To this, for each i=1,...,t, we define P
iby removing the
leftmost clusters from Pithat do not contain the left boundary of a
cluster CLi. Then, we are able to apply Lemma 3.27 with P=P
i
and L=Lifor i=1,...,t. Using |P
i||Pi|, it follows that |Li||Pi|.
Since, P1,...,Ptand L1,...,Ltare partitions of Sand C, respectively,
=
t
i=1
|Li|
t
i=1
|Pi|=k.
As mentioned above, this proves the theorem.
3.6lower bounds
In this section, we present constructions of several data sets for the
diameter k-clustering problem. The data sets yield lower bounds for
the approximation factor of Algorithm 3.3. Some of the data sets
are also suitable to prove lower bounds for Algorithm 3.2and Algo-
rithm 3.1for the k-center problem and the discrete k-center problem,
respectively. To keep the constructions simple, we make use of the
fact that we presented the clustering algorithms without a particular
tie-breaking strategy. That is, whenever one of the algorithms is able
to choose between several possible merge steps, we simply assume
that we can govern its choice.
In Section 3.6.1, we show that for data sets XR(i.e., d=1), Algo-
rithm 3.3has an approximation factor of at least 2.5. Recall that in in
Section 3.5we stated that in the one-dimensional case Algorithm 3.3
computes a solution to the diameter k-clustering problem with an
approximation factor strictly below 3. Hence, for d=1, we obtain
almost matching upper and lower bounds for the cost of the solution
computed by Algorithm 3.3. The same holds for Algorithm 3.2and
the k-center problem since for d=1the two problems as well as the
two algorithms are equivalent.
In Section 3.6.2, we show that the dimension dhas an impact on
the approximation factor of Algorithm 3.3. This follows from a two-
dimensional data set yielding a lower bound of 3for the metric based
on the L-norm. Note that this exceeds the upper bound from the
one-dimensional case.
In Section 3.6.4, we show that there exist data sets such that Al-
gorithm 3.3computes an approximation to the diameter k-clustering
problem (Problem 2.7) with an approximation factor of (p
logk)for
metrics based on an Lp-norm (1p < )and (logk)for the met-
ric based on the L-norm. In case of the L1- and the L-norm, this
matches an already known lower bound that has been shown using
a rather artificial metric [Dasgupta and Long, 2005]. However, the
45
analysis of agglomerative clustering
bound in [Dasgupta and Long, 2005] is derived from a two-dimen-
sional data set, while the dimension of our construction depends on
the number of clusters k.
Finally, in Section 3.6.4, we show that the lower bound of (p
logk)
for any Lp-norm and (logk)for the L-norm can be adapted to the
discrete k-center problem. In case of the L2-norm, we thus obtain al-
most matching upper and lower bounds for the cost of the solution
computed by Algorithm 3.1. Furthermore, we are able to restrict the
dependency of the approximation factor of Algorithm 3.1on dand k.
3.6.1arbitrary metrics and d=1
We first show a lower bound for the approximation factor of Algo-
rithm 3.3. To do this, we use a sequence of data sets from Rdwith
d=1. Since up to normalization there is only one norm for d=1,
without loss of generality we assume the Euclidean metric.
Proposition 3.28.For all ε>0and k4there exists a data set XR
such that Algorithm 3.3computes a solution to the diameter k-clustering
problem (Problem 2.7) with cost at least 5
2εtimes the cost of an optimal
solution.
Proof. It is sufficient to consider the case k=4. The construction can
easily be extended to k > 4 by adding separated dummy points. In
the following we construct a data set for any fixed nN. That is,
we actually obtain a series of data sets. For any nNwe show that
the computed solution has an approximation factor of 5
2f(n)for
f(n)> 0 and limnf(n) = 0.
For xR, let
V(x) := {x+i|iNand 0i2n1}.
That is, V(x)consists of 2nequidistant points, where the distance be-
tween neighboring points is 1and diam(V(x)) = 2n1. Furthermore,
we define
l(x) := x2n1,
r(x) := x+2n1+2n1=x+3·2n11,
W(x) := V(x){l(x),r(x)}.
It follows that diam(W(x)) = 2n+11as shown in Figure 12.
Based on the sets W(x), we define
X:=
4
i=1
W(xi)
where xi:= i·(7·2n12)for i=1,. . . ,4. Then, there is a gap of
3·2n11between Wn(xi)and Wn(xi+1), i.e.
diam({r(xi),l(xi+1)})=3·2n11for i=1,. . . ,3. (13)
46
3.6 lower bounds
l(x)x r(x)
2n1
2n12n1
Figure 12: Sketch of the set W(x).
l(xi)r(xi)
2n12n1
2n1
Figure 13: Part of the dendrogram for W(xi).
The optimal 4-clustering of Xis
Copt
4={W(x1),W(x2),W(x3),W(x4)}
and costdiam(Copt
4)=2n+11.
The solution computed by Algorithm 3.3may have larger cost. At
the beginning, the minimum distance between two points from Xis 1.
The possible pairs of points with distance 1come from the sets V(xi)
for i=1,...,4. Since the distance between V(xi)and l(xi)or r(xi)
is 2n1, we can assume that the algorithm merges all points of V(xi)
for i=1,...,4as shown in Figure 13. It follows that Algorithm 3.3
computes the 12-clustering
C12 ={{l(x1)},V(x1),{r(x1)},
{l(x2)},V(x2),{r(x2)},
{l(x3)},V(x3),{r(x3)},
{l(x4)},V(x4),{r(x4)}}.
For i=1,...,4, the diameters of {l(xi)}V(xi)and V(xi){r(xi)}
are equal to 3·2n11and these are the best possible merge steps.
Therefore, by (13), we can assume that Algorithm 3.3merges r(xi)
and l(xi+1)for i=1,...,3first. This results in the 7-clustering
C7={{l(x1)}V(x1),
{r(x1),l(x2)},V(x2),
{r(x2),l(x3)},V(x3),
{r(x3),l(x4)},V(x4){r(x4)}}
where V(x2)and V(x3)have a diameter of 2n1while the remaining
clusters have a diameter of 3·2n11(see Figure 14). Between two
neighboring clusters of C7, there is a gap of 2n1.
In the next step of Algorithm 3.3, the best possible choice is to
merge {r(x1),l(x2)}with V(x2),{r(x2),l(x3)}with V(x2)or V(x3), or
47
analysis of agglomerative clustering
C1C2C3C4
5·2n3 3·2n11 3·2n2 3·2n11
Figure 14: Part of the dendrogram for X.
{r(x3),l(x4)}with V(x3). We let the algorithm merge {r(x1),l(x2)}with
V(x2)and {r(x3),l(x4)}with V(x3). This results in a 5-clustering where
the clusters have alternating lengths of 3·2n11and 3·2n2with
gaps of 2n1between them. Then, in the step resulting in C4, Algo-
rithm 3.3has to create a cluster of diameter 5·2n3as shown in
Figure 14. Therefore, the computed solution has an approximation
factor of
costdiam(C4)
costdiam(Copt
4)=5·2n3
2n+11.
For ngoing to infinity this approximation factor converges from be-
low to 5
2.
Analogously to Section 3.5, the bound for the one-dimensional di-
ameter k-clustering problem immediately yields the same bound for
the one-dimensional k-center problem.
Corollary 3.29.For all ε>0and k4there exists a data set XRsuch
that Algorithm 3.2computes a solution to the k-center problem (Problem 2.8)
with cost at least 5
2εtimes the cost of an optimal solution.
3.6.2l-metric and d=2
In this section, we construct a data set that uses only eight points from
R2and yields a lower bound of 3for the metric based on the L-norm.
Recall that in Section 3.5, we showed that for d=1the approximation
factor of a computed solution is always strictly less than 3. Therefore,
the lower bound of 3for d=2implies that the dimension dhas an
impact on the approximation factor of Algorithm 3.3.
Proposition 3.30.For the metric based on the L-norm, there exists a data
set XR2such that Algorithm 3.3computes a solution to the diameter
k-clustering problem (Problem 2.7) with three times the cost of an optimal
solution.
48
3.6 lower bounds
A
B
C
D
E
F
G
H
(a) Lower bound for the metric based
on the L-norm.
A B
C D
E F
G H
(b) Lower bound for the metric based
on the L2-norm. The points
C,D,G,Hhave a z-coordinate of
0, while the points A,B,E,Fhave
az-coordinate of 2x.
Proof. We prove the proposition by constructing a data set for k=4.
Consider eight points A,. . . ,HR2with
A= (0,1),E= (−1,2),
B= (1,0),F= (2,1),
C= (0,1),G= (1,2),
D= (−1,0),H= (−2,1).
The optimal 4-clustering of these points is
Copt
4={{A,E},{B,F},{C,G},{D,H}}
which has a maximum L-diameter of 1(cf. Figure 15a). However,
it is also possible that Algorithm 3.3starts by merging A with B and
C with D. Then, in the third step, the algorithm merges Eor Fwith
{A,B},Gor Hwith {C,D}, or {A,B}with {C,D}. We assume the latter.
Thus, the fourth merge step creates a cluster of L-diameter 3.
3.6.3euclidean metric and d=3
For the Euclidean case, we are able to construct a 3-dimensional data
set that yields a lower bound of 2.56. This is below the upper bound
of 3from the one-dimensional case. Therefore, this data set does
not show an impact of the dimension din the Euclidean case as in
the previous section. However, this lower bound is still better than
the lower bound of 2.5from the one-dimensional case. This suggests
that in higher dimensions it might be easier to construct good lower
bounds.
Proposition 3.31.For the Euclidean metric there exists a data set XR3
such that Algorithm 3.3computes a solution to the diameter k-clustering
problem (Problem 2.7) with cost 2.56 times the cost of an optimal solution.
49
analysis of agglomerative clustering
Proof. We prove the proposition by constructing a data set for k=4.
For any fixed xRwith 0 < x < 2 consider eight points A,. . . ,HR2
with
A= (−1,1,2x),E= (−(1+x),1+4x2,2x),
B= ( 1,1,2x),F= ( 1+x,1+4x2,2x),
C= (−1,1,0),G= (−(1+x),−(1+4x2),0),
D= ( 1,1,0),H= ( 1+x,−(1+4x2),0).
The optimal 4-clustering of these points is
Copt
4={{A,E},{B,F},{C,G},{D,H}},
which has a maximum L2-diameter of 2(cf. Figure 15b). However,
since AB=CD=2it is possible that Algorithm 3.3starts
by merging Awith Band Cwith D. Then, the cheapest merge adds
one of the points E,Fto the cluster {A,B}or it adds one of the points
G,Hto the cluster {C,D}or it merges {A,B}with {C,D}. We assume
the latter. The resulting cluster {A,B,C,D}has a diameter of 22+x.
Then, in the fourth merge step, the algorithm will either merge one
of the pairs E,Fand G,Hor one of the pairs E,Gand F,H. The choice
depends on the parameter x. Note that Algorithm 3.3will not merge
the cluster {A,B,C,D}with one of the remaining four points, since
this is always more expensive. The diameter of the created cluster is
maximized for x1.56. If we fix x=1.56, the algorithm merges E
with For Gwith H. This results in a 4-clustering of cost 5.12, while
the optimal solution has cost 2.
3.6.4lp-metric (1p)in variable dimension
In the following, we consider the diameter k-clustering problem with
respect to the metric based on the L1-norm. We provide a data set
with dimension O(k)such that Algorithm 3.3computes a solution
with an approximation factor of (logk).
Proposition 3.32.For the metric based on the L1-norm, there exists a data
set XRdwith d=k+log2ksuch that Algorithm 3.3computes a solution
to the diameter k-clustering problem (Problem 2.7) with 1
2log2ktimes the
cost of an optimal solution.
Proof. For the sake of simplicity, assume kto be a power of 2. In the
following, we consider the (k+log2k)-dimensional set Xof |X|=k2
points defined by
X:= {[ei
b]1ikand b{0,1}log2k}.
50
3.6 lower bounds
Here, eiRkdenotes the i-th canonical unit vector. Consider the
k-clustering
C
k:= {Cbb{0,1}log2k},
where for each b{0,1}log2kthe cluster Cbis given by
Cb:= {[ei
b]1ik}.
The largest diameter of C
kis costdiam(C
k)=2. Hence, for the diameter
optkof an optimal solution, it follows that
optk2. (14)
To investigate the merge steps of Algorithm 3.3, note that
diam({[ei
b1],[ej
b2]})={h(b1,b2)if i=j
2+h(b1,b2)if i=j
where h(b1,b2)denotes the Hamming distance between the strings
b1,b2{0,1}log2k. Hence, we may assume that Algorithm 3.3starts
by merging the points [ei,0,b]and [ei,1,b]for all 1ikand
all b{0,1}log2(k)−1, thereby forming 1
2k2clusters of diameter 1.
Next, we show inductively that Algorithm 3.3keeps merging pairs
of clusters that agree on the first kcoordinates until the algorithm
halts. Assume that there is some number 1tlog2ksuch that the
clustering computed so far consists solely of the clusters
C(t)
i,b=
ei
b
b
b{0,1}t
for all 1ikand all b{0,1}log2(k)−t. Note that for t=1this is
the case after the first 1
2k2merges. Then,
diam(C(t)
i,b1C(t)
j,b2)={t+h(b1,b2)if i=j
2+t+h(b1,b2)if i=j.
Hence, as above, we may assume that in the next 1
2t+1k2steps Algo-
rithm 3.3merges the clusters C(t)
i,0band C(t)
i,1bfor all 1ikand all
b{0,1}log2(k)−(t+1). The resulting clusters have a diameter of t+1.
Furthermore,
C(t+1)
i,b=C(t)
i,0bC(t)
i,1b.
Algorithm 3.3keeps merging clusters in this way until after t=
log2krounds we end up with the k-clustering Ck={Ci1ik}
where
Ci={[ei
b]b{0,1}log2k}.
These clusters Ciare of diameter log2k. Comparing to (14), we de-
duce that Algorithm 3.3computes a solution to Problem 2.7with at
least 1
2log2ktimes the cost of an optimal solution.
51
analysis of agglomerative clustering
Considering the diameter k-clustering problem with respect to an
arbitrary Lp-metric (with 1p < ), note that the behavior of Al-
gorithm 3.3does not change if we consider the p-th power of the Lp-
distance instead of the Lp-distance. Also note that for all x,y{0,1}d
we have xyp
p=xy1. Since the data set Xfrom Proposition 3.32
is a subset of {0,1}d, we immediately obtain the following corollary.
Corollary 3.33.For the metric based on any Lp-norm with 1p < ,
there exists a data set XRdwith d=k+log2ksuch that Algorithm 3.3
computes a solution to the diameter k-clustering problem (Problem 2.7) with
p
1
2log2ktimes the cost of an optimal solution.
Additionally, considering the diameter k-clustering problem with
respect to the L-metric, it is known that subset of npoints from an
arbitrary metric space can be embedded isometrically into (Rn,L)
[Fréchet, 1910]. That is, the distance between the embedded points
is equal to the original distances. The construction of the embedded
set is simple. For 1i,jn, let dij be the distance between xiand
xjin the metric space. Then, for each point xiwith 1indefine
x
i:= (di1,...,din)T. It immediately follows x
ix
j=dij.
Hence, the data set of size n=k2from Proposition 3.32 yields a
data set in Rk2satisfying the same approximation bound with respect
to the L-distance. We obtain the following corollary.
Corollary 3.34.For the metric based on the L-norm, there exists a data
set XRdwith d=k2such that Algorithm 3.3computes a solution to the
diameter k-clustering problem (Problem 2.7) with 1
2log2ktimes the cost of
an optimal solution.
the discrete k-center problem
The data set Xfrom Proposition 3.32 also yields lower bounds for the
approximation factor of the agglomerative solution to the discrete k-
center problem. Just note that for the data set Xin every step of the
algorithm the minimal discrete radius of a cluster equals the diameter
of the cluster. We immediately obtain the following corollaries.
Corollary 3.35.For the metric based on any Lp-norm with 1p < ,
there exists a data set XRdwith d=k+log2ksuch that Algorithm 3.1
computes a solution to the discrete k-center problem (Problem 2.9) with
p
1
2log2ktimes the cost of an optimal solution.
Corollary 3.36.For the metric based on the L-norm, there exists a data
set XRdwith d=k2such that Algorithm 3.1computes a solution to the
discrete k-center problem (Problem 2.9) with 1
2log2ktimes the cost of an
optimal solution.
Moreover, in case of the L2-norm, we obtain the following corollary.
52
3.6 lower bounds
Corollary 3.37.For the metric based on the L2-norm, there exists a data set
XRdwith d=O(log3k)such that Algorithm 3.1computes a solution to
the discrete k-center problem (Problem 2.9) with (logk)times the cost
of an optimal solution.
Note that Corollary 3.37 follows by embedding the data set from
Corollary 3.35 into the O(log3k)-dimensional Euclidean space with-
out altering the behavior of the agglomerative algorithm or the lower
bound of (logk)(using the Johnson-Lindenstrauss embedding
[Johnson and Joram, 1984]). For this embedded data set, the bound
given in Section 3.2states an upper bound of 20d +2log(k) + 2=
O(log3k)times the cost of an optimal solution. Hence, in case of the
discrete k-center problem using the L2-metric, the upper bound from
our analysis almost matches the lower bound.
Furthermore, this implies that the approximation factor of Algo-
rithm 3.1cannot be simultaneously independent of dand logk. More
precisely, the approximation factor cannot be sublinear in 6
dand in
logk.
53
Part II
ON PARAMETER ESTIMATION
The second part of this thesis is about the parameter esti-
mation problem for statistical models. Statistical models
offer an alternate view on the structure of a given data
set that differs substantially from the hard clustering ap-
proach discussed in Part I. Using this approach, the data
set is assumed to be generated by a stochastic process.
Then, the parameters of this process can be interpreted as
a description of the structure of the data. While a partition
of the data as computed by hard clustering algorithms is
only valid for the given data set, the parameters of the
stochastic process may also be used to describe other data
sets generated by the same process.
The main result of Part II is the analysis of a probabilistic
variant of the classical EM algorithm for the parameter es-
timation problem of Gaussian mixture models. We show
that the probabilistic algorithm is considerably faster than
the classical algorithm while, with high probability, its pa-
rameter updates are close to the deterministic EM updates.
Furthermore, we confirm our theoretical results in an ex-
perimental analysis.
4
STATISTICAL MODELS
One way to deal with large data sets is to reduce the amount of in-
formation that has to be processed by investigating the distribution
of the data. That is, if we are merely interested in the correlation of
the data, then a loss of the information residing in the single data
elements is acceptable. In this chapter we deal with the description
of data by statistical models. That is, we assume that a given data set
was generated by a particular statistical process. Then, the parame-
ters of this process give a small description of the data set.
In this chapter we discuss the problem of obtaining suitable pa-
rameters for particular statistical models. We focus our attention on
the well understood Gaussian mixture models. Gaussian mixtures
are frequently used in practice since they are suitable for the expla-
nation of many real world data sets. One often cited example is the
Old Faithful data set. It is named after a famous geyser in the Yel-
lowstone National Park where its data is taken from. The data set
is two-dimensional and contains 272 entries. Each entry corresponds
to one observation of an eruption and consists of the eruption time
in minutes and the waiting time to the next eruption. The data set
is shown in Figure 15. We see that the eruptions seem to follow two
different time patterns. The measurements decompose into two ellip-
b
b
b
b
b
b
b
b
b
b
b
b
b
b
bb
bb
b
b
b
b
b
bb
b
b
b
b
bb
b
b
b
bb
b
b
bb
b
b
b
b
b
bbb
b
b
b
bb
bb
b
b
b
b
b
b
bb
b
b
b
b
b
b
b
b
b
b
b
b
bb
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
bb
b
b
b
b
b
b
b
b
b
b
b
b
b
b
bb
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
bb
b
b
b
b
b
b
b
b
b
b
b
bb
b
b
b
b
b
b
b b
b
b
b
b
b
b
bb
b
b
b
b
b
b
b
bb
b
b
b
bb b
bb
b
bb
bb
bb
b
b
b
b
b
b
b
b
b
b b
b
b
b
b
b
bb
b
bb
b
bb
b
b
bbb
b
b
b
b
b
bb
b
b
b
b
bb
bbb
b b
b
b
b
b
bbb
b
b
b
b
b
b
b
b
b
b
b b
b
b
b
b
bb
b
b
bb
bb
b
bb
bb
bb
b
bb
b
Figure 15: The Old Faithful data set.
57
statistical models
soidally shaped point clouds. As we will discuss in Section 4.3, this
makes the data set a suitable candidate for Gaussian mixture models.
For the understanding of this and the following chapters, some
basic knowledge of discrete and continuous probability theory is in-
dispensable. In the following, without introducing the necessary con-
cepts, we denote the probability of an observation Aby Pr(A)and we
denote the probability of Agiven an observation Bby Pr(A|B). For
observations from continuous domains, Pr(·)denotes the probability
density function. Although formally incorrect, we use the term prob-
ability in the discrete as well as in the continuous case. However, it is
always clear from the context whether a probability or density func-
tion is meant. Furthermore, we denote random variables by upper
case letters and concrete observations by lower case letters. That is,
Pr(X=x)denotes the probability that the random variable Xtakes on
the value x(in the continuous case, it denotes the density of Xat the
point x). If the focus does not lie on the density of concrete observa-
tions but on the density function, then we simply write Pr(X)which
strictly speaking is a random variable itself. A short introduction to
probability theory can be found in [Bishop, 2006]. For a profound
treatment of the matter see [Grimmett and Stirzaker, 2001].
4.1the parameter estimation problem
Astatistical model for some domain Xconsists of a probability dis-
tribution on Xand a set of associated parameters. The probability
distribution is given by a density function pθ:XRthat is gov-We consider the
parameters to be
given as a vector θ
from a proper
parameter domain.
erned by a parameter vector θ. Then, data from the domain Xis
modeled by a random variable XXwith
Pr(X=x|θ)=pθ(x)
for xX. If we assume that the data consists of nNelements
that are drawn independently from the same distribution, we denote
X= (X1,. . . ,Xn)Xnfor independent and identically distributed
random variables X1,...,XnX. Then, the density pθis extended
to the domain Xnby computing the product probabilities over the
single observations. More precisely,
Pr(X= (x1,. . . ,xn)|θ)=pθ(x1,...,xn) =
n
i=1
pθ(xi).
Note that we denote data sets by ordered tuples. Besides simpli-
fying the discussion (e.g., regarding repetitions) it is natural to look
at the independent draws from the same distribution as a sequen-
tial process. Since we consider the data to be the result of a random
process, we use the terms data (set) and observation interchangeably.
Using a statistical model to describe a given data set, the central
question is how to optimize the parameter vector θ. A natural ap-
58
4.1 the parameter estimation problem
proach is to try and maximize the probability that θis responsible for
a given observation X, i. e., to compute
ˆ
θ:= argmax
θ
Pr(θ|X).
Pr(θ|X)is called the posterior probability since it is obtained after the
observation of X. However, for this probability to be well defined, we
need to assume a probability distribution defined on the domain of
the parameters. Then, using the probability Pr(θ)(called prior proba-
bility), Bayes’ theorem yields
Pr(θ|X)=Pr(X|θ)Pr(θ)
Pr(X),
where Pr(X|θ)is given by the statistical model. Since the denomina-
tor on the right-hand side does not depend on θ, we can state
Pr(θ|X)Pr(X|θ)Pr(θ).
The approach to optimize θbased on this relationship is called the A resource for the
Bayesian approach is
[Bishop, 2006].
Bayesian approach.
In this thesis we consider an alternate approach called the frequen-
tist approach. Adopting the frequentist approach, no assumptions are
made on the parameters. The optimization of the parameter vector θ
is done by maximizing the probability of the observation Xgiven θ,
i.e., to compute
θ:= argmax
θ
Pr(X|θ).
It is important to note that this argumentum maximi is not guar-
anteed to be well defined. The probability Pr(X|θ)does not have to
have a global maximum (e.g., for Gaussian mixtures, cf. Section 4.3.2).
However, for a well posed problem there should at least exist a local
maximum that can be searched for.
Another well known pitfall of this approach is its vulnerability to For a detailed
treatment of
overfitting see
[Hawkins, 2004].
the problem of overfitting. This means that a parameter vector θat
a global or local maximum of Pr(X|θ)may be so well adapted to
the concrete observation Xthat the resulting statistical model is less
suitable to describe similar data from the same application.
If the probability Pr(X|θ)is viewed as a function of the parameter
vector θ, then it is called the likelihood of θgiven the observation X
and is denoted as
L(θ|X)=Pr(X|θ).
Therefore, θas defined above, is called the maximum likelihood esti-
mate. The problem of computing or approximating the maximum
likelihood estimate is called the parameter estimation problem.
Instead of maximizing the likelihood it is equivalent and some-
times analytically more convenient to maximize its logarithm (cf. Sec-
tion 4.3). The function lnL(θ|X)is called the log-likelihood of θgiven
59
statistical models
X. For observations X= (X1,. . . ,Xn)as described above, the log-
likelihood takes the form
ln
n
i=1
pθ(Xi) =
n
i=1
lnpθ(Xi) =
n
i=1
lnL(θ|Xi).
That is, each sub-observation Xicontributes an additive term to the
log-likelihood cost function. This contribution is simply the log-like-
lihood of the parameter vector θgiven the single observation Xi.
In order to be able to interpret the optimization of θas the min-
imization of a cost function, it is common to minimize the negative
log-likelihood, i.e., to compute
θ:= argmin
θ
lnL(θ|X).
4.2general mixture models
In this section, we study the parameter estimation problem for a par-
ticular class of statistical models, namely the mixture models or mix-
ture distributions. Mixture models are well suited to model data that
can be partitioned into several subsets which can be seen as to be
generated by different distributions over the same domain X. For a
mixture of kdistributions, the parameter vector takes the form
θ= (w1,...,wk,θ1,...,θk),
where w1,. . . ,wkR+with k
j=1wj=1are called the weights and
θjis the parameter vector for the j-th component distribution. We
denote the probability density for the corresponding mixture by
pθ(x) =
k
j=1
wjpθj(x),
where pθjdenotes the probability density of the j-th component dis-
tribution. Since the weights sum up to one, it follows that
X
pθ(x)∂x =
k
j=1
wjX
pθj(x)∂x =
k
j=1
wj=1.
Thus, pθ(x)is indeed a probability density. Drawing a single observa-
tion X=xfrom a mixture distribution can be described as a two step
process. First, choose one of the component distributions with prob-
ability proportional to its weight. Second, draw xfrom the chosen
component distribution.
For the likelihood of θgiven an observation X=(x1,. . . ,xn)it fol-
lows that
L(θ|X)=
n
i=1
pθ(xi) =
n
i=1
k
j=1
wjpθj(xi)
60
4.3 gaussian mixture models
and for the log-likelihood we get
lnL(θ|X)=
n
i=1
lnpθ(xi) =
n
i=1
ln
k
j=1
wjpθj(xi).
Note that this expression contains the logarithm of a sum. There-
fore, in the mixture case, the likelihood is difficult to optimize di-
rectly. However, in Chapter 5we discuss an algorithmic approach to
approximate the maximum likelihood estimate.
4.3gaussian mixture models
Gaussian distributions (or short, Gaussians) are statistical models for
points from the d-dimensional space Rd. A single Gaussian is an uni-
modal distribution. That is, points drawn from a single Gaussian are
concentrated around a single point, called the mean of the Gaussian.
Furthermore, the points form an ellipsoidal shape and their density
drops exponentially with the distance to the mean. By mixing several
Gaussian distributions, it is possible to model multimodal data sets.
4.3.1single gaussian distributions
The Gaussian distribution (also known as the normal distribution) is a
continuous unimodal probability distribution for real valued random
variables. In the one-dimensional case, it is called the univariate Gaus-
sian distribution and is given by the probability density Nµ,σ2:RR
with
Nµ,σ2(x) := 1
2πσ2exp((xµ)2
2),
where µ,σ2Rand σ2> 0. The expected value of a random variable
X Nµ,σ2is given by For a derivation of
these equalities see
Section 2.3in
[Bishop, 2006].
E[X]=
x·Nµ,σ2(x)∂x =µ
and the variance of Xis given by
E[(Xµ)2]=
(xµ)2·Nµ,σ2(x)∂x =σ2.
Therefore, the parameter µis called the mean of the Gaussian distri-
bution and the parameter σ2is called its variance.
The extension of the Gaussian distribution to the d-dimensional
space Rdis called the multivariate Gaussian distribution (cf. Figure 16)
and is given by the probability density Nµ,Σ:RdRwith
Nµ,Σ(x) := 1
()d|Σ|exp(1
2(xµ)TΣ1(xµ)), (15)
61
statistical models
...
.
Figure 16: Density of a two-dimensional Gaussian distribution.
where µRdand ΣRd×dis a symmetric positive-definite matrix.
Analogously to the univariate case, the expected value of a random
variable X Nµ,Σis given by
E[X]=Rd
x·Nµ,Σ(x)∂x =µ
and the covariance matrix of Xis given by
E[(Xµ)(Xµ)T]=Rd
(xµ)(xµ)T·Nµ,σ2(x)∂x =Σ.
Therefore, the parameter Σis called the covariance matrix of the Gaus-
sian distribution. In the following, we take a closer look at the geo-
metrical form of the Gaussian density for different types of covariance
matrices.
For Σbeing a multiple of the identity matrix, i.e.,
Σ=σ2IdRd×d
for σ2> 0, we deduce
Nµ,σ2Id(x) = 1
(2πσ2)dexp(1
2xµ2)(16)
=1
(2πσ2)dexp(1
2
d
i=1
(xiµi)2)
=
d
i=1
1
2πσ2exp(1
2(xiµi)2)
=
d
i=1
Nµi,σ2(xi),
where x= (x1,. . . ,xd)Tand µ= (µ1,. . . ,µd)T. That is, each coordi-
nate of a random variable X Nµ,σ2Idis distributed independently
according to a univariate Gaussian around the corresponding coordi-
nate of µand with variance σ2. Furthermore, Equation (16) yields
that points with the same density lie on a sphere centered at µ(cf.
Figure 17a).
62
4.3 gaussian mixture models
b
σ
σ
(a) Σ=σ2Id
b
σ1
σ2
(b) diagonal Σ
b
λ1
λ2
(c) sym.pos.-def. Σ
Figure 17: Points with constant density for two-dimensional Gaus-
sians with different types of covariance matrices.
If Σis a diagonal matrix with diagonal elements σ2
1,. . . ,σ2
dR+,
then Σ1=Diag(1
σ2
1,. . . ,1
σ2
1)and
Nµ,Diag(σ2
1,...,σ2
d)(x) = 1
()dd
i=1σ2
i
exp(d
i=1
(xiµi)2
2
i)
=
d
i=1
1
2πσ2
i
exp((xiµi)2
2
i)
=
d
i=1
Nµi,σ2
i(xi).
That is, the i-th coordinate of X Nµ,Diag(σ2
1,...,σ2
d)is distributed inde-
pendently according to a univariate Gaussian around µiagain. How-
ever, in the diagonal case the variance of the i-th coordinate is σ2
i.
That is, a Gaussian with diagonal covariance matrix can be thought
of to be derived from a Gaussian with covariance Idby shrinking
and stretching along the coordinate axes. More precisely, by mul-
tiplying the i-th coordinate of X= (X1,. . . ,Xd) Nµ,Idwith the
standard deviation σifor i=1,. . . ,d, we get a random variable
Y=(σ1X1,. . . ,σdXd)T Nµ,Diag(σ2
1,...,σ2
d)(cf. Figure 17b).
For a general symmetric positive-definite Σ, we consider its unique
Cholesky Factorization See Theorem 4.2.7in
[Golub and
Van Loan, 2012].
Σ=LLT,
where Lis an invertible lower triangular matrix with positive diago-
nal entries. The matrix Lcan be uniquely decomposed as
L=VD
for an orthonormal matrix Vand a diagonal matrix Dwith positive
entries. While the linear map x7→ Dx stretches (or shrinks) xalong
the coordinate axes, the linear map x7→ Vx is a rotation around and
reflection at the origin. That is, applying the linear map x7→ Lx to the
d-dimensional unit ball results in an ellipsoid centered at the origin.
The above decomposition of Σis unique. Furthermore, for any or-
thonormal matrix Vand any diagonal matrix Dwith positive entries,
the matrix LLTwith L=VD is symmetric positive-definite. Therefore,
63
statistical models
it is common to identify a symmetric positive-definite matrix with the
associated ellipsoid.
We use the decomposition of Σstated above to decompose Σ1as
Σ1=(LT)1L1=(L1)TL1
=((VD)1)T(VD)1
=(D1V1)TD1V1.
Then, for the density of the multivariate Gaussian distribution, we
deduce
Nµ,Σ(x) = 1
()d|Σ|exp(D1V1(xµ)2
2). (17)
That is, a Gaussian with general symmetric positive-definite covari-
ance matrix can be thought of to be derived from a Gaussian with
diagonal covariance by rotating around the mean (cf. Figure 17c).
By ignoring the normalization factor |Σ|1
2, Equation (17) offers an
alternate view on the density function. The evaluation of Nµ,Σat a
point xcan be interpreted as a two step process. In the first step, xis
rotated around µby the linear map
x7→ V1(xµ) + µ
and afterwards the coordinates of xare scaled relative to µby the
linear map
x7→ D1(xµ) + µ.
This yields the point
y=D1V1(xµ) + µ. (18)
In the second step, the point yis passed to the density Nµ,Id.
Points with constant density have to be mapped to a sphere around
µin the first step (cf. Equation (16)). To see which points satisfy this
property, let zbe an arbitrary point on the translation of the unit
sphere centered at µ. Then, the linear map
x7→ VD(xµ) + µ
maps zonto the translation of the ellipsoid associated to Σcentered
at µ. Substituting x=VD(zµ) + µin Equation 18 yields
y=D1V1VD(zµ) + µ=z.
This equality also holds for zlying on an arbitrary sphere centered
at µ. We conclude that points with constant density lie on concentric,
confocal ellipsoids. More precisely, these ellipsoids result from scal-
ing the ellipsoid associated to Σand translating it to be centered at µ.
64
4.3 gaussian mixture models
Note that the lengths of the semi-axes of the ellipsoid associated to Σ
are the square roots of the eigenvalues of Σ(cf. Figure 17c).
See Chapter 2.3in
[Bishop, 2006].
For a single Gaussian distribution, the parameter vector θtakes the
form
θ= (µ,Σ).
Then, for the likelihood of θgiven an observation X=(x1,...,xn), we
get
L(θ|X)=Pr(X|θ)=
n
i=1
Nµ,Σ(xi)
=
n
i=1
1
()d|Σ|exp(1
2(xiµ)TΣ1(xiµ)).
As indicated in the previous section, it is analytically much simpler
to consider the log-likelihood. The result of applying the logarithm
is that the product transforms into a sum and that the exponential
function disappears. For the negative log-likelihood of a parameter
vector θ= (µ,Σ)given an observation X=(x1,...,xn), we get
lnL(µ,Σ|X)=
n
i=1
lnNµ,Σ(xi)(19)
=
n
i=1(ln()d|Σ|+1
2(xiµ)TΣ1(xiµ))
=
n
i=1(d
2ln +1
2ln|Σ|+1
2(xiµ)TΣ1(xiµ))
=nd
2ln +n
2ln|Σ|+
n
i=1
1
2(xiµ)TΣ1(xiµ).
The parameter estimation problem for a single Gaussian distribu-
tion, i. e., the maximization of L(µ,Σ|X)with respect to µand Σ, is
easy to solve. There exist closed form formulas for the values µand
Σof the maximum likelihood estimate θ, namely For a derivation of
these equalities see
[Bilmes, 1997].
µ=1
n
n
i=1
xiand Σ=1
n
n
i=1
(xiµ)(xiµ)T.
4.3.2mixture of gaussian distributions
As mentioned above, single Gaussian distributions are well suited to
model ellipsoidally shaped data sets. However, if the data set splits
into several possibly separated ellipsoidally shaped parts, then we
need a more general type of distribution called Gaussian mixture dis-
tribution.
Gaussian mixtures are a special case of mixture distributions (cf.
Section 4.2), where all components are single Gaussian distributions.
65
statistical models
...
.
Figure 18: Density of a mixture of three Gaussian distributions.
For a mixture of kGaussians, the parameter vectors θ1,. . . ,θkof the
component Gaussians take the form
θj=(µj,Σj),
where µjis the mean of the j-th component and Σjis its covariance
matrix. Then, the parameter vector of the mixture takes the form
θ= (w1,. . . ,wk,µ1,. . . ,µk,Σ1,. . . ,Σk)
and the probability density Mθof the mixture (cf. Figure 18) is de-
fined as
Mθ(x) :=
k
j=1
wjNµj,Σj(x).
For the likelihood of θgiven an observation X=(x1,. . . ,xn)it fol-
lows that
L(θ|X)=
n
i=1
Mθ(xi) =
n
i=1
k
j=1
wjNµj,Σj(xi)
and for the log-likelihood we get
lnL(θ|X)=
n
i=1
lnMθ(xi) =
n
i=1
ln
k
j=1
wjNµj,Σj(xi). (20)
Note that in contrast to the case of a single Gaussian distribution,
we are not able to move the logarithm operator in front of the Gaus-
sian density function (cf. Equation (19)). The summation over the k
components thus prevents the derivation of a simple formula for the
log-likelihood.
In case of Gaussian mixture models, it is hopeless to try and de-
duce closed formulas for the maximum likelihood estimate anyway.
The problem is that for the mixture of at least two Gaussians the
maximum likelihood estimate is not well defined. On the contrary,
it is always possible to generate parameters with arbitrary large log-
likelihood.
66
4.4 handy notions from probability theory
To see this, we fix an arbitrary point of the observed data X=
(x1,. . . ,xn)and customize the parameters for one of the kcomponent
Gaussians. Without loss of generality, we consider the data point x1
and the parameters w1,µ1, and Σ1. While w1can be set to any con-
stant cRwith 0<c<1, we define µ1=x1. Furthermore, for any
αRwith α > 0 we define ΣαRd×dto be a symmetric positive
definite matrix with determinant |Σα|=α. Then, for αtending to
zero and x=x1, the density Nµ1,Σα(x)also tends to zero (cf. Equa-
tion (15)). This is because of the exponential drop of the Gaussian
density function. Thus, for Σ1=Σαand αsmall enough, the contri-
bution of the points xiwith xi=xto the log-likelihood is dominated
by the components 2to k(cf. Equation (20)). More precisely, for
α0their contribution converges to some constant that depends on
θ2,. . . ,θk. However, for αsmall enough, the contribution of x=x1to
the log-likelihood is dominated by the first component. For α0the
density Nµ1,Σα(x1)converges to . Thus, the contribution of x=x1
also converges to .
As a consequence, one should always keep in mind that in case
of the mixture of Gaussian distributions the parameter estimation
problem is an ill-posed problem.
4.4handy notions from probability theory
In this section, we introduce some notions from probability theory
which we need for the discussion of the EM algorithm in Chapter 5
and for the analysis of the SEM algorithm in Chapter 6.
4.4.1distinction between distribution and density
Probability distributions are usually given by the corresponding den-
sity functions. Therefore, it is common to identify a distribution with
its density. Although formally not correct, this convention generally
does not lead to any ambiguities. We join this common practice for
the remainder of this thesis since it simplifies the discussion.
4.4.2on information theory
In the following, we give a brief description of two concepts from
information theory, namely the entropy of a random variable and
the Kullback-Leibler divergence between two random variables. We
define both quantities for the discrete as well as for the continuous
case.
For the following definitions of entropy and differential entropy,
we use the convention that
0ln0=0
67
statistical models
which arises from continuity since xlnx0as x0.
Definition 4.1.Let Xbe a random variable over a discrete domain Xwith
probability mass function p:XR. Then, the entropy of Xis defined as
H(X):=
xX
p(x)lnp(x).
Since the entropy of Xdepends only on the probability mass p, it
is also denoted as H(p). The entropy of Xis a non-negative quantity
which can be interpreted as the amount of information that is ob-
tained by observing the value of X. We use the entropy only for nota-
tional purposes. For a detailed discussion of the entropy see Chapter
2.1in [Cover and Thomas, 2012]. The continuous correspondence of
the entropy is called the differential entropy.
Definition 4.2.Let Xbe a random variable over a continuous domain X
with continuous and differentiable probability density function p:XR.
Then, the differential entropy of Xis defined as
H(X):= X
p(x)lnp(x)∂x.
Again, we also denote the differential entropy of Xas H(p), since
it depends only on the probability density p. For a discussion of the
relation between entropy and differential entropy, see Chapter 8.3in
[Cover and Thomas, 2012].
For the following definitions of the Kullback-Leibler divergence, we
use the conventions that
0ln 0
0=0and pln p
0=
for p > 0.
Definition 4.3.Let X,Ybe random variables over a discrete domain X.
Furthermore, let pand qbe the probability mass functions of Xand Y, re-
spectively. Then, the Kullback-Leibler divergence between Xand Yis
defined as
KL(XY):=
xX
p(x)ln p(x)
q(x).
Note that the Kullback-Leibler divergence is not symmetric, which
is the reason why it should not be called Kullback-Leibler distance.
For a motivation and discussion of the Kullback-Leibler divergence
see Chapter 2.3in [Cover and Thomas, 2012]. Just as the entropy, the
Kullback-Leibler divergence can also be defined for the continuous
case.
Definition 4.4.Let X,Ybe random variables over a continuous domain
X. Let pand qbe the probability density functions of Xand Y, respec-
tively. Furthermore, let pand qbe continuous and differentiable. Then, the
Kullback-Leibler divergence between Xand Yis defined as
KL(XY):= X
p(x)ln p(x)
q(x)∂x.
68
4.4 handy notions from probability theory
A discussion of the Kullback-Leibler divergence for the continuous
case can be found in Chapter 8.5of [Cover and Thomas, 2012].
In the discrete as well as in the continuous case, we also denote
KL(XY)as KL(pq)since it depends only on pand q. Furthermore,
the Kullback-Leibler divergence between Xand Yis a non-negative
quantity that is zero if and only if p=q.
69
5
THE CLASSICAL EM ALGORITHM
Throughout this chapter, we deal with an algorithmic approach to
solving the parameter estimation problem for some fixed statistical
model. More precisely, we discuss a general scheme for the approxi-
mation of the corresponding maximum likelihood estimate (see Chap-
ter 4.1), called the Expectation-Maximization algorithm (or simply
EM algorithm). Since its introduction in [Dempster et al., 1977] a lot
of work has been done to analyze and improve the EM algorithm.
Today it is very well understood [McLachlan and Krishnan, 2008].
In the following, we explain the algorithm, discuss its convergence
properties, and apply it to general mixture distributions as well as
Gaussian mixtures. The discussion of the general EM algorithm and
its convergence properties follows the presentation in Chapter 2.3of
[Bishop, 2006]. The application to general and Gaussian mixtures is
based on [Bilmes, 1997].
The EM algorithm is used when the observed data can be seen as
incomplete and when the corresponding complete data helps signif-
icantly to solve the parameter estimation problem. The main scope
of the algorithm are statistical models for which direct optimization
is impossible. For example, let us assume that the observed data
was generated by a mixture of Gaussian distributions. Then the com-
putation of the maximum likelihood estimate is computationally in-
tractable (see Chapter 4.3.2). However, if we knew for each observed
point from which component distribution it was generated, then the
solution could be computed separately for each component by closed
form formulas (see Chapter 4.3.1). Therefore, for Gaussian mixture
models we define the origin of each data point as the so-called hidden
values.
In the following, we denote the incomplete observation by the ran-
dom variable Xand the hidden values by the random variable Z. Fur-
thermore, we denote the joint distribution of the complete data (X,Z)Recall that we
identify a
distribution with
its density.
by
pθ(x,z) = Pr(X=x,Z=z|θ),
where θdenotes a vector of model parameters. Then, for the distribu-
tion of the incomplete data we deduce
pθ(x) = pθ(x,Z)∂Z =Pr(X=x|θ). (21)
71
the classical em algorithm
Algorithm 5.1:Classical EM algorithm
Input: incomplete observation X, initial model θ
while <condition> do
1q(Z) Pr(Z|X,θ)// for hidden r.v. Z
2maximize EZq[lnPr(X,Z|θ)] w. r. t. θ
end
return θ
Note that for Gaussian mixtures the integral becomes a sum over
the discrete space of possible assignments of each point to one of the
component distributions. The complete data likelihood pθ(X,Z)often
has a much simpler form than the incomplete data likelihood pθ(X).
Using Bayes’ theorem,
pθ(x,z) = Pr(Z=z|θ)Pr(X=x|θ,Z=z).
In case of mixture distributions for example, this leads to simple up-
date equations of the EM algorithm (see Section 5.2and Section 5.3).
The parameter estimation problem given the incomplete observa-cf. Chapter 4.1
tion Xis to maximize the likelihood function
L(θ|X)=Pr(X|θ)=pθ(X,Z)∂Z
over the choice of θ. For the discussion of the EM algorithm we
furthermore need the distribution pX,θof the hidden values given the
incomplete observation. Using Bayes’ theorem,
pX,θ(z) = pθ(X,z)
pθ(X)=Pr(Z=z|X,θ). (22)
The EM algorithm is an iterative method that, given an initial es-In Chapter 7.3we
describe a simple
method of obtaining
initial parameters.
timate of the model parameters, tries to approximate the maximum
likelihood estimate by maximizing the expected complete data log-
likelihood. That is, it computes
θnew := argmax
θ
E
ZpX,θold
[lnPr(X,Z|θ)] .
The expectation is taken over the distribution of the hidden values
that is induced by the previous parameter estimate θold. The EM al-
gorithm in its general form is stated as Algorithm 5.1, where the con-
dition of the while loop may be filled in with any reasonable choice,
e.g., a convergence criterion or a fixed number of iterations.
Each iteration of the loop in Algorithm 5.1consists of two steps.
Step 1derives the distribution over the hidden values that is induced
by the previous parameter estimate. Fixing this distribution, the ex-
pectation over the hidden values that is to be maximized can be stated
as a function of the parameters θ. Therefore, the first step is called
72
5.1 convergence of the em algorithm
the expectation step. However, no particular expectation is computed
in this first step. In applications of the EM algorithm for concrete
statistical models, the derived distribution is often discrete as for ex-
ample for Gaussian mixtures. Then, the implementation of the expec-
tation step usually consists of precomputations of the corresponding
density function for possible values of Z. In case of Gaussian mixture
models, it is sufficient to precompute the probability that a certain
point was generated by a certain component for each point and each
component distribution (see Section 5.3). Step 2is called the maxi-
mization step and computes the new set of parameters by maximizing
the expectation over the choice of θ.
For the EM algorithm to be applicable, we have to assume that the
maximization in Step 2is computationally tractable. In practice, this
is often the case (e.g., for mixtures of distributions from the expo-
nential family). There also exist computationally simpler variants of
the EM algorithm that may be used if the maximization step is not
(efficiently) computable. The General EM (or GEM) algorithm for
example tries to only increase the expectation in the second step (cf.
[Bishop, 2006]). Furthermore, there exist probabilistic variants like
the SEM algorithm discussed in Chapter 6. The SEM algorithm sim-
plifies the computation by avoiding to deal with the expectation at
all. Instead, it simply guesses the hidden values which considerably
simplifies the maximization step.
To analyze the convergence properties of the EM algorithm in the
following section, we assume that two preconditions are met. First,
as mentioned above, the maximization step has to be computationally
tractable. Second, the log-likelihood function for the considered sta-
tistical model and the expectation from the maximization step both
have to be differentiable. The above mentioned and frequently used
mixtures of distributions from the exponential family satisfy these
preconditions.
5.1convergence of the em algorithm
The parameter vectors computed by the EM algorithm converge to a
stationary point of the likelihood function. To see this, we here adopt
the line of argument in Chapter 2.3of [Bishop, 2006].
In this section, we denote the log-likelihood function given an in-
complete observation Xby lnpθ(X)as a function of θ. Note that cf. Equation (21)
applying the logarithm does not change the stationary points.
Theorem 5.1.Let θ0be the initial parameter estimate and for i > 0 let θi
be the parameter estimate θcomputed by Algorithm 5.1in the i-th iteration
of Step 2. Then, for increasing ithe log-likelihood of θiconverges to a
stationary point of lnpθ(X).
At first sight, the significance of Theorem 5.1may appear low. It
does not even guarantee the convergence to a local maximum, let
73
the classical em algorithm
lnpθ(X)
EZq[lnpθ(X,Z)]+H(q)q=pX,θ
KL(qpX,θ)
Figure 19: Decomposition of lnpθ(X)over the choice of q.
alone the convergence to a global one. However, in general we can
not hope for more since it is known that the EM algorithm can get
stuck on a saddle point of the likelihood function (see Chapter 3.6of
[McLachlan and Krishnan, 2008]). Moreover, since a global maximum
does not even have to exist (cf. Chapter 4.3.2), this does not have to
be an undesirable property. In practice, the local maxima may be the
original target. And even a saddle point can be of great interest when
the local maxima fall victim to the problem of overfitting mentioned
in Chapter 4.1.
To prove Theorem 5.1, we use the following lemma, which decom-
poses the log-likelihood of the incomplete data as shown in Figure 19.
Note that the complete data log-likelihood is denoted by lnpθ(X,Z)
and recall that pX,θdenotes the distribution over the hidden valuescf. Equation (22)
that is induced by the incomplete observation Xand the parameter
estimate θ.
Lemma 5.2.Let qbe an arbitrary distribution over the hidden values Z.
Then, the log-likelihood of a parameter vector θgiven an observation Xcan
be written as
cf. Figure 19 lnpθ(X) = E
Zq[lnpθ(X,Z)]+H(q)+KL(qpX,θ).
Since the Kullback-Leibler divergence is always non-negative and
it is zero if and only if both densities are identical (see Chapter 4.4.2),
Lemma 5.2immediately yields the following corollary.
Corollary 5.3.Let qbe an arbitrary distribution over the hidden values Z.
Then, the log-likelihood of a parameter vector θgiven an observation Xcan
be lower bounded by
lnpθ(X)E
Zq[lnpθ(X,Z)]+H(q).
Furthermore, the bound holds with equality if and only if q=pX,θ.
74
5.1 convergence of the em algorithm
Proof of Lemma 5.2.Using the definition of the Kullback-Leibler diver-
gence (see Chapter 4.4.2),
lnpθ(X) KL(qpX,θ)
=lnpθ(X) + q(Z)ln(pX,θ(Z)
q(Z))∂Z
=q(Z)(lnpθ(X) + ln(pX,θ(Z)
q(Z)))∂Z
=q(Z)ln(ln(pθ(X)pX,θ(Z))
q(Z))∂Z.
From the definition of pX,θin Equation (22), we deduce
pθ(X)pX,θ(Z) = pθ(X,Z).
Thus,
lnpθ(X) KL(qpX,θ)
=q(Z)ln(lnpθ(X,Z)
q(Z))∂Z
=q(Z)lnpθ(X,Z)∂Z q(Z)lnq(Z)∂Z
=E
Zq[lnpθ(X,Z)]+H(q),
where the last equality follows using the definitions of the expectation
and of the entropy (see Chapter 4.4.2).
Proof of Theorem 5.1.In the following, we consider a single iteration of This proof also
follows Chapter 2.3
in [Bishop, 2006].
Algorithm 5.1and show that the parameter updates do not decrease
the likelihood. Furthermore, we argue that the likelihood remains
constant only if it already reached a stationary point. We denote
the parameters before the considered round by θold and the updated
parameters by θnew.
Each round of the EM algorithm starts with the selection of the
distribution q=pX,θold over the hidden values Z, where
pX,θold(Z) = Pr(ZX,θold)
as specified in Algorithm 5.1. Based on this distribution, we define
fX,θold(θ) := E
ZpX,θold
[lnpθ(X,Z)]+H(pX,θold ).
Then, Corollary 5.3yields
lnpθ(X)fX,θold (θ), (23)
i.e., fX,θold is a lower bound for lnpθ. Furthermore,
lnpθold (X) = fX,θold (θold)(24)
75
the classical em algorithm
lnpθ(X)
fX,θold(θ)
θold θnew
Figure 20: Optimization of lnpθ(X)over the choice of θ.
as sketched in Figure 20.
Since H(pX,θold)is independent of θ, in order to maximize fX,θold(θ)
over the choice of θ, it is sufficient to maximize EZpX,θold [lnpθ(X,Z)].
That is, the maximization step of Algorithm 5.1computes
θnew := argmax
θ
fX,θold(θ)
as shown in Figure 20. In particular,
fX,θold(θnew)fX,θold (θold) = lnpθold(X),
where the equality at the end follows from Equation (24). Using In-
equality (23) with θ=θnew,
lnpθnew(X)lnpθold (X).
It remains to show that
lnpθnew(X)>lnpθold (X)
if θold is not a stationary point of lnpθ(X)as a function of θ. Since
the gradient of g(θ) := lnpθ(X)is zero if and only if θis a stationary
point of g,
∂g(θ)
∂θ (θold)=0.
However, from Inequality (23) and Equation (24) it follows that g
touches fX,θold at the point θold as shown in Figure 20. Therefore, for
the gradient of fX,θold at the point θold we get
∂fX,θold(θ)
∂θ (θold) = ∂g(θ)
∂θ (θold)=0.
It follows that θold does not maximize fX,θold and thus
fX,θold(θnew)> fX,θold(θold).
Analogously to the derivation of non-decrease of the likelihood, we
deduce
lnpθnew(X)>lnpθold (X).
76
5.2 the em algorithm for mixture distributions
5.2the em algorithm for mixture distributions
In this section, we discuss the application of the EM algorithm to
general mixture distributions as introduced in Chapter 4.2. Let
θ= (w1,. . . ,wk,θ1,. . . ,θk)
be the parameter vector of the mixture, i.e., w1,. . . ,wkR+are the
weights and θ1,...,θkare the parameter vectors of the component
distributions. Then, the probability of a single observation X=xis
pθ(x) =
k
j=1
wjpθj(x).
In order to apply the EM algorithm, we have to decide on the hid-
den values Z. That is, we have to carefully choose which unobserved
information would help us to solve the parameter estimation problem.
Recall that drawing a single observation from the mixture can be de-
scribed as a two step process. First, draw the j-th component with
probability wj, and second, draw xwith probability pθj(x). Based on
this interpretation, we define Z:= (zj){0,1}kwith k
j=1zj=1to be
the vector of binary random variables indicating which component
was chosen. Then, for j=1,. . . ,kwe get Pr(zj=1θ)=wj.
The probability of an observation X= (x1,...,xn)consisting of n
Nindependent draws from the mixture distribution is
pθ(x1,...,xn) =
n
i=1
k
j=1
wjp(xi|θj).
We define Z:= (zij){0,1}n×kwith k
j=1zij =1for i=1,...,nto
be the binary matrix indicating which component is responsible for
each sub-observation. In the mixture case, this indicator matrix Zis
the natural choice for the hidden values.
For i=1,...,nand j=1,...,kwe get Pr(zij =1θ)=wj. If we
denote the i-th row of the indicator matrix by zi, then the probability
to observe a particular row zican be written as
Note that all but one
summand of this
sum are zero.
pθ(zi) =
k
j=1
zijwj.
Using this indicator matrix Z, the log-likelihood of θgiven the com-
plete data (X,Z)is given by
lnPr(X,Z|θ)=
n
i=1
lnpθ(xi,zi) =
n
i=1
ln(pzi,θ(xi)pθ(zi))
=
n
i=1
lnpzi,θ(xi) +
n
i=1
lnpθ(zi).
77
the classical em algorithm
Note that this decomposition is very useful since the two sums de-
pend on disjoint parts of the parameter vector θ. The first sum can be
written as
Here, all but one
summand of the
inner sum are zero.
n
i=1
lnpzi,θ(xi) =
n
i=1
ln
k
j=1
zijpθj(xi),
while the second sum can be written as
n
i=1
lnpθ(zi) =
n
i=1
ln
k
j=1
zijwj.
That is, the first sum only depends on θ1,. . . ,θkand the second sum
only depends on w1,...,wk. With the linearity of the expectation,
it follows that the maximization step of the EM algorithm can be
considered as two separate maximizations.
Obviously, without specifying the type of the component distribu-
tions, nothing can be said about the updates of the EM algorithm
for the parameter vectors θ1,. . . ,θk. However, there do exist closed
formulas for the updates of the weights w1,...,wk. Given an obser-
vation Xand a parameter vector θ, we define the responsibility rjof
the j-th component to be the expected number of observations drawn
from the j-th component distribution, i.e.,
rj:=
n
i=1
E
ZpX,θ[zij]=
n
i=1
Pr(zij =1X,θ).
By applying Bayes’ theorem twice, the responsibilities can be expressed
as
rj=
n
i=1
Pr(xi,zij =1θ)
Pr(xi|θ)
=
n
i=1
Pr(zij =1θ)Pr(xizij =1,θ)
Pr(xi|θ)
=
n
i=1
wjpθj(xi)
k
=1wpθ(xi).
That is, the responsibilities are computable by evaluating the compo-
nent distributions and using the current parameters θonly. The EM
updates for the weights are the normalized responsibilities, i. e., for
j=1,. . . ,k,For a derivation of
these updates see
[Bilmes, 1997]. wnew
j:= rj
n.
5.3the em algorithm for gaussian mixtures
We finish our introduction to the EM algorithm with an application
to Gaussian mixture models (cf. Chapter 4.3). Let
θ= (w1,. . . ,wk,µ1,. . . ,µk,Σ1,. . . ,Σk)
78
5.3 the em algorithm for gaussian mixtures
be the parameter vector of the Gaussian mixture, i. e., w1,. . . ,wk
R+are the weights, µ1,. . . ,µkRdare the means of the component
Gaussians and Σ1,. . . ,ΣkRd×dare the corresponding covariance
matrices. As discussed in Section 5.2, the EM updates of the weights
are given by
wnew
j:= rj
n=1
n
n
i=1
Pr(zij =1X,θ)
=1
n
n
i=1
wjNµj,Σj(xi)
k
=1wNµ,Σ(xi)
for j=1,. . . ,k, where rjis the responsibility of the j-th component as
defined in Section 5.2.
The derivation of the update equations for the means and the co-
variance matrices is rather elaborate and can for example be looked
up in [Bilmes, 1997]. For the j-th mean, the update equation is
µnew
j:= n
i=1Pr(zij =1X,θ)xi
n
i=1Pr(zij =1X,θ)
=1
rj
n
i=1
wjNµj,Σj(xi)xi
k
=1wNµ,Σ(xi).
That is, the contribution of each point to the j-th component mean
is proportional to its probability in the j-th component in relation
to its overall probability. Finally, for the j-th covariance, the update
equation is
Σnew
j:= n
i=1Pr(zij =1X,θ)(xiµnew
j)(xiµnew
j)T
n
i=1Pr(zij =1X,θ)
=1
rj
n
i=1
wjNµj,Σj(xi) (xiµnew
j)(xiµnew
j)T
k
=1wNµ,Σ(xi).
79
6
THE STOCHASTIC EM ALGORITHM
In this chapter, we study a probabilistic variant of the EM algorithm
that was originally proposed as the Stochastic EM or SEM algorithm
[Celeux and Diebolt, 1985]. The algorithm replaces the maximization
step of the EM algorithm with two simple steps. First, it guesses
the hidden values by sampling a fixed instance. Second, it maxi-
mizes the complete-data likelihood over the choice of the parameter
vector. Depending on the application, this simplification yields a con-
siderable speedup. Furthermore, there exist applications where the
maximization step of the classical EM algorithm is computationally
intractable (e.g., when it involves high dimensional integrations). In
these cases, the simplified maximization step may lead to practicable
update equations. However, the focus of this chapter lies on a com-
parison between the computations of the classical EM algorithm and
its probabilistic variant. We discuss only statistical models for which
both algorithms are applicable.
The models generated by the Stochastic EM algorithm are studied
in [Ip, 1994]. For mixtures of distributions from the exponential fam-
ily, the author shows that the sequence of models generated by the
SEM iterations is an ergodic Markov chain converging weakly to a
stationary distribution over models. Furthermore, the mean of this
stationary distribution is analyzed under appropriate assumptions.
It is shown that for the number of input points going to infinity, this
mean converges to the maximum likelihood estimate. However, the
mean of the stationary distribution usually can not be obtained by
a single run of the algorithm. Instead, a large number of restarts is
necessary to retrieve a reasonable approximation of the mean distri-
bution.
Practitioners are often satisfied with the solutions computed by the
EM algorithm. Therefore, we analyze the SEM algorithm in relation
to the classical EM algorithm. In contrast to the previous analysis of
the SEM algorithm that focused on stochastic properties of the above
mentioned Markov chain, we study a single run of the algorithm.
In our approach, we look at it as a probabilistic algorithm that tries
to imitate the behavior of a deterministic algorithm while using its
randomness to speed up the computations. Our results suggest that
in most cases where the classical EM algorithm is applicable, one
81
the stochastic em algorithm
Algorithm 6.1:Probabilistic EM algorithm
Input: incomplete observation X, initial model θ
while <condition> do
1q(Z) Pr(Z|X,θ)// for hidden r.v. Z
2adraw zq(Z=z)
2bmaximize Pr(X,Z=z|θ)w.r. t. θ
end
return θ
can use the SEM algorithm to obtain similar results more efficiently.
Moreover, in our experiments discussed in Chapter 7, we observe that
sometimes, compared with the EM algorithm, the SEM algorithm
computes parameters with a larger likelihood. This is most likely
due to its inherent randomness that allows the algorithm to escape
from a saddle point or an undesired local maximum of the likelihood
function (cf. Chapter 5.1). Note that the algorithm is hypothetically
able to even guess the ’true’ hidden values, although this generally
happens only with negligible probability.
In Section 6.1, we present the generic SEM algorithm. Afterwards,
in Section 6.2, we show that general mixture distributions are suitable
candidates for the application of the SEM algorithm. In Section 6.3,
we compare the updates of the EM and SEM algorithm for Gaussian
mixture models. We consider a single update step and show that with
high probability the algorithms perform almost the same computa-
tions for sufficiently large input data sets. Our experimental results
(see Chapter 7.4) confirm that this still holds for a large number of
successive steps. Moreover, we show that for Gaussian mixtures, the
simplified maximization step of the SEM algorithm leads to consid-
erably better running times compared to the classical EM algorithm.
This result is confirmed by our experiments as well.
6.1the generic sem algorithm
The SEM algorithm is stated as Algorithm 6.1. The algorithm differs
from the classical EM algorithm in the second step. It replaces the
maximization in Step 2of Algorithm 5.1with two alternate steps (cf.
Step 2a and Step 2b of Algorithm 6.1). Instead of operating with
the expectation over the choice of Z, it simply guesses the hidden
values. This is done in Step 2a by sampling an instance for Zusing the
distribution qfrom Step 1. Afterwards, the complete-data likelihood
given (X,Z)is maximized over the choice of θin Step 2b.
82
6.2 the sem algorithm for mixture distributions
Algorithm 6.2:SEM algorithm for mixtures
Input: incomplete observation X= (x1,...,xn),
initial model θ= (w1,...,wk,θ1,...,θk)
while <condition> do
1q(Z) Pr(Z|X,θ)// for indicator matrix Z
2adraw Z= (zij)q(Z)
2bfor j=1,. . . ,k do
wnew
j 1
nn
i=1zij
θnew
j argmaxθjPr(Yjθj)
where Yjis a vector of all xiwith zij =1
end
end
return θ
6.2the sem algorithm for mixture distributions
A crucial question concerning the SEM algorithm is for which class
of models it should be applied. In this section we show that mixture
distributions are suitable candidates. There is only one necessary
precondition, namely that for each component distribution separately,
the maximum likelihood estimate is computationally tractable.
The resulting SEM algorithm for mixture distributions is stated as
Algorithm 6.2. As for the EM algorithm, we denote the incomplete
observation by X= (x1,...,xn)and the model parameters by
θ= (w1,...,wk,θ1,...,θk).
Again, we choose the indicator matrix (zij){0,1}n×kas the hidden cf. Chapter 5.2
values Z. Just as in case of the EM algorithm, the maximization step
can be carried out separately for the weights w1,...,wkand for the
component parameters θ1,. . . ,θk. To do this, we define n1,...,nkto
be the number of data elements that are assigned to each component
in Step 2a, i.e.,
nj=
n
i=1
zij and
k
j=1
nj=n.
Then, for the weight updates we simply compute the fraction of the In Section 6.3.2, we
show that these
updates are good
approximations of
the EM updates.
input data that is assigned to each component, i.e.,
wnew
j:= nj
n.
Furthermore, by YjXnjwe denote the vector of elements xifor
i=1,...,nwith zij =1. Then, for the updates of θ1,...,θk, we
maximize the probability pθj(Yj)over the choice of θjfor j=1,...,k.
83
the stochastic em algorithm
6.2.1correctness
In the following, we show that Algorithm 6.2is a correct instantia-
tion of Algorithm 6.1for the mixture case. That is, we show that
computing
wnew
j:= 1
n
n
i=1
zij and θnew
j:= argmax
θj
Pr(Yjθj)
does indeed maximize Pr(X,Z=z|θ). To prove the correctness of the
weight updates, we use the following lemma. Note that for kN,
we denote the standard (k1)-simplex by
Sk1=
(p1,. . . ,pk)Rk
+
k
j=1
pj=1
.
Lemma 6.1.Let k,n,n1,...,nkNwith k
j=1nj=n. Then, the function
f:Sk1Rwith
f(p) = f(p1,...,pk) =
k
j=1
njln(pj)
is maximized by p= (n1
n,...,nk
n).
Proof. Let q= (q1,. . . ,qk)with qj=nj
nfor j=1,. . . ,k. Instead of f,
we maximize
f(p)
n+H(q)=
k
j=1
qjlnpj
k
j=1
qjlnqj
=
k
j=1
qjln pj
qj
= KL(qp)
where H(·)denotes the entropy and KL(··)denotes the Kullback-
Leibler divergence (see Chapter 4.4.2). Then, the lemma follows from
the non-negativity of the Kullback-Leibler divergence and the fact
that KL(qp)=0if and only if p=q.
Proposition 6.2.Let θ= (w1,. . . ,wk,θ1,. . . ,θk)be the parameter vector
of a mixture model. Then, for any incomplete observation X= (x1,. . . ,xn)
with the hidden indicator matrix Z= (zij), the complete-data likelihood
L(X,Z|θ)is maximized over the choice of θby setting
wj=1
n
n
i=1
zij and θj=argmax
θ
j
Pr(Yjθ
j)
for j=1,...,k, where Yjis a vector of all xiwith zij =1.
84
6.2 the sem algorithm for mixture distributions
Proof. We denote the new model computed by Algorithm 6.2in some
fixed round by θX,Z. Since
Pr(X,Z|θ)=Pr(Z|θ)Pr(X|Z,θ),
it is sufficient to show that θX,Zseparately maximizes Pr(Z|θ)as well
as Pr(X|Z,θ).
Recall that the i-th row of the indicator matrix (zij)is a binary unit
vector indicating which component is responsible for the point xi.
That is, zij {0,1}with k
j=1zij =1and
Pr(zij =1θ)=wj=
k
j=1
zijwj.
We conclude
Pr(Z|θ)=
n
i=1
k
j=1
zijwj=
k
j=1
wnj
j.
Then,
lnPr(Z|θ)=
k
j=1
njln(wj)
and applying Lemma 6.1yields that θX,Zmaximizes Pr(Z|θ).
It remains to show that θX,Zmaximizes Pr(X|Z,θ). However, due
to the independence of the observations,
lnPr(X|Z,θ)=
n
i=1
lnPr(xi|Z,θ)
=
n
i=1
k
j=1
zij lnPr(xiθj)
=
k
j=1
lnPr(Yjθj).
6.2.2limitations
Strictly speaking, for the maximization step to be well defined, we
have to assure that njζfor j=1,. . . ,kfor some model dependent
threshold ζN. That is, at least ζobservations have to be assigned
to each component distribution. Otherwise, it might not be possible
to determine the new parameters. This is obvious for nj=0at least,
since generally, without any data, there is no basis for parameter esti-
mation. For Gaussian mixture models, we discuss the threshold ζin
Section 6.3.
If nj< ζ, it might be a good idea to drop the under-determined
component and to replace it. This can be done by drawing new com-
ponent parameters from an a-priori distribution, for example.
85
the stochastic em algorithm
6.3the sem algorithm for gaussian mixtures
In this section we show that for mixtures of Gaussians, the EM algo-
rithm and the SEM algorithm compute almost the same parameter
updates. More precisely, we consider multivariate Gaussian mixtures
over the d-dimensional vector space Rd.
Recall from Chapter 4.3that for a mixture of kNmultivariate
Gaussians over Rdthe parameters of each component consists of a
mean µRdand a covariance matrix ΣRd×d. That is, the parame-
ter vector of the model takes the form
θ= (w1,...,wk,µ1,...,µk,Σ1,. . . ,Σk).
As usual, we denote the incomplete observation by X= (x1,. . . ,xn)
with xiRdand choose the hidden values Zto be the indicator
matrix (zij){0,1}n×k. Furthermore, based on the distribution q
derived in Step 1of Algorithm 6.2, we denote the probability that xi
was generated by the j-th component by
The expectation of
a binary random
variable is simply
the probability that
it becomes 1.
pij := Pr(zij =1X,θ)=E
Zq[zij]. (25)
Then, as discussed in Chapter 5.3, the classical EM algorithm in each
step deterministically computes the following updates for j=1,. . . ,k:
wEM
j:= 1
n
n
i=1
pij,µEM
j:= n
i=1pijxi
n
i=1pij
,
ΣEM
j:= n
i=1pij(xiµEM
j)(xiµEM
j)T
n
i=1pij
. (26)
After sampling the indicator matrix (zij){0,1}n×k, the Probabilis-
tic EM algorithm computes the following updates for j=1,...,k:
wSEM
j:= 1
n
n
i=1
zij,µSEM
k:= n
i=1zijxi
n
i=1zij
,
ΣSEM
j:= n
i=1zij(xiµSEM
j)(xiµSEM
j)T
n
i=1zij
, (27)
where µSEM
jand ΣSEM
jare the maximum likelihood solution for
the parameter estimation problem of a single Gaussian distribution
with respect to the points assigned to the j-th component (cf. Chap-
ter 4.3.1). Note that by writing zij as
zij =Pr(zij =1Z)=Pr(zij =1X,θ,Z),
the SEM update equations can be interpreted as to arise from the
EM update equations by incorporating the additional knowledge of
which component each point was generated by. In contrast to the EM
algorithm, the parameters computed by the SEM algorithm are in fact
random variables with respect to the random experiment of drawing
the zij. This offers the alternate interpretation that the EM updates
arise from the SEM updates by replacing the indicator matrix with its
expectation (cf. Equation (25)).
86
6.3 the sem algorithm for gaussian mixtures
6.3.1limitations
As already observed in Section 6.2.2, strictly speaking, we have to
assure that |Yj|=njis large enough for each j=1,. . . ,k. In case
of multivariate Gaussians, we need that |Yj|d+1at least. Other-
wise, the covariance matrix ΣSEM
jis not symmetric positive definite.
Assuming that the observations x1,. . . ,xnRdare given in general
linear position, assigning at least d+1points to each Gaussian is
sufficient, though.
6.3.2proximity of the update equations
In this section, we provide the main results of this chapter. The theo-
rems presented below provide bounds for the differences between the
parameter updates as computed by the EM and SEM algorithm. As
one would expect, it can not be shown that the differences between
the mean updates are small if the corresponding weight updates dif-
fer considerably. The same holds for the differences between the co-
variance updates. Furthermore, it can not be shown that the differ-
ences between the covariance updates are small if the corresponding
mean updates differ considerably. A difference only in one coordi-
nate of the mean updates already leads to useless bounds for the
corresponding entries of the covariance matrix. Therefore, we first
provide the bound for the weight updates. Afterwards, we state the
bound for the mean updates which depends on the differences be-
tween the weight updates. Finally, we close with the bound for the
covariance updates. This bound depends on the differences of the
weight updates as well as on the differences between the mean up-
dates.
Before we state the theorems, we provide some preliminary defi-
nitions. For vectors vRd, we denote the -th coordinate of vby
(v). For matrices MRd×d, we denote the (1,2)-th entry of Mby
(M)12. Furthermore, we define the spread of the -th coordinate of
the input data set by
:= max
i(xi)min
i(xi). (28)
As in Chapter 5.2, we define the overall responsibility of the j-th Gaus-
sian component by
rj:=
n
i=1
pij.
In the following, we develop the quantities that will be used to
measure the differences between the mean and covariance updates
computed by the EM and SEM algorithm. For j{1,...,k}, the differ-
87
the stochastic em algorithm
ence between the computed means for the j-th component Gaussian
is
µSEM
jµEM
j=n
i=1zij(xiµEM
j)
n
i=1zij
.
We ignore the normalization factor n
i=1zij in the denominator. Then,
each summand zij(xiµEM
j)of the numerator is a random variable
that corresponds to the contribution of xito the difference between
µSEM
jand µEM
j. For {1,...,d}, the variance in the -th coordinate
of this contribution is
Var (zij(xiµEM
j))=pij(1pij)(xiµEM
j)2
.
Since the zij are independent random variables, the variance in the
-th coordinate of the overall contribution of all points is
Var(n
i=1
zij(xiµEM
j))=
n
i=1
pij(1pij)(xiµEM
j)2
.
In our analysis of the proximity of the SEM updates, we measure
the difference in the -th coordinate of the j-th mean in terms of the
corresponding standard deviation
τjℓ := v
u
u
t
n
i=1
pij(1pij)(xiµEM
j)2
. (29)
Analogously, for the difference in the (1,2)-th entry of the j-th co-
variance, we use
ρjℓ12:= v
u
u
t
n
i=1
pij(1pij)(Υij ΣEM
j)2
12
, (30)
where Υij = (xiµEM
j)(xiµEM
j)T.
To bound the difference between the updates of the EM algorithm
and the SEM algorithm, we state three separate theorems. The first
one bounds the difference of the weight updates for a single compo-
nent distribution. The theorem holds for Gaussian mixtures as well
as for general mixture models (cf. Section 6.2and Chapter 5.2).
Theorem 6.3(Proximity of weights).Let j{1,...,k},δRwith
2erj/3δ1, and λw=3ln 2
/δ
rj. Then, with probability of at least 1δ,
wSEM
jwEM
jλwwEM
j.
The next theorem bounds the difference in a single coordinate of
the mean updates for a single component. One can not expect good
estimates for the means if the weights are not approximated well.
Thus, in the second theorem we assume that the corresponding weight
is already approximated with an accuracy of λw. The derived bound
for the computed mean depends on λwby a factor of 1
1λw.
88
6.3 the sem algorithm for gaussian mixtures
Theorem 6.4(Proximity of means).Let j{1,...,k}and let Abe the
event that
wSEM
jwEM
jλwwEM
j, (31)
where 0 < λw< 1.
Then, for {1,. . . ,d}and 0<δ<1, the conditional probability of
(µSEM
jµEM
j)λµ
1λw·τjℓ
rj
given the event Ais at least 1δfor
λµ=
2e ln2
/δif τjℓ
1
e2e ln2
/δ
2∆
τjℓ ln2
/δotherwise.
The third theorem is the equivalent of Theorem 6.4for a single
entry of the covariance matrix. Similar to Theorem 6.4, the derived
bound depends on the weight accuracy λw. Additionally, we assume
that the -th coordinate of the corresponding mean is already approxi-
mated with an accuracy λ
µ. The bound splits into a sum of two terms.
The first term is the analogon of the bound for the mean from Theo-
rem 6.4. The second term is the error that arises from the accuracy of
the mean estimates.
Theorem 6.5(Proximity of covariances).Let j{1,...,k}and let Abe
the event that
wSEM
jwEM
jλwwEM
j(32)
where 0<λw< 1. Furthermore, let 1,2{1,...,d}and let Bbe the event
that for =1,2,
(µSEM
jµEM
j)λ
µ
1λw·τjℓ
rj
(33)
where λ
µ> 0.
Then, for 0<δ<1, the conditional probability of
(ΣSEM
jΣEM
j)12λΣ
1λw·ρjℓ12
rj
+λ1
µλ2
µ
(1λw)2·τjℓ1τjℓ2
r2
j
given the events Aand Bis at least 1δfor
λΣ=
2e ln2
/δif ρjℓ12
12
1
e2e ln2
/δ
2∆12
ρjℓ12ln2
/δotherwise.
To derive a result for the complete update, one has to combine
all three theorems using the union bound. Note that all three bounds
depend on the responsibilities rj. As one would expect due to the law
of large numbers, the accuracy of our bounds improves with growing
rj.
89
the stochastic em algorithm
6.3.3limitations
Note that, generally speaking, it is not possible to approximate a mix-
ture component with too small a weight. Thus, we cannot expect
to derive proximity bounds for the SEM updates that do not use as-
sumptions on the weights, which correspond to the responsibilities
rj=n·wEM
j. The bounds from Theorem 6.4and Theorem 6.5both
depend on 1
1λw. That is, these bounds become arbitrarily large for
λwnear 1. Therefore, to get reasonable results, we need λwto be con-
siderably small. For instance, to ensure λw1
2, the requirements for
λwin Theorem 6.3yield rj12 ·ln(2
/δ). In other words, given a data
set of size n, all weights have to be at least 12·ln(2
/δ)
n.
6.3.4on chernoff bounds
In the following, we state and prove several Chernoff type bounds.
Chernoff bounds are a standard tool from probability theory for the
derivation of concentration results. Readers who are familiar with
this technique may skip this section or at least the given proofs.
We start with an elementary Chernoff bound. A proof can be foundSee Corollary 4.6in
[Mitzenmacher and
Upfal, 2005]. in [Mitzenmacher and Upfal, 2005], for example.
Lemma 6.6.Let X1,. . . ,Xnbe independent random variables over {0,1}and
let Y=n
i=1Xi. Then, for each 0λ1,
Pr(|YE[Y]|λ·E[Y]) 2eE[Y]λ2
3.
As a corollary, we get the following lemma which bounds the devi-
ation for a given probability of occurrence. This bound will be used
for the difference between the weight updates later.
Lemma 6.7.Let X1,. . . ,Xnbe independent random variables in {0,1}and
let Y=n
i=1Xi. Then, for 2eE[Y]
3δ1and λ=3ln2
/δ
E[Y],
Pr(|YE[Y]|λ·E[Y]) δ.
To bound the differences between the mean and between the co-
variance updates, we need a more elaborate Chernoff type bound.
Lemma 6.8.Let X1,...,Xnbe independent, discrete random variables withThis Lemma and its
proof are based on
[Levchenko, 2013]. E[Xi]=0and |Xi|Cfor some constant C > 0 and i=1,...,n. Further-
more, let Y=N
n=1Xi. Then, for any λ0,
Pr(|Y|λVar(Y))2eλ2
2ea
where a0such that λ=aeaVar(Y)
C.
Analogously to the relation between Lemma 6.6and Lemma 6.7,
we use Lemma 6.8to derive a bound for a given probability of occur-
rence.
90
6.3 the sem algorithm for gaussian mixtures
Lemma 6.9.Let X1,. . . ,Xnbe independent, discrete random variables with
E[Xi]=0and |Xi|Cfor some constant C > 0 and i=1,. . . ,n. Further-
more, let Y=N
n=1Xiand 0<δ<1. Then,
Pr(|Y|λVar(Y))δ
for
λ=
2e ln2
/δif Var(Y)
C1
e2e ln2
/δ
2C
Var(Y)ln2
/δotherwise.
In the remainder of this section, we prove Lemma 6.8and Lemma 6.9.
Proof of Lemma 6.8.First, note that we will naturally assume Var(Y) = For Var(Y) = 0,
there is no need
for Lemma 6.8.
n
i=1Var(Xi)> 0. Then, for any λ0, there exists an a0with
λ=aeaVar(Y)
C. Due to symmetry, we only prove
Pr(YλVar(Y))eλ2
2ea.
By Markov’s inequality, for any t > 0 we obtain
Pr(YλVar(Y))=Pr(etY eVar(Y))E[etY]
eVar(Y). (34)
Let xi1,xi2,. . . be the possible outcomes of the discrete random vari-
able Xi. We denote the corresponding probabilities by
πij := Pr(Xi=xij).
Then, by the definition of the expectation and the series expansion of
the exponential function,
E[etXi]=
j
πijetxij
=
j
πij (1+txij +1
2!(txij)2+1
3!(txij)3+...).
Separating the first two summands of the series expansion yields
E[etXi]
=
j
πij +t
j
πijxij +
j
πij (1
2!(txij)2+1
3!(txij)3+. . .)
=1+
j
πij(txij)2(1
2!+1
3!txij +1
4!(txij)2+...),
where the second equality follows from tjπijxij =tE[Xi]=0. Since
2
(m+2)!1
m!for mN,
E[etXi]1+
j
πij(txij)21
2etxij .
91
the stochastic em algorithm
In the following, we assume that tis chosen such that
atC. (35)
Then, etxij eaand it follows that
E[etXi]1+ea
2t2Var(Xi),
where we used
Var(Xi) = E[X2
i]E[Xi]2=E[X2
i]=
j
πij(txij)2.
Since X1,. . . ,Xnare independent,
E[etY]=
n
i=1
E[etXi]
n
i=1(1+ea
2t2Var(Xi)).
Using 1+αeαfor α0, it follows that
E[etY]
n
i=1
eea
2t2Var(Xi)=eea
2t2Var(Y). (36)
Finally, we fix t:= λ
eaVar(Y). Then, tC =aand Inequality (35) is
satisfied as required. Substituting tin Inequality (36) yields
E[etY]eλ2
2ea.
Furthermore,
eVar(Y)=eλ2
ea.
Thus, E[etY]
eVar(Y)eλ2
2ea
and Inequality (34) yields the claim.
Proof of Lemma 6.9.As in the proof of Lemma 6.8, we assume Var(Y)>
0. First, let
Var(Y)
C1
e2e ln2
/δ
and
λ=2e ln2
/δ. (37)
Then, λeVar(Y)
C. Thus, we are able to choose 0a1, such that
λ=aeaVar(Y)
C. Applying Lemma 6.8,
Pr(|Y|λVar(Y))2exp (λ2
2ea).
By substituting Equation (37),
Pr(|Y|λVar(Y))2exp (2e ln2
/δ
2ea)=2(δ
2)e
ea
δ.
92
6.3 the sem algorithm for gaussian mixtures
It remains to consider the case where
Var(Y)
C<1
e2e ln2
/δ(38)
and
λ=2C
Var(Y)ln2
/δ. (39)
Let aR, such that λ=aeaVar(Y)
C. Then,
aea=2C2
Var(Y)ln2
/δ. (40)
By Inequality (38), C
Var(Y)>e
2e ln2
/δand we deduce
aea>2e2
2e ln2
/δln2
/δ=e.
It follows that a > 1. Applying Lemma 6.8,
Pr(|Y|λVar(Y))2exp (λ2
2ea).
By substituting Equation (39),
Pr(|Y|λVar(Y))2exp(4C2
Var(Y)(ln2
/δ)21
2ea).
Using Equation (40),
Pr(|Y|λVar(Y))2exp(2aealn2
/δ1
2ea)=2exp(aln2
/δ).
Since a>1,
Pr(|Y|λVar(Y))< 2 exp(ln2
/δ)=2(δ
2)=δ.
6.3.5proof of the proximity bounds
Using Lemma 6.7, we are able to prove Theorem 6.3.
Proof of Theorem 6.3.Let W=n
i=1zij =n·wSEM
j. Then,
E
Zq[W]=
n
i=1
pij =rj=n·wEM
j.
Thus, applying Lemma 6.7yields
n·|wSEM
jwEM
j|n·λwEM
j
with probability at most δ.
93
the stochastic em algorithm
Using Lemma 6.9, we are able to prove Theorem 6.4and Theo-
rem 6.5.
Proof of Theorem 6.4.Let j{1,...,k},{1,...,d}and define the real
random variable
Mijℓ := (zij pij)(xiµEM
j).
Since EZq[zij]=pij, it follows that EZq[Mijℓ]=0and
Var(Mijℓ) = pij(1pij)(xiµEM
j)2
.
Moreover, since each µEM
jis a convex combination of x1,. . . ,xn,
|Mijℓ||zij pij|·(xiµEM
j).
Note that the definition of µEM
jyields n
i=1pij (xiµEM
j)=0. Thus,
for the random variable Mjℓ := n
i=1Mijℓ, we deduce
Mjℓ =
n
i=1
zij (xiµEM
j).
Furthermore, EZq[Mjℓ]=0and
Var(Mjℓ) =
n
i=1
pij(1pij)(xiµEM
j)2
=τ2
jℓ.
Note that the standard deviation of Mjℓ is τjℓ, which was introduced
in Section 6.3.2to measure the difference between the mean updates.
Applying Lemma 6.9with C=and λµas provided in the theorem
yields
Pr(
n
i=1
zij (xiµEM
j)
λµτjℓ)δ. (41)
The difference between the coordinates of the mean updates can be
written as
(µSEM
jµEM
j)=
(n
i=1zijxin
i=1zijµEM
j)
n
i=1zij
=n
i=1zij (xiµEM
j)
n·wSEM
j
.
Then, by combining Inequality (41) and Inequality (31), we conclude
that the conditional probability of
(µSEM
jµEM
j)λµτjℓ
n(1λw)wEM
j
=λµτjℓ
(1λw)rj
given the event Ais at least 1δ.
94
6.3 the sem algorithm for gaussian mixtures
Proof of Theorem 6.5.The covariance updates of the EM and SEM algo- cf. Equation (26)
and Equation (27)
rithm depend on the corresponding mean updates. Consequently, the
difference between the covariance updates depends on the difference
between the mean updates. To measure the impact of µSEM
jµEM
j
on ΣSEM
jΣEM
j, let ^
ΣSEM
jbe the covariance that would be computed
by the SEM algorithm, if µSEM
j=µEM
j. That is,
^
ΣSEM
j:= n
i=1zij(xiµEM
j)(xiµEM
j)T
n
i=1zij
.
By substituting xiµEM
j=xiµSEM
j+µSEM
jµEM
j, we deduce
^
ΣSEM
j=n
i=1zij (xiµSEM
j)(xiµSEM
j)T
n
i=1zij
+n
i=1zij (xiµSEM
j)(µSEM
jµEM
j)T
n
i=1zij
+n
i=1zij (µSEM
jµEM
j)(xiµSEM
j)T
n
i=1zij
+n
i=1zij (µSEM
jµEM
j)(µSEM
jµEM
j)T
n
i=1zij
.
Note that the first summand is equal to ΣSEM
jand the last summand
is equal to (µSEM
jµEM
j)(µSEM
jµEM
j)T. Furthermore, both sum-
mands in the middle are zero, since
n
i=1zij (xiµSEM
j)
n
i=1zij
=n
i=1zijxi
n
i=1zij
µSEM
j=0.
By denoting ν:= µSEM
jµEM
j, we conclude
^
ΣSEM
j=ΣSEM
j+ννT.
Analogously to the proof of Theorem 6.4, we bound the difference
between the entries of ^
ΣSEM
jand ΣEM
j. To do this, we define the real
random variable
Sijℓ12:= (zij pij)(Υij ΣEM
j)12
where Υij := (xiµEM
j)(xiµEM
j)T. Since EZq[zij]=pij, it follows
that EZq[Sijℓ12]=0and
Var(Sijℓ12) = pij(1pij)(Υij ΣEM
j)2
12.
Moreover, since each ΣEM
jis a convex combination of Υ1j,. . . ,Υnj,
|Sijℓ12||zij pij|·(Υij ΣEM
j)1212.
95
the stochastic em algorithm
Note that the definition of ΣEM
jyields n
i=1pij (Υij ΣEM
j)12
=0.
Thus, for the random variable Sjℓ12:= n
i=1Sijℓ12, we deduce
Sjℓ12=
n
i=1
zij (Υij ΣEM
j)12.
Furthermore, EZq[Sjℓ12]=0and
Var(Sjℓ12) =
n
i=1
pij(1pij)(Υij ΣEM
j)2
12=ρ2
jℓ12.
Note that the standard deviation of Sjℓ12is ρjℓ12, which was intro-
duced in Section 6.3.2to measure the difference between the covari-
ance updates. Applying Lemma 6.9with C=12and λΣas pro-
vided in the theorem yields
Pr(
n
i=1
zij (Υij ΣEM
j)12
λΣρjℓ12)δ. (42)
The difference between the entries of the covariance updates can be
written as
(ΣSEM
jΣEM
j)12=(^
ΣSEM
jννTΣEM
j)12
=(n
i=1zijΥij
n
i=1zij
ννTn
i=1zijΣEM
j
n
i=1zij )12
=
n
i=1zij (Υij ΣEM
j)
n·wSEM
j
ννT
12
.
Using the triangle inequality,
(ΣSEM
jΣEM
j)12
n
i=1zij (Υij ΣEM
j)
n·wSEM
j
12
+(µSEM
jµEM
j)1·(µSEM
jµEM
j)2.
Then, by combining Inequality (42), Inequality (32) and Inequality (33),
we conclude that the conditional probability of
(ΣSEM
jΣEM
j)12λΣρjℓ12
n(1λw)wEM
j
+λ1
µλ2
µ
(1λw)2·τjℓ1τjℓ2
r2
j
=λΣ
(1λw)·ρjℓ12
rj
+λ1
µλ2
µ
(1λw)2·τjℓ1τjℓ2
r2
j
.
given the events Aand Bis at least 1δ.
96
6.3 the sem algorithm for gaussian mixtures
6.3.6running time analysis
When the EM and SEM algorithm are both applicable for a particular
statistical model, it is natural to ask why the SEM algorithm should
be favored. Besides its ability to escape from a stationary point of
the likelihood function as mentioned at the beginning of this chapter,
the SEM algorithm is particularly faster in general. In the following,
we explain where the speedup stems from. First, we motivate the
speedup for general mixture models. Afterwards, we show a constant
factor speedup for Gaussian mixture models.
Let X= (x1,...,xn)be the input data set and assume that it was
generated by a general mixture of kdistributions. The EM algorithm
maximizes the expected complete data log-likelihood. Usually, this
implies that each data point is involved in the update of all compo-
nents (see for example Equation 26). Therefore, the running time of
the EM algorithm depends on the term kn. In contrast, the SEM algo-
rithm assigns each point to exactly one component. Hence, each point
is involved in the update of a single component only (see for example
Equation 27). This observation suggests that the speedup of the SEM
algorithm might be up to a factor of k. However, for the sampling
step (Step 2a), the SEM algorithm has to compute the probability that
a certain point was generated by a certain component for each point
and each component distribution. That is, the running time of the
SEM algorithm depends on the term kn as well. Nevertheless, we
are able to show that regarding the parameter estimation problem of
Gaussian mixture models, there is a constant factor speedup. To do
this, we discuss the number of computations in a single iteration of
both algorithms.
For the sake of simplicity, we abstain from precisely charging all
computations. Instead, we consider only the multiplications in R.
Furthermore, we consider only the leading term of the number of per-
formed multiplications. That is, we focus on the computations that
entail the most multiplications in terms of the parameters n,dand k
for considerably large values of n,dand k. Not surprisingly, these
are the computations involving the covariance matrices. Furthermore,
we assume that nd.
Both algorithms start with computing the probabilities Pr(xiµj,Σj)cf. Equation (25)
for i=1,. . . ,nand j=1,...,k. The number of multiplications for
the computation of these probabilities is dominated by matrix-vector
multiplications between the d×dcovariance matrices Σjand the d-
dimensional vectors xi. Therefore, the leading term of the number of
multiplications for the common computations is knd2.
The leading term of the number of multiplications for the remain-
ing computations of the EM algorithm is dominated by the update
of the covariances. Each point is involved in the update of each com- cf. Equation (26)
ponent by the outer product of the point with itself. Therefore, the
97
the stochastic em algorithm
leading term of the number of remaining multiplications is knd2as
well. Thus, the overall number of multiplications of the EM algorithm
is dominated by 2knd2.
In case of the SEM algorithm, again the updates of the covariances
determine the leading term of the number of remaining multiplica-
tions. However, due to the hard assignment of the points, the leadingcf. Equation (27)
term does not depend on k. Instead, the number of multiplications
needed to update the covariance matrix is dominated by nd2. It fol-
lows that for n,kand dlarge enough, the number of multiplications
during the remaining SEM computations is negligible in comparison
to the number of multiplications during the common computations.
Thus, the overall running time of the SEM algorithm is dominated by
knd2.
This implies a factor 2speedup. However, depending on details
of the implementation, the actual dependency of the multiplications
during the common computations may better be assumed to be dom-
inated by c1knd2for some constant c1. Analogously, for the remain-
ing computations of the EM algorithm we assume a leading term of
the dependency of the multiplications of c2knd2for some constant
c2. This results in a speedup of a factor of 1+c2
c1. Indeed, our exper-
imental results confirm a constant factor speedup (see Chapter 7.4).
Moreover, the measured speedup tends a factor of 3which implies
c2> c1.
98
7
EXPERIMENTAL ANALYSIS
To underpin our proximity and running time analysis from Chapter 6,
we performed a series of experiments. In our experimental analysis,
we compare the computations of the classical EM algorithm and the
probabilistic SEM algorithm on a number of different data sets. The
data sets divide into several artificial and real world data sets. As in
the theoretical analysis, we consider the parameter estimation prob-
lem for Gaussian mixture models only.
7.1implementation
To get comparable results, we implemented the EM algorithm and
the SEM algorithm from scratch in C++. For the linear algebra com-
putations, we use the free C++ template library Eigen (available at
http://eigen.tuxfamily.org/). For the generation of artificial data
sets, to compute initial model parameters, and for the probabilistic
steps of the SEM algorithm, we need an efficient pseudo-random
number generator. Our implementation uses the Mersenne twister
[Matsumoto and Nishimura, 1998] specified as std::mt19937 in the
current C++ standard [ISO, 2012].
As discussed in Chapter 6.3.1, it is possible that the SEM algorithm
assigns too few points to a particular component to be able to com-
pute a new covariance. Although we disregard these cases in our
analysis, we have to consider them in our implementation. If no point
is assigned to a particular component, we solve the problem by sam-
pling a new mean uniformly from the input data set and initializing
the covariance matrix with a scaled identity matrix. More precisely,
set the covariance matrix to σ2Id, where we compute
σ2:= 1
2d min
i=jµiµj2
as proposed in [Dasgupta and Schulman, 2007] for the initial param-
eter estimation. If not enough points are assigned to a particular
component, we try to mix the under-determined covariance with the
previous covariance matrix. If this fails to compute a symmetric pos-
itive definite matrix, we simply keep the old covariance matrix. Due
to numerical instability, the EM algorithm has to deal with similar
99
experimental analysis
problems if the responsibility rjof a component becomes too small.cf. Chapter 5.3
Therefore, we implemented a similar error handling procedure for
the EM algorithm. In our experiments, these problems hardly ever oc-
curred. However, it must be said that the error handling may lead to
substantially different solutions of the EM and SEM algorithm. Our
experimental results presented in Section 7.4are not affected by the
error handling procedures.
7.2data sets
For our experiments, we used artificial as well as real world data sets.
For the generation of the artificial data sets, we considered different
combinations of the dimension dNand the number of component
distributions kN. For each combination, we probabilistically com-
puted several parameter vectors θ= (w1,. . . ,wk,µ1,. . . ,µk,Σ1,. . . ,Σk)
for Gaussian mixtures with kcomponents over Rd.cf. Chapter 4.3.2
For the determination of the weights w1,...,wkwe drew kreal
numbers uniformly at random from the interval [0,1]. To compute
balanced as well as unbalanced weights, these numbers were after-
wards exponentiated with the same integer constant chosen between
0and 3. Finally, the resulting numbers were normalized to retrieve
the weights.
For the determination of the means µ1,...,µkwe created a Gaus-
sian meta distribution from which all means were drawn. The mean
of the meta distribution was set to the origin and the covariance ma-
trix of the meta distribution was created probabilistically itself. To
do this, we first created a matrix Mby drawing each entry from a
one-dimensional Gaussian distribution with mean 0and variance k
d.
Afterwards, to retrieve the covariance matrix of the meta distribution,
we computed the product of the matrix Mwith its own transpose.
This generally results in a symmetric positive-definite covariance ma-
trix. The variance k
dof the one-dimensional Gaussian used to deter-
mine the entries of the matrix Mwas chosen to control the separation
of the created means µ1,. . . ,µk. The separation increases with in-
creasing k(more components need more space) and it decreases with
increasing d(more dimensions allow less separation in each of them).
Proceeding this way, we ensured that the mixtures mainly consist of
interfusing Gaussians. To get reasonable results it is important that
the points generated by different components of a Gaussian mixture
are not pairwise well separated. Otherwise, the task of learning the
parameters is too easy. Furthermore, the difference in the computa-The distance-based
clustering approach
discussed in Part I,
is a suitable method,
for example.
tions of the EM and SEM algorithm would become too small to be
verified in our experiments.
Finally, to determine the covariance matrices Σ1,. . . ,Σk, we pro-
ceeded as for the covariance matrix of the meta distribution used for
the computation of the means. The only difference is that we use a
100
7.2 data sets
Figure 21: Projection of one of the artificial data sets with n=
1 000 000. The ellipsoids depict the standard deviation ar-
eas of each component distribution. The larger the weight
of the component, the brighter is the shade of the corre-
sponding ellipsoid.
one-dimensional Gaussian with variance 1to draw the entries of the
matrix M.
For each θcomputed as above, we drew several point sets of differ-
ent size from the corresponding Gaussian mixture. Our experiments
for the different combinations of dand kled to essentially similar re-
sults. Therefore, in the following we discuss artificial data sets with
d=k=10 only. To confirm our proximity results from Chapter 6, we
created several data sets of size n=1 000 000 (cf. Figure 21). To in-
vestigate the behavior for small data sets, we also created data sets of
size n=10 000. In the remainder of this section, we consider two par-
ticular data sets that led to characteristic results, one for n=1 000 000
which we denote by Art1Mand one for n=10 000 which we denote
by Art10K. For both artificial data sets we computed Gaussian mix-
tures with the proper number of component distributions, i.e., with
k=10.
As real world data we use three publicly available data sets shown
in Figure 7.2. The first one is the Forest Covertype data set which is part
of the UCI Machine Learning Library [Asuncion and Newman, 2007]
and contains 581 012 data points. The data set contains cartographic
information about four wilderness areas located in the Roosevelt Na-
tional Forest. Each data point consists of cartographic features of
an area of 900 square meters in size. To get data suitable for Gaus-
sian mixture parameter estimation, we used only the quantitative at-
101
experimental analysis
tributes located at the first 10 coordinates (see Figure 22a). That is,
we ignored the labels describing which tree species dominates the
respective area and 44 qualitative binary attributes. For the Forest
Covertype data set we computed Gaussian mixtures with k=10 com-
ponent distributions.
The second data set is based on the Amsterdam Library of Object
Images (ALOI) [Geusebroek et al., 2005]. This database consists of
110 250 images of 1 000 small objects, taken under various conditions.
We use a 27-dimensional feature vector set that is based on color
histograms in HSV color space (see Figure 22b). The features were
extracted from the database as described in [Kriegel et al., 2011b] but
using 2bins for hue, saturation and brightness each. The data set
is provided by the ELKI project [Achtert et al., 2012] and contains
110 250 data points. For the ALOI data set we computed Gaussian
mixtures with k=3component distributions.
The third data set is created from data provided by the GeoNames
geographical database (available at http://www.geonames.org/). The
original data set contained several features of 135 082 cities around
the world with a population of at least 1000. We extracted the geo-
graphic coordinates (latitude and longitude) for each city to compute
a projection onto the 3-dimensional unit sphere. The resulting data
set is a 3-dimensional model of the populated regions of the globe
(see Figure 22c). For the GeoNames data set we computed Gaussian
mixtures with k=20 component distributions.
The first two real world data sets were normalized before use. That
is, for each coordinate of the input data points the values were trans-
lated and scaled to fit into the interval [0,1]. Thus, the spread in each
dimension is =1(cf. Equation (28)). Without normalization, the dif-
ference between the means and covariances computed by the EM and
SEM algorithm might be dominated by a single or only few dimen-
sions of the input space. By introducing the normalization step, we
ensured that the parameter estimation problem does not effectively
reduce to a lower dimensional problem.
7.3experiments
For the comparison of the EM and SEM algorithm, we established
four different types of tests. In the first type of tests we compared the
negative log-likelihood of the parameters computed by the EM and
SEM algorithm. To do this, we ran both algorithms for 50 rounds and
computed the negative log-likelihood of the computed parameters
after each round.
The goal of the second type of tests was to compare the parameters
computed by the two algorithms. Again, we ran both algorithms for
50 rounds. After each round, we computed the differences between
the computed parameters. More precisely, for each component distri-
102
7.3 experiments
(a) Forest Covertype data set
with d=10 and n=110 250
(b) ALOI features data set
with d=27 and n=110 250
(c) GeoNames data set with d=3and n=135 082
Figure 22: Projections of the real world data sets.
bution we computed the absolute distance between the weights, the
Euclidean distance between the means and the Frobenius norm of the
difference between the covariance matrices.
The aim of the third type of tests was to evaluate our theoreti-
cal bounds on the proximity of the EM and SEM computations (cf.
Chapter 6.3.2). For the sake of simplicity, we compared only the
means computed by the two algorithms. To to this, in each round
of the SEM algorithm we also computed the mean updates the EM
algorithm would compute if it was given the same previous model
parameters. Then, we determined the actual Euclidean distance be-
tween each pair of means. Furthermore, we computed our theoretical
bounds for each component distribution and each coordinate of the
103
experimental analysis
corresponding mean. To get a bound that holds with high probability,
we chose
δ:= 1
100 ·k(d+1).
Applying Theorem 6.3for all kweight updates and Theorem 6.4for
all dcoordinates in all kmean updates yields k(d+1)bounds that
hold with probability 1δ. Using the union bound, the resulting
bound for the Euclidean distance between the means holds with prob-
ability of at least 11
100 .
Finally, in the fourth type of tests, we compared the running times
of the EM and SEM algorithm. Our analysis in Chapter 6.3.6pre-
dicts a constant factor speedup in case of Gaussian mixture models.
Furthermore, we motivated the overall speedup based on a factor k
speedup for a part of the computations of the parameter updates. To
investigate the dependency of the speedup on the number of com-
ponent distributions, we started both algorithm with the GeoNames
data set for several values of kbetween 1and 100. To reduce the
effects of single delayed runs and in case of the SEM algorithm to
consider the probabilistic behavior, we averaged the running times of
both algorithms over ten runs.
More precisely, we started both algorithms for each value of kwith
three different initial solutions. For each initial solution, we executed
both algorithms three times.
The EM algorithm, as well as the SEM algorithm, compute their
solution based on an initial parameter estimate. How to compute a
good initial set of model parameters is not a subject of this thesis.
Since we study the solutions of the SEM algorithm in relation to the
solutions of the classical EM algorithm, we only have to assure that
both algorithms are started with the same initial solutions. For our
comparison of the EM and SEM algorithm, the quality of the initial
solution is not critical. Therefore, we implemented a very simple ini-
tialization method that we used for all our experiments. To compute
initial parameters θ= (w1,. . . ,wk,µ1,...,µk,Σ1,...,Σk), we executed
the following steps:
1. Draw kpoints c1,. . . ,ckfrom the input data set XRduni-
formly at random.
2. Compute a partition C1,...,Ckof Xinto ksubsets by assign-
ing each xXto the nearest ciin terms of Euclidean distance
(breaking ties arbitrarily).
3. For i=1,. . . ,k:
a) Set the i-th weight to wi:= |Ci|
|X|.
b) Set µi,Σito the maximum likelihood estimate of a single
Gaussian for the data set Ci(see Chapter 4.3.1).
104
7.4 results
For each data set, we created 10 sets of initial model parameters.
Since the SEM algorithm is probabilistic, for each initial model we
performed 100 different runs. To ensure that the results of our sec-
ond and third type of test are not biased by an unlikely computation,
we averaged the results of the different runs for each initial solution.
Recall that the EM algorithm is deterministic. Therefore, we need to
execute the EM algorithm only once.
7.4results
In this section, we discuss the results of our four types of tests. The
focus of the following presentation lies on the second and third type
of test since these tests correspond to our analysis in Chapter 6. As
mentioned above, we discuss the results for five particular data sets,
two artificial data sets and three real world data sets. During our
experiments we observed that the results for different initial solutions
did not differ substantially. Therefore, we depict only results for some
selected initial solutions that led to typical behavior of the algorithms.
Some characteristic results of our first type of tests are shown in
Figure 23. The negative log-likelihood of the solutions computed by
the EM algorithm is plotted as a dashed black line. Note that for
each round of the EM algorithm we only have a single value to plot.
For the SEM algorithm, we have to plot the results of 100 different
runs of the algorithm. Therefore, we depict the results of the SEM
algorithm similar to box plots. The dark gray line marks the median,
the gray ribbon ranges from the lower to the upper quartile, and the
light gray ribbon ranges from the minimum to the maximum of the
negative log-likelihood of the computed solutions. For the majority
of our experiments for the artificial as well as the real world data
sets, the log-likelihood of the solutions of the EM and SEM algorithm
almost coincide. To make the differences visible at all in Figure 23, we
had to omit the first three rounds of the algorithms. For the Art1M
data set, most of the initial solutions still lead to results as depicted
in Figure 23a. However, for some initial solutions we observe small
differences as depicted in Figure 23c. Furthermore, individual initial
solutions lead to considerable differences as depicted in Figure 23e.
Generally speaking, for small values of none can not expect that
the SEM algorithm yields parameter estimates close to the EM com-
putations. This is due to the law of large numbers, i.e., the influence
of a single sampling step of the SEM algorithm is larger for smaller
n. Indeed, for Art10K data set we observe differences between the
log-likelihood of the computed solutions more often. Some examples
are depicted in Figure 23b to Figure 23f. Note that the SEM results
may lie above the EM results as well as below them and also both in
the same run.
105
experimental analysis
Some typical results of our second type of tests are shown in Fig-
ure 24 to Figure 28. For the interpretation of the results, we will re-
late the measured differences between the parameters computed by
the EM and SEM algorithm to a simple upper bound derived from
the dimension dand the largest spread of the corresponding data
set. That is, for a given data set, we assume that the spread in eachSee Chapter 6.3.2,
Equation (28).dimension is at most . Then, the Euclidean distance between the
computed means is at most
Γµ:= d.
The Frobenius norm of the difference between the computed covari-
ance matrices is at most
ΓΣ:= d∆2.
Furthermore, recall that the weights lie in the interval [0,1]. That is,
the difference between the computed weights is at most 1. In the
following, we compute the fraction of Γµand ΓΣwhich the computed
means and covariance matrices differ by, respectively. The differences
of the weights are always compared to 1.
For the artificial data sets, we have d=10 and 40. It follows
that Γµ130 and ΓΣ16 000. Regarding the Art1M data set (see
Figure 24), we observe that the parameter vectors computed by the
EM and SEM algorithm are very similar. More precisely, the weights
differ by less than 0.002 and the means differ by less than 0.002 ·Γµ.
The covariances even differ by less than 0.0002 ·ΓΣ. Moreover, we ob-
serve that after the first few rounds, the differences in the parameters
of many components are even substantially smaller than those of the
remaining components.
As mentioned in the discussion of our first type of tests, for small
values of none can not expect good parameter estimates. Indeed, for
the Art10K data set (see Figure 25), the differences between the means
and covariances are larger by a factor of 20. However, our first type
of tests show that this does not necessarily result in lower likelihood
of the computed parameters.
Recall that the two real world data sets corresponding to Figure 26
and Figure 27 are normalized. That is, =1.
For the Forest Covertype data set (see Figure 26), we have d=10.
It follows that Γµ3.2and ΓΣ=10. Then, the weights differ by less
than 0.004, the means differ by less than 0.004 ·Γµ, and the covariances
differ by less than 0.0002 ·ΓΣ. That is, the results are similar to the
results for the Art1M data set. One explanation for these good results
is that the Forest Covertype data set is the largest of the real world
data sets with a size of n=581 012.
For the ALOI data set (see Figure 27), we have d=27. It follows
that Γµ5.2and ΓΣ=27. For the weight updates, at least in the first
couple of rounds, we get a difference of up to 0.2. The means differ by
up to 0.009 ·Γµand the covariances differ by up to 0.0006 ·ΓΣ. These
106
7.4 results
results are slightly worse than the results for the Forest Covertype
data set. However, this can be explained by the smaller size of the
ALOI data set with n=110 250 only. Furthermore, except for the
difference of some weight updates during the first few rounds, the
differences are still very small.
For the GeoNames data set (see Figure 28), we have d=3and
=2. It follows that Γµ3.5and ΓΣ=12. For the weights, we get a
difference of less than 0.003. The means and the covariances differ by
a fraction of less than 0.01 ·Γµand 0.0015 ·ΓΣ, respectively. However,
for all but one component, the differences between the means and
between the covariances are lower by a factor of 3at least.
All three real world data sets have in common that after approxi-
mately 40 rounds, the differences in all parameters remain on a very
small level. Furthermore, we point out that all results confirm that
both algorithms compute almost the same parameter updates.
For our third type of tests, we depict a selection of results in Fig-
ure 29 to Figure 33. First, we observe that the actual difference mea-
sured in our experiments is significantly smaller than our high prob-
ability bound. Furthermore, the similar development of our bound
and the actual difference indicates the accuracy of our analysis. Note
that mostly, the largest differences exist during the first couple of
rounds. This matches our observations from the second type of tests.
Just as for our second type of test, we relate the bound for the differ-
ence of the weight updates to the largest possible distance determined
by the dimension and the spread of the data set. As mentioned in the
description of our tests in Section 7.3, the presented bounds hold with
probability 11
100 . For the Art1M data set (cf. Figure 29), the results
show that the bound is less than 0.013 ·Γµ. Again, due to the smaller
size of the data set, the results for the Art10K data set are worse (cf.
Figure 30). Here, we get bounds up to 0.12 ·Γµ. In Chapter 6.3.3, we
already discussed the limited applicability of our bounds for small
responsibilities rj. Since smaller values of nresult in smaller values
of rj=n·wEM
j, the larger bounds observed in Figure 30 are not sur-
prising.
For the real world data sets, we observed a larger variety of devel-
opments of our high probability bound. This may arise from the fact
that the real world data sets are less suitable for being modeled by
Gaussian mixtures. To avoid tedious repetitions, we prefer to present
a diverse choice of results rather than a characteristic selection. Nev-
ertheless, all presented bounds for all real world data sets are at most
0.1·Γµ.
Finally, the results of our fourth type of tests are depicted in Fig-
ure 34 and Figure 35. Figure 34 shows that the running times of both
algorithms linearly depend on the number of components k. The
slope of the curve for the SEM algorithm is considerably smaller as
predicted by our analysis in Chapter 6.3.6. Furthermore, the results
107
experimental analysis
show that for k=1the EM algorithm is faster than the SEM algo-
rithm. This is not surprising since for k=1both algorithms compute
the same updates, while the SEM processes the additional sampling
step. To approximate the factor of the speedup, we did further tests
for larger values of k. The results are depicted in Figure 35 and sug-
gest a factor of up to 3at least.
Note that although the theoretical and experimental running time
analyses seem to match, we have to treat the results with care. On
the one hand, our theoretical analysis simplifies the situation by only
counting the multiplications. On the other hand, our implementa-
tion may be subject to several optimizations emerging from the linear
algebra library, the compiler and the floating point hardware. Fur-
thermore, these optimizations of the implementation possibly differ
in their effectiveness for the two algorithms. Nevertheless, the mea-
sured speedup is real and thus, the SEM algorithm should be taken
into account when it comes to the parameter estimation problem for
Gaussian mixture models.
108
7.4 results
(a) Art1M data set (b) Art10K data set
(c) Art1M data set (d) Art10K data set
(e) Art1M data set (f) Art10K data set
Figure 23: Negative log-likelihood of parameters computed by the
EM and SEM algorithm for the selected artificial data sets
and different initial solutions. Figure 23a, 23c, and 23e all
depict results for the Art1M data set, but for different ini-
tial solutions. The same holds for Figure 23b, 23d, and 23f
with the Art10K data set.
109
experimental analysis
(a) Difference wEM
kwSEM
kbetween the weights.
(b) Difference µEM
kµSEM
kbetween the means.
(c) Difference ΣEM
kΣSEM
kbetween the covariances.
Figure 24: Comparison of intermediate solutions of the EM and SEM
algorithm given the Art1Mdata set with Γµ130 and ΓΣ
16 000 for k=10. All three figures depict the results for the
same initial solution. Each figure consists of ten graphs,
each depicting the difference for one component Gaussian
averaged over 100 SEM runs.
110
7.4 results
(a) Difference wEM
kwSEM
kbetween the weights.
(b) Difference µEM
kµSEM
kbetween the means.
(c) Difference ΣEM
kΣSEM
kbetween the covariances.
Figure 25: Comparison of intermediate solutions of the EM and SEM
algorithm given the Art10Kdata set with Γµ130 and
ΓΣ16 000 for k=10. All three figures depict the results
for the same initial solution. Each figure consists of ten
graphs, each depicting the difference for one component
Gaussian averaged over 100 SEM runs.
111
experimental analysis
(a) Difference wEM
kwSEM
kbetween the weights.
(b) Difference µEM
kµSEM
kbetween the means.
(c) Difference ΣEM
kΣSEM
kbetween the covariances.
Figure 26: Comparison of intermediate solutions of the EM and SEM
algorithm given the normalized Forest Covertype data set
with Γµ3.2and ΓΣ=10 for k=10. All three figures
depict the results for the same initial solution. Each figure
consists of ten graphs, each depicting the difference for one
component Gaussian averaged over 100 SEM runs.
112
7.4 results
(a) Difference wEM
kwSEM
kbetween the weights.
(b) Difference µEM
kµSEM
kbetween the means.
(c) Difference ΣEM
kΣSEM
kbetween the covariances.
Figure 27: Comparison of intermediate solutions of the EM and SEM
algorithm given the normalized ALOI data set with Γµ
5.2and ΓΣ=27 for k=3. All three figures depict the results
for the same initial solution. Each figure consists of three
graphs, each depicting the difference for one component
Gaussian averaged over 100 SEM runs.
113
experimental analysis
(a) Difference wEM
kwSEM
kbetween the weights.
(b) Difference µEM
kµSEM
kbetween the means.
(c) Difference ΣEM
kΣSEM
kbetween the covariances.
Figure 28: Comparison of intermediate solutions of the EM and SEM
algorithm given the GeoNames data set with Γµ3.5and
ΓΣ=12 for k=20. All three figures depict the results for
the same initial solution. Each figure consists of twenty
graphs, each depicting the difference for one component
Gaussian averaged over 100 SEM runs.
114
7.4 results
(a) This is one of the most frequent developments of the bound.
(b) Sometimes, the bound increases, albeit on a very low level.
(c) Several ups and downs occur only rarely.
Figure 29: Experimental difference and the theoretical bound on the
difference between means computed by the EM and SEM
algorithm given the Art1Mdata set with Γµ130 for k=
10. All three figures depict the results averaged over 100
SEM runs for the same initial solution but for the means of
different component Gaussians.
115
experimental analysis
(a) Similar to Figure 29a, but on a higher level.
(b) Although the bound is worsening, the actual differences remain small.
(c) Development of the bound contrary to the actual difference.
Figure 30: Experimental difference and the theoretical bound on the
difference between means computed by the EM and SEM
algorithm given the Art10Kdata set with Γµ130 for k=
10. All three figures depict the results averaged over 100
SEM runs for the same initial solution but for the means of
different component Gaussians.
116
7.4 results
Figure 31: Experimental difference and the theoretical bound on the
difference between means computed by the EM and SEM
algorithm given the normalized Forest Covertype data set
with Γµ3.2for k=10. All three figures depict the results
averaged over 100 SEM runs for the same initial solution
but for the means of different component Gaussians.
117
experimental analysis
Figure 32: Experimental difference and the theoretical bound on the
difference between means computed by the EM and SEM
algorithm given the normalized ALOI data set with Γµ
5.2for k=3. All three figures depict the results averaged
over 100 SEM runs for the same initial solution but for the
means of different component Gaussians.
118
7.4 results
Figure 33: Experimental difference and the theoretical bound on the
difference between means computed by the EM and SEM
algorithm given the GeoNames data set with Γµ3.5for k=
20. All three figures depict the results averaged over 100
SEM runs for the same initial solution but for the means of
different component Gaussians.
119
experimental analysis
.....
0
.
2
.
4
.
6
.
8
.
10
.
12
.
14
.
16
.
18
.
20
.
0.
5
.
10
.
15
.
number of components
.
seconds
.
. ..
EM algorithm
. ..
SEM algorithm
Figure 34: Average running times of the EM and SEM algorithm
started on the GeoNames data set for different numbers
of component Gaussians.
.....
0
.
10
.
20
.
30
.
40
.
50
.
60
.
70
.
80
.
90
.
100
.
0.
1
.
2
.
3
.
number of components
.
factor of speedup
Figure 35: Average speedup of the SEM algorithm started on the
GeoNames data set for different numbers of component
Gaussians.
120
LIST OF FIGURES
Figure 1Different measurements for the same cluster 15
Figure 2The volume argument 19
Figure 3drad(AB)<costdrad(Cm)+pApB22
Figure 4Intermediate centers 26
Figure 5Covering centers and clusters 27
Figure 6Merging (Z,optk)-connected clusters 31
Figure 7Congruent configurations 37
Figure 8Merging clusters intersecting a ball of radius r39
Figure 9The two base cases of the induction 43
Figure 10 The inductive step 44
Figure 11 Partition of the optimal solution 44
Figure 12 Sketch of the set W(x)47
Figure 13 Part of the dendrogram for W(xi)47
Figure 14 Part of the dendrogram for X48
Figure 15 The Old Faithful data set 57
Figure 16 Density of a two-dimensional Gaussian 62
Figure 17 Points with constant Gaussian density 63
Figure 18 Density of a Gaussian mixture distribution 66
Figure 19 Decomposition of lnpθ(X)over the choice of q74
Figure 20 Optimization of lnpθ(X)over the choice of θ76
Figure 21 Projection of one of the artificial data sets 101
Figure 22 Projections of the real world data sets 103
Figure 23 Negative log-likelihood for artificial data sets 109
Figure 24 Differences for the Art1M data set 110
Figure 25 Differences for Art10K data set 111
Figure 26 Differences for the Forest Covertype data set 112
Figure 27 Differences for the ALOI data set 113
Figure 28 Differences for the GeoNames data set 114
Figure 29 Theoretical bound for the large artificial data set 115
Figure 30 Theoretical bound for the small artificial data set 116
Figure 31 Theoretical bound for the Forest Covertype data set 117
Figure 32 Theoretical bound for the ALOI data set 118
Figure 33 Theoretical bound for the GeoNames data set 119
Figure 34 Average running times for EM and SEM 120
Figure 35 Average speedup of the SEM algorithm 120
121
LIST OF ALGORITHMS
3.1AgglomerativeDiscreteRadius(X)............. 20
3.2AgglomerativeRadius(X)................... 24
3.3AgglomerativeCompleteLinkage(X)........... 34
5.1Classical EM algorithm . . . . . . . . . . . . . . . . . . . . . 72
6.1Probabilistic EM algorithm . . . . . . . . . . . . . . . . . . 82
6.2SEM algorithm for mixtures . . . . . . . . . . . . . . . . . . 83
123
BIBLIOGRAPHY
[Achtert et al., 2012] Achtert, E., Goldhofer, S., Kriegel, H.-P., Schu-
bert, E., and Zimek, A. (2012). Evaluation of clusterings metrics
and visual support. Data Engineering, International Conference on,
0:12851288.
[Ackermann, 2009] Ackermann, M. R. (2009). Algorithms for the Breg-
man k-Median Problem. PhD thesis, University of Paderborn.
[Ackermann et al., 2011] Ackermann, M. R., Blömer, J., Kuntze, D.,
and Sohler, C. (2011). Analysis of Agglomerative Clustering. In
Proceedings of the 28th International Symposium on Theoretical Aspects
of Computer Science (STACS 11), pages 308319. Dagstuhl Publish-
ing.
[Ackermann et al., 2012] Ackermann, M. R., Blömer, J., Kuntze, D.,
and Sohler, C. (2012). Analysis of Agglomerative Clustering. Algo-
rithmica.
[Amunts et al., 2013] Amunts, K., Lepage, C., Borgeat, L., Mohlberg,
H., Dickscheid, T., Rousseau, M.-E., Bludau, S., Bazin, P.-L., Lewis,
L. B., Oros-Peusquens, A.-M., Shah, N. J., Lippert, T., Zilles, K., and
Evans, A. C. (2013). BigBrain: An Ultrahigh-Resolution 3D Human
Brain Model. Science,340(6139):14721475.
[Asuncion and Newman, 2007] Asuncion, A. and Newman, D. J.
(2007). UCI machine learning repository. Available at:
http://www.ics.uci.edu/mlearn/MLRepository.html.
[Azevedo et al., 2009] Azevedo, F. A., Carvalho, L. R., Grinberg, L. T.,
Farfel, J. M., Ferretti, R. E., Leite, R. E., Filho, W. J., Lent, R., and
Herculano-Houzel, S. (2009). Equal numbers of neuronal and non-
neuronal cells make the human brain an isometrically scaled-up
primate brain. The Journal of Comparative Neurology,513(5):532541.
[B¯
adoiu et al., 2002] B¯
adoiu, M., Har-Peled, S., and Indyk, P. (2002).
Approximate clustering via core-sets. In Proceedings of the thiry-
fourth annual ACM symposium on Theory of computing, STOC 02,
pages 250257, New York, NY, USA. ACM.
[Baum et al., 1970] Baum, L. E., Petrie, T., Soules, G., and Weiss, N.
(1970). A Maximization Technique Occurring in the Statistical Anal-
ysis of Probabilistic Functions of Markov Chains. The Annals of
Mathematical Statistics,41(1):164171.
125
Bibliography
[Biernacki et al., 2003] Biernacki, C., Celeux, G., and Govaert, G.
(2003). Choosing starting values for the EM algorithm for getting
the highest likelihood in multivariate Gaussian mixture models.
Computational Statistics & Data Analysis,41(3-4):561575.
[Bilmes, 1997] Bilmes, J. (1997). A Gentle Tutorial of the EM Algo-
rithm and its Application to Parameter Estimation for Gaussian
Mixture and Hidden Markov Models. Technical Report TR-97-021,
ICSI.
[Bishop, 2006] Bishop, C. M. (2006). Pattern Recognition and Machine
Learning (Information Science and Statistics). Springer-Verlag New
York, Inc., Secaucus, NJ, USA.
[Blömer et al., 2013] Blömer, J., Bujna, K., and Kuntze, D. (2013). A
Theoretical and Experimental Comparison of the EM and SEM
Algorithm. CoRR: Computing Research Repository, abs/1310.5034.
arXiv:1310.5034 [cs.LG].
[Brewer et al., 2009] Brewer, G. J., Boehler, M. D., Pearson, R. A., De-
Maris, A. A., Ide, A. N., and Wheeler, B. C. (2009). Neuron network
activity scales exponentially with synapse density. Journal of Neural
Engineering,6(1):014001.
[Broder et al., 1997] Broder, A. Z., Glassman, S. C., Manasse, M. S.,
and Zweig, G. (1997). Syntactic clustering of the web. Computer
Networks and ISDN Systems,29(813):11571166. Papers from the
Sixth International World Wide Web Conference.
[Celeux et al., 1995] Celeux, G., Chauveau, D., and Diebolt, J. (1995).
On stochastic versions of the em algorithm. Technical Report 2514,
Institut national de recherche en informatique et en automatique
(INRIA).
[Celeux and Diebolt, 1985] Celeux, G. and Diebolt, J. (1985). The
SEM algorithm: a probabilistic teacher algorithm derived from the
EM algorithm for the mixture problem. Computational Statistics
Quarterly,2:7382.
[Celeux and Diebolt, 1992] Celeux, G. and Diebolt, J. (1992). A
stochastic approximation type EM algorithm for the mixture prob-
lem. Stochastics and Stochastic Reports,41(1-2):119134.
[Charikar et al., 1997] Charikar, M., Chekuri, C., Feder, T., and Mot-
wani, R. (1997). Incremental clustering and dynamic information
retrieval. In Proceedings of the twenty-ninth annual ACM symposium
on Theory of computing, STOC 97, pages 626635, New York, NY,
USA. ACM.
[Cover and Thomas, 2012] Cover, T. M. and Thomas, J. A. (2012). El-
ements of information theory. John Wiley & Sons.
126
Bibliography
[Dasgupta and Long, 2005] Dasgupta, S. and Long, P. M. (2005). Per-
formance guarantees for hierarchical clustering. Journal of Computer
and System Sciences,70(4):555569. Special Issue on COLT 2002.
[Dasgupta and Schulman, 2007] Dasgupta, S. and Schulman, L.
(2007). A Probabilistic Analysis of EM for Mixtures of Separated,
Spherical Gaussians. Journal of Machine Learning Research,8:203
226.
[Dempster et al., 1977] Dempster, A. P., Laird, N. M., and Rubin, D. B.
(1977). Maximum likelihood from incomplete data via the EM al-
gorithm. Journal of the Royal Statistical Society, Series B: Statistical
Methodology,39(1):138.
[Dias and Wedel, 2004] Dias, J. G. and Wedel, M. (2004). An empir-
ical comparison of EM, SEM and MCMC performance for prob-
lematic Gaussian mixture likelihoods. Statistics and Computing,
14(4):323332.
[Eisen et al., 1998] Eisen, M. B., Spellman, P. T., Brown, P. O., and
Botstein, D. (1998). Cluster analysis and display of genome-wide
expression patterns. Proceedings of the National Academy of Sciences,
95(25):1486314868.
[Feder and Greene, 1988] Feder, T. and Greene, D. (1988). Optimal
algorithms for approximate clustering. In Proceedings of the twenti-
eth annual ACM symposium on Theory of computing, STOC 88, pages
434444, New York, NY, USA. ACM.
[Florek et al., 1951] Florek, K., Lukaszewicz, J., Perkal, J., Steinhaus,
H., and Zubrzycki, S. (1951). Sur la liaison et la division des points
d’un ensemble fini. Colloquium Mathematicae,2:282285.
[Fortunato, 2010] Fortunato, S. (2010). Community detection in
graphs. Physics Reports,486(3-5):75 174.
[Fréchet, 1910] Fréchet, M. (1910). Les dimensions d’un ensemble
abstrait. Mathematische Annalen,68(2):145168.
[Geusebroek et al., 2005] Geusebroek, J. M., Burghouts, G. J., and
Smeulders, A. W. M. (2005). The amsterdam library of object im-
ages. International Journal of Computer Vision,61(1):103112.
[Golub and Van Loan, 2012] Golub, G. and Van Loan, C. (2012). Ma-
trix Computations. Matrix Computations. Johns Hopkins University
Press.
[Gonzalez, 1985] Gonzalez, T. F. (1985). Clustering to minimize
the maximum intercluster distance. Theoretical Computer Science,
38(0):293306.
127
Bibliography
[Grimmett and Stirzaker, 2001] Grimmett, G. and Stirzaker, D. (2001).
Probability and Random Processes. Probability and Random Pro-
cesses. Clarendon Press.
[Hawkins, 2004] Hawkins, D. M. (2004). The problem of overfitting.
Journal of chemical information and computer sciences,44(1):112.
[Ip, 1994] Ip, E. H.-s. (1994). A stochastic EM estimator in the presence of
missing data theory and applications. PhD thesis, Stanford Univer-
sity.
[ISO, 2012] ISO (2012). ISO/IEC 14882:2011 Information technology
Programming languages C++. International Organization for Stan-
dardization, Geneva, Switzerland.
[Johnson and Joram, 1984] Johnson, W. B. and Joram, L. (1984). Ex-
tensions of Lipschitz mappings into a Hilbert space. In Conference
in Modern Analysis and Probability, volume 26 of Contemporary Math-
ematics, pages 189206. American Mathematical Society.
[Koch et al., 2006] Koch, K., McLean, J., Segev, R., Freed, M. A., Berry,
M. J., Balasubramanian, V., and Sterling, P. (2006). How much the
eye tells the brain. Current biology : CB,16(14):14281434.
[Kriegel et al., 2011a] Kriegel, H.-P., Kröger, P., Sander, J., and Zimek,
A. (2011a). Density-based clustering. Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery,1(3):231240.
[Kriegel et al., 2011b] Kriegel, H.-P., Schubert, E., and Zimek, A.
(2011b). Evaluation of multiple clustering solutions. In Proceedings
of the 2nd MultiClust Workshop (in conjunction with ECML PKDD).
[Leder, 2013] Leder, L. (2013). Nichtapproximierbarkeitsresultate zu
Radius- und Durchmesserclustering unter Verwendung von Lp-
Metriken. Bachelor’s thesis. To appear.
[Lee et al., 2008] Lee, K., Kim, J., Kwon, K. H., Han, Y., and Kim, S.
(2008). DDoS attack detection method using cluster analysis. Expert
Systems with Applications,34(3):16591665.
[Levchenko, 2013] Levchenko, K. (2013). Chernoff bound. Available
at: http://www.cs.ucsd.edu/klevchen/techniques/chernoff.pdf.
[Matsumoto and Nishimura, 1998] Matsumoto, M. and Nishimura,
T. (1998). Mersenne twister: a 623-dimensionally equidistributed
uniform pseudo-random number generator. ACM Transactions on
Modeling and Computer Simulation,8(1):330.
[McLachlan and Krishnan, 2008] McLachlan, G. J. and Krishnan, T.
(2008). The EM Algorithm and Extensions (Wiley Series in Probability
and Statistics). Wiley-Interscience, 2edition.
128
Bibliography
[McQuitty, 1957] McQuitty, L. L. (1957). Elementary Linkage Analy-
sis for Isolating Orthogonal and Oblique Types and Typal Relevan-
cies. Educational and Psychological Measurement,17:207209.
[Meyerhenke and Staudt, 2013] Meyerhenke, H. and Staudt, C. L.
(2013). Engineering High-Performance Community Detection
Heuristics for Massive Graphs. Accepted for full paper presen-
tation at 42nd International Conference on Parallel Processing
(ICPP).
[Mitzenmacher and Upfal, 2005] Mitzenmacher, M. and Upfal, E.
(2005). Probability and Computing: Randomized Algorithms and Proba-
bilistic Analysis. Cambridge University Press, New York, NY, USA.
[Montgomery and Madison, 2004] Montgomery, J. M. and Madison,
D. V. (2004). Discrete synaptic states define a major mechanism of
synapse plasticity. Trends in Neurosciences,27(12):744 750.
[Naszódi, 2010] Naszódi, M. (2010). Covering a set with homothets
of a convex body. Positivity,14(1):6974.
[Pearson, 1894] Pearson, K. (1894). Contributions to the Mathemati-
cal Theory of Evolution. Philosophical Transactions of the Royal Society
of London. A,185:71110.
[Pereira et al., 1993] Pereira, F., Tishby, N., and Lee, L. (1993). Dis-
tributional clustering of english words. In Proceedings of the 31st
annual meeting on Association for Computational Linguistics, ACL 93,
pages 183190, Stroudsburg, PA, USA. Association for Computa-
tional Linguistics.
[Schaeffer, 2007] Schaeffer, S. E. (2007). Graph clustering. Computer
Science Review,1(1):2764.
[Sneath and Sokal, 1973] Sneath, P. H. A. and Sokal, R. R. (1973). Nu-
merical taxonomy: the principles and practice of numerical classification.
W. H. Freeman.
[Sundberg, 1974] Sundberg, R. (1974). Maximum Likelihood Theory
for Incomplete Data from an Exponential Family. Scandinavian Jour-
nal of Statistics,1(2):4958.
[Tan et al., 2005] Tan, P.-N., Steinbach, M., and Kumar, V. (2005). In-
troduction to Data Mining, (First Edition). Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA.
[Turing, 1950] Turing, A. M. (1950). Computing Machinery and Intel-
ligence. Mind, LIX:433460.
[von Neumann, 1958] von Neumann, J. (1958). The computer and the
brain. Mrs. Hepsa Ely Silliman memorial lectures. Yale University
Press.
129
Bibliography
[Walker, 2012] Walker, R., editor (2012). The Human Brain Project A
Report to the European Commission. The HBP-PS Consortium, Lau-
sanne.
[Webster, 1994] Webster, R. (1994). Convexity. Oxford Science Publi-
cations. Oxford University Press, USA.
[Wei and Tanner, 1990] Wei, G. C. G. and Tanner, M. A. (1990). A
Monte Carlo Implementation of the EM Algorithm and the Poor
Man’s Data Augmentation Algorithms. Journal of the American Sta-
tistical Association,85(411):699704.
[Wu, 1983] Wu, C. F. J. (1983). On the Convergence Properties of the
EM Algorithm. The Annals of Statistics,11(1):95103.
130