Practical algorithms for clustering and modeling large data sets: analysis and improvements [original]

PRACTICAL ALGORITHMS

FOR CLUSTERING AND MODELING

LARGE DATA SETS

ANALYSIS AND IMPROVEMENTS

Dissertation

zur Erlangung des Grades

„Doktor der Naturwissenschaften”

(doctor rerum naturalium)

der Fakultät für Elektrotechnik, Informatik und Mathematik

der Universität Paderborn

vorgelegt von

DIPL.-INFORM.

DANIEL KUNTZE

Paderborn, den 29. August 2013

(überarbeitete Fassung vom 28. Dezember 2013)

angenommen auf Vorschlag von

PROF. DR. JOHANNES BLÖMER

UNIVERSITÄT PADERBORN

PROF. DR. CHRISTIAN SOHLER

TECHNISCHE UNIVERSITÄT DORTMUND

verteidigt am

7. November 2013

DEDICATED TO MY WIFE

EVA REBECCA

ACKNOWLEDGMENTS

This thesis would not have been written without a set of people sup-

porting me. Clustering them leads to four groups of people I am

deeply thankful.

First of all, I would like to thank my advisors. Prof. Dr. Johannes

Blömer was never too busy to invest time in discussions that inspired

new trains of thought and his guidance always advanced my work.

Prof. Dr. Christian Sohler not only made me eat sushi but also fed

my thoughts with lots of valuable ideas for my research. My col-

leagues Claudia Jahn, Kathrin Bujna, Dr. Stefanie Mense, Dr. Marcel

Ackermann, Jonas Gefele, Peter Günther, Dr. Volker Krummel, Gen-

nadij Liske, David Teusner, Paul Wolf, and of course the IT crowd

(Ulrich Ahlers, Tobias Schultz-Friese, Thomas Thissen, and Michael

Utermöhle) helped me out when I was stuck and provided welcome

excuses from work. I learned a lot during my time with you.

The research presented in this thesis was supported by the Priority

Programme ’Algorithm Engineering’ of the German Research Foun-

dation (DFG), grants BL 314/6-1to BL 314/6-3, for which I am also

really thankful.

I also want to thank my dear colleague Kathrin Bujna, who tamed

R for me and made it draw beautiful pictures.

Last but not least, I want to express my appreciation for the family

and friends cluster. Without Emily and Ronja, I might have finished

more quickly but with less fun. You were the cutest to come into my

office for chocolate! Furthermore, I am very grateful for my parents

and parents-in-law. Their support, love, and prayers were a great

encouragement. My friend Peter Richert never stopped to cheer me

on — thanks for believing in me, mate.

Most of all, I want to thank my wife for accompanying me on my

way, for sharing all the ups and downs, and for giving me unfailing

support. Without you, I would never have succeeded.

ABSTRACT

In times of big data, large-scale database applications are of increas-

ing importance. A central task in this field is to reduce the amount

of data that has to be processed. One way to approach this task is

to capture the structure of the given data. This thesis deals with two

particular data structuring techniques.

The first part of this thesis is about cluster analysis. More precisely,

we consider the diameter k-clustering problem. The subject of this

problem is to partition a finite subset of Rdinto ksubsets called clus-

ters such that the largest diameter of the clusters is minimized. One

early clustering algorithm that computes a hierarchy of approximate

solutions to this problem (for all values of k) is the agglomerative

clustering algorithm with the complete linkage strategy. For decades,

this algorithm has been widely used by practitioners. However, it

is not well studied theoretically. We provide the first analysis of the

agglomerative complete linkage clustering algorithm. Assuming that

the dimension dis a constant, we show that for any kthe solution

computed by this algorithm is an O(logk)-approximation to the di-

ameter k-clustering problem. Our analysis does not only hold for

the Euclidean distance but for any metric that is based on a norm.

Furthermore, we analyze the closely related k-center and discrete k-

center problem. For the corresponding agglomerative algorithms, we

deduce an approximation factor of O(logk)as well.

The second part of this thesis deals with data modeling by training

statistical models. More precisely, we focus on the parameter esti-

mation problem for general and Gaussian mixture distributions. We

analyze a probabilistic variant of the Expectation-Maximization (EM)

algorithm, known as the Stochastic EM or SEM algorithm. Unlike the

original work, we focus on the analysis of a single run of the algo-

rithm. First, we discuss the SEM algorithm for general mixture dis-

tributions. Second, we consider Gaussian mixture models and show

that with high probability the updates of the EM algorithm and its

probabilistic variant are almost the same, given that the data set is suf-

ficiently large. A series of experiments confirms that this still holds

for a large number of successive update steps. Furthermore, we ex-

plain why the SEM algorithm is considerably faster than the classical

EM algorithm. In particular, we show that the probabilistic variant

establishes a constant factor speedup for Gaussian mixture models.

vii

ZUSAMMENFASSUNG

In Zeiten von Big Data sind große Datenbankanwendungen von wach-

sender Bedeutung. Eine zentrale Aufgabe aus diesem Bereich ist die

Reduzierung der zu verarbeitenden Datenmenge. Diese Aufgabe

lässt sich beispielsweise lösen, indem die Struktur der gegebenen

Daten erfasst wird. In dieser Dissertation beschäftigen wir uns mit

zwei Techniken zur Strukturierung von großen Datenmengen.

Im ersten Teil der Arbeit geht es um Clusteranalyse, speziell um das

Durchmesser-k-Clustering-Problem. Dabei wird eine endliche Teil-

menge des Rdin kCluster partitioniert, so dass der größte Cluster-

durchmesser minimiert wird. Ein früher Clusteringalgorithmus, der

eine Hierarchie von Näherungslösungen (für alle Werte von k) zu

diesem Problem berechnet, ist der agglomerative Clusteringalgorith-

mus mit der Complete-Linkage-Strategie. Dieser Algorithmus erfährt

in der Praxis seit Jahrzehnten eine breite Anwendung, ist aber theo-

retisch nicht gut untersucht. Wir liefern die erste Analyse des ag-

glomerativen Complete-Linkage Clusteringalgorithmus. Wir zeigen,

dass die vom Algorithmus berechnete Lösung für konstante Dimen-

sion dfür jedes keine O(logk)-Approximation des Durchmesser-k-

Clustering-Problems ist. Unsere Analyse gilt dabei nicht nur für

den Euklidischen Abstand, sondern für jede Metrik, die auf einer

Norm basiert. Außerdem analysieren wir das k-Center-Problem und

das diskrete k-Center-Problem, welche eng mit dem Durchmesser-k-

Clustering-Problem verwandt sind. Für die dazugehörigen agglome-

rativen Algorithmen leiten wir ebenfalls einen Approximationsfaktor

von O(logk)her.

Der zweite Teil der Arbeit beschäftigt sich mit der Modellierung von

Daten durch das Trainieren von statistischen Modellen. Hier betrach-

ten wir die Parameterschätzung von allgemeinen und Gaußschen

Mixturverteilungen. Wir analysieren eine probabilistische Variante

des Expectation-Maximization (EM) Algorithmus, die als Stochastic

EM oder SEM Algorithmus bekannt ist. Im Gegensatz zu den ur-

sprünglichen Arbeiten zum SEM Algorithmus betrachten wir in un-

serer Analyse einen einzelnen Durchlauf des Algorithmus. Dabei

beschreiben wir den Algorithmus zunächst für allgemeine Mixtur-

verteilungen. Danach betrachten wir Gaußsche Mixturmodelle und

zeigen für diese, dass die Updates des EM Algorithmus und seiner

probabilistischen Variante mit hoher Wahrscheinlichkeit fast identisch

sind, wenn die Datenmenge groß genug ist. Eine Reihe von Expe-

rimenten bestätigt, dass dies auch für eine große Anzahl aufeinan-

derfolgender Updateschritte noch gilt. Darüber hinaus erörtern wir,

warum der SEM Algorithmus schneller arbeitet als der klassische EM

Algorithmus. Insbesondere zeigen wir für Gaußsche Mixturmodelle,

dass die probabilistische Variante eine Beschleunigung um einen kon-

stanten Faktor erreicht.

CONTENTS

1 introduction 1

1.1Cluster analysis 2

1.2Parameter estimation 5

1.3Outline of the thesis 9

i on cluster analysis 11

2 clustering basics 13

2.1The notion of similarity 14

2.2Clustering quality 14

2.3Problem Definitions 16

3 analysis of agglomerative clustering 17

3.1Preliminaries 18

3.2Discrete k-center clustering 20

3.3k-center clustering 24

3.4Diameter k-clustering 34

3.5The one-dimensional case 41

3.6Lower bounds 45

ii on parameter estimation 55

4 statistical models 57

4.1The parameter estimation problem 58

4.2General mixture models 60

4.3Gaussian mixture models 61

4.4Handy notions from probability theory 67

5 the classical em algorithm 71

5.1Convergence of the EM algorithm 73

5.2The EM algorithm for mixture distributions 77

5.3The EM algorithm for Gaussian mixtures 78

6 the stochastic em algorithm 81

6.1The generic SEM algorithm 82

6.2The SEM algorithm for mixture distributions 83

6.3The SEM algorithm for Gaussian mixtures 86

7 experimental analysis 99

7.1Implementation 99

7.2Data sets 100

7.3Experiments 102

7.4Results 105

INTRODUCTION

Recently, as part of the Human Brain Project [Walker, 2012], researchers

from Canada and Germany released BigBrain, a high-resolution 3D

model of the human brain [Amunts et al., 2013]. The aim of this

model is to provide “considerable neuroanatomical insight into the

human brain, thereby allowing the extraction of microscopic data for

modeling and simulation”. The long-term objective of the Human

Brain Project is the simulation of the human brain on supercomputers,

thus realizing a recent dream of mankind. The chances of success for

a venture like this were already discussed in the 1950s by great minds

as Alan Turing1and John von Neumann2.

Although science is not able to simulate the human brain yet, a lot

of effort is put into the imitation of its capabilities. One of these ca-

pabilities is to handle large amounts of data. The human retina alone

transmits approximately one megabyte of data to the brain every sec-

ond [Koch et al., 2006]. Fortunately, we do not have to consciously

process each bit of information that we see. Instead, our brain’s ab-

straction capacity reduces the amount of information. This enables

us to recognize a white area covered with black speckles as a page

of printed text. Moreover, we are able to classify the speckles as Ro-

man letters and recognize groups of letters as English words. Then,

by processing one word after another while ignoring the remaining

words on the page, we are able to read whole sentences.

For the sake of simplicity let’s wrongly assume that the information

of a sentence is simply the sum of information in the single words.

Furthermore, we assume that the amount of information in a single

word is simply the number of bits needed to identify one word of the

English language. If we estimate the size of the English vocabulary

by 500 000 words, then we need 19 bits per word. An average reader

is able to read about 4words per second which corresponds to 76 bits

per second. This rough approximation shows that the brain reduces

the information from the perception in the eye to the recognition of

the words by several orders of magnitude.

1In 1950, Turing wrote his well-known article “Computing Machinery and Intelli-

gence” [Turing, 1950].

2Shortly before his death in 1957, von Neumann began to write the unfinished and

posthumously published book “The Computer and the Brain” [von Neumann, 1958].

introduction

Furthermore, the brain is also able to store large amounts of data.

The number of neurons in the human brain is estimated to about

86 billion cells, 16 billion of which are located in the cerebral cortex

[Azevedo et al., 2009]. We may assume that each neuron in the cere-

bral cortex is connected to other neurons by 50 000 synapses on aver-

age [Brewer et al., 2009]. This yields a total number of approximately

800 trillion synapses. It is not known how the brain stores informa-

tion and which role the interaction between neurons and synapses

plays. However, recent work suggests that a synapse switches be-

tween discrete states [Montgomery and Madison, 2004]. If we assume

that each synapse can hold one bit of information, we get a total stor-

age capacity of approximately 100 terabytes in the cerebral cortex.

The combination of the capability to store large amounts of data

and the ability to process it resembles the requirements of large-scale

database applications. With continuously growing databases, it be-

comes increasingly important to reduce the amount of data that has

to be processed. That is, in order to work with very large data sets,

we need to be able to extract the essential information. This is quite

similar to the brain’s ability to extract text from the electrochemical

impulses received from the optic nerve. One step into this direction

is to try and capture the structure of a given data set. The approaches

to this task are as diverse as the underlying applications. This thesis

deals with two particular data structuring techniques introduced in

the following two sections.

1.1cluster analysis

One way of investigating the structure of a given data set is to divide

it into subsets of similar data elements. This approach is called cluster

analysis or simply clustering. There are many applications for clus-

tering, including data compression [Pereira et al., 1993], structuring

results of search engines [Broder et al., 1997], analysis of gene expres-

sion data [Eisen et al., 1998], anomaly detection [Lee et al., 2008], and

community detection [Meyerhenke and Staudt, 2013]. For every ap-

plication, a proper objective function is used to measure the quality

of a clustering. One particular objective function that is based on a

distance measure between data elements is the largest diameter of

the clusters. If the desired number of clusters kis given, we call the

problem of minimizing this objective function the diameter k-clustering

problem. In this thesis, we focus on the diameter k-clustering problem

and two closely related clustering problems.

One of the earliest and most widely used clustering algorithms

is agglomerative clustering. The history of agglomerative clustering

goes back to the 1950s at least [Florek et al., 1951, McQuitty, 1957].

Later, biological taxonomy became one of the driving forces of clus-

ter analysis. In [Sneath and Sokal, 1973] the authors, who where

1.1 cluster analysis

the first biologists using computers to classify organisms, discuss sev-

eral agglomerative clustering methods. Agglomerative clustering is a

bottom-up clustering process. At the beginning, every data element

forms its own cluster. In each subsequent step, the two ’closest’ clus-

ters will be merged until only one cluster remains.

In order to define the agglomerative algorithm properly, we have

to specify what it means for two clusters to be ’close’. In case of

the diameter clustering problem mentioned above, two clusters are

defined to be close when the distance between their farthest pair of

data elements is small. Then, the two ’closest’ clusters are the two

clusters that minimize the diameter of their union. Using this def-

inition of closeness is called the complete linkage strategy. Alternate

strategies are to minimize the distance between the closest pair of

data elements or to minimize the average distance between data ele-

ments from the two clusters. These are called single linkage strategy3

and average linkage strategy, respectively.

Agglomerative clustering algorithms create a hierarchy of clusters,

such that for any two different clusters Aand Bfrom different levels

of the hierarchy we either have A∩B=∅,A⊂B, or B⊂A. Such a

hierarchy is useful in many applications, for example, when one is

interested in hereditary properties of the clusters (as in some bioin-

formatics applications) or if the exact number of clusters is a priori

unknown.

1.1.1related work

In this thesis, we study the agglomerative clustering algorithm us-

ing the complete linkage strategy. Since the complete linkage strat-

egy minimizes the diameter in each step, this algorithm is a natural

choice to find hierarchical diameter k-clusterings. The running time

is obviously polynomial in the description length of the input data.

Therefore, our only goal in this thesis is to give an approximation

guarantee for the diameter k-clustering problem. The approximation

guarantee is given by a factor αsuch that the largest diameter of

the k-clustering computed by the algorithm is at most αtimes the

largest diameter of an optimal k-clustering. Although the agglom-

erative complete linkage clustering algorithm is widely used, there

are only few theoretical results considering the quality of the clus-

tering computed by this algorithm. In [Dasgupta and Long, 2005]

the authors provide a certain metric distance function such that this

algorithm computes a k-clustering with an approximation factor of

Ω(logk).

The diameter k-clustering problem is closely related to the k-center

problem. In this problem, we search for kcenters and the objective is

3Using the single linkage strategy is equivalent to computing a minimum spanning

tree of the graph induced by the distance function using Kruskal’s algorithm.

introduction

to minimize the maximum distance of any data element to the nearest

center. When the centers are restricted to come from the input data

set, the problem is called the discrete k-center problem. It can easily be

verified that for metric distance functions the costs of optimal solu-

tions to all three problems are within a factor of 2from each other.

For the Euclidean case, we know that for fixed k, i.e., when we are

not interested in a hierarchy of clusterings, the diameter k-clustering

problem and the k-center problem are NP-hard. In fact, it is already

NP-hard to approximate both problems with an approximation factor

below 1.96 and 1.82, respectively [Feder and Greene, 1988]. These

non-approximability results are transferable to the discrete k-center

problem for approximation factors below 1.73 [Leder, 2013].

Furthermore, there exist provably good approximation algorithms

in this case. For the k-center problem, a simple 2-approximation al-

gorithm is known for metric spaces [Gonzalez, 1985]. Since this al-

gorithm chooses the cluster centers from the input data set, it is also

a2-approximation algorithm for the discrete k-center problem. The

analysis of this algorithm is immediately transferrable to the diame-

ter k-clustering problem and yields a 2-approximation algorithm for

this problem, too. For the k-center problem, a variety of results is

known. For example, for the Euclidean metric in [B¯

adoiu et al., 2002]

a(1+ϵ)-approximation algorithm with running time 2O(klogk

/ϵ2)dn

is shown. This implies a (2+ϵ)-approximation algorithm with the

same running time for the diameter k-clustering problem. Moreover,

for metric spaces a hierarchical clustering strategy is known with an

approximation guarantee of 8for the discrete k-center problem [Das-

gupta and Long, 2005]. This implies an algorithm with an approxi-

mation guarantee of 16 for the diameter k-clustering problem.

Part I of this thesis as well as all of the above mentioned work is

about static clustering, i. e., in the problem definition we are given

the whole input data set at once. An alternative model of the input

data is to consider sequences of data elements that are given one after

another. In [Charikar et al., 1997], the authors discuss clustering in

a so-called incremental clustering model. They give an algorithm with

constant approximation factor that maintains a hierarchical clustering

while new elements are added to the input data set. Furthermore,

they show a lower bound of Ω(log k)for the agglomerative complete

linkage algorithm and the diameter k-clustering problem. However,

since their model differs from ours, their results have no bearing on

our results.

1.1.2our contribution

One of the open problems discussed in [Dasgupta and Long, 2005]

is to derive a non-trivial upper bound for the approximation guar-

antee of the classical agglomerative complete linkage clustering al-

1.2 parameter estimation

gorithm. In this thesis, we present such an analysis for the agglom-

erative complete linkage and related clustering algorithms. Our re-

sults hold for metrics on Rdthat are based on a norm as for exam-

ple the Euclidean metric. We prove that the agglomerative solution

to the diameter k-clustering problem is an O(logk)-approximation.

Here, the O-notation hides a constant that is doubly exponential in

d. This approximation guarantee holds for every level of the hierar-

chy computed by the algorithm. That is, we compare each computed

k-clustering with an optimal solution for that particular value of k.

These optimal k-clusterings do not necessarily form a hierarchy. In

fact, there are simple examples where optimal solutions have no hier-

archical structure.

Our analysis also yields that if we allow 2k instead of kclusters

and compare the cost of the computed 2k-clustering to an optimal

solution with kclusters, the approximation factor is independent of

kand depends only on d. Moreover, the techniques of our analysis

can be applied to prove stronger results for the k-center problem and

the discrete k-center problem. For the k-center problem, we derive

an approximation guarantee that is logarithmic in kand only singly

exponential in d. For the discrete k-center problem, we derive an

approximation guarantee that is logarithmic in kand the dependence

on dis only linear and additive.

Furthermore, we give almost matching upper and lower bounds

for the one-dimensional case. These bounds are independent of k.

For d⩾2and the metric based on the L∞-norm, we provide a lower

bound that exceeds the upper bound for d=1. For d⩾3, we give

a lower bound for the Euclidean case which is larger than the lower

bound for d=1. Finally, we construct data sets which provide lower

bounds for any metric based on an Lp-norm with 1⩽p⩽∞. How-

ever, the construction of these data sets needs the dimension to de-

pend on k.

Our analysis, presented in Chapter 3, was previously published in

[Ackermann et al., 2011, Ackermann et al., 2012].

1.2parameter estimation

Apart from the clustering approach outlined in Section 1.1, another

way of revealing the structure of large data sets is to train the parame-

ters of statistical models. This approach is discussed in Part II of this

thesis. To describe a given data set by the parameters of a statistical

model is a central task in the field of data mining and machine learn-

ing. Many applications in this field deal with real-world data that is

suitable to be modeled by stochastic processes. That is, the data is as-

sumed to be generated by a stochastic process which is governed by

some model parameters. Then, the goal is to optimize the parameters

such that the stochastic process is a likely source of the data. Presum-

introduction

ably, one of the first studies of this problem was done by Karl Pearson

for a data set consisting of measurements of the frontal breadth of the

carapace of 1 000 female shore crabs [Pearson, 1894]. Pearson showed

that the data was most likely a mixture of two Gaussian distributions.

This suggests that the crabs belonged to two different species. To

compute the parameters of the mixture, Pearson used the method of

moments to derive a ninth-degree polynomial by hand which gave

rise to a set of candidate parameters.

Today, we are able to work with much larger data sets and we do

not have to do any calculations by hand. Instead, we are equipped

with powerful algorithms for the parameter optimization. One way

to optimize the parameters is to maximize the probability of the given

data set over the choice of the parameters. As a function of the model

parameters, this probability is called the likelihood function. The

problem of maximizing the likelihood is called the parameter estima-

tion problem. The Expectation-Maximization (EM) algorithm intro-

duced in [Dempster et al., 1977] is a general scheme for finding maxi-

mum likelihood solutions for this parameter estimation problem. It is

used when the observed data Xcan be seen as incomplete and when

the parameter estimation problem given the corresponding complete

data (X,Z)is easy to solve. Starting with an initial set of model pa-

rameters, the algorithm repeatedly performs two steps to compute a

new set of parameters. The first step derives an optimal distribution

for the hidden values. The second step computes the new set of pa-

rameters by maximizing the expectation (over the hidden values) of

the complete-data likelihood.

Applications of the EM algorithm to special cases were already

known before the method was formulated in its general form in

[Dempster et al., 1977]. For example, the well known Baum-Welch

algorithm (see [Baum et al., 1970]) for the computation of parameters

for hidden Markov models is an instantiation of a generalized variant

of the EM algorithm. For exponential families, the method was stud-

ied in [Sundberg, 1974]. More examples can be found in [Dempster

et al., 1977].

1.2.1related work

During the decades since its first presentation, a lot of work has been

done to improve the EM algorithm. Most of these improvements deal

with one or more of three major drawbacks of the algorithm. First,

the convergence of the EM algorithm can be very slow. For a discus-

sion of several speed-up techniques see Chapter 4in [McLachlan and

Krishnan, 2008]. Second, the EM algorithm is prone to converge only

to a local maximum, or even worse, to a saddle point of the likelihood

function [Wu, 1983, Dias and Wedel, 2004]. One way to deal with this

problem is to compute ’good’ initial estimates [Biernacki et al., 2003]

1.2 parameter estimation

or to restart the algorithm with several different initializations in the

hope that one iteration leads to a satisfying solution. Finally, in some

cases the EM algorithm is not applicable at all since the maximiza-

tion step is computationally intractable. To overcome this problem,

the maximization step has to be substantially simplified.

There exist a number of stochastic variants of the EM algorithm

that try to deal with these drawbacks, including the Stochastic EM

or SEM algorithm [Celeux and Diebolt, 1985], the Stochastic Approx-

imation EM or SAEM algorithm [Celeux and Diebolt, 1992], and the

Monte Carlo EM or MCEM algorithm [Wei and Tanner, 1990]. The

MCEM algorithm replaces the distribution over the hidden values

from the first step of the EM algorithm by an empirical Monte Carlo

approximation. Thus, for continuous random variables Z, it replaces

analytic by numerical computation. Furthermore, due to the inherent

randomness, the algorithm is capable of escaping from a saddle point

or from a local maximum of the likelihood function.

The SEM algorithm is even simpler than the MCEM algorithm. In-

stead of maximizing the expectation as done in the second step of the

EM algorithm, it uses the distribution from the first step to simply

guess the missing values. That is, the algorithm samples an assign-

ment for the hidden values Z. Afterwards, it maximizes the complete-

data likelihood only for that fixed assignment. This considerably re-

duces the computational complexity and may lead to the applicability

of the EM scheme in the first place. The SEM algorithm can be seen

as a special case of the MCEM algorithm where only one sample is

drawn to approximate the distribution over the hidden values. It is

known that for particular classes of statistical models, the sequence of

models generated by the SEM iterations is an ergodic Markov chain

converging weakly to a stationary distribution over models. Further-

more, it is known that under appropriate assumptions and for the

number of input points going to infinity, the mean of this stationary

distribution converges to the maximum likelihood estimate [Ip, 1994].

A suitable application of the SEM algorithm is the parameter es-

timation problem for mixture distributions. In this model, the data

set consists of independent draws from a mixture distribution and the

hidden values are usually assumed to be the assignments of each data

element to the component distribution it was generated by. That is,

for each data element, the SEM algorithm guesses which component

distribution it was generated by. Obviously, if the number of data ele-

ments is small, then each wrong assignment has a large impact. And

indeed, a limitation of the SEM algorithm is that it does not work

for very small data sets. Therefore, the authors of the original SEM

work introduced the SAEM algorithm later. The SAEM algorithm

is a hybrid algorithm that mixes the computations of the SEM algo-

rithm with the computations of the classical EM algorithm. At the

beginning it works like a pure SEM algorithm and then it gradually

introduction

transforms until it finally does pure EM computations. For a compar-

ison of all three mentioned stochastic EM variants see [Celeux et al.,

1995].

1.2.2our contribution

As already mentioned in the previous section, the original line of

work on the SEM algorithm focused on stochastic properties of the

sequence of models generated by the SEM algorithm. However, to

approximate the above mentioned mean that converges to the max-

imum likelihood estimate, we would need to average over a large

number of runs of the SEM algorithm. In this thesis, we present

a new analysis of the SEM algorithm for Gaussian mixture models.

Unlike the previous work, our analysis studies only a single run of

the algorithm. Furthermore, in addition to our theoretical results, we

also provide experimental results for various artificial and real world

data sets.

Since practitioners are often satisfied with the solutions computed

by the classical EM algorithm, we analyze the SEM algorithm in re-

lation to the EM algorithm. More precisely, we examine the parame-

ter updates computed in an arbitrary iteration of the SEM algorithm

and compare them to the updates computed by the EM algorithm

for the same previous parameter estimate. We show that with high

probability the updates of the EM and SEM algorithm are almost the

same, given that the input data set is sufficiently large. Our results

are stated as three separate theorems. The first theorem gives a high

probability bound on the difference of the weight updates. This re-

sult does not only hold for Gaussian mixture models, but also for

arbitrary mixture models. The second theorem gives a high probabil-

ity bound on the distance between the means computed by the EM

and SEM iteration. Finally, the third theorem does the same for the

covariance updates.

Since the computed parameters generally differ from each other af-

ter a single round of the two algorithms, our analysis does not work

for a sequence of several EM and SEM iterations. However, our exper-

iments confirm that the similarity of the parameters mostly persists

even after a large number of successive update steps. Moreover, we

explain why the SEM algorithm is considerably faster than the classi-

cal EM algorithm. In particular, we show that the probabilistic variant

establishes a constant factor speedup for Gaussian mixture models.

Our theoretical and experimental analysis, presented in Chapter 6

and Chapter 7, was also published in [Blömer et al., 2013].

1.3 outline of the thesis

1.3outline of the thesis

This thesis is organized into two parts. In Part I we turn towards clus-

ter analysis. After giving a short introduction to clustering in Chap-

ter 2, we present the first analysis of a classical clustering algorithm

in Chapter 3. Part II is about statistical models and the parameter es-

timation problem. The basic concepts needed later are introduced

in Chapter 4. In Chapter 5, we discuss the classical Expectation-

Maximization algorithm. In Chapter 6, we analyze a probabilistic

variant of the EM algorithm. Finally, in Chapter 7, we present an

experimental comparison of both algorithms.

On some pages you will find cross-references and additional re-

marks written in the margins that may be useful to understand the

main text. Furthermore, some bibliographic references are only men-

tioned in these marginal notes.

Part I

ON CLUSTER ANALYSIS

The first part of this thesis is about cluster analysis. We

focus on so-called distance-based hard clustering. Using

this approach, the distances between the elements of a

given data set are used to compute a partition of the data

into subsets of elements that are close to each other.

The main result of Part I is the first analysis of the qual-

ity of agglomerative complete linkage clustering for the

diameter k-clustering problem.

CLUSTERING BASICS

Simply put, clustering or cluster analysis is the process of organizing a

given set of items into groups (called clusters) of similar items. The

term is used for the entire number of techniques to solve this task as

well as for the task itself. Furthermore, the term clustering is also used

for a resulting set of clusters. If a clustering consists of kclusters, it

is called a k-clustering.

Ideally, all elements of a particular cluster are similar while ele-

ments from different clusters are dissimilar. It is evident that the no-

tion of similarity or dissimilarity is a crucial ingredient for the formu-

lation of clustering problems. As a consequence, the different ways

to model a cluster are as numerous as the number of clustering appli-

cations. There is a broad range of (possibly overlapping) approaches,

including graph-based clustering [Schaeffer, 2007, Fortunato, 2010],

density-based clustering [Kriegel et al., 2011a], distribution-based clus-

tering (cf. Chapter 4), and distance-based clustering (see Section 2.1).

Another key differentiator between common clustering applications

is the certainty of membership in a cluster. The two main categories

are hard and soft clustering. In case of soft clustering, an element of the

input data set may be assigned to multiple clusters at the same time

and in addition this assignment can be weighted. For example, using

a distribution-based clustering approach, we may want to model the

input data by a mixture distribution (cf. Chapter 4.2). Then, we can

identify the mixture components with the clusters and assign the ele-

ments of the input data set to each cluster with a weight proportional

to the probability of the element in the corresponding mixture com-

ponent. In case of hard clustering however, each element is assigned

to exactly one cluster. That is, a hard clustering is actually a partition

of the given data set.

An important special case of hard clustering is the so-called hierar-

chical clustering. We call a collection of k-clusterings of the same finite

data set X, but for different values of k, hierarchical, if it fulfills the

following two properties. First, for any k, the collection contains at

most one k-clustering. Second, for any two of its clusterings Ci,Cj

with |Ci|=i < j =|Cj|, every cluster in Ciis the union of one or more

clusters from Cj. A hierarchical collection of clusterings is called a

hierarchical clustering.

clustering basics

In the following, we introduce the concepts of clustering that we

need to apply our analysis of agglomerative clustering in Chapter 3.

In particular, we consider distance-based hard clustering only. For a

thorough introduction to clustering see for example [Tan et al., 2005].

2.1the notion of similarity

The distance-based clustering approach establishes a notion of dis-

similarity based on the distances between the elements of the input

data. Therefore, we start our formal introduction to cluster analysis

with the definition of distance measures. We use the term distance mea-

sure to differentiate from the distance function of a metric since we

neither require symmetry nor the triangle inequality.

Definition 2.1.Let Xbe an arbitrary domain. A function D:X×X→R

is called a distance measure, if the following conditions are satisfied for all

x,y∈X:

1.D(x,y)⩾0(non-negativity)

2.D(x,y) = 0⇔x=y(identity of indiscernibles)

Cluster analysis is established for a wide variety of distance mea-

sures. Although it is very common to embed the input data into a

metric space like the d-dimensional Euclidean vector space Rd, some

widely used distance measures do not form a metric. For example,

the squared Euclidean distance and the Mahalanobis distance both are no

metrics since they do not fulfill the triangle inequality. Other distance

measures as for example the Kullback-Leibler divergence or the Itakura-

Saito divergence (both defined for X=Rd

⩾0) are not even symmetric.

For a discussion of clustering algorithms for non-metric distance mea-

sures see [Ackermann, 2009]. For the remainder of this chapter, we

consider metric distance measures only.

Definition 2.2.Let Xbe an arbitrary domain. A function D:X×X→R

is called a metric, if Dis a distance measure and in addition, the following

conditions are satisfied for all x,y,z∈X:

3.D(x,y) = D(y,x)(symmetry)

4.D(x,y)⩽D(x,z) + D(z,y)(triangle inequality)

2.2clustering quality

To distinguish good clusterings from bad clusterings we need to in-

troduce a measure of quality. It is obvious that the choice of the

right measure highly depends on the particular clustering application.

Generally, the desired properties of the clustering are transformed to

a cost function that maps clusterings to real numbers. Then, the goal

2.2 clustering quality

b b

(a) The diameter.

b b

(b) The radius.

b b

Figure 1: Different measurements for the same cluster.

of the clustering process is to minimize the cost of the resulting clus-

tering.

Before introducing three popular hard clustering cost functions, we

give a formal definition of a k-clustering.

Definition 2.3.Let Xbe an arbitrary domain, k∈Nand X⊂Xa finite

set with |X|⩾k. Then, a k-clustering of Xis a partition Ck={C1,. . . ,Ck}

of Xinto knon-empty subsets. The subsets C1,. . . ,Ckare called clusters.

The first cost function we consider is the diameter cost of a cluster-

ing which is simply the largest diameter among its clusters.

Definition 2.4.Let Xbe an arbitrary domain with a distance measure D.

Furthermore, let X⊆Xbe a finite input data set and C={C1,. . . ,Ck}a

k-clustering of X.

Then, the diameter cost of Cis defined as

costdiam(C):= max

C∈Cdiam(C),

where diam(C) := maxx,y∈CD(x,y)denotes the diameter of C.cf. Figure 1a

The diameter cost only depends on the distances between elements

of the input data set. For the definition of the radius cost, we addi-

clustering basics

tionally introduce so-called cluster centers. These centers come from

the same domain that the input data set is embedded into.

Definition 2.5.Let Xbe an arbitrary domain with a distance measure D.

Furthermore, let X⊆Xbe a finite input data set and C={C1,. . . ,Ck}a

k-clustering of X.

Then, the radius cost of Cis defined as

costrad(C):= max

C∈Crad(C),

where rad(C) := miny∈Xmaxx∈CD(x,y)denotes the radius of C.cf. Figure 1b

The discrete radius cost is a special case of the radius cost where

the centers have to be elements of the input data set.

Definition 2.6.Let Xbe an arbitrary domain with a distance measure D.

Furthermore, let X⊆Xbe a finite input data set and C={C1,. . . ,Ck}a

k-clustering of X.

Then, the discrete radius cost of Cis defined as

costdrad(C):= max

C∈Cdrad(C),

where drad(C) := miny∈Cmaxx∈CD(x,y)denotes the discrete radius ofcf. Figure 1c

That is, if we choose X=Xthe radius cost coincides with the

discrete radius cost.

2.3problem definitions

Based on the three cost functions introduced in Section 2.2we de-

fine the corresponding clustering problems. All three problems are

discussed in Chapter 3.

Problem 2.7(diameter k-clustering).Given k∈Nand a finite set X⊂

Rdwith |X|⩾k, find a k-clustering Ckof Xwith minimal diameter cost.

Problem 2.8(k-center).Given k∈Nand a finite set X⊂Rdwith |X|⩾k,

find a k-clustering Ckof Xwith minimal radius cost.

Problem 2.9(discrete k-center).Given k∈Nand a finite set X⊂Rd

with |X|⩾k, find a k-clustering Ckof Xwith minimal discrete radius cost.

ANALYSIS OF AGGLOMERATIVE CLUSTERING

In this chapter we analyze the agglomerative clustering algorithms

for the (discrete) k-center problem and the diameter k-clustering prob-

lem. As mentioned in the introduction, an agglomerative algorithm

takes a bottom-up approach. It starts with the |X|-clustering that con-

tains one cluster for each element of a given data set and then suc-

cessively merges two of the remaining clusters such that the cost of

the resulting clustering is minimized. That is, in each merge step

the agglomerative algorithms for Problem 2.7, Problem 2.8and Prob-

lem 2.9minimize the diameter, the radius and the discrete radius of

the resulting cluster, respectively.

Our main objective is the agglomerative complete linkage cluster-

ing algorithm, which minimizes the diameter in every step. Never-

theless, we start with the analysis of the agglomerative algorithm for

the discrete k-center problem since it is the simplest one of the three.

Then we adapt our analysis to the k-center problem and finally to

the diameter k-clustering problem. In each case we need to introduce

further techniques to deal with the increased complexity of the given

problem.

We show that all three algorithms compute an O(logk)-approxima-

tion for the particular corresponding clustering problem. However,

the dependency on the dimension which is hidden in the O-notation

ranges from only linear and additive in case of the discrete k-center

problem to a factor that is doubly exponential in case of the diameter

k-clustering problem. As mentioned in the introduction, the cost of

optimal solutions to the three problems are within a factor of 2from

each other. That is, each algorithm computes an O(logk)-approxi-

mation for all three problems. However, we will analyze the proper

agglomerative algorithm for each problem.

Strictly speaking, the clusterings computed by the agglomerative

algorithms stated as Algorithm 3.1,3.2and 3.3are not uniquely de-

termined, since the minimization step may be ambiguous. However,

all our results hold for any particular tie-breaking strategy (even non-

deterministic ones). To keep the discussion simple, we ignore such

subtleties unless they are crucial.

analysis of agglomerative clustering

3.1preliminaries

Throughout this chapter, we confine ourselves to metric distance mea-

sures that are defined on Rdfor some d∈N. That is, we consider

input data sets that are finite subsets of Rd. Our results hold for arbi-

trary metrics that are based on a norm, i. e., we consider only distance

measures of the form

D(x,y) = ∥x−y∥

for an arbitrary norm ∥·∥. Readers who are not familiar with arbi-The figures in this

chapter assume the

Euclidean metric. trary metrics or are only interested in the Euclidean case, may assume

that the Euclidean norm ∥·∥2is used, i. e.,

∥x−y∥=v

∑

i=1

(xi−yi)2.

Furthermore, for r∈Rand y∈Rd, we denote the closed d-dimen-

sional ball of radius rcentered at yby

r(y) := {x|∥x−y∥⩽r}.

For our analysis of agglomerative clustering, we repeatedly use the

volume argument stated in Lemma 3.2. This argument provides an

upper bound on the minimum distance between two points from a

finite set of points lying inside the union of finitely many balls. For

the application of this argument, the following definition is crucial.

Definition 3.1.Let k∈Nand r∈R. Then, a set X⊂Rdis called

(k,r)-coverable, if there exist y1,...,yk∈Rdwith

X⊆

∪

i=1

r(yi).

Lemma 3.2.Let k∈N,r∈Rand P⊂Rdbe finite and (k,r)-coverable

with |P|> k. Then, there exist distinct p,q∈Psuch that

∥p−q∥⩽4r d

√k

|P|.

Proof. Let Z⊂Rdwith |Z|=kand P⊂∪z∈ZBd

r(z). We define δto be

the minimum distance between two points of P, i. e.,

δ:= min

p,q∈P

p=q∥p−q∥.

That is, we have to show that δ⩽4r d

√k

|P|. We assume for contra-

diction that u:= 4r d

√k

|P|< δ. Since |P|> k there exists z∈Zwith

3.1 preliminaries

r+u

Figure 2: The volume argument.

Bd

r(z)∩P⩾2. It follows that δ⩽2r and hence, u

2< r. Note that for

any y∈Rd,τ∈R, and any norm ∥·∥,

Vol(Bd

τ(y))=τd·Vd,

where Vdis the volume of the d-dimensional unit ball Bd

1(0)(see

Corollary 6.2.15 in [Webster, 1994]). Therefore,

Vol(∪

z∈Z

r+u

/2(z))<∑

z∈Z

Vol(Bd

2r(z))=k(2r)d·Vd.

Furthermore, since any p∈Pis contained in a ball Bd

r(z)for some

z∈Z, we conclude that any ball Bd

/2(p)for p∈Pis contained in a

ball Bd

r+u

/2(z)for some z∈Z(cf. Figure 2). Thus,

Vol

∪

p∈P

/2(p)

⩽Vol(∪

z∈Z

r+u

/2(z))< k(2r)d·Vd. (1)

Since u < δ, for any distinct p,q∈P, we conclude that

/2(p)∩Bd

/2(q) = ∅.

Therefore, using the definition of u, the total volume of the |P|balls

/2(p)is given by

Vol

∪

p∈P

/2(p)

=|P|(u

2)dVd=k(2r)d·Vd.

This however contradicts Inequality (1) and we obtain δ⩽u, which

proves the lemma.

Furthermore, we need the following covering lemma that is a corol-

lary of Theorem 1.2in [Naszódi, 2010].

Lemma 3.3.Let λ,τ∈Rwith λ,τ > 0. Then, for arbitrary metrics that are

based on a norm, a ball of radius τcan be covered by ⌈(3

λ)d⌉balls of radius

λτ.

analysis of agglomerative clustering

Algorithm 3.1:AgglomerativeDiscreteRadius(X)

Input: finite set of input points X⊂Rd

C|X|←− {{x} | x∈X}

for i=|X|−1,. . . ,1do

find distinct clusters A,B∈Ci+1minimizing drad(A∪B)

Ci←− (Ci+1\ {A,B})∪{A∪B}

end

return C|X|,...,C1

3.2discrete k-center clustering

The agglomerative algorithm for the discrete k-center problem is sta-

ted as Algorithm 3.1. Given a finite set of input points X⊂Rd,

the algorithm computes hierarchical k-clusterings for all values of

kbetween 1and |X|. We denote them by C|X|,...,C1. Throughout

this section, cost always means discrete radius cost and optkrefers to

the cost of an optimal discrete k-center clustering of X⊂Rdwhere

k∈Nwith k⩽|X|. That is, optkdenotes the cost of an optimal

solution to Problem 2.9. Since any cluster Cis contained in a ball of

radius drad(C), we have that Xis (k,costdrad(Ck))-coverable for any

k-clustering Ckof X. It follows that Xis (k,optk)-coverable. This

fact, as well as the following observation about the greedy strategy of

Algorithm 3.1, will be used frequently in our analysis.

Observation 3.4.The cost of all computed clusterings is equal to the dis-

crete radius of the cluster created last. Furthermore, the discrete radius of

the union of any two clusters is always an upper bound for the cost of the

clustering to be computed next.

The following theorem states our result for the discrete k-center

problem.

Theorem 3.5.Let X⊂Rdbe a finite set of points. Then, for arbitrary

metrics that are based on a norm and for all k∈Nwith k⩽|X|, the

partition Ckof Xinto kclusters as computed by Algorithm 3.1satisfies

costdrad(Ck)<(20d +2log2(k) + 2)·optk,

where optkdenotes the cost of an optimal solution to Problem 2.9.

We prove Theorem 3.5in two steps. First, Proposition 3.6in Sec-

tion 3.2.1provides an upper bound to the cost of the intermediate

2k-clustering. This upper bound is independent of kand |X|, only lin-

ear in dand may be of independent interest. In its proof, we use the

volume argument from Lemma 3.2to bound the distance between the

centers of pairs of remaining clusters.

Second, in Section 3.2.2, we analyze the remaining kmerge steps of

Algorithm 3.1down to the computation of the k-clustering. There, we

3.2 discrete k-center clustering

no longer need to apply Lemma 3.2to bound the distance between

two cluster centers. Instead, the volume argument is replaced by a

very simple bound that is already sufficient to obtain our result.

3.2.1analysis of the 2k-clustering

Proposition 3.6.Let X⊂Rdbe finite. Then, for all k∈Nwith 2k ⩽|X|,

the partition C2k of Xinto 2k clusters as computed by Algorithm 3.1satisfies

costdrad(C2k)< 20d ·optk,

where optkdenotes the cost of an optimal solution to Problem 2.9.

To prove Proposition 3.6, we divide the merge steps of Algorithm 3.1

into phases, each reducing the number of remaining clusters by one

fourth. The following lemma bounds the increase of the cost during

a single phase by an additive term.

Lemma 3.7.Let m∈Nwith 2k < m ⩽|X|. Then,

costdrad(C⌊3m

4⌋)⩽costdrad(Cm)+4d

√2k

m·optk.

Proof. Let t:= ⌊3m

4⌋. Then, Cm∩Ct+1is the set of clusters from Cm

that still exist ⌈m

4⌉−1 < m

4merge steps after the computation of Cm.

In each iteration of its loop, the algorithm can merge at most two

clusters from Cm. Thus,

|Cm∩Ct+1|> m −2m

4=m

Let τ:= costdrad(Cm). From every cluster C∈Cm, we fix a center

pC∈Cwith C⊂Bd

τ(pC)and define P:= {pCC∈Cm∩Ct+1}. Then,

|P|=|Cm∩Ct+1|>m

2> k.

Since Xis (k,optk)-coverable, so is P⊂X. Then, by Lemma 3.2, there

exist distinct A,B∈Cm∩Ct+1such that

∥pA−pB∥⩽4d

√2k

m·optk.

Therefore, the distance from pAto any q∈Bis at most 4d

√2k

m·optk+τ.

We conclude that merging Aand Bwould result in a cluster whose

discrete radius can be upper bounded by

cf. Figure 3

drad(A∪B)⩽costdrad(Cm)+4d

√2k

m·optk.

The result follows from A,B∈Ct+1and Observation 3.4.

To prove Proposition 3.6, we apply Lemma 3.7for ⌈log4

|X|

2k ⌉con-

secutive phases.

analysis of agglomerative clustering

costdrad(Cm)

Figure 3: drad(A∪B)<costdrad(Cm)+∥pA−pB∥.

Proof of Proposition 3.6.Let u:= ⌈log4

|X|

2k ⌉and for i=0,. . . ,u, define

mi:= ⌈(3

4)i|X|⌉.

Then, mu⩽2k and mi> 2k for i=0,...,u−1. Since

⌊3mi

4⌋=⌊3

4⌈(3

4)i|X|⌉⌋⩽⌊(3

4)i+1|X|+3

4⌋⩽⌈(3

4)i+1|X|⌉=mi+1

and Algorithm 3.1uses a greedy strategy,

costdrad(Cmi+1)⩽costdrad(C⌊3mi

4⌋)

for i=0,...,u−1. Combining this inequality with Lemma 3.7(ap-

plied to m=mi), we obtain

costdrad(Cmi+1)⩽costdrad(Cmi)+4d

√2k

mi·optk.

By repeatedly applying this inequality for i=0,...,u−1and using

costdrad(C2k)⩽costdrad(Cmu)and costdrad(Cm0)=0, we deduce

costdrad(C2k)⩽

u−1

∑

i=0(4d

√2k

mi·optk)

⩽4d

√2k

|X|·

u−1

∑

i=0

√(4

3)i

·optk.

Solving the geometric series and using u−1 < log4

|X|

2k leads to

costdrad(C2k)⩽4d

√2k

|X|·

√(4

3)u−1

√4

3−1·optk<

√4

3−1·optk. (2)

By taking only the first two terms of the series expansion of the ex-

ponential function, we get d

√4

3=eln 4

d> 1 +ln 4

d. Substituting this

bound into Inequality (2) yields

costdrad(C2k)<

√4

ln 4

d·optk< 20d ·optk.

3.2 discrete k-center clustering

3.2.2analysis of the remaining merge steps

The analysis of the remaining merge steps introduces the O(logk)

term to the approximation factor of our result. It is similar to the

analysis used in the proof of Proposition 3.6. Again, we divide the

merge steps into phases. This time however, one phase consists of

one half of the remaining merge steps. Furthermore, we replace the

volume argument from Lemma 3.2with a simpler bound. This bound

results from the observation that, as long as there are more than k

clusters left, we are able to find a pair of clusters whose centers lie

in the same cluster of an optimal k-clustering. That is, the distance

between these centers is at most two times the discrete radius of the

common cluster. The following lemma bounds the increase of the

cost during a single phase.

Lemma 3.8.Let m∈Nwith k < m ⩽|X|. Then,

costdrad(C⌊m+k

2⌋)⩽costdrad(Cm)+2optk.

Proof. Let t:= ⌊m+k

2⌋. Then, Cm∩Ct+1is the set of clusters from Cm

that still exist ⌈m−k

2⌉−1 < m−k

2merge steps after the computation of

Cm. In each iteration of its loop, the algorithm can merge at most two

clusters from Cm. Thus,

|Cm∩Ct+1|> m −2m−k

2=k.

Let τ:= costdrad(Cm). From every cluster C∈Cm, we fix a center

pC∈Cwith C⊂Bd

τ(pC)and define P:= {pCC∈Cm∩Ct+1}. Then,

|P|=|Cm∩Ct+1|> k.

Since Xis (k,optk)-coverable, so is P⊂X. It follows that there exist

distinct A,B∈Cm∩Ct+1such that pAand pBare contained in the

same ball of radius optk, i.e.,

∥pA−pB∥⩽2optk.

Then, the distance from pAto any q∈Bis at most 2optk+τ. We con-

clude that merging Aand Bwould result in a cluster whose discrete

radius can be upper bounded by

cf. Figure 3

drad(A∪B)⩽costdrad(Cm)+2optk.

The result follows using A,B∈Ct+1and Observation 3.4.

To prove Theorem 3.5, we apply Lemma 3.8for about logkconsec-

utive phases.

analysis of agglomerative clustering

Algorithm 3.2:AgglomerativeRadius(X)

Input: finite set of input points X⊂Rd

C|X|←− {{x} | x∈X}

for i=|X|−1,. . . ,1do

find distinct clusters A,B∈Ci+1minimizing rad(A∪B)

Ci←− (Ci+1\ {A,B})∪{A∪B}

end

return C|X|,...,C1

Proof of Theorem 3.5.Let u:= ⌊log2(k)⌋+1. Then,

log2k < u ⩽log2(k) + 1.

Furthermore, for i=0,...,uwe define

mi:= k+⌊(1

2)ik⌋.

Then, mu=kand mi> k for i=0,...,u−1. Since

⌊mi+k

2⌋=⌊1

2(2k +⌊(1

2)ik⌋)⌋=k+⌊1

2⌊(1

2)ik⌋⌋

⩽k+⌊(1

2)i+1k⌋=mi+1

and Algorithm 3.1uses a greedy strategy,

costdrad(Cmi+1)⩽costdrad(C⌊mi+k

2⌋)

for i=0,...,u−1. Combining this inequality with Lemma 3.8(ap-

plied to m=mi), we obtain

costdiam(Cmi+1)⩽costdrad(Cmi)+2optk.

By repeatedly applying this inequality for i=0,...,u−1and using

m0=2k and mu=k, we deduce

costdrad(Ck)⩽costdrad(C2k)+

u−1

∑

i=0

2optk

⩽costdrad(C2k)+ (2log2(k) + 2)·optk.

Hence, the result follows using Proposition 3.6.

3.3k-center clustering

The agglomerative algorithm for the k-center problem is stated as Al-

gorithm 3.2. The only difference to Algorithm 3.1is the minimization

of the radius instead of the discrete radius in the minimization step.

In the following, cost always means radius cost and optkrefers to the

3.3k-center clustering

cost of an optimal k-center clustering of X⊂Rdwhere k∈Nwith

k⩽|X|. That is, optkdenotes the cost of an optimal solution to Prob-

lem 2.8. Analogously to the discrete case, any cluster Cis contained

in a ball of radius drad(C)and thus, the set Xis (k,optk)-coverable.

Observation 3.9.The cost of all computed clusterings is equal to the ra- This is analogous to

Observation 3.4.

dius of the cluster created last. Furthermore, the radius of the union of any

two clusters is always an upper bound for the cost of the clustering to be

computed next.

The following theorem states our result for the k-center problem.

Theorem 3.10.Let X⊂Rdbe a finite set of points. Then, for arbitrary

metrics that are based on a norm and for all k∈Nwith k⩽|X|, the partition

Ckof Xinto kclusters as computed by Algorithm 3.2satisfies

costrad(Ck)=O(logk)·optk,

where optkdenotes the cost of an optimal solution to Problem 2.8, and the

constant hidden in the O-notation is singly exponential in the dimension d.

As in the proof of Theorem 3.5, we first bound the cost of the inter-

mediate 2k-clustering. However, we have to apply a different analysis.

As a consequence, the dependency on the dimension increases from

linear and additive to a singly exponential factor.

3.3.1analysis of the 2k-clustering

Proposition 3.11.Let X⊂Rdbe finite. Then, for all k∈Nwith 2k ⩽|X|,

the partition C2k of Xinto 2k clusters as computed by Algorithm 3.2satisfies

costrad(C2k)< 24d ·e24d ·optk,

where optkdenotes the cost of an optimal solution to Problem 2.8.

Just as in the analysis of Algorithm 3.1, we divide the merge steps

of Algorithm 3.2into phases, such that in each phase the number of

remaining clusters is reduced by one fourth. Like in the discrete case,

the input points are (k,optk)-coverable. However, centers correspond-

ing to an intermediate solution computed by Algorithm 3.2need not

be covered by the kballs induced by an optimal solution. As a conse-

quence, we are no longer able to apply Lemma 3.2on the centers as

in the discrete case.

To bound the increase of the cost during a single phase, we cover

the remaining clusters at the beginning of a phase by a set of over-

lapping balls. Each of the clusters is completely contained in one of

these balls that all have the same radius. Furthermore, the number of

remaining clusters will be at least twice the number of these balls. It

follows that there are many pairs of clusters that are contained in the

analysis of agglomerative clustering

optk

optk+costrad(Cm)

costrad(Cm)

Figure 4: Intermediate centers.

same ball. Then, as long as the existence of at least one such pair can

be guaranteed, the radius of the cluster created next can be bounded

by the radius of the covering balls. The following lemma will be used

to bound the increase of the cost during a single phase.

Lemma 3.12.Let m∈Nwith 2k < m ⩽|X|. Then,

costrad(C⌊3m

4⌋)<(1+6d

√2k

m)costrad(Cm)+6d

√2k

moptk.

Proof. Let P={P1,...,Pk}be an optimal k-center clustering of X. For

each i∈{1,...,k}we fix a point yi∈Rdsuch that

Pi⊂Boptk(yi).

Furthermore, for any C∈Cmlet zC∈Rdsuch that

C⊂Bcostrad(Cm)(zC).

With τ:= optk+costrad(Cm)it follows that each zCis contained in at

least one of the balls Bτ(yi)for i∈{1,...,k}as illustrated in Figure 4.

Applying Lemma 3.3with λ=3

√⌊m

2k ⌋yields that each of the balls

Bτ(yi)for i=1,...,kcan be covered by

ℓ:= ⌊m

2k⌋⩽m

balls of radius ε:= 3τ

√⌊m

2k ⌋. Therefore, there exist

kℓ ⩽m

centers v1,. . . ,vkℓ ∈Rdsuch that each zCfor C∈Cmis contained in

at least one of the balls Bε(v1),...,Bε(vkℓ). For i=1,...,kℓ, we define

Vi:= Bcostrad(Cm)+ε(vi).

Then, any cluster C∈Cmis contained in at least one of the balls

V1,. . . ,Vkℓ (cf. Figure 5).

3.3k-center clustering

viε

costrad(Cm)+ε

zC1

costrad(Cm)

zC2

Figure 5: Covering centers and clusters.

As in the proof of Lemma 3.7we define t:= ⌊3m

4⌋and deduce |Cm∩

Ct+1|>m

2. Since kℓ ⩽m

2, there exist two clusters A,B∈Cm∩Ct+1

that are contained in the same ball Viwith i∈{1,...,kℓ}. Therefore,

merging clusters Aand Bwould result in a cluster whose radius can

be upper bounded by

cf. Figure 5

rad(A∪B)⩽costrad(Cm)+ε.

Using Observation 3.9and the fact that Aand Bare part of the clus-

tering Ct+1, we can upper bound the cost of Ctby

costrad(Ct)⩽costrad(Cm)+ε.

It remains to show

ε < 6 d

√2k

m(optk+costrad(Cm)).

However, since m

2k > 1, it follows m

2k < 2 ⌊m

2k ⌋. Thus,

√m

2k <d

√2⌊m

2k⌋⩽2d

√⌊m

2k⌋

and we conclude

√⌊m

2k ⌋

< 6 d

√2k

To prove Proposition 3.11, we apply Lemma 3.12 for ⌈log4

|X|

2k ⌉con-

secutive phases.

Proof of Proposition 3.11.Analogously to the proof of Proposition 3.6,

we define u:= ⌈log4

|X|

2k ⌉and mi:= ⌈(3

4)i|X|⌉for i=0,...,u. Then,

mu⩽2k and for i=0,...,u−1,mi> 2k and ⌊3mi

4⌋⩽mi+1. Using

Lemma 3.12, for i=0,...,u−1we deduce

costrad(Cmi+1)<(1+6d

√2k

mi)costrad(Cmi)+6d

√2k

optk.

analysis of agglomerative clustering

By repeatedly applying this inequality for i=0,...,u−1and using

costrad(C2k)⩽costrad(Cmu)and costrad(Cm0)=0, we get

costrad(C2k)<

u−1

∑

i=0

6d

√2k

u−1

∏

j=i+1(1+6d

√2k

mj)

optk.

Using mi⩾(3

4)i|X|, we obtain

costrad(C2k)

⩽6optk

√2k

|X|·

u−1

∑

i=0

(4

3)i

du−1

∏

j=i+1(1+6d

√2k

|X|(4

3)j

d)

.

Substituting u−1−ifor iyields

costrad(C2k)

=6optk

√2k

|X|(4

3)u−1

u−1

∑

i=0

(3

4)i

du−1

∏

j=u−i(1+6d

√2k

|X|(4

3)u−1

d(3

4)u−1−j

d)

.

Using u−1 < log4

|X|

2k , we deduce

costrad(C2k)< 6 optk·

u−1

∑

i=0

(3

4)i

di−1

∏

j=0(1+6(3

4)j

d)

. (3)

By taking only the first two terms of the series expansion of the expo-

nential function, we get 1+6(3

4)j

d< e6(3

4)j

dand therefore,

i−1

∏

j=0(1+6(3

4)j

d)<

i−1

∏

j=0(e6(3

4)j

d)=e6∑i−1

j=0(3

4)j

d. (4)

The sum in the exponent can be bounded by the infinite geometric

series

∞

∑

j=0(3

4)j

1−(3

4)1

⩽4d. (5)

Here, the last inequality follows by upper bounding the convex func-

tion f(x) = (3

4)xin the interval [0,1]by the line through f(0) = 1

and f(1) = 3

4, i.e., (3

4)x⩽1−x

4. Putting Inequalities (3), (4) and (5)

together then gives

costrad(C2k)< 6 optk·

u−1

∑

i=0((3

4)i

e24d)< 24d ·e24d ·optk,

where the last inequality follows by using Inequality (5) again.

3.3k-center clustering

3.3.2connected subsets

The analysis of the remaining merge steps from the discrete k-center

case (cf. Section 3.2.2) is not transferable to the k-center case. Again,

as in the proof of Proposition 3.11, we are no longer able to derive

a simple additive bound on the increase of the cost when merging

two clusters. In order to preserve the logarithmic dependency of the

approximation factor on k, we show that it is sufficient to analyze

Algorithm 3.2on a subset Y⊆Xsatisfying a certain connectivity

property. Using this property, we are able to apply a combinatorial

approach that relies on the number of merge steps left.

We start by defining the connectivity property that will be used to

relate clusters to an optimal k-clustering.

Definition 3.13.Let Z⊆Rdand r∈R. Two sets A,B⊆Rdare called

(Z,r)-connected if there exists a point z∈Zwith

r(z)∩A=∅and Bd

r(z)∩B=∅.

Note that for any two (Z,r)-connected clusters A,B, we have

rad(A∪B)⩽rad(A) + 2rad(B) + 2r. (6)

To see this, let yRdwith A⊂Bd

rad(A)(y),xa∈Bd

r(z)∩Aand xb∈

r(z)∩B. Then, the distance between yand xais at most radA, the

distance between xaand xbis at most 2r and the distance between xb

and any point x∈Bis at most 2rad(B).

Next, we show that for any data set Xwe can bound the cost of the

k-clustering computed by Algorithm 3.2by the cost of the ℓ-clustering

computed on a connected subset Y⊆Xfor a proper ℓ⩽k.

Lemma 3.14.Let X⊂Rdbe finite and k∈Nwith k⩽|X|. Then, there

exists a subset Y⊆X, a number ℓ∈Nwith ℓ⩽kand ℓ⩽|Y|, and a set

Z⊂Rdwith |Z|=ℓsuch that:

1.Yis (ℓ,optk)-coverable;

2.costrad(Ck)⩽costrad(Pℓ);

3. For all n∈Nwith ℓ < n ⩽|Y|, every cluster in Pnis (Z,optk)-

connected to another cluster in Pn.

Here, the collection P1,. . . ,P|Y|denotes the hierarchical clustering computed

by Algorithm 3.2on input Y.

Proof. To define Y,Z, and ℓwe consider the (k+1)-clustering com-

puted by Algorithm 3.2on input X. We know that X=∪A∈Ck+1A

is (k,optk)-coverable. Let E⊆Ck+1be a minimal subset such that

∪A∈EAis (|E|−1,optk)-coverable, i. e., for all sets F⊆Ck+1with

|F|<|E|the union ∪A∈FAis not (|F|−1,optk)-coverable. Since a set F

of size 1cannot be (|F|−1,optk)-coverable, we deduce |E|⩾2.

analysis of agglomerative clustering

Let Y:= ∪A∈EAand ℓ:= |E|−1. Then, ℓ⩽kand Yis (ℓ,optk)-

coverable. This establishes the first property of the lemma. It follows

that there exists a set Z⊂Rdwith |Z|=ℓand Y⊂∪z∈ZBd

optk(z).

Since Yis the union of the clusters from E⊆Ck+1, each merge

step between the computation of C|X|and Ck+1merges either two

clusters A,B⊂Yor two clusters A,B⊂X\Y. In the following, we

consider the restriction of the clusterings Ck+1,...,C|X|to the input

data set Y⊂X. We need to deal with the resulting clusterings as if

they were computed by Algorithm 3.2on input Y. This is the part

of our analysis where we have to be more precise on the tie-breaking

strategy as briefly mentioned at the beginning of this chapter. If on

input Y, the minimization step of Algorithm 3.2is not always unam-

biguous, the computed clusterings may differ from the restriction of

Ck+1,...,C|X|to Y. Thus, for the analysis of any fixed run on input X

we simply assume that on input YAlgorithm 3.2performs the same

sequence of merge steps that ignoring the merge steps inside X\Y

lead to Ck+1,...,C|X|in the fixed run on input X.

We let P1,...,P|Y|be the hierarchical clustering computed by Algo-

rithm 3.2on input Y. Then, we have Pℓ+1=E=Ck+1∩2Y. Thus,2Ydenotes the

power set of Y.Pℓ+1⊆Ck+1. To compute Pℓ, on input Y, Algorithm 3.2merges two

clusters from Pℓ+1that minimize the radius of the resulting cluster.

Analogously, on input X, Algorithm 3.2merges two clusters from

Ck+1to compute Ck. Since Pℓ+1⊆Ck+1, Observation 3.9implies

costrad(Ck)⩽costrad(Pℓ), thus proving the second property of the

lemma.

It remains to show that for all n∈Nwith ℓ < n ⩽|Y|it holds

that every cluster in Pnis (Z,optk)-connected to another cluster in

Pn(the third property of the lemma). By the definition of Z, every

cluster in Pnintersects at least one ball Bd

optk(z)for z∈Z. Therefore,

it is enough to show that each ball Bd

optk(z)intersects at least two

clusters from Pn. We first show this property for n=ℓ+1. For ℓ=1,

this follows from the fact that Bd

optk(z)with Z={z}has to contain both

clusters from P2. For ℓ > 1, we would otherwise be able to remove

one cluster from Pℓ+1and get ℓclusters whose union is (ℓ−1,optk)-

coverable. However, this contradicts the definition of E=Pℓ+1as a

minimal subset with this property.

To show the third property of the lemma for general n, let A∈

Pnand z∈Zwith Bd

optk(z)∩A=∅. There exists a unique cluster

A∈Pℓ+1with A⊆˜

A. Then, we have Bd

optk(z)∩˜

A=∅. However, we

have just shown that Bd

optk(z)has to intersect at least two clusters from

Pℓ+1. Thus, there exists another cluster ˜

B∈Pℓ+1with Bd

optk(z)∩˜

B=∅.

Since every cluster from Pℓ+1is a union of clusters from Pn, there

exists at least one cluster B∈Pnwith B⊆˜

Band Bd

optk(z)∩B=∅.

3.3k-center clustering

optk

costrad(Pm)

costrad(Pn)

Figure 6: Merging (Z,optk)-connected clusters.

3.3.3analysis of the remaining merge steps

Let Y,Z,ℓ, and P1,. . . ,P|Y|be as given by Lemma 3.14. Then, Proposi-

tion 3.11 can be used to obtain an upper bound for the cost of P2ℓ. In

the following, we analyze the merge steps leading from P2ℓ to Pℓ+1

and show how to obtain an upper bound for the cost of Pℓ+1. As

in Section 3.3.1, we analyze the merge steps in phases. The following

lemma is used to bound the increase of the cost during a single phase.

Note that optkstill refers to the cost of an optimal solution on input

X, not Y.

Lemma 3.15.Let m,n∈Nwith n⩽2ℓ and ℓ<m⩽n⩽|Y|. If there are

no two (Z,optk)-connected clusters in Pm∩Pn,

costrad(P⌊m+ℓ

2⌋)⩽costrad(Pm)+2costrad(Pn)+2optk.

Proof. Analogously to the proof of Lemma 3.8we define t:= ⌊m+ℓ

2⌋

and deduce that there are strictly less than m−ℓ

2merge steps between

the computations of Pmand Pt+1. We show that there exist at least

m−ℓdisjoint pairs of clusters from Pmsuch that the radius of their

union can be upper bounded by costrad(Pm)+2costrad(Pn)+2optk.

Then, at least one of these pairs still has to exist in Pt+1since in each

iteration of its loop the algorithm can destroy at most two of the pairs.

The lemma follows using Observation 3.9and from the fact that there

is only one merge step left.

To bound the number of the above mentioned cluster pairs, we start

with a structural observation. Pm∩Pnis the set of clusters from Pn

that still exist in Pm. By our definition of Y,Z, and ℓ, we conclude that

any cluster A∈Pm∩Pnis (Z,optk)-connected to another cluster B∈

Pm. If we assume that there are no two (Z,optk)-connected clusters in

Pm∩Pn, this implies B∈Pm\ Pn(cf. Figure 6). Thus, using A∈Pn,

B∈Pm, and Inequality (6), the radius of A∪Bcan be bounded by

rad(A∪B)⩽costrad(Pm)+2costrad(Pn)+2optk. (7)

Moreover, using a similar argument, we derive the same bound for

two clusters A1,A2∈Pm∩Pnthat are (Z,optk)-connected to the same

cluster B∈Pm\ Pn. That is,

rad(A1∪A2)⩽costrad(Pm)+2costrad(Pn)+2optk. (8)

analysis of agglomerative clustering

Next, we show that there exist at least ⌈|Pm∩Pn|

2⌉disjoint pairs of

clusters from Pmsuch that the radius of their union can be bounded

either by Inequality (7) or by Inequality (8). To do so, we first consider

the pairs of clusters from Pm∩Pnthat are (Z,optk)-connected to the

same cluster from Pm\ Pnuntil no candidates are left. For these

pairs, we can bound the radius of their union by Inequality (8). Then,

each cluster from Pm\ Pnis (Z,optk)-connected to at most one of

the remaining clusters from Pm∩Pn. Thus, each remaining cluster

A∈Pm∩Pncan be paired with a different cluster B∈Pm\ Pnsuch

that Aand Bare (Z,optk)-connected. For these pairs, we can bound

the radius of their union by Inequality (7). Since for all pairs either

one or both of the clusters come from the set Pm∩Pn, we can lower

bound the number of pairs by ⌈|Pm∩Pn|

2⌉.

To complete the proof, we show

m−ℓ⩽⌈|Pm∩Pn|

2⌉.

In each iteration of its loop, the algorithm can merge at most two

clusters from Pn. Therefore, there are at least ⌈n−|Pm∩Pn|

2⌉merge

steps between the computations of Pnand Pm. Hence,

m⩽n−⌈n−|Pm∩Pn|

2⌉⩽n

2+|Pm∩Pn|

Using n⩽2ℓ, we deduce

m−ℓ⩽|Pm∩Pn|

Lemma 3.16.Let n∈Nwith n⩽2ℓ and ℓ < n ⩽|Y|. Then,

costrad(Pℓ+1)< 2 (log2(ℓ) + 2)(costrad(Pn)+optk).

Proof. For n=ℓ+1there is nothing to show. Hence, we assume

n>ℓ+1. By the definition of Z, there exist two (Z,optk)-connected

clusters in Pn. Let ˜n∈Nwith ˜n < n be maximal such that no two

(Z,optk)-connected clusters exist in P˜n∩Pn. The number ˜nis well-

defined since |P1|=1implies ˜n⩾1. It follows that the same holds for

all m∈Nwith m⩽˜n. We conclude that Lemma 3.15 is applicable

for all m∈Nwith ℓ<m⩽˜n.

By the definition of ˜nthere still exist at least two (Z,optk)-connect-

ed clusters in P˜n+1∩Pn. Then, Observation 3.9implies

costrad(P˜n)⩽2costrad(Pn)+optk. (9)

If ˜n⩽ℓ+1, Inequality (9) proves the lemma. For ˜n > ℓ +1, define

u=⌈log2(˜n−ℓ)⌉and

mi=⌈(1

2)i

(˜n−ℓ) + ℓ⌉> ℓ

3.3k-center clustering

for i=0,...,u. Then, m0=˜nand mu=ℓ+1. Furthermore,

⌊mi+ℓ

2⌋=⌊1

2⌈(1

2)i(˜n−ℓ) + ℓ⌉+ℓ

2⌋⩽⌊1

2((1

2)i(˜n−ℓ) + ℓ+1)+ℓ

2⌋

=⌊(1

2)i+1(˜n−ℓ) + ℓ+1

2⌋⩽⌈(1

2)i+1(˜n−ℓ) + ℓ⌉=mi+1.

Since Algorithm 3.2uses a greedy strategy,

costrad(Pmi+1)⩽costrad(P⌊mi+ℓ

2⌋)

for i=0,. . . ,u−1. Combining this inequality with Lemma 3.15 (ap-

plied to m=mi), we obtain

costrad(Pmi+1)⩽costrad(Pmi)+2costrad(Pn)+2optk.

By repeatedly applying this inequality for i=0,...,u−1and sum-

ming up the costs, we get

costrad(Pmu)<costrad(P˜n)+2u ·(costrad(Pn)+optk)

(9)

< 2(u+1)·(costrad(Pn)+optk).

Since ˜n < 2ℓ, we get u < log2(ℓ) + 1and the lemma follows using

mu=ℓ+1.

The following lemma finishes the analysis except for the last merge

step.

Lemma 3.17.Let Y⊂Rdbe finite and ℓ⩽|Y|such that Yis (ℓ,optk)-cov-

erable. Furthermore, let Z⊂Rdwith |Z|=ℓsuch that for all n∈Nwith

ℓ+1⩽n⩽|Y|, every cluster in Pnis (Z,optk)-connected to another cluster

in Pn, where P1,...,P|Y|denotes the hierarchical clustering computed by

Algorithm 3.2on input Y. Then,

costrad(Pℓ+1)< 2(log2(ℓ) + 2)·(24d ·e24d +1)·optk.

Proof. Let n:= min(|Y|,2ℓ). Then, using Proposition 3.11,

costrad(Pn)< 24d ·e24d ·optk.

The result follows by using this bound together with Lemma 3.16.

3.3.4proof of theorem 3.10

By Lemma 3.14, there exists a subset Y⊆X, a number ℓ⩽k, and a

hierarchical clustering P1,. . . ,P|Y|of Ywith

costrad(Ck)⩽costrad(Pℓ).

Furthermore, there exists a set Z⊂Rdsuch that every cluster from

Pℓ+1is (Z,optk)-connected to another cluster in Pℓ+1. Thus, Pℓ+1

contains two clusters A,Bthat intersect the same ball of radius optk.

Hence,

costrad(Ck)⩽rad(A∪B)⩽2costrad(Pℓ+1)+optk.

The theorem follows using Lemma 3.17 and ℓ⩽k.

analysis of agglomerative clustering

Algorithm 3.3:AgglomerativeCompleteLinkage(X)

Input: finite set of input points X⊂Rd

C|X|←− {{x} | x∈X}

for i=|X|−1,. . . ,1do

find distinct clusters A,B∈Ci+1minimizing diam(A∪B)

Ci←− (Ci+1\ {A,B})∪{A∪B}

end

return C|X|,...,C1

3.4diameter k-clustering

In this section, we analyze the agglomerative complete linkage clus-

tering algorithm for Problem 2.7stated as Algorithm 3.3. Again, the

only difference to Algorithm 3.1and Algorithm 3.2is the minimiza-

tion of the diameter in the minimization step.

Note that in this section cost always means diameter cost and optk

refers to the cost of an optimal diameter k-clustering of X⊂Rdwhere

k∈Nwith k⩽|X|. That is, optkdenotes the cost of an optimal

solution to Problem 2.7. Analogously to the (discrete) radius case,

any cluster Cis contained in a ball of radius diam(C)and thus, the

set Xis (k,optk)-coverable.

Observation 3.18.The cost of all computed clusterings is equal to the di-This is analogous to

Observation 3.4and

Observation 3.9.ameter of the cluster created last. Furthermore, the diameter of the union of

any two clusters is always an upper bound for the cost of the clustering to

be computed next.

The following theorem states our main result.

Theorem 3.19.Let X⊂Rdbe a finite set of points. Then, for arbitrary

metrics that are based on a norm and for all k∈Nwith k⩽|X|, the partition

Ckof Xinto kclusters as computed by Algorithm 3.3satisfies

costdiam(Ck)=O(logk)·optk,

where optkdenotes the cost of an optimal solution to Problem 2.7, and the

constant hidden in the O-notation is doubly exponential in the dimension d.

As in the proof of Theorem 3.5and Theorem 3.10, we first bound

the cost of the intermediate 2k-clustering. However, we have to apply

a different analysis again. This time, the new analysis results in a

bound that depends doubly exponential on the dimension.

3.4.1analysis of the 2k-clustering

Proposition 3.20.Let X⊂Rdbe finite. Then, for all k∈Nwith 2k ⩽|X|,

the partition C2k of Xinto 2k clusters as computed by Algorithm 3.3satisfies

costdiam(C2k)< 23σ (28d +6)·optk,

3.4 diameter k-clustering

where σ= (42d)dand optkdenotes the cost of an optimal solution to Prob-

lem 2.7.

In our analysis of the k-center problem, we made use of the fact

that merging two clusters lying inside a ball of some radius rresults

in a new cluster of radius at most r. This is no longer true for the

diameter k-clustering problem. We are not able to derive a bound for

the diameter of the new cluster that is significantly less than 2r. The

additional factor of 2makes our analysis from Section 3.3.1useless

for the diameter case since it would result in an approximation factor

that is linear in the size of Xrather than only depending on kand d.

To prove Proposition 3.20, we split the merge steps of Algorithm 3.3

into two stages. The first stage consists of the merge steps down to

a(22O(dlogd)k)-clustering. The analysis of the first stage is based on

the following notion of similarity. Two clusters are called similar if

one cluster can be translated such that every point of the translated

cluster is near a point of the second cluster. Then, by merging sim-

ilar clusters, the diameter essentially increases by the length of the

translation vector. During the first stage, we guarantee that there is a

sufficiently large number of similar clusters left. The cost of the inter-

mediate (22O(dlogd)k)-clustering can be upper bounded by O(d)·optk.

The second stage consists of the steps reducing the number of re-

maining clusters from 22O(dlogd)kto only 2k. Since the number of

remaining merge steps as well as the upper bound of the cost are in-

dependent of ∥X∥, we are able to show an approximation factor that

only depends on kand d. In this stage, we are no longer able to

guarantee that a sufficiently large number of similar clusters exists.

Therefore, we analyze the merge steps of the second stage using a

weaker argument, very similar to the one used in the second step of

the analysis in the discrete k-center case (cf. Section 3.2.2). As long

as there are more than 2k clusters left, we are able to find sufficiently

many pairs of clusters that intersect the same cluster of an optimal

k-clustering. Therefore, we can bound the cost of merging such a

pair by the sum of the diameters of the two clusters plus the diame-

ter of the optimal cluster. We show that the cost of the intermediate

2k-clustering is upper bounded by 22O(dlog d)·optk. Let us remark that

we do not obtain our main result if we already use this argument for

the first stage.

Both stages are again subdivided into phases, such that in each

phase the number of remaining clusters is reduced by one fourth.

stage one

The following lemma will be used to bound the increase of the cost

during a single phase of the first stage.

analysis of agglomerative clustering

Lemma 3.21.Let λ∈Rwith 0<λ<1and ρ=⌈(3

λ)d⌉. Furthermore, let

m∈Nwith 2ρ+1k < m ⩽|X|. Then,

costdiam(C⌊3m

4⌋)⩽(1+2λ)costdiam(Cm)+4d

√2ρ+1k

moptk. (10)

Proof. Let τ:= costdiam(Cm). From every cluster C∈Cm, we fix an

arbitrary point and denote it by pC. Then, the distance from pCto

any q∈Cis at most τand

C−pC:= {p−pCp∈C}⊂Bd

τ(0).

By Lemma 3.3, a ball of radius τcan be covered by ρballs of radius

λτ. Hence, we are able to fix y1,...,yρ∈Rdwith

τ(0)⊆

∪

i=1

λτ(yi).

For C∈Cm, we define the configuration of Cby

Conf(C) := {yi|1⩽i⩽ρand Bd

λτ(yi)∩(C−pC)=∅}.

That is, we identify each cluster C∈Cmwith the subset of the balls

λτ(y1),...,Bd

λτ(yρ)that intersect C−pC. Note that no cluster from

C∈Cmhas an empty configuration and the number of possible con-

figurations is upper bounded by 2ρ.

As in the proof of Lemma 3.7we define t:= ⌊3m

4⌋and deduce

|Cm∩Ct+1|>m

2. It follows that there exist

j > 1

2ρ·m

2=m

2ρ+1

distinct clusters C1,...,Cj∈Cm∩Ct+1with the same configuration.

Using m > 2ρ+1k, we deduce j > k.

Let P:= {pC1,...,pCj}. Since Xis (k,optk)-coverable, so is P⊂X.

Therefore, by Lemma 3.2, there exist distinct a,b∈{1,. . . ,j}such that

∥pCa−pCb∥⩽4d

√2ρ+1k

moptk.

Next, we derive a bound for the diameter of the union of the corre-

sponding clusters Caand Cb. That is, we need to bound the largest

distance ∥u−v∥between two points u,v∈Ca∪Cb. If u,v∈Caor

u,v∈Cb, then ∥u−v∥⩽costdiam(Cm). Hence, let u∈Caand v∈Cb.

Using the triangle inequality, for any w∈Rd,

cf. Figure 7∥u−v∥⩽∥pCa−pCb∥+∥u+pCb−pCa−w∥+∥w−v∥.

For ∥pCa−pCb∥, we already derived an upper bound. To bound

∥u+pCb−pCa−w∥, let y∈Conf(Ca) = Conf(Cb)such that

u−pCa∈Bd

λτ(y).

3.4 diameter k-clustering

pCa

pCb

∥pCa−pCb∥

∥u+pCb−pCa−w∥

∥w−v∥

Figure 7: Congruent configurations.

Furthermore, let w∈Cbwith

w−pCb∈Bd

λτ(y).

Then,

∥u+pCb−pCa−w∥=∥u−pCa− (w−pCb)∥

⩽2λ ·costdiam(Cm)

=2λτ.

Since v,w∈Cb, their distance is bounded by

∥w−v∥⩽diam(Cb)⩽costdiam(Cm).

We conclude that merging the clusters Caand Cbresults in a cluster

whose diameter can be upper bounded by

diam(Ca∪Cb)⩽(1+2λ)costdiam(Cm)+4d

√2ρ+1k

moptk.

Using Observation 3.18 and the fact that Caand Cbare part of the

clustering Ct+1, we can upper bound the cost of Ctby

costdiam(Ct)⩽diam(Ca∪Cb).

Note that the parameter λfrom Lemma 3.21 establishes a trade-

off between the two terms on the right-hand side of Inequality (10).

For the analysis of the first stage, we have to choose λcarefully.

In the proof of the following lemma, we use λ=ln 4

3/4d and apply

Lemma 3.21 for ⌈log4

|X|

2σ+1k⌉consecutive phases, where σ= (42d)d.

Then, we are able to upper bound the cost of the clustering computed

after the first stage by a term that is linear in dand optkand indepen-

dent of |X|and k. The number of remaining clusters is independent

of the number of input points |X|and depends only on the dimension

dand the desired number of clusters k.

analysis of agglomerative clustering

Lemma 3.22.Let 2σ+1k < |X|for σ= (42d)d. Then, on input X, Algo-

rithm 3.3computes a clustering C2σ+1kwith

costdiam(C2σ+1k)<(28d +4)optk.

Proof. Let u:= ⌈log3

2σ+1k

|X|⌉and define mi:= ⌈(3

4)i|X|⌉for i=0,...,u.

Furthermore, let λ:= ln 4

3/4d. This implies ρ⩽σfor the parameter ρ

of Lemma 3.21. Then, mu⩽2σ+1kand mi> 2σ+1k⩾2ρ+1kfor

i=0,. . . ,u−1. Since

⌊3mi

4⌋=⌊3

4⌈(3

4)i|X|⌉⌋⩽⌊(3

4)i+1|X|+3

4⌋⩽⌈(3

4)i+1|X|⌉=mi+1

and Algorithm 3.3uses a greedy strategy,

costdiam(Cmi+1)⩽costdiam(C⌊3mi

4⌋)

for i=0,. . . ,u−1. Combining this inequality with Lemma 3.21 (ap-

plied to m=mi), we obtain

costdiam(Cmi+1)⩽(1+2λ)costdiam(Cmi)+4d

√2ρ+1k

optk.

By repeatedly applying this inequality for i=0,...,u−1and using

costdiam(C2σ+1k)⩽costdiam(Cmu)and costdiam(Cm0)=0, we get

costdiam(C2σ+1k)⩽

u−1

∑

i=0

(1+2λ)i·4d

2σ+1k

4)u−1−i|X|optk



=4d

2σ+1k

4)u−1|X|optk·

u−1

∑

i=0

(1+2λ)i·d

√(3

4)i

.

Using u−1 < log3

2σ+1k

|X|,

costdiam(C2σ+1k)< 4 optk·

u−1

∑

i=0



1+2λ

√4





. (11)

By taking only the first two terms of the series expansion of the expo-

nential function, we get 1+2λ =1+ln 4

2d < eln 4

2d =2d

√4

3. Substituting

this bound into Inequality (11) and extending the sum yields

costdiam(C2σ+1k)< 4 optk·∞

∑

i=0



√4





< 4 optk·∞

∑

i=0(1

1+2λ)i

Solving the geometric series leads to

costdiam(C2σ+1k)< 4 (1

2λ +1)·optk<(28d +4)·optk.

3.4 diameter k-clustering

A B

diam(A)

diam(B)

diam(A∪B)

Figure 8: Merging two clusters intersecting a ball of radius r.

stage two

The second stage covers the remaining merge steps until Algorithm 3.3

computes the clustering C2k. However, compared to stage one, the

analysis of a single phase yields a weaker bound. The following

lemma provides an analysis of a single phase of the second stage.

It is very similar to Lemma 3.7and Lemma 3.8in the analysis of the

discrete k-center problem.

Lemma 3.23.Let m∈Nwith 2k < m ⩽|X|. Then,

costdiam(C⌊3m

4⌋)⩽2·(costdiam(Cm)+optk).

Proof. As in the proof of Lemma 3.7, we define t:= ⌊3m

4⌋and deduce

|Cm∩Ct+1|>m

2> k. Since Xis (k,optk)-coverable, there exists a point

y∈Rdsuch that Bd

optk(y)intersects two clusters A,B∈Cm∩Ct+1.

We conclude that merging Aand Bwould result in a cluster whose

diameter can be upper bounded by

cf. Figure 8

diam(A∪B)⩽2costdiam(Cm)+2optk.

The result follows using A,B∈Ct+1and Observation 3.18.

Lemma 3.24.Let n∈Nwith n⩽2σ+1kand 2k < n ⩽|X|for σ= (42d)d.

Then, on input X, Algorithm 3.3computes a clustering C2k with

costdiam(C2k)< 23σ (costdiam(Cn)+2optk).

Proof. Let u:= ⌈log3

n⌉and define mi:= ⌈(3

4)in⌉for i=0,...,u.

Then, mu⩽2k and mi> 2k for i=0,. . . ,u−1. Analogously to the

proof of Lemma 3.22, we get ⌊3mi

4⌋⩽mi+1and using Lemma 3.23,

costdiam(Cmi+1)⩽2(costdiam(Cmi)+optk)

for i=0,...,u−1. By repeatedly applying this inequality for i=

0,. . . ,u−1and using costdiam(C2k)⩽costdiam(Cmu), we get

costdiam(C2k)⩽2ucostdiam(Cn)+

∑

i=1

2ioptk

< 2u(costdiam(Cn)+2optk).

Hence using u⩽⌈log4

32σ⌉< 3σ, the result follows.

analysis of agglomerative clustering

Proof of Proposition 3.20.The Proposition follows immediately by com-

bining Lemma 3.22 and Lemma 3.24.

3.4.2analysis of the remaining merge steps

We analyze the remaining merge steps analogously to the k-center

problem. Therefore, in this section we only discuss the differences,

most of which are slightly modified bounds for the cost of merging

two clusters.

The connectivity property from Section 3.3.2remains the same.

However, for any two (Z,r)-connected clusters A,B, we use

cf. Figure 8diam(A∪B)⩽diam(A) + diam(B) + 2r (12)

as a replacement for Inequality (6). Furthermore, Lemma 3.14 also

holds for the diameter k-clustering problem, i.e., with

costdiam(Ck)⩽costdiam(Pℓ).

Using Inequality (12) in the proof of Lemma 3.15, we get

diam(A∪B)⩽costdiam(Pm)+costdiam(Pn)+2optk

as a replacement for Inequality (7) while Inequality (8) can be re-

placed by

diam(A1∪A2)⩽costdiam(Pm)+2(costdiam(Pn)+2optk).

That is, for the diameter k-clustering problem the two upper bounds

are different. However, the second one is larger than the first one.

Using it in both cases, the inequality stated in Lemma 3.15 changes

slightly to

costdiam(P⌊m+ℓ

2⌋)⩽costdiam(Pm)+2(costdiam(Pn)+2optk).

Together with costdiam(P˜n)⩽2costdiam(Pn)+2optkas a replacement

for Inequality (9), the bound stated in Lemma 3.16 becomes

costdiam(Pℓ+1)< 2 (log2(ℓ) + 2)(costdiam(Pn)+2optk).

Thus, using Proposition 3.20, the upper bound for the cost of the

(ℓ+1)-clustering of Ystated in Lemma 3.17 becomes

costdiam(Pℓ+1)< 2 (log2(ℓ) + 2)(23σ (28d +6)+2)·optk

for σ= (42d)d. Analogously to Section 3.3.4, this proves Theorem 3.19.

3.5 the one-dimensional case

3.5the one-dimensional case

For d=1, we are able to show that Algorithm 3.3computes an ap-

proximation to the diameter k-clustering problem (Problem 2.7) with

an approximation factor of at most 3. We even know that for any

data set X⊂Rthe approximation factor of the computed solution is

strictly below 3. However, we do not show an approximation factor

of 3−ϵfor some ϵ > 0. The proof of this upper bound makes use of

the total order of the real numbers and is certainly not generalizable

to higher dimensions.

Theorem 3.25.Let d=1. Then, Algorithm 3.3gives a solution to the

diameter k-clustering problem (Problem 2.7) with cost less than three times

the cost of an optimal solution.

For any set C⊂Rit holds that diam(C) = 2rad(C). That is, in the

one-dimensional case, the diameter and the radius of a cluster differ

by a constant factor. Thus, for any two clusters Aand B,

diam(A)<diam(B)⇔rad(A)<rad(B).

It follows that Algorithm 3.3and Algorithm 3.2choose the same clus- We assume that

there are no ties.

ters in the merge step and we immediately get the following corollary.

Corollary 3.26.Let d=1. Then, Algorithm 3.2gives a solution to the

k-center problem (Problem 2.8) with cost less than three times the cost of an

optimal solution.

Before proving Theorem 3.25, we discuss some features of one-

dimensional clusterings and introduce a suitable terminology. Af-

terwards, we formulate the central argument used in the proof of

Theorem 3.25 as a separate lemma.

In the following, we consider solely clusterings of the finite input

data set X⊂R. Since Algorithm 3.3uses a greedy strategy to choose

the clusters to be merged, it merges only clusters whose points lie

directly next to each other. That is, the created clusters are cohesive

in the sense that they contain all points of Xthat lie between their

smallest and their largest point. Without loss of generality, we may

assume that the same holds for the clusters of an optimal solution.

Obviously, any clustering can be transformed to have this property

without changing the largest diameter. It follows that all considered

clusters are uniquely determined by their smallest and their largest

point. Therefore, in the figures of this section, clusters are depicted by

line segments between their smallest and their largest point. We call

the smallest point of a cluster its left boundary and the largest point its

right boundary.

We start with considering the relative position of two clusters A,B⊂

Xfrom possibly different clusterings of X⊂R. We say that Alies in-

side of B, if A⊂B. Furthermore, Alies to the left of Bif the right bound-

ary of Ais strictly less than the left boundary of B, i. e., maxx∈Ax <

analysis of agglomerative clustering

minx∈Bx. Analogously, Alies to the right of B, if maxx∈Bx < minx∈Ax.

For two distinct clusters A,Bfrom the same clustering C, it follows

that Alies either to the left or to the right of B. We say that such

clusters Aand Bare neighboring if no further cluster C∈Clies in

between. A set of n > 2 clusters from the same clustering Cis called

neighboring if the clusters can be denoted as C1,. . . ,Cnsuch that for

i=1,. . . ,n−1, the clusters Ciand Ci+1are neighboring. For two

neighboring clusters A,Bwith Alying to the left of B, we define the

gap between Aand Bas the open interval between the right bound-

ary of Aand the left boundary of B. It follows that a gap contains

no elements from X. Furthermore, for two clusters A,Bfrom different

clusterings we define Ato be left aligned to B, if Aand Bshare their left

boundary, i. e., minx∈Ax=minx∈B. Analogously, Ais right aligned

to B, if maxx∈Ax=maxx∈Bx. If it is clear from the context or if it is

irrelevant which cluster a cluster Ais aligned to, we simply say Ais

left (or right) aligned.

For the remainder of this section, we fix an optimal diameter k-

clustering Sof X. That is, Sis an optimal solution to Problem 2.7

for d=1. Without loss of generality, we assume S={S1,...,Sk}

with Silying to the left of Si+1for i=1,. . . ,k−1. Furthermore,

we denote the hierarchical clusterings computed by Algorithm 3.3on

input Xby C|X|,. . . ,C1. For our line of reasoning, we focus on two

particular clusterings Cℓand Cmwith ℓ⩽m. Let m∈Nbe theThe smaller the

index, the later the

clustering is

computed.

smallest index, such that all clusters of Cmhave a diameter of at most

optkand let ℓ∈Nbe the smallest index, such that all clusters of Cℓ

have a diameter strictly less than 3·optk. Since Algorithm 3.3uses

a greedy strategy, it is sufficient to show that ℓis at most kto prove

Theorem 3.25. To do this, we use the following technical lemma.

Lemma 3.27.Let P⊂Sbe a set of neighboring clusters from the optimal

solution Swith the following properties:

1. Gaps between neighboring clusters from Phave length at most optk.

2. The rightmost cluster of Pis right aligned to a cluster C∈Cmand it

is the only one with this property.

Furthermore, let L⊂Cℓbe such that:

3. For each cluster C∈L, the left boundary of Cis contained in one of

the clusters from P.

4. The left boundary of the leftmost cluster of Lis contained in the left-

most cluster of P.

Then, it holds that |L| ⩽|P|.

Proof. We prove the lemma by induction over the size of Pwith two

base cases. First, let |P| =1. We denote P={P}and assume |L| ⩾2for

contradiction. By definition, each cluster from Lis the union of one or

3.5 the one-dimensional case

P⊂SP

L⊂Cℓ

CmA B

(a) Sketch of contradiction for |P| =1.

P⊂SP1P2

L⊂CℓA B C

(b) Sketch of contradiction for |P| =2.

Figure 9: The two base cases of the induction.

more clusters from Cmas illustrated in Figure 9a. It follows that there

exist at least two clusters A,B∈Cmsuch that their left boundaries are

contained in the single cluster P. Then, the right boundaries of Aand

Balso have to be contained in P. Otherwise, there could not exist a

cluster C∈Cmthat is right aligned to Pas required for the second

property. It follows that Aand Bboth lie inside of Pwhich has a

diameter of at most optk. That is, the union of Aand Bwould have

a diameter of at most optk, too. Thus, Algorithm 3.3would have

created a clustering Cm−1with cost of at most optk. However, this

contradicts the definition of m.

Second, let |P| =2, denote P={P1,P2}with P1lying to the left of P2

and assume |L| ⩾3for contradiction. Let C∈Lbe the rightmost clus-

ter of Land recall that the gap between the two neighboring clusters

P1and P2contains no points from X. It follows that there have to ex-

ist at least two clusters A,B∈Lto the left of C, which both lie inside

P1∪P2(cf. Figure 9b). Furthermore, the right boundaries of Aand

Bhave to be strictly less than the right boundary of P2. Otherwise,

the left boundary of Ccould not be contained in one of the clusters

of Pas required in the third property. Using the first property, it fol-

lows that the union of Aand Bhas a diameter of strictly less than

3·optk. However, analogously to the first base case, this contradicts

the definition of ℓ.

Finally, for the inductive step, let |P| ⩾3and assume that for any

two sets P′and L′as defined in the lemma but with |P′|<|P|, it

holds that |L′|⩽|P′|. Let P1,P2∈Pbe the two leftmost clusters of

Pwith P1lying to the left of P2. Furthermore, let A,B,C∈Lbe For |L| < 3, there is

nothing to show.

the three leftmost clusters of L⊂Cℓordered from left to right as

illustrated in Figure 10. Every cluster in Cℓis the union of one or

more clusters from Cm. Thus, the second property yields that the

rightmost cluster of Pis the only cluster from Pthat may be right

aligned to a cluster from Cℓ. However, this cluster from Cℓcan not

be the cluster B, since to the right of Bthere has to be room for the

cluster C. It follows that Bis not right aligned to a cluster from P.

analysis of agglomerative clustering

P1P2P

P⊂S

L⊂CℓA B C

Figure 10: The inductive step.

z }| {

z }| { ... Pt

z}|{

Figure 11: Partition of the optimal solution.

The definition of ℓyields diam(A∪B)⩾3·optkand the first property

provides diam(P1∪P2)⩽3·optk. Since Bcan not be right aligned to

P2, the right boundary of Band the left boundary of Cboth have to

be contained in a cluster P∈Pthat lies to the right of P2.

We define a new set P′by removing all clusters from Pthat lie

to the left of P. It follows that |P′|⩽|P| −2. Furthermore, we define

L′:= L \ {A,B}. Then, all four properties of Pand Las required by the

lemma are also satisfied by P′and L′. Using the inductive hypothesis

|L′|⩽|P′|, we conclude

|L| =|L′|+2⩽|P′|+2⩽|P|.

Proof of Theorem 3.25.In the following, we denote Cm={C1,...,Cm}

with Cilying to the left of Ci+1for i=1,. . . ,m−1. Since Sand Cm

are clusterings of the same set X, it follows that C1is left aligned to S1

and Cmis right aligned to Sk. Furthermore, it follows that if for any

1⩽i⩽m−1and 1⩽j⩽k−1the cluster Ciis right aligned to the

cluster Sj, then Ci+1has to be left aligned to Sj+1(cf. Figure 11). That

is, Cmand Shave a common gap between the neighboring clusters

Ci,Ci+1and Sj,Sj+1, respectively. We use these common gaps of Cm

and Sto define a partition P1,. . . ,Ptof the optimal solution S. Let

t∈Nbe the number of clusters in Sthat are left aligned to a cluster

of Cm(which is also the number of right aligned clusters). Then, for

i=1,...,tlet Libe the i-th cluster of S(counted from the left) that

is left aligned to a cluster of Cmand let Ribe the i-th right aligned

cluster. Using Liand Ri, we define Pito be the set of neighboring

clusters of Sfrom Lito Ri.

Based on the partition P1,. . . ,Ptof Swe define a partition of Cℓ.

To this, for each i∈{1,. . . ,t}, we pick all clusters from Cℓwhose left

boundary is contained in a cluster from Pi. This results in a partition

L1,...,Ltof Cℓwith

Li:= {C∈Cℓ(min

x∈Cx)∈Pfor a cluster P∈Pi}.

3.6 lower bounds

Then, for each i=1,. . . ,t, the sets Piand Lisatisfy the first three

properties required in Lemma 3.27 for the sets Pand L. In order to

also satisfy the fourth property, we need to modify the sets P1,...,Pt

slightly. To this, for each i=1,...,t, we define P∗

iby removing the

leftmost clusters from Pithat do not contain the left boundary of a

cluster C∈Li. Then, we are able to apply Lemma 3.27 with P=P∗

and L=Lifor i=1,...,t. Using |P∗

i|⩽|Pi|, it follows that |Li|⩽|Pi|.

Since, P1,...,Ptand L1,...,Ltare partitions of Sand Cℓ, respectively,

ℓ=

∑

i=1

|Li|⩽

∑

i=1

|Pi|=k.

As mentioned above, this proves the theorem.

3.6lower bounds

In this section, we present constructions of several data sets for the

diameter k-clustering problem. The data sets yield lower bounds for

the approximation factor of Algorithm 3.3. Some of the data sets

are also suitable to prove lower bounds for Algorithm 3.2and Algo-

rithm 3.1for the k-center problem and the discrete k-center problem,

respectively. To keep the constructions simple, we make use of the

fact that we presented the clustering algorithms without a particular

tie-breaking strategy. That is, whenever one of the algorithms is able

to choose between several possible merge steps, we simply assume

that we can govern its choice.

In Section 3.6.1, we show that for data sets X⊂R(i.e., d=1), Algo-

rithm 3.3has an approximation factor of at least 2.5. Recall that in in

Section 3.5we stated that in the one-dimensional case Algorithm 3.3

computes a solution to the diameter k-clustering problem with an

approximation factor strictly below 3. Hence, for d=1, we obtain

almost matching upper and lower bounds for the cost of the solution

computed by Algorithm 3.3. The same holds for Algorithm 3.2and

the k-center problem since for d=1the two problems as well as the

two algorithms are equivalent.

In Section 3.6.2, we show that the dimension dhas an impact on

the approximation factor of Algorithm 3.3. This follows from a two-

dimensional data set yielding a lower bound of 3for the metric based

on the L∞-norm. Note that this exceeds the upper bound from the

one-dimensional case.

In Section 3.6.4, we show that there exist data sets such that Al-

gorithm 3.3computes an approximation to the diameter k-clustering

problem (Problem 2.7) with an approximation factor of Ω(p

√logk)for

metrics based on an Lp-norm (1⩽p < ∞)and Ω(logk)for the met-

ric based on the L∞-norm. In case of the L1- and the L∞-norm, this

matches an already known lower bound that has been shown using

a rather artificial metric [Dasgupta and Long, 2005]. However, the

analysis of agglomerative clustering

bound in [Dasgupta and Long, 2005] is derived from a two-dimen-

sional data set, while the dimension of our construction depends on

the number of clusters k.

Finally, in Section 3.6.4, we show that the lower bound of Ω(p

√logk)

for any Lp-norm and Ω(logk)for the L∞-norm can be adapted to the

discrete k-center problem. In case of the L2-norm, we thus obtain al-

most matching upper and lower bounds for the cost of the solution

computed by Algorithm 3.1. Furthermore, we are able to restrict the

dependency of the approximation factor of Algorithm 3.1on dand k.

3.6.1arbitrary metrics and d=1

We first show a lower bound for the approximation factor of Algo-

rithm 3.3. To do this, we use a sequence of data sets from Rdwith

d=1. Since up to normalization there is only one norm for d=1,

without loss of generality we assume the Euclidean metric.

Proposition 3.28.For all ε>0and k⩾4there exists a data set X⊂R

such that Algorithm 3.3computes a solution to the diameter k-clustering

problem (Problem 2.7) with cost at least 5

2−εtimes the cost of an optimal

solution.

Proof. It is sufficient to consider the case k=4. The construction can

easily be extended to k > 4 by adding separated dummy points. In

the following we construct a data set for any fixed n∈N. That is,

we actually obtain a series of data sets. For any n∈Nwe show that

the computed solution has an approximation factor of 5

2−f(n)for

f(n)> 0 and limn→∞f(n) = 0.

For x∈R, let

V(x) := {x+i|i∈Nand 0⩽i⩽2n−1}.

That is, V(x)consists of 2nequidistant points, where the distance be-

tween neighboring points is 1and diam(V(x)) = 2n−1. Furthermore,

we define

l(x) := x−2n−1,

r(x) := x+2n−1+2n−1=x+3·2n−1−1,

W(x) := V(x)∪{l(x),r(x)}.

It follows that diam(W(x)) = 2n+1−1as shown in Figure 12.

Based on the sets W(x), we define

X:=

∪

i=1

W(xi)

where xi:= i·(7·2n−1−2)for i=1,. . . ,4. Then, there is a gap of

3·2n−1−1between Wn(xi)and Wn(xi+1), i.e.

diam({r(xi),l(xi+1)})=3·2n−1−1for i=1,. . . ,3. (13)

3.6 lower bounds

l(x)x r(x)

2n−1

2n−12n−1

Figure 12: Sketch of the set W(x).

l(xi)r(xi)

2n−12n−1

2n−1

Figure 13: Part of the dendrogram for W(xi).

The optimal 4-clustering of Xis

Copt

4={W(x1),W(x2),W(x3),W(x4)}

and costdiam(Copt

4)=2n+1−1.

The solution computed by Algorithm 3.3may have larger cost. At

the beginning, the minimum distance between two points from Xis 1.

The possible pairs of points with distance 1come from the sets V(xi)

for i=1,...,4. Since the distance between V(xi)and l(xi)or r(xi)

is 2n−1, we can assume that the algorithm merges all points of V(xi)

for i=1,...,4as shown in Figure 13. It follows that Algorithm 3.3

computes the 12-clustering

C12 ={{l(x1)},V(x1),{r(x1)},

{l(x2)},V(x2),{r(x2)},

{l(x3)},V(x3),{r(x3)},

{l(x4)},V(x4),{r(x4)}}.

For i=1,...,4, the diameters of {l(xi)}∪V(xi)and V(xi)∪{r(xi)}

are equal to 3·2n−1−1and these are the best possible merge steps.

Therefore, by (13), we can assume that Algorithm 3.3merges r(xi)

and l(xi+1)for i=1,...,3first. This results in the 7-clustering

C7={{l(x1)}∪V(x1),

{r(x1),l(x2)},V(x2),

{r(x2),l(x3)},V(x3),

{r(x3),l(x4)},V(x4)∪{r(x4)}}

where V(x2)and V(x3)have a diameter of 2n−1while the remaining

clusters have a diameter of 3·2n−1−1(see Figure 14). Between two

neighboring clusters of C7, there is a gap of 2n−1.

In the next step of Algorithm 3.3, the best possible choice is to

merge {r(x1),l(x2)}with V(x2),{r(x2),l(x3)}with V(x2)or V(x3), or

analysis of agglomerative clustering

C1C2C3C4

5·2n−3 3·2n−1−1 3·2n−2 3·2n−1−1

Figure 14: Part of the dendrogram for X.

{r(x3),l(x4)}with V(x3). We let the algorithm merge {r(x1),l(x2)}with

V(x2)and {r(x3),l(x4)}with V(x3). This results in a 5-clustering where

the clusters have alternating lengths of 3·2n−1−1and 3·2n−2with

gaps of 2n−1between them. Then, in the step resulting in C4, Algo-

rithm 3.3has to create a cluster of diameter 5·2n−3as shown in

Figure 14. Therefore, the computed solution has an approximation

factor of

costdiam(C4)

costdiam(Copt

4)=5·2n−3

2n+1−1.

For ngoing to infinity this approximation factor converges from be-

low to 5

Analogously to Section 3.5, the bound for the one-dimensional di-

ameter k-clustering problem immediately yields the same bound for

the one-dimensional k-center problem.

Corollary 3.29.For all ε>0and k⩾4there exists a data set X⊂Rsuch

that Algorithm 3.2computes a solution to the k-center problem (Problem 2.8)

with cost at least 5

2−εtimes the cost of an optimal solution.

3.6.2l∞-metric and d=2

In this section, we construct a data set that uses only eight points from

R2and yields a lower bound of 3for the metric based on the L∞-norm.

Recall that in Section 3.5, we showed that for d=1the approximation

factor of a computed solution is always strictly less than 3. Therefore,

the lower bound of 3for d=2implies that the dimension dhas an

impact on the approximation factor of Algorithm 3.3.

Proposition 3.30.For the metric based on the L∞-norm, there exists a data

set X⊂R2such that Algorithm 3.3computes a solution to the diameter

k-clustering problem (Problem 2.7) with three times the cost of an optimal

solution.

3.6 lower bounds

(a) Lower bound for the metric based

on the L∞-norm.

A B

C D

E F

G H

(b) Lower bound for the metric based

on the L2-norm. The points

C,D,G,Hhave a z-coordinate of

0, while the points A,B,E,Fhave

az-coordinate of 2√x.

Proof. We prove the proposition by constructing a data set for k=4.

Consider eight points A,. . . ,H∈R2with

A= (0,1),E= (−1,2),

B= (1,0),F= (2,1),

C= (0,−1),G= (1,−2),

D= (−1,0),H= (−2,−1).

The optimal 4-clustering of these points is

Copt

4={{A,E},{B,F},{C,G},{D,H}}

which has a maximum L∞-diameter of 1(cf. Figure 15a). However,

it is also possible that Algorithm 3.3starts by merging A with B and

C with D. Then, in the third step, the algorithm merges Eor Fwith

{A,B},Gor Hwith {C,D}, or {A,B}with {C,D}. We assume the latter.

Thus, the fourth merge step creates a cluster of L∞-diameter 3.

3.6.3euclidean metric and d=3

For the Euclidean case, we are able to construct a 3-dimensional data

set that yields a lower bound of 2.56. This is below the upper bound

of 3from the one-dimensional case. Therefore, this data set does

not show an impact of the dimension din the Euclidean case as in

the previous section. However, this lower bound is still better than

the lower bound of 2.5from the one-dimensional case. This suggests

that in higher dimensions it might be easier to construct good lower

bounds.

Proposition 3.31.For the Euclidean metric there exists a data set X⊂R3

such that Algorithm 3.3computes a solution to the diameter k-clustering

problem (Problem 2.7) with cost 2.56 times the cost of an optimal solution.

analysis of agglomerative clustering

Proof. We prove the proposition by constructing a data set for k=4.

For any fixed x∈Rwith 0 < x < 2 consider eight points A,. . . ,H∈R2

with

A= (−1,1,2√x),E= (−(1+x),1+√4−x2,2√x),

B= ( 1,1,2√x),F= ( 1+x,1+√4−x2,2√x),

C= (−1,−1,0),G= (−(1+x),−(1+√4−x2),0),

D= ( 1,−1,0),H= ( 1+x,−(1+√4−x2),0).

The optimal 4-clustering of these points is

Copt

4={{A,E},{B,F},{C,G},{D,H}},

which has a maximum L2-diameter of 2(cf. Figure 15b). However,

since ∥A−B∥=∥C−D∥=2it is possible that Algorithm 3.3starts

by merging Awith Band Cwith D. Then, the cheapest merge adds

one of the points E,Fto the cluster {A,B}or it adds one of the points

G,Hto the cluster {C,D}or it merges {A,B}with {C,D}. We assume

the latter. The resulting cluster {A,B,C,D}has a diameter of 2√2+x.

Then, in the fourth merge step, the algorithm will either merge one

of the pairs E,Fand G,Hor one of the pairs E,Gand F,H. The choice

depends on the parameter x. Note that Algorithm 3.3will not merge

the cluster {A,B,C,D}with one of the remaining four points, since

this is always more expensive. The diameter of the created cluster is

maximized for x≈1.56. If we fix x=1.56, the algorithm merges E

with For Gwith H. This results in a 4-clustering of cost 5.12, while

the optimal solution has cost 2.

3.6.4lp-metric (1⩽p⩽∞)in variable dimension

In the following, we consider the diameter k-clustering problem with

respect to the metric based on the L1-norm. We provide a data set

with dimension O(k)such that Algorithm 3.3computes a solution

with an approximation factor of Ω(logk).

Proposition 3.32.For the metric based on the L1-norm, there exists a data

set X⊂Rdwith d=k+log2ksuch that Algorithm 3.3computes a solution

to the diameter k-clustering problem (Problem 2.7) with 1

2log2ktimes the

cost of an optimal solution.

Proof. For the sake of simplicity, assume kto be a power of 2. In the

following, we consider the (k+log2k)-dimensional set Xof |X|=k2

points defined by

X:= {[ei

b]∀1⩽i⩽kand b∈{0,1}log2k}.

3.6 lower bounds

Here, ei∈Rkdenotes the i-th canonical unit vector. Consider the

k-clustering

C∗

k:= {Cbb∈{0,1}log2k},

where for each b∈{0,1}log2kthe cluster Cbis given by

Cb:= {[ei

b]∀1⩽i⩽k}.

The largest diameter of C∗

kis costdiam(C∗

k)=2. Hence, for the diameter

optkof an optimal solution, it follows that

optk⩽2. (14)

To investigate the merge steps of Algorithm 3.3, note that

diam({[ei

b1],[ej

b2]})={h(b1,b2)if i=j

2+h(b1,b2)if i=j

where h(b1,b2)denotes the Hamming distance between the strings

b1,b2∈{0,1}log2k. Hence, we may assume that Algorithm 3.3starts

by merging the points [ei,0,b′]⊤and [ei,1,b′]⊤for all 1⩽i⩽kand

all b′∈{0,1}log2(k)−1, thereby forming 1

2k2clusters of diameter 1.

Next, we show inductively that Algorithm 3.3keeps merging pairs

of clusters that agree on the first kcoordinates until the algorithm

halts. Assume that there is some number 1⩽t⩽log2ksuch that the

clustering computed so far consists solely of the clusters

C(t)

i,b′=













b′



b∈{0,1}t









for all 1⩽i⩽kand all b′∈{0,1}log2(k)−t. Note that for t=1this is

the case after the first 1

2k2merges. Then,

diam(C(t)

i,b1∪C(t)

j,b2)={t+h(b1,b2)if i=j

2+t+h(b1,b2)if i=j.

Hence, as above, we may assume that in the next 1

2t+1k2steps Algo-

rithm 3.3merges the clusters C(t)

i,0b′and C(t)

i,1b′for all 1⩽i⩽kand all

b′∈{0,1}log2(k)−(t+1). The resulting clusters have a diameter of t+1.

Furthermore,

C(t+1)

i,b′=C(t)

i,0b′∪C(t)

i,1b′.

Algorithm 3.3keeps merging clusters in this way until after t=

log2krounds we end up with the k-clustering Ck={Ci1⩽i⩽k}

where

Ci={[ei

b]b∈{0,1}log2k}.

These clusters Ciare of diameter log2k. Comparing to (14), we de-

duce that Algorithm 3.3computes a solution to Problem 2.7with at

least 1

2log2ktimes the cost of an optimal solution.

analysis of agglomerative clustering

Considering the diameter k-clustering problem with respect to an

arbitrary Lp-metric (with 1⩽p < ∞), note that the behavior of Al-

gorithm 3.3does not change if we consider the p-th power of the Lp-

distance instead of the Lp-distance. Also note that for all x,y∈{0,1}d

we have ∥x−y∥p

p=∥x−y∥1. Since the data set Xfrom Proposition 3.32

is a subset of {0,1}d, we immediately obtain the following corollary.

Corollary 3.33.For the metric based on any Lp-norm with 1⩽p < ∞,

there exists a data set X⊂Rdwith d=k+log2ksuch that Algorithm 3.3

computes a solution to the diameter k-clustering problem (Problem 2.7) with

√1

2log2ktimes the cost of an optimal solution.

Additionally, considering the diameter k-clustering problem with

respect to the L∞-metric, it is known that subset of npoints from an

arbitrary metric space can be embedded isometrically into (Rn,L∞)

[Fréchet, 1910]. That is, the distance between the embedded points

is equal to the original distances. The construction of the embedded

set is simple. For 1⩽i,j⩽n, let dij be the distance between xiand

xjin the metric space. Then, for each point xiwith 1⩽i⩽ndefine

x′

i:= (di1,...,din)T. It immediately follows ∥x′

i−x′

j∥∞=dij.

Hence, the data set of size n=k2from Proposition 3.32 yields a

data set in Rk2satisfying the same approximation bound with respect

to the L∞-distance. We obtain the following corollary.

Corollary 3.34.For the metric based on the L∞-norm, there exists a data

set X⊂Rdwith d=k2such that Algorithm 3.3computes a solution to the

diameter k-clustering problem (Problem 2.7) with 1

2log2ktimes the cost of

an optimal solution.

the discrete k-center problem

The data set Xfrom Proposition 3.32 also yields lower bounds for the

approximation factor of the agglomerative solution to the discrete k-

center problem. Just note that for the data set Xin every step of the

algorithm the minimal discrete radius of a cluster equals the diameter

of the cluster. We immediately obtain the following corollaries.

Corollary 3.35.For the metric based on any Lp-norm with 1⩽p < ∞,

there exists a data set X⊂Rdwith d=k+log2ksuch that Algorithm 3.1

computes a solution to the discrete k-center problem (Problem 2.9) with

√1

2log2ktimes the cost of an optimal solution.

Corollary 3.36.For the metric based on the L∞-norm, there exists a data

set X⊂Rdwith d=k2such that Algorithm 3.1computes a solution to the

discrete k-center problem (Problem 2.9) with 1

2log2ktimes the cost of an

optimal solution.

Moreover, in case of the L2-norm, we obtain the following corollary.

3.6 lower bounds

Corollary 3.37.For the metric based on the L2-norm, there exists a data set

X⊂Rdwith d=O(log3k)such that Algorithm 3.1computes a solution to

the discrete k-center problem (Problem 2.9) with Ω(√logk)times the cost

of an optimal solution.

Note that Corollary 3.37 follows by embedding the data set from

Corollary 3.35 into the O(log3k)-dimensional Euclidean space with-

out altering the behavior of the agglomerative algorithm or the lower

bound of Ω(√logk)(using the Johnson-Lindenstrauss embedding

[Johnson and Joram, 1984]). For this embedded data set, the bound

given in Section 3.2states an upper bound of 20d +2log(k) + 2=

O(log3k)times the cost of an optimal solution. Hence, in case of the

discrete k-center problem using the L2-metric, the upper bound from

our analysis almost matches the lower bound.

Furthermore, this implies that the approximation factor of Algo-

rithm 3.1cannot be simultaneously independent of dand logk. More

precisely, the approximation factor cannot be sublinear in 6

√dand in

√logk.

Part II

ON PARAMETER ESTIMATION

The second part of this thesis is about the parameter esti-

mation problem for statistical models. Statistical models

offer an alternate view on the structure of a given data

set that differs substantially from the hard clustering ap-

proach discussed in Part I. Using this approach, the data

set is assumed to be generated by a stochastic process.

Then, the parameters of this process can be interpreted as

a description of the structure of the data. While a partition

of the data as computed by hard clustering algorithms is

only valid for the given data set, the parameters of the

stochastic process may also be used to describe other data

sets generated by the same process.

The main result of Part II is the analysis of a probabilistic

variant of the classical EM algorithm for the parameter es-

timation problem of Gaussian mixture models. We show

that the probabilistic algorithm is considerably faster than

the classical algorithm while, with high probability, its pa-

rameter updates are close to the deterministic EM updates.

Furthermore, we confirm our theoretical results in an ex-

perimental analysis.

STATISTICAL MODELS

One way to deal with large data sets is to reduce the amount of in-

formation that has to be processed by investigating the distribution

of the data. That is, if we are merely interested in the correlation of

the data, then a loss of the information residing in the single data

elements is acceptable. In this chapter we deal with the description

of data by statistical models. That is, we assume that a given data set

was generated by a particular statistical process. Then, the parame-

ters of this process give a small description of the data set.

In this chapter we discuss the problem of obtaining suitable pa-

rameters for particular statistical models. We focus our attention on

the well understood Gaussian mixture models. Gaussian mixtures

are frequently used in practice since they are suitable for the expla-

nation of many real world data sets. One often cited example is the

Old Faithful data set. It is named after a famous geyser in the Yel-

lowstone National Park where its data is taken from. The data set

is two-dimensional and contains 272 entries. Each entry corresponds

to one observation of an eruption and consists of the eruption time

in minutes and the waiting time to the next eruption. The data set

is shown in Figure 15. We see that the eruptions seem to follow two

different time patterns. The measurements decompose into two ellip-

bbb

b b

bb b

b b

bbb

b b

bbb

b b

Figure 15: The Old Faithful data set.

statistical models

soidally shaped point clouds. As we will discuss in Section 4.3, this

makes the data set a suitable candidate for Gaussian mixture models.

For the understanding of this and the following chapters, some

basic knowledge of discrete and continuous probability theory is in-

dispensable. In the following, without introducing the necessary con-

cepts, we denote the probability of an observation Aby Pr(A)and we

denote the probability of Agiven an observation Bby Pr(A|B). For

observations from continuous domains, Pr(·)denotes the probability

density function. Although formally incorrect, we use the term prob-

ability in the discrete as well as in the continuous case. However, it is

always clear from the context whether a probability or density func-

tion is meant. Furthermore, we denote random variables by upper

case letters and concrete observations by lower case letters. That is,

Pr(X=x)denotes the probability that the random variable Xtakes on

the value x(in the continuous case, it denotes the density of Xat the

point x). If the focus does not lie on the density of concrete observa-

tions but on the density function, then we simply write Pr(X)which

strictly speaking is a random variable itself. A short introduction to

probability theory can be found in [Bishop, 2006]. For a profound

treatment of the matter see [Grimmett and Stirzaker, 2001].

4.1the parameter estimation problem

Astatistical model for some domain Xconsists of a probability dis-

tribution on Xand a set of associated parameters. The probability

distribution is given by a density function pθ:X→Rthat is gov-We consider the

parameters to be

given as a vector θ

from a proper

parameter domain.

erned by a parameter vector θ. Then, data from the domain Xis

modeled by a random variable X∈Xwith

Pr(X=x|θ)=pθ(x)

for x∈X. If we assume that the data consists of n∈Nelements

that are drawn independently from the same distribution, we denote

X= (X1,. . . ,Xn)∈Xnfor independent and identically distributed

random variables X1,...,Xn∈X. Then, the density pθis extended

to the domain Xnby computing the product probabilities over the

single observations. More precisely,

Pr(X= (x1,. . . ,xn)|θ)=pθ(x1,...,xn) =

∏

i=1

pθ(xi).

Note that we denote data sets by ordered tuples. Besides simpli-

fying the discussion (e.g., regarding repetitions) it is natural to look

at the independent draws from the same distribution as a sequen-

tial process. Since we consider the data to be the result of a random

process, we use the terms data (set) and observation interchangeably.

Using a statistical model to describe a given data set, the central

question is how to optimize the parameter vector θ. A natural ap-

4.1 the parameter estimation problem

proach is to try and maximize the probability that θis responsible for

a given observation X, i. e., to compute

θ:= argmax

Pr(θ|X).

Pr(θ|X)is called the posterior probability since it is obtained after the

observation of X. However, for this probability to be well defined, we

need to assume a probability distribution defined on the domain of

the parameters. Then, using the probability Pr(θ)(called prior proba-

bility), Bayes’ theorem yields

Pr(θ|X)=Pr(X|θ)Pr(θ)

Pr(X),

where Pr(X|θ)is given by the statistical model. Since the denomina-

tor on the right-hand side does not depend on θ, we can state

Pr(θ|X)∝Pr(X|θ)Pr(θ).

The approach to optimize θbased on this relationship is called the A resource for the

Bayesian approach is

[Bishop, 2006].

Bayesian approach.

In this thesis we consider an alternate approach called the frequen-

tist approach. Adopting the frequentist approach, no assumptions are

made on the parameters. The optimization of the parameter vector θ

is done by maximizing the probability of the observation Xgiven θ,

i.e., to compute

θ∗:= argmax

Pr(X|θ).

It is important to note that this argumentum maximi is not guar-

anteed to be well defined. The probability Pr(X|θ)does not have to

have a global maximum (e.g., for Gaussian mixtures, cf. Section 4.3.2).

However, for a well posed problem there should at least exist a local

maximum that can be searched for.

Another well known pitfall of this approach is its vulnerability to For a detailed

treatment of

overfitting see

[Hawkins, 2004].

the problem of overfitting. This means that a parameter vector θat

a global or local maximum of Pr(X|θ)may be so well adapted to

the concrete observation Xthat the resulting statistical model is less

suitable to describe similar data from the same application.

If the probability Pr(X|θ)is viewed as a function of the parameter

vector θ, then it is called the likelihood of θgiven the observation X

and is denoted as

L(θ|X)=Pr(X|θ).

Therefore, θ∗as defined above, is called the maximum likelihood esti-

mate. The problem of computing or approximating the maximum

likelihood estimate is called the parameter estimation problem.

Instead of maximizing the likelihood it is equivalent and some-

times analytically more convenient to maximize its logarithm (cf. Sec-

tion 4.3). The function lnL(θ|X)is called the log-likelihood of θgiven

statistical models

X. For observations X= (X1,. . . ,Xn)as described above, the log-

likelihood takes the form

∏

i=1

pθ(Xi) =

∑

i=1

lnpθ(Xi) =

∑

i=1

lnL(θ|Xi).

That is, each sub-observation Xicontributes an additive term to the

log-likelihood cost function. This contribution is simply the log-like-

lihood of the parameter vector θgiven the single observation Xi.

In order to be able to interpret the optimization of θas the min-

imization of a cost function, it is common to minimize the negative

log-likelihood, i.e., to compute

θ∗:= argmin

−lnL(θ|X).

4.2general mixture models

In this section, we study the parameter estimation problem for a par-

ticular class of statistical models, namely the mixture models or mix-

ture distributions. Mixture models are well suited to model data that

can be partitioned into several subsets which can be seen as to be

generated by different distributions over the same domain X. For a

mixture of kdistributions, the parameter vector takes the form

θ= (w1,...,wk,θ1,...,θk),

where w1,. . . ,wk∈R+with ∑k

j=1wj=1are called the weights and

θjis the parameter vector for the j-th component distribution. We

denote the probability density for the corresponding mixture by

pθ(x) =

∑

j=1

wjpθj(x),

where pθjdenotes the probability density of the j-th component dis-

tribution. Since the weights sum up to one, it follows that

∫X

pθ(x)∂x =

∑

j=1

wj∫X

pθj(x)∂x =

∑

j=1

wj=1.

Thus, pθ(x)is indeed a probability density. Drawing a single observa-

tion X=xfrom a mixture distribution can be described as a two step

process. First, choose one of the component distributions with prob-

ability proportional to its weight. Second, draw xfrom the chosen

component distribution.

For the likelihood of θgiven an observation X=(x1,. . . ,xn)it fol-

lows that

L(θ|X)=

∏

i=1

pθ(xi) =

∏

i=1

∑

j=1

wjpθj(xi)

4.3 gaussian mixture models

and for the log-likelihood we get

lnL(θ|X)=

∑

i=1

lnpθ(xi) =

∑

i=1

∑

j=1

wjpθj(xi).

Note that this expression contains the logarithm of a sum. There-

fore, in the mixture case, the likelihood is difficult to optimize di-

rectly. However, in Chapter 5we discuss an algorithmic approach to

approximate the maximum likelihood estimate.

4.3gaussian mixture models

Gaussian distributions (or short, Gaussians) are statistical models for

points from the d-dimensional space Rd. A single Gaussian is an uni-

modal distribution. That is, points drawn from a single Gaussian are

concentrated around a single point, called the mean of the Gaussian.

Furthermore, the points form an ellipsoidal shape and their density

drops exponentially with the distance to the mean. By mixing several

Gaussian distributions, it is possible to model multimodal data sets.

4.3.1single gaussian distributions

The Gaussian distribution (also known as the normal distribution) is a

continuous unimodal probability distribution for real valued random

variables. In the one-dimensional case, it is called the univariate Gaus-

sian distribution and is given by the probability density Nµ,σ2:R→R

with

Nµ,σ2(x) := 1

√2πσ2exp(−(x−µ)2

2σ2),

where µ,σ2∈Rand σ2> 0. The expected value of a random variable

X∼ Nµ,σ2is given by For a derivation of

these equalities see

Section 2.3in

[Bishop, 2006].

E[X]=∫∞

−∞

x·Nµ,σ2(x)∂x =µ

and the variance of Xis given by

E[(X−µ)2]=∫∞

−∞

(x−µ)2·Nµ,σ2(x)∂x =σ2.

Therefore, the parameter µis called the mean of the Gaussian distri-

bution and the parameter σ2is called its variance.

The extension of the Gaussian distribution to the d-dimensional

space Rdis called the multivariate Gaussian distribution (cf. Figure 16)

and is given by the probability density Nµ,Σ:Rd→Rwith

Nµ,Σ(x) := 1

√(2π)d|Σ|exp(−1

2(x−µ)TΣ−1(x−µ)), (15)

statistical models

...

Figure 16: Density of a two-dimensional Gaussian distribution.

where µ∈Rdand Σ∈Rd×dis a symmetric positive-definite matrix.

Analogously to the univariate case, the expected value of a random

variable X∼ Nµ,Σis given by

E[X]=∫Rd

x·Nµ,Σ(x)∂x =µ

and the covariance matrix of Xis given by

E[(X−µ)(X−µ)T]=∫Rd

(x−µ)(x−µ)T·Nµ,σ2(x)∂x =Σ.

Therefore, the parameter Σis called the covariance matrix of the Gaus-

sian distribution. In the following, we take a closer look at the geo-

metrical form of the Gaussian density for different types of covariance

matrices.

For Σbeing a multiple of the identity matrix, i.e.,

Σ=σ2Id∈Rd×d

for σ2> 0, we deduce

Nµ,σ2Id(x) = 1

√(2πσ2)dexp(−1

2σ2∥x−µ∥2)(16)

√(2πσ2)dexp(−1

2σ2

∑

i=1

(xi−µi)2)

∏

i=1

√2πσ2exp(−1

2σ2(xi−µi)2)

∏

i=1

Nµi,σ2(xi),

where x= (x1,. . . ,xd)Tand µ= (µ1,. . . ,µd)T. That is, each coordi-

nate of a random variable X∼ Nµ,σ2Idis distributed independently

according to a univariate Gaussian around the corresponding coordi-

nate of µand with variance σ2. Furthermore, Equation (16) yields

that points with the same density lie on a sphere centered at µ(cf.

Figure 17a).

4.3 gaussian mixture models

(a) Σ=σ2Id

σ1

σ2

(b) diagonal Σ

√λ1

√λ2

Figure 17: Points with constant density for two-dimensional Gaus-

sians with different types of covariance matrices.

If Σis a diagonal matrix with diagonal elements σ2

1,. . . ,σ2

d∈R+,

then Σ−1=Diag(1

σ2

1,. . . ,1

σ2

1)and

Nµ,Diag(σ2

1,...,σ2

d)(x) = 1

√(2π)d∏d

i=1σ2

exp(d

∑

i=1

−(xi−µi)2

2σ2

∏

i=1

√2πσ2

exp(−(xi−µi)2

2σ2

∏

i=1

Nµi,σ2

i(xi).

That is, the i-th coordinate of X∼ Nµ,Diag(σ2

1,...,σ2

d)is distributed inde-

pendently according to a univariate Gaussian around µiagain. How-

ever, in the diagonal case the variance of the i-th coordinate is σ2

That is, a Gaussian with diagonal covariance matrix can be thought

of to be derived from a Gaussian with covariance Idby shrinking

and stretching along the coordinate axes. More precisely, by mul-

tiplying the i-th coordinate of X= (X1,. . . ,Xd)∼ Nµ,Idwith the

standard deviation σifor i=1,. . . ,d, we get a random variable

Y=(σ1X1,. . . ,σdXd)T∼ Nµ,Diag(σ2

1,...,σ2

d)(cf. Figure 17b).

For a general symmetric positive-definite Σ, we consider its unique

Cholesky Factorization See Theorem 4.2.7in

[Golub and

Van Loan, 2012].

Σ=LLT,

where Lis an invertible lower triangular matrix with positive diago-

nal entries. The matrix Lcan be uniquely decomposed as

L=VD

for an orthonormal matrix Vand a diagonal matrix Dwith positive

entries. While the linear map x7→ Dx stretches (or shrinks) xalong

the coordinate axes, the linear map x7→ Vx is a rotation around and

reflection at the origin. That is, applying the linear map x7→ Lx to the

d-dimensional unit ball results in an ellipsoid centered at the origin.

The above decomposition of Σis unique. Furthermore, for any or-

thonormal matrix Vand any diagonal matrix Dwith positive entries,

the matrix LLTwith L=VD is symmetric positive-definite. Therefore,

statistical models

it is common to identify a symmetric positive-definite matrix with the

associated ellipsoid.

We use the decomposition of Σstated above to decompose Σ−1as

Σ−1=(LT)−1L−1=(L−1)TL−1

=((VD)−1)T(VD)−1

=(D−1V−1)TD−1V−1.

Then, for the density of the multivariate Gaussian distribution, we

deduce

Nµ,Σ(x) = 1

√(2π)d|Σ|exp(−∥D−1V−1(x−µ)∥2

2). (17)

That is, a Gaussian with general symmetric positive-definite covari-

ance matrix can be thought of to be derived from a Gaussian with

diagonal covariance by rotating around the mean (cf. Figure 17c).

By ignoring the normalization factor |Σ|−1

2, Equation (17) offers an

alternate view on the density function. The evaluation of Nµ,Σat a

point xcan be interpreted as a two step process. In the first step, xis

rotated around µby the linear map

x7→ V−1(x−µ) + µ

and afterwards the coordinates of xare scaled relative to µby the

linear map

x7→ D−1(x−µ) + µ.

This yields the point

y=D−1V−1(x−µ) + µ. (18)

In the second step, the point yis passed to the density Nµ,Id.

Points with constant density have to be mapped to a sphere around

µin the first step (cf. Equation (16)). To see which points satisfy this

property, let zbe an arbitrary point on the translation of the unit

sphere centered at µ. Then, the linear map

x7→ VD(x−µ) + µ

maps zonto the translation of the ellipsoid associated to Σcentered

at µ. Substituting x=VD(z−µ) + µin Equation 18 yields

y=D−1V−1VD(z−µ) + µ=z.

This equality also holds for zlying on an arbitrary sphere centered

at µ. We conclude that points with constant density lie on concentric,

confocal ellipsoids. More precisely, these ellipsoids result from scal-

ing the ellipsoid associated to Σand translating it to be centered at µ.

4.3 gaussian mixture models

Note that the lengths of the semi-axes of the ellipsoid associated to Σ

are the square roots of the eigenvalues of Σ(cf. Figure 17c).

See Chapter 2.3in

[Bishop, 2006].

For a single Gaussian distribution, the parameter vector θtakes the

form

θ= (µ,Σ).

Then, for the likelihood of θgiven an observation X=(x1,...,xn), we

get

L(θ|X)=Pr(X|θ)=

∏

i=1

Nµ,Σ(xi)

∏

i=1

√(2π)d|Σ|exp(−1

2(xi−µ)TΣ−1(xi−µ)).

As indicated in the previous section, it is analytically much simpler

to consider the log-likelihood. The result of applying the logarithm

is that the product transforms into a sum and that the exponential

function disappears. For the negative log-likelihood of a parameter

vector θ= (µ,Σ)given an observation X=(x1,...,xn), we get

−lnL(µ,Σ|X)=

∑

i=1

−lnNµ,Σ(xi)(19)

∑

i=1(ln√(2π)d|Σ|+1

2(xi−µ)TΣ−1(xi−µ))

∑

i=1(d

2ln2π +1

2ln|Σ|+1

2(xi−µ)TΣ−1(xi−µ))

=nd

2ln2π +n

2ln|Σ|+

∑

i=1

2(xi−µ)TΣ−1(xi−µ).

The parameter estimation problem for a single Gaussian distribu-

tion, i. e., the maximization of L(µ,Σ|X)with respect to µand Σ, is

easy to solve. There exist closed form formulas for the values µ∗and

Σ∗of the maximum likelihood estimate θ∗, namely For a derivation of

these equalities see

[Bilmes, 1997].

µ∗=1

∑

i=1

xiand Σ∗=1

∑

i=1

(xi−µ∗)(xi−µ∗)T.

4.3.2mixture of gaussian distributions

As mentioned above, single Gaussian distributions are well suited to

model ellipsoidally shaped data sets. However, if the data set splits

into several possibly separated ellipsoidally shaped parts, then we

need a more general type of distribution called Gaussian mixture dis-

tribution.

Gaussian mixtures are a special case of mixture distributions (cf.

Section 4.2), where all components are single Gaussian distributions.

statistical models

...

Figure 18: Density of a mixture of three Gaussian distributions.

For a mixture of kGaussians, the parameter vectors θ1,. . . ,θkof the

component Gaussians take the form

θj=(µj,Σj),

where µjis the mean of the j-th component and Σjis its covariance

matrix. Then, the parameter vector of the mixture takes the form

θ= (w1,. . . ,wk,µ1,. . . ,µk,Σ1,. . . ,Σk)

and the probability density Mθof the mixture (cf. Figure 18) is de-

fined as

Mθ(x) :=

∑

j=1

wjNµj,Σj(x).

For the likelihood of θgiven an observation X=(x1,. . . ,xn)it fol-

lows that

L(θ|X)=

∏

i=1

Mθ(xi) =

∏

i=1

∑

j=1

wjNµj,Σj(xi)

and for the log-likelihood we get

lnL(θ|X)=

∑

i=1

lnMθ(xi) =

∑

i=1

∑

j=1

wjNµj,Σj(xi). (20)

Note that in contrast to the case of a single Gaussian distribution,

we are not able to move the logarithm operator in front of the Gaus-

sian density function (cf. Equation (19)). The summation over the k

components thus prevents the derivation of a simple formula for the

log-likelihood.

In case of Gaussian mixture models, it is hopeless to try and de-

duce closed formulas for the maximum likelihood estimate anyway.

The problem is that for the mixture of at least two Gaussians the

maximum likelihood estimate is not well defined. On the contrary,

it is always possible to generate parameters with arbitrary large log-

likelihood.

4.4 handy notions from probability theory

To see this, we fix an arbitrary point of the observed data X=

(x1,. . . ,xn)and customize the parameters for one of the kcomponent

Gaussians. Without loss of generality, we consider the data point x1

and the parameters w1,µ1, and Σ1. While w1can be set to any con-

stant c∈Rwith 0<c<1, we define µ1=x1. Furthermore, for any

α∈Rwith α > 0 we define Σα∈Rd×dto be a symmetric positive

definite matrix with determinant |Σα|=α. Then, for αtending to

zero and x=x1, the density Nµ1,Σα(x)also tends to zero (cf. Equa-

tion (15)). This is because of the exponential drop of the Gaussian

density function. Thus, for Σ1=Σαand αsmall enough, the contri-

bution of the points xiwith xi=xto the log-likelihood is dominated

by the components 2to k(cf. Equation (20)). More precisely, for

α→0their contribution converges to some constant that depends on

θ2,. . . ,θk. However, for αsmall enough, the contribution of x=x1to

the log-likelihood is dominated by the first component. For α→0the

density Nµ1,Σα(x1)converges to ∞. Thus, the contribution of x=x1

also converges to ∞.

As a consequence, one should always keep in mind that in case

of the mixture of Gaussian distributions the parameter estimation

problem is an ill-posed problem.

4.4handy notions from probability theory

In this section, we introduce some notions from probability theory

which we need for the discussion of the EM algorithm in Chapter 5

and for the analysis of the SEM algorithm in Chapter 6.

4.4.1distinction between distribution and density

Probability distributions are usually given by the corresponding den-

sity functions. Therefore, it is common to identify a distribution with

its density. Although formally not correct, this convention generally

does not lead to any ambiguities. We join this common practice for

the remainder of this thesis since it simplifies the discussion.

4.4.2on information theory

In the following, we give a brief description of two concepts from

information theory, namely the entropy of a random variable and

the Kullback-Leibler divergence between two random variables. We

define both quantities for the discrete as well as for the continuous

case.

For the following definitions of entropy and differential entropy,

we use the convention that

0ln0=0

statistical models

which arises from continuity since xlnx→0as x→0.

Definition 4.1.Let Xbe a random variable over a discrete domain Xwith

probability mass function p:X→R. Then, the entropy of Xis defined as

H(X):= − ∑

x∈X

p(x)lnp(x).

Since the entropy of Xdepends only on the probability mass p, it

is also denoted as H(p). The entropy of Xis a non-negative quantity

which can be interpreted as the amount of information that is ob-

tained by observing the value of X. We use the entropy only for nota-

tional purposes. For a detailed discussion of the entropy see Chapter

2.1in [Cover and Thomas, 2012]. The continuous correspondence of

the entropy is called the differential entropy.

Definition 4.2.Let Xbe a random variable over a continuous domain X

with continuous and differentiable probability density function p:X→R.

Then, the differential entropy of Xis defined as

H(X):= − ∫X

p(x)lnp(x)∂x.

Again, we also denote the differential entropy of Xas H(p), since

it depends only on the probability density p. For a discussion of the

relation between entropy and differential entropy, see Chapter 8.3in

[Cover and Thomas, 2012].

For the following definitions of the Kullback-Leibler divergence, we

use the conventions that

0ln 0

0=0and pln p

0=∞

for p > 0.

Definition 4.3.Let X,Ybe random variables over a discrete domain X.

Furthermore, let pand qbe the probability mass functions of Xand Y, re-

spectively. Then, the Kullback-Leibler divergence between Xand Yis

defined as

KL(X∥Y):= ∑

x∈X

p(x)ln p(x)

q(x).

Note that the Kullback-Leibler divergence is not symmetric, which

is the reason why it should not be called Kullback-Leibler distance.

For a motivation and discussion of the Kullback-Leibler divergence

see Chapter 2.3in [Cover and Thomas, 2012]. Just as the entropy, the

Kullback-Leibler divergence can also be defined for the continuous

case.

Definition 4.4.Let X,Ybe random variables over a continuous domain

X. Let pand qbe the probability density functions of Xand Y, respec-

tively. Furthermore, let pand qbe continuous and differentiable. Then, the

Kullback-Leibler divergence between Xand Yis defined as

KL(X∥Y):= ∫X

p(x)ln p(x)

q(x)∂x.

4.4 handy notions from probability theory

A discussion of the Kullback-Leibler divergence for the continuous

case can be found in Chapter 8.5of [Cover and Thomas, 2012].

In the discrete as well as in the continuous case, we also denote

KL(X∥Y)as KL(p∥q)since it depends only on pand q. Furthermore,

the Kullback-Leibler divergence between Xand Yis a non-negative

quantity that is zero if and only if p=q.

THE CLASSICAL EM ALGORITHM

Throughout this chapter, we deal with an algorithmic approach to

solving the parameter estimation problem for some fixed statistical

model. More precisely, we discuss a general scheme for the approxi-

mation of the corresponding maximum likelihood estimate (see Chap-

ter 4.1), called the Expectation-Maximization algorithm (or simply

EM algorithm). Since its introduction in [Dempster et al., 1977] a lot

of work has been done to analyze and improve the EM algorithm.

Today it is very well understood [McLachlan and Krishnan, 2008].

In the following, we explain the algorithm, discuss its convergence

properties, and apply it to general mixture distributions as well as

Gaussian mixtures. The discussion of the general EM algorithm and

its convergence properties follows the presentation in Chapter 2.3of

[Bishop, 2006]. The application to general and Gaussian mixtures is

based on [Bilmes, 1997].

The EM algorithm is used when the observed data can be seen as

incomplete and when the corresponding complete data helps signif-

icantly to solve the parameter estimation problem. The main scope

of the algorithm are statistical models for which direct optimization

is impossible. For example, let us assume that the observed data

was generated by a mixture of Gaussian distributions. Then the com-

putation of the maximum likelihood estimate is computationally in-

tractable (see Chapter 4.3.2). However, if we knew for each observed

point from which component distribution it was generated, then the

solution could be computed separately for each component by closed

form formulas (see Chapter 4.3.1). Therefore, for Gaussian mixture

models we define the origin of each data point as the so-called hidden

values.

In the following, we denote the incomplete observation by the ran-

dom variable Xand the hidden values by the random variable Z. Fur-

thermore, we denote the joint distribution of the complete data (X,Z)Recall that we

identify a

distribution with

its density.

pθ(x,z) = Pr(X=x,Z=z|θ),

where θdenotes a vector of model parameters. Then, for the distribu-

tion of the incomplete data we deduce

pθ(x) = ∫pθ(x,Z)∂Z =Pr(X=x|θ). (21)

the classical em algorithm

Algorithm 5.1:Classical EM algorithm

Input: incomplete observation X, initial model θ

while <condition> do

1q(Z)←− Pr(Z|X,θ)// for hidden r.v. Z

2maximize EZ∼q[lnPr(X,Z|θ)] w. r. t. θ

end

return θ

Note that for Gaussian mixtures the integral becomes a sum over

the discrete space of possible assignments of each point to one of the

component distributions. The complete data likelihood pθ(X,Z)often

has a much simpler form than the incomplete data likelihood pθ(X).

Using Bayes’ theorem,

pθ(x,z) = Pr(Z=z|θ)Pr(X=x|θ,Z=z).

In case of mixture distributions for example, this leads to simple up-

date equations of the EM algorithm (see Section 5.2and Section 5.3).

The parameter estimation problem given the incomplete observa-cf. Chapter 4.1

tion Xis to maximize the likelihood function

L(θ|X)=Pr(X|θ)=∫pθ(X,Z)∂Z

over the choice of θ. For the discussion of the EM algorithm we

furthermore need the distribution pX,θof the hidden values given the

incomplete observation. Using Bayes’ theorem,

pX,θ(z) = pθ(X,z)

pθ(X)=Pr(Z=z|X,θ). (22)

The EM algorithm is an iterative method that, given an initial es-In Chapter 7.3we

describe a simple

method of obtaining

initial parameters.

timate of the model parameters, tries to approximate the maximum

likelihood estimate by maximizing the expected complete data log-

likelihood. That is, it computes

θnew := argmax

Z∼pX,θold

[lnPr(X,Z|θ)] .

The expectation is taken over the distribution of the hidden values

that is induced by the previous parameter estimate θold. The EM al-

gorithm in its general form is stated as Algorithm 5.1, where the con-

dition of the while loop may be filled in with any reasonable choice,

e.g., a convergence criterion or a fixed number of iterations.

Each iteration of the loop in Algorithm 5.1consists of two steps.

Step 1derives the distribution over the hidden values that is induced

by the previous parameter estimate. Fixing this distribution, the ex-

pectation over the hidden values that is to be maximized can be stated

as a function of the parameters θ. Therefore, the first step is called

5.1 convergence of the em algorithm

the expectation step. However, no particular expectation is computed

in this first step. In applications of the EM algorithm for concrete

statistical models, the derived distribution is often discrete as for ex-

ample for Gaussian mixtures. Then, the implementation of the expec-

tation step usually consists of precomputations of the corresponding

density function for possible values of Z. In case of Gaussian mixture

models, it is sufficient to precompute the probability that a certain

point was generated by a certain component for each point and each

component distribution (see Section 5.3). Step 2is called the maxi-

mization step and computes the new set of parameters by maximizing

the expectation over the choice of θ.

For the EM algorithm to be applicable, we have to assume that the

maximization in Step 2is computationally tractable. In practice, this

is often the case (e.g., for mixtures of distributions from the expo-

nential family). There also exist computationally simpler variants of

the EM algorithm that may be used if the maximization step is not

(efficiently) computable. The General EM (or GEM) algorithm for

example tries to only increase the expectation in the second step (cf.

[Bishop, 2006]). Furthermore, there exist probabilistic variants like

the SEM algorithm discussed in Chapter 6. The SEM algorithm sim-

plifies the computation by avoiding to deal with the expectation at

all. Instead, it simply guesses the hidden values which considerably

simplifies the maximization step.

To analyze the convergence properties of the EM algorithm in the

following section, we assume that two preconditions are met. First,

as mentioned above, the maximization step has to be computationally

tractable. Second, the log-likelihood function for the considered sta-

tistical model and the expectation from the maximization step both

have to be differentiable. The above mentioned and frequently used

mixtures of distributions from the exponential family satisfy these

preconditions.

5.1convergence of the em algorithm

The parameter vectors computed by the EM algorithm converge to a

stationary point of the likelihood function. To see this, we here adopt

the line of argument in Chapter 2.3of [Bishop, 2006].

In this section, we denote the log-likelihood function given an in-

complete observation Xby lnpθ(X)as a function of θ. Note that cf. Equation (21)

applying the logarithm does not change the stationary points.

Theorem 5.1.Let θ0be the initial parameter estimate and for i > 0 let θi

be the parameter estimate θcomputed by Algorithm 5.1in the i-th iteration

of Step 2. Then, for increasing ithe log-likelihood of θiconverges to a

stationary point of lnpθ(X).

At first sight, the significance of Theorem 5.1may appear low. It

does not even guarantee the convergence to a local maximum, let

the classical em algorithm

lnpθ(X)

EZ∼q[lnpθ(X,Z)]+H(q)q=pX,θ

KL(q∥pX,θ)

Figure 19: Decomposition of lnpθ(X)over the choice of q.

alone the convergence to a global one. However, in general we can

not hope for more since it is known that the EM algorithm can get

stuck on a saddle point of the likelihood function (see Chapter 3.6of

[McLachlan and Krishnan, 2008]). Moreover, since a global maximum

does not even have to exist (cf. Chapter 4.3.2), this does not have to

be an undesirable property. In practice, the local maxima may be the

original target. And even a saddle point can be of great interest when

the local maxima fall victim to the problem of overfitting mentioned

in Chapter 4.1.

To prove Theorem 5.1, we use the following lemma, which decom-

poses the log-likelihood of the incomplete data as shown in Figure 19.

Note that the complete data log-likelihood is denoted by lnpθ(X,Z)

and recall that pX,θdenotes the distribution over the hidden valuescf. Equation (22)

that is induced by the incomplete observation Xand the parameter

estimate θ.

Lemma 5.2.Let qbe an arbitrary distribution over the hidden values Z.

Then, the log-likelihood of a parameter vector θgiven an observation Xcan

be written as

cf. Figure 19 lnpθ(X) = E

Z∼q[lnpθ(X,Z)]+H(q)+KL(q∥pX,θ).

Since the Kullback-Leibler divergence is always non-negative and

it is zero if and only if both densities are identical (see Chapter 4.4.2),

Lemma 5.2immediately yields the following corollary.

Corollary 5.3.Let qbe an arbitrary distribution over the hidden values Z.

Then, the log-likelihood of a parameter vector θgiven an observation Xcan

be lower bounded by

lnpθ(X)⩾E

Z∼q[lnpθ(X,Z)]+H(q).

Furthermore, the bound holds with equality if and only if q=pX,θ.

5.1 convergence of the em algorithm

Proof of Lemma 5.2.Using the definition of the Kullback-Leibler diver-

gence (see Chapter 4.4.2),

lnpθ(X) − KL(q∥pX,θ)

=lnpθ(X) + ∫q(Z)ln(pX,θ(Z)

q(Z))∂Z

=∫q(Z)(lnpθ(X) + ln(pX,θ(Z)

q(Z)))∂Z

=∫q(Z)ln(ln(pθ(X)pX,θ(Z))

q(Z))∂Z.

From the definition of pX,θin Equation (22), we deduce

pθ(X)pX,θ(Z) = pθ(X,Z).

Thus,

lnpθ(X) − KL(q∥pX,θ)

=∫q(Z)ln(lnpθ(X,Z)

q(Z))∂Z

=∫q(Z)lnpθ(X,Z)∂Z −∫q(Z)lnq(Z)∂Z

Z∼q[lnpθ(X,Z)]+H(q),

where the last equality follows using the definitions of the expectation

and of the entropy (see Chapter 4.4.2).

Proof of Theorem 5.1.In the following, we consider a single iteration of This proof also

follows Chapter 2.3

in [Bishop, 2006].

Algorithm 5.1and show that the parameter updates do not decrease

the likelihood. Furthermore, we argue that the likelihood remains

constant only if it already reached a stationary point. We denote

the parameters before the considered round by θold and the updated

parameters by θnew.

Each round of the EM algorithm starts with the selection of the

distribution q=pX,θold over the hidden values Z, where

pX,θold(Z) = Pr(ZX,θold)

as specified in Algorithm 5.1. Based on this distribution, we define

fX,θold(θ) := E

Z∼pX,θold

[lnpθ(X,Z)]+H(pX,θold ).

Then, Corollary 5.3yields

lnpθ(X)⩾fX,θold (θ), (23)

i.e., fX,θold is a lower bound for lnpθ. Furthermore,

lnpθold (X) = fX,θold (θold)(24)

the classical em algorithm

lnpθ(X)

fX,θold(θ)

θold θnew

Figure 20: Optimization of lnpθ(X)over the choice of θ.

as sketched in Figure 20.

Since H(pX,θold)is independent of θ, in order to maximize fX,θold(θ)

over the choice of θ, it is sufficient to maximize EZ∼pX,θold [lnpθ(X,Z)].

That is, the maximization step of Algorithm 5.1computes

θnew := argmax

fX,θold(θ)

as shown in Figure 20. In particular,

fX,θold(θnew)⩾fX,θold (θold) = lnpθold(X),

where the equality at the end follows from Equation (24). Using In-

equality (23) with θ=θnew,

lnpθnew(X)⩾lnpθold (X).

It remains to show that

lnpθnew(X)>lnpθold (X)

if θold is not a stationary point of lnpθ(X)as a function of θ. Since

the gradient of g(θ) := lnpθ(X)is zero if and only if θis a stationary

point of g,

∂g(θ)

∂θ (θold)=0.

However, from Inequality (23) and Equation (24) it follows that g

touches fX,θold at the point θold as shown in Figure 20. Therefore, for

the gradient of fX,θold at the point θold we get

∂fX,θold(θ)

∂θ (θold) = ∂g(θ)

∂θ (θold)=0.

It follows that θold does not maximize fX,θold and thus

fX,θold(θnew)> fX,θold(θold).

Analogously to the derivation of non-decrease of the likelihood, we

deduce

lnpθnew(X)>lnpθold (X).

5.2 the em algorithm for mixture distributions

5.2the em algorithm for mixture distributions

In this section, we discuss the application of the EM algorithm to

general mixture distributions as introduced in Chapter 4.2. Let

θ= (w1,. . . ,wk,θ1,. . . ,θk)

be the parameter vector of the mixture, i.e., w1,. . . ,wk∈R+are the

weights and θ1,...,θkare the parameter vectors of the component

distributions. Then, the probability of a single observation X=xis

pθ(x) =

∑

j=1

wjpθj(x).

In order to apply the EM algorithm, we have to decide on the hid-

den values Z. That is, we have to carefully choose which unobserved

information would help us to solve the parameter estimation problem.

Recall that drawing a single observation from the mixture can be de-

scribed as a two step process. First, draw the j-th component with

probability wj, and second, draw xwith probability pθj(x). Based on

this interpretation, we define Z:= (zj)∈{0,1}kwith ∑k

j=1zj=1to be

the vector of binary random variables indicating which component

was chosen. Then, for j=1,. . . ,kwe get Pr(zj=1θ)=wj.

The probability of an observation X= (x1,...,xn)consisting of n∈

Nindependent draws from the mixture distribution is

pθ(x1,...,xn) =

∏

i=1

∑

j=1

wjp(xi|θj).

We define Z:= (zij)∈{0,1}n×kwith ∑k

j=1zij =1for i=1,...,nto

be the binary matrix indicating which component is responsible for

each sub-observation. In the mixture case, this indicator matrix Zis

the natural choice for the hidden values.

For i=1,...,nand j=1,...,kwe get Pr(zij =1θ)=wj. If we

denote the i-th row of the indicator matrix by zi, then the probability

to observe a particular row zican be written as

Note that all but one

summand of this

sum are zero.

pθ(zi) =

∑

j=1

zijwj.

Using this indicator matrix Z, the log-likelihood of θgiven the com-

plete data (X,Z)is given by

lnPr(X,Z|θ)=

∑

i=1

lnpθ(xi,zi) =

∑

i=1

ln(pzi,θ(xi)pθ(zi))

∑

i=1

lnpzi,θ(xi) +

∑

i=1

lnpθ(zi).

the classical em algorithm

Note that this decomposition is very useful since the two sums de-

pend on disjoint parts of the parameter vector θ. The first sum can be

written as

Here, all but one

summand of the

inner sum are zero.

∑

i=1

lnpzi,θ(xi) =

∑

i=1

∑

j=1

zijpθj(xi),

while the second sum can be written as

∑

i=1

lnpθ(zi) =

∑

i=1

∑

j=1

zijwj.

That is, the first sum only depends on θ1,. . . ,θkand the second sum

only depends on w1,...,wk. With the linearity of the expectation,

it follows that the maximization step of the EM algorithm can be

considered as two separate maximizations.

Obviously, without specifying the type of the component distribu-

tions, nothing can be said about the updates of the EM algorithm

for the parameter vectors θ1,. . . ,θk. However, there do exist closed

formulas for the updates of the weights w1,...,wk. Given an obser-

vation Xand a parameter vector θ, we define the responsibility rjof

the j-th component to be the expected number of observations drawn

from the j-th component distribution, i.e.,

rj:=

∑

i=1

Z∼pX,θ[zij]=

∑

i=1

Pr(zij =1X,θ).

By applying Bayes’ theorem twice, the responsibilities can be expressed

rj=

∑

i=1

Pr(xi,zij =1θ)

Pr(xi|θ)

∑

i=1

Pr(zij =1θ)Pr(xizij =1,θ)

Pr(xi|θ)

∑

i=1

wjpθj(xi)

∑k

ℓ=1wℓpθℓ(xi).

That is, the responsibilities are computable by evaluating the compo-

nent distributions and using the current parameters θonly. The EM

updates for the weights are the normalized responsibilities, i. e., for

j=1,. . . ,k,For a derivation of

these updates see

[Bilmes, 1997]. wnew

j:= rj

5.3the em algorithm for gaussian mixtures

We finish our introduction to the EM algorithm with an application

to Gaussian mixture models (cf. Chapter 4.3). Let

θ= (w1,. . . ,wk,µ1,. . . ,µk,Σ1,. . . ,Σk)

5.3 the em algorithm for gaussian mixtures

be the parameter vector of the Gaussian mixture, i. e., w1,. . . ,wk∈

R+are the weights, µ1,. . . ,µk∈Rdare the means of the component

Gaussians and Σ1,. . . ,Σk∈Rd×dare the corresponding covariance

matrices. As discussed in Section 5.2, the EM updates of the weights

are given by

wnew

j:= rj

n=1

∑

i=1

Pr(zij =1X,θ)

∑

i=1

wjNµj,Σj(xi)

∑k

ℓ=1wℓNµℓ,Σℓ(xi)

for j=1,. . . ,k, where rjis the responsibility of the j-th component as

defined in Section 5.2.

The derivation of the update equations for the means and the co-

variance matrices is rather elaborate and can for example be looked

up in [Bilmes, 1997]. For the j-th mean, the update equation is

µnew

j:= ∑n

i=1Pr(zij =1X,θ)xi

∑n

i=1Pr(zij =1X,θ)

∑

i=1

wjNµj,Σj(xi)xi

∑k

ℓ=1wℓNµℓ,Σℓ(xi).

That is, the contribution of each point to the j-th component mean

is proportional to its probability in the j-th component in relation

to its overall probability. Finally, for the j-th covariance, the update

equation is

Σnew

j:= ∑n

i=1Pr(zij =1X,θ)(xi−µnew

j)(xi−µnew

j)T

∑n

i=1Pr(zij =1X,θ)

∑

i=1

wjNµj,Σj(xi) (xi−µnew

j)(xi−µnew

j)T

∑k

ℓ=1wℓNµℓ,Σℓ(xi).

THE STOCHASTIC EM ALGORITHM

In this chapter, we study a probabilistic variant of the EM algorithm

that was originally proposed as the Stochastic EM or SEM algorithm

[Celeux and Diebolt, 1985]. The algorithm replaces the maximization

step of the EM algorithm with two simple steps. First, it guesses

the hidden values by sampling a fixed instance. Second, it maxi-

mizes the complete-data likelihood over the choice of the parameter

vector. Depending on the application, this simplification yields a con-

siderable speedup. Furthermore, there exist applications where the

maximization step of the classical EM algorithm is computationally

intractable (e.g., when it involves high dimensional integrations). In

these cases, the simplified maximization step may lead to practicable

update equations. However, the focus of this chapter lies on a com-

parison between the computations of the classical EM algorithm and

its probabilistic variant. We discuss only statistical models for which

both algorithms are applicable.

The models generated by the Stochastic EM algorithm are studied

in [Ip, 1994]. For mixtures of distributions from the exponential fam-

ily, the author shows that the sequence of models generated by the

SEM iterations is an ergodic Markov chain converging weakly to a

stationary distribution over models. Furthermore, the mean of this

stationary distribution is analyzed under appropriate assumptions.

It is shown that for the number of input points going to infinity, this

mean converges to the maximum likelihood estimate. However, the

mean of the stationary distribution usually can not be obtained by

a single run of the algorithm. Instead, a large number of restarts is

necessary to retrieve a reasonable approximation of the mean distri-

bution.

Practitioners are often satisfied with the solutions computed by the

EM algorithm. Therefore, we analyze the SEM algorithm in relation

to the classical EM algorithm. In contrast to the previous analysis of

the SEM algorithm that focused on stochastic properties of the above

mentioned Markov chain, we study a single run of the algorithm.

In our approach, we look at it as a probabilistic algorithm that tries

to imitate the behavior of a deterministic algorithm while using its

randomness to speed up the computations. Our results suggest that

in most cases where the classical EM algorithm is applicable, one

the stochastic em algorithm

Algorithm 6.1:Probabilistic EM algorithm

Input: incomplete observation X, initial model θ

while <condition> do

1q(Z)←− Pr(Z|X,θ)// for hidden r.v. Z

2adraw z∝q(Z=z)

2bmaximize Pr(X,Z=z|θ)w.r. t. θ

end

return θ

can use the SEM algorithm to obtain similar results more efficiently.

Moreover, in our experiments discussed in Chapter 7, we observe that

sometimes, compared with the EM algorithm, the SEM algorithm

computes parameters with a larger likelihood. This is most likely

due to its inherent randomness that allows the algorithm to escape

from a saddle point or an undesired local maximum of the likelihood

function (cf. Chapter 5.1). Note that the algorithm is hypothetically

able to even guess the ’true’ hidden values, although this generally

happens only with negligible probability.

In Section 6.1, we present the generic SEM algorithm. Afterwards,

in Section 6.2, we show that general mixture distributions are suitable

candidates for the application of the SEM algorithm. In Section 6.3,

we compare the updates of the EM and SEM algorithm for Gaussian

mixture models. We consider a single update step and show that with

high probability the algorithms perform almost the same computa-

tions for sufficiently large input data sets. Our experimental results

(see Chapter 7.4) confirm that this still holds for a large number of

successive steps. Moreover, we show that for Gaussian mixtures, the

simplified maximization step of the SEM algorithm leads to consid-

erably better running times compared to the classical EM algorithm.

This result is confirmed by our experiments as well.

6.1the generic sem algorithm

The SEM algorithm is stated as Algorithm 6.1. The algorithm differs

from the classical EM algorithm in the second step. It replaces the

maximization in Step 2of Algorithm 5.1with two alternate steps (cf.

Step 2a and Step 2b of Algorithm 6.1). Instead of operating with

the expectation over the choice of Z, it simply guesses the hidden

values. This is done in Step 2a by sampling an instance for Zusing the

distribution qfrom Step 1. Afterwards, the complete-data likelihood

given (X,Z)is maximized over the choice of θin Step 2b.

6.2 the sem algorithm for mixture distributions

Algorithm 6.2:SEM algorithm for mixtures

Input: incomplete observation X= (x1,...,xn),

initial model θ= (w1,...,wk,θ1,...,θk)

while <condition> do

1q(Z)←− Pr(Z|X,θ)// for indicator matrix Z

2adraw Z= (zij)∝q(Z)

2bfor j=1,. . . ,k do

wnew

j←− 1

n∑n

i=1zij

θnew

j←− argmaxθjPr(Yjθj)

where Yjis a vector of all xiwith zij =1

end

return θ

6.2the sem algorithm for mixture distributions

A crucial question concerning the SEM algorithm is for which class

of models it should be applied. In this section we show that mixture

distributions are suitable candidates. There is only one necessary

precondition, namely that for each component distribution separately,

the maximum likelihood estimate is computationally tractable.

The resulting SEM algorithm for mixture distributions is stated as

Algorithm 6.2. As for the EM algorithm, we denote the incomplete

observation by X= (x1,...,xn)and the model parameters by

θ= (w1,...,wk,θ1,...,θk).

Again, we choose the indicator matrix (zij)∈{0,1}n×kas the hidden cf. Chapter 5.2

values Z. Just as in case of the EM algorithm, the maximization step

can be carried out separately for the weights w1,...,wkand for the

component parameters θ1,. . . ,θk. To do this, we define n1,...,nkto

be the number of data elements that are assigned to each component

in Step 2a, i.e.,

nj=

∑

i=1

zij and

∑

j=1

nj=n.

Then, for the weight updates we simply compute the fraction of the In Section 6.3.2, we

show that these

updates are good

approximations of

the EM updates.

input data that is assigned to each component, i.e.,

wnew

j:= nj

Furthermore, by Yj∈Xnjwe denote the vector of elements xifor

i=1,...,nwith zij =1. Then, for the updates of θ1,...,θk, we

maximize the probability pθj(Yj)over the choice of θjfor j=1,...,k.

the stochastic em algorithm

6.2.1correctness

In the following, we show that Algorithm 6.2is a correct instantia-

tion of Algorithm 6.1for the mixture case. That is, we show that

computing

wnew

j:= 1

∑

i=1

zij and θnew

j:= argmax

θj

Pr(Yjθj)

does indeed maximize Pr(X,Z=z|θ). To prove the correctness of the

weight updates, we use the following lemma. Note that for k∈N,

we denote the standard (k−1)-simplex by

Sk−1=





(p1,. . . ,pk)∈Rk

∑

j=1

pj=1



.

Lemma 6.1.Let k,n,n1,...,nk∈Nwith ∑k

j=1nj=n. Then, the function

f:Sk−1→Rwith

f(p) = f(p1,...,pk) =

∑

j=1

njln(pj)

is maximized by p= (n1

n,...,nk

n).

Proof. Let q= (q1,. . . ,qk)with qj=nj

nfor j=1,. . . ,k. Instead of f,

we maximize

f(p)

n+H(q)=

∑

j=1

qjlnpj−

∑

j=1

qjlnqj

∑

j=1

qjln pj

= − KL(q∥p)

where H(·)denotes the entropy and KL(·∥·)denotes the Kullback-

Leibler divergence (see Chapter 4.4.2). Then, the lemma follows from

the non-negativity of the Kullback-Leibler divergence and the fact

that KL(q∥p)=0if and only if p=q.

Proposition 6.2.Let θ= (w1,. . . ,wk,θ1,. . . ,θk)be the parameter vector

of a mixture model. Then, for any incomplete observation X= (x1,. . . ,xn)

with the hidden indicator matrix Z= (zij), the complete-data likelihood

L(X,Z|θ)is maximized over the choice of θby setting

wj=1

∑

i=1

zij and θj=argmax

θ′

Pr(Yjθ′

for j=1,...,k, where Yjis a vector of all xiwith zij =1.

6.2 the sem algorithm for mixture distributions

Proof. We denote the new model computed by Algorithm 6.2in some

fixed round by θX,Z. Since

Pr(X,Z|θ)=Pr(Z|θ)Pr(X|Z,θ),

it is sufficient to show that θX,Zseparately maximizes Pr(Z|θ)as well

as Pr(X|Z,θ).

Recall that the i-th row of the indicator matrix (zij)is a binary unit

vector indicating which component is responsible for the point xi.

That is, zij ∈{0,1}with ∑k

j=1zij =1and

Pr(zij =1θ)=wj=

∑

j=1

zijwj.

We conclude

Pr(Z|θ)=

∏

i=1

∑

j=1

zijwj=

∏

j=1

wnj

Then,

lnPr(Z|θ)=

∑

j=1

njln(wj)

and applying Lemma 6.1yields that θX,Zmaximizes Pr(Z|θ).

It remains to show that θX,Zmaximizes Pr(X|Z,θ). However, due

to the independence of the observations,

lnPr(X|Z,θ)=

∑

i=1

lnPr(xi|Z,θ)

∑

i=1

∑

j=1

zij lnPr(xiθj)

∑

j=1

lnPr(Yjθj).

6.2.2limitations

Strictly speaking, for the maximization step to be well defined, we

have to assure that nj⩾ζfor j=1,. . . ,kfor some model dependent

threshold ζ∈N. That is, at least ζobservations have to be assigned

to each component distribution. Otherwise, it might not be possible

to determine the new parameters. This is obvious for nj=0at least,

since generally, without any data, there is no basis for parameter esti-

mation. For Gaussian mixture models, we discuss the threshold ζin

Section 6.3.

If nj< ζ, it might be a good idea to drop the under-determined

component and to replace it. This can be done by drawing new com-

ponent parameters from an a-priori distribution, for example.

the stochastic em algorithm

6.3the sem algorithm for gaussian mixtures

In this section we show that for mixtures of Gaussians, the EM algo-

rithm and the SEM algorithm compute almost the same parameter

updates. More precisely, we consider multivariate Gaussian mixtures

over the d-dimensional vector space Rd.

Recall from Chapter 4.3that for a mixture of k∈Nmultivariate

Gaussians over Rdthe parameters of each component consists of a

mean µ∈Rdand a covariance matrix Σ∈Rd×d. That is, the parame-

ter vector of the model takes the form

θ= (w1,...,wk,µ1,...,µk,Σ1,. . . ,Σk).

As usual, we denote the incomplete observation by X= (x1,. . . ,xn)

with xi∈Rdand choose the hidden values Zto be the indicator

matrix (zij)∈{0,1}n×k. Furthermore, based on the distribution q

derived in Step 1of Algorithm 6.2, we denote the probability that xi

was generated by the j-th component by

The expectation of

a binary random

variable is simply

the probability that

it becomes 1.

pij := Pr(zij =1X,θ)=E

Z∼q[zij]. (25)

Then, as discussed in Chapter 5.3, the classical EM algorithm in each

step deterministically computes the following updates for j=1,. . . ,k:

wEM

j:= 1

∑

i=1

pij,µEM

j:= ∑n

i=1pijxi

∑n

i=1pij

ΣEM

j:= ∑n

i=1pij(xi−µEM

j)(xi−µEM

j)T

∑n

i=1pij

. (26)

After sampling the indicator matrix (zij)∈{0,1}n×k, the Probabilis-

tic EM algorithm computes the following updates for j=1,...,k:

wSEM

j:= 1

∑

i=1

zij,µSEM

k:= ∑n

i=1zijxi

∑n

i=1zij

ΣSEM

j:= ∑n

i=1zij(xi−µSEM

j)(xi−µSEM

j)T

∑n

i=1zij

, (27)

where µSEM

jand ΣSEM

jare the maximum likelihood solution for

the parameter estimation problem of a single Gaussian distribution

with respect to the points assigned to the j-th component (cf. Chap-

ter 4.3.1). Note that by writing zij as

zij =Pr(zij =1Z)=Pr(zij =1X,θ,Z),

the SEM update equations can be interpreted as to arise from the

EM update equations by incorporating the additional knowledge of

which component each point was generated by. In contrast to the EM

algorithm, the parameters computed by the SEM algorithm are in fact

random variables with respect to the random experiment of drawing

the zij. This offers the alternate interpretation that the EM updates

arise from the SEM updates by replacing the indicator matrix with its

expectation (cf. Equation (25)).

6.3 the sem algorithm for gaussian mixtures

6.3.1limitations

As already observed in Section 6.2.2, strictly speaking, we have to

assure that |Yj|=njis large enough for each j=1,. . . ,k. In case

of multivariate Gaussians, we need that |Yj|⩾d+1at least. Other-

wise, the covariance matrix ΣSEM

jis not symmetric positive definite.

Assuming that the observations x1,. . . ,xn∈Rdare given in general

linear position, assigning at least d+1points to each Gaussian is

sufficient, though.

6.3.2proximity of the update equations

In this section, we provide the main results of this chapter. The theo-

rems presented below provide bounds for the differences between the

parameter updates as computed by the EM and SEM algorithm. As

one would expect, it can not be shown that the differences between

the mean updates are small if the corresponding weight updates dif-

fer considerably. The same holds for the differences between the co-

variance updates. Furthermore, it can not be shown that the differ-

ences between the covariance updates are small if the corresponding

mean updates differ considerably. A difference only in one coordi-

nate of the mean updates already leads to useless bounds for the

corresponding entries of the covariance matrix. Therefore, we first

provide the bound for the weight updates. Afterwards, we state the

bound for the mean updates which depends on the differences be-

tween the weight updates. Finally, we close with the bound for the

covariance updates. This bound depends on the differences of the

weight updates as well as on the differences between the mean up-

dates.

Before we state the theorems, we provide some preliminary defi-

nitions. For vectors v∈Rd, we denote the ℓ-th coordinate of vby

(v)ℓ. For matrices M∈Rd×d, we denote the (ℓ1,ℓ2)-th entry of Mby

(M)ℓ1ℓ2. Furthermore, we define the spread of the ℓ-th coordinate of

the input data set by

∆ℓ:= max

i(xi)ℓ−min

i(xi)ℓ. (28)

As in Chapter 5.2, we define the overall responsibility of the j-th Gaus-

sian component by

rj:=

∑

i=1

pij.

In the following, we develop the quantities that will be used to

measure the differences between the mean and covariance updates

computed by the EM and SEM algorithm. For j∈{1,...,k}, the differ-

the stochastic em algorithm

ence between the computed means for the j-th component Gaussian

µSEM

j−µEM

j=∑n

i=1zij(xi−µEM

∑n

i=1zij

We ignore the normalization factor ∑n

i=1zij in the denominator. Then,

each summand zij(xi−µEM

j)of the numerator is a random variable

that corresponds to the contribution of xito the difference between

µSEM

jand µEM

j. For ℓ∈{1,...,d}, the variance in the ℓ-th coordinate

of this contribution is

Var (zij(xi−µEM

j)ℓ)=pij(1−pij)(xi−µEM

j)2

ℓ.

Since the zij are independent random variables, the variance in the

ℓ-th coordinate of the overall contribution of all points is

Var(n

∑

i=1

zij(xi−µEM

j)ℓ)=

∑

i=1

pij(1−pij)(xi−µEM

j)2

ℓ.

In our analysis of the proximity of the SEM updates, we measure

the difference in the ℓ-th coordinate of the j-th mean in terms of the

corresponding standard deviation

τjℓ := v

∑

i=1

pij(1−pij)(xi−µEM

j)2

ℓ. (29)

Analogously, for the difference in the (ℓ1,ℓ2)-th entry of the j-th co-

variance, we use

ρjℓ1ℓ2:= v

∑

i=1

pij(1−pij)(Υij −ΣEM

j)2

ℓ1ℓ2

, (30)

where Υij = (xi−µEM

j)(xi−µEM

j)T.

To bound the difference between the updates of the EM algorithm

and the SEM algorithm, we state three separate theorems. The first

one bounds the difference of the weight updates for a single compo-

nent distribution. The theorem holds for Gaussian mixtures as well

as for general mixture models (cf. Section 6.2and Chapter 5.2).

Theorem 6.3(Proximity of weights).Let j∈{1,...,k},δ∈Rwith

2e−rj/3⩽δ⩽1, and λw=√3ln 2

/δ

rj. Then, with probability of at least 1−δ,

wSEM

j−wEM

j⩽λwwEM

The next theorem bounds the difference in a single coordinate of

the mean updates for a single component. One can not expect good

estimates for the means if the weights are not approximated well.

Thus, in the second theorem we assume that the corresponding weight

is already approximated with an accuracy of λw. The derived bound

for the computed mean depends on λwby a factor of 1

1−λw.

6.3 the sem algorithm for gaussian mixtures

Theorem 6.4(Proximity of means).Let j∈{1,...,k}and let Abe the

event that

wSEM

j−wEM

j⩽λwwEM

j, (31)

where 0 < λw< 1.

Then, for ℓ∈{1,. . . ,d}and 0<δ<1, the conditional probability of

(µSEM

j−µEM

j)ℓ⩽λµ

1−λw·τjℓ

given the event Ais at least 1−δfor

λµ=







√2e ln2

/δif τjℓ

∆ℓ⩾1

e√2e ln2

/δ

2∆ℓ

τjℓ ln2

/δotherwise.

The third theorem is the equivalent of Theorem 6.4for a single

entry of the covariance matrix. Similar to Theorem 6.4, the derived

bound depends on the weight accuracy λw. Additionally, we assume

that the ℓ-th coordinate of the corresponding mean is already approxi-

mated with an accuracy λℓ

µ. The bound splits into a sum of two terms.

The first term is the analogon of the bound for the mean from Theo-

rem 6.4. The second term is the error that arises from the accuracy of

the mean estimates.

Theorem 6.5(Proximity of covariances).Let j∈{1,...,k}and let Abe

the event that

wSEM

j−wEM

j⩽λwwEM

j(32)

where 0<λw< 1. Furthermore, let ℓ1,ℓ2∈{1,...,d}and let Bbe the event

that for ℓ=ℓ1,ℓ2,

(µSEM

j−µEM

j)ℓ⩽λℓ

1−λw·τjℓ

(33)

where λℓ

µ> 0.

Then, for 0<δ<1, the conditional probability of

(ΣSEM

j−ΣEM

j)ℓ1ℓ2⩽λΣ

1−λw·ρjℓ1ℓ2

+λℓ1

µλℓ2

(1−λw)2·τjℓ1τjℓ2

given the events Aand Bis at least 1−δfor

λΣ=







√2e ln2

/δif ρjℓ1ℓ2

∆ℓ1∆ℓ2

⩾1

e√2e ln2

/δ

2∆ℓ1∆ℓ2

ρjℓ1ℓ2ln2

/δotherwise.

To derive a result for the complete update, one has to combine

all three theorems using the union bound. Note that all three bounds

depend on the responsibilities rj. As one would expect due to the law

of large numbers, the accuracy of our bounds improves with growing

rj.

the stochastic em algorithm

6.3.3limitations

Note that, generally speaking, it is not possible to approximate a mix-

ture component with too small a weight. Thus, we cannot expect

to derive proximity bounds for the SEM updates that do not use as-

sumptions on the weights, which correspond to the responsibilities

rj=n·wEM

j. The bounds from Theorem 6.4and Theorem 6.5both

depend on 1

1−λw. That is, these bounds become arbitrarily large for

λwnear 1. Therefore, to get reasonable results, we need λwto be con-

siderably small. For instance, to ensure λw⩽1

2, the requirements for

λwin Theorem 6.3yield rj⩾12 ·ln(2

/δ). In other words, given a data

set of size n, all weights have to be at least 12·ln(2

/δ)

6.3.4on chernoff bounds

In the following, we state and prove several Chernoff type bounds.

Chernoff bounds are a standard tool from probability theory for the

derivation of concentration results. Readers who are familiar with

this technique may skip this section or at least the given proofs.

We start with an elementary Chernoff bound. A proof can be foundSee Corollary 4.6in

[Mitzenmacher and

Upfal, 2005]. in [Mitzenmacher and Upfal, 2005], for example.

Lemma 6.6.Let X1,. . . ,Xnbe independent random variables over {0,1}and

let Y=∑n

i=1Xi. Then, for each 0⩽λ⩽1,

Pr(|Y−E[Y]|⩾λ·E[Y]) ⩽2e−E[Y]λ2

As a corollary, we get the following lemma which bounds the devi-

ation for a given probability of occurrence. This bound will be used

for the difference between the weight updates later.

Lemma 6.7.Let X1,. . . ,Xnbe independent random variables in {0,1}and

let Y=∑n

i=1Xi. Then, for 2e−E[Y]

3⩽δ⩽1and λ=√3ln2

/δ

E[Y],

Pr(|Y−E[Y]|⩾λ·E[Y]) ⩽δ.

To bound the differences between the mean and between the co-

variance updates, we need a more elaborate Chernoff type bound.

Lemma 6.8.Let X1,...,Xnbe independent, discrete random variables withThis Lemma and its

proof are based on

[Levchenko, 2013]. E[Xi]=0and |Xi|⩽Cfor some constant C > 0 and i=1,...,n. Further-

more, let Y=∑N

n=1Xi. Then, for any λ⩾0,

Pr(|Y|⩾λ√Var(Y))⩽2e−λ2

2ea

where a⩾0such that λ=aea√Var(Y)

Analogously to the relation between Lemma 6.6and Lemma 6.7,

we use Lemma 6.8to derive a bound for a given probability of occur-

rence.

6.3 the sem algorithm for gaussian mixtures

Lemma 6.9.Let X1,. . . ,Xnbe independent, discrete random variables with

E[Xi]=0and |Xi|⩽Cfor some constant C > 0 and i=1,. . . ,n. Further-

more, let Y=∑N

n=1Xiand 0<δ<1. Then,

Pr(|Y|⩾λ√Var(Y))⩽δ

for

λ=









√2e ln2

/δif √Var(Y)

C⩾1

e√2e ln2

/δ

√Var(Y)ln2

/δotherwise.

In the remainder of this section, we prove Lemma 6.8and Lemma 6.9.

Proof of Lemma 6.8.First, note that we will naturally assume Var(Y) = For Var(Y) = 0,

there is no need

for Lemma 6.8.

∑n

i=1Var(Xi)> 0. Then, for any λ⩾0, there exists an a⩾0with

λ=aea√Var(Y)

C. Due to symmetry, we only prove

Pr(Y⩾λ√Var(Y))⩽e−λ2

2ea.

By Markov’s inequality, for any t > 0 we obtain

Pr(Y⩾λ√Var(Y))=Pr(etY ⩾etλ√Var(Y))⩽E[etY]

etλ√Var(Y). (34)

Let xi1,xi2,. . . be the possible outcomes of the discrete random vari-

able Xi. We denote the corresponding probabilities by

πij := Pr(Xi=xij).

Then, by the definition of the expectation and the series expansion of

the exponential function,

E[etXi]=∑

πijetxij

=∑

πij (1+txij +1

2!(txij)2+1

3!(txij)3+...).

Separating the first two summands of the series expansion yields

E[etXi]

=∑

πij +t∑

πijxij +∑

πij (1

2!(txij)2+1

3!(txij)3+. . .)

=1+∑

πij(txij)2(1

2!+1

3!txij +1

4!(txij)2+...),

where the second equality follows from t∑jπijxij =tE[Xi]=0. Since

(m+2)!⩽1

m!for m∈N,

E[etXi]⩽1+∑

πij(txij)21

2etxij .

the stochastic em algorithm

In the following, we assume that tis chosen such that

a⩾tC. (35)

Then, etxij ⩽eaand it follows that

E[etXi]⩽1+ea

2t2Var(Xi),

where we used

Var(Xi) = E[X2

i]−E[Xi]2=E[X2

i]=∑

πij(txij)2.

Since X1,. . . ,Xnare independent,

E[etY]=

∏

i=1

E[etXi]⩽

∏

i=1(1+ea

2t2Var(Xi)).

Using 1+α⩽eαfor α⩾0, it follows that

E[etY]⩽

∏

i=1

eea

2t2Var(Xi)=eea

2t2Var(Y). (36)

Finally, we fix t:= λ

ea√Var(Y). Then, tC =aand Inequality (35) is

satisfied as required. Substituting tin Inequality (36) yields

E[etY]⩽eλ2

2ea.

Furthermore,

etλ√Var(Y)=eλ2

ea.

Thus, E[etY]

etλ√Var(Y)⩽e−λ2

2ea

and Inequality (34) yields the claim.

Proof of Lemma 6.9.As in the proof of Lemma 6.8, we assume Var(Y)>

0. First, let

√Var(Y)

C⩾1

e√2e ln2

/δ

and

λ=√2e ln2

/δ. (37)

Then, λ⩽e√Var(Y)

C. Thus, we are able to choose 0⩽a⩽1, such that

λ=aea√Var(Y)

C. Applying Lemma 6.8,

Pr(|Y|⩾λ√Var(Y))⩽2exp (−λ2

2ea).

By substituting Equation (37),

Pr(|Y|⩾λ√Var(Y))⩽2exp (−2e ln2

/δ

2ea)=2(δ

2)e

⩽δ.

6.3 the sem algorithm for gaussian mixtures

It remains to consider the case where

√Var(Y)

C<1

e√2e ln2

/δ(38)

and

λ=2C

√Var(Y)ln2

/δ. (39)

Let a∈R, such that λ=aea√Var(Y)

C. Then,

aea=2C2

Var(Y)ln2

/δ. (40)

By Inequality (38), C

√Var(Y)>e

√2e ln2

/δand we deduce

aea>2e2

2e ln2

/δln2

/δ=e.

It follows that a > 1. Applying Lemma 6.8,

Pr(|Y|⩾λ√Var(Y))⩽2exp (−λ2

2ea).

By substituting Equation (39),

Pr(|Y|⩾λ√Var(Y))⩽2exp(−4C2

Var(Y)(ln2

/δ)21

2ea).

Using Equation (40),

Pr(|Y|⩾λ√Var(Y))⩽2exp(−2aealn2

/δ1

2ea)=2exp(−aln2

/δ).

Since a>1,

Pr(|Y|⩾λ√Var(Y))< 2 exp(−ln2

/δ)=2(δ

2)=δ.

6.3.5proof of the proximity bounds

Using Lemma 6.7, we are able to prove Theorem 6.3.

Proof of Theorem 6.3.Let W=∑n

i=1zij =n·wSEM

j. Then,

Z∼q[W]=

∑

i=1

pij =rj=n·wEM

Thus, applying Lemma 6.7yields

n·|wSEM

j−wEM

j|⩾n·λwEM

with probability at most δ.

the stochastic em algorithm

Using Lemma 6.9, we are able to prove Theorem 6.4and Theo-

rem 6.5.

Proof of Theorem 6.4.Let j∈{1,...,k},ℓ∈{1,...,d}and define the real

random variable

Mijℓ := (zij −pij)(xi−µEM

j)ℓ.

Since EZ∼q[zij]=pij, it follows that EZ∼q[Mijℓ]=0and

Var(Mijℓ) = pij(1−pij)(xi−µEM

j)2

ℓ.

Moreover, since each µEM

jis a convex combination of x1,. . . ,xn,

|Mijℓ|⩽|zij −pij|·(xi−µEM

j)ℓ⩽∆ℓ.

Note that the definition of µEM

jyields ∑n

i=1pij (xi−µEM

j)ℓ=0. Thus,

for the random variable Mjℓ := ∑n

i=1Mijℓ, we deduce

Mjℓ =

∑

i=1

zij (xi−µEM

j)ℓ.

Furthermore, EZ∼q[Mjℓ]=0and

Var(Mjℓ) =

∑

i=1

pij(1−pij)(xi−µEM

j)2

ℓ=τ2

jℓ.

Note that the standard deviation of Mjℓ is τjℓ, which was introduced

in Section 6.3.2to measure the difference between the mean updates.

Applying Lemma 6.9with C=∆ℓand λµas provided in the theorem

yields

Pr(

∑

i=1

zij (xi−µEM

j)ℓ

⩾λµτjℓ)⩽δ. (41)

The difference between the coordinates of the mean updates can be

written as

(µSEM

j−µEM

j)ℓ=

(∑n

i=1zijxi−∑n

i=1zijµEM

j)ℓ

∑n

i=1zij 

=∑n

i=1zij (xi−µEM

j)ℓ

n·wSEM

j

Then, by combining Inequality (41) and Inequality (31), we conclude

that the conditional probability of

(µSEM

j−µEM

j)ℓ⩽λµτjℓ

n(1−λw)wEM

=λµτjℓ

(1−λw)rj

given the event Ais at least 1−δ.

6.3 the sem algorithm for gaussian mixtures

Proof of Theorem 6.5.The covariance updates of the EM and SEM algo- cf. Equation (26)

and Equation (27)

rithm depend on the corresponding mean updates. Consequently, the

difference between the covariance updates depends on the difference

between the mean updates. To measure the impact of µSEM

j−µEM

on ΣSEM

j−ΣEM

j, let ^

ΣSEM

jbe the covariance that would be computed

by the SEM algorithm, if µSEM

j=µEM

j. That is,

ΣSEM

j:= ∑n

i=1zij(xi−µEM

j)(xi−µEM

j)T

∑n

i=1zij

By substituting xi−µEM

j=xi−µSEM

j+µSEM

j−µEM

j, we deduce

ΣSEM

j=∑n

i=1zij (xi−µSEM

j)(xi−µSEM

j)T

∑n

i=1zij

+∑n

i=1zij (xi−µSEM

j)(µSEM

j−µEM

j)T

∑n

i=1zij

+∑n

i=1zij (µSEM

j−µEM

j)(xi−µSEM

j)T

∑n

i=1zij

+∑n

i=1zij (µSEM

j−µEM

j)(µSEM

j−µEM

j)T

∑n

i=1zij

Note that the first summand is equal to ΣSEM

jand the last summand

is equal to (µSEM

j−µEM

j)(µSEM

j−µEM

j)T. Furthermore, both sum-

mands in the middle are zero, since

∑n

i=1zij (xi−µSEM

∑n

i=1zij

=∑n

i=1zijxi

∑n

i=1zij

−µSEM

j=0.

By denoting ν:= µSEM

j−µEM

j, we conclude

ΣSEM

j=ΣSEM

j+ννT.

Analogously to the proof of Theorem 6.4, we bound the difference

between the entries of ^

ΣSEM

jand ΣEM

j. To do this, we define the real

random variable

Sijℓ1ℓ2:= (zij −pij)(Υij −ΣEM

j)ℓ1ℓ2

where Υij := (xi−µEM

j)(xi−µEM

j)T. Since EZ∼q[zij]=pij, it follows

that EZ∼q[Sijℓ1ℓ2]=0and

Var(Sijℓ1ℓ2) = pij(1−pij)(Υij −ΣEM

j)2

ℓ1ℓ2.

Moreover, since each ΣEM

jis a convex combination of Υ1j,. . . ,Υnj,

|Sijℓ1ℓ2|⩽|zij −pij|·(Υij −ΣEM

j)ℓ1ℓ2⩽∆ℓ1∆ℓ2.

the stochastic em algorithm

Note that the definition of ΣEM

jyields ∑n

i=1pij (Υij −ΣEM

j)ℓ1ℓ2

=0.

Thus, for the random variable Sjℓ1ℓ2:= ∑n

i=1Sijℓ1ℓ2, we deduce

Sjℓ1ℓ2=

∑

i=1

zij (Υij −ΣEM

j)ℓ1ℓ2.

Furthermore, EZ∼q[Sjℓ1ℓ2]=0and

Var(Sjℓ1ℓ2) =

∑

i=1

pij(1−pij)(Υij −ΣEM

j)2

ℓ1ℓ2=ρ2

jℓ1ℓ2.

Note that the standard deviation of Sjℓ1ℓ2is ρjℓ1ℓ2, which was intro-

duced in Section 6.3.2to measure the difference between the covari-

ance updates. Applying Lemma 6.9with C=∆ℓ1∆ℓ2and λΣas pro-

vided in the theorem yields

Pr(

∑

i=1

zij (Υij −ΣEM

j)ℓ1ℓ2

⩾λΣρjℓ1ℓ2)⩽δ. (42)

The difference between the entries of the covariance updates can be

written as

(ΣSEM

j−ΣEM

j)ℓ1ℓ2=(^

ΣSEM

j−ννT−ΣEM

j)ℓ1ℓ2

=(∑n

i=1zijΥij

∑n

i=1zij

−ννT−∑n

i=1zijΣEM

∑n

i=1zij )ℓ1ℓ2

=

∑n

i=1zij (Υij −ΣEM

n·wSEM

−ννT

ℓ1ℓ2



Using the triangle inequality,

(ΣSEM

j−ΣEM

j)ℓ1ℓ2⩽

∑n

i=1zij (Υij −ΣEM

n·wSEM

j

ℓ1ℓ2



+(µSEM

j−µEM

j)ℓ1·(µSEM

j−µEM

j)ℓ2.

Then, by combining Inequality (42), Inequality (32) and Inequality (33),

we conclude that the conditional probability of

(ΣSEM

j−ΣEM

j)ℓ1ℓ2⩽λΣρjℓ1ℓ2

n(1−λw)wEM

+λℓ1

µλℓ2

(1−λw)2·τjℓ1τjℓ2

=λΣ

(1−λw)·ρjℓ1ℓ2

+λℓ1

µλℓ2

(1−λw)2·τjℓ1τjℓ2

given the events Aand Bis at least 1−δ.

6.3 the sem algorithm for gaussian mixtures

6.3.6running time analysis

When the EM and SEM algorithm are both applicable for a particular

statistical model, it is natural to ask why the SEM algorithm should

be favored. Besides its ability to escape from a stationary point of

the likelihood function as mentioned at the beginning of this chapter,

the SEM algorithm is particularly faster in general. In the following,

we explain where the speedup stems from. First, we motivate the

speedup for general mixture models. Afterwards, we show a constant

factor speedup for Gaussian mixture models.

Let X= (x1,...,xn)be the input data set and assume that it was

generated by a general mixture of kdistributions. The EM algorithm

maximizes the expected complete data log-likelihood. Usually, this

implies that each data point is involved in the update of all compo-

nents (see for example Equation 26). Therefore, the running time of

the EM algorithm depends on the term kn. In contrast, the SEM algo-

rithm assigns each point to exactly one component. Hence, each point

is involved in the update of a single component only (see for example

Equation 27). This observation suggests that the speedup of the SEM

algorithm might be up to a factor of k. However, for the sampling

step (Step 2a), the SEM algorithm has to compute the probability that

a certain point was generated by a certain component for each point

and each component distribution. That is, the running time of the

SEM algorithm depends on the term kn as well. Nevertheless, we

are able to show that regarding the parameter estimation problem of

Gaussian mixture models, there is a constant factor speedup. To do

this, we discuss the number of computations in a single iteration of

both algorithms.

For the sake of simplicity, we abstain from precisely charging all

computations. Instead, we consider only the multiplications in R.

Furthermore, we consider only the leading term of the number of per-

formed multiplications. That is, we focus on the computations that

entail the most multiplications in terms of the parameters n,dand k

for considerably large values of n,dand k. Not surprisingly, these

are the computations involving the covariance matrices. Furthermore,

we assume that n≫d.

Both algorithms start with computing the probabilities Pr(xiµj,Σj)cf. Equation (25)

for i=1,. . . ,nand j=1,...,k. The number of multiplications for

the computation of these probabilities is dominated by matrix-vector

multiplications between the d×dcovariance matrices Σjand the d-

dimensional vectors xi. Therefore, the leading term of the number of

multiplications for the common computations is knd2.

The leading term of the number of multiplications for the remain-

ing computations of the EM algorithm is dominated by the update

of the covariances. Each point is involved in the update of each com- cf. Equation (26)

ponent by the outer product of the point with itself. Therefore, the

the stochastic em algorithm

leading term of the number of remaining multiplications is knd2as

well. Thus, the overall number of multiplications of the EM algorithm

is dominated by 2knd2.

In case of the SEM algorithm, again the updates of the covariances

determine the leading term of the number of remaining multiplica-

tions. However, due to the hard assignment of the points, the leadingcf. Equation (27)

term does not depend on k. Instead, the number of multiplications

needed to update the covariance matrix is dominated by nd2. It fol-

lows that for n,kand dlarge enough, the number of multiplications

during the remaining SEM computations is negligible in comparison

to the number of multiplications during the common computations.

Thus, the overall running time of the SEM algorithm is dominated by

knd2.

This implies a factor 2speedup. However, depending on details

of the implementation, the actual dependency of the multiplications

during the common computations may better be assumed to be dom-

inated by c1knd2for some constant c1. Analogously, for the remain-

ing computations of the EM algorithm we assume a leading term of

the dependency of the multiplications of c2knd2for some constant

c2. This results in a speedup of a factor of 1+c2

c1. Indeed, our exper-

imental results confirm a constant factor speedup (see Chapter 7.4).

Moreover, the measured speedup tends a factor of 3which implies

c2> c1.

EXPERIMENTAL ANALYSIS

To underpin our proximity and running time analysis from Chapter 6,

we performed a series of experiments. In our experimental analysis,

we compare the computations of the classical EM algorithm and the

probabilistic SEM algorithm on a number of different data sets. The

data sets divide into several artificial and real world data sets. As in

the theoretical analysis, we consider the parameter estimation prob-

lem for Gaussian mixture models only.

7.1implementation

To get comparable results, we implemented the EM algorithm and

the SEM algorithm from scratch in C++. For the linear algebra com-

putations, we use the free C++ template library Eigen (available at

http://eigen.tuxfamily.org/). For the generation of artificial data

sets, to compute initial model parameters, and for the probabilistic

steps of the SEM algorithm, we need an efficient pseudo-random

number generator. Our implementation uses the Mersenne twister

[Matsumoto and Nishimura, 1998] specified as std::mt19937 in the

current C++ standard [ISO, 2012].

As discussed in Chapter 6.3.1, it is possible that the SEM algorithm

assigns too few points to a particular component to be able to com-

pute a new covariance. Although we disregard these cases in our

analysis, we have to consider them in our implementation. If no point

is assigned to a particular component, we solve the problem by sam-

pling a new mean uniformly from the input data set and initializing

the covariance matrix with a scaled identity matrix. More precisely,

set the covariance matrix to σ2Id, where we compute

σ2:= 1

2d min

i=j∥µi−µj∥2

as proposed in [Dasgupta and Schulman, 2007] for the initial param-

eter estimation. If not enough points are assigned to a particular

component, we try to mix the under-determined covariance with the

previous covariance matrix. If this fails to compute a symmetric pos-

itive definite matrix, we simply keep the old covariance matrix. Due

to numerical instability, the EM algorithm has to deal with similar

experimental analysis

problems if the responsibility rjof a component becomes too small.cf. Chapter 5.3

Therefore, we implemented a similar error handling procedure for

the EM algorithm. In our experiments, these problems hardly ever oc-

curred. However, it must be said that the error handling may lead to

substantially different solutions of the EM and SEM algorithm. Our

experimental results presented in Section 7.4are not affected by the

error handling procedures.

7.2data sets

For our experiments, we used artificial as well as real world data sets.

For the generation of the artificial data sets, we considered different

combinations of the dimension d∈Nand the number of component

distributions k∈N. For each combination, we probabilistically com-

puted several parameter vectors θ= (w1,. . . ,wk,µ1,. . . ,µk,Σ1,. . . ,Σk)

for Gaussian mixtures with kcomponents over Rd.cf. Chapter 4.3.2

For the determination of the weights w1,...,wkwe drew kreal

numbers uniformly at random from the interval [0,1]. To compute

balanced as well as unbalanced weights, these numbers were after-

wards exponentiated with the same integer constant chosen between

0and 3. Finally, the resulting numbers were normalized to retrieve

the weights.

For the determination of the means µ1,...,µkwe created a Gaus-

sian meta distribution from which all means were drawn. The mean

of the meta distribution was set to the origin and the covariance ma-

trix of the meta distribution was created probabilistically itself. To

do this, we first created a matrix Mby drawing each entry from a

one-dimensional Gaussian distribution with mean 0and variance k

Afterwards, to retrieve the covariance matrix of the meta distribution,

we computed the product of the matrix Mwith its own transpose.

This generally results in a symmetric positive-definite covariance ma-

trix. The variance k

dof the one-dimensional Gaussian used to deter-

mine the entries of the matrix Mwas chosen to control the separation

of the created means µ1,. . . ,µk. The separation increases with in-

creasing k(more components need more space) and it decreases with

increasing d(more dimensions allow less separation in each of them).

Proceeding this way, we ensured that the mixtures mainly consist of

interfusing Gaussians. To get reasonable results it is important that

the points generated by different components of a Gaussian mixture

are not pairwise well separated. Otherwise, the task of learning the

parameters is too easy. Furthermore, the difference in the computa-The distance-based

clustering approach

discussed in Part I,

is a suitable method,

for example.

tions of the EM and SEM algorithm would become too small to be

verified in our experiments.

Finally, to determine the covariance matrices Σ1,. . . ,Σk, we pro-

ceeded as for the covariance matrix of the meta distribution used for

the computation of the means. The only difference is that we use a

100

7.2 data sets

Figure 21: Projection of one of the artificial data sets with n=

1 000 000. The ellipsoids depict the standard deviation ar-

eas of each component distribution. The larger the weight

of the component, the brighter is the shade of the corre-

sponding ellipsoid.

one-dimensional Gaussian with variance 1to draw the entries of the

matrix M.

For each θcomputed as above, we drew several point sets of differ-

ent size from the corresponding Gaussian mixture. Our experiments

for the different combinations of dand kled to essentially similar re-

sults. Therefore, in the following we discuss artificial data sets with

d=k=10 only. To confirm our proximity results from Chapter 6, we

created several data sets of size n=1 000 000 (cf. Figure 21). To in-

vestigate the behavior for small data sets, we also created data sets of

size n=10 000. In the remainder of this section, we consider two par-

ticular data sets that led to characteristic results, one for n=1 000 000

which we denote by Art1Mand one for n=10 000 which we denote

by Art10K. For both artificial data sets we computed Gaussian mix-

tures with the proper number of component distributions, i.e., with

k=10.

As real world data we use three publicly available data sets shown

in Figure 7.2. The first one is the Forest Covertype data set which is part

of the UCI Machine Learning Library [Asuncion and Newman, 2007]

and contains 581 012 data points. The data set contains cartographic

information about four wilderness areas located in the Roosevelt Na-

tional Forest. Each data point consists of cartographic features of

an area of 900 square meters in size. To get data suitable for Gaus-

sian mixture parameter estimation, we used only the quantitative at-

101

experimental analysis

tributes located at the first 10 coordinates (see Figure 22a). That is,

we ignored the labels describing which tree species dominates the

respective area and 44 qualitative binary attributes. For the Forest

Covertype data set we computed Gaussian mixtures with k=10 com-

ponent distributions.

The second data set is based on the Amsterdam Library of Object

Images (ALOI) [Geusebroek et al., 2005]. This database consists of

110 250 images of 1 000 small objects, taken under various conditions.

We use a 27-dimensional feature vector set that is based on color

histograms in HSV color space (see Figure 22b). The features were

extracted from the database as described in [Kriegel et al., 2011b] but

using 2bins for hue, saturation and brightness each. The data set

is provided by the ELKI project [Achtert et al., 2012] and contains

110 250 data points. For the ALOI data set we computed Gaussian

mixtures with k=3component distributions.

The third data set is created from data provided by the GeoNames

geographical database (available at http://www.geonames.org/). The

original data set contained several features of 135 082 cities around

the world with a population of at least 1000. We extracted the geo-

graphic coordinates (latitude and longitude) for each city to compute

a projection onto the 3-dimensional unit sphere. The resulting data

set is a 3-dimensional model of the populated regions of the globe

(see Figure 22c). For the GeoNames data set we computed Gaussian

mixtures with k=20 component distributions.

The first two real world data sets were normalized before use. That

is, for each coordinate of the input data points the values were trans-

lated and scaled to fit into the interval [0,1]. Thus, the spread in each

dimension is ∆=1(cf. Equation (28)). Without normalization, the dif-

ference between the means and covariances computed by the EM and

SEM algorithm might be dominated by a single or only few dimen-

sions of the input space. By introducing the normalization step, we

ensured that the parameter estimation problem does not effectively

reduce to a lower dimensional problem.

7.3experiments

For the comparison of the EM and SEM algorithm, we established

four different types of tests. In the first type of tests we compared the

negative log-likelihood of the parameters computed by the EM and

SEM algorithm. To do this, we ran both algorithms for 50 rounds and

computed the negative log-likelihood of the computed parameters

after each round.

The goal of the second type of tests was to compare the parameters

computed by the two algorithms. Again, we ran both algorithms for

50 rounds. After each round, we computed the differences between

the computed parameters. More precisely, for each component distri-

102

7.3 experiments

(a) Forest Covertype data set

with d=10 and n=110 250

(b) ALOI features data set

with d=27 and n=110 250

Figure 22: Projections of the real world data sets.

bution we computed the absolute distance between the weights, the

Euclidean distance between the means and the Frobenius norm of the

difference between the covariance matrices.

The aim of the third type of tests was to evaluate our theoreti-

cal bounds on the proximity of the EM and SEM computations (cf.

Chapter 6.3.2). For the sake of simplicity, we compared only the

means computed by the two algorithms. To to this, in each round

of the SEM algorithm we also computed the mean updates the EM

algorithm would compute if it was given the same previous model

parameters. Then, we determined the actual Euclidean distance be-

tween each pair of means. Furthermore, we computed our theoretical

bounds for each component distribution and each coordinate of the

103

experimental analysis

corresponding mean. To get a bound that holds with high probability,

we chose

δ:= 1

100 ·k(d+1).

Applying Theorem 6.3for all kweight updates and Theorem 6.4for

all dcoordinates in all kmean updates yields k(d+1)bounds that

hold with probability 1−δ. Using the union bound, the resulting

bound for the Euclidean distance between the means holds with prob-

ability of at least 1−1

100 .

Finally, in the fourth type of tests, we compared the running times

of the EM and SEM algorithm. Our analysis in Chapter 6.3.6pre-

dicts a constant factor speedup in case of Gaussian mixture models.

Furthermore, we motivated the overall speedup based on a factor k

speedup for a part of the computations of the parameter updates. To

investigate the dependency of the speedup on the number of com-

ponent distributions, we started both algorithm with the GeoNames

data set for several values of kbetween 1and 100. To reduce the

effects of single delayed runs and in case of the SEM algorithm to

consider the probabilistic behavior, we averaged the running times of

both algorithms over ten runs.

More precisely, we started both algorithms for each value of kwith

three different initial solutions. For each initial solution, we executed

both algorithms three times.

The EM algorithm, as well as the SEM algorithm, compute their

solution based on an initial parameter estimate. How to compute a

good initial set of model parameters is not a subject of this thesis.

Since we study the solutions of the SEM algorithm in relation to the

solutions of the classical EM algorithm, we only have to assure that

both algorithms are started with the same initial solutions. For our

comparison of the EM and SEM algorithm, the quality of the initial

solution is not critical. Therefore, we implemented a very simple ini-

tialization method that we used for all our experiments. To compute

initial parameters θ= (w1,. . . ,wk,µ1,...,µk,Σ1,...,Σk), we executed

the following steps:

1. Draw kpoints c1,. . . ,ckfrom the input data set X⊂Rduni-

formly at random.

2. Compute a partition C1,...,Ckof Xinto ksubsets by assign-

ing each x∈Xto the nearest ciin terms of Euclidean distance

(breaking ties arbitrarily).

3. For i=1,. . . ,k:

a) Set the i-th weight to wi:= |Ci|

|X|.

b) Set µi,Σito the maximum likelihood estimate of a single

Gaussian for the data set Ci(see Chapter 4.3.1).

104

7.4 results

For each data set, we created 10 sets of initial model parameters.

Since the SEM algorithm is probabilistic, for each initial model we

performed 100 different runs. To ensure that the results of our sec-

ond and third type of test are not biased by an unlikely computation,

we averaged the results of the different runs for each initial solution.

Recall that the EM algorithm is deterministic. Therefore, we need to

execute the EM algorithm only once.

7.4results

In this section, we discuss the results of our four types of tests. The

focus of the following presentation lies on the second and third type

of test since these tests correspond to our analysis in Chapter 6. As

mentioned above, we discuss the results for five particular data sets,

two artificial data sets and three real world data sets. During our

experiments we observed that the results for different initial solutions

did not differ substantially. Therefore, we depict only results for some

selected initial solutions that led to typical behavior of the algorithms.

Some characteristic results of our first type of tests are shown in

Figure 23. The negative log-likelihood of the solutions computed by

the EM algorithm is plotted as a dashed black line. Note that for

each round of the EM algorithm we only have a single value to plot.

For the SEM algorithm, we have to plot the results of 100 different

runs of the algorithm. Therefore, we depict the results of the SEM

algorithm similar to box plots. The dark gray line marks the median,

the gray ribbon ranges from the lower to the upper quartile, and the

light gray ribbon ranges from the minimum to the maximum of the

negative log-likelihood of the computed solutions. For the majority

of our experiments for the artificial as well as the real world data

sets, the log-likelihood of the solutions of the EM and SEM algorithm

almost coincide. To make the differences visible at all in Figure 23, we

had to omit the first three rounds of the algorithms. For the Art1M

data set, most of the initial solutions still lead to results as depicted

in Figure 23a. However, for some initial solutions we observe small

differences as depicted in Figure 23c. Furthermore, individual initial

solutions lead to considerable differences as depicted in Figure 23e.

Generally speaking, for small values of none can not expect that

the SEM algorithm yields parameter estimates close to the EM com-

putations. This is due to the law of large numbers, i.e., the influence

of a single sampling step of the SEM algorithm is larger for smaller

n. Indeed, for Art10K data set we observe differences between the

log-likelihood of the computed solutions more often. Some examples

are depicted in Figure 23b to Figure 23f. Note that the SEM results

may lie above the EM results as well as below them and also both in

the same run.

105

experimental analysis

Some typical results of our second type of tests are shown in Fig-

ure 24 to Figure 28. For the interpretation of the results, we will re-

late the measured differences between the parameters computed by

the EM and SEM algorithm to a simple upper bound derived from

the dimension dand the largest spread ∆of the corresponding data

set. That is, for a given data set, we assume that the spread in eachSee Chapter 6.3.2,

Equation (28).dimension is at most ∆. Then, the Euclidean distance between the

computed means is at most

Γµ:= ∆√d.

The Frobenius norm of the difference between the computed covari-

ance matrices is at most

ΓΣ:= d∆2.

Furthermore, recall that the weights lie in the interval [0,1]. That is,

the difference between the computed weights is at most 1. In the

following, we compute the fraction of Γµand ΓΣwhich the computed

means and covariance matrices differ by, respectively. The differences

of the weights are always compared to 1.

For the artificial data sets, we have d=10 and ∆≈40. It follows

that Γµ≈130 and ΓΣ≈16 000. Regarding the Art1M data set (see

Figure 24), we observe that the parameter vectors computed by the

EM and SEM algorithm are very similar. More precisely, the weights

differ by less than 0.002 and the means differ by less than 0.002 ·Γµ.

The covariances even differ by less than 0.0002 ·ΓΣ. Moreover, we ob-

serve that after the first few rounds, the differences in the parameters

of many components are even substantially smaller than those of the

remaining components.

As mentioned in the discussion of our first type of tests, for small

values of none can not expect good parameter estimates. Indeed, for

the Art10K data set (see Figure 25), the differences between the means

and covariances are larger by a factor of 20. However, our first type

of tests show that this does not necessarily result in lower likelihood

of the computed parameters.

Recall that the two real world data sets corresponding to Figure 26

and Figure 27 are normalized. That is, ∆=1.

For the Forest Covertype data set (see Figure 26), we have d=10.

It follows that Γµ≈3.2and ΓΣ=10. Then, the weights differ by less

than 0.004, the means differ by less than 0.004 ·Γµ, and the covariances

differ by less than 0.0002 ·ΓΣ. That is, the results are similar to the

results for the Art1M data set. One explanation for these good results

is that the Forest Covertype data set is the largest of the real world

data sets with a size of n=581 012.

For the ALOI data set (see Figure 27), we have d=27. It follows

that Γµ≈5.2and ΓΣ=27. For the weight updates, at least in the first

couple of rounds, we get a difference of up to 0.2. The means differ by

up to 0.009 ·Γµand the covariances differ by up to 0.0006 ·ΓΣ. These

106

7.4 results

results are slightly worse than the results for the Forest Covertype

data set. However, this can be explained by the smaller size of the

ALOI data set with n=110 250 only. Furthermore, except for the

difference of some weight updates during the first few rounds, the

differences are still very small.

For the GeoNames data set (see Figure 28), we have d=3and

∆=2. It follows that Γµ≈3.5and ΓΣ=12. For the weights, we get a

difference of less than 0.003. The means and the covariances differ by

a fraction of less than 0.01 ·Γµand 0.0015 ·ΓΣ, respectively. However,

for all but one component, the differences between the means and

between the covariances are lower by a factor of 3at least.

All three real world data sets have in common that after approxi-

mately 40 rounds, the differences in all parameters remain on a very

small level. Furthermore, we point out that all results confirm that

both algorithms compute almost the same parameter updates.

For our third type of tests, we depict a selection of results in Fig-

ure 29 to Figure 33. First, we observe that the actual difference mea-

sured in our experiments is significantly smaller than our high prob-

ability bound. Furthermore, the similar development of our bound

and the actual difference indicates the accuracy of our analysis. Note

that mostly, the largest differences exist during the first couple of

rounds. This matches our observations from the second type of tests.

Just as for our second type of test, we relate the bound for the differ-

ence of the weight updates to the largest possible distance determined

by the dimension and the spread of the data set. As mentioned in the

description of our tests in Section 7.3, the presented bounds hold with

probability 1−1

100 . For the Art1M data set (cf. Figure 29), the results

show that the bound is less than 0.013 ·Γµ. Again, due to the smaller

size of the data set, the results for the Art10K data set are worse (cf.

Figure 30). Here, we get bounds up to 0.12 ·Γµ. In Chapter 6.3.3, we

already discussed the limited applicability of our bounds for small

responsibilities rj. Since smaller values of nresult in smaller values

of rj=n·wEM

j, the larger bounds observed in Figure 30 are not sur-

prising.

For the real world data sets, we observed a larger variety of devel-

opments of our high probability bound. This may arise from the fact

that the real world data sets are less suitable for being modeled by

Gaussian mixtures. To avoid tedious repetitions, we prefer to present

a diverse choice of results rather than a characteristic selection. Nev-

ertheless, all presented bounds for all real world data sets are at most

0.1·Γµ.

Finally, the results of our fourth type of tests are depicted in Fig-

ure 34 and Figure 35. Figure 34 shows that the running times of both

algorithms linearly depend on the number of components k. The

slope of the curve for the SEM algorithm is considerably smaller as

predicted by our analysis in Chapter 6.3.6. Furthermore, the results

107

experimental analysis

show that for k=1the EM algorithm is faster than the SEM algo-

rithm. This is not surprising since for k=1both algorithms compute

the same updates, while the SEM processes the additional sampling

step. To approximate the factor of the speedup, we did further tests

for larger values of k. The results are depicted in Figure 35 and sug-

gest a factor of up to 3at least.

Note that although the theoretical and experimental running time

analyses seem to match, we have to treat the results with care. On

the one hand, our theoretical analysis simplifies the situation by only

counting the multiplications. On the other hand, our implementa-

tion may be subject to several optimizations emerging from the linear

algebra library, the compiler and the floating point hardware. Fur-

thermore, these optimizations of the implementation possibly differ

in their effectiveness for the two algorithms. Nevertheless, the mea-

sured speedup is real and thus, the SEM algorithm should be taken

into account when it comes to the parameter estimation problem for

Gaussian mixture models.

108

7.4 results

(a) Art1M data set (b) Art10K data set

(e) Art1M data set (f) Art10K data set

Figure 23: Negative log-likelihood of parameters computed by the

EM and SEM algorithm for the selected artificial data sets

and different initial solutions. Figure 23a, 23c, and 23e all

depict results for the Art1M data set, but for different ini-

tial solutions. The same holds for Figure 23b, 23d, and 23f

with the Art10K data set.

109

experimental analysis

(a) Difference wEM

k−wSEM

kbetween the weights.

(b) Difference µEM

k−µSEM

kbetween the means.

k−ΣSEM

kbetween the covariances.

Figure 24: Comparison of intermediate solutions of the EM and SEM

algorithm given the Art1Mdata set with Γµ≈130 and ΓΣ≈

16 000 for k=10. All three figures depict the results for the

same initial solution. Each figure consists of ten graphs,

each depicting the difference for one component Gaussian

averaged over 100 SEM runs.

110

7.4 results

(a) Difference wEM

k−wSEM

kbetween the weights.

(b) Difference µEM

k−µSEM

kbetween the means.

k−ΣSEM

kbetween the covariances.

Figure 25: Comparison of intermediate solutions of the EM and SEM

algorithm given the Art10Kdata set with Γµ≈130 and

ΓΣ≈16 000 for k=10. All three figures depict the results

for the same initial solution. Each figure consists of ten

graphs, each depicting the difference for one component

Gaussian averaged over 100 SEM runs.

111

experimental analysis

(a) Difference wEM

k−wSEM

kbetween the weights.

(b) Difference µEM

k−µSEM

kbetween the means.

k−ΣSEM

kbetween the covariances.

Figure 26: Comparison of intermediate solutions of the EM and SEM

algorithm given the normalized Forest Covertype data set

with Γµ≈3.2and ΓΣ=10 for k=10. All three figures

depict the results for the same initial solution. Each figure

consists of ten graphs, each depicting the difference for one

component Gaussian averaged over 100 SEM runs.

112

7.4 results

(a) Difference wEM

k−wSEM

kbetween the weights.

(b) Difference µEM

k−µSEM

kbetween the means.

k−ΣSEM

kbetween the covariances.

Figure 27: Comparison of intermediate solutions of the EM and SEM

algorithm given the normalized ALOI data set with Γµ≈

5.2and ΓΣ=27 for k=3. All three figures depict the results

for the same initial solution. Each figure consists of three

graphs, each depicting the difference for one component

Gaussian averaged over 100 SEM runs.

113

experimental analysis

(a) Difference wEM

k−wSEM

kbetween the weights.

(b) Difference µEM

k−µSEM

kbetween the means.

k−ΣSEM

kbetween the covariances.

Figure 28: Comparison of intermediate solutions of the EM and SEM

algorithm given the GeoNames data set with Γµ≈3.5and

ΓΣ=12 for k=20. All three figures depict the results for

the same initial solution. Each figure consists of twenty

graphs, each depicting the difference for one component

Gaussian averaged over 100 SEM runs.

114

7.4 results

(a) This is one of the most frequent developments of the bound.

(b) Sometimes, the bound increases, albeit on a very low level.

Figure 29: Experimental difference and the theoretical bound on the

difference between means computed by the EM and SEM

algorithm given the Art1Mdata set with Γµ≈130 for k=

10. All three figures depict the results averaged over 100

SEM runs for the same initial solution but for the means of

different component Gaussians.

115

experimental analysis

(a) Similar to Figure 29a, but on a higher level.

(b) Although the bound is worsening, the actual differences remain small.

Figure 30: Experimental difference and the theoretical bound on the

difference between means computed by the EM and SEM

algorithm given the Art10Kdata set with Γµ≈130 for k=

10. All three figures depict the results averaged over 100

SEM runs for the same initial solution but for the means of

different component Gaussians.

116

7.4 results

Figure 31: Experimental difference and the theoretical bound on the

difference between means computed by the EM and SEM

algorithm given the normalized Forest Covertype data set

with Γµ≈3.2for k=10. All three figures depict the results

averaged over 100 SEM runs for the same initial solution

but for the means of different component Gaussians.

117

experimental analysis

Figure 32: Experimental difference and the theoretical bound on the

difference between means computed by the EM and SEM

algorithm given the normalized ALOI data set with Γµ≈

5.2for k=3. All three figures depict the results averaged

over 100 SEM runs for the same initial solution but for the

means of different component Gaussians.

118

7.4 results

Figure 33: Experimental difference and the theoretical bound on the

difference between means computed by the EM and SEM

algorithm given the GeoNames data set with Γµ≈3.5for k=

20. All three figures depict the results averaged over 100

SEM runs for the same initial solution but for the means of

different component Gaussians.

119

experimental analysis

.....

number of components

seconds

. ..

EM algorithm

. ..

SEM algorithm

Figure 34: Average running times of the EM and SEM algorithm

started on the GeoNames data set for different numbers

of component Gaussians.

.....

100

number of components

factor of speedup

Figure 35: Average speedup of the SEM algorithm started on the

GeoNames data set for different numbers of component

Gaussians.

120

LIST OF FIGURES

Figure 1Different measurements for the same cluster 15

Figure 2The volume argument 19

Figure 3drad(A∪B)<costdrad(Cm)+∥pA−pB∥22

Figure 4Intermediate centers 26

Figure 5Covering centers and clusters 27

Figure 6Merging (Z,optk)-connected clusters 31

Figure 7Congruent configurations 37

Figure 8Merging clusters intersecting a ball of radius r39

Figure 9The two base cases of the induction 43

Figure 10 The inductive step 44

Figure 11 Partition of the optimal solution 44

Figure 12 Sketch of the set W(x)47

Figure 13 Part of the dendrogram for W(xi)47

Figure 14 Part of the dendrogram for X48

Figure 15 The Old Faithful data set 57

Figure 16 Density of a two-dimensional Gaussian 62

Figure 17 Points with constant Gaussian density 63

Figure 18 Density of a Gaussian mixture distribution 66

Figure 19 Decomposition of lnpθ(X)over the choice of q74

Figure 20 Optimization of lnpθ(X)over the choice of θ76

Figure 21 Projection of one of the artificial data sets 101

Figure 22 Projections of the real world data sets 103

Figure 23 Negative log-likelihood for artificial data sets 109

Figure 24 Differences for the Art1M data set 110

Figure 25 Differences for Art10K data set 111

Figure 26 Differences for the Forest Covertype data set 112

Figure 27 Differences for the ALOI data set 113

Figure 28 Differences for the GeoNames data set 114

Figure 29 Theoretical bound for the large artificial data set 115

Figure 30 Theoretical bound for the small artificial data set 116

Figure 31 Theoretical bound for the Forest Covertype data set 117

Figure 32 Theoretical bound for the ALOI data set 118

Figure 33 Theoretical bound for the GeoNames data set 119

Figure 34 Average running times for EM and SEM 120

Figure 35 Average speedup of the SEM algorithm 120

121

LIST OF ALGORITHMS

3.1AgglomerativeDiscreteRadius(X)............. 20

3.2AgglomerativeRadius(X)................... 24

3.3AgglomerativeCompleteLinkage(X)........... 34

5.1Classical EM algorithm . . . . . . . . . . . . . . . . . . . . . 72

6.1Probabilistic EM algorithm . . . . . . . . . . . . . . . . . . 82

6.2SEM algorithm for mixtures . . . . . . . . . . . . . . . . . . 83

123

BIBLIOGRAPHY

[Achtert et al., 2012] Achtert, E., Goldhofer, S., Kriegel, H.-P., Schu-

bert, E., and Zimek, A. (2012). Evaluation of clusterings – metrics

and visual support. Data Engineering, International Conference on,

0:1285–1288.

[Ackermann, 2009] Ackermann, M. R. (2009). Algorithms for the Breg-

man k-Median Problem. PhD thesis, University of Paderborn.

[Ackermann et al., 2011] Ackermann, M. R., Blömer, J., Kuntze, D.,

and Sohler, C. (2011). Analysis of Agglomerative Clustering. In

Proceedings of the 28th International Symposium on Theoretical Aspects

of Computer Science (STACS ’11), pages 308–319. Dagstuhl Publish-

ing.

[Ackermann et al., 2012] Ackermann, M. R., Blömer, J., Kuntze, D.,

and Sohler, C. (2012). Analysis of Agglomerative Clustering. Algo-

rithmica.

[Amunts et al., 2013] Amunts, K., Lepage, C., Borgeat, L., Mohlberg,

H., Dickscheid, T., Rousseau, M.-E., Bludau, S., Bazin, P.-L., Lewis,

L. B., Oros-Peusquens, A.-M., Shah, N. J., Lippert, T., Zilles, K., and

Evans, A. C. (2013). BigBrain: An Ultrahigh-Resolution 3D Human

Brain Model. Science,340(6139):1472–1475.

[Asuncion and Newman, 2007] Asuncion, A. and Newman, D. J.

(2007). UCI machine learning repository. Available at:

http://www.ics.uci.edu/∼mlearn/MLRepository.html.

[Azevedo et al., 2009] Azevedo, F. A., Carvalho, L. R., Grinberg, L. T.,

Farfel, J. M., Ferretti, R. E., Leite, R. E., Filho, W. J., Lent, R., and

Herculano-Houzel, S. (2009). Equal numbers of neuronal and non-

neuronal cells make the human brain an isometrically scaled-up

primate brain. The Journal of Comparative Neurology,513(5):532–541.

[B¯

adoiu et al., 2002] B¯

adoiu, M., Har-Peled, S., and Indyk, P. (2002).

Approximate clustering via core-sets. In Proceedings of the thiry-

fourth annual ACM symposium on Theory of computing, STOC ’02,

pages 250–257, New York, NY, USA. ACM.

[Baum et al., 1970] Baum, L. E., Petrie, T., Soules, G., and Weiss, N.

(1970). A Maximization Technique Occurring in the Statistical Anal-

ysis of Probabilistic Functions of Markov Chains. The Annals of

Mathematical Statistics,41(1):164–171.

125

Bibliography

[Biernacki et al., 2003] Biernacki, C., Celeux, G., and Govaert, G.

(2003). Choosing starting values for the EM algorithm for getting

the highest likelihood in multivariate Gaussian mixture models.

Computational Statistics & Data Analysis,41(3-4):561–575.

[Bilmes, 1997] Bilmes, J. (1997). A Gentle Tutorial of the EM Algo-

rithm and its Application to Parameter Estimation for Gaussian

Mixture and Hidden Markov Models. Technical Report TR-97-021,

ICSI.

[Bishop, 2006] Bishop, C. M. (2006). Pattern Recognition and Machine

Learning (Information Science and Statistics). Springer-Verlag New

York, Inc., Secaucus, NJ, USA.

[Blömer et al., 2013] Blömer, J., Bujna, K., and Kuntze, D. (2013). A

Theoretical and Experimental Comparison of the EM and SEM

Algorithm. CoRR: Computing Research Repository, abs/1310.5034.

arXiv:1310.5034 [cs.LG].

[Brewer et al., 2009] Brewer, G. J., Boehler, M. D., Pearson, R. A., De-

Maris, A. A., Ide, A. N., and Wheeler, B. C. (2009). Neuron network

activity scales exponentially with synapse density. Journal of Neural

Engineering,6(1):014001.

[Broder et al., 1997] Broder, A. Z., Glassman, S. C., Manasse, M. S.,

and Zweig, G. (1997). Syntactic clustering of the web. Computer

Networks and ISDN Systems,29(8–13):1157–1166. Papers from the

Sixth International World Wide Web Conference.

[Celeux et al., 1995] Celeux, G., Chauveau, D., and Diebolt, J. (1995).

On stochastic versions of the em algorithm. Technical Report 2514,

Institut national de recherche en informatique et en automatique

(INRIA).

[Celeux and Diebolt, 1985] Celeux, G. and Diebolt, J. (1985). The

SEM algorithm: a probabilistic teacher algorithm derived from the

EM algorithm for the mixture problem. Computational Statistics

Quarterly,2:73–82.

[Celeux and Diebolt, 1992] Celeux, G. and Diebolt, J. (1992). A

stochastic approximation type EM algorithm for the mixture prob-

lem. Stochastics and Stochastic Reports,41(1-2):119–134.

[Charikar et al., 1997] Charikar, M., Chekuri, C., Feder, T., and Mot-

wani, R. (1997). Incremental clustering and dynamic information

retrieval. In Proceedings of the twenty-ninth annual ACM symposium

on Theory of computing, STOC ’97, pages 626–635, New York, NY,

USA. ACM.

[Cover and Thomas, 2012] Cover, T. M. and Thomas, J. A. (2012). El-

ements of information theory. John Wiley & Sons.

126

Bibliography

[Dasgupta and Long, 2005] Dasgupta, S. and Long, P. M. (2005). Per-

formance guarantees for hierarchical clustering. Journal of Computer

and System Sciences,70(4):555–569. Special Issue on COLT 2002.

[Dasgupta and Schulman, 2007] Dasgupta, S. and Schulman, L.

(2007). A Probabilistic Analysis of EM for Mixtures of Separated,

Spherical Gaussians. Journal of Machine Learning Research,8:203–

226.

[Dempster et al., 1977] Dempster, A. P., Laird, N. M., and Rubin, D. B.

(1977). Maximum likelihood from incomplete data via the EM al-

gorithm. Journal of the Royal Statistical Society, Series B: Statistical

Methodology,39(1):1–38.

[Dias and Wedel, 2004] Dias, J. G. and Wedel, M. (2004). An empir-

ical comparison of EM, SEM and MCMC performance for prob-

lematic Gaussian mixture likelihoods. Statistics and Computing,

14(4):323–332.

[Eisen et al., 1998] Eisen, M. B., Spellman, P. T., Brown, P. O., and

Botstein, D. (1998). Cluster analysis and display of genome-wide

expression patterns. Proceedings of the National Academy of Sciences,

95(25):14863–14868.

[Feder and Greene, 1988] Feder, T. and Greene, D. (1988). Optimal

algorithms for approximate clustering. In Proceedings of the twenti-

eth annual ACM symposium on Theory of computing, STOC ’88, pages

434–444, New York, NY, USA. ACM.

[Florek et al., 1951] Florek, K., Lukaszewicz, J., Perkal, J., Steinhaus,

H., and Zubrzycki, S. (1951). Sur la liaison et la division des points

d’un ensemble fini. Colloquium Mathematicae,2:282–285.

[Fortunato, 2010] Fortunato, S. (2010). Community detection in

graphs. Physics Reports,486(3-5):75 –174.

[Fréchet, 1910] Fréchet, M. (1910). Les dimensions d’un ensemble

abstrait. Mathematische Annalen,68(2):145–168.

[Geusebroek et al., 2005] Geusebroek, J. M., Burghouts, G. J., and

Smeulders, A. W. M. (2005). The amsterdam library of object im-

ages. International Journal of Computer Vision,61(1):103–112.

[Golub and Van Loan, 2012] Golub, G. and Van Loan, C. (2012). Ma-

trix Computations. Matrix Computations. Johns Hopkins University

Press.

[Gonzalez, 1985] Gonzalez, T. F. (1985). Clustering to minimize

the maximum intercluster distance. Theoretical Computer Science,

38(0):293–306.

127

Bibliography

[Grimmett and Stirzaker, 2001] Grimmett, G. and Stirzaker, D. (2001).

Probability and Random Processes. Probability and Random Pro-

cesses. Clarendon Press.

[Hawkins, 2004] Hawkins, D. M. (2004). The problem of overfitting.

Journal of chemical information and computer sciences,44(1):1–12.

[Ip, 1994] Ip, E. H.-s. (1994). A stochastic EM estimator in the presence of

missing data – theory and applications. PhD thesis, Stanford Univer-

sity.

[ISO, 2012] ISO (2012). ISO/IEC 14882:2011 Information technology —

Programming languages — C++. International Organization for Stan-

dardization, Geneva, Switzerland.

[Johnson and Joram, 1984] Johnson, W. B. and Joram, L. (1984). Ex-

tensions of Lipschitz mappings into a Hilbert space. In Conference

in Modern Analysis and Probability, volume 26 of Contemporary Math-

ematics, pages 189–206. American Mathematical Society.

[Koch et al., 2006] Koch, K., McLean, J., Segev, R., Freed, M. A., Berry,

M. J., Balasubramanian, V., and Sterling, P. (2006). How much the

eye tells the brain. Current biology : CB,16(14):1428–1434.

[Kriegel et al., 2011a] Kriegel, H.-P., Kröger, P., Sander, J., and Zimek,

A. (2011a). Density-based clustering. Wiley Interdisciplinary Reviews:

Data Mining and Knowledge Discovery,1(3):231–240.

[Kriegel et al., 2011b] Kriegel, H.-P., Schubert, E., and Zimek, A.

(2011b). Evaluation of multiple clustering solutions. In Proceedings

of the 2nd MultiClust Workshop (in conjunction with ECML PKDD).

[Leder, 2013] Leder, L. (2013). Nichtapproximierbarkeitsresultate zu

Radius- und Durchmesserclustering unter Verwendung von Lp-

Metriken. Bachelor’s thesis. To appear.

[Lee et al., 2008] Lee, K., Kim, J., Kwon, K. H., Han, Y., and Kim, S.

(2008). DDoS attack detection method using cluster analysis. Expert

Systems with Applications,34(3):1659–1665.

[Levchenko, 2013] Levchenko, K. (2013). Chernoff bound. Available

at: http://www.cs.ucsd.edu/∼klevchen/techniques/chernoff.pdf.

[Matsumoto and Nishimura, 1998] Matsumoto, M. and Nishimura,

T. (1998). Mersenne twister: a 623-dimensionally equidistributed

uniform pseudo-random number generator. ACM Transactions on

Modeling and Computer Simulation,8(1):3–30.

[McLachlan and Krishnan, 2008] McLachlan, G. J. and Krishnan, T.

(2008). The EM Algorithm and Extensions (Wiley Series in Probability

and Statistics). Wiley-Interscience, 2edition.

128

Bibliography

[McQuitty, 1957] McQuitty, L. L. (1957). Elementary Linkage Analy-

sis for Isolating Orthogonal and Oblique Types and Typal Relevan-

cies. Educational and Psychological Measurement,17:207–209.

[Meyerhenke and Staudt, 2013] Meyerhenke, H. and Staudt, C. L.

(2013). Engineering High-Performance Community Detection

Heuristics for Massive Graphs. Accepted for full paper presen-

tation at 42nd International Conference on Parallel Processing

(ICPP).

[Mitzenmacher and Upfal, 2005] Mitzenmacher, M. and Upfal, E.

(2005). Probability and Computing: Randomized Algorithms and Proba-

bilistic Analysis. Cambridge University Press, New York, NY, USA.

[Montgomery and Madison, 2004] Montgomery, J. M. and Madison,

D. V. (2004). Discrete synaptic states define a major mechanism of

synapse plasticity. Trends in Neurosciences,27(12):744 –750.

[Naszódi, 2010] Naszódi, M. (2010). Covering a set with homothets

of a convex body. Positivity,14(1):69–74.

[Pearson, 1894] Pearson, K. (1894). Contributions to the Mathemati-

cal Theory of Evolution. Philosophical Transactions of the Royal Society

of London. A,185:71–110.

[Pereira et al., 1993] Pereira, F., Tishby, N., and Lee, L. (1993). Dis-

tributional clustering of english words. In Proceedings of the 31st

annual meeting on Association for Computational Linguistics, ACL ’93,

pages 183–190, Stroudsburg, PA, USA. Association for Computa-

tional Linguistics.

[Schaeffer, 2007] Schaeffer, S. E. (2007). Graph clustering. Computer

Science Review,1(1):27–64.

[Sneath and Sokal, 1973] Sneath, P. H. A. and Sokal, R. R. (1973). Nu-

merical taxonomy: the principles and practice of numerical classification.

W. H. Freeman.

[Sundberg, 1974] Sundberg, R. (1974). Maximum Likelihood Theory

for Incomplete Data from an Exponential Family. Scandinavian Jour-

nal of Statistics,1(2):49–58.

[Tan et al., 2005] Tan, P.-N., Steinbach, M., and Kumar, V. (2005). In-

troduction to Data Mining, (First Edition). Addison-Wesley Longman

Publishing Co., Inc., Boston, MA, USA.

[Turing, 1950] Turing, A. M. (1950). Computing Machinery and Intel-

ligence. Mind, LIX:433–460.

[von Neumann, 1958] von Neumann, J. (1958). The computer and the

brain. Mrs. Hepsa Ely Silliman memorial lectures. Yale University

Press.

129

Bibliography

[Walker, 2012] Walker, R., editor (2012). The Human Brain Project – A

Report to the European Commission. The HBP-PS Consortium, Lau-

sanne.

[Webster, 1994] Webster, R. (1994). Convexity. Oxford Science Publi-

cations. Oxford University Press, USA.

[Wei and Tanner, 1990] Wei, G. C. G. and Tanner, M. A. (1990). A

Monte Carlo Implementation of the EM Algorithm and the Poor

Man’s Data Augmentation Algorithms. Journal of the American Sta-

tistical Association,85(411):699–704.

[Wu, 1983] Wu, C. F. J. (1983). On the Convergence Properties of the

EM Algorithm. The Annals of Statistics,11(1):95–103.

130