Document [original]

One-class Classiﬁcation in the presence of

Point, Collective, and Contextual Anomalies

vorgelegt von Dipl.-Ing. Nico Görnitz geb. in Berlin

von der Fakultät IV—Elektrotechnik und Informatik

der Technischen Universität Berlin

zur Erlangung des akademischen Grades

Doktor der Naturwissenschaften

-Dr. rer. nat.-

genehmigte Dissertation

Promotionsausschuss:

Vorsitzender: Prof. Dr. Benjamin Blankertz

Gutachter: Prof. Dr. Klaus-Robert Müller

Gutachter: Prof. Dr. Manfred Opper

Gutachter: Prof. Dr. Marius Micha Kloft

Tag der wissenschaftlichen Aussprache: 9. November 2018

Berlin 2019

iii

To my family.

Acknowledgements

This work was carried out during the years 2010-2017 in the Computational Biology

Group at the Friedrich Miescher Laboratory of the Max Planck Society in Tübingen, Ger-

many, the eScience Group of Microsoft Research in Los Angeles, US, and, foremost, the Ma-

chine Learning Group at the Berlin Institute of Technology (TU Berlin), Germany.

These past years were a period of personal growth with constant exchange of ideas, the

love of learning in an environment encouraging creativity and insight. I am grateful for the

experiences to travel and meeting smart and interesting people along this path. I’d like to

express my deepest gratitude to my advisors Klaus-Robert Müller and Marius Kloft, who

made it possible for me to carry out this work and who contributed to this thesis with their

time, ideas and energy.

If it wasn’t for Marius Kloft, who attracted me to the world of research with his love for

teaching, wit and passion for machine learning research, I may not have pursued this path.

Besides him, it was the mentorship of Shinichi Nakajima, Ulf Brefeld, Sören Sonnenburg,

Gunnar Rätsch, and Konrad Rieck that helped forming my scientiﬁc ideas, which eventually

became my Ph.D. thesis. A very special thanks I’d like to address to Klaus-Robert Müller

who, in addition, was an avid supporter throughout the whole time.

This work would have not been possible without the fruitful collaboration of my dear

colleagues. I took particular pleasure in working with Luiz Alberto Lima, Marina Vidovic,

Alexander Bauer, and Seunghak Lee. Furthermore, I would like to thank Jonas Behr, Alexan-

der Binder, Andreas Ziehe, Christian Widmer, Irene Dowding, Regina Bohnert, Georg Zeller,

Vipin Sreedharan, Jamal Nasir, Philipp Drewe, Gregoire Montavon, Christoph Lippert, Anne

Porbadnigk, Mikio Braun, André Kahles, Géraldine Jean, and Bettina Mieth.

I especially thank my spouse Christina and my son Max who suﬀered the most from

my obsession with unﬁnished chapters and last-minute changes to the manuscript, for their

patience and support. I would like to thank my parents and my siblings, Mandy and Linda,

for supporting me in every conceivable way throughout the years.

Finally, I acknowledge ﬁnancial support of the German Bundesministerium für Bildung

und Forschung (BMBF), under the consecutive projects ALICE I and II (01IB10003B and

01IB15001B respectively). I also acknowledge the support by the German Research Foun-

dation (DFG) through the grant DFG MU 987/6-1 and RA 1894/1-1.

vii

Abstract

Anomaly detection has a prominent position in the processing pipeline of any real-world

data-driven application. Its central goal is to detect and separate valid data points from

malicious—anomalous—ones such that the cleaned data set can be processed further. In many

applications, anomalies are even the prime objects of interest and need to be exposed early

in order to avoid loss, e.g. in credit card fraud detection.

One-class classiﬁcation is a machine learning concept that is especially suited for the

anomaly detection problem. Intrinsically unsupervised, it aims at providing a concise de-

scription of a given data set such that data points generated by a diﬀerent process can be

detected accurately. Prominent machine learning models for one-class classiﬁcation are one-

class support vector machines and the closely related support vector data descriptions.

The contribution of this thesis is the extension of those methods to cope with diﬀerent

scenarios of anomalies:

Point Anomalies Assuming that anomalies are scarce and occur independently of

each other, methods for controlling the sparsity of the found solutions in terms of

single independent features and groups of features are derived.

Collective Anomalies In this scenario anomalies are assumed to appear as groups of

measurements instead of single entries. Techniques from structured output learning

are (i) extended to cope with large-scale problems, (ii) employed to derive an unsuper-

vised anomaly detector for groups of measurements that exhibit a latent dependency

structure.

Contextual Anomalies Anomalies appear only in speciﬁc contexts and data is sup-

posed to carry two signals that contain behavioral and contextual information. Con-

tributions in this scenario consider latent class dependencies and are threefold: (i) the

derivation of a method capable of detecting latent class contextual anomalies, (ii) the-

oretical insight reveal k-means as a special case, and (iii) a method for learning with

latent class dependencies when an additional structure is imposed on the latent vari-

ables.

The proposed methods are empirically analyzed on a variety of diﬀerent applications

ranging from gene ﬁnding to porosity estimation to brain computer interfaces showing promis-

ing performance when compared to baseline methods.

viii

Zusammenfassung

Anomalieerkennung nimmt im Verarbeitungsablauf jeder realen Daten-getriebenen An-

wendung eine wichtige Stellung ein. Ihre zentrale Aufgabe ist es gültige Daten zu erkennen

und von Ungültigen, anomalen Daten, zu trennen sodass der so bereinigte Datensatz weiter

verarbeitet werden kann. In vielen Anwendungen sind sogar die Anomalien die interessan-

testen Objekte und sollten so früh wie möglich erkannt werden um möglichen Verlusten

vorzubeugen wie zum Beispiel bei der Prävention von Kreditkartenbetrug.

Einklassen-Klassiﬁkation ist ein Konzept des Maschinellen Lernens, welches besonders

geeignet ist um Anomalien zu detektieren. Es handelt sich um intrinsisch unüberwachte

Lernverfahren welche darauf abzielen eine genaue Beschreibung eines gegebenen Daten-

satzes zu liefern, so dass Datenpunkte, die von einem anderen Prozess erzeugt wurden, akku-

rat erkannt werden können. Die wichtigsten Vertreter dieser Zunft sind die Einklassen-SVM

sowie die mir ihr eng verwandte SVDD.

Diese Arbeit leisten einen Beitrag um diese Methoden so zu erweitern, dass sie mit den

folgenden, allgemeinen Anomalieszenarien umgehen können:

Punktanomalien Wir nehmen an, dass Anomalien selten sind und unabhängig voneinan-

der auftreten. Wir entwickeln Methoden, welche die Spährlichkeit, die Anzahl der

Nullstellen in der Lösung, kontrollieren, basierend dabei auf einzelnen Merkmalen oder

Gruppen von Merkmalen.

Kollektivanomalien In diesem Szenario wird angenommen, dass Anomalien in Grup-

pen von Messungen auftreten anstatt als isolierte Einzelmessung. Wir werden Tech-

niken vom Strukturlernen (i) erweitern, um mit grossen Datenmeengen umgehen zu

können, und (ii) anwenden um einen unüberwachten Anomalieerkenner für Gruppen

von Messungen zu entwickeln, wenn diese Messungen eine latente Abhängigkeitsstruk-

tur besitzen.

Kontextanomalien Anomalien erscheinen nur in gewissen Kontext und es wird angenom-

men, dass Daten aus Verhaltens- und Kontextinformationen bestehen. Beiträge in

diesen Szenario beschränken sich auf latente Klassenstruktur und sind dreigeteilt: (i)

eine Methode zur Erkennung von Anomalien mit latenter Klassenstruktur wird vorgestellt,

(ii) theoretische Einsichten, welche zeigen das k-means ein Spezialfall ist, werden vorgestellt,

und (iii) eine Methode die mit latenter Klassenstruktur umgehen kann, wenn diese

wiederum eine eigene Abhängigkeitsstruktur besitzt, wird entwickelt.

Die vorgestellten Methoden werden empirisch analysiert. Die Anwendungen reichen

dabei von Generkennung über Hirn-Computer-Schnittstellen zu Porositätserkennung. Dabei

zeigen die vorgestellten Methoden im Vergleich zu Standardmethoden vielversprechende Re-

sultate.

Contents

Page

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 A Roadmap through this Thesis . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Own Contributions and Publications . . . . . . . . . . . . . . . . . . . . . 3

I Background 5

2 Foundations of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Anomaly Detection and One-class Classiﬁcation . . . . . . . . . . . . . . . . . . 13

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 One-class Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Model Selection and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

II Point Anomalies 23

4 Sparsity-inducing Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Sparsity-inducing One-class SVM . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Inducing Group-sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

III Collective Anomalies 45

5 Learning with Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2 Large-scale Structured Output Learning . . . . . . . . . . . . . . . . . . . . 48

5.3 Latent Structure Anomaly Detection . . . . . . . . . . . . . . . . . . . . . 51

5.4 Evaluation and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

IV Contextual Anomalies 69

6 Learning with Latent Class Dependencies . . . . . . . . . . . . . . . . . . . . . . 71

6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.2 A Joint Feature Map Formulation . . . . . . . . . . . . . . . . . . . . . . . 72

6.3 Direct Formulation includes k-means as Special Case . . . . . . . . . . . . . 75

6.4 Extension to Non-independent Samples . . . . . . . . . . . . . . . . . . . . 83

6.5 Evaluation and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

A Learning with Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

B Learning with Latent Class Dependencies . . . . . . . . . . . . . . . . . . . . . . 113

Chapter 1

Introduction

‘I provide a service that is unique in this

world,’ said Dirk. ‘The term “holistic”

refers to my conviction that what we are

concerned with here is the fundamental

interconnectedness of all –’

Douglas Adams (Dirk Gently’s Holistic

Detective Agency)

With the abundance of data nowadays, automated tools handling the sheer amount of

available and further incoming data are a necessity. Over the past decade, machine learning

concepts have become invaluable not only for researchers but also for practitioners in the

industry to tackle complex, data-driven problems. Generally phrased, the goal of machine

learning is to learn unknown concepts from data given provided label information.

The goal of anomaly detection is, however, to separate valid data points from malicious,

anomalous ones. It has a prominent position in the processing pipeline of any real-world

data-driven application. Unfortunately, tagging data as anomalous depends very much on

the application at hand and can not be generalized easily. Moreover, due to its scarcity and

novelty, label information is generally rare, incomplete, or missing altogether. However, ﬁnd-

ing anomalies is of vital interest in many applications, as they oftentimes translate directly

to actionable items, situations that need immediate response such as engine failures. A com-

mon approach for detecting the unlikeliness is by describing the normal behavior of a system

and ﬁnding deviations thereof [1–3].

One-class classiﬁcation [4,5] is a machine learning concept that is especially suited for

anomaly detection. Intrinsically unsupervised, it aims at providing a tight boundary—a con-

cise description—of a given data set such that data points generated by a diﬀerent process can

be detected accurately. Two of the most prominent machine learning models for one-class

classiﬁcation are one-class support vector machines and the closely related support vector

data descriptions [6–9]. These methods have been successfully applied to a large number of

problems including network intrusion detection [10–12], hyperspectral imagery [13], surface

modeling [14], and neurosciences [15,16].

Despite their success, most machine learning methods treat data points and hence, anoma-

lies, as independent events without taking into account the dependency structure even if

it is known. Analyzing data with dependency structure is a challenging eﬀort in machine

learning and signal processing that has many important applications; for example, in mobile

communication [17], earthquake prediction [18], geosciences data analysis [19,20], traﬃc

ﬂow modeling [21], and bioinformatics [22].

This thesis, we study the three diﬀerent classes—from independent events to intercon-

nected entities—of anomalies and develop methods for various settings based on the one-class

classiﬁcation principle:

2Chapter 1. Introduction

Point Anomalies Assuming that anomalies are scarce and occur independently of

each other, methods for controlling the sparsity of the found solutions in terms of

single independent features and groups of features are derived.

Collective Anomalies In this scenario anomalies are assumed to appear as groups of

measurements instead of single entries. Techniques from structured output learning

are (i) extended to cope with large-scale problems, (ii) employed to derive an unsuper-

vised anomaly detector for groups of measurements that exhibit a latent dependency

structure.

Contextual Anomalies Anomalies appear only in speciﬁc contexts and data is sup-

posed to carry two signals that contain behavioral and contextual information. Con-

tributions in this scenario consider latent class dependencies and are threefold: (i) a

method capable of detecting latent class contextual anomalies, (ii) theoretical insights

reveal k-means as special case, and (iii) a method for learning with latent class depen-

dencies when an additional structure is imposed on the latent variables.

This dissertation derives and discusses anomaly detectors based on the one-class classiﬁca-

tion paradigm for settings involving point, collective, and contextual anomalies.

1.1 A Roadmap through this Thesis

At high level, this thesis is divided into four distinct parts. In the compulsory Background

part, basic machine learning concepts from kernel machines and optimization as well as an

overview of anomaly detection and one-class classiﬁcation are presented. The following

three parts—Point Outliers,Collective Outliers, and Contextual Outliers—contain the

original contributions of this thesis.

Chapter 2: Foundations of Machine Learning This chapter reviews fundamental con-

cepts from machine learning that are necessary for the understanding of this thesis. In spe-

ciﬁc, deﬁnitions and theorems for kernel methods and basic (convex) optimization concepts

are introduced.

Chapter 3: Anomaly Detection and One-class Classiﬁcation A comprehensive intro-

duction to anomaly detection and one-class classiﬁcation is given in this chapter. History and

deﬁnition, trends and categorization of anomaly detection techniques presented. Evaluation

strategies and corresponding measurements are discussed.

Chapter 4: Sparsity-inducing Learning The focus in this chapter is on feature regular-

ization. In speciﬁc, we describe techniques for controlling sparsity of singleton features as

well as groups of features. The proposed algorithms are applied to applications in BCI-EEG

and authorship attribution. No special requirements are imposed on the nature of anomalies.

Chapter 5: Learning with Structured Data In this chapter, anomalies are supposed to

appear in groups of measurements rather than single entries. This challenging problem is

tackled employing structured output learning concepts. We present a corresponding large-

scale optimization scheme for structured output learning and an unsupervised one-class clas-

siﬁer tailored to this scenario. Challenging applications on computational biology problems

are presented.

1.2. Own Contributions and Publications 3

Chapter 6: Learning with Latent Class Dependencies This chapter assumes that anoma-

lies appear only in speciﬁc contexts. We develop methods that are capable of contextual

anomaly detection when the context comes as latent class dependencies and reveal impor-

tant special cases. Further, extensions to scenarios when contextual variables exhibit known

structural dependencies are proposed.

Chapter 7: Conclusion This chapter concludes this thesis with a brief discussion and

outlook.

1.2 Own Contributions and Publications

I had the pleasure to work closely with very skilled scientist that excel in their respective

ﬁelds. Many times they had a speciﬁc problem for an application in mind which enabled me

to tailor methods especially to their needs. Generally speaking, idea, derivation, optimiza-

tion, and implementation of methods are my contributions. This also holds for empirical

evaluations on artiﬁcially generated data. In all of the presented work, I have been lead au-

thor or joint ﬁrst author (with the exception of Nasir, Görnitz, and Brefeld). In the following,

I detail the contributions.

Part II Empirical results on authorship attribution as well as the description thereof, has

been done by Jamal Nasir. Experiments on BCI-EEG data including the analysis and result

discussion is based on the work of Anne Porbadnigk.

Görnitz, N., Kloft, M., Rieck, K., Brefeld, U., “Toward Supervised Anomaly Detection”,

Journal of Artiﬁcial Intelligence Research (JAIR), vol. 46, pp. 235–262, 2013

Porbadnigk, A., Görnitz, N., Kloft, M., Müller, K.-R., “Decoding Brain States during

Auditory Perception by Supervising Unsupervised Learning.”, Journal of Computing

Science and Engineering (JCSE), vol. 7, no. 2, pp. 112–121, 2013

Nasir, J. A., Görnitz, N., Brefeld, U., “An Oﬀ-the-shelf Approach to Authorship At-

tribution”, in International Conference on Computational Linguistics (COLING), 2014,

pp. 895–904

Part III Preparation of the data set and consulting on biological research questions has

been done by Georg Zeller. Marius Kloft derived the generalization bounds and the dual

formulation presented in Section 5.3. All experiments where carried out by myself.

Görnitz, N., Braun, M., Kloft, M., “Hidden Markov Anomaly Detection”, in Interna-

tional Conference on Machine Learning (ICML), 2015, pp. 1833–1842

Görnitz, N., Widmer, C., Zeller, G., Kahles, A., Sonnenburg, S., Rätsch, G., “Hierarchi-

cal Multitask Structured Output Learning for Large-scale Sequence Segmentation”, in

Advances in Neural Information Processing Systems (NIPS), 2011, pp. 2690–2698

Zeller, G., Görnitz, N., Kahles, A., Behr, J., Mudrakarta, P., Sonnenburg, S., Rätsch, G.,

“mTim: rapid and accurate transcript reconstruction from RNA-Seq data”, ArXiv, 2013

Part IV Marius Kloft carried out the generalization bounds presented in Section 6.2. Appli-

cation to simulated and real porosity prediction has been done by Luis A. Lima. Experiments

on BCI data including the analysis and result discussion is based on the work of Anne Por-

badnigk. Figures and application of the proposed method to the data have been done by

myself.

4Chapter 1. Introduction

Görnitz, N., Porbadnigk, A. K., Kloft, M., Binder, A., Sannelli, C., Braun, M., Müller,

K.-R., “When brain and behavior disagree: A novel ML approach for handling system-

atic label noise in EEG data”, in Machine Learning and Interpretation in Neuroimaging

Workshop (MLINI), 2013

Görnitz, N., Porbadnigk, A. K., Binder, A., Sanelli, C., Braun, M., Müller, K.-R., Kloft,

M., “Learning and Evaluation in Presence of Non-i.i.d. Label Noise”, in International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS), vol. 33, 2014, pp. 293–302

Porbadnigk, A. K., Görnitz, N., Sannelli, C., Binder, A., Braun, M., Kloft, M., Müller, K.-

R., “When Brain and Behavior Disagree: Tackling systematic label noise in EEG data

with Machine Learning”, in IEEE International Winter Workshop on Brain-Computer

Interface (BCI), 2014

Porbadnigk, A. K., Görnitz, N., Sannelli, C., Binder, A., Braun, M., Kloft, M., Müller,

K.-R., “Extracting latent brain states — Towards true labels in cognitive neuroscience

experiments”, NeuroImage, vol. 120, pp. 225–253, 2015

Görnitz, N., Lima, L. A., Varella, L. E., Müller, K.-R., Nakajima, S., “Transductive Re-

gression for Data with Latent Dependency Structure”, IEEE Transactions on Neural Net-

works and Learning (TNNLS), 2017

Görnitz, N., Lima, L. A., Müller, K.-R., Kloft, M., Nakajima, S., “Support vector data

descriptions and k-means clustering: one class?”, IEEE Transactions on Neural Networks

and Learning (TNNLS), 2017

Lima, L. A., Görnitz, N., Varella, L. E., Vellasco, M., Müller, K.-R., Nakajima, S., “Poros-

ity Estimation by Semi-supervised Learning with Sparsely Available Labeled Samples”,

Computers & Geosciences, vol. 106, pp. 33–48, 2017

I Background

Chapter 2

Foundations of Machine Learning

2.1 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

This chapter introduces the machine learning concepts needed for understanding the

methods developed in this thesis. The overall focus on classic concepts and will leave out

some of the more prominent techniques nowadays (i.e. deep learning) even though they are

likely to play an important role in the near future for one-class classiﬁcation and anomaly

detection. However, these techniques will only be discussed marginally in this thesis. We

hereby start with kernel methods and the kernel trick in speciﬁc and go on with the basics

of (non-)convex optimization theory.

2.1 Kernel Methods

We start the discussion on kernel methods by giving a simple example: consider a linear

model f(x) = hw, φ(x)iwith a possibly very high dimensional parameter vector w∈ F and

a feature vector φ:X → F of data point x∈ X of corresponding dimension. Here, instead

of accessing our data points directly, we would like to add a little ﬂexibility by allowing

an arbitrary transformation φwhich maps the data points from the input space Xto some

feature space F(cf. Figure 2.1) which, hopefully, makes the problem more accessible. Further,

assume that we are given a sample of size i= 1, . . . , n of i.i.d. data points xi∈ X and

corresponding labels yi∈Rwhich we will use to ﬁt our parameter vector to produce the

least squared error on that given sample,

w∗= argmin

w∈F

i=1

ℓ(w,xi, yi)with ℓ(w,x, y) := 1

2(y−hw, φ(x)i)2.

For the sake of simplicity, we employ stochastic gradient descent as an optimization tech-

nique. Therefore, at time step twe pick a data point xtand the corresponding label ytfrom

our training sample and update the parameter vector according to the following formula

(with w0= 0):

wt+1 =wt−η∂ℓ(wt,xt, yt)

∂w=wt+η(yt−hwt, φ(xt)i)φ(xt) = wt+αtφ(xt).

Since our initial objective is convex, gradient descent will (with carefully chosen η) ﬁnd a

local, hence, global minimum. Rather surprisingly, it will attain the optimal value while not

leaving the span of the data. Therefore, the optimal parameter vector w∗can be expressed

8Chapter 2. Foundations of Machine Learning

X F

Figure 2.1 – The feature mapping φmaps the data points (red and blue dots) from the input space X(left)

to some feature space F(right). As can be seen, the goal is to simplify the problem for subsequent analysis,

e.g. classiﬁcation.

as a weighted sum of feature vectors,

w∗=

t=1

αtφ(xt).(2.1)

Given this expansion, we can also re-write the inner product between parameter vector and

feature vector with

hw, φ(x)i=X

αthφ(xt), φ(x)i=X

αtk(xt,x),

where k(x,x′) = hφ(x), φ(x′)iis called a kernel. This gives an alternative view on the above

optimization problem, where the key is to ﬁnd weightings of similarities, as encoded with in-

ner products, between data points. Moreover, we do not need to know the inner workings of

the feature map φanymore. To summarize this little example, we can (i) express the optimal

solution w∗as a weighted sum of feature vectors and (ii) access the inner product in terms

of similarities between data points. The former property is known as the representer theorem

and the latter gives rise to the kernel trick [33–36]. In fact, these properties form the basis of

many successful machine learning models such as support vector machines (SVMs) [37,38].

Lets now discuss these these ﬁndings in more detail. Any discussion about kernel meth-

ods and the kernel trick in speciﬁc needs to be split in three parts:

i the feature map φwhich maps a data point from its input space Xinto some higher

dimensional feature space F;

ii the kernel k:X × X → R, which encodes similarities between feature vectors of

corresponding data points k(x,x′) = hφ(x), φ(x′)i;

iii the reproducing kernel Hilbert space (RKHS) F, a space of functions endowed with a

norm.

A very general deﬁnition of a kernel is given in Mohri, Rostamizadeh, and Talwalkar.

Deﬁnition 1 (Kernels [39]).A function k:X ×X → Ris called a kernel over X.

Albeit this deﬁnition allows to encode arbitrary functions, in order to ensure that a de-

composition into feature vectors k(x,x′) = hφ(x), φ(x′)iexists, kneeds to satisfy the fol-

lowing condition:

2.1. Kernel Methods 9

Theorem 1 (Mercer’s condition [39]).Let X ⊂ RNbe a compact set and let k:X ×X → R

be a continuous and symmetric function. Then, kadmits a uniformly convergent expansion of

the form

k(x,x′) = ∞

i=0

aiφi(x)φi(x′),

with ai>0iﬀ for any square integrable function c(c∈L2(X)), the following condition holds:

Z ZX×X

c(x)c(x′)k(x,x′)dxdx′≥0.

We present another, slightly more general and approachable deﬁnition (cf. discussion in

[39]) which ensures the existence of the decomposition.

Deﬁnition 2 (Positive deﬁnite symmetric kernels [39,40]).A kernel k:X × X → Ris

said to be positive deﬁnite symmetric (PDS) if for any {x1,...,xn} ⊆ X, the matrix K=

[k(xi,xj)]ij ∈Rn×nis symmetric positive semideﬁnite (SPSD).

The matrix Kis called the kernel matrix or the Gram matrix associated to K. Hence, we

have the following relation between feature maps and kernels: for a speciﬁc choice of PDS

kernel k, the feature map φis ﬁxed (up to rotation). For a speciﬁc choice of feature map φ,

the corresponding kernel kis ﬁxed. Albeit many possibilities exists for deﬁning an proper

kernel k, the following kernels appear frequently:

Linear kernel k(x,x′) := hx,x′i;

Polynomial kernel k(x,x′) := (hx,x′i+c)dwith c > 0and d∈N;

Radial basis function (RBF) kernel k(x,x′) := exp(−1

2σ2kx−x′k).

Now that the relation between feature maps and kernels is clear, we only need a relation

between those entities with their respective reproducing kernel Hilbert space (RKHS) which

comes in the form of the following theorem as given in Mohri, Rostamizadeh, and Talwalkar:

Theorem 2 (Reproducing kernel Hilbert space (RKHS) [39]).Let k:X ×X → Rbe a PDS

kernel. Then, there exists a Hilbert space Fand a mapping φfrom Xto Fsuch that:

∀x,x′∈ X, k(x,x′) = hφ(x), φ(x′)i.

Furthermore, Fhas the following property known as the reproducing property:

∀f∈ F,∀x∈ X, f(x) = hf, k(x,·)i.

Fis called a reproducing kernel Hilbert space (RKHS) associated to k.

Finally, we can give a concise description of the existence of the expansion in Eq. (2.1).

The original representer theorem was presented by Kimeldorf and Wahba [41] and later re-

ﬁned by, e.g. Schölkopf [42]. In general, the representer theorem states that if a given opti-

mization problem can be rephrased in a speciﬁc (very general) form, then the optimal solution

of this optimization problem must live in the span of the data.

Theorem 3 (Representer Theorem [39]).Let k:X × X → Rbe a PDS kernel and Fits

corresponding RKHS. Then, for any non-decreasing function G:R→Rand any loss function

L:Rn→R∪{∞}, the optimization problem

argmin

f∈F

G(kfkF) + L(f(x1), . . . , f(xn))

10 Chapter 2. Foundations of Machine Learning

admits a solution of the form f∗=Pn

i=1 αik(xi,·) = Pn

i=1 αiφ(xi). If Gis further assumed

to be increasing, then any solution has this form.

2.2 Optimization

We give a short introduction to optimization based on the great book of Boyd and Van-

denberghe [43]. First, we introduce the necessary concepts while focusing on the general

problem. We then turn to the special case of convex optimization and end this section with

a discussion on non-convex problems.

At the core of every machine learning method there is an objective that needs to be

optimized subject to some constraints. This means that whatever the intent of a method is,

it should be expressible as a objective function of some adjustable parameters x:

min

xf0(x)(2.2)

s.t. fi(x)≤0, i = 1, . . . , m

hj(x) = 0, j = 1, . . . , p .

We refer to the feasible domain of x(assumed to be non-empty) as Dand to its optimal value

as p∗. The above problem, the primal optimization problem, is a constrained optimization

problem which is generally hard to handle. An unconstrained version can be obtained using

the notion of Lagrangian functions with penalties on constraint violations.

Deﬁnition 3 (Lagrangian).A function L:Rn×Rm×Rp→Rof the form

L(x, λ, ν) = f0(x) +

i=1

λifi(x) +

i=1

νihi(x),

is called the Lagrangian associated with the Problem (2.2).

The variables νand λare called the Lagrange multiplier or dual variables.

Deﬁnition 4 (Lagrange dual function).Let g:Rm×Rp→Rbe the minimum value of the

Lagrangian over x. Then for any λ∈Rmand ν∈Rp

g(λ, ν) = inf

x∈D L(x, λ, ν) = inf

x∈D f0(x) +

i=1

λifi(x) +

i=1

νihi(x)!.

An important property of the above deﬁnition is that for any λ≥0and any νthe La-

grange dual function is a lower bound on the primal optimum p∗,g(λ, ν)≤p∗. The diﬀerence

between both entities is called the duality gap. If the duality gap is zero, then strong duality

holds otherwise we speak of weak duality (which always holds). Maximizing g(λ, ν)wrt.

λ≥0and νis referred to as the dual problem of Eq. (2.2). There is a strong relation between

primal and dual problem that is condensed in the Karush-Kuhn-Tucker (KKT) conditions.

Theorem 4 (KKT conditions).Let fiand hjfor i= 0, . . . , m and j= 1, . . . , p be diﬀeren-

tiable. Let further x∗and the pair (λ∗, ν∗)be any primal and dual optimal points with zero

duality gap. Thus

∇f0(x∗) +

i=1

λ∗

i∇fi(x∗) +

i=1

ν∗

i∇hi(x∗) = 0 (stationarity),

2.2. Optimization 11

fi(x∗)≤0, i = 1, . . . , m (primal feasibility),

hi(x∗) = 0, i = 1, . . . , p (primal feasibility),

λ∗≥0, i = 1, . . . , m (dual feasibility),

λ∗fi(x∗) = 0, i = 1, . . . , m (complementary slackness)

are called the Karush-Kuhn-Tucker (KKT) conditions.

We further introduce the concept of the convex conjugate which sometimes helps to

generalize certain problems.

Deﬁnition 5 (Conjugate function).Let f:Rn→R. The function f∗:Rn→R, deﬁned as

f∗(y) = sup

x∈domf

(hy,xi−f(x)) ,

is called the conjugate of the function f.

It is also known as the convex conjugate or Legendre-Fenchel convex conjugate. If fis

diﬀerentiable, then f∗is also called the Legendre transform.

The above formulations hold for any optimization problem. However, in the case of con-

vex optimization there are certain properties that are very desirable. Basically, there are two

amazing things about convex optimization. First, it is guaranteed to ﬁnd the global optimum

in reasonable amount of time. This is what we ultimately care about. The second thing is that

strong duality holds. It means that we can check optimality but it allows also optimize the

dual problem which might give additional insights into the application at hand or it makes the

optimization more eﬃcient. In order to qualify as a convex optimization problem, we need

the constraints to fulﬁll some basic properties (called constraint qualiﬁcations). Further, fi

must be convex, hjmust be aﬃne and hence, the feasible set Dmust be convex. Hence, we

need two more deﬁnitions about convex sets and convex functions.

Deﬁnition 6 (Convex set).A set Cis said to be convex if the line segment between any two

points in Clies in C, i.e. if for any x1,x2∈Cand any σwith 0≤σ≤1,

σx1+ (1 −σ)x2∈C

must hold.

Deﬁnition 7 (Convex function).A function f:Rn→Ris convex if domf is a convex set

and if for all x1,x2∈domf, and σwith 0≤σ≤1,

f(σx1+ (1 −σ)x2)≤σf(x1) + (1 −σ)f(x2).

However, the question how to ﬁnd the optimal solution remains. A very simple and

general approach is given in Algorithm 1 where we need to (i) ﬁnd a descent direction (e.g.

negative gradient), (ii) ﬁnd a step length θ≥0(e.g. line search), and (iii) a stopping criterion.

Most famous, and probably most widely used, is gradient descent. There are, of course, many

Algorithm 1 General descent method.

while stopping criterion not satisﬁed do

Determine a descent direction d

Choose a step length θ≥0

Update xt+1 =xt+θd

end while

12 Chapter 2. Foundations of Machine Learning

more elaborate optimization algorithms for convex optimization problems [43–48] which

might be general purpose or exploit certain special properties (e.g. sparsity). Broadly they

can be categorized into ﬁrst order methods (e.g. gradient descent) and higher order methods

(e.g. Newtons method), i.e. where the second derivative is needed. A broad class of algorithms

is contained in the proximal methods [49].

Some of those methods might be even applicable to non-convex optimization. However,

in this case the best one can get is generally a local optimal solution. Certain methods that

are based on special problem structure, e.g. assuming that the objective function can be

decomposed into a diﬀerence of convex functions, showed promising results [50–52]. Despite

all the progress made, due to the increased attention to highly non-convex problems imposed

by deep neural nets, the working horse of todays optimization algorithms is again...gradient

descent.

Chapter 3

Anomaly Detection and One-class Classiﬁ-

cation

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 One-class Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.1 One-class Support Vector Machine . . . . . . . . . . . . . . . 19

3.3.2 Support Vector Data Description . . . . . . . . . . . . . . . . 20

3.4 Model Selection and Evaluation . . . . . . . . . . . . . . . . . . . . . 21

3.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 21

In this chapter, we discuss the fundamentals of anomaly detection and subsequently one-

class classiﬁcation. We start by framing the historical context and state the basic deﬁnitions,

most fundamental work, the various types of anomalies and learning settings. Further, we

attempt to coarsely categorize models and settings before we turn to an in-depth discussion

on one-class classiﬁcation. We proceed by talking about the evaluation and interpretation of

anomaly detection and ﬁnally, we conclude with an outlook on current and future challenges.

3.1 Introduction

Traditionally, anomaly detection is a very application-driven research subject and methods

have been proposed and studied for several decades in statistics, machine learning, data min-

ing, and database systems [2]. Anomaly detection is used nowadays as an umbrella term for

wide variety of techniques, settings, and approaches which all share a common goal. This

commonality is usually deﬁned over the unusualness of observations in a given data set.

Hence, the goal is to ﬁnd, remove, describe or extract (parts of) observations that deviate

from rest of the data set signiﬁcantly. A widely accepted, very general deﬁnition of what an

anomaly is, was given by Hawkings [53], 1980:

An outlier [=anomaly] is an observation which deviates so much from the other

observations as to arouse suspicions that it was generated by a diﬀerent mechanism.

However, there have also been predecessors, e.g. Grubbs, 1969 [54]:

An outlying observation, or “outlier” [=anomaly], is one that appears to deviate

markedly from other members of the sample in which it occurs.

14 Chapter 3. Anomaly Detection and One-class Classiﬁcation

These quotations show that the ﬁeld of anomaly detection is indeed quite old for com-

puter science and statistics. However, importance of anomaly detection sky rocketed only

quite recently which was fueled by the advent of the internet, online services, big data and

the corresponding economical impact. As of today, practically all online services rely heavily

on a mix of anomaly detection methods (e.g. fraud detection, intrusion detection, etc. pp.).

Depending on the context and application, the term “anomaly” is often replaced by other

substitutions such as outlier, exception, peculiarity, surprise, noise, abnormalities, deviants,

discordants. Notably, anomaly and outlier are used interchangeably throughout most of the

literature whereas other names might indicate specialized settings (i.e. noise). Generally,

three diﬀerent types of anomalies are considered:

•point anomalies are data points that appear isolated from the bulk of the data (cf.

Fig. 3.1)

•contextual anomalies (sometimes also conditional anomalies) are data points whose

values itself are only anomalous in a speciﬁc contextual relation (cf. Fig. 3.2)

•collective anomalies consist of a sequence of data points that only as a group, and

not as individual points, can be tagged anomalous (cf. Fig. 3.3)

Of the above, point anomalies have been studied more extensive and many methods readily

assume that data points come as independent instances. If, however, data exhibits strong de-

pendency structure, handling data points as independent instance might not suﬃce anymore.

Anomaly detection has tremendous scope for research especially in this area [25,55–57]. In

some situations, depending on the requirements and the analyst’s understanding of the prob-

lem, it might suﬃce to phrase collective and contextual anomalies as point anomalies.

Figure 3.1 – An example of point anomalies. O1 and

O2 are outliers which are well isolated from the large

data clusters G1 and G2 (distributed under CC BY-SA 4.0

license).

Most anomaly detection methods are

loosely based on the Hawkins’ outlier and

approach the problem by ﬁnding strong de-

viations in data. Few approaches assume

that anomalies and/or nominal data can

be modeled by the analyst (sometimes re-

ferred to as well-deﬁned anomaly distribu-

tion, WDAD). These are rare though.

In the end, the output of anomaly detec-

tion method should give a clear answer of

whether or not a given instance is anoma-

lous. This seems like a binary classiﬁcation

task which, indeed, is handled this way in

some cases [58]. However, the task of ﬁnd-

ing anomalies is complex and in most real-

istic cases it can not be expected to perfectly separate anomalous data from the nominal

data. Furthermore, there are often various degrees of anomalousness within data (e.g. noise

is considered a weak anomaly). To cope with more realistic scenarios, most analyst’s favor

a continuous output such that data points can be ranked from the most nominal to most

anomalous data point.

Many methods inherently depend on continuous attributes, they often need further pre-

processing to be normalized between a speciﬁc range or whitened (with features having the

same standard deviation). There are, however, other data types which are frequently en-

countered in data sets. Foremost, binary attributes which can only take on two values {0,1}

and (un-)ordered categorical attributes which can take n∈N+possible values. If needed,

categorical attributes can be converted into a sparse binary vector xof dimension nwith

3.2. Categorization 15

kxk2= 1. Binary attributes can further be converted into continuous attributes by, e.g.

principle component analysis (PCA).

Figure 3.2 – An example of a contextual anomaly. Note that the

value of the data instance at t1is not anomalous, but in the context

of t2it is. (distributed under CC BY-SA 4.0 license).

In search for a Hawkins’ out-

lier, methods need to ﬁnd devia-

tions from the nominal data while

further knowledge about the pro-

cess of anomalies can not be ex-

pected. However, in many prac-

tical applications there will be

prior knowledge either in terms

of expert insights or in existing

anomaly samples.

A fully supervised approach

assumes that for each data point in

the training set a corresponding label is present. Standard binary classiﬁer like support vec-

tor machines or logistic regression can be employed and tested accordingly. This approach

does work well if all anomaly classes are well sampled. Furthermore, usual settings lead to

very unbalanced data sets with often >99% nominal data. In such cases, binary classiﬁer

might return trivial solutions.

Figure 3.3 – An example of a collective anomaly taken from human

electrocardiogram. Note that groups of nominal data points vary

in their values signiﬁcantly, only the absence of a whole group at

t= 6000 of data points forms an anomaly (distributed under CC

BY-SA 4.0 license).

Half-way between supervised

and unsupervised learning is semi-

supervised learning which can be

further split into a bunch of sub-

settings. In machine learning the

general deﬁnition for this setting

is simply that additional to labeled

examples, there are some unla-

beled examples. If only inferring

the labels for those unlabeled ex-

amples is the goal, then this can

be tagged a transductive setting.

However, in the anomaly detec-

tion community, semi-supervised

learning often refers to a setting

where a data set only containing

nominal data is available. Learning with positive and unlabeled examples (LPUE, or PU learn-

ing) is the extension when having nominal data as well as a contaminated data set.

Finally, we have a fully unsupervised scenario, where a contaminated data set is available

without labels. Unsupervised and semi-supervised learning settings are the most prevalent

in the literature.

3.2 Categorization

Although successful anomaly detectors depend on application-speciﬁc peculiarities, existing

approaches can be roughly categorized depending on the their main idea. In the following,

we discuss a list of categories based upon the book of Aggarwal [2]. We would like to point

out that such a list is arbitrary and depends very much on the viewpoint and background

of the writer. E.g. locality preserving projections (LPP) [59] is a linear method that depends

16 Chapter 3. Anomaly Detection and One-class Classiﬁcation

on the proximity of data points and tries to ﬁnd a lower-dimensional subspace of a high-

dimensional space. Nevertheless, it is a convenient way of coarsely separating methods to

understand their main approach on outlier detection.

Extreme Value Theory Extreme value theory (EVT) is concerned with the limit behavior

(as the number of samples goes to inﬁnity) of sample extremes, i.e. data points that have very

high or very low values. This is much like the central limit theorem (CLT), which models

the limit behavior of sample sums. Indeed, both theories have been developed roughly at the

same time. Extreme value theory was pioneered by Leonard Tippett in the ﬁrst half of the

last century. Working with cotton, he noticed that the weakest ﬁbres controls the strength of

a thread. Together with R. A. Fisher, Tippet obtained three asymptotic limits describing the

distributions of extremes which was later put down in a book by Emil Julius Gumbel. Orig-

inally developed as an univariate theory, EVT was later extended to multivariate settings.

However, it is developed to ﬁnd outliers at the borders of the data which might not be help-

ful as a direct method but as a post-processing step for anomaly scores. Samples for learning

the appropriate model (generalized extreme value distribution or the generalized Pareto dis-

tribution) are either selected based on the block-maxima (BM) approach or the more recently

proposed peaks over threshold (POT).

Probabilistic and Statistical Models Statistical models for anomaly detection comprises

tail inequalities and tail conﬁdence tests. For Bayesian probabilistic models, parameters (and

hyper-parameter) are modeled by probability distributions and the result itself is, again, a

probability distribution called the posterior distribution. This is (often) in contrast to fre-

quentist approaches, where we are usually interested in point estimates through, e.g. maxi-

mum likelihood (ML) or maximum a posteriori (MAP) hence, the best possible model. There

are two key challenges to overcome: (a) modeling the dependencies between random vari-

ables and choosing the right probability distributions for the various parameters, and (b)

inference, i.e. deriving the posterior distribution given observations. On the positive side,

modeling data dependencies is a natural thing and since the result is a probability distribu-

tion, it can be neatly used to separate high-density regions from potential outlier regions.

The downside, however, is that probability distributions must be chosen carefully to fully

reﬂect the reality. As simple as it sounds, this is often not the case. Due to its complexity for

deriving the posterior function, simpler or matching (i.e. conjugate) distributions are often

used.

Linear Models Modeling the nominal class as a linear model and ﬁnding strong deviations

from this model is an appealing approach since many prominent and powerful methods used

in machine learning are based upon linear models. This list includes regression methods

such as (ordinary/regularized) least squares regression, least absolute shrinkage and selection

operator (LASSO) [60], and Gaussian processes as well as binary classiﬁer such as logistic

regression and support vector machines. The list continuous with, e.g. principle component

analysis (PCA), independent component analysis (ICA), and, what will be very important

to this thesis, one-class support vector machines (OC-SVMs) [7,61]. Interestingly though,

support vector data description (SVDD), which models a hypersphere around the nominal

class, would technically not belong to this category as it is based on a quadratic model.

Proximity-based Models Proximity-based approaches split into three groups: (a) distance-

based methods, (b) density-based methods, and (c) cluster-based methods. Distance-based

approach will measure similarity based upon the k-nearest neighbor distances. The rational

behind is that anomalies are data points with much larger distances to its nearest neighbors

3.2. Categorization 17

Figure 3.4 – Data was realized from isotropic Gaussian distributions (various cluster with random means

and variance) with increasing dimensionality and uniformly distributed query points. The distance gap ǫ+1

is reported on three distinct ℓp-norm induced metrics (p∈{1,2,∞}, left ﬁgure) as well as the minimum,

maximum, and mean euclidean distance for increasing number of dimensions (right ﬁgure). Both ﬁgures

show that with increasing number of dimensions minimum and maximum distances concentrate quickly.

than nominal data. Computationally, calculating k-nearest neighbors requires the computa-

tion of a pairwise distance matrix. This can be prohibitive for a large number of data points.

Luckily though, new techniques such as locality sensitive hashing reduce the eﬀort signiﬁ-

cantly. Approaches in (b) will utilize a number of data instances within the proximity of a

data point to estimate its local density [3]. As usual, low density instances will be reported

as more anomalous as high density points. One of the most prominent representative of this

group is local outlier factor (LOF) [62] which was heavily used as a vehicle for other meth-

ods, e.g. local outlier probabilities (LoOP) [63]. The goal of methods in (c) is to partition the

given data into subsets, where instances within a subset are similar. Here, approaches can

return hard assignments such the famous k-means algorithm as introduced by MacQueen

in 1967 [64] or hierarchical clustering approaches such as single-linkage clustering, or, soft

assignments, e.g. Gaussian mixture models. Anomalies can then be identiﬁed as data points

with large deviation from the nominal data within clusters.

High-dimensional Models High-dimensional models must be seen in historical context.

In the past, researchers have noticed that traditional models for anomaly detection, namely

nearest-neighbor-based models, degrade in performance as the number of dimensions or

features increases. This is blamed on the infamous curse of dimensionality. In 1999, Beyer,

Goldstein, Ramakrishnan, and Shaft condensed the reason into a theorem stating that if cer-

tain conditions are met, then, ultimately, the minimum distance and the maximum distance

within a data set will concentrate (cf. Figure 3.4).

In detail, Theorem 1 in [65] states that as the dimensionality of data sets increases, the

distance between some query point and its nearest neighbor Dmin and its furthest neighbor

Dmax will converge to the same value,

lim

dims→∞

PDmax

Dmin

−1≤ǫ=1

for any ǫ>0.

Regularization-based methods circumvent those problems by eﬀectively restricting the

model class. However, it is still an area of active research [66,67] today and there is still a need

for models that reduce the dimensionality of the presented data set for, e.g. visualization and

summarization. Reducing the dimensionality can also increase the detection performance,

18 Chapter 3. Anomaly Detection and One-class Classiﬁcation

if misleading information, i.e. noise, is signiﬁcantly reduced. A large class of methods in

this category will extract a subspace from the original space. One of the most prominent

approaches is kernel principal component analysis (kernel PCA) [33,35].

Information Theoretic Models Information theoretic models are based on the minimum

description length principle (MDL) which was developed by Rissanen, 1978 [68]. A recent

introduction can be found in the book of Peter Grünwald [69] which is based on a previous

tutorial [70]. The fundamental idea behind is, to see data compression as a way of learning

functions. Moreover, in the context of this work, as a way to identify outliers. Given a set of

hypotheses H∈ H and a data set X∈ X, use the hypotheses that compressed Xthe most.

Finding outliers then translates to ﬁnding patterns or data points that can not be compressed.

3.3 One-class Classiﬁcation

The term one-class classiﬁcation ﬁrst appeared in the works published by Moya and Hush 1.

The following quote is taken from their original research paper [5] and describes the essence

of an one-class classiﬁer:

We call a classiﬁer that can recognize new examples of target patterns and distin-

guish those from non-target patterns a one-class classiﬁer.

Note that this deﬁnition is quite distinct from the binary (or multi-class) classiﬁcation setting,

where the classiﬁer is only required to distinguish between the target class and one speciﬁc

form of non-target class (the opposing class) and not all possible non-target classes.

Translated to the anomaly detection setting, an one-class classiﬁer distinguishes between

the nominal data (=target pattern) and any sort of anomaly (=non-target pattern). It is this

behavior that makes an one-class classiﬁer especially suited for anomaly detection. One-

class classiﬁer are usually trained either on nominal data only (semi-supervised) or on an

contaminated set (unsupervised) consisting of nominal data and some anomalies.

Interestingly, long before the work of Moya and Hush, T.C. Minter (1975) published his

work on single-class classiﬁcation. In his work [73],

(...) a Bayes classiﬁer will be presented which classiﬁes samples into the “class of

interest” or the “other” classes but requires only labeled training samples for the

“class of interest” to design the classiﬁer. Thus, this classiﬁer minimizes the need

for ground truth. For these reasons, the classiﬁer will be referred to as a single-class

classiﬁer.

He, therefore, presented a ﬁrst version of a semi-supervised one-class classiﬁer 2.

Given the deﬁnition of the Hawkins’ outlier, anomalies will be quite distinct from the

nominal class. Hence, we presume that anomalies will occur in the tails of the nominal class

probability density, i.e. in the low probability regions. Successful training of an one-class

classiﬁer, therefore, comprises of learning a tight description around the high probability re-

gions of the presented data set. Hence, we would like to learn the density level set containing

most of the data instances. To do so, there are two possible options. First, one can estimate

the distribution and then cut-oﬀ at the desired density level set. Second, one attempts to

estimate a binary classiﬁer that can tag whether or not a given data instance belong to the

desired density level set or not. Approaches building on the ﬁrst principle are called plug-in

estimators while approaches of the second principle are called direct estimators. One-class

classiﬁers as discussed here, will be based on the second principle.

1before 1996 as claimed by Wikipedia, c.f. [4,71,72]

2The “other” classes mustn’t contain the “class of interest”.

3.3. One-class Classiﬁcation 19

Lets start with some theory behind. In 1992, Einmahl and Mason [74] presented a gen-

eralization of the quantile function based on the estimation of minimum volume sets (MVS)

that has the following form:

U(α) = inf{λ(C) : P(C)≥α, C ∈ C}

with P being the distribution, Cmeasurable subsets of the input space, λa measure (real-

valued function going from Cto R). Parameterized by 0< α < 1, MVS are the smallest

volumes containing a probability mass of α, i.e. if α= 1 then the corresponding MVS con-

tains the support of the density (all non-zero probability elements). In the empirical case

(P(C) = 1

nPn

i=1 1C(xi)and Lebesque measure), αcontrols the fraction of data points ly-

ing outside of the MVS. So, we get the smallest volume (a tight description) of the data with a

1−αfraction of data points that will not be included. Moreover, Wolfgang Polonik showed

that under some assumptions, minimum volume sets are indeed density level sets [75,76].

More formally, we are given a set of input instances x1, . . . , xn∈ X, which are commonly

assumed to be realized from independent and identically distributed (i.i.d.) random variables

X1, . . . , Xn∼P, where Pis a potentially unknown measure of probability. The aim is to

ﬁnd a set containing the most typical instances under the measure P, and instances lying

outside of the set are declared as anomalies. The task of anomaly detection can be formally

phrased within the framework of density level set estimation [75,77,78] as follows. Denoting

by Xanother i.i.d. copy according to P, the theoretically optimal nominal set is Lν:= {x∈

X:p(x)≥bν}for ν∈]0,1[ and bνsuch that P(X /∈Lν) = ν, which is called the ν

density level set and can be interpreted as follows: Lνcontains the most likely inputs under

the density p, while rare or untypical data (“anomalies”) are modeled to lie outside of Lν.

The parameter νindicates the fraction of outliers in the model.

The aim is to compute, based on the data x1, . . . , xn∈ X, a good approximation of Lν,

that is, to determine a function f:X → Rgiving rise to an estimated density level set

Lν:= {x∈ X :f(x)≥0}.It is desirable that ˆ

Lνclosely approximates the true density

level set Lν, i.e., ˆ

Lνconverges to Lνin probability, that is,

P(ˆ

Lν\Lν∪Lν\ˆ

Lν)→0for n→ ∞.

This implies that ˆ

Lνhas asymptotically probability mass ν, that is, P(X /∈ˆ

Lν)→νfor

n→ ∞.

In the following, we focus on the two most prominent kernel-based [33] one-class classi-

ﬁers. Other approaches include, e.g. Bayesian data description [79], Gaussian processes [80],

neural networks [5], and random forest [81]. Further, many existing approach can be, with

little changes, used as one-class classiﬁer, i.e. kernel density estimation, Gaussian mixture

models, k-means.

3.3.1 One-class Support Vector Machine

A, if not the, classic approach to kernel-based one-class classiﬁcation is the one-class support

vector machine (OC-SVM) [7,8,82]. Even today, the OC-SVM is among the most promi-

nent and successful anomaly detectors. The OC-SVM is based on linear models fw,ρ(x) =

hw, φ(x)i−ρ, where the data is mapped into a reproducing kernel Hilbert space (RKHS) H

via a feature map φ:X → H. It subsequently separates a fraction of 1−νmany inputs

20 Chapter 3. Anomaly Detection and One-class Classiﬁcation

from the origin with maximum margin:

max

w,ρ,ξ≥0

2kwk2−ρ+1

νn

i=1

ξi(Primal OC-SVM)

subject to ξi≥ −fw,ρ(xi)∀i= 1, . . . , n.

The corresponding dual OP has the following form:

max

0≤α≤1

nν −1

i=1

j=1

αiαjk(xi,xj)(Dual OC-SVM)

subject to

i=1

αi= 1

with expansions w=Pn

i=1 αiφ(xi). Further properties will be discussed in this thesis where

appropriate.

Due to its success, there is a vast literature building atop of OC-SVMs. Rätsch et al. [83]

showed that boosting-like algorithm can be constructed solving a ℓ1-norm regularized one-

class SVM. Lee and Scott [84] gave an answer for the problem of calculating the one-class

SVM solution path when νvaries between 0and 1. The relation to density estimation was

shown in [85]. Estimating minimum volume sets has been investigated by [86]. Vert and

Vert [87] actually showed that the one-class SVM with RBF kernel is a consistent density level

set estimator. The construction of hierarchical level sets has been investigated by [88,89].

Recent works comprises extensions towards group anomaly detection [90] and, of course,

deep learning [91].

3.3.2 Support Vector Data Description

The goal is to ﬁnd a model f:X → Rand a density level set L:= {x:f(x)≤0}containing

most of the regular data points, while for anomalies and outliers x/∈Lholds. In case of the

support vector data description (SVDD) method, fc,R(x) = kc−φ(x)k2−R2and parameter

estimation corresponds to solving a quadratically constrained quadratic program (QCQP) as

originally proposed by Tax [92]:

min

R,c,ξ≥0R2+C

i=1

ξi(Primal SVDD)

subject to ξi≥fc,R(xi)∀i= 1, . . . , n (3.1)

That allows for the following simple geometric interpretation: a ball with minimum radius

Ris computed that comprises most the regular data points, while all points lying outside of

the normality radius are declared being anomalous. This is a more direct link to the min-

imum volume estimation discussion above. Here, the sets C∈ C are hyper-spheres in the

reproducing kernel Hilbert space (RKHS). The corresponding dual problem of the above OP

is:

max

0≤α≤C

i=1

αik(xi,xi)−

i=1

j=1

αiαjk(xi,xj)(Dual SVDD)

subject to

i=1

αi= 1

3.4. Model Selection and Evaluation 21

with expansions c=Pn

i=1 αiφ(xi).

As well as with the OC-SVM, more details and properties will be discussed in this thesis if

necessary, i.e. Chapter 6.3 presents a in-depth discussion of the properties of the SVDD and

some of the OC-SVM. It is noteworthy to mention that a strong connection between both

methods exists which is the reason why in the literature the term OC-SVM is sometimes

used for the SVDD.

Due to its simple interpretation, SVDDs became very popular in the literature and lots

of application but also theoretical contributions have been made. The solution path of the

SVDD was analyzed by [93]. A Bayesian approach to data description was proposed by [79].

3.4 Model Selection and Evaluation

One of the most crucial parts, not only for anomaly detection but basically for any learning

machine in any setting, is, the evaluation and model selection part. Selecting the best hyper-

parameter and measuring how good a trained model generalizes is the essence before putting

it in production. Unfortunately, due to the nature of the anomalies, only few, if any, will be

known. Hence, a serious problem in anomaly detection in general is the absence of suﬃcient

information about anomalous classes.

In unsupervised anomaly detection, models are often combined instead of selected (model

averaging vs. model selection). Even though that still leaves the problem of estimating the

generalization error untouched. However, only in rare cases will be absolutely no infor-

mation available about existing anomalies and evaluation can be attempted using the few

available labeled examples.

Even if enough labeled samples are available, the class balances will be extremely skewed

and some measures, e.g. classiﬁcation error, will not be suited to reﬂect the state of general-

ization error appropriately. To circumvent those problems, error measures that are transient

to class imbalances are used. Namely, the area under the ROC curve (AUC or AUROC) and

area under the precision recall curve (AUPR) [94].

Few works consider estimation of performance measures with missing information. In

case of PU learning, Hajizadeh, Li, Dollevoet, and Tax [95] introduced a measure, PULP,

that works without explicitly given negative labels. PULP is based on the calculation of the

probability of true positives for some random positive predictions. Another very interesting

approach is given in [96], where the author shows that based on mass volume curves and

excess mass curves, evaluation of unsupervised anomaly detectors can be done without the

help of test samples.

Tax and Müller [97] tackle the problem of model selection of one-class classiﬁer with

only considering the nominal test samples. Thomas, Clémençon, Feuillard, and Gramfort [98]

generalize this solution for model selection and base their decision upon estimation of mass

volume curves.

3.5 Summary and Discussion

The problem of anomaly detection arises in many application scenarios and is often a vital

part of the data processing pipeline. Moreover, anomalies itself pose a valuable information

source as they often translate to actionable items. Techniques for anomaly detection have

been investigated for more than half a century now with new challenges arising today, i.e.

due to the shear amount of data (Big Data) or new types of complex data (Social Networks).

Especially when data has dependency structure, it often poses a complex task as anomalies

might not occur as isolated points (point anomalies) instead they only appear in groups of

data points (collective anomalies) or within some context (contextual anomalies).

22 Chapter 3. Anomaly Detection and One-class Classiﬁcation

A speciﬁc approach to anomaly detection is one-class classiﬁcation where a classiﬁer is

trained on a single class (the nominal data) alone. A promising family of one-class classiﬁers

is the one-class support vector machine (OC-SVM) and the support vector data description

(SVDD).

In this thesis, we propose extensions upon the framework of OC-SVM and SVDD to tackle

speciﬁc problems with with a special focus on leveraging dependency structure within fea-

tures and samples.

We are hereby relying on specialized formulations and slight deviations from the stan-

dard techniques. In order to avoid notational clutter, we chose to introduce the type of clas-

siﬁer and the corresponding formulation (e.g. constraint vs. unconstraint, primal vs. dual,

generalized loss functions and/or regularizer) used within each part separately.

II Point Anomalies

Chapter 4

Sparsity-inducing Regularization

4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Sparsity-inducing One-class SVM . . . . . . . . . . . . . . . . . . . . . 26

4.3 Inducing Group-sparsity . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4.1 Analysis of Brain States . . . . . . . . . . . . . . . . . . . . . 34

4.4.2 Authorship Attribution . . . . . . . . . . . . . . . . . . . . . 39

4.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 42

One of the most famous and oldest disputes in statistics and machine learning is about

the origins of the ordinary least squares method. While Adrien Marie Legendre published his

work in 1805 [99] and Carl Friedrich Gauss 4 years later in 1809 [100], the latter claimed that

he developed his method in 1795, almost 10 years before Legendre published his work. Recent

work [101] suggests that Gauss was indeed right, albeit, a deﬁnitive answer to this century

old question will probably never given. However, the method and its countless variants still

remain the working horse in the data science community.

One particular successful extension, ridge regression [102], added a squared regularizer

to the main objective. The results of this simple extension was a more stable optimization

as well as more accurate predictions. However, these types of regularizations also result in a

dense representation, meaning that each and every variable is used for prediction even if the

correlation to the eﬀect is negligible.

Albeit dense representations generally have slightly higher prediction performance on

real datasets, a parsimonious representation would have several advantages. Besides space

and computational advantages due to the smaller number of variables involved in the process,

those representations might unravel the driving forces, the main causes behind the examined

problem and would make the model and the data much more accessible to interpretation.

The underlying assumption of sparse models is that target values can be accurately described

using only a small subset of input variables. Hence, the model is supposed to learn a map-

ping with zero inﬂuence weights for variables that do not improve prediction performance.

Robert Tibsherani [60] formed ridge regression into a sparse model by simply replacing the

squared regularizer with an ℓ1regularizer. The resulting model, the lasso, hence became very

successful with applications in, e.g. computation biology [103].

Structured sparse models [48,104] are a natural extension of the independent regulariza-

tion of single variables by considering dependencies among those. The kind of structure is

hereby usually deﬁned a priori. A common setting includes group sparsity where, instead

of single variables, disjunct groups of variables are weighted against each other. Of course,

26 Chapter 4. Sparsity-inducing Regularization

there are numerous extensions towards overlapping groups, hierarchical or graph dependen-

cies among groups of features.

Figure 4.1 – Grey lines represent density

level sets of the objective function of inter-

est. Blue lines indicate the ℓp-norms for p=

{1,2,10}. Maximum values are achieved in

the corners for p=1(sparse solution with

x1=0and x2=1), at about 80◦for p=2

(non-sparse but x1<< x2), and at almost

exactly 45◦for p= 10 (x1≈x2).

However, group sparsity has a nice interpreta-

tion in terms of multiple kernel learning [105–108].

As the name suggests, the goal is to learn a con-

vex combination of multiple kernels to either select

the most promising group of features (sparse setting)

or to leverage all combined information of heteroge-

neous data representations (non-sparse setting). In

this chapter, we aim at deriving a variant of the one-

class SVM [82] that can smoothly transition into a

sparse model using ℓp-norm regularization as well as

aℓp-norm regularized multiple kernel learning vari-

ant of the semi-supervised one-class SVM as given in

[12] that also can be applied to sparse as well as non-

sparse MKL settings. Hereby, the ℓp-norm ensures

that both, high prediction accuracy and interpretabil-

ity can be achieved.

In the following sections, after discussing the

problem setting in more detail (cf. 4.1), we introduce

a variant of the one-class SVM with ℓp-norm regu-

larization that also admits sparse reconstruction (cf.

Section 4.2) and a particular variant that admits sparse reconstruction of (non-overlapping)

groups of features (Section 4.3). Before we conclude the chapter in Section 4.5, we will employ

the developed methods to applications in Section 4.4.

4.1 Preliminaries

We build our approaches on the paradigms of support vector learning [33,37] and one-class

classiﬁcation; that is, we are given ndata points x1,...,xn, where xilies in some input space

Rd, and the goal is to ﬁnd a model f:Rd→Rand a density level-set Dρ={x:f(x)≥ρ}

encompassing the normal data, i.e., x∈Dρ, while for outliers x′/∈Dρholds. In this chapter,

we consider linear models of the form

f(x)=�w,φ(x)�(4.1)

for some feature function φ:Rd→Fmapping the data into some feature space.

Our aim is a parsimonious representation of the model for single features (cf. Section 4.2)

as well as disjunct groups of features (cf. Section 4.3 without compromising on accuracy. That

is, we introduce a tunable parameter that smoothly controls the sparseness of the solution.

We consider the Minkowski ℓp-norm, where �w�p=d

i=1 |wi|p1/p with p≥1. An

illustration of the impact of various norms on the optimization outcome is given in Figure 4.1.

4.2 Sparsity-inducing One-class SVM

In this section, we will introduce an ℓp-norm regularization for the (primal) one-class SVM as

given in Equation P�� OC�SVM. This will allow us to smoothly transition from the dense

solutions given by the ℓ2-norm regularized one-class SVM towards sparse solutions with ℓ1

regularizer. A variant of the ℓ1-norm regularized one-class SVM was derived by Rätsch, Mika,

4.2. Sparsity-inducing One-class SVM 27

Schölkopf, and Müller [109,110] in the context of boosting relying on Ivanov regularization

and barrier methods for optimization.

Another way of smooth transitions between sparse and dense solutions is the elastic net

approach that has been introduced by [111]. Here, two regularizer, one ℓ1and the other ℓ2,

are weighted against each other. This solution is slightly more complex and does need one

more hyper-parameter to adjust. Therefore, we resort to the ℓp-norm solution. In detail, our

contributions are the following:

• we propose a variant of the primal one-class SVM with ℓp-norm regularization;

• we derive a corresponding unconstrained optimization problem and solver,

• for the special case of p= 1 are able to derived a re-formulation that allows very

eﬃcient optimization;

• a semi-supervised learning extension for p= 1 that leverages negative labels is pro-

posed.

In section, we never leave the primal space for optimization and therefore, we assume that

transformations have been applied to the data so we can set φ7→ idRdwhich will allow us

to discard the function and avoid notational clutter.

Lets begin with a slightly generalized version of the primal one-class SVM problem in

Equation (Primal OC-SVM):

min

w,ρ,ξ Ω(w) + 1

νn

i=1

ξi−ρ

s.t. hw,xii ≥ ρ−ξi, ξi≥0,∀i∈ {1, . . . , n}

(4.2)

where Ω(w)is a smooth regularizer and ν∈]0,1] a hyper-parameter controlling the ’size’

of the level set (the lower νthe larger the level set). Once the optimal parameters w∗and

ρ∗are found, these are plugged into (4.1), and new instances xare classiﬁed according to

sign(f(x)−ρ∗).

The learning machine (4.2) has been intensively studied for the choice of the regularizer

Ω(w) := 1

2kwk2

2, which leads to dense optimal weight vectors w∗, i.e., the entries of w∗are

strictly diﬀerent from zero (except in pathological cases) and thus hinder feature selection

and interpretability. In contrast, we build the methodology used in this section on more

general regularizers of the form

Ω(w) := kwkp,

where kwkp=Pd

i=1 |wi|p1/p denotes the Minkowski ℓp-norm. Solving optimization

problem (4.2) can be tedious due to the various constraints and non-smooth terms. However,

we can easily re-write the above optimization problem by substituting ξi. Note that ξi≥

ρ− hw,xiiand ξi≥0which leads to ξi≥max(0, ρ − hw,xii). Minimization ensures

equality and hence, we arrive at

min

w,ρ fp(w, ρ) = kwkp−ρ+1

νn

i=1

max(0, ρ −hw,xii).(4.3)

28 Chapter 4. Sparsity-inducing Regularization

Algorithm 2 Subgradient descent solver for ℓp-norm one-class SVM (Eq. (4.2))

Require: {x}n

i=1,ν,p, and step size/rule α

Initialize w0and ρ0

Set fbest = + inf,wbest =0, and k= 0

while no convergence and k≤max_iter do

fk=fp(wk, ρk) (cf. Eq.4.3)

if fk< fbest then

fbest =fk

wbest =wk

end if

wk+1 =wk−α∂fp(wk,ρk)

∂wk(cf. Eq.4.4)

ρk+1 =ρk−α∂fp(wk,ρk)

∂ρk(cf. Eq.4.4)

k=k+ 1

end while

return wbest

The above optimization problem can be readily solved using standard techniques such as

sub-gradient descent where only sub-gradients w.r.t. wand ρneed to be assessed by

∂fp(w, ρ)

∂w=w⊙|w|p−2

kwkp−1

νn

i=1 (−xifor 0< ρ −hw,xii

0else ,(4.4)

∂fp(w, ρ)

∂ρ =−1 + 1

νn

i=1 (1for 0< ρ −hw,xii

0else .(4.5)

Here, ⊙denotes the Hadamard product. The resulting sub-gradient descent solver is given in

Alg. 2. We give an example of the impact of 1≤p≤4for a very simple setup consisting of

20-dimensional correlated Gaussian variables in Figure 4.2. To circumvent numerical issues,

we report the sum of absolute values normalized by its respective maximum component

Pi|wi|/maxj|wj|. As we expected, parsimony in the solution vector correlates with p.

1.00

1.16

1.32

1.47

1.63

1.79

1.95

2.11

2.26

2.42

2.58

2.74

2.89

3.05

3.21

3.37

3.53

3.68

3.84

4.00

norm

|/max

Figure 4.2 – Impact of pon the sparseness of the so-

lution. To avoid numerical issues, we report the sum of

absolute values normalized by the maximum component

Pi|wi|/maxj|wj|.

Very sparse solutions Now, we focus

more on the limiting case p= 1, which

is likely to lead to very sparse solutions:

suppose we minimize an objective function

g(w)subject to kwk1≤1; then, the opti-

mal solution is attained when the level sets

of the objective function ’hit’ the norm con-

straint. If the objective function is convex,

the point of intersection is usually at one of

the corners of the constraint and thus has

sparse coordinates (cf. Figure 4.1). In linear

methods, each dimension in the solution of-

ten corresponds to a measurable cause. The

beneﬁt of increasing the sparseness in the

solution vector lies in the fact that the solu-

tion now becomes interpretable. I.e. when

no ground truth is available, interpretability is mandatory. An elegant way to solve (4.2) for

Ω(w) = kwk1is to set w=w+−w−, substituting kwk1=Pdw+

d+w−

d, and to optimize

over w+,w−≥0instead of w.

4.2. Sparsity-inducing One-class SVM 29

To enhance numerical stability of sparse one-class learning, we propose to consider the

following sparsity-inducing one-class learning formulation:

min

w,ξ kwk1+C

i=1

ξi

s.t. hw,xii ≥ 1−ξi, ξi≥0,∀i∈ {1, . . . , n}

(4.6)

(which is reminiscent of the very well known 2-class C-SVM given by Cortes and Vapnik

[37] or sparse Fisher by Mika et al [112]). The following theorem shows that (4.2) is an exact

re-formulation of (4.6).

Theorem 5. Let Ω(w) = kwk1and denote the optimal solution of (4.2) and (4.6) by (w∗

ν, ρ∗

ν, ξ∗

ν)

with ρ∗

ν>0and (e

w∗

C,e

ξ∗

C), respectively. Then, for any ν∈]0,1], setting C:= 1

νn , it holds

w∗

ν=ρ∗

νe

wC,

i.e., the weight vectors output by (4.2) and (4.6) are, besides a scaling factor, equivalent.

Proof. Let (w∗, ρ∗, ξ∗)be optimal in (4.2). It follows that (w∗, ρ∗)is optimal in the corre-

sponding unconstrained formulation:

(w∗, ρ∗) = argmin

w,ρ kwk1+1

νn

i=1

max(0, ρ −hw,xii)−ρ .

Note that thus w∗= argminwkwk1+1

νn Pn

i=1 max(0, ρ∗−w⊤xi).Now denote e

w∗:=

argmine

wρ∗ke

wk1+1

νn Pn

i=1 max(0, ρ∗−hρ∗e

w,xii).By a variable substitution w=ρ∗e

we observe that w∗=ρ∗e

w∗and hence w∗/ρ∗is optimal in mine

wke

wk1+1

νn Pn

i=1 max(0,1−

w,xii)(because ρ∗is positive), which, setting C:= 1

νn is the unconstrained version of (4.6)

(and thus equivalent). Thus w∗/ρ∗is optimal in (4.6), which was to show.

Semi-supervised Learning In exploratory data analysis with large amounts of data, un-

supervised methods are often applied in an iterative manner to reveal properties of interest.

To incorporate accumulated knowledge, we would like to include labeled information into

the optimization problem. For dense models, this has been done in Görnitz, Kloft, Rieck, and

Brefeld [12]. However, we need to retain the sparseness of the solution and don’t want to

increase the number of hyper-parameters that need to be adjusted on-the-ﬂy (of which there

are four in [12]). We can achieve this by only adding negative labels hence, having unlabeled

and negatively labeled data available. Moreover, when compared to [12], there will be no

margin as well as uniform inﬂuence for each data point.

We wish to include negatively labeled instances xn+1,...,xm(i.e., instances of which we

already know that they are outliers) into the learning machine (4.6). A simple and eﬀective

way to do so is to constrain the negatively labeled instances to lie outside of the density level

set: hw,xii ≤ 1 + ξi, ξi≥0,∀i∈ {n+ 1, . . . , m}. The resulting linear program

min

w+,w−,ξ

j=1

(w+

j+w−

j) + C

i=1

ξi(4.7)

s.t. hw+−w−,xii ≥ 1−ξi, ξi≥0,∀i∈ {1, . . . , n}

hw+−w−,xii ≤ 1 + ξi, ξi≥0,∀i∈ {n+ 1, . . . , m}

w+≥0,w−≥0

30 Chapter 4. Sparsity-inducing Regularization

can be eﬃciently solved using oﬀ-the-shelf solver such as MOSEK1.

A note on p < 1Very sparse solution can be obtained by optimizing over the ℓ0-norm

which gives the number of non-zero elements and other p < 1-norms. However, these are

no proper norms anymore and the resulting optimization algorithm would be non-convex

or even a combinatorial problem (for ℓ0). Hence, applying Algorithm 2 does not guarantee

convergence to a meaningful optimum anymore. Using ℓ1instead can be viewed as a convex

surrogate for the actual sparse solution.

4.3 Inducing Group-sparsity

Group sparsity, which was studied in the context of lasso ﬁrst [113], is the problem of iden-

tifying and selecting groups of features instead of single features. Unlike standard lasso,

features within groups are regularized using the ℓ2-norm while on group level a ℓ1regu-

larizer is used. The result leads to a parsimonious selection of groups with dense features

rather than a parsimonious representation of single features. A comprehensive survey on

optimizing structured sparsity models using penalties can be found in [48].

0.001

0.01

1.0

2.0

4.0

10.0

100.0

Kernel width parameter

0.0

0.2

0.4

0.6

0.8

1.0

Kernel weighting

=1.00

Figure 4.3 – Impact of pon the sparseness of the solu-

tion. Here, the weighting of seven RBF kernels with vari-

ous widths have been learned setting p= 1.

Multiple kernel learning (MKL) is es-

pecially useful for heterogeneous data

where a single similarity measure might

not suﬃce to eﬃciently capture signals

from data. Combining various kernels and

weighting them accordingly to achieve

improved performance is the goal of MKL.

Hereby, each kernel represents a group

of features and when optimized to yield

sparse weightings [107,114,115], can be

viewed as a group sparsity optimization

problem. In fact, the relation between

MKL and group sparsity is well-known

[48,116].

Recently, non-sparse multiple kernel

learning using ℓp-norm regularization has

been shown to outperform sparse counterparts by some margin [105,117,118]. The addi-

tional hyper-parameter pcan be adjusted to trade-oﬀ accuracy and sparsity of the solution.

In the limiting case of p= 1, equivalence to the group sparsity problem is ensured.

There are various works that extend to multiple kernel learning idea to one-class clas-

siﬁers. Among the most prominent are Kloft, Brefeld, Düssel, Gehl, and Laskov [119] who

extended the support vector data description (SVDD) to regularize groups of features with

an ℓ1loss in the context of network intrusion detection. The same extension had been ap-

plied to one-class SVMs in [107]. Our contribution, on the other hand, provides a trade-oﬀ

parameter pthat controls the sparseness of the solution and, additionally, is able to handle

labeled examples.

Our contributions in this section are:

• we extend the (convex) semi-supervised anomaly detector (SSAD) [12] to handle mul-

tiple kernels and to automatically adjusting their weighting using ℓp-norm regulariza-

tion;

1http://www.mosek.com/

4.3. Inducing Group-sparsity 31

• we propose a corresponding block coordinate descent solver in the spirit of [105] that

alternates between solving the SSAD problem using the most recent kernel mixture

and analytically updating the mixture coeﬃcients.

We show exemplary the impact of the trade-oﬀ parameter pon the sparseness of the

found solution in Figure 4.3 and Figure 4.4. Here, seven RBF kernels with various widths (x-

axis) are generated from Gaussian data points. Figure 4.3 shows that setting p= 1 leads to

sparse solutions and only a single kernel receives most of the weight. In Figure 4.4 however,

the non-sparse solution using p= 2 still picks the very same kernel to receive most of the

weight while distributing large portions of the total weight to other kernels as well.

0.001

0.01

1.0

2.0

4.0

10.0

100.0

Kernel width parameter

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Kernel weighting

=2.00

Figure 4.4 – Impact of pon the sparseness of the solu-

tion. Here, the weighting of seven RBF kernels with vari-

ous widths have been learned setting p= 2.

In cases were few labeled examples

are available, using only unlabeled data

for the estimation of the parameter of

the one-class SVM is usually leading to

less accurate models as when fully ex-

ploiting labeled information [12]. There-

fore in addition to nunlabeled exam-

ples x1,...,xn, we include mlabeled ex-

amples (xn+1, yn+1),...,(xn+m, yn+m).

Labels yi∈ {+1,−1}are considered bi-

nary, that is in case yi= +1, the entry xi

belongs to the nominal class. To combine

sums and hence, improve readability, we

introduce labels yi= +1 ∀i= 1, . . . , n

for all unlabeled examples and an indica-

tor function 1c≡[c > n]to mask labeled

examples; the function 1csimply returns 1if c > n and 0otherwise.

A semi-supervised generalization of the one-class SVM model is the convex version of

the semi-supervised anomaly detection framework (SSAD) [12] which we will use with an

L2-regularizer together with the hinge-loss. Let γbe the margin for the labeled examples

and κ,ηu, and ηltrade-oﬀ parameters. For avoiding notational clutter, we introduce the

example-wise regularization hyper-parameters

ηi=(ηufor i= 1, . . . , n

ηlelse

which allows us to shorten the optimization problem to

min

w,ρ,γ≥0,ξ≥0

2kwk2

2−ρ−κγ +

n+m

i=1

ηiξi(4.8)

s.t. yihw, φ(xi)i ≥ yiρ+1iγ−ξi∀i= 1, . . . , n +m .

The solution wadmits a dual representation and can be written as

n+m

i=1

αiyiφ(xi)

and hence, the decision function depends only on inner products of the input examples which

paves the way for kernel functions kφ(x,x′) = hφ(x), φ(x′)i(see [33] for an introduction to

32 Chapter 4. Sparsity-inducing Regularization

kernels). It holds

f(x) =

n+m

i=1

αiyihφ(xi), φ(x)i−ρ=

n+m

i=1

αiyikφ(xi,x)−ρ.

We omit the subscript φin the remainder to not clutter notation unnecessarily.

A kernel represents a similarity measure for single features or groups of features. If inputs

are of very distinct nature, e.g. continuous and discrete values, a single similarity measure

might not be suﬃcient. In such cases, we would like to incorporate multiple feature descrip-

tions into the learning problem. Those would be represented by their respective kernel. To

fully exploit the provided set of kernels, we aim to learn a weighted combination of Tkernels

with mixing coeﬃcients d= (β1, . . . , βT):

kMKL(x,x′) :=

t=1

βtkt(x,x′) =

t=1

βthφt(x), φt(x′)i=

t=1hpβtφt(x),pβtφt(x′)i.

In general, properties of the mixing coeﬃcients include (i) non-negativity, hence βt≥0

and (ii) normalization kdkp= 1. Recent work [105] suggests to use the more general p-

norm instead of a common 1-norm [107,120,121]. The latter usually leads to sparse mixing

coeﬃcients whereas p-norm with 1≤p≤+∞admits sparsity adjustments for the prob-

lem at hand and thus adds ﬂexibility. Incorporating multiple feature representations in our

model (4.8) leads to

fMKL(x) =

t=1hˆ

wt,pβtφt(x)i−ρ=

t=1 pβthˆ

wt, φt(x)i−ρ.

Due to technical reasons, i.e. to preserve convexity, we substitute the model parameters

wt=√βtˆ

wtand arrive at the revised primal MKL-SSAD optimisation problem:

min

{wt}T

t=1,ρ,γ≥0,ξ≥0

t=1

βtkwtk2

2−ρ−κγ +

n+m

i=1

ηiξi(4.9)

s.t. yi

t=1hwt, φt(xi)i ≥ yiρ+1iγ−ξii= 1, . . . , n +m

kdk2

p≤1,d≥0.

[105] prove the equivalence of Tikhonov and Ivanov regularisation which allows to move the

regulariser on the mixing coeﬃcients in the objective function. We will exploit this relation

on various occasions in this section. Deriving the Lagrange dual problem, we arrive at the

intermediate saddle point problem

max

αmin

{wt}T

t=1,d≥0

2kdk2

p+1

t=1

βtkwtk2

2−

n+m

i=1

αiyi

t=1hwt, φt(xi)i(4.10)

s.t. κ≤

n+m

i=1

1iαi,1 =

n+m

i=1

yiαi,0≤αi≤ηi, i = 1, . . . , n +m.

We are solving the optimisation problem in a block-coordinate descent fashion by alternating

between wand d. This enables us to compute the latter analytically assuming ﬁxed variables

4.4. Applications 33

Algorithm 3 Proposed optimization algorithm for MKL-SSAD (4.9)

Require: x,y, ηu, ηl, κ &p−norm

Initialize kernel mixture coeﬃcients such that kdz=0kp= 1

while Until Convergence do

Step 1: solve the convex SSAD problem as stated in Eqn. (4.12)

αz+1 = argmaxα:0≤αi≤ηiJ(α, dz)s.t. κ≤Pn+m

i=1 1iαiand 1 = Pn+m

i=1 yiαi

Step 2: optimize the weights according to Eqn. (4.11)

dz+1 = argmind≥0J(αz+1,d)s.t. kdk2

p≤1

z=z+1

end while

return Trained parameter vector α⋆, weights d∗

wand setting the partial derivative to zero:

λβp−1

tkdk2−p

p−kwtk2

β2

= 0.

Therefore, given Υ≥0we get

βt= Υkwtk

p+1

Furthermore, it holds that at any optimal point kdkp= 1 and solving for Υgives Υ =

1/(PT

t=1 kwtk

p+1

2)1

p. Putting things together, gives the analytical update rule

βt=kwtk

p+1

(PT

t=1 kwtk

p+1

2)1

(4.11)

which, since only norms are involved, ensures non-negativity for the mixing coeﬃcients.

Substituting wtusing the representer theorem wt=βtPn+m

i=1 αiyiφt(xi)yields the ﬁnal

optimisation problem for MKL-SSAD:

max

αmin

=:J(α,d)

z}| {

−1

i,j

αiαjyiyj

t=1

βtkt(xi,xj)(4.12)

s.t. κ≤

n+m

i=1

1iαi,1 =

n+m

i=1

yiαi,0≤αi≤ηi, i = 1, . . . , n +m

kdk2

p≤1.

As a block-coordinate descent method, we can iteratively alternate between the two opti-

mization blocks and every limit point of Algorithm 3 is a globally optimal point (cf. [105]).

Algorithm 3 summarizes the proposed optimization procedure. To be comparable, kernels

need to be centered and normalized.

4.4 Applications

We present two applications that show the beneﬁts of our introduced methods. First, we will

employ the techniques from Section 4.2 is exploratory data analysis in a EEG-BCI setting

where no ground truth is available. However, the results given by applying our methodology

34 Chapter 4. Sparsity-inducing Regularization

was assessed and analyzed by a domain expert. In the second application we will attempt to

ﬁnd authorships of disputed documents. To achieve state-of-the-art performance, which will

be tested on a related dataset were labels are known, various feature representations of text

need to be mixed to achieve highest possible accuracy.

4.4.1 Analysis of Brain States

The last years have seen a rise in interest in using Electroencephalography-based Brain Com-

puter Interfacing (EEG-BCI) methodology for investigating non-medical questions beyond

the purpose of communication and control. One of these novel applications is to examine

how signal quality is being processed neurally, which is of particular interest for industry,

besides providing neuroscientiﬁc insights. As for most behavioural experiments in the neu-

rosciences, the assessment of a given stimulus by a subject is required. Based on an EEG

study on speech quality of phonemes, we will ﬁrst discuss the information contained in the

neural correlate of this judgement. Typically, this is done by analyzing the data along behav-

ioral responses/labels. However, participants in such complex experiments often guess at the

threshold of perception. This leads to labels that are only partly correct and oftentimes ran-

dom, which is a problematic scenario for using supervised learning. Therefore, we propose a

novel supervised-unsupervised learning scheme based on techniques from Section 4.2, that

aims to diﬀerentiate true labels from random ones in a data-driven way. We show that this

approach provides a more crisp view of the brain states that experimenters are looking for,

besides discovering additional brain states to which the classical analysis is blind.

EEG Experiment and Classical Analysis Understanding which levels of quality loss are

still perceived by users is a crucial question for any provider of signal quality. Conventionally,

behavioral tests are used for this purpose, asking participants directly for their rating. Recent

work has proposed to complement this approach by also recording a user’s neural response

to a stimulus, as the neural response may diﬀer from the behavioral response [122–124].

Eleven participants (mean age 25) took part in this study, for whom both behavioral and

neural response was recorded using 64-channel EEG. Participants performed an auditory

discrimination task in which they had to press a button whenever they detected an audi-

tory stimulus of degraded quality (target). Stimuli were presented in an oddball paradigm,

using the undisturbed phoneme /a/ as non-target (NT, 70% of stimuli). Among these stimuli

of high quality, the participant had to ﬁnd instances when the phoneme was superimposed

with signal-correlated noise. Participants were instructed to indicate by button press, if they

noticed a deviation in the stimulus. Four noisy target stimuli were used, T1-T4, consisting of

the phoneme /a/ superimposed with decreasing levels of signal-correlated noise (targets; 6%

per class). In an additional 6% of trials, the phoneme /i/ was presented as control stimulus

(C; target). The noise levels of the target stimuli (T1-T4) were chosen separately for each

participant, in order to account for individual diﬀerences in sensitivity to noise, aiming at

perception rates of 100%, 75%, 25% and 0%, respectively. For this purpose, a pre-test was

run; the resulting signal-to-noise ratios (SNR) for the deviant stimuli were set to 5, 21, 24

and 28 dB on average (mean perception rate in the experiment: 99%, 46%, 22% and 7%). The

disturbed auditory stimuli were created using a Modulated Noise Reference Unit (MNRU).

Target stimuli that were detected by the participant are referred to as ’hits’ (true positives)

and the others as ’misses’ (false positives).

Each stimulus had a duration of 160 ms with 1000 ms stimulus onset asynchrony. Per par-

ticipant, 8 to 12 blocks were recorded with 300 stimuli each. The button presses of the par-

ticipants were recorded using a parallel port computer keyboard. For stimulus presentation,

in-ear headphones by Sennheiser were used. EEG was recorded using a Brain Products (Mu-

nich, Germany) EEG system with 64 electrodes (AF3-4, 7-8; FAF1-2; Fz, 3-10; Fp1-2; FFC1-2,

4.4. Applications 35

5-8; FT7-10; FCz, 1-6; CFC5-8; Cz, 3-6; CCP7-8; CP1-2, 5-6; T7-8; TP7-10; P3-4, Pz, 7-8; POz;

O1-2 and the right mastoid) and a BrainAmp EEG ampliﬁer. Electrodes were placed accord-

ing to the international 10-10 system. The tip of the nose was chosen as a reference site and

a forehead ground electrode. EEG data were sampled at a rate of 100 Hz. In the following, we

investigate event-related potentials (ERPs), i.e. the diﬀerential signal between the voltage at

a given electrode position and the reference electrode.

100 − 150 [ms] 150 − 200 [ms] 200 − 250 [ms] 250 − 300 [ms] 400 − 600 [ms] 600 − 800 [ms]800 − 1000 [ms]

[µV]

−10

T1:hit

100 − 150 [ms] 150 − 200 [ms] 200 − 250 [ms] 250 − 300 [ms] 400 − 600 [ms] 600 − 800 [ms]800 − 1000 [ms]

[µV]

−10

T2:hit

100 − 150 [ms] 150 − 200 [ms] 200 − 250 [ms] 250 − 300 [ms] 400 − 600 [ms] 600 − 800 [ms]800 − 1000 [ms]

[µV]

−10

T2:miss

100 − 150 [ms] 150 − 200 [ms] 200 − 250 [ms] 250 − 300 [ms] 400 − 600 [ms] 600 − 800 [ms]800 − 1000 [ms]

[µV]

−10

T3:miss

Figure 4.5 – Scalp distribution of ERPs for diﬀerent stimuli in seven time intervals, grouped by their

behavioral label [hit/miss] (participant vp=1). The maps represent a top view on the head with nose pointing

upwards.

The behavioral responses of the participants provide labels for each trial, seemingly in-

dicating whether the stimulus was perceived as disturbed or not. However, these labels can

be assumed to be confounded with label noise to a large degree, in particular at the threshold

of perception (stimulus T2). As a ﬁrst step, we take these spurious labels as ground truth and

analyze the event-related potentials in these groups. If the behavioral response indicates that

the quality degradation is processed (hits), the resulting ERP activation pattern can be char-

acterized by two components: early sensory and late cognitive processing stages. Figure 4.5

shows the spatial distribution of the ERPs as scalp distributions (head seen from above, nose

pointing upwards), averaged over seven time intervals. The ﬁgure shows data exemplar-

ily for one participant (vp 1). The top row shows the averaged neural response to a strong

degradation that was noticed behaviorally (T1 hit). The four early intervals represent sen-

sory processing of the stimulus (100–300ms post stimulus), which is reﬂected in a temporal

negativity above the auditory cortices. In contrast, the last three intervals can be assumed to

reﬂect cognitive processing (400–1000ms post stimulus). This elicits an occipital positivity,

commonly referred to as P3 component. This component is elicited as a neural reaction to

deviating stimuli in an oddball paradigm [125]. In our study, a P3 can be expected to occur

when a participant notices that the quality of a stimulus is degraded. Generally speaking, the

stronger the degradation, the higher the amplitude of the EEG signal, in particular that of the

P3 component. This eﬀect becomes obvious when comparing the ﬁrst two rows of the ﬁgure,

36 Chapter 4. Sparsity-inducing Regularization

with a much weaker activation during late intervals for stimulus T2 (weak degradation) com-

pared to T1 (strong degradation). In contrast, the last row shows the neural processing of a

stimulus with a subtle degradation that is not noticed on a behavioral level (T3 miss). While

sensory processing still causes activity in the early intervals, there is no notable cognitive

component.

Exploratory Data-driven Analysis While the topography of the averaged ERPs seemed

to show a consistent picture so far, the presence of label noise becomes very obvious for the

stimulus at the threshold of perception (T2). As the ratings of participants become unreli-

able to the point of guessing, grouping according to behavioral labels becomes conspicuously

confounded, as can be seen in the second and third row of Figure 4.5. Even though the par-

ticipant gave diﬀerent ratings in these cases, the neural activation is strikingly similar. While

the presence of label noise is obvious for this stimulus, the labels of the other classes can be

expected to be confounded as well, just to a lesser degree. In the following, we will attempt

to infer the correct labels in a data-driven way by using a novel learning methodology, in

order to obtain an unbiased view of the EEG data.

We now turn to the details of the proposed supervised-unsupervised processing pipeline.

The motivation behind this approach: even though it may seem that other methods (e.g.

kernelized methods) could be more suitable for this problem, EEG data is well separable by

linear classiﬁcation (for a comparison of linear vs. nonlinear methods, cf. [126]). As discussed

previously, the missing ground truth compels us to rely solely on interpretability of the results

which can be achieved easily by applying linear and sparse methods. The inspection of the

results of applying step 1 shows that there is a high chance of ﬁnding trials confounded by

measurement noise (faulty electrodes) characterized by high amplitudes and/or drifts which

we denote as artifacts. Therefore, we deliberately force the method to exclude such examples

and search for other features by including the highest-ranked data points as outliers in a semi-

supervised manner. Typically we chose 5 examples of each end of the spectrum to explicitly

retain outlier labels (cf. Algorithm 4). We divide into three classes: core, plateau and outlier

class. These classes occur naturally when applying the sparse one-class methods described

in the previous section. Examples belonging to the plateau class are orthogonal to the core

and outlier class and lie on the decision boundary. Hence, for division simple thresholding is

suﬃcient.

Algorithm 4 Processing Pipeline

1: Given {x1,...,xn}=Xsolve Eq. (4.7) preserving w∗

2: Calculate the anomaly score f1(xi) = hw∗

1,xii−1for i={1, . . . , n}

3: Select subset Lk⊆ X with |Lk|=kand Lk={x∈ X | |f1(xi)| ≥ |f1(xj)|∀i6=j}

4: Semi-supervised learning with elements from Lktagged as outliers resulting in w∗

5: Again, calculate the anomaly score f2(xi) = hw∗

2,xii−1for i={1, . . . , n}

6: Selecting the most conﬁdent examples S={x∈ X | f2(xi)≥0, i = 1, . . . , n}

7: Applying Eqn. (4.7) on Sagain, returns the ﬁnal solution f3(x) = hw∗

3,xi

8: Now, the sets Poutlier ={x∈ X | f3(xi)<0, i = 1, . . . , n},Pplateau ={x∈ X |

f3(xi) = 0, i = 1, . . . , n}and Pcore ={x∈ X | f3(xi)>0, i = 1, . . . , n}can be

analyzed

The feature inputs are based on the time series of the ERPs. We ﬁrst reduced the dimen-

sionality of the data (cf [127]). Hence, we calculated the mean of the ERP signal within the

seven neurophysiologically plausible intervals shown in Figure 4.5 (for each electrode and

trial). For this, the EEG signal from 61 recorded electrodes was used (omitting the Fp and EO

electrodes). Thus, the dimensionality of the data was reduced from 6400 (100 data points x

4.4. Applications 37

61 electrodes) to 427 features (7 data points x 61 electrodes). These features were then used

as input for the processing pipeline.

The supervised-unsupervised learning approach groups the trials into three classes: a

core class, an outlier class and a plateau class. These three classes can be seen exemplarily

in Figure 4.6 for one participant (vp=1) and the stimulus at the threshold of perception (T2).

Again, the scalp distribution of ERPs are shown in the seven intervals that were also used as

input features. Remarkably, the core class (row 2) ﬁnds a very typical representation of hits

with distinct auditory processing (ﬁrst intervals) and a strong P3 component (last two inter-

vals), suggesting that the degradation was consciously processed. This pattern is subdued

in the plateau class (row 3), where the auditory cortices still show a strong activation, but

only a very subtle P3 is visible, indicating that the degradation was processed on a sensory

level, but not noticed by the participant. Finally, there is virtually no activation in early or

late components for the outlier class (row 4), suggesting at most subliminal processing of

the stimulus. This distinction is by far more cogent than that based on behavioral labels,

where two classes were assumed (hit/miss) that were obviously confounded (middle rows of

Figure 4.5). Not only does the algorithm ﬁnd plausible classes, it also does so on the basis

of neurophysiologically plausible features: As can be seen in the top row of the ﬁgure, the

active features reﬂect the bi-temporal neural activity in early processing stages (auditory)

and the occipital activity in late processing stages (cognitive).

100−150 150−200 200−250 250−300 400−600 600−800 800−1000

100 − 150 [ms] 150 − 200 [ms] 200 − 250 [ms] 250 − 300 [ms] 400 − 600 [ms] 600 − 800 [ms]800 − 1000 [ms]

[µV]

−10

T2:core

100 − 150 [ms] 150 − 200 [ms] 200 − 250 [ms] 250 − 300 [ms] 400 − 600 [ms] 600 − 800 [ms]800 − 1000 [ms]

[µV]

−10

T2:plat

100 − 150 [ms] 150 − 200 [ms] 200 − 250 [ms] 250 − 300 [ms] 400 − 600 [ms] 600 − 800 [ms]800 − 1000 [ms]

[µV]

−10

T2:out

Figure 4.6 – Top row: weights of features (ﬁlter) assigned in the last step of the Algorithm (4). Bottom

rows: scalp plots of the trials that are grouped into the core, plateau and outlier class (vp=1, T2).

Across all participants and stimuli, the trials grouped into the core class show a distinct

representation of how the stimulus is processed, including both sensory and cognitive com-

ponents (’neural hit’) or only sensory processing (’neural miss’). For obvious degradations

(C, T1), it is always the ’neural hit’ that is found, while the algorithm rather assigns ’neu-

ral misses’ to this class for subtle degradations. This is reasonable, as neural misses can be

assumed to be predominant in those classes (the same is true for hits). In almost all cases

(participants/stimuli), the outlier class represents trials that reﬂect a mental state other than

these clear hit/miss patterns. Mostly, these are trials with very subdued activation (60% of

38 Chapter 4. Sparsity-inducing Regularization

trials show an amplitude lower than +/-5µV on average), which indicates that the stimulus

was processed at most at a subliminal level. Finally, the plateau class, where the EEG signal

is orthogonal to the features chosen by the algorithm, contains a cluster of trials that dif-

fer most widely among participants. These either reﬂect measurement noise or eye artifacts

(40%), a subdued pattern of neural hits/misses (30%) or a mental state other than that (20%).

Figure 4.7 summarizes these results based on visual inspection.

1 2 3 4 5 6 7 8 9 10 11

Neural Miss OtherArtifact

Neural Hit

Figure 4.7 – Overview over all participants (x-axis) and stimuli (y-axis): neural pattern of core, plateau

and outlier classes (column 1-3), based on visual inspection.

The motivation behind our approach is to ﬁnd a coherent way to handle dependent label

noise that is composed of a mixture of random labels and accurate ones. Figure 4.8 provides

an insight into these ratios, as far as our approach can reveal them. First, it shows that the

behavioral perception rate in black, i.e. the percentage of trials that were labeled as hits by

the participants. As can be seen, the perception rate is high (almost 100%) for stimuli C/T1

and then drops markedly for stimuli T2-4 (left to right). Underneath these values, the ﬁgure

shows which percentage of these behaviorial hits are assigned to the core, plateau or outlier

class (ratios shown in gray, orange and white). This could be interpreted as the quantitative

mixture of random labels and accurate ones.

Figure 4.8 – Behavioral perception rate for all participants and target stimuli (C,T1-T4 from left to right),

with the ratio of how many of these trials are grouped into core, plateau and outlier class (gray, orange and

white box).

4.4. Applications 39

Application Outcome Analyzing EEG signals robustly, despite their high non-stationarity

(cf. [128–131]), their multimodal nature, and the obviously noisy signal characteristics [11],

is a major challenge that necessitates machine learning. However, in particular in complex

cognitive tasks, the behavioral ratings given by participants are often unreliable, thus intro-

ducing label noise. Although in practice, independent label noise can be handled by most

vanilla supervised learning algorithms, they can fail miserably in case of dependent label

noise. This set-up is rather common in behavioural experiments where a subject is required

to assess a given stimulus; in this work we have analyzed data from speech signal quality

judgements. Near perception threshold, the behavioral responses of subjects provide labels

that are noisy through a subjective assessment of the auditory signal. There are two reasons

for this: (a) the subjects guess, i.e. the labels are random, (b) a very weakly correlated percep-

tion of a change in audio signal quality is reported that gives rise to a faint structure in the

noisy labels. Computing the neural correlates of behaviour requires labels that reﬂect the task

as clean as possible. To achieve this, we proposed a novel supervised-unsupervised learning

procedure, that ﬁrst removes artifactual trials from the experiment and then infers which of

the remaining labels are reliable and which are random. Once these more reliable labels are

in place, a better and more meaningful experimental evaluation of the neural correlates in

our audio signal quality application can be performed. Moreover, our approach allows for

deﬁning groupings of trials that reﬂect more ﬁne-grained cognitive states. The interesting

point to note furthermore is that in this manner a neural correlate may occasionally be even

more sensitive than the conscious behavioural one.

4.4.2 Authorship Attribution

Automatically attributing a piece of text to its author is one of the oldest problems studied

in linguistics [132]. Despite being an old problem, authorship attribution is still highly top-

ical and todays applications range from plagiarism detection [133], identifying the origin of

anonymous harassments in emails, blogs, and chat rooms [134] to copyright and estate issues

as well as resolving historical questions of disputed authorship [135,136].

Intrinsically, the goal of authorship detection is to identify the characteristic traits of an

author. The idea is that, these traits distinguish ’her’ from other authors in terms of writing

style, use of words, etc. Thus, prior work often focuses on designing and extracting features

from text to capture these traits. There is a great deal of features proposed for authorship

detection, including word or character n-grams [137,138], part-of-speech [139], probabilistic

context-free grammars [140], or linguistic features [141]. However, indicative features for

one author do not necessarily help to characterise another. A major problem in authorship

detection is therefore to ﬁnd the right set of features for a given task at hand [142].

Algorithmically, a variety of diﬀerent models have been studied in the context of author-

ship detection, ranging from probabilistic approaches [143] and similarity-based methods

[144] to vector space models [136,145]. The approaches either treat documents as indepen-

dent (instance-based) or concatenate documents by the same author (proﬁle-based). Intu-

itively, the latter is helpful if an author has a concise way of expressing herself so that the

concatenated document allows to extract a statistic that is suﬃcient for capturing her style.

On the other hand, instance-based approaches are better suited for expressive authors and

have advantages in sparse data scenarios.

Another aspect in authorship attribution is the application scenario of the ﬁnal model. In

transductive (in-sample) settings, the unlabeled documents of interest are already included

in the training process and the model does not necessarily perform well on new and unseen

texts. By contrast, inductive (out-of-sample) scenarios generally allow to learn models that

can be applied to any future text but require larger training samples to achieve accurate

performances.

40 Chapter 4. Sparsity-inducing Regularization

Here, we employ the techniques developed in Section 4.3. This remedies the above

mentioned problems by fusing existing techniques: (i) We cast authorship attribution as an

anomaly detection problem where one model is learned for every author. The idea is to iden-

tify a concise region in feature space that contains (most of) the documents of the author of

interest while other documents are considered outliers. Thus, the model can be viewed as a

proﬁle-based approach in feature space while the data is treated on an instance-based level.

(ii) We remedy the in-sample / out-of-sample problem by using a semi-supervised extension

of the commonly unsupervised outlier detection framework. By doing so, we may include

authorship labels for the already known documents and leave the disputed ones unlabeled.

(iii) Finally, as the proposed approach is a member of the multiple kernel learning family

this automatically includes a mathematically well founded feature selection framework that

renders the method generally applicable. The optimal solution is given by a (possibly sparse)

linear mixture of kernel functions.

Empirically, we observe that our approach consistently outperforms baseline competitors

or conﬁrms common knowledge with respect to the authorship of disputed articles. The

main advantage of the method however lies in its simplicity. Practitioners do not need to

take critical design choices in terms of which features to use and which not. By contrast,

all features (kernels) can be used in the optimization and the method itself ﬁnds the optimal

combination for the problem at hand.

Related Work Authorship attribution using linguistic and stylistic features has a long tra-

dition and can be dated back to the nineteenth century. As a ﬁrst attempt, Mendenhall [132]

uses features based on word lengths to characterize the plays of Shakespeare. Later in the

ﬁrst half of the 20th century, diﬀerent textual statistics, such as Zipf’s distribution [146] and

Yule’s k-statistic [147] have been proposed to quantify textual style. Study by Mosteller et

al. is one of the most inﬂuential modern work in authorship attribution[135]. They use a

Bayesian approach to analyze frequencies of a small set of function words. Until the late

1990s, research in stylometry has been dominated by feature engineering to quantify writing

style [148] and about 1,000 diﬀerent measures have been proposed [149].

Document representation is essential for author attribution tasks. Features aim to cap-

ture characteristic traits of authors that persist across topics. Traditional stylometric features

include function and high-frequency words, hapax legomena, Yules k-statistic, syllable distri-

butions, sentence length, word length and word frequencies, vocabulary richness functions,

syntactic and analysis. Many studies combine features of diﬀerent types using multivariate

analyses. Some researchers use punctuation symbols while others experiment with n-grams

[150]. Grammatical style markers with natural language processing techniques are also used

to extract features from the documents.

Also in terms of technical approaches, authorship attribution has been studied with a

wide range of diﬀerent approaches. The deployed techniques can be broadly divided into

three categories: machine learning [150], multivariate/cluster analysis [151], and natural

language processing [152]. Principal components analysis (PCA) is one of the widely used

techniques for authorship studies, for instance, [153] apply PCA to identify the authorship of

unknown articles that have been attributed to Stephen Crane. In addition, machine learning-

based approaches, including support vector machines [150], are frequently used to discrimi-

nate documents by diﬀerent authors. An excellent survey on the diversity of approaches for

authorship detection is provided by [154].

Results on the Reuters 50-50 Data set We use a subset of the Reuters 50-50 data set2to

evaluate the performance of the aforementioned approaches. The reduced data contains 1000

2https://archive.ics.uci.edu/ml/datasets/Reuter_50_50

4.4. Applications 41

Classifier 1

Func POS Suf BOW

1.7783

3.1623

5.6234

Classifier 2

Func POS Suf BOW

1.7783

3.1623

5.6234

Classifier 3

Func POS Suf BOW

1.7783

3.1623

5.6234

Classifier 4

Func POS Suf BOW

1.7783

3.1623

5.6234

Classifier 5

Func POS Suf BOW

1.7783

3.1623

5.6234

Classifier 6

Func POS Suf BOW

1.7783

3.1623

5.6234

Classifier 7

Func POS Suf BOW

1.7783

3.1623

5.6234

Classifier 8

Func POS Suf BOW

1.7783

3.1623

5.6234

Classifier 9

Func POS Suf BOW

1.7783

3.1623

5.6234

Classifier 10

Func POS Suf BOW

1.7783

3.1623

5.6234

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure 4.9 – Kernel mixture coeﬃcients for the 10 classes

articles written by 10 authors, Aaron Pressman, Alan Crosby, Alexander Smith, Benjamin

Kang Lim, Bernard Hickey, Brad Dorfman, Darren Schuettler, David Lawder, Edna Fernandes,

and Eric Auchard.

We split the data into training (90%) and test (10%) sets and conduct a 10-fold cross-

validation on the training set for model selection. The best performing models are then eval-

uated on the test set. We compare the performance of our approach with diﬀerent p-norms

to the SSAD which uses one kernel at a time. We use p∈ {1,1.7783,3.1623,5.6234,10}.

Note that p= 10 approximates the sum-kernel that would result by simply adding-up the

four kernels.

The results in terms of averaged micro- and macro-F1measures are shown in Table 4.1.

MKL consistently outperforms the single-kernel baseline for all p-norms. That is, instead

of extensively experimenting with SSAD and diﬀerent kernel functions and parameter se-

lections, a single run with our MKL already leads to better performances in both metrics.

Figure 4.9 visualizes the resulting mixing coeﬃcients for the 10 authors/classiﬁers. While

the models are very similar at ﬁrst sight, small deviations indicate diﬀerences in the style of

the authors.

Table 4.1 – F-scores for the subset of Reuters 50-50

p-norm MKL SSAD

1 1.7783 3.1623 5.6234 10 func-word POS Suﬃx-3 BOW

Fmicro 73.46 73.08 73.84 73.89 74.23 63.08 54.62 70.01 72.85

Fmacro 79.23 78.86 79.63 79.76 80.07 68.66 58.03 74.01 78.09

Revisiting the Federalist Papers The Federalist Papers are a series of 85 articles and es-

says written during 1787–1788. They were published anonymously to persuade the citizens

of the State of New York to ratify the Constitution. Later, these papers were credited to

Alexander Hamilton, John Jay, and James Madison; 73 of the documents are uniquely associ-

ated with one of the three authors while the remaining 12, also known as the disputed papers,

have been claimed by both, Hamilton and Madison. Three of the 73 articles are considered

joint work by Hamilton and Madison. Previous studies often assign all 12 disputed papers to

Madison which we assume as ground-truth in the remainder [135,136].

42 Chapter 4. Sparsity-inducing Regularization

To conﬁrm or refuse these previous ﬁndings, we conduct an experiment using the fol-

lowing four kernels as document representation: the ﬁrst kernel is made of 484 function

words taken from [155], the second contains part-of-speech (POS) tags, the third is assem-

bled by 3-letter suﬃxes, the last one simply a bag-of-words (BOW) kernel. We compare the

performance of our approach (MKL) with semi-supervised anomaly detection (SSAD) [12].

As before, the baseline cannot use all kernels at a time and are evaluated on every kernel

separately. For simplicity, we show only the MKL results for parameter p= 2 as all other

p-norms that we tried out lead to the same result.

We randomly divide the undisputed papers into training (80%) and holdout (20%) and use

the 12 disputed papers for testing. We make sure that training sets contain at least three ex-

amples of every author and two articles written jointly by Hamilton and Madison. Otherwise

we discard and draw again. We repeat experiments ﬁve times with randomly drawn training

and holdout sets and report on averaged accuracies for the disputed test set.

Table 4.2 – Results for the disputed articles of the Fed-

eralist papers.

kernel H&M M J H

MKL (all) 0 12 0 0

SSAD

484fw 0 12 0 0

POS 9 0 3 0

Suﬃx3 0 12 0 0

BoW 0 0 0 12

The results are shown in Table 4.2. The

one-class SVM and SVDD constantly credit

the 12 disputed articles as joint work by

Hamilton and Madison. The outcome of

SSAD highly depends on the kernel func-

tion; while the part-of-speech kernel dis-

tributes the papers on Jay (3) and Hamil-

ton and Madison (9), respectively, the bag-

of-words kernel assigns all documents to

Hamilton. By contrast, SVDD with func-

tion word and suﬃx-3 kernels attribute the

articles to Madison. The same outcome is observed for our novel MKL that also credits the

12 papers to Madison. Thus, MKL and SSAD with function words and BoW kernel conﬁrm

todays assumption that all 12 papers have been written by Madison. However, chosing SSAD

as the base classiﬁer in the absence of prior knowledge leaves much room for interpretations

and the user in the need of deciding between three solutions, depending on which kernel

she prefers. By using our MKL, selecting features and or kernel functions is no longer neces-

sary as the learning algorithm itself picks the right combination of kernels for the problem

at hand. Thus, the more kernels are being used, the richer the decision space for the MKL.

Application Outcome Our empirical results show the robustness of our approach as it

consistently outperforms baseline competitors on a subset of Reuters 50-50 or conﬁrms com-

mon knowledge wrt the authorship of disputed articles of the Federalist Papers. The main

advantage of the method however lies in its simplicity. Practitioners do not need to take crit-

ical design choices in terms of which features to use and which not. By contrast, all features

(kernels) can be used in the optimization and the method itself ﬁnds the optimal combination

for the problem at hand.

4.5 Summary and Discussion

In this chapter, we introduced two techniques that allow to smoothly transition from dense

solutions towards sparse solutions for unsupervised and semi-supervised one-class support

vector machines based on ℓp-norm regularization. While in Section 4.2 we focused on con-

trolling individual features, in Section 4.3 we turned our attention to groups of features.

Further, we proposed corresponding optimization schemes and, in case of individual fea-

tures and p= 1, a highly optimized linear program formulation. We applied the proposed

4.5. Summary and Discussion 43

techniques on applications in EEG-BCI as well as authorship attribution were, in both cases,

domain experts applied and analyzed the proposed methods.

While the results indicate that both methods work reasonable in their respective appli-

cation, none of them is perfect and there is a number of drawbacks that we discuss here.

Limits of ℓp-norm regularized OC-SVM In Section 4.2, the proposed solver based on the

sub-gradient descent algorithm does need a number of parameters including the choice of

the step lengths. For diminishing and constant step length the algorithm has been shown to

converge. In practice, however, this solver doesn’t work very reliable and needs to be adjusted

manually in order to converge in reasonable time or with reasonable error. More advanced

methods based on proximal algorithms [49] are likely to be much more faster and less manual.

Furthermore, the choice of p, besides p= 1 and p= 2, has no intrinsic rationale and can only

be tested on a hold-out data set. Finally, when a parsimonious solution is the goal, we actually

would like to solve the respective problem using ℓ0-(pseudo)norm regularization. However,

below p < 1the optimization problem becomes non-convex and is not guaranteed to give

meaningful results anymore. Finally, one of the most important properties of the one-class

SVM is the ability to employ kernels. This ﬂexibility is unfortunately lost in our approach.

Limits of ℓp-norm MKL SSAD The proposed approach iterates between ﬁnding the op-

timal weighting between kernels and solving the SSAD optimization problem which is com-

putationally demanding especially when a large number of kernels is employed. However,

ﬁnding the optimal weighting for multiple kernels can also be attempted in a heuristic fash-

ion or using cross-validation techniques (in case of a smaller number of kernels). Moreover,

it seems that a uniform combination of kernels is a reasonable choice as indicated in Ta-

ble 4.1. However, the proposed MKL approach is a very systematic approach to incorporate,

e.g. continuous and discrete features into a single feature description.

Source code and resources for the proposed methods are available on github a. Parts

of this chapter are based on:

Görnitz, N., Kloft, M., Rieck, K., Brefeld, U., “Toward Supervised Anomaly De-

tection”, Journal of Artiﬁcial Intelligence Research (JAIR), vol. 46, pp. 235–262,

2013

Porbadnigk, A., Görnitz, N., Kloft, M., Müller, K.-R., “Decoding Brain States

during Auditory Perception by Supervising Unsupervised Learning.”, Journal

of Computing Science and Engineering (JCSE), vol. 7, no. 2, pp. 112–121, 2013

Nasir, J. A., Görnitz, N., Brefeld, U., “An Oﬀ-the-shelf Approach to Authorship

Attribution”, in International Conference on Computational Linguistics (COL-

ING), 2014, pp. 895–904

ahttps://github.com/nicococo/tilitools

III Collective Anomalies

Chapter 5

Learning with Structured Data

5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2 Large-scale Structured Output Learning . . . . . . . . . . . . . . . . . 48

5.3 Latent Structure Anomaly Detection . . . . . . . . . . . . . . . . . . . 51

5.4 Evaluation and Applications . . . . . . . . . . . . . . . . . . . . . . . 57

5.4.1 Transcript Identiﬁcation for Eucaryotic Organisms . . . . . . . 58

5.4.2 Hidden Markov Anomaly Detection . . . . . . . . . . . . . . . 63

5.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67

Structured data is ubiquitous: time series, graphs such as social networks, or trees for de-

pendency parsing. Learning with that kind of data is a challenge that every machine learner

faces on a daily basis. However, there are is a massive discrepancy in predicting structured

outputs vs. using structured input.

Learning with structured input is a common setting and generally boils down to selecting

an appropriate feature embedding for the bespoken structures. E.g. sequence data might be

embedded using n-gram techniques or a sliding window approach [156]. Once features are

ﬁxed, standard techniques, such as binary support vector machines, can be applied without

any changes.

x1x2x3x4

y :

x :

y1y2y3y4

Figure 5.1 – A hidden Markov model as undirected

graphical model (Markov random ﬁeld, MRF). A se-

quence of observations x(grey circles) is coupled with

corresponding latent labels y.

Structured output methods, on the other

hand, make complex predictions on groups

or collections of data points while ex-

ploiting structural information of the input

data. Complex prediction might include se-

quence, tree, or graph-like outputs hence,

multiple interdependent variables (cf. Fig-

ure 5.1). This is in gross contrast to stan-

dard prediction methods that output single

labels, e.g. classiﬁcation or regression meth-

ods. Most prominent approaches include

structure output support vector machines (SSVMs) [157], conditional random ﬁelds (CRFs)

[158], and structured perceptrons [159] and have been applied to applications ranging from

computational biology [27,160] to speech recognition [161]. A comprehensive overview on

the prediction of structured data is given in Bakir, Hofmann, Schölkopf, Smola, Taskar, and

S.V.N. Vishwanathan [162].

Inferring interdependent labels ties a group of input data points together and provides a

rich information source that would be well suited for collective (group) anomaly detection

problems. However, such methods are generally supervised and require extensive labeling

48 Chapter 5. Learning with Structured Data

on a very ﬁne-grained level which is prohibitive in anomaly detection settings. Furthermore,

they are notoriously hard to solve and only applicable for smaller and medium sized problems.

In this chapter, we derive an unsupervised anomaly detection method based on the struc-

tured output learning paradigm that is able to reliable spot anomalous groups of data points

when those exhibit discrete label dependency structure (Section 5.3). Further, we introduce

the supervised structured output learning paradigm and derive an eﬃcient optimization

method that enables large-scale learning (Section 5.2). Empirical evaluations on challeng-

ing computational biology tasks are presented in Section 5.4. To further diﬀerentiate both

scenarios, we will use z∈ Z instead of y∈ Y in the unsupervised setting.

5.1 Preliminaries

In the supervised learning case, we assume that ground truth label structures and corre-

sponding observations {(x1,y1),...,(xn,yn)} ∈ X × Y are given for training purposes

and hence, to adjust the parameter vector w∈ H. Furthermore, we are interested in ﬁnding

a predictor that is able to output the corresponding label structure ˆ

y∈ Y given the obser-

vations x∈ X and a joint feature map Ψ : X × Y → H that captures the dependencies

between input and output (e.g., [157,159]):

y(x) = argmax

y∈Y hw,Ψ(x,y)i.(5.1)

While this approach can be applied to very general structures, e.g. sequences, trees, and

graphs, in this chapter, a special focus is set on large-scale sequence learning. That is, we

assume that each data entry x∈ X contains a large number of samples that form a hidden

Markov sequence. An example of such a model is given in Figure 5.1. Note that we are given

a set of such observations x∈ X of variable length.

In the unsupervised anomaly detection setting, we will employ the same sort of tech-

nique but without given labels for training. That is, we are given only the observations

{x1, . . . , xn} ∈ X and the desired output is, as standard in anomaly detection, an anomaly

score for observation x(each consisting of a group of data points). However, the proposed

method, latent structure anomaly detector, is build atop of the structure output principle (cf.

Eq. (5.1)).

5.2 Large-scale Structured Output Learning

In contrast to binary classiﬁcation, elements from the output space Y(e.g., sequences, trees,

or graphs) of structured output problems have an inherent structure which makes more so-

phisticated, problem-speciﬁc loss functions desirable. The loss between the true label y∈ Y

and the predicted label ˆ

y∈ Y is measured by a loss function ∆ : Y ×Y → R+. A widely

used approach to predict ˆ

y∈ Y is the use of a linearly parametrized model as given in

Eq. (5.1). Generally, we are interested in ﬁnding the predictor with minimum risk given the

data distribution P,

R(ˆ

y) = ZX×Y

∆(y,ˆ

y(x))dP(x,y).

The most common approaches to estimate the model parameters ware based on struc-

tured output SVMs (e.g., [157,163]) and conditional random ﬁelds (e.g., [158]; see also [164]).

Here we follow the approach taken in [157,160], where estimating the parameter vector w

5.2. Large-scale Structured Output Learning 49

using the margin rescaling variant amounts to solving the following optimization problem

min

w∈H,ξ≥0

2kwk2

2+C

i=1

ξi(5.2)

s.t. hw,Ψ(xi,yi)−Ψ(xi,¯

y)i ≥ ∆(yi,¯

y)−ξi∀i, ¯

y∈ Y.

Instead of taking all possible conﬁgurations ¯

y∈ Y into account (which would be computa-

tional infeasible even for very small problems), a standard method for optimization iteratively

computes the maximum violator max¯

y∈Y ∆(yi,¯

y)−hw,Ψ(xi,yi)ibased on the intermedi-

ate solution for w∈ Hand then solves the resulting optimization problem. This way, a new

constraint is generated per iteration for each example until convergence (cf. Algorithm 5).

These kinds of optimization algorithms is referred to as column generation. For our pur-

poses, we re-formulate the above problem into an unconstrained optimization problem and

replace the loss function as well as the regularizer by some place-holders:

min

w∈H (R(w) + C

i=1

ℓ(max

y∈Y hw,Ψ(xi,¯

y)i+ ∆(yi,¯

y)−hw,Ψ(xi,yi)i)),(5.3)

where ℓis the loss function and R(x)the regularizer. For ℓ(a) = max(0, a)and R(w) =

2kwk2

2we obtain the structured output support vector machine with margin rescaling and

hinge-loss as shown in Eq. (5.2).

In [26] the authors propose a hierarchical multitask structured output method that showed

promising results in computational biology tasks. It turns out that we can now combine

the structured output formulation with hierarchical multitask learning in a straight-forward

way. We extend the regularizer R(w)in (5.3) with a γ-parametrized convex combination

of a multitask regularizer 1

2||w−wp||2

2with the original term. When R(w) = 1

2kwk2

2and

omitting constant terms, we arrive at Rp,γ(w) = 1

2kwk2

2−γhw,wpi. Thus we can apply the

mentioned hierarchical multitask learning approach and solve for every node the following

optimization problem:

min

w∈H (Rp,γ(w) + C

i=1

ℓ(max

y∈Y hw,Ψ(xi,¯

y)i+ ∆(yi,¯

y)−hw,Ψ(xi,yi)i))(5.4)

A major diﬃculty remains: solving the resulting optimization problems which now can

become considerably larger than for the single-task case.

Column generation techniques often converge slowly. Moreover, the size of the restricted

optimization problems grows steadily and solving them becomes more expensive in each

iteration. Simple gradient descent or second order methods can not be directly applied as

alternatives, because (5.4) is continuous but non-smooth. Our approach is instead based on

bundle methods for regularized risk minimization as proposed in [165,166] and [167]. In case

of SVMs, this further relates to the OCAS method introduced in [168]. In order to achieve

fast convergence, we use a variant of these methods adapted to structured output learning

that is suitable for hierarchical multitask learning.

We consider the objective function J(w) = Rp,γ(w) + L(w), where

L(w) := C

i=1

ℓ(max

y∈Y {hw,Ψ(xi,¯

y)i+ ∆(yi,¯

y)}−hw,Ψ(xi,yi)i).

Many machine learning methods adjust their parameters by minimizing directly the empir-

ical loss. On the contrary for structure output support vector machines, evaluating the em-

pirical loss L(w)implies the invocation of the embedded argmax-function. We argue that it

50 Chapter 5. Learning with Structured Data

Algorithm 5 Column Generation Method for Structured Output Learning

w(1) =wp

k= 1 and Γi=∅ ∀i

repeat

for i= 1, .., n do

y∗= argmaxy∈Y {hw(k),Ψ(xi,y)i+ ∆(yi,y)}

if hw(k),Ψ(xi,y∗)i+ ∆(yi,y∗)>max

(Ψ,∆)∈Γi{hw(k),Ψi+ ∆}then

Γi= Γi∪(Ψ(xi,y∗),∆(yi,y∗))

end if

end for

w(k)= argmin

w∈H (Rp,γ(w) + C

i=1

ℓmax

(Ψ,∆)∈Γi{hw,Ψi+ ∆}−hw,Ψ(xi, yi)i)

k=k+ 1

until no changes in (Ψi,∆i)∀i

is the most expensive step in general and therefore eﬀective methods as e.g. line search are

practically prohibitive. Instead, we propose to optimize an estimate of the empirical loss ˆ

(w), which can be computed eﬃciently. We deﬁne the estimated empirical loss ˆ

L(w)as

L(w) := C

i=1

ℓmax

(Ψ,∆)∈Γi{hw,Ψi+ ∆}−hw,Ψ(xi,yi)i.

Accordingly, we deﬁne the estimated objective function as ˆ

J(w) = Rp,γ(w) + ˆ

L(w). It

is easy to verify that J(w)≥ˆ

J(w).Γiis a set of pairs (Ψ(xi,y),∆(yi,y)) deﬁned by a

suitably chosen, growing subset of Y, such that ˆ

L(w)→L(w)(cf. Algorithm 6).

5 10 15 20 25 30 35 40 45

100

101

102

103

104

105

106

107

108

iteration

objective value

Bundle Method Upper Bound

Bundle Method Lower Bound

Original OP Upper Bound

Original OP Lower Bound

Target

Figure 5.2 – Convergence behavior of the proposed

method (red, cf. Alg. (6)) against the standard column

generation approach (blue, cf. Alg. (5)).

In general, bundle methods are exten-

sions of cutting plane methods that employ

a proximal operator [49] to stabilize the so-

lution of the approximated function. In the

framework of regularized risk minimization,

a natural input to the proximal operator is

given by the regularizer. We apply this ap-

proach to the objective ˆ

J(w)and solve

min

w∈HRp,γ(w) + max

i∈I{hai,wi+bi},(5.5)

where the set of cutting planes ai,bilower

bound ˆ

L. As proposed in [166,167], we use a

set Iof limited size. Moreover, we calculate

an aggregation cutting plane ¯a,¯

bthat lower

bounds the estimated empirical loss ˆ

L. To

be able to solve the primal optimization problem in (5.5) in the dual space as proposed by

[166,167], we adopt an elegant strategy described in [167] to obtain the aggregated cutting

plane (¯a′,¯

b′)using the dual solution αof (5.5):

¯a′=X

i∈I

αjaiand ¯

b′=X

i∈I

αibi.(5.6)

5.3. Latent Structure Anomaly Detection 51

The following two formulations reach the same minimum when optimized with respect to

min

w∈H Rp(w) + max

i∈Ihai,wi+bi= min

w∈H {Rp(w) + h¯a′,wi+¯

b′}.

This new aggregated plane can be used as an additional cutting plane in the next iteration

step. We therefore have a monotonically increasing lower bound on the estimated empirical

loss and can remove previously generated cutting planes without compromising convergence

(see [167] for details).

The algorithm is able to handle any (non-)smooth convex loss function ℓ, since only the

subgradient needs to be computed. This can be done eﬃciently for the hinge-loss, squared

hinge-loss, Huber-loss, and logistic-loss.

The resulting optimization algorithm is outlined in Algorithm 6. There are several im-

provements possible: For instance, one can bypass updating the empirical risk estimates in

line 6, when L(w(k))−ˆ

L(w(k))≤ǫ. Finally, while Algorithm 6 was formulated in primal

space, it is easy to reformulate in dual variables making it independent of the dimensionality

of w∈ H.

Algorithm 6 Bundle Methods for Structured Output Algorithm

S≥1: maximal size of the bundle set

θ > 0: line-search trade-oﬀ (cf. [168] for details)

w(1) =wp

k= 1 and ¯a=0,¯

b= 0,Γi=∅ ∀i

repeat

for i= 1, .., n do

y∗= argmaxy∈Y{hw(k),Ψ(xi,y)i+ ∆(yi,y)}

if ℓmax

y∈Y {hw,Ψ(xi,y)i+ ∆(yi,y)}> ℓ max

(Ψ,∆)∈Γihw,Ψi+ ∆then

Γi= Γi∪(Ψ(xi,y∗),∆(yi,y∗))

end if

Compute ak∈∂wˆ

L(w(k))

Compute bk=ˆ

L(w(k))−hw(k),aki

w∗= argmin

w∈H Rp,γ(w) + max max

(k−S)+<i≤k{hai,wi+bi},h¯a,wi+¯

b

Update ¯a,¯

baccording to (5.6)

η∗= argminη∈Rˆ

J(w∗+η(w∗−w(k)))

w(k+1) = (1 −θ)w∗+θη∗(w∗−w(k))

k=k+ 1

end for

until L(w(k))−ˆ

L(w(k))≤ǫand J(w(k))−Jk(w(k))≤ǫ

An illustration of the convergence behavior of the proposed bundle method is shown in

Fig. 5.2. Here, it can be clearly seen that column generation needs many more iterations and

hence, much more time in order to converge.

5.3 Latent Structure Anomaly Detection

Building upon the previously discussed structured output methods, we will now exploit these

techniques for unsupervised collective anomaly detection. As always, we start with the stan-

dard formulation of the one-class SVM,

52 Chapter 5. Learning with Structured Data

min

w,ρ,ξ

2kwk2+1

νn

i=1

ξi−ρ

s.t. hw,xii ≥ ρ−ξi, ξi≥0,∀i∈ {1, . . . , n},

(5.7)

and take a closer look at the constraints. Here, the value of the linear function is forced to

stay above threshold ρfor the bulk of the data. If we abstract from the linear function and

allow arbitrary scoring functions sw:X → Rthe constraints will change to sw(xi)≥ρ−ξi.

Assuming an adequate (parameterized) model of the probability of a given sample x,sw(x) =

log pw(x)is available. As a result, the one-class SVM picks high-scoring, high-probability

samples as nominal and low-scoring, low-probability samples as anomalies which is exactly

as intended. Extending the idea to structures z∈ Z gives

pw(x,z) = pw(x|z)p(z) = 1

Z(z)exp(hw,Ψ(x,z)i)·ηexp(δ(z)),(5.8)

where we assume log-linear models for the conditional probability p(x|z)and some Gaussian

prior over pre-deﬁned penalties δ:Z → R−with corresponding normalizations η. Of

course, structure are supposed to be unknown and a standard way of achieving this, is to

marginalize over all possible conﬁgurations of z∈ Z. However, we argue that maximum a

posteriori will be beneﬁcial wrt. computational eﬀorts. Hence, taking the log and discarding

unnecessary terms, we end up with the ﬁnal scoring function

sw(x) = max

z∈Z hw,Ψ(x,z)i+δ(z)(5.9)

In the problem setting of latent structure anomaly detection, we extend the expressive-

ness of the one-class SVM as given in Eq. 5.7 by considering models of the form fw,ρ(x) =

maxz∈Zhw,Ψ(x, z)i+δ(z)−ρ, where Ψ : X×Z → His a joint feature map into a reproduc-

ing kernel Hilbert space Hthat corresponds to a kernel function k: (X×Z)×(X×Z)→R.

This is a principled way of approaching the encoding problem for arbitrary dependencies be-

tween xand zas it is common in the structured output literature [157]. Albeit, it has been

already used to encode hidden Markov and hidden semi-Markov models [26,160], it is not

restricted to those and has been applied to Markov random ﬁelds [169], weighted context-

free grammars and taxonomies [157]. Here, the maximization step for the latent variable z

acts as a frequentist’s equivalent to marginalization in basic probability theory [169].

Employing the above notation, we phrase the primal optimization problem of latent

anomaly detection as follows:

Problem 6 (Primal latent anomaly detection optimization problem).Given a mono-

tonically non-decreasing loss function l:R→R, minimize, with respect to w∈ H and

ρ∈R,

2kwk2−ρ+1

νn

i=1

lρ−max

z∈Z hw,Ψ(xi, z)i+δ(z).(P)

The interpretation of the above formulation is as follows. The loss function could be,

e.g., l(t) = max(0, t), in which case the above detection method extends the one-class

support vector machine [7] to the latent domain (this is extensively discussed in the up-

coming Section 5.3). Variants of this detection method can be obtained from the above

general formulation by employing diﬀerent loss functions, e.g., of logistic or exponential

type (l(t) = log(1 + exp(t)) and l(t) = exp(t), respectively). It is important to note that,

when contrasted to the classical kernel-based hypothesis model fw,ρ(φ(x)) = hw, φ(x)i−

5.3. Latent Structure Anomaly Detection 53

ρ, the above detection method employs a latent hypothesis model of the form fw,ρ(x) =

maxz∈Zhw,Ψ(x, z)i+δ(z)−ρ, which allows for additional ﬂexibility.

To obtain a dual representation of the Problem 6, we start by equivalently re-writing (P)

min

w∈H,ρ∈R,ξ∈Rn

2kwk2−ρ+1

νn

i=1

l(ξi)

s.t. ξi≥ρ−max

z∈Z hw,Ψ(xi, z)i+δ(z),∀i

Denote, for all α∈Rnwith α≥0,1the Lagrangian by

L(w, ρ, ξ, α) := 1

2kwk2−ρ+1

νn

i=1

l(ξi) +

i=1

αiρ−ξi−max

z∈Z hw,Ψ(xi, z)i+δ(z).

By weak duality (e.g., [43], Chapter 5),

Eq. (P) ≥max

α:α≥0min

w∈H,ρ∈R,ξ∈RnL(w, ρ, ξ, α) = max

α:α≥0 −1

νn

i=1

max

ξi∈Rαiνnξi−l(ξi)

+ min

ρ∈Rρ−1 +

i=1

αi−max

w∈H,zi∈Z

i=1,...,n

i=1

αihw,Ψ(xi, zi)i+δ(zi)−1

2kwk2

|{z }

(∗)

Let wαand (zα

i)i=1,...,n be the maximizing arguments in (∗). Thus maxzi∈Z hwα,Ψ(xi, zi)i+

δ(zi) = hwα,Ψ(xi, zα

i)i+δ(zα

i), and maxzi∈Z hw,Ψ(xi, zi)i+δ(zi)≥ hw,Ψ(xi, zα

i)i+

δ(zα

i)for all w∈ Hand i= 1, . . . , n. Hence, for all α∈Rn

(∗) = max

w∈H

i=1

αihw,Ψ(xi, zα

i)i+δ(zα

i)−1

2kwk2,

from which it follows wα=Pn

i=1 αiΨ(xi, zα

i), and thus

(∗) = max

zi∈Z

i=1,...,n

i,j=1

αiαjk(xi, zi),(xj, zj)+

i=1

αiδ(zi).

Hence,

max

α:α≥0min

w∈H,ρ∈R,ξ∈RnL(w, ρ, ξ, α) = max

α:α≥0 −1

νn

i=1

max

ξi∈Rαiνnξi−l(ξi)

+ min

ρ∈Rρ−1 +

i=1

αi−max

zi∈Z

i=1,...,n

1

i,j=1

αiαjk(xi, zi),(xj, zj)+

i=1

αiδ(zi)!

(†)

= max

α:α≥0,Pn

i=1 αi=1 −1

νn

i=1

l∗(αiνn)

−max

zi∈Z

i=1,...,n

1

i,j=1

αiαjk(xi, zi),(xj, zj)+

i=1

αiδ(zi)!

1For vectors x∈Rn, we denote by x≥0as the component-wise inequalities xi≥0,i= 1,...,n.

54 Chapter 5. Learning with Structured Data

where for (†)we employ the notion of the Fenchel-Legendre convex conjugate function

f∗(a) := supbha,bi−f(b)[170] and exploit that the function w7→ 1

2k·k2is self-conjugated;

as well as we observe that minρ∈Rρ−1+Pn

i=1 αi= 0 if Pn

i=1 αi= 1 and −∞else-wise,

which enforces the constraint Pn

i=1 αi= 1 when maximizing with respect to α. Thus we

obtain the following dual optimization problem of (P).

Problem 7 (Dual latent anomaly detection optimization problem).Given a monoton-

ically non-decreasing loss function l:R→R, and denoting by the l∗:R→Rthe dual loss

function, maximize, with respect to α∈Rnand subject to α≥0and Pn

i=1 αi= 1,

−min

zi∈Z

i=1,...,n 1

i,j=1

αiαjk(xi, zi),(xj, zj)+

i=1

αiδ(zi)!−1

νn

i=1

l∗(αiνn).(D)

The minimization over z∈ Z can be expanded into slack variables, so the above dual be-

comes a quadratically constrained program (QCQP) with n·|Z|many quadratic constraints.

By the above dualization the prediction function can be written as

f(x) = max

z∈Z n

i=1

αik(xi, zα

i),(x, z)+δ(z)−ρ.

where ρcan be calibrated by line search such that exactly a fraction of 1−νtraining points

satisfy f(xi)≥0. The corresponding estimated density-level set is given by ˆ

Lν:= {x∈ X :

f(x)≥0}.

For the theoretical analysis, we consider a slight variation of latent anomaly detection,

min

w∈H

i=1

l1−max

z∈Z hw,Ψ(xi, z)i+δ(z)

s.t. kwk ≤ C .

(5.10)

For the important choice of l(t) = max(0, t)studied in Section 5.3, the above reformulation

is equivalent to the original problem (P), in the sense for any choice of νin (P), there exists a

choice of C > 0in (5.10) such that both problems have the same solution in the variable w.

This is shown in Supplementary Material A.1.

To analyze (5.10) theoretically, note that (5.10) corresponds to performing empirical risk

minimization (ERM), ˆ

f:= argminf∈F 1

nPn

i=1 l(f(xi)) over the class F:= {fw=x7→

1−maxz∈Z(hw,Ψ(x, z)i+δ(z)):kwk ≤ C}.In the following theorem, we show that

the solution of (5.10) has asymptotically the same loss as the theoretically optimal quantity

f∗:= argminf∈F E l(f(X)).

Theorem 8 (Latent anomaly detection generalization bound).The following gener-

alization bound holds for the latent anomaly detection method (A.1). Let l:R→Rbe

a non-negative and L-Lipschitz continuous loss function. Denote A:= maxz∈Z |δ(z)|and

B:= maxx∈X,z∈Z kΨ(x, z)k. With probability at least 1−ǫover the draw of the sample, the

generalization error is bounded as:

E l(ˆ

f)−E l(f∗)≤8L1 + A+BC |Z|

√n+L(1 + A+BC)r2 log(2/ǫ)

Proof. The full proof is shown in supplemental material A.1.

While the present analysis considers a worst-case bound that is independent of the struc-

ture of the latent space Z, it would be interesting to analyze the bound also for special

choices of the joint feature map and discrete loss functions. Such an analysis was presented

5.3. Latent Structure Anomaly Detection 55

in McAllester and Keshet [171], who showed asymptotic consistency of the update direction

of a perception-like structured prediction algorithm.

Note that the requirements on the loss function are, in particular, fulﬁlled by the loss

l(t) = max(0, t), which is employed both by the one-class SVM and by the proposed hidden

Markov anomaly detector that is introduced in Section 5.3 below. Indeed in that case, lis

non-negative and Lipschitz continuous with constant L= 1.

Hidden Markov Anomaly Detection In this section, we derive the proposed hidden

Markov anomaly detection (HMAD) methodology that is capable of dealing with sequence

data that exhibits latent state structure. We therefore need to settle for an appropriate loss

function land a joint feature map Ψ(x, z).

Setting l(t) := max(0, t), we can derive a latent version of the one-class support vector

machine (OC-SVM) [7]. Contrary to [172], structures need not to be known. We derive the

latent version of the OC-SVM as follows.

Problem 9 (Primal latent OC-SVM optimization problem).Given the monotonically non-

decreasing hinge loss function l:R→R, l(t) = max(0, t), minimize, with respect to w∈ H

and ρ∈R,

2kwk2−ρ+1

νn

i=1

max 0, ρ −max

z∈Z hw,Ψ(xi, z)i+δ(z).(P′)

It is easy to check that the dual loss of l(t) = max(0, t)is the function l∗(t) = 0 if

0≤t≤1and ∞else, and thus the corresponding dual optimization problem is as follows.

Problem 10 (Dual latent one-class SVM optimization problem).Given the monoton-

ically non-decreasing hinge loss function l:R→R, l(t) = max(0, t), and denoting by

l∗:R→Rthe dual hinge loss function, maximize, with respect to α∈Rnand subject

to 0≤α≤1

νn and Pn

i=1 αi= 1,

−min

zi∈ Z

i= 1, . . . , n 1

i,j=1

αiαjk(xi, zα

i),(xj, zα

j)−

i=1

αiδ(zα

i)!(D′)

In hidden Markov anomaly detection, we are interested in inferring the hidden state

sequence z= (z1, . . . , zT)∈ Z, with single entries zt∈ Y, associated with an observed

feature sequence x= (x1, . . . , xT), i.e., each element of the sequence is a feature vector

xt= (xt

l)l=1,...,d ∈Rd. Hidden Markov models have been introduced as a certain class of

probability density functions Pwith chain-like factorization [161] and parameters w:

P(x, z|w) = π(x1, z1|w)

t=2 P(zt|zt−1,w)P(xt|zt,w).(5.11)

Based on the corresponding log-probability and conditioned on the inputs, logP(z|x) =

log π(z1, x1|w) + PT

t=2 log P(zt|zt−1,w) + log P(zt|xt,w),we introduce the matching

scoring function G:X × Z × H → Rthat decomposes into Gtrans :Y × Y × H → R

and Gem :X ×Y ×H → R:

log P(z|x) = G(x, z, w) =

t=2

Gtrans(zt, zt−1,w) +

t=1

Gem(xt, zt,w),(5.12)

such that G(x, z, w)∝ hw,Ψ(x, z)i. This motivates deﬁning a joint feature map as follows:

56 Chapter 5. Learning with Structured Data

Deﬁnition 8 (Hidden Markov joint feature map).Given a feature map φ:X → F, deﬁne

the Hidden Markov joint feature map Ψ : X ×Z → H as

Ψ(x, z) = (PT

t=2 1[zt=i∧zt−1=j])i,j∈Y,

(PT

t=1 1[zt=i]φ(xt))i∈Y !.

To better understand the above feature map, observe that the weight vector w, which

is w= (wem,wtrans), decomposes into a transition vector wtrans = (wtrans

i,j )i,j∈Y and an

emission vector wem = (wem

i)i∈Y, so the linear model becomes

hw,Ψ(x, z)i=

t=2 X

i,j∈Y

1[zt=i∧zt−1=j]wtrans

i,j +

t=1 X

i∈Y

1[zt=i]hwem

i, φ(xt)i,

which is reminiscent of the log probability associated with HMMs and given by (5.12).

Deﬁnition 9 (Hidden Markov anomaly detection (HMAD)).Hidden Markov anomaly

detection (HMAD) is deﬁned as the latent OC-SVM (Problem 9 and 10) together with the hidden

Markov joint feature map (Deﬁnition 8).

Note that thus, because of the speciﬁc form of the joint feature map occuring in HMAD,

the problem of maximizing over the latent variables in Eqn. (P′) can be solved by ﬁnding

the most probable state sequence of the corresponding hidden Markov model, which can be

eﬃciently computed using, e.g., Viterbi’s algorithm [161].

Similar to its non-structured counterpart, the structured one-class SVM enjoys interest-

ing properties, as we show below. Recall that for an input xand prediction function fthe

following cases can occur:

1. f(x)>0(then xis strictly inside the density level set)

2. f(x) = 0 (then xis right at the boundary of the set)

3. f(x)<0(then xis outside of the density level set, i.e., xis an outlier)

The following theorem shows that the parameter νcontrols the number of outliers.

Theorem 11. The following statements hold for the structured one-class SVM and the induced

decision function f:

(a) The fraction of outliers (inputs xiwith f(xi)<0) is upper bounded by ν.

(b) The fraction of inputs lying strictly inside the density level set (inputs xiwith f(xi)>0)

is upper bounded by 1−ν.

The theorem is proven in Appendix A.2 and shows that the quantity νcan be interpreted

as the fraction of outliers predicted by the learning algorithm. In particular this shows, to-

gether with theoretical analysis, that for well behaved problems (where there is no probability

mass exactly on the decision region and where the true decision boundary is contained in

the hypothesis set, e.g., via the use of universal kernels [173]), the estimated density level set

Lνasymptotically equals the truly underlying density level set Lν:P(ˆ

Lν\Lν∪Lν\ˆ

Lν)→0

for n→ ∞.

5.4. Evaluation and Applications 57

Optimization Algorithm A ﬁrst diﬃculty occurring when trying to solve the optimiza-

tion problem (P′) consists in the function g: (w, ρ)7→ ρ−maxz∈Z hw,Ψ(xi, z)i+δ(z),

which is concave and thus renders the optimization problem non-convex. However, note

that any concave function h:R→Rcan be decomposed into convex and concave parts,

max(0, h(x)) = max(0,−h(x))+h(x). Hence, putting g(w, ρ) = ρ−maxz∈Z hw,Ψ(xi, z)i

+δ(z), we can write Eq. (P′)=1

2kwk2−ρ+1

νn Pn

i=1 max 0,−g(w, ρ)+g(w, ρ).

The above decomposition consists of a convex term followed by a concave term, which ad-

mits the optimization framework of DC programming (diﬀerence of convex functions) [46].

Although the function −gis not diﬀerentiable, it admits, at any point (w0, ρ0)∈ H×R, a

subdiﬀerential

∂(w0,ρ0)g(w0, ρ0) := {v∈ H×R:g(w, ρ)−g(w0, ρ0)

≥ hv,(w, ρ)−(w0, ρ0)i,∀(w, ρ)∈ H×R}.

One can verify—using the sub-diﬀerentiability of the maximum operator—that, for any z∈

Z, the point (Ψ(xi, z),−1) is contained in the subdiﬀerential ∂(w0,ρ0)g(w0, ρ0). Thus, we

can linearly approximate, for any z∈ Z, via g(x)≈ hw,Ψ(xi, z)i+δ(z)−ρ. In the opti-

mization algorithm we will thus construct a sequence of variables (wt, ρt, zt),t= 1,2,3, . . .,

where we use this approximation with zchosen as zt= argmaxz∈Z wt−1,Ψ(xi, z)+δ(z),

where wt−1is conveniently computed by solving a regular one-class SVM problem. The re-

sulting optimization algorithm is described in Algorithm 7.

Algorithm 7 Hidden Markov Anomaly Detection

input data x1, . . . , xn

put t= 0 and initialize wt(e.g., randomly)

repeat

t:=t+1

for i= 1, . . . , n do

i:= argmaxz∈Zhwt−1,Ψ(xi, z)i+δ(z)

(i.e. use Viterbi algorithm)

end for

let (wt, ρt)be the optimal arguments when solving one-class SVM with φ(xi) :=

Ψ(xi, zt

until ∀i= 1, . . . , n :zt

i=zt−1

Return optimal model parameters w:= wt,ρ=ρt, and zi:= zt

i∀i= 1, . . . , N

Despite the non-convex nature of the optimization problem, we found in our experiments

that the algorithm tends to converge often faster than the standard column-generation ap-

proach of the supervised structured SVM [157], since no storage of constraints is necessary,

which in turn leads to constant time and space complexity for each iteration of Algorithm 7.

5.4 Evaluation and Applications

In the following section, we apply the derived methods to applications from computational

biology. First, supervised large-scale structured output methods are applied to the transcript

identiﬁcation problem using RNA-seq data on real eucaryotic model organism and compared

against state-of-the-art techniques. Unsupervised collective anomaly detection is then eval-

uated on controlled data and on procarytic gene detection.

58 Chapter 5. Learning with Structured Data

5.4.1 Transcript Identiﬁcation for Eucaryotic Organisms

High-throughput sequencing technology applied to cellular mRNA (RNA-Seq) has revolu-

tionized transcriptome studies [174–176] (among many others). In contrast to microarray

platforms, which it has replaced in many applications, RNA-Seq can not only be used to ac-

curately quantify known transcripts, but also to reveal the precise structure of transcripts at

single-nucleotide resolution. RNA-Seq based transcript reconstruction has therefore become

a valuable tool for the completion of genome annotations [177] (for instance) and further en-

abled subsequent analyses of diﬀerentially expressed genes [178], transcript isoforms [179,

180] and exons [181], all of which generally rely on correctly inferred transcript inventories.

De novo transcript reconstruction is thus a pivotal step in the analysis of RNA-Seq data.

There are two conceptually diﬀerent strategies to approach this problem: one can either

assemble transcripts directly from RNA-Seq reads using methodology that originated from

genome assembly approaches [182,183]. Alternatively, the problem can be decomposed into

two steps: RNA-Seq reads are ﬁrst aligned to the genome of origin followed by the actual

transcript reconstruction on the basis of these alignments. While the ﬁrst, assembly-based

strategy does not require a high-quality genome sequence and is thus applicable to non-

model organisms, it is arguably addressing a more diﬃcult problem than the latter, mapping-

based approach. Consequently, transcripts, in particular ones with low expression, may be

more accurately reconstructed by methods implementing the mapping-based approach [184,

185] (see also [182,183] for a comparison). The performance of mapping-based methods

however strongly depends on the quality of the RNA-Seq read alignments. Considerable

attention has therefore been payed to solve the problem of correctly aligning RNA fragments

across splice junctions [186,187].

Following the mapping-based paradigm, we developed a novel machine learning-based

method, which we call mTim: margin-based transcript inference method. In contrast to

algorithmic transcript assembly [184,185], we formalize the problem as a supervised label

sequence learning task and apply state-of-the-art techniques, namely Hidden Markov sup-

port vector machines (HM-SVMs) [157]. This way of approaching the problem is similar to

recently developed gene ﬁnders [188], and mTim is indeed a hybrid method that can uti-

lize both, RNA-Seq read alignments and characteristic features of the genome sequence, e.g.

around splice sites [189]. However, mTim’s emphasis is on inference from aligned RNA-Seq

reads, and its model is only augmented by a few genic sequence motif sensors [188], which

can moreover be disabled. We thus make weak assumptions, if any, about the inferred tran-

scripts: importantly, we do not model protein-coding sequences (CDS) and are thus able to

predict noncoding transcripts as well as coding ones with similar expression.

The task of reconstructing the exon-intron structure of expressed genes can be converted

into a label sequence learning problem, where we attempt to label each nucleotide in the

genome as either intergenic, exonic or intronic. Our prior knowledge about what constitutes

a valid gene structure is incorporated into a state model to restrict the space of possible

labelings to valid ones.

Starting from a naive state model that would consist of a single state for each of the

atomic labels, exonic, intronic, and intergenic, we extended it as follows (see Figure 5.3): ﬁrst,

we devised a strand-speciﬁc model. Second, we created expression-dependent submodels.

This allows us to maintain several parameter sets, each of which is optimized for transcripts

with a certain read support. Due to non-uniform read coverage along transcripts, transitions

between expression levels proved useful in practice. Finally, the simple model was extended

by states that mark segment boundaries (e.g. when transitioning from exon to intron), as this

facilitates boundary recognition from features such as spliced reads (Fig. 5.3).

5.4. Evaluation and Applications 59

Feature Derivation The inference of transcript structures is based on sequences of obser-

vations or features derived from RNA-Seq read alignments and predicted splice sites. Specif-

ically, we derive the following position-wise features from RNA-Seq alignments:

• number of reads aligned at the given position, indicating an exon.

• a gradient of the read coverage; high absolute values correspond to sharp in- or de-

creases in coverage typical of the start and end of exonic regions, respectively

• number of reads that are spliced over the given position (strand-speciﬁc), thus indicat-

ing an intronic position.

• number of spliced reads supporting a donor splice site at the given position (strand-

speciﬁc).

• number of spliced reads supporting an acceptor splice site at the given position (strand-

speciﬁc).

• number of paired-read alignments for which the insert spanned the given position

(only used if read pair information is available, strand-speciﬁc), an indicator of tran-

script connectivity.

intergenic

intronic

exonic

intronic

exonic

plus strand minus strand

expression level

1st 2nd ...

...

acc2

acc1

don1

-don2

acc2

acc1

don1

+don2

+p2

-p2

Figure 5.3 – State model used by mTim. The ﬁrst and

last nucleotide of introns and transcripts were modeled

with particular care: The former are associated with

splice site signals at exon-intron junctions (states de-

noted acc and don), whereas the latter correspond to

transcript start and end (denoted pand t, respectively).

The model is strand-speciﬁc and consists of expression-

speciﬁc sub-models.

Additionally, we derive features from

the genome sequence around a given posi-

tion such as strand-speciﬁc donor and ac-

ceptor splice site prediction.

As a ground truth for guiding the su-

pervised training process, annotated gene

models with a portion of the surrounding

intergenic region are excised and converted

into label sequences by assigning one of the

above atomic labels to each nucleotide (see

color coding in Figure 5.3). In the presence

of alternative transcripts, this labeling was

based on a single isoform (the one that was

best supported by RNA-Seq reads), and addi-

tionally a mask of alternative transcript re-

gions was generated to avoid that learning

the correct alternatives is penalized during

training.

Data Preparation For the following com-

putational experiments we used RNA-Seq

data from well-studied model organism for

which high-quality annotations exist, be-

cause these can not only be used for train-

ing, but also to assess the accuracy of the

inferred transcripts.

We aligned RNA-Seq reads to the

genome using the splice-aware alignment

tool PalMapper [187]

Primary RNA-Seq alignments were ﬁltered with the goal to reduce the number of align-

ment errors. To this end, we used a small subset of annotated introns to deﬁne an opti-

mal choice of parameters for ﬁltering criteria such as maximal number of edit operations

60 Chapter 5. Learning with Structured Data

(mismatches, insertions, and deletions), minimal length of the shortest aligned segment in a

spliced alignment, and the minimal number of alignments supporting an intron. The cho-

sen ﬁlter settings maximize the F-Score (harmonic mean of precision and recall) between the

annotation set and the introns contained in the ﬁltered alignments.

Donor and acceptor splice sites were predicted from the genome sequence following a

published protocol [189]. In summary, this method cuts out genomic sequences around all

potential splice donor and acceptor site (exhibiting the two-nucleotide consensus sequence)

and applies SVM classiﬁers with string kernels to recognize annotated splice sites. Trained

classiﬁers are subsequently used to generate whole-genome predictions which were subse-

quently transformed into probabilistic conﬁdence values [189].

From the RNA-seq read alignments we then generated the above-listed coverage and

splice-site features and derived a label sequence from the corresponding gene annotations

(see above for details).

Accuracy (F-score)Accuracy (F-score)

on unfiltered alignm.

on filtered alignm.

on unfiltered alignm.

on filtered alignm.

mTiM Cufflinks

Intron evaluation

Transcript evaluation

0.8

0.6

0.4

0.2

0.0

1.0

0.6

0.4

0.2

0.0

C. elegans

A. thaliana

D. melanogaster

Figure 5.4 – Comparison of mTim and Cuﬄinks. (a)

Assessment of the total number of introns whose bound-

aries were correctly predicted at single-nucleotide preci-

sion. (b) Evaluation of the number of gene loci for which

at least one transcript isoform was predicted correctly

(all introns correct).

To be able to assess the impact of align-

ment quality on subsequent transcript in-

ference, we used unﬁltered alignments in a

ﬁrst set of experiments and subsequently re-

peated these using ﬁltered RNA-seq align-

ments as input to assess the improvement

of transcript inference with improved align-

ment quality.

To generate transcript models from

these read alignments, the mTim pipeline

proceeds through the following steps:

1. Deﬁnition of genome chunks; impor-

tantly, chunks are deﬁned based on

read coverage only without using any

annotation information.

2. Partitioning genome chunks into sub-

sets for cross validation.

3. Training on chunks from the train-

ing set using known (annotated) gene

models as ground truth.

4. Application of the trained mTim mod-

els to predict transcript structures on

test chunks.

Using cross-validation, we obtain unbi-

ased estimates of mTim’s transcript recon-

struction accuracy for data it had not seen

during training.

To compare mTim’s prediction to the state of the art in alignment-guided transcript in-

ference, we also applied Cuﬄinks with default parameter settings to the same unﬁltered and

ﬁltered RNA-seq alignment data.

Results and Discussion To evaluate its performance, we applied mTim to RNA-Seq data

from model species. We chose three organisms, Chaenorhabditis elegans (nematode worm),

Arabidopsis thaliana (thale cress) and Drosophila melanogaster (fruit ﬂy), whose genomes

5.4. Evaluation and Applications 61

Training examples

400 1000800

1.0

0.0

0.4

0.2

0.6

0.8

Accuracy (F-score)

Expression submodels Training iterations

cba

Intron

Transcript

35 m

52 m

71 m

1.5 h

4.5 h

1.9 h

2.4 h

6002000 20151050 200150100500

0.8 h

1.3 h

4.5 h

2.5 h

3.4 h

sufficiently small

duality gap reached

Figure 5.5 – Opimizing mTim’s performance. (a) The HM-SVM learning algorithm utilized training data

eﬃciently and accuracy quickly reached a plateau. (b) Expression-speciﬁc submodels (see Fig. 5.3 and Meth-

ods) improve reconstruction of complete transcripts. (c) Accuracy as a function of the number of training

iterations (using 1000 examples). The duality gap was suﬃciently small for termiantion after 78 iterations.

All results were obtained using unﬁltered RNA-Seq alignments for C. elegans. Empirical execution times in

(a) and (c) were averaged across three HM-SVM trainings.

and transcriptomes have been extensively characterized [177], making it possible to use an-

notated gene models as a ground truth for evaluating the quality of transcripts reconstructed

from RNA-Seq data. Although these genome annotations were neither complete nor free of

errors, which only allowed for approximative evaluations, these were nonetheless useful for

assessing mTim’s transcript reconstruction accuracy relative to other methods.

We evaluated the accuracy of transcripts reconstructed by mTim in a whole-genome com-

parison to annotated protein-coding genes using cross-validation (see Methods for details).

Here we used two popular criteria that evaluate intron and transcript quality respectively.

The ﬁrst is an assessment of the total number of introns that are inferred correctly (with

single-nucleotide precision), whereas the second counts the number of gene loci for which at

least one transcript isoform has been reconstructed correctly (all introns predicted correctly).

Note that both criteria do not evaluate transcript starts and ends at nucleotide resolution, be-

cause annotations are generally more uncertain for these than for intron boundaries; in tran-

script evaluation, however, predicted transcript fusion or split predictions will be regarded

as errors.

For both criteria we assessed the sensitivity and precision of predicted transcripts. The

former is deﬁned as the proportion of annotated introns (or transcripts) which were inferred

correctly, whereas the latter is deﬁned as the proportion of inferred introns (or transcripts)

which correctly matched an annotated intron (or transcript). The F-score is an aggregate

accuracy measure, deﬁned as the harmonic mean of sensitivity and precision:

F= 2 ·sensitivity ·precision

sensitivity +precision

In initial assessments we veriﬁed the eﬀectiveness of mTim’s training algorithm and mod-

eling approach. We ﬁrst evaluated how eﬃciently the HM-SVM training exploits the available

training data. Intron accuracy quickly reached a level where additional training sequences

no longer led to substantial improvements: with as little as 80 training examples an intron

accuracy (F-score) of 0.75 was exceeded, which was only 6.5% below the maximum of 0.812

(Fig. 5.5a). Transcript reconstruction accuracy continued to improve with additional training

examples, although with 250 training sequences transcript accuracy was less than 10% below

the maximum of 0.373 (Fig. 5.5a). Second, we assessed the impact of expression-speciﬁc sub-

models (see Fig. 5.3 and Methods) on transcript reconstruction accuracy (Fig. 5.5b). While we

62 Chapter 5. Learning with Structured Data

observed little eﬀect on intron reconstruction, we conﬁrmed that submodels were valuable

for correctly inferring whole transcripts: with ﬁve submodels, transcript accuracy increased

by 25% relative to the simple model without submodels (Fig. 5.5b). Since expression-speciﬁc

submodels provided an eﬀective means to group exons with similar expression levels into

one transcript and terminate it when expression changes dramatically, we used ﬁve submod-

els for all subsequent mTim experiments. Third, we assessed convergence speed of mTim’s

optimization approach. Results obtained for a training set consisting of 1,000 sequences

suggest that after about 80 iterations, completed in <2CPU hours, prediction accuracy had

converged (Fig. 5.5c).

Accuracy

mTiM without splice site prediction features

mTiM with splice site prediction features

0.8

0.6

0.4

0.2

0.0

1.0 Intron Transcript

Sensitivity

Precision

F-score

Sensitivity

Precision

F-score

Figure 5.6 – Accuracy of mTim when trained with and

without features derived from genomic sequence signals

around splice sites. Both mTim instances were trained

and evaluated on unﬁltered RNA-Seq alignments from

C. elegans.

To benchmark mTim’s transcript re-

construction performance in comparison to

other methods, we extended our evaluations

to include Cuﬄinks [184], a widely adopted

method, applying the same assessment cri-

teria as before. Comparative evaluations re-

vealed that mTim inferred relatively accu-

rate transcript structures, almost always as

good as or better than Cuﬄinks (Fig. 5.4).

Notably, mTim’s predictions were relatively

robust against issues in the underlying read

alignments (intron accuracy was unaﬀected

by alignment ﬁltering, and transcript F-

score decreased by at most 16%). Cuﬄinks

in contrast was found to be much more sen-

sitive to these issues; without alignment ﬁl-

tering, its intron and transcript accuracy (F-

score) dropped by 13 −35% and 30 −50%,

respectively (Fig. 5.4). The quality of tran-

scripts inferred by mTim appeared to be rel-

atively high (Fig. 5.4) and consistently so

across the diverse range of input data tested here; in particular mTim maintained high pre-

cision (Table 5.1).

Due to its modular architecture and its general machine-learning approach, mTim can

easily be tailored to speciﬁc application requirements. For instance features corresponding

to genomic splice site predictions can be disabled, making mTim rely completely on RNA-

Seq alignment features thereby eliminating any potential bias against non-coding transcripts.

We assessed the extent to which this aﬀects transcript reconstruction accuracy and found the

eﬀect to be minor (Fig. 5.6).

Application Outcome Here, we have introduced mTim, a discriminative machine learning-

based method that reconstructs transcripts from RNA-Seq read alignments and splice site

predictions. We have shown that it is able to infer transcripts with high accuracy and that it

is more robust errors in the underlying read alignments. Pre-trained mTim predictors used

for this work are available within the Oqtans Galaxy webserver 2. Moreover, mTim is open-

source software provided via GitHub 3.

2http://oqtans.org/

3https://github.com/nicococo/mTIM

5.4. Evaluation and Applications 63

Table 5.1 – Sensitivity and precision of introns and transcripts reconstructed with mTim or Cuﬄinks ap-

plied to PalMapper alignments.

Alignm. Sensitivity [%] Precision [%]

ﬁltered mTim Cuﬄinks mTim Cuﬄinks

Intron evaluation

C. elegans NO 75.4 58.6 88.1 86.4

YES 74.0 71.3 89.5 91.6

A. thaliana NO 69.4 30.9 87.8 77.9

YES 69.1 53.5 86.5 95.6

D. melanogaster NO 70.5 66.6 82.1 66.5

YES 68.1 70.7 88.0 88.6

Transcript evaluation

C. elegans NO 30.8 20.3 47.3 33.8

YES 31.5 30.4 51.2 45.2

A. thaliana NO 24.6 8.6 46.2 17.0

YES 23.9 21.2 44.2 25.2

D. melanogaster NO 28.0 24.7 49.4 28.7

YES 32.1 34.7 63.0 59.8

Accuracy values of the best-performing method in each category are in bold face. See main

text for deﬁnitions of sensitivity and precision and details on alignment ﬁltering.

5.4.2 Hidden Markov Anomaly Detection

100 200 400 600 800 1000

Number of training examples

10-5

10-3

10-2

100

103

Time in [sec]

Bayes (Linear)

HMAD

OC-SVM (RBF 1.0)

OC-SVM (Hist 8)

OC-SVM (Linear)

Figure 5.7 – Runtime performance for the controlled

experiment. Results are shown for our hidden Markov

anomaly detection (HMAD) as well as a set of competi-

tors (using optimal kernel parameters) for an increasing

number of training examples.

We conducted experiments for the scenario

of label sequence learning where we have

full access to the ground truth as well as a

real-world computational biology scenario.

Our interest is to assess the anomaly de-

tection performance of our hidden Markov

anomaly detection (HMAD) method for

groups of measurements. As baseline meth-

ods that excel in one-class classiﬁcation set-

tings, we chose one-class support vector

machines (OC-SVM) with appropriate ker-

nels. For initialization, we randomly choose

a vector w0for each run of our algorithm

which is suﬃcient, since no initialization of

structures is needed, as those are deduced

from the parameter vector.

Controlled Experiment For the con-

trolled experiments, we aim to gain insights into the behavior of our method. We investi-

gate the anomaly detection performance for low to very high (up to 30%) fraction of anoma-

lies. Furthermore, we are interested in the anomaly detection performance for an increasing

amount of disorganization in the input sequences. Since HMAD exploits latent structure, it

is not clear how it performs when less structure is present. Vanilla OC-SVMs does not exploit

latent dependencies and should be unaﬀected by this. Additionally, we are interested in the

runtime behavior for various training set sizes.

64 Chapter 5. Learning with Structured Data

2.5% 5% 10% 15% 20% 30%

Percentage of anomalous data

0.2

0.4

0.6

0.8

1.0

Detection accuracy [in AUC]

0% 2% 5% 10% 20% 40%60%

100%

Percentage of disorganization

0.2

0.4

0.6

0.8

1.0

Detection accuracy [in AUC]

Figure 5.8 – Results for the controlled experiment: (left) anomaly detection performance for various frac-

tions of anomalies in the training set and (right) anomaly detection performance for increasing amount of

disorganization. All settings show results for our hidden Markov anomaly detection (HMAD) as well as a

set of competitors (using optimal kernel parameters). Noticeable, the detection performance of HMAD is

not aﬀected by increasing amounts of disorganization in the input data (right).

We generated Gaussian noise sequences of length 600 with unit variance for the nominal

bulk of the data. Non-trivial anomalies (see Fig. 5.9) were induced as blocks of Gaussian noise

with non-zero mean and a total, cumulative length of 120 per anomalous example. We vary

either the fraction of anomalies in the training data set or the number of blocks, depending

on the amount of structure that is modeled into the data (see Figure 5.9: from 120 sub-blocks

of length 1 (100% disorganization) to a single block of length 120 (0% disorganization). We

employ a binary state model consisting of 2 states and 4 possible transitions with an con-

stant prior δ(·). We report on the average area under the ROC curve (AUC) for the anomaly

detection performance over 50 repetitions of the experiment. Since we know the underly-

ing ground truth we can exactly compute the Bayes classiﬁer,4which in our case lies within

the set of linear classiﬁers, and serves as a hypothetical upper performance bound for the

maximal achievable detection performance.

0300 600

Sequence position

100%

Percentage of disorganization

Noisy observations

True state sequence

Figure 5.9 – Examples of observation sequences for two

extreme cases of our controlled experiments: even in the

easy setting (top), the true state sequence is barely visible

to the naked eye in the noisy observed sequence, while

in the challenging setting (bottom) it is almost impossi-

ble for humans to extrapolate the truly underlying state

sequence.

We compare the detection performance

of our method to the one achieved by OC-

SVMs with RBF kernels, histogram kernels,

and linear kernels using l1- and l2-feature

normalization, and optimal kernel parame-

ters (1.0 for the RBF kernel, 8 for the his-

togram kernel, and l1for the linear ker-

nel). The results of the anomaly detection

experiment are shown in Figure 5.8 and Fig-

ure 5.7. As can be seen in the ﬁgure, our

method achieves tremendously higher de-

tection rates than the OC-SVMs using linear

or RBF kernel, which perform similar bad as

random guessing. Most competitive base-

line methods are OC-SVMs with histogram

kernels and optimal bin size (8 bins). There

exists a strong relation between our method

HMAD and Fisher kernels [190] in the sense,

that the same representation is used. Unlike Fisher kernels, our methodology includes the

parameter optimization procedure, and therefore, given the same model parameters both

4For data that is i.i.d. realized from a distribution (which is the case in our synthetic experiment), the Bayes

classiﬁer is deﬁned as the classiﬁer achieving the maximal accuracy among all measurable functions.

5.4. Evaluation and Applications 65

methods are on par. Remarkably, our method achieves stable on-par performance with the

Bayes classiﬁer for all levels of disorganization, even when there is no structure to be ex-

ploited in the data (see Figure 5.8 right) and outperforms signiﬁcantly all competitors for

varying fraction of anomalies (see Figure 5.8 left).

As an example, we depict two typical anomalous observation sequences of length 600

and anomalous block length 120 of the experiment in Fig. 5.9 for the 0% (top) and 100%

disorganization (bottom) settings. As can be seen, anomalies are not trivially detectable.

Number of training examples

Time in [sec]

HMAD

SSVM

Figure 5.10 – Without the need for constraint genera-

tion, our hidden Markov anomaly detection easily out-

performs the structured SVM.

We also conducted runtime experiments

(Fig. 5.8 right) to compare the runtime of our

method HMAD against that of the baseline

methods. We used the same two-state model

as in the previous controlled experiment,

but with training set size varying from 100

to 1000 examples. We used a fraction 10%

of anomalies to ensure there is a suﬃcient

number of anomalies in the data. As ex-

pected, absolute computational runtime is

higher than for vanilla OC-SVMs. This is

due to the iterative approach that includes

Viterbi decoding of the sequences and solv-

ing a vanilla OC-SVM in each step. How-

ever, computational complexity grows with increasing number of examples comparable to

OC-SVM which gives a total complexity of O(OC-SVM)+O(c), where cis a constant.

We report run time comparisons of the structured output SVM (SSVM) and our hidden

Markov anomaly detection in Figure 5.10. Since the HMAD does not need to add constraints

in each iteration, it easily outperforms the SSVM. However, it does require multiple iterations

that include Viterbi decoding as well as solving a vanilla one-class SVM and therefore is

slower than the OC-SVM (for a comparison see Fig. 5.8).

0% 2% 5% 10% 20% 40%60% 100%

Percentage of disorganization

0.4

0.6

0.8

1.0

Detection accuracy [in AUC]

HMAD

Fisher Kernel Upper Bound

Fisher Kernel Lower Bound

Figure 5.11 – Comparison for an increasing amount of

disorganization of our method HMAD (blue) against a

variety of Fisher kernels (gray area), including a lower

bound (magenta) based on random model parameters

and an upper bound (red) that was trained on ground

truth data.

Fisher kernels [190–192] have been pro-

posed as a way of incorporating graphical

models into the framework of kernel-based

learning [33] and therefore beneﬁt from the

vast amount of kernel machines. A practi-

cal Fisher kernel is deﬁned as the gradient of

the log-likelihood of the probabilistic model

with respect to its model parameters.

There is a strong connection of Fisher

kernels and our HMAD, in the sense, that

we use the same representation of graphi-

cal models. However, our method HMAD

includes the parameter optimization proce-

dure. Speciﬁcally, given the same model pa-

rameters learned by our method, the corre-

sponding Fisher kernel employed in an one-

class SVM leads to the same solution. Of

course, learning the right model parameter

is the key to good performance.

To cope with a variety of parameter learning settings and hence, have a realistic compar-

ison against multiple parameter estimation methodologies for Fisher kernels and derive an

upper and a lower bound for the maximum likelihood estimation for Fisher kernels. Here, a

66 Chapter 5. Learning with Structured Data

lower bound can be easily obtained by using random model parameters, whereas an upper

bound uses the ground truth latent states information for parameter estimation.

The results in Fig. 5.11 and Fig. 5.12 show the range of possible solutions for the Fisher

kernel (gray area) with the upper bound (red) and (unsurprisingly unstable) lower bound

(magenta), in the same setting as in Section 5.4.2. Moreover, it shows that our method

HMAD performs nearly as good as the upper bound in absence of any label information.

2.5% 5% 10% 15% 20% 30%

Percentage of anomalous data

0.6

0.8

1.0

Detection accuracy [in AUC]

HMAD

Fisher Kernel Upper Bound

Fisher Kernel Lower Bound

Figure 5.12 – Comparison for an increasing amount of

anomalies of our method HMAD (blue) against a variety

of Fisher kernels (gray area), including a lower bound

(magenta) based on random model parameters and an

upper bound (red) that was trained on ground truth data.

To assess the stability of the found so-

lution, we did experiments with an in-

creasing number of hidden states for our

proposed method HMAD in the same set-

ting as in Section 5.4.2. The results in

Fig. 5.13 show, that our method is not

sensible to the number of hidden states.

Procaryotic Gene Detection In prokary-

otes (mostly bacteria and archaea) gene

structures consist of the protein coding re-

gion that starts by a start codon (one out of

three speciﬁc 3-mers in many prokaryotes)

followed by a number of codon triplets (of

three nucleotides each) and is terminated by

a stop codon (one out of ﬁve speciﬁc 3-mers

in many prokaryotes) [193]. Genic regions

are ﬁrst transcribed to RNA and then trans-

lated into a protein. Since genes are separated from one another by intergenic regions, the

problem of identifying genes can be posed as a label sequence learning task, were one assigns

a label (out of intergenic, start, stop, exonic) to each position in the genome [188].

2 3 4 6 8 12

Number of hidden states

0.6

0.8

1.0

Detection accuracy [in AUC]

HMAD

Figure 5.13 – Performance evaluation for an increasing

number of hidden states of our method HMAD (blue).

We downloaded the genome of the

widely studied escherichia coli bacteria,

which is publicly available.5

Genomic sequences were cut between

neighboring genes (splitting intergenic re-

gions equally), such that a minimum dis-

tance of 6 nucleotides between genes was

maintained. Intergenic regions have a min-

imum distance of 50 nucleotides to genic

regions. Features were derived from the

nucleotide sequence by transcoding it to

a numerical representation of triplets. All

examples have a minimum length of 500

nucleotides and do not exceed 1200 nu-

cleotides.

For the OC-SVM we use matching spec-

trum kernels of order 1,2, and 3 (resp. 64, 4160, and 266.304 dimensions), while the SSVM

and HMAD obtain a sequence of binary entries as input data. A description of the used state

model, which is based on Görnitz, Widmer, Zeller, Kahles, Sonnenburg, and Rätsch [26], is

given in Figure 5.15. Start and stop states use corresponding features that encode start and

stop codons. Any other states is using all 64 binary input features. Furthermore, we choose

5http://www.sanger.ac.uk. . .

. . . /resources/downloads/bacteria/escherichia-coli.html

5.5. Summary and Discussion 67

δ(z)to have a slightly higher probability towards the intergenic state. For a more fair com-

parison, OC-SVM and HMAD are given the true fraction of anomalies which varies from 2.5%

up to 30%. The training set contained 200 examples of intergenic and genic examples with

a total length of >170.000 nucleotides, while the testing set contained 350 intergenic and 50

genic examples of length >330.000 nucleotides, rending this a computationally challenging

experiment. The experiment was repeated 20 times where training and test set are drawn

randomly.

2.5% 5% 10% 15% 20% 30%

Percentage of anomalous data

0.6

0.7

0.8

0.9

1.0

Detection accuracy [in AUC]

OC-SVM Spectrum (1)

OC-SVM Spectrum (2)

OC-SVM Spectrum (3)

OC-SVM Spectrum (FS)

HMAD (FS)

HMAD

Figure 5.14 – Detection performance for various frac-

tions of outliers in terms of AUC for the procaryotic gene

ﬁnding experiment. Clearly, the accuracy of our hidden

Markov anomaly detection exceeds the vanilla one-class

SVM performance even when using higher order (1,2 &

3 codons = 64, 4160 and 266.304 dimensions) spectrum

kernels.

We further employ a simple feature se-

lection procedure where the 8 most dis-

tinctive genic- and intergenic features are

selected on a comparable labeled procary-

ote (e. fergusonii), which increased per-

formance for OC-SVM by more than 10%.

While performance for our HMAD re-

mained unchanged, training and prediction

times dropped down to 15% when compared

to the full model.

The results in Figure 5.14 show a

vastly superior performance of our method

(HMAD) in terms of the detection accu-

racy: HMAD achieves a perfect AUC of 1.00

(which means: it exactly identiﬁes every se-

quence containing a gene with zero error)

for all outlier fractions, while the classical

one-class SVM shows much worse perfor-

mance with an AUC of 0.85 at best and 0.66

in the worst case. Using higher order spectrum kernels increases the detection performance

only marginally.T h is result is remarkable as it has been reported that string kernels such as

spectrum kernel achieve state of the art performance in this application [188].

Intergenic Intergenic

Start StopExonic

IGE Start Stop

Ex2

Ex3

Ex1

Figure 5.15 – State model of prokayotic gene ﬁnding.

Application Outcome Here, we inves-

tigated various properties of our proposed

method, hidden Markov anomaly detection

(HMAD), on a artiﬁcially generated data set

where we had full control over the ground

truth. We further applied our method to real

data from procaryotic bacteria and showed

that unsupervised detection of genes is pos-

sible. In both settings, HMAD performed

comparable or better than baseline competi-

tors. Moreover, as no column generation is necessary, the runtime performance is much

better than its supervised counterpart.

5.5 Summary and Discussion

The main contributions of this chapter are twofold. In the ﬁrst part, we took a closer look

at the supervised structured output SVM (SSVM) which serves as a foundation for the latter

introduced unsupervised anomaly detection method. We re-formulated the margin rescaling

variant of the SSVM and incorporate a more complex regularization that enables hierarchical

multitask learning. We then derived an optimization method based on bundle methods to

68 Chapter 5. Learning with Structured Data

solve the speciﬁc problem of SSVM when a fair amount of samples is given and each sample

consists of large numbers of measurements. The speed improvements are shown in Fig. 5.2.

In Section 5.3, we derived an anomaly detection method that is based on the supervised

structured output principle but does not need label information. Due to its access to the latent

dependency structure of a group of measurements, it can be applied as collective anomaly

detection method. This is veriﬁed in the experimental section that gives positive results when

compared against baseline methods.

Limits of Bundle Method Optimization for SSVMs The proposed optimization scheme

was designed with the speciﬁc application of Section 5.4.1 in mind. This setting includes a

low to medium number of samples with a low to medium number of features but where each

sample possibly contains thousands and even millions of measurements. On the other hand,

when this setting is not given (i.e. many samples with small amount of measurements each),

the optimization might take much longer than the standard column generation. Moreover,

due to the ﬁne-grained label information needed to train those models, SSVMs are generally

not suitable as anomaly detectors even if results tend to be very good.

Limits of Latent Structure Anomaly Detector Even though the proposed unsupervised

method is much faster than its supervised (SSVM) counterpart, it still needs various iter-

ations of a standard one-class SVM until convergence and is hence, multiple times slower

than the standard formulation. Moreover, compared to the non-structured baseline, we loose

the important convexity property and might get stuck in low-quality minima. The possibly

biggest drawback, however, is its complexity. The kind of structure is pre-coded into a fairly

complicated joint feature map and needs to be chosen carefully when deploying to a speciﬁc

application. If done properly, the empirical evaluation suggests that it can be beneﬁcial.

Source code and resources for the proposed methods are available on github a b. Parts

of this chapter are based on:

Görnitz, N., Braun, M., Kloft, M., “Hidden Markov Anomaly Detection”, in

International Conference on Machine Learning (ICML), 2015, pp. 1833–1842

Görnitz, N., Widmer, C., Zeller, G., Kahles, A., Sonnenburg, S., Rätsch, G., “Hi-

erarchical Multitask Structured Output Learning for Large-scale Sequence Seg-

mentation”, in Advances in Neural Information Processing Systems (NIPS), 2011,

pp. 2690–2698

Zeller, G., Görnitz, N., Kahles, A., Behr, J., Mudrakarta, P., Sonnenburg, S.,

Rätsch, G., “mTim: rapid and accurate transcript reconstruction from RNA-Seq

data”, ArXiv, 2013

ahttps://github.com/nicococo/tilitools

bhttps://github.com/nicococo/mtim

IV Contextual Anomalies

Chapter 6

Learning with Latent Class Dependencies

6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.2 A Joint Feature Map Formulation . . . . . . . . . . . . . . . . . . . . . 72

6.3 Direct Formulation includes k-means as Special Case . . . . . . . . . . 75

6.4 Extension to Non-independent Samples . . . . . . . . . . . . . . . . . 83

6.5 Evaluation and Applications . . . . . . . . . . . . . . . . . . . . . . . 90

6.5.1 Extracting Latent Brain States . . . . . . . . . . . . . . . . . . 90

6.5.2 Porosity Estimation . . . . . . . . . . . . . . . . . . . . . . . 95

6.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 104

In this chapter, we turn our attention to the contextual anomaly setting. That is, we

consider samples to be singletons but with some contextual connection. In this setting, data

points will be considered anomalous only if the contextual information admits it. Of course,

there are many diﬀerent forms of contextual information that can be examined. Here, we

restrict ourselves to latent class dependencies where the anomaly score for each data points

depends on the respective observation and some hidden class information.

According to Chandola, Banerjee, and Kumar [1], contextual anomaly detectors need two

kinds of attributes: (i) contextual attributes that embed the corresponding data point into a

neighborhood and (ii) behavioral attributes which determine its normality. Common ap-

proaches in the literature consider time-series [194] or spatial [195] contexts. In this chapter

we consider latent class dependencies.

We start in Section 6.2 by extending the support vector data description (SVDD) to in-

corporate latent class dependencies and a mechanism to infer the bespoken latent class on-

the-ﬂy. This is done employing techniques from structured output prediction (cf. Chapter 5).

In Section 6.3, we abandon the ﬂexibility of the joint feature map and make the latent class

dependency explicit which allows us to establish a—somewhat surprising—connection to k-

means. The third part (Section 6.4), however, starts with a certain regression setup in mind

where data points are connected to their respective latent class variables while, at the same

time, the latter exhibit connections among themselves. An extension towards contextual one-

class classiﬁcation using our previously developed methods is discussed. Here, the boundary

between contextual and collective outlier detection vanishes and this method allows to bind

groups of data points.

6.1 Preliminaries

Throughout this chapter, we consider support vector data description (SVDD) as our base

model. Here, the data is mapped from the input space into a RKHS feature space φ:X → F

72 Chapter 6. Learning with Latent Class Dependencies

that gives rise to a kernel k[33,196]. The goal is to ﬁnd a model f:X → Rand a density

level-set L:= {x:f(x)≤R2}containing most of the regular data points, while for anoma-

lies and outliers x/∈Lholds. In case of the support vector data description (SVDD) method,

fSVDD(x) = kc−φ(x)k2and parameter estimation corresponds to solving a quadratically

constrained quadratic program (QCQP),

min

R,c,ξ≥0R2+C

i=1

ξi(6.1)

s.t. kc−φ(xi)k2≤R2+ξi, i = 1, . . . , n .

That allows for the following simple geometric interpretation: a ball of radius Ris computed

that comprises most the regular data points, while all points lying outside of the normality

radius are declared being anomalous.

An important note on Section 6.4 where techniques for with learning latent class depen-

dency structure are developed: the original motivation behind this section is diﬀerent than

contextual anomaly detection and rather focuses on a semi-supervised regression setting.

However, due to its modularity, a straightforward extension towards one-class classiﬁcation

based on support vector data descriptions is discussed in detail.

6.2 A Joint Feature Map Formulation

One way of incorporating behavioral x∈ X and contextual z∈ Z information blocks

into standard methods is by employing joint feature maps Ψ : X × Z → F to encode the

information and embed it into a feature space which then can be plugged in without further

changes of the method. As straightforward as it seems, we did not specify the structure of Ψ

yet and, most importantly, do not assume that the context zis given in advanced and instead

must be inferred based on the observations x. Hence, the contributions of this section are:

• we extend the support vector data description to handle contextual information based

on the notion of joint feature maps

• we present a simple embedding for the problem of latent classes

• a corresponding solver based on diﬀerence of convex (DC) functions programming is

proposed.

In this section, we extend the classical mapping fSVDD by the inclusion of a latent variable

z∈ Z in a joint feature map Ψ : X × Z → F. As we try to ﬁnd the tightest description

of the data, it makes sense to deﬁne the contextual information to correspond to ﬁnding a

minimizer of the hyper-sphere description. As a consequence, the resulting model

f:X → R,x7→ min

z∈Z kc−Ψ(x,z)k2(6.2)

becomes more expressive (a similar idea appeared also recently in the context of supervised

learning [157]). The latent state variable bzof a given data point xcan be inferred by bz=

argminz∈Z kΨ(x,z)k2−2hc,Ψ(x,z)i. The extended model, which we call LatentSVDD,

leads to a modiﬁed optimization problem:

min

R,c,ξ≥0R2+C

i=1

ξi(LatentSVDD)

s.t. min

z∈Z kc−Ψ(xi,z)k2≤R2+ξi∀i .

6.2. A Joint Feature Map Formulation 73

Because of the min operator in the constraints, the resulting optimization problem is no

longer convex, but we can derive an optimization strategy by decomposing the problem

into convex and concave parts and iteratively linearizing the concave part (DC Programming

[197]). In order to do so, we re-write the above problem in an equivalent, unconstrained

fashion as follows:

min

c,R R2+C

i=1

max(0,min

z∈Z kc−Ψ(xi,z)k2−R2).(6.3)

Substituting Ω := R2−kck2, this is equivalent to

min

c,Ωkck2+ Ω + C

i=1

max 0,−Ω + min

z∈Z kΨ(xi,z)k2−2hc,Ψ(xi,z)i

subject to the constraint kck2+Ω ≥0, which can be dropped as it is not active in the optimal

point. Note that, for any i, the function

gi(c,Ω) := −Ω + min

z∈Z kΨ(xi,z)k2−2hc,Ψ(xi,z)i(6.4)

is concave, so −giis convex. Furthermore, note that for any t∈R: max(0, t) = max(0,−t)+

t. Thus, we have the decomposition

max(0, gi(c,Ω)) = max(0,−gi(c,Ω))

|{z }

convex

+gi(c,Ω)

|{z }

concave

because the maximum of two convex functions is convex and can equivalently re-write the

LatentSVDD optimization problem as a sum of a convex and a concave function as follows:

given the deﬁnition of giin Eq. (6.4), solve

(LatentSVDD-DC)

min

c,Ωkck2+ Ω + C

i=1

max(0,−gi(c,Ω))

|{z }

convex

i=1

gi(c,Ω)

|{z }

concave

The above problem is an instance of the class of DC optimization problems. We propose to

solve the above problem with the simpliﬁed DC algorithm. That is, alternatingly, the concave

part is linearized and the resulting approximate problem solved. The resulting algorithm is

shown in Algorithm 8.

The proposed algorithm converges against a local optimum (typically in about 10 itera-

tions, as we found in our experiments). This follows from the following theorem that is taken

from [50], which is an extension of the convergence theorem in [198] to non-diﬀerentiable

objective functions.

Theorem 12 ([50], Theorem 3.3).Let f, g be convex functions. Let x0be any feasible point,

and put

∀t > 0 : xt:= argmin

xf(x)−x⊤∇g(xt−1).

If the non-smooth parts of fand gare piecewise-linear and the smooth part of fis strictly convex

quadratic, then any limit point of the sequence (xt)is a stationary point.

74 Chapter 6. Learning with Latent Class Dependencies

Algorithm 8 Optimization Algorithm for LatentSVDD

input data x1,...,xN

initialize ct=0 &∀i:bzt=0

i(e.g., randomly)

repeat

t:=t+1

for i= 1, . . . , N do

bzt

i:= argminz∈Z ||ct−1−Ψ(xi,z)||2

overwriting the notation of giin (6.4), we deﬁne

gi(c,Ω) := −Ω + ||Ψ(xi,bzt

i)||2−2hc,Ψ(xi,bzt)i

end for

let ctand Ωtthe optimal arguments when solving Problem (LatentSVDD-DC) with

the giset as above

until ∀i:bzt

i:= bzt−1

return optimal model parameters c:= ct,R:= p||ct||2+ Ωt, and zi:= bzt

i∀i=

1, . . . , N

The proposed algorithm also admits a dual representation via the convex conjugate func-

tion f∗(x) := supyhx, yi−f(x). The dual of the LatentSVDD-DC problem is given by

min

c,Ω −C

i=1

gi(c,Ω)!∗− kck2+ Ω + C

i=1

max(0,−gi(c,Ω))!∗

This completes the presentation of the ﬁrst step in our proposed methodology.

Joint feature map As we have described LatentSVDD in terms of a joint feature map

Ψ(x,z)we want to speciﬁcally ﬁx the structure for problems involving latent classes depen-

dencies. Hence, we speciﬁcally employ a variant of the joint feature map with latent space

Z:= {1, . . . , K}that is similar to the multi-class joint feature map in [157]: let be Λ(z) =

{δ(z1, z), δ(z2, z), . . . , δ(zK, z)} ∈ {0,1}Kand a⊗bbe the direct tensor product of vectors

aand b. Given a data point x, we deﬁne our joint feature map as Ψ(x, z) = φ(x)⊗Λ(z).

Let us assume the dimensionality of φis d, then the total dimensionality is d·Kwith at least

d·(K−1) zero entries.

Theoretical Analysis We conclude with a generalization analysis of this unsupervised

learning algorithm and deﬁne for any λ > 0, the following hypothesis class

FLatentSVDD :=nfc,Ω,Z=x7→ Ω + max

z∈Z 2hc,Ψ(x,z)i−kΨ(x,z)k2: 0 ≤ kck2+ Ω ≤λo,

and its corresponding loss class GLatentSVDD := l◦FLatentSVDD, employing the loss function

l(t) := max(0,−t). It is not diﬃcult to verify that (e.g., [199], Proposition 12), by employing

the variable substitution Ω := R2− kck2, for any C > 0there is an λ > 0such that

Problem (LatentSVDD) is equivalent to

min

f∈FLatentSVDD

i=1

l(f(xi)) = min

g∈GLatentSVDD

i=1

g(xi).

Hence, we may analyze the proposed LatentSVDD within the proven framework of empir-

ical risk minimization.

6.3. Direct Formulation includes k-means as Special Case 75

Let us ﬁrst brieﬂy review the classical setup of empirical risk minimization. Let x1,...,xn

be an i.i.d. sample drawn from a probability distribution Pover X. Let Fbe a class of func-

tions mapping from Xto some set Y, and let l:Y → [0, b]be a bounded loss function, for

some b > 0. The goal is to ﬁnd a function f∈ F that has a low risk E[l(f(x))]. Denoting

the loss class by G:= l◦F, this is equivalent ﬁnding a function gwith small E[g]. The best

function in Gwe can hope to learn is g∗∈argming∈G E[g]. Since g∗is unknown, we instead

compute a minimizer bgn∈argming∈G b

E[g], where b

E[g] := 1

nPn

i=1 g(xi). To compare the

prediction accuracies of g∗and bgn, it is known [200] that, with probability at least 1−δover

the draw of the sample,

E[bgn]−E[g∗]≤4Rn(G) + br2 log(2/δ)

n.(6.5)

Here, Rn(G) := Esupg∈G 1

nPn

i=1 σig(xi)is the Rademacher complexity, where σ1, . . . , σn

are i.i.d. Rademacher variables (random signs). Usually Rn(G)is of the order O(1/√n),

when we employ appropriate regularization, and thus so is (6.5). We will show that also

LatentSVDD enjoys this favorable rate:

Theorem 13 (Generalization bound for LatentSVDD).Let g∗∈argming∈GLatentSVDD E[g]

and bgn∈argming∈GLatentSVDD

nPn

i=1 g(xi). Assume there is a real number B > 0such that

P(kΨ(xi,z)k ≤ B) = 1. Denote the cardinality of Zby |Z|. Then, the following generalization

bound holds:

E[bgn]−E[g∗]≤4|Z|λ+B√λ

√n+Br2 log(2/δ)

Sketch of Proof. For the proof, we proceed in three steps: ﬁrst, we prove a Rademacher bound

for the classic SVDD (cf. the lemma below). Next, we use Lemma 8.1 in [201] to conclude a

Rademacher bound for LatentSVDD. Finally, we conclude the claimed result by (6.5).

The complete proof of Theorem 13 is shown in Appendix B.1. It builds on the following

generalization bound for the classic SVDD, which is also proved in the supplement.

Lemma 1 (Rademacher bound for SVDD).Put FSVDD(z) := nfc,Ω=x7→ Ω+2hc,Ψ(xi,z)i

−kΨ(xi,z)k2: 0 ≤ kck2+ Ω ≤λoand GSVDD(z) := l◦ FSVDD(z)with l(t) :=

max(0,−t). Assume there is a real number B > 0such that P(kΨ(xi,z)k ≤ B) = 1. Then

the Rademacher complexity of GSVDD is bounded as follows:

R(GSVDD(z)) ≤λ+B√λ

√n.

As a result the convergence rate can be slightly or even considerably slower than O(p1/n)),

depending on the “degree of violation” of the independence assumption.

6.3 Direct Formulation includes k-means as Special Case

Using joint feature maps increases the ﬂexibility of the model. However, if it is known that

latent classes will be used, a more direct approach would have various advantages, i.e. less

complex. Moreover, we show in this section that using a direct formulation of the above

introduced LatentSVDD includes k-means as a special case. Hence, our contributions here

are

• we give a comprehensive review including proofs of the properties of support vector

data descriptions

76 Chapter 6. Learning with Latent Class Dependencies

• we introduce the direct formulation of the LatentSVDD (which we call ClusterSVDD)

and show that it contains k-means as well as the standard SVDD as a special case

• we propose a corresponding solver for primal and dual formulations.

We start by introducing k-means and the re-visit SVDD where we prove most important

properties. Finally, we introduce our ClusterSVDD and proof that k-means and SVDD are

contained as a special case. Given a set of input instances x1,...,xℓ∈ X, where Xis an

arbitrary set that is commonly assumed to be realized from a sequence of independent and

identically distributed (i.i.d) random variables. Furthermore, kdenotes the number of clusters

and zi∈ {1, . . . , k}the membership of the corresponding input instance xi. Memberships

can be expressed by partition sets {Sj}k

j=1, where i∈Sjif and only if zi=j. It holds that

Si∩Sj=∅for i6=jand ∪k

j=1Sj={1, . . . , ℓ}.

Kernel based approaches [33,35] allow the input instances to be mapped into a repro-

ducing kernel Hilbert space (RKHS) Hvia a feature map φ:X → H.

k-means Clustering k-means clustering [64] (a recent overview is given in [202]) is usu-

ally introduced as a (non-convex) optimization problem of ﬁnding a partition {Sj}k

j=1, for a

pre-deﬁned k, that minimizes the within cluster sum-of-squares (WCSS),

min

{Sj}k

j=1

j=1 X

i∈Sjkxi−cjk2,(6.6)

with {cj∈ X}k

j=1 being the means of the corresponding clusters. Solving this problem (at

least locally optimal) consists of three simple steps:

(1) Initialize the cluster centers {cj}k

j=1 and repeat step (2) & (3) until no changes occur.

(2) Update the partitions {Sj}k

j=1 by identifying the nearest cluster given the intermediate

cluster centers c·,zi= argminˆz∈{1,...,k}kcˆz−xik2.

(3) Update the cluster centers cj= 1/|Sj|Pi∈Sjxi,∀j= 1, . . . , k.

Since its introduction, eﬀorts have been made to increase the ﬂexibility of the description [35,

203,204], i.e. through the use of kernels ([33–35] for an introduction to kernel methods),

and to increase the robustness of the method [205–207] against outliers and the curse of

dimensionality. In this work, we tackle all of the above mentioned into a single framework.

Another line of research, which we do not investigate further in this work, deals with the

inference of the correct number of partitions k[208–212].

Let us ﬁrst begin by introducing an alternative way of stating the optimization problem

of k-means. Instead of stating that the cluster centers {cj}k

j=1 should be the means of input

instances corresponding to the cluster, we can re-write OP (6.6) more concisely as

min

{Sj}k

j=1

min

cjX

i∈Sjkcj−xik2.

This yields the same solution since it is a convex problem w.r.t. cj(ﬁxing the partitions),

and we can analytically derive the optimal solution by ∂Pi∈Sjkcj−xik2/∂cj= 0,

therefore c∗

j= 1/|Sj|Pi∈Sjxi. We can now deﬁne an equivalent constrained formulation

of OP (6.6).

6.3. Direct Formulation includes k-means as Special Case 77

Deﬁnition 10 (k-means Constrained Problem).The constrained optimization problem for k-

means is given by

j=1

min

cjX

i∈Sjkcj−xik2(6.7)

subject to zi= argmin

ˆz∈{1,...,k}kcˆz−xik2,

where i∈Sj, if and only if zi=j, ∀i= 1, . . . , ℓ and ∀j= 1, . . . , k.

Revisiting SVDD As noted in the literature [213–215], there are some issues with the

original formulation of the SVDD as deﬁned in Problem 6.1. First, the formulation is not

convex due to R2in the constraints and second, the primal-dual relation breaks down for

0< C < 1/ℓ. However, this can be ﬁxed and we derive here a rigorous formulation of the

SVDD based on the work of Chang et al. [213].

Deﬁnition 11 (Primal Constrained Problem).The primal SVDD optimization problem as a

quadratically constrained linear program (QCLP) is given by:

min

c,T≥0,ξ≥0T+1

ℓν

ℓ

i=1

ξi(6.8)

subject to kc−φ(xi)k2≤T+ξi∀i= 1, . . . , ℓ

for all 0< ν < 1the constraint T≥0is dispensable (cf. Lemma (3)). We will denote the

OP (6.8) as Svdd(ν, {xi}ℓ

i=1).

Note that ξiin OP (6.8) can be substituted, which allows for an unconstrained formulation

of the SVDD.

Deﬁnition 12 (Primal Unconstrained Problem).The primal convex, non-smooth, and uncon-

strained SVDD optimization problem is given by:

min

c,T≥0T+1

ℓν

ℓ

i=1

max(0,kc−φ(xi)k2−T).(6.9)

This deﬁnition comes in handy, as solving SVDD in the primal form is suﬃcient and

sub-gradient based solver can be applied.

Deriving a linearly constrained quadratic program (QP) allows to pin the relation between

SVDDs and OC-SVMs.

Theorem 14 (Quadratic Program Formulation and Equivalence to One-class SVM).The

SVDD primal optimization problem, given by OP (6.8), can be transformed into the following

equivalent linearly constrained quadratic program (QP):

min

w,ρ,ξ≥0

2kwk2−ρ+1

ℓν

ℓ

i=1

ξi(6.10)

subject to hw, φ(xi)i ≥ ρ+1

2kφ(xi)k2−ξi,∀i= 1, . . . , ℓ ,

i.e. for L2-normalized feature vectors kφ(x)k=const, the above formulation reduces to the

one-class SVM formulation as given in Chapter 3.

78 Chapter 6. Learning with Latent Class Dependencies

Proof. Starting from the formulation of the primal SVDD in OP (6.8), we ﬁrst extend the

constraints from kc−φ(xi)k2≤T+ξito kck2−2hc, φ(xi)i+kφ(xi)k2≤T+ξi. Second,

we re-arrange terms and arrive at kck2−T

2+kφ(xi)k2

2−ξi

2≤ hc, φ(xi)i. In a third step,

we substitute ρ=kck2−T

2∈R,ζi=ξi

2∈R+, and c=w∈ H, which changes the

objective function T+1

ℓν Pℓ

i=1 ξitowards kwk2−2ρ+1

ℓν Pℓ

i=1 2ζi. Without changing

the minimizer, we can multiply the objective by 1

2and arrive at the one-class SVM objective

2kwk2−ρ+1

ℓν Pℓ

i=1 ζiwith corresponding constraints hw, φ(xi)i ≥ ρ+1

2kφ(xi)k2−ζi.

Which proves the ﬁrst part of the theorem. For the second part, a simple substitution ˆρ=

ρ+1

2kφ(xi)k2=ρ+const2

2∈Rleads to the desired outcome.

Lemma 2. Assume ν≤1/ℓ is given, then OP (6.8) reduces to the minimum enclosing ball

(MEB) problem, i.e. it holds that {ξi}ℓ

i=1 = 0 (hard margin).

Proof. Assume an optimal solution of OP (6.8) is given by (T∗,c∗,{ξ∗

i}ℓ

i=1). Assume another

solution (T∗+ξ∗

m,c∗,{0}ℓ

i=1), where ξ∗

m= maxi∈{1,...,ℓ}ξ∗

i, which is a feasible solution,

Therefore,

(T∗+ξ∗

m) + 1

ℓν

ℓ

i=1

0 = T∗+ξ∗

m≤T∗+1

ℓν

ℓ

i=1

ξ∗

i⇒νξ∗

m≤1

ℓ

i=1

ξ∗

i=1

ℓ



ℓ

i\m

ξ∗

i+ξ∗

m

,

is strictly fulﬁlled for ν < 1/ℓ and hence, any optimal solution must include {ξ∗

i}ℓ

i=1 =

{0}ℓ

i=1 and for ν= 1/ℓ, the set of optimal solutions does include {ξ∗

i}ℓ

i=1 ={0}ℓ

i=1.

Lemma 3. Assume 0< ν < 1is given, then the non-negativity constraint in OP (6.8), T ≥0,

can be omitted.

Proof. (According to [213], Theorem 3, Proof in Appendix A) Assume an optimal solution

of OP (6.8) is given by (T∗,c∗,{ξ∗

i}ℓ

i=1). Further, assume that T∗=−|T∗|and another

feasible solution that does not change the constraints is given by (0,c∗,{ξ∗

i−|T∗|}ℓ

i=1), i.e.

0≤ kc∗−φ(xi)k2≤ −|T∗|+ξ∗

i. It holds that

−|T∗|+1

ℓν

ℓ

i=1

ξ∗

i≥0 + 1

ℓν

ℓ

i=1

(ξ∗

i−|T∗|) = −|T∗|

ν+1

ℓν

ℓ

i=1

ξ∗

is true for ν < 1and hence, is a contradiction to the assumption that −|T∗|=T∗.

Lemma 4. Assume ν≥1is given, then due to the non-negativity constraint in OP (6.8), T ≥0,

the optimal solution must have T∗= 0.

Proof. (According to [213], Theorem 3, Proof in Appendix A) Assume an optimal solution of

OP (6.8) is given by (T∗,c∗,{ξ∗

i}ℓ

i=1)and another feasible solution, that does not change the

constraints, is given by (0,c∗,{ξ∗

i+T∗}ℓ

i=1). It holds that

T∗+1

ℓν

ℓ

i=1

ξ∗

i≥0 + 1

ℓν

ℓ

i=1

(ξ∗

i+T∗) = T∗

ν+1

ℓν

ℓ

i=1

ξ∗

is true for ν≥1and hence, the optimal solution must have T∗= 0.

Therefore, we can now state precise primal and dual optimization problems.

6.3. Direct Formulation includes k-means as Special Case 79

Theorem 15 (Primal Problem and Solution for ν≥1).If ν≥1the primal optimization

problem reduces to

min

ℓ

i=1 kc−φ(xi)k2,(6.11)

and the optimal solution is given by c= 1/ℓ Pℓ

i=1 φ(xi).

Proof. According to Lemma 4, T= 0 and 1

ℓν >0can be discarded, hence we arrive at

min

c,ξ≥0

ℓ

i=1

ξi

subject to kc−φ(xi)k2≤ξi∀i= 1, . . . , ℓ.

Further, ξi≥0is due to the 2-norm always fulﬁlled and minimization yields the smallest

possible ξi=kc−φ(xi)k2which reads unconstrained

min

cL(c) =

ℓ

i=1 kc−φ(xi)k2.

This quadratic form has a unique optimum at ∂L(c)/∂c= 0, which is c= 1/ℓ Pℓ

i=1 φ(xi).

For ν≥1the dual problem can be solved analytically by α= 1/ℓ.

Theorem 16 (Dual Problem).For 0< ν ≤1and appropriately deﬁned Mercer-kernel k:

H×H → R,k(x,y)7→ hφ(x), φ(y)i, the dual problem is given by

max

0≤α≤1

ℓν

ℓ

i=1

αik(xi,xi)−

ℓ

i=1

ℓ

j=1

αiαjk(xi,xj)(6.12)

subject to

ℓ

i=1

αi= 1

with expansions c=Pℓ

i=1 αiφ(xi).

Proof. Due to Lemma 3, we can skip the non-negativity constraint T≥0of the convex

OP (6.8). The resulting Lagrangian arrives at

L(α, β, c, T, ξ) = T+1

ℓν

ℓ

i=1

ξi+

ℓ

i=1

αi(kc−φ(xi)k2−T−ξi)−

ℓ

i=1

βiξi,

and solving for the Lagrange dual function g(α, β)(with α≥0, β ≥0and g(α, β) =

minc,T,ξ L(α, β, c, T, ξ)) yields

(1) 1

ℓν −βi−αi= 0 and hence, 0≤α≤1

ℓν

(2) the expansion c=Pℓ

i=1 αiφ(xi)

(3) the equality constraint Pℓ

i=1 αi= 1 .

Substitution and re-arrangement then gives us the dual optimization problem in OP (6.12).

In order for strong duality to hold, some constraint qualiﬁcations, such as Slater’s condition,

must be fulﬁlled (which holds trivially, cf. [213] Section 3.1). For any primal (c∗, T∗, ξ∗)

80 Chapter 6. Learning with Latent Class Dependencies

and dual optimal solution (α∗, β∗), the complementary slackness constraints are given by

α∗

i(kc∗−φ(xi)k2−T∗−ξ∗

i) = 0 and β∗

iξ∗

i= 0.

Interestingly, the above formulation reduces to the dual one-class SVM optimization

problem [7], if k(x,y)is a constant for x=y. Also, the dual formulation allows for a

neat interpretation of the νparameter.

Theorem 17. Given 0< ν ≤1, then ⌈ℓν⌉is a lower bound on the number of support vectors

and an upper bound on the number of outliers.

Proof. Due to the complementary slackness constraints (Thm. 16, cf. [213], Eq. (12,17)), we

know that constraints in Eq. (6.8) that are not strictly fulﬁlled yield α∗= 1/ℓν (ξ∗

i>0⇒

β∗

i= 0 and 1

ℓν −β∗

i−α∗= 0 must hold), whereas constraints that are strictly fulﬁlled

receive α∗= 0 (ξ∗

i= 0,kc∗−φ(xi)k2< T∗and complementary slackness must hold).

For data points lying exactly on the border, it holds that 0≤α∗≤1/ℓν (ξ∗

i= 0 and

kc∗−φ(xi)k2=T∗). Therefore, in order to fulﬁll the equality constraint in Problem (6.12),

at most ⌈ℓν⌉data points can strictly lie outside and there must be at least that much support

vectors.

It therefore makes sense, to restrict νto be in range ]0,1].

ClusterSVDD In this section, we introduce our unifying formulation ClusterSVDD and

prove that k-means and SVDD can be recovered as special cases. The core idea of is displayed

in Figure 6.1.

Deﬁnition 13 (Primal Problem).Primal non-convex ClusterSVDD optimization problem (again

0< ν ≤1):

min

{cj}k

j=1,T≥0,ξ≥0

j=1

Tj+

j=1

ℓ

i=1

1[zi=j]

Pl1[zl=j]νξi(6.13)

subject to kczi−φ(xi)k2

2≤Tzi+ξi,∀i= 1, . . . , ℓ

with zi= argmin

ˆz∈{1,...,k}kcˆz−φ(xi)k2−Tˆz

Theorem 18 (Decomposability).The Problem (6.13) is decomposable into ksub-problems with

kdisjunct sets of hypersphere constraints and ℓglobal cluster membership constraints.

Proof. Notice that the data can be partitioned, that is, each datum xican only belong to

a single set Sjat any given time, where i∈Sjfor j∈1, . . . , k iﬀ zi=j. It follows

that Si∩Sj=∅for i6=jand ∪k

j=1Sj={1, . . . , ℓ}. Re-writing Pℓ

i=1

1[zi=j]

Pl1[zl=j]νξi=

6.3. Direct Formulation includes k-means as Special Case 81

|Sj|νPi∈Sjξi(in Problem (6.13)) and arranging terms accordingly achieves

min

{cj}k

j=1,T≥0,ξ≥0

j=1 

Tj+1

|Sj|νX

i∈Sj

ξi



j=1

min

cj,Tj≥0,ξ≥0Tj+1

|Sj|νX

i∈Sj

ξi(6.14)

subject to kcj−φ(xi)k2≤Tj+ξi,∀i∈Sj, j = 1

kcj−φ(xi)k2≤Tj+ξi,∀i∈Sj, j =k

with zi= argmin

ˆz∈{1,...,k}kcˆz−φ(xi)k2−Tˆz,∀i= 1, . . . , ℓ.

The above optimization problem is now decomposed into kdistinct SVDD optimization prob-

lems that are coupled solely through the global cluster assignment constraint. Hence, by ap-

plying the notation introduced in Def. 11, the ClusterSvdd optimization problem OP (6.14)

can be written as

j=1

Svdd(ν, {xi}i∈Sj)(6.15)

subject to zi= argmin

ˆz∈{1,...,k}kcˆz−φ(xi)k2−Tˆz,∀i= 1, . . . , ℓ.

Figure 6.1 – The model: Fitting multiple hyperspheres

simultaneously with a pre-deﬁned outlier fraction is the

core idea of our proposed method ClusterSVDD.

This is also an interesting result for

the optimization in Section 6.3. Notably,

given the partitions {Sj}k

j=1, OP (6.15) is

just a sum of convex optimization problems,

which itself is a convex optimization prob-

lem [43]. Because the problem decomposes

neatly into SVDD sub-problems (with ex-

act primal-dual relations where strong dual-

ity holds), using kernels is straightforward

by simply solving the dual SVDD Prob-

lem (6.12) instead of the primal version. We

now proceed further and show the equiva-

lence to k-means when ν≥1.

Theorem 19 (Equivalence I).Assume ν≥

1given and φ:X → X,x7→ xbeing

the identity function idX, then ClusterSVDD optimization problem is identical to the k-means

optimization problem: OP (6.13) = OP (6.7).

82 Chapter 6. Learning with Latent Class Dependencies

Proof. Since the OP (6.8) can be decomposed into OP (6.15) and Thm. 15 holds for each sub-

SVDD,

j=1

Svdd(ν≥1,{xi}i∈Sj) =

j=1

min

cjX

i∈Sjkcj−φ(xi)k2

subject to zi= argmin

ˆz∈{1,...,k}kcˆz−φ(xi)k2,∀i= 1, . . . , ℓ

which is identical to the k-means OP (6.7).

Theorem 20 (Equivalence II).Assume k= 1 given, then ClusterSVDD optimization problem

is identical to the SVDD optimization problem: OP (6.13) = OP (6.8).

Proof. Since the OP (6.8) can be decomposed into OP (6.15), the sum can be ommitted as well

as the cluster membership constraints, as they always deliver 1 = zi= argminˆz∈{1,...,k}kcˆz−

φ(xi)k2,∀i= 1, . . . , ℓ. The resulting optimization problem is SVDD(ν, {xi}ℓ

i=1), which is

in fact the original SVDD formulation as deﬁned in Def. 11.

Relation to Kernel k-means and Spectral Clustering From the decomposability theo-

rem (Thm. 18) a kernelized version of our ClusterSVDD can be derived using the dual of

the SVDD as given in Thm. 16. Due to the expansion of c=Pℓ

i=1 αiφ(xi)of a single SVDD,

we can equivalently rewrite the global cluster membership constraint of OP. (6.13) as

zi:= argmin

j∈{1,...,k}X

m,n∈Sj

αmαnkj(xm,xn)−2X

m∈Sj

αmkj(xm,xi) + kj(xi,xi)−Tj.

Moreover, a proper dual version of k-means can be derived as a special case due to Thm. 19

which ensures the equivalence to kernel k-means [204]. Interestingly, Dhillon et al. [203]

showed that an explicit theoretical connection between kernel k-means and spectral clus-

tering [216] can be drawn under certain conditions. In return, there is also a connection

between our ClusterSVDD and spectral clustering with kernel k-means being the link.

Algorithm 9 ClusterSVDD

input data x1, . . . , xℓand outlier fraction ν > 0

put t= 0

choose zi∈ {1, . . . , k} ∀i∈ {1, . . . , ℓ}(e.g. randomly)

let (ct

j, Tt

j)be the optimal arguments when solving the SVDD optimization problem

OP (6.8) with subset xi, i ∈Sj,∀j= 1, . . . , k.

repeat

t:=t+1

for i= 1, . . . , ℓ do

i:= argminj∈{1,...,k}||ct−1

j−φ(xi)||2−Tt−1

end for

let (ct

j, Tt

j)be the optimal arguments when solving the SVDD optimization problem

OP (6.8) with subset xi, i ∈Sj,∀j= 1, . . . , k.

until ∀i= 1, . . . , ℓ :zt

i=zt−1

Return optimal model parameters c·:= ct

·,T·=Tt

·, the cluster memberships zi:=

i∀i= 1, . . . , ℓ, and the anomaly scores si:= ||ct

zi−φ(xi)||2−Tt

6.4. Extension to Non-independent Samples 83

Optimization Following the ideas of CCCP [197] (concave-convex procedure), a variant

of DC-programming [52] (diﬀerence of convex functions), which itself is a special instance

of MM (majorization-minimization), we separate the problem into two sub-problems:

(1) inferring the partition, and

(2) calculating the new hypersphere centers and radii.

This approach does not guarantee the globally optimal solution (except for k= 1), but will

provide locally optimal solutions. Due to Theorem (18), the optimization is similar to the

original k-means optimization, where the ﬁrst step also considers kernels, and the second

step can be solved using existing SVDD implementations. The resulting optimization algo-

rithm is described in Algorithm 9 for its primal form and in Algorithm 10 for its kernelized

counterpart. Despite the non-convex nature of the optimization problem, we found in our

experiments that the algorithm tends to converge fast.

Algorithm 10 Kernel ClusterSVDD

input data x1, . . . , xℓand outlier fraction ν > 0

put t= 0

choose zi∈ {1, . . . , k} ∀i∈ {1, . . . , ℓ}(e.g. randomly) and therefore ﬁxing St

j,∀j=

1, . . . , k

let (αt

j, Tt

j)be the optimal arguments when solving the SVDD dual optimization problem

OP (6.12) with subset xi, i ∈St

j,∀j= 1, . . . , k for the corresponding kernel kj.

repeat

t:=t+1

for i= 1, . . . , ℓ do

i:= argminj∈{1,...,k}Pm,n∈St−1

jαt−1

mαt−1

nkj(xm,xn)

−2Pm∈St−1

jαt−1

mkj(xm,xi) + kj(xi,xi)−Tt−1

Note: zt

i,∀i= 1, . . . , ℓ implies St

j,∀j= 1, . . . , k

end for

let (αt

j, Tt

j)be the optimal arguments when solving the SVDD dual optimization prob-

lem OP (6.12) with subset xi, i ∈St

j,∀j= 1, . . . , k for the corresponding kernel kj.

until ∀i= 1, . . . , ℓ :zt

i=zt−1

Return optimal model parameters α·:= αt

·,T·=Tt

·, the cluster memberships zi:=

i∀i= 1, . . . , ℓ, and the anomaly scores si:= Pm,n∈St

αt

mαt

nkj(xm,xn)−

2Pm∈St

αt

mkj(xm,xi) + kj(xi,xi)−Tt

6.4 Extension to Non-independent Samples

Inferring latent classes from observations does pose an restriction on our model that might

not be wanted: similar observations will be always grouped together. While this is reason-

able, there are settings where such properties are undesired. However, to overcome this

restriction more information is necessary, e.g. dependency structure or additional label in-

formation.

In detail, the requirements for our model to cope with this setting are the following:

(i) we are only interested in predicting the data points that we already have (transductive

setting); (ii) of those, none or only few carry actual labels that we are interested in (scarce

labeled data); (iii) data points with distinct labels can have the same behavioral values (over-

lapping clusters); (iv) selection of cluster based on structure (inference of structures in latent

84 Chapter 6. Learning with Latent Class Dependencies

space); (v) labels are based on the inferred structured latent states as well as their respective

behavioral values. The resulting model is displayed in Figure 6.2.

Figure 6.2 – The model: data points are connected through latent

variables. Observations xare given but none or only few corre-

sponding target values yare known.

We designed this method with

a speciﬁc application in mind (cf.

Section 6.5.2) which is, rather

suprisingly, a regression setting

and the resulting method is called

transductive conditional random

ﬁeld regression (TCRFR). How-

ever, due to the sketched proper-

ties and its modularity, an exten-

sion towards one-class classiﬁca-

tion using methods derived earlier

this chapter (LatentSVDD, Sec-

tion 6.2), will be discussed in detail.

Hence, our main contributions

in this section are

• we derive a transductive regression method that leverages latent class dependency

structure (transductive conditional random ﬁeld regression, TCRFR);

• we present a corresponding solver based on loopy belief propagation and linear pro-

gram approximations;

• we extend the methodology to contextual one-class classiﬁcation based on the La-

tentSVDD (cf. Section 6.2).

As we shortly leave the beaten track of one-class classiﬁcation, we start by reviewing the

work that is related to the regression setting that we bear in mind.

Related Work According to the described properties, we grouped related work into 3 dis-

tinct classes:

Methods in group one consists can best be described as general purpose. These are algo-

rithms that are fast, easy to apply and make only a few assumptions about the data. This,

however, comes at the price of not leveraging all the information available and, therefore,

creates less accurate predictions.

Methods in the second group are technically most closely related to our method and can

be described as methods dealing with structured data. However, interestingly, none of these

methods can be applied to our setting. That is, each and every method assumes IID training

data. Further, from 7 methods, only 4 are regression methods [217–219] and only 3 consider

continuous labels[217,219]. All of these remaining 3 methods assume completely known

latent states for training, which our setting does not provide. Structure has been modeled

by conditional random ﬁelds (CRFs) [158], and extensions thereof comprise diverse continu-

ous methods [217,220]. For kernel machines, the classical structured output support vector

machines (SSVM) [157], allows to learn on joint feature maps; for extensions to regression,

see [218,219,221]. Extensions to semi-supervised settings have been developed [222].

Finally, methods in the third group do make many more assumptions on the data to ex-

tract more information for higher prediction accuracies. Mostly, these methods are special-

ized, advanced versions of their general purpose counterparts in the ﬁrst group. Transductive

Regression [223,224] copes with the semi-supervised setting by inferring virtual labels for

unlabeled examples by superposition of information of labeled examples [223]. Here, interac-

tions between examples are imposed implicitly by choosing an appropriate metric. However,

6.4. Extension to Non-independent Samples 85

those methods do not take latent dependency structure into account. Another line of research

is a mixture of experts model [20,225,226], where multiple regression models (experts) are

trained, and one (or a weighted sum) of thereof is used to predict the output label of new

samples. Laplacian regularized learning machines [227–229] assume that data lies on a man-

ifold in transductive or semi-supervised settings. We will apply this technique to kernelized

support vector regression, which itself includes the function class of (kernel) ridge regression.

Transductive Conditional Random Field Regression (TCRFR) Given a labeled sample

set S={(xi, yi)∈RD×R}n

i=1, and an unlabeled sample set U={xi∈RD}n+m

i=n+1, consider

a regression model with Gaussian noise:

y=f(x;w) + ǫ, ǫ =y−f(x;w)∼ N(0, σ2),

p(y|x,w)∝exp(−1

2σ2|y−f(x;w)|2),

where x∈RDand y∈Rare input and output variables, respectively, and f(x;w) =

hw,xiis a linear regression function with an unknown parameter w∈RD.σ2denotes the

noise variance. We assume the Gaussian prior for w:p(w)∝exp −λ′

2kwk2. Then, the

maximum a posteriori (MAP) estimator is obtained by maximizing the joint distribution of

{yi}n

i=1 and w(assuming IID data):

max

w∈RDp({yi}n

i=1|{xi}n

i=1,w)p(w) =

i=1

p(yi|xi,w)p(w),(6.16)

or, equivalently, minimizing the negative logarithm of the joint distribution minw∈RDL0(w),

where

L0(w) = λ′kwk2

2+X

|yi−hw,xii|2

σ2.(6.17)

This is the standard ridge regression setting, which we extend threefold: First, in the spirit

of kernel ridge regression, we introduce feature functions φfor the input data φ:R→ X.

Moreover, we explicitly model the dependency of the regression function f(x)on a latent

variable π∈ Z using local joint feature maps Φ : X × Z → H1on the labeled sample set

S, Second, we focus on predicting labels on the unlabeled data set Uonly, and, ﬁnally, we

respect the dependency of the inputs of the labeled and unlabeled sample sets Sand Uthat

can be exploited, e.g. they share spatial relations that can be modeled by conditional random

ﬁelds (CRF) using a global joint feature map Ψ : Nn+m

i=1 X×Nn+m

i=1 Z → H2. Note that both

local Ψand global Φfeatures map the samples into reproducing kernel Hilbert spaces H·that

correspond to kernel functions [33]. This is a principled way of approaching the encoding

problem for arbitrary dependencies between xand πas it is common in the structured output

literature [157].

With these extensions, we tackle the problem of inferring latent variables under spatio-

temporal structure from few precise output measurements and many noisy input measure-

ments.

We propose transductive conditional random ﬁeld regression (TCRFR, cf. Fig 6.2), which

consists mainly of two parts: (a) a least-squares regression part with parameter u, condi-

tioned on the latent states and input instances, and (b) a conditional random ﬁeld part with

parameter vthat explicitly models the dependencies of the latent variables and is condi-

tioned on the input instances only. Both parts receive a Gaussian prior for stabilization and

86 Chapter 6. Learning with Latent Class Dependencies

we are only interested in maximum a posteriori estimates (starting from the ridge regression

likelihood, cf. Eq. (6.16)):

max

up({yi}n

i=1|{xi}n+m

i=1 ,u)p(u)≥max

u,v,{πi}n+m

i=1

p({yi}n

i=1,{πi}n+m

i=1 ,v|{xi}n+m

i=1 ,u)p(u)

= max

u,v,{πi}n+m

i=1

p({yi}n

i=1,{πi}n+m

i=1 |{xi}n+m

i=1 ,u,v)p(u)p(v)

= max

u,v,{πi}n+m

i=1

p({yi}n

i=1|{πi}n

i=1,{xi}n

i=1,u)p(u)

p({πi}n+m

i=1 |{xi}n+m

i=1 ,v)p(v)

= max

u,v,{πi}n+m

i=1

p(yi|πi,xi,u)p(u)

p({πi}n+m

i=1 |{xi}n+m

i=1 ,v)p(v).(6.18)

The probabilities are deﬁned accordingly:

p(y|π, x,u)∝exp −|y−hu,Φ(x,π)i|2

2σ2,(6.19)

p(u)∝exp −λ′

2kuk2,(6.20)

p({π}n+m

i=1 |{xi}n+m

i=1 ,v) = 1

Z({xi}n+m

i=1 ,v)exp hv,Ψ({xi}n+m

i=1 ,{πi}n+m

i=1 )i,(6.21)

p(v)∝exp −1

2v⊤Γv,(6.22)

where λ′and Γ∈ SdimH2

+(positive semi-deﬁnite matrix) are regularization constants and

Z({xi}n+m

i=1 ,v) = Pˆ

Π∈Nn+m

i=1 Zexp hv,Ψ({xi}n+m

i=1 ,ˆ

Π)iis the partition function. Thus,

the MAP estimator for all unknown variables, including the model parameters u∈ H1and

v∈ H2, and the latent variables {πi}n+m

i=1 , can be obtained by solving the following problem:

min

u∈H1,v∈H2,{πi∈Z}n+m

i=1

L(u,v,{πi}n+m

i=1 ),(6.23)

where L(u,v,{πi}n+m

i=1 )is a convex combination of the objectives of the regression model

and the conditional random ﬁeld:

L(u,v,{πi}n+m

i=1 ) = θLrr(u,{πi}n

i=1) + (1 −θ)Lcrf(v,{πi}n+m

i=1 ),(6.24)

where

Lrr(u,{πi}n

i=1) = λ

2kuk2

2+1

i=1 |yi−hu,Φ(xi, πi)i|2,(6.25)

Lcrf(v,{πi}n+m

i=1 ) = 1

2kvΓ1

2k2

2−hv,Ψ({xi}n+m

i=1 ,{πi}n+m

i=1 )i+ log Z({xi}n+m

i=1 ,v).(6.26)

Here we re-parameterize the noise and the regularization parameters σ2→(1−θ)θ−1, σ2λ′→

λfor the regression part, so that the trade-oﬀ between the regression loss and the latent struc-

ture loss is explicit. From Eqn (6.24), it is apparent that 0≤θ≤1is the parameter of a convex

combination that weighs the CRF and the ridge regression objective functions. Setting θ= 1

therefore assigns 100% weight to the ridge regression part, omitting any CRF input. Practi-

cally, θwill have to lie in the range 0< θ < 1. Also, in most applications, labeled data will

be sparse and hence, in those cases it is expected that θ > 0.5will prove preferable.

6.4. Extension to Non-independent Samples 87

Algorithm 11 Transductive Conditional Random Field Regression (TCRFR)

input data Sand U

put t= 0 and initialize utand vt(e.g., randomly)

repeat

t:=t+1

Update {πt

i}n+m

i=1 by Eq. (6.27) using the intermediate solutions ut−1and vt−1

Update utby Eq.(6.28) and {πt

i}n+m

i=1

Update vtby Eq.(6.29) and {πt

i}n+m

i=1

until ∀i= 1, . . . , N :πt

i=πt−1

Predict unlabeled examples Uusing the inferred states {πt

i}n+m

i=n+1 and regression param-

eter ut:yi=hut,Φ(xi, πt

i)i

To solve the problem (6.23), we adopt a CCCP-style scheme [51,197], a kind of majorization-

minimization scheme, which has been successfully used in structured output settings with

latent variables [230]. In each (t-th) iteration, we infer the most likely conﬁguration {πi},

given uand v, for all training examples,

{ˆπi}n+m

i=1 = argmin

{πi∈Z}n+m

i=1 L(u,v,{πi}n+m

i=1 )

= argmin

{πi∈Z}n+m

i=1

i=1 |yi−hu,Φ(xi, πi)i|2−(1 −θ)hv,Ψ({xi}n+m

i=1 ,{πi}n+m

i=1 )i,

(6.27)

and then update the ridge regression parameter uand the CRF parameter vrespectively,

u= argmin

u∈H1L(u,v,{πi}n

i=1) = argmin

u∈H1Lrr(u,{πi}n

i=1),(6.28)

v= argmin

v∈H2L(u,v,{πi}n+m

i=1 ) = argmin

v∈H2Lcrf(v,{πi}n+m

i=1 ).(6.29)

Below (cf. Algorithm 11), we detail how to perform each of the steps (6.27)–(6.29).

For Algorithm 11, we can show that each iteration monotonically decreases the objective

if certain assumptions are met.

Theorem 21 (Monotonicity of Convergence for Algorithm 11).Given a minimizer for

the inference problem in Eq. (6.27) that suﬃces

L(ut,vt,{πt+1

i}n+m

i=1 )≤ L(ut,vt,{πt

i}n+m

i=1 ),(6.30)

then the log-likelihood in Eq. (6.23) is monotonically decreasing for increasing number of itera-

tions t, i.e. L(ut,vt,{πt}n+m

i=1 )≤ L(ut−1,vt−1,{πt−1}n+m

i=1 ).

Proof. L(ut+1,vt+1,{πt+1

i}n+m

i=1 )≤min{u}L(u,vt+1,{πt+1

i}n+m

i=1 )

≤min{v}L(ut,v,{πt+1

i}n+m

i=1 )≤ L(ut,vt,{πt+1

i}n+m

i=1 )≤ L(ut,vt,{πt

i}n+m

i=1 )since As-

sumption (6.30) must hold, and due to the convexity of Lrr and Lcrf (for ﬁxed π).

Choice of Joint Feature Maps Given an undirected graph G= (V, E)with binary edges

Eand vertices V, where each vertex corresponds to a sample and the state space is S=Z,

Ψ({xi}n+m

i=1 ,{πi}n+m

i=1 ) = (P(e1,e2)∈E1[πe1=s1∧πe2=s2])(s1,s2)∈S,

(Pv∈V1[πv=s]φ(xv))s∈S.(6.31)

88 Chapter 6. Learning with Latent Class Dependencies

The graph-model consists basically of two parts: a transition part and a emission part.

Accordingly, we ﬁx the regression joint feature map to be

Φ(x, π) = φ(x)⊗Λ(π),(6.32)

where Λ(π)∈ {0,1}Kwith entries (Λ(π))k= 1 if π=kand 0otherwise. K∈N+is the

number of hidden states and φthe feature function φ:RD→ X. Basically, the regression

map is a Ktimes replicated feature vector where all parts that do not correspond to the

current active state πare set to zero. For further information and examples of joint feature

maps, we refer to [157].

Latent State Inference (Eq.(6.27))Latent state inference is computationally hard in gen-

eral. While for tree-like structures eﬃcient global inference schemes exist, this does not hold

true for settings with loops. Since we are focusing on the latter, we rely on one of the two

approximation methods:

We ﬁrst discuss an approach based on linear program approximation. However, in this

case, an extension to quadratic programs is necessary and hence, we call the resulting ap-

proach quadratic program approximation (QPA). This approach is inspired from the idea of

linear program approximations and marginal polytopes [231]. Therefore, instead of using

the explicit relation of the parameter vector with the joint feature map, we need to model the

explicit relation between the latent variables:

L(u,v,{πi}n+m

i=1 ) = θ

i=1 |yi−hu,Φ(xi, πi)i|2−(1 −θ)hv,Ψ({x}n+m

i=1 ,{π}n+m

i=1 )i

=θ

i=1

(hu,Φ(xi, πi)i)2−θ

i=1

yihu,Φ(xi, πi)i

−(1 −θ)hv,Ψ({x}n+m

i=1 ,{π}n+m

i=1 )i+const.

=θ

2µ⊤

lB(u,{xi}n

i=1)µl−θµ⊤

lc(u,{xi}n

i=1,{yi}n

i=1)

−(1 −θ)µ⊤d(v,{x}n+m

i=1 ) + σ⊤e(v)+const.,

with the variables B, c, d, e deﬁned accordingly:

∀n

i=1 :Bi|S|:(i+1)|S|−1,i|S|:(i+1)|S|−1(u,{xi}n

i=1) = (hus1, φ(xi)ihus2, φ(xi)i)(s1,s2)∈S,

∀n

i=1 :ci|S|:(i+1)|S|−1(u,{xi}n

i=1,{yi}n

i=1) = (yihus, φ(xi)i)s∈S,

∀n+m

i=1 :di|S|:(i+1)|S|−1(v,{x}n+m

i=1 ) = (hvs, φ(xi)i)s∈S,

∀(e1, e2)∈E:ee1,e2(v) = ve1,e2.

Here, µs

i= Λ(πi)correspond to the relaxed unary term, σ(s1,s2)

(e1,e2)= [πe1=s1∧πe2=s2]

corresponding to the relaxed pairwise term, which must satisfy the marginal (i.e. local) poly-

tope constraints [232]P(G).µl⊆µselector for labeled data points within all data points µ,

B(u,{xi}n

i=1)is a matrix with |S|×|S|sub-matrices on its diagonal, c(u,{xi}n

i=1,{yi}n

i=1)

is the linear part of the quadratic regression model containing the labels; d(v,{x}n+m

i=1 )con-

tains the score of the parameter vector and the features for each vertex and each state; e(v)

is the vector of pairwise connection weights.

In our case, the regression term for labeled examples can be expressed as a positive semi-

deﬁnite matrix which consequently leads to the following quadratic program formulation:

6.4. Extension to Non-independent Samples 89

{µ∗

i}n+m

i=1 = argmin

µ,σ:P(G)

2µ⊤

lB(u,{xi}n

i=1)µl−θµ⊤

lc(u,{xi}n

i=1,{yi}n

i=1)

−(1 −θ)µ⊤d(v,{xi}n+m

i=1 ) + σ⊤e(v).

While we found empirically that this approach is more reliable and stable than loopy

belief propagation, it is also computationally demanding and does not scale well, i.e. with

the number of edges. Furthermore, we cannot ensure that Assumption (6.30) holds.

Another approach is based on loopy belief propagation approximation [233], where each

ˆπiis sequentially updated given the states of its neighbors. This approach is proven to mono-

tonically decrease the objective for each iteration and therefore Assumption (6.30) holds even

in the presence of loops. Moreover, in case of tree-like structures, LBPA does converge to the

global solution. The algorithm works by iteratively sending messages Mij(s)from node ito

node j(in state s) in the proximity of its location:

Mij(s)←ε+ max

tιij(s, t) + ϑi(t) + X

k∈N(i)/j

Mki(t),

where εis some normalization constant, N(i)denotes the set of neighboring nodes of node

iand

ιij(s, t) = (1 −θ)vst,

ϑi(t) = (1 −θ)hvt, φ(xi)i−1[i≤n]θ

2|yi−hu,Φ(xi, t)i|2

|{z }

regression part

.(6.33)

After convergence, max-marginals µi(s)can be computed as follows,

µi(s)←ε+ max

tϑi(t) + X

k∈N(i)

Mki(t).

Finally, backtracking using the max-marginals reveals the latent states per node. We em-

pirically found that the quadratic approximation performs similar, but it is time-consuming,

while the LBP approximation gives a reasonable performance and is scalable.

Parameter Estimation (Eq. (6.29) and Eq. (6.28))This optimization problem for v(Eq.(6.29))

is convex and therefore we apply a gradient-based solver with L-BFGS, the method of choice

for parameter estimation of CRFs. To perform the gradient descent, we need to compute the

objective Lcrf, and its gradient with respect to v, which is written as

∇vLcrf(v,{πi}n+m

i=1 ) = Γv−Ψ({xi}n+m

i=1 ,{πi}n+m

i=1 )

+Eˆπ∼p({ˆπi}n+m

i=1 |{xi}n+m

i=1 ,v)[Ψ({xi}n+m

i=1 ,{ˆπi}n+m

i=1 )].(6.34)

The objective (6.24) contains the partition function log Z({x}n+m

i=1 ,v), and the gradient

(6.34) involves the expectation

Eˆπ∼p({ˆπi}n+m

i=1 |{xi}n+m

i=1 ,v)[Ψ({xi}n+m

i=1 ,{ˆπi}n+m

i=1 )].

Computation of partition function with pairwise interaction is known to be hard. There-

fore, we approximate it with the pseudo-likelihood [234].

90 Chapter 6. Learning with Latent Class Dependencies

The estimation of u, Eq. (6.28), is simply a ridge regression problem, of which the solution

is available analytically:

∂Lrr(u,{πi}n

i=1)

∂u= 0 ⇒u= (λI + ΦΦ⊤)−1Φy,

with I∈ {0,1}dimH1×dimH1being the identity matrix, Φ∈RdimH1×nthe design matrix of

only the labeled samples, and ΦΦ⊤the corresponding covariance matrix.

One fundamental assumption in our application setting is the linearity of the regression

model within each latent state. For this setting, the above regression model is suﬃcient. It

is, however, quite easy to extend to non-linear settings. For that, kernel ridge regression can

be applied and solved analytically. Notably, maximum a posteriori estimation of the latent

states needs to be changed if no expansion of ucan be provided.

Towards One-class Classiﬁcation Here, we extend the idea to one-class classiﬁcation.

The resulting method, ContextualSVDD, employs results from Section 6.2 and the loopy be-

lief propagation derivations as given in Eq. (6.33). Instead of optimizing the ridge regression

problem, we replace it by an slightly modiﬁed version of the unconstrained LatentSVDD as

given in Eq. (6.3). All we need to do, is to substitute the local latent variable zwith our global

variable πifor each data point and remove the corresponding minimization,

LContextualSVDD(c, R, {πi}n+m

i=1 ) := R2+C

i=1

max(0,kc−Ψ(xi, πi)k2−R2).

When latent variables are ﬁxed, the resulting optimization problem is convex. However, this

formulation does no longer represent the negative logarithm of some joint distribution and

hence, we loose the probability interpretation. To have a fully functional version, inference

methods need to be adjusted as well which we do here for the more scalable loopy belief

propagation:

ϑi(t) = (1 −θ)hvt, φ(xi)i−θC max(0,kc−Ψ(xi, πi)k2−R2)

|{z }

ContextualSVDD part

.(6.35)

As can be seen, all examples are considered as long as they receive slack, i.e. lie outside

of the hypersphere. Moreover, if labels some are available a semi-supervised extension of

ContextualSVDD can be derived using ideas from [12].

6.5 Evaluation and Applications

We test our derived methods (TCRFR and LatentSVDD) on challenging applications from

BCI and geoscience.

6.5.1 Extracting Latent Brain States

In many real-world applications, the simpliﬁed assumption of independent and identically

distributed noise breaks down, and labels can have structured, systematic noise. For ex-

ample, in brain-computer interface applications, training data is often the result of lengthy

experimental sessions, where the attention levels of participants can change over the course

of the experiment. In such application cases, structured label noise will cause problems be-

cause most ma- chine learning methods assume independent and identically distributed label

noise. In this paper, we present a novel methodology for learning and evaluation in presence

of systematic label noise.

6.5. Evaluation and Applications 91

We are given a data set Dconsisting of Ndata points x1,...,xN, lying in some input

space X, and labels y1, . . . , yN∈ Y. As mentioned above, we consider a learning scenario

where we have varying conﬁdence in the labels (some yiare more trustworthy than others).

To this end, we propose a methodology for learning with non-i.i.d. label noise that consists

of four steps.

As a result we obtain a learning methodology that outputs, for a training set D, an in-

ductive rule

gD:X ×Y → Y,

that lets us assign to any pair (x, y)a denoised label by:= gD(y), which is our guess for the

true underlying label.

The various steps of the above methodology are detailed below.

Pipeline The ﬁrst step is an application of our proposed method, LatentSVDD in Sec-

tion 6.2.

To remove outliers in the second step, we divide the data set Dinto two disjoint sets

L−:= {x:f(x)≤ρ}, containing most of the regular data, and L+:= {x:f(x)> ρ},

consisting of the anomalies. Here fis deﬁned as in Eq. (6.2). LatentSVDD provides us with

a natural choice of a threshold ρ=R2, but usually we employ a small and thus conservative

radius R << kψ(x,z)k∞, so that choosing ρ=R2would be too aggressive (too many

anomalies removed). As a remedy, we apply the following procedure to determine a good

threshold ρ. Set fi:= f(xi)and arrange the fiin non-decreasing order, f(1) ≤. . . ≤f(n).

Put

ρ:= max R2,max

i=1,...,N−1f(i+1) −f(i).

Thus intuitively we determine the threshold where the anomaly score f(x)has the steepest

slope. The motivation of which is that regular data is quite densely sampled and thus has

a rather smooth increase of anomaly scores, so that choosing an area with steep slope of

anomaly scores corresponds to an anomalous region in input space. Indeed we have observed

that this heuristic often leads to good results in practice. Finally we output W:= L−as our

(sanitized) working training set.

In the third step, we aim at assigning a label b

yfor each data point xusing the information

from the latent variable z∈ Z, as computed by LatentSVDD. We start by partitioning

the working data set Wis into msmaller sets W1,...,Wm, where m:= |Z| denotes the

cardinality of the latent state space, by grouping all data points that have the same latent

state in the LatentSVDD model.

Then, we wish to ﬂip the labels of data points such that the data within each group Wi

has identical labels. To this end, we could simply perform a majority vote within each group.

We follow a diﬀerent, more sophisticated approach here: we determine each group’s joint

label by choosing the labels such that the working set’s kernel-target-alignment (KTA) score

is maximized after label assignment.

Kernel target alignment (KTA) [235,236] is a method that measures the ﬁt between the

Gram matrix K= (hφ(xi), φ(xj)i)1≤,i,j≤nand the label vector y= (y1, . . . , yn)as follows:

KTA(K, y) = hK, yy⊤iF

kKkFkyy⊤kF

Here, hA, BiF:= Pn

i,j=1 aijbij denotes the Frobenius inner product and kAkF:= hA, Ai1/2

denotes its induced norm. This measure has been utilized for optimizing kernels or feature

representations [235,237]. In this paper, we reverse the perspective: instead of optimizing a

kernel to match the labels, we optimize the labels to match the kernel.

92 Chapter 6. Learning with Latent Class Dependencies

Let W=W1∪. . .∪Wmbe the partition of the working training set Winto disjoint sets

Wisuch that examples having the same latent state are grouped within the same Wi. Then

we compute the denoised label vector b

yas

y:= argmax

y∈{+1,−1}N

KTA(K, y)

s.t. ∀i, j, k :xi,xj∈ Wk⇒yi=yj.

Here, the constraints require that all data points within a group Wiare assigned with the

same label. This ensures that we only have to optimize over a few possible label combinations,

e.g., over 25= 32 instead of 2N, if we have m= 5 groups. This renders the optimization

problem feasible.

Fair evaluation of learning algorithms for label denoising is a major challenge and the

ﬁnal step in our pipeline: while we cannot trust the observed labels, we usually cannot access

the underlying ground truth of an experiment.

When evaluating our experiments on real-world data, we employ three indicators for

the prediction accuracy of an algorithm. First, note that it is our intrinsic interest that the

accuracy of a classiﬁer increases after denoising the labels. For this purpose we measure the

classiﬁcation performance in terms of the area under the ROC curve (AUC) [238] before and

after denoising, and take the diﬀerence as an indicator for a algorithm’s performance: a good

denoising algorithm should yield a substantial higher classiﬁcation accuracy after denoising.

Second, we use kernel-target-alignment scores as an indicator for the ﬁt between labels and

data before and after denoising. KTA scores are complementary to AUCs in the sense that

capture how well the separability of the data correlates with the labels. Third, we invoke

expert opinions to ensure the quality of the delivered solution. This has the advantage that

we do not rely on labels in this case, but the disadvantage that the expert opinion is subjective

and might be biased. In summary, the combined application of the above described measures

lets us obtain a guess for the true performance of a denoising algorithm.

Motivation & Neuroscientiﬁc Background We evaluated our proposed learning method-

ology on the data of an EEG-BCI experiment, for which we recorded 20 participants. The

results are presented in this section.

In our EEG experiment, we address the question of whether or not the brain of a partic-

ipant processed a response error. Conventionally, the EEG data would be analyzed based on

the behavioral response of the participant, grouping all trials together where the behavioral

response is de facto correct or wrong (= behavioral labels). However, having committed a

mistake behaviorally does not equate having processed it neurally [239]. While the neural

processing is what we are really interested in, these neural labels are unknown, as no ground

truth is available. We used LatentSVDD for ﬁnding these neural labels in a data-driven way,

with the goal of dividing the EEG trials: those where an error was processed neurally, and

those where none was processed.

When participants recognize having committed a response error, two speciﬁc compo-

nents are evoked in the event-related potential (ERP) of the EEG signal: an error negativity

(Ne) and an error positivity (Pe). Out of these, only the Pehas been attributed to error or

post-error processing itself [240]. Therefore, we focus on the Pein the following, which is

characterized by a centro-parietal maximum 200–500ms after feedback [241–244].

Paradigm & Methods In our experiment, 20 participants were asked to perform a fast-

paced d2 test [245], a common test of visual selective attention. In this test, participants are

presented two types of visual stimuli and are asked to distinguish between these two stimuli

6.5. Evaluation and Applications 93

by pressing the corresponding button: the right hand should be used for the target stim-

ulus (20% of trials), the left hand for the non-target stimulus (80% of trials). In total, each

participant assessed 300 stimuli under time pressure. Feedback was given 500 ms after each

response, both on reaction time and correctness. Brain activity was recorded with multichan-

nel EEG ampliﬁers (BrainAmp DC by Brain Products, Munich, Germany) with 119 Ag/AgCl

electrodes placed according to an extended international 10-10 system, sampled at 1000 Hz

and band-pass ﬁltered between 0.05 Hz and 200 Hz.

We examined the neural response that was elicited by receiving feedback. For this, the

EEG data was divided into epochs of 500 ms, starting from the onset of feedback. These

epochs were baseline corrected (based on the 200 ms interval prior to feedback) and artifact

rejection was performed. As features for LatentSVDD and classiﬁcation, we calculated 9 fea-

tures per epoch. For this purpose, the interval [0 500 ms] was divided in 10 non-overlapping

intervals of 50 ms length. We then calculated the mean signal in each of these intervals and

subsequently, the gradient between these means. In order to test class separability, we clas-

siﬁed the EEG data using shrinkage LDA, sampling 30 times from the data set and dividing

the data set into 75% training data and 25% test data. Classiﬁcation was run using (a) be-

havioral labels, (b) the ’neural’ labels suggested by LatentSVDD, and, for comparison, those

derived by SVDD, LP and RDE. We expect the ’neural’ classes to be better separable than

before (higher AUC values) and to have a better matching of labels and data (higher KTA

scores), compared to using behavioral labels (correct vs. incorrect responses).

Class Re-Assignment and Anomalous Trials On average, LatentSVDD ﬂipped the la-

bels for 35.94% of all trials. This resulted in a neural error rate of 31.18%, compared the

lower behavioral error rate (18.05%). Based on the anomaly score that LatentSVDD returns

for each trial, we rejected a small percentage of trials for each participant (cf section 2.2.).

For the majority of participants, there are only few trials with high anomaly scores, with a

steep drop-oﬀ compared to the other trials (cf Figure 6.3). Visual inspection revealed that

the results also make sense neuroscientiﬁcally: the rejected trials show typical artifacts (eye

blinks, voltage drifts with respect to all electrodes or a single electrode) that have escaped

the conventional artifact rejection run prior to applying LatentSVDD, as well as trials with

unusually high amplitudes.

50 100 150 200 250

Sorted Anomaly Scores

Participants

0.2

0.4

0.6

0.8

Figure 6.3 – Sorted anomaly scores for each data point of each participant.

Quantitative Assessment We quantiﬁed the beneﬁts of LatentSVDD using KTA scores

and linear classiﬁcation (LDA). Both measures conﬁrm that the labels assigned by LatentSVDD

allow a much better separation of the data than behavioral labels for all 20 participants. As

94 Chapter 6. Learning with Latent Class Dependencies

Figure 6.4 – AUC and KTA results for all participants of the experiment.

can be seen in Figure 6.4.B, LatentSVDD renders the classes clearly more distinct from each

other, reﬂected in higher AUC values (0.95 ±0.02 versus 0.60 ±0.08). This is accompanied

by substantially higher KTA score for all participants. As can be seen in Figure 6.4.A, La-

tentSVDD is also superior compared to other denoising methods (SVDD, LP, RDE). SVDD

and LP lag far behind, both in AUC and KTA scores. In fact, applying these methods even

makes separability of classes worse than before (no method: 0.60 ±0.08, SVDD: 0.59±0.07,

LP: 0.54 ±0.17). In contrast, RDE proves to be a close competitor to LatentSVDD. How-

ever, our approach shows better results for this EEG experiment, with a mean AUC score of

0.95 ±0.02 (RDE: 0.90 ±0.04) and a mean KTA score of 0.3911 (RDE: 0.2842).

Neuroscientiﬁc Assessment While AUC and KTA scores help quantify the positive ef-

fect of LatentSVDD, the results are also neurophysiologically sound. In the following, we

discuss this for our methodology at the example of participant 5. The diﬀerent steps of our

methodology are visualized in Figure 6.5. Each plot shows the same data (time course at

electrode Cz), yet grouped in diﬀerent classes. The conventional approach is shown on the

far left (a), the superior results retained by LatentSVDD on the far right (d), with classes

that are clearly better separable. Initially (Figure 6.5(a), classes show great similarity (correct

responses in green, erroneous responses in red). Our methodology reveals four latent brain

states (Figure 6.5(b)). The state with the highest amplitude (purple) corresponds to typical

error processing, with a clear positive component Pe. A clear positivity also occurs in the

blue and pink state, yet less pronounced and with diﬀerent latencies. In contrast, no error

has been processed in the black state. Based on the latent variable, a subset of trials is then

re-assigned (Figure 6.5(c)). Red and green indicate labels that are retained, orange and light

6.5. Evaluation and Applications 95

(a) Before denoising

10 V

100ms

(b) Latent States

(d) After denoising

Figure 6.5 – Time course at electrode Cz: (a) before denoising (behavioral labels), (b) latent brain states

revealed by LatentSVDD, (c) resulting re-assignment of labels, (d) after denoising.

green signify trials where the labels were switched (orange to red, light green to green). As

can be seen, the re-assignment makes sense intuitively. Finally, Figure 6.5(d) shows the de-

noised data, which reveals a more pronounced error positivity Pe(red) than before. While the

latent states themselves are highly subject-speciﬁc, we ﬁnd similar results, i.e. the recovery

of a stronger Pecomponent than before, for all other participants.

Application Outcome Finding the true label for data with systematic, non-i.d.d. label

noise is a common challenge in experimental disciplines such as the neurosciences. We pro-

posed a 4-step methodology for learning and evaluation in presence of non-i.i.d. label noise,

in the heart of which lies our novel learning algorithm—LATENTSVDD—that allows to cap-

ture the hidden state of the label noise. We demonstrate in an extensive case study of EEG-

BCI data recorded during an attention test, where we observed that the labels denoised by

the proposed methodology lead to substantial better separability of the data (assessed with

linear classiﬁcation; rise in the mean AUC from 0.60 to 0.95 for EEG data). Visual inspection

of the data by a domain expert shows that the class assignments output are neurophysiolog-

ically plausible, leading to more easily interpretable brain states that subsequently allow for

a better and more meaningful experimental evaluation.

6.5.2 Porosity Estimation

Here, we will empirically evaluate our proposed method from Section 6.4, transductive con-

ditional random ﬁeld regression (TCRFR). First, we will verify various properties using arti-

ﬁcially generated data. In a second step, we will apply our method to realistically simulated

impedance data where ground truth porosity values are known as well as real data from a

Brazilian oﬀshore area.

Controlled Experiment In this section, we assess the various properties of our proposed

TCRFR model and compare it against baseline methods in a controlled environment. In all

96 Chapter 6. Learning with Latent Class Dependencies

experiments, we applied cross validation and hyper-parameter tuning on the training sam-

ples for all methods with 20 repetitions. The search range for each parameter is shown in

Table 6.1.

Method Parameter Range

SVR C0.1, 1.0, 10.0, 100.0

ǫ1E-0, ..., 1E-5

SVR (RBF) C0.1, 1.0, 10.0, 100.0

ǫ1E-0, ..., 1E-5

σ21.0, 0.1, 0.01

LapSVR (RBF) C0.1, 1.0, 10.0, 100.0

ǫ1E-0, ..., 1E-5

σ21.0, 0.1, 0.01

γI/γA0.01

RR σ1E-6, 1E-5, ..., 1E-1

TR ǫ1E-6, 1E-5, ..., 1E-1

C0.1, 1.0, 10.0, 100.0

C′0.1, 1.0, 10.0, 100.0

MoE (FlexMix) iter. 2000

tol. 1E-4, 1E-3, ..., 0.1

k-means RR ǫ1E-5, 1E-4, 1E-3

TCRFR θ0.75, 0.85, 0.95

λ1E-4, 1E-3, 1E-2

γ0.1, 1., 10.

Table 6.1 – Optimized hyper-parameters for support vector regres-

sion (SVR), Laplacian SVR (LapSVR), ridge regression (RR), trans-

ductive regression (TR), mixture of experts (MoE), and our proposed

method (TCRFR).

We evaluate the performance

with diﬀerent criteria: for predic-

tion, we show the mean absolute

error (MAE), the mean square er-

ror (MSE), the root mean squared

error (RMSE), the median absolute

error (MDAE), and the R2-score;

for clustering (latent variable esti-

mation) accuracy, we show the ad-

justed rand score (ARS).

We chose the following as the

competitors: ridge regression (RR);

support vector regression (SVR)

with linear and RBF kernel (SVR

RBF) and a Laplacian regularized

transductive SVR (with RBF ker-

nel); a RC (Regression and Cluster-

ing) approach for assessing latent

states by applying k-means and us-

ing ridge regression within each

cluster (k-means+RR); a mixture

of experts approach (MoE) [225,

226];1and the transductive regres-

sion (TR) [223]. We also plot the

lower bounds of the errors, which

are the prediction errors under the

assumption that the latent variable is known for all the test samples.

Synthetic structured data was created according to the sequence model illustrated in

Fig. 6.6. From 2 latent states with heavily overlapping inputs and additional Gaussian noise,

800 data points were generated. The data was randomly split into training, validation and

test data, and the experiment was repeated 20 times. We tested our approach against the

baseline methods.

We often face cases where the number of labeled samples is extremely small. In such

a case, we found that enhancing the propagation of information from the labeled samples

to the unlabeled samples signiﬁcantly boosts the performance, as shown below. For this

purpose, we treat the labeled and the unlabeled samples asymmetrically: the labeled samples

are connected to the neighbors lying in a radius-R-near ball, while the unlabeled samples are

connected only to the 2 nearest neighbors in the sequence (4 nearest neighbors in the lattice

grid). We set R∼P−1/2, where Pis the proportion of the labeled samples. This way, we can

ﬁx the ratio between the total number of edges between labeled and unlabeled samples and

the total number of edges between unlabeled samples. Also, we encourage the ferromagnetic

interactions (that is, we favor same states for neighboring latent variables) by reducing the

relative regularization parameter for v, i.e., Γin Eq.(6.22) is diagonal with its elements equal

to γ, except the ones corresponding to the ferromagnetic pairwise terms that are equal to

0.01γ. Fig.6.7 shows the performance of TCRFR(R). We can clearly see that optimizing R

signiﬁcantly improves the performance, which supports our strategy.

1We use the FlexMix software package.

6.5. Evaluation and Applications 97

0 20 40 60 80 100

Tim e

−3

−2

−1

Observations

St at e 1 St at e 2 St at e 3

Observations w/ Neighborhood Connection

Observations w/ Assigned Regression Target

Latent State Mean Input Values

−2 −1 0 1 2 3 4 5

Observations

− 0.5

0.0

0.5

1.0

1.5

2.0

2.5

Regression Targets

Figure 6.6 – Toy example for structured linear regression problem with few labeled (red) and many unla-

beled (gray) data. Left: sequence data (structure: temporal) was generated from three latent states. Right:

Considering the input observations (horizontal axis) only, clustering or inferring latent states is futile,

whereas harvesting label information (vertical axis), which are available for the red dots, and temporal

structure (edges) allows unique clustering.

Figure 6.8 – Runtime comparison of our two pro-

posed approximate inference schemes, TCRFR-QPA and

TCRFR-LBPA. LBPA proves superior, although the accu-

racy performance (cf. Fig 6.9) gives a slight advantage

to the QPA, which is the better candidate for a smaller

number of data points.

We compare our two inference schemes,

TCRFR-QPA and TCRFR-LBPA, to assess the

diﬀerence in performance and, speciﬁcally,

runtime. To achieve a fair comparison (in

runtime), both methods share the same pa-

rameters and no model selection was done.

Fig.6.8 compares the accuracy criteria and

the runtime of the two methods for diﬀer-

ent numbers of samples. Although Fig. 6.9

hints that TCRFR-QPA performs better in

settings with low fractions of labeled sam-

ples, TCRFR-LBPA has a big advantage in

computation time. Therefore, for settings

with a small to medium number of data

points, and especially a small fraction of la-

beled data points, TCRFR-QPA should be

preferred. However, due to the much larger

number of data points in subsequent ex-

periments, we adopt TCRFR-LBPA as our

method of choice.

Now we assess the accuracy when changing the fraction of labeled data. Figure 6.9 com-

pares the performance of our proposed TCRFR with the baseline methods. We clearly see that

our method outperforms all baseline methods under all accuracy criteria, and gives close per-

formance to the lower bound optimal strategy in some cases. Note that the ARS criterion for

latent variable estimation is reported only for TCRFR, k-means+RR, and MoE, since the other

methods do not provide a latent variable estimator.

RR, SVR, and SVR (RBF) do not consider the dependence of the regression model on the

latent state. TR and Laplacian SVR (RBF) consider the transductive setting, but also do not

have a latent variable. For this reason, those three methods cannot accurately predict the

98 Chapter 6. Learning with Latent Class Dependencies

Figure 6.7 – Performance on synthetic structured data. MAE, MSE, RMSE, MDAE, R2-score, ARS, and the

runtime for diﬀerent fractions of the labeled samples are shown. Here, we assess the quality of the speciﬁc

grid construction.

labels of unlabeled samples when generated from multiple regression models dependent on

the latent state. k-means+RR and MoE consider multiple regression models, depending on

the latent variable. However, it doesn’t take the structure, i.e., interaction between neighbors,

into account. For this reason, they tend to fail to infer the latent states of the unlabeled data,

which also results in poor label prediction performance.

Our TCRFR, which performs signiﬁcantly better than the others, is the only method that

identiﬁes multiple regression models from a limited number of labeled samples, and appro-

priately propagates the label information to the unlabeled data, by capturing the structure of

latent variables.

The number of assumed latent states is a crucial parameter. Here, we examine the impact

of choosing latent states diﬀering from the ground truth for all methods that are sensitive

to this parameter (our TCRFR, K-means+ridge regression, and the Mixture-of-Experts) and

ridge regression as a “calibration”. Figure 6.10 shows the dependence of the performance

on the assumed number of latent states. We can see that the performance of TCRFR is not

very sensitive to the assumed number of latent states, as long as it is larger than the true

number of latent states (2 in this dataset). This is because the redundant components tend to

be discarded if the regularization coeﬃcient γis optimized.

Porosity Prediction Porosity estimation is a crucial step in the analysis of petroleum

reservoirs for the oil industry. Although estimating porosity from seismic impedance is less

accurate than from drilled wells [19], plenty of measurements are available, typically on a 3D

grid covering over tens of square kilometers. The left panel of Figure 6.11 shows an example

of a seismic impedance data horizontal slice [246].

As stated earlier, the correlation between seismic impedance and porosity depends on

bodies (or units) of rock known as facies [247]. The segmentation of the reservoir into facies

allows local heterogeneity and strong contrasts in rock properties to be preserved between

diﬀerent geological layers [248]. The middle panel in Figure 6.11 shows the facies pattern

of the same data, adapted from [246]. In many cases, facies classiﬁcation is carried out by

hand, based on the data available from seismic surveys, well logs, and collected core samples.

For automation, cell-based geostatistical modeling, object-based stochastic modeling [247],

k-means [64] or Mixture of Gaussians [249] are often applied. Porosity estimation is then

6.5. Evaluation and Applications 99

Figure 6.9 – Performance on synthetic structured data for varying fractions of the labeled samples.

Figure 6.10 – Performance on synthetic structured data for a varying number of maximum latent states.

(a) (b) (c) (d)

Figure 6.11 – Porosity prediction problem. The goal is to estimate (c) porosity (unknown at most of the

locations) from (a) impedance (known) by using a linear relationship between them. However, this relation-

ship depends on the (b) facies (unknown), and accurate facies estimation requires porosity measurements

because of the overlapped marginal distribution of the impedance (d).

100 Chapter 6. Learning with Latent Class Dependencies

Method MAE MSE RMSE MDAE R2

MoE 2.38477 8.44310 2.90562 1.57930 0.47237

k-means+RR 2.08030 6.27532 2.50489 1.93901 0.61407

SVR 1.84235 11.37484 3.37256 0.24478 0.28910

RR 2.05989 6.19819 2.48950 1.89004 0.61271

TR 2.05993 6.19791 2.48944 1.89106 0.61273

TCRFR 0.69878 3.55215 1.88422 0.14865 0.77804

L. bound 0.15237 0.03567 0.18885 0.13740 0.99777

Table 6.3 – Performance on synthetic seismic data for 5% of labeled data.

performed within each facies usually by kriging [250], an interpolation method for spatial

data based on Gaussian processes commonly used in geostatistics [251]. This whole pro-

cess is extremely time-consuming and requires the specialized knowledge of a geologist (see

Appendix B).

Method Parameter Range

MoE iter. 300, 400, ..., 800

tol. 1E-4, 1E-3, ..., 0.1

KMRR ǫ1E-5, 1E-4, 1E-3

SVR C1E-3, 1E-2, ..., 1.

ǫ0.1, 1., 10.

kernel linear

RR tol. 1E-6, 1E-5, ..., 0.1

TR ǫ1E-6, 1E-5, 1E-4

C10., 100., ..., 1E4

C′0.001, 0.01, ..., 1

TCRFR R3, 4, ..., 8

θ0.7, 0.75, ..., 1.0

λ1E-4, 1E-3, 1E-2

γ0.1, 1., 10.

Table 6.2 – Optimized hyperparameters in the porosity prediction

experiment.

In the following subsections,

we show the performance of TCRFR

and the baseline competitors on

synthetic and real data. In all ex-

periments, we applied 3-fold cross

validation on the training samples

to tune the hyper-parameters for

all methods. The search range for

each parameter is shown in Ta-

ble 6.2.

Synthetic Seismic Data We use

the synthetic 3D reservoir bench-

mark data set [246] (150 ×200 ×

40 voxels), which was created

through realistic geological model-

ing. Figure 6.11 shows one hori-

zontal slice of the data with 150 ×

200 voxels. There are two fa-

cies, the sand channels (yellow in

Fig.6.11 (b)) and the background

shale (blue). From the data, we observe the following trend: the sand channels have higher

porosity (Fig.6.11 (c)) than the background shale, and the impedance (Fig.6.11 (a)) has a nega-

tive correlation with porosity (see also Fig.6.11 (d)). Due to the vertical low resolution during

the seismic acquisition process [247], we simplify our setting by only considering connec-

tions in the horizontal slices. So, from each of those volumes, we extract 150×200 horizontal

slices, and assume that the whole impedance data and part of the porosity data are available

as the input and the regression label (output), respectively. Our goal is to infer the latent

structure (facies), and to predict the porosities at the unlabeled samples.

Among the 150 ×200 = 30,000 pixels, we randomly choose 5% of them as labeled

samples, and the others are treated as unlabeled samples. We iterate this process 10 times

and report the average performance.

Table 6.3 summarizes the performance of TCRFR and the baseline methods. A clear ad-

vantage of TCRFR is found. To discuss the reason of the success of TCRFR, we show the

6.5. Evaluation and Applications 101

Figure 6.12 – Facies estimation results for 5% of labeled examples.

estimated facies and the predicted porosity for a single trial in Figure 6.12 and Figure 6.13,

respectively.

Figure 6.12 implies that TCRFR successfully recovers the facies structure, while MoE and

k-means+RR fail. The excellent facies estimation by TCRFR, despite the small fraction of la-

beled data, is because it acquires the facies structure with adequate strength of correlation

between neighbors, through the learning process of conditional random ﬁeld. This enables

appropriate propagation of the label information, which is necessary for good facies estima-

tion from only 5% of labeled samples. On the other hand, MoE and k-means+RR are not

capable of taking the structure of facies into account. Therefore, although equipped with

multiple regression models for each facies, they fail to identify the facies of the unlabeled

samples, because no information is propagated from labeled samples. In fact, we found that

the facies estimation by MoE is accurate on the labeled samples, and the bad performance is

only on the unlabeled samples.

Thanks to the high quality of facies estimation, TCRFR provides a signiﬁcantly better

porosity estimation result, as shown in Figure 6.13. SVR, RR, and TR are not capable of

dealing with multiple regression models and, therefore, do not perform as well as our TCRFR

method. Also note that they do not provide facies estimation results.

Figure 6.14 shows MAE, RMSE, and MDAE for a range of labeled samples fractions. For

any fraction in this range, our TCRFR outperforms all state-of-the-art competitors, which

again proves a clear advantage of our approach.

Last, Figure 6.15 shows the facies estimation results (top) and the porosity prediction re-

sults (bottom) by TCRFR for diﬀerent fractions of labeled samples. Notably, although degra-

dation is observed to some extent, TCRFR still provides reasonable facies estimation and

porosity prediction, even if only 1∼2% of labeled samples are available. In fact, 1∼2%

is still high for the porosity prediction application—we should assume an extremely small

number of labeled samples available only at the drilled wells. Nevertheless, we see the cur-

rent research as a good starting point, and will further improve our method by using domain

knowledge and other heuristics to cope with fewer labeled samples.

Real Data Experiment We apply our TCRFR to a real petroleum reservoir, located in the

oﬀshore coast of Brazil. It covers an area of approximately 100 square kilometers, with 460

meters in depth. The data in this region comprises a 3D volume with 313 ×549 ×74 voxels

containing acoustic impedance samples. This data contains truly labeled data from only three

wells, with which no general-purpose machine learning method can cope. Accordingly, we

use additional labeled samples, which were created by geoscientists through a handcrafted

procedure (see Appendix B.2 for details).

Table 6.4 shows the performance of TCRFR and the baseline methods on the real data for

5% of labeled samples (including additional handcrafted labels). Similarly to the experiment

102 Chapter 6. Learning with Latent Class Dependencies

(a) Ground Truth (b) MoE (c) k-means + RR (d) SVR

(f) TR(e) RR (g) TCRFR

Figure 6.13 – Porosity prediction results for 5% of labeled data.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

2% 5% 10% 15%

RootMeanSquaredError

%oflabeledexamples

PerformancevsNumberoflabeledexamples

MoE

KMRR

SVR

RR

TR

TCRFR

Figure 6.14 – MAE, RMSE, and MDAE on synthetic seismic data for a range of labeled data fractions.

0 20 40 60 80 100 120 140

100

150

(a) Facies ground truth (b) 15% (c) 10%

(i) 10%

(d) 5%

(j) 5%

(e) 2%

(k) 2%

(f) 1%

(l) 1%(h) 15%(g) Porosity ground truth

Figure 6.15 – Estimated facies and the predicted porosity by TCRFR for diﬀerent fractions of labeled sam-

ples.

6.5. Evaluation and Applications 103

Method MAE MSE RMSE MDAE R2

MoE 0.42502 0.55195 0.74268 0.22591 0.88991

k-means+RR 0.45002 0.44259 0.66513 0.28474 0.90910

SVR 0.48028 0.46350 0.68055 0.35463 0.90757

RR 0.45716 0.45581 0.67490 0.28999 0.90909

TR 0.45717 0.45581 0.67490 0.29000 0.90909

TCRFR 0.24225 0.13712 0.37001 0.14571 0.97264

Table 6.4 – Porosity prediction performance on the real data with 5% of labeled examples.

(a) Ground Truth (b) MoE (c) k-means + RR (d) SVR

(f) TR(e) RR (g) TCRFR

Figure 6.16 – Predicted porosity on the real data.

on synthetic data in the previous subsection, our TCRFR compares highly favorably with the

baselines.

Figure 6.16 shows the predicted porosity by TCRFR and the baseline methods. Note that

the ground truth here is not the true labels available only at the wells, but the additional

handcrafted labels. Again, TCRFR provides excellent results and allows a ﬁrst and useful

assessment of geologically attractive regions for oil exploration (red and yellow regions).

Application Outcome Handling data under spatio-temporal structure with limited labels

and, therefore, a combination of certainty and vast uncertainty requires novel robust mod-

eling strategies. We tackled this challenging problem in time-series analysis and porosity

prediction for the oil industry. Experiments on toy time-series data and synthetic porosity

prediction data clearly showed successful inference. Finally, we have studied real world data

from an oﬀshore oil ﬁeld and could show remarkable performance of our new model, which

compares very favorably with the state-of-the-art competitors.

104 Chapter 6. Learning with Latent Class Dependencies

6.6 Summary and Discussion

In this chapter, we proposed extensions of the support vector data description to cope with

contextual anomalies. We focused on the speciﬁc setting of latent class dependencies. Our

ﬁrst contribution—LatentSVDD—leveraged joint feature maps to incorporate latent classes

into the objective function. While the notion of joint feature maps allows great ﬂexibility

beyond latent classes (cf. Chapter 5), by deriving a restricted version—ClusterSVDD—and

reviewing rigorously its properties, we were able to show that k-means is contained as a

special case. Finally, we imposed structural dependencies among the latent variables itself,

eﬀectively relating the input samples. Although the proposed method—TCRFR—was derived

with a regression setting in mind, we discussed an extension to one-class classiﬁcation—

ContextualSVDD. Extensive applications from neurosciences and geosciences display the

beneﬁts of the proposed solutions.

Although the applications draw a positive picture of our developed methods, various

limitations exist. Besides the fact that all of the presented methods are non-convex extensions

of convex base models, all of them are also computationally much more demanding.

Limits of LatentSVDD LatentSVDD with the proposed joint feature map will tend to

join cluster with small sample sizes which is the reason why some cluster remain empty our

application in Section 6.5.1. While this might seem like a huge beneﬁt, there is unfortunately

no straightforward way of controlling this behavior. Furthermore while the joint feature map

adds a lot of ﬂexibility to encode latent feature space and input space, it still remains an open

question how to leverage this in real-world applications. Finally, the proposed formulation

can not leverage the ﬂexibility of kernels.

Limits of ClusterSVDD ClusterSVDD on the other hand does not show such behavior

as the concentration of cluster and can be seen as a k-means variant with inherent anomaly

detection. However, there is no clear rationale that each cluster should assume a fraction of ν

outliers and hence, the resulting clustering and anomaly detection becomes less interpretable.

Moreover, if setting ν= 1 the k-means algorithm is recovered and optimal solutions can be

calculated analytically while for any other setting a much more computationally demanding

quadratic problem needs to be solved.

Limits of TCRFR The most severe limitation that our TCRFR method faces is of com-

putational nature. Especially demanding are the inference steps for large number of nodes

and edges and, i.e. the calculation of the partition function. Even though fast (and crude)

approximations of the partition function are employed, due to the pure number of function

calls, calculations will take the bulk of the time necessary to converge. The impact of the

approximation quality on the overall solution needs to be examined more closely.

6.6. Summary and Discussion 105

Source code and resources for the proposed methods are available on github a b. Parts

of this chapter are based on:

Görnitz, N., Porbadnigk, A. K., Kloft, M., Binder, A., Sannelli, C., Braun, M.,

Müller, K.-R., “When brain and behavior disagree: A novel ML approach for

handling systematic label noise in EEG data”, in Machine Learning and Interpre-

tation in Neuroimaging Workshop (MLINI), 2013

Görnitz, N., Porbadnigk, A. K., Binder, A., Sanelli, C., Braun, M., Müller, K.-R.,

Kloft, M., “Learning and Evaluation in Presence of Non-i.i.d. Label Noise”, in

International Conference on Artiﬁcial Intelligence and Statistics (AISTATS), vol. 33,

2014, pp. 293–302

Porbadnigk, A. K., Görnitz, N., Sannelli, C., Binder, A., Braun, M., Kloft, M.,

Müller, K.-R., “When Brain and Behavior Disagree: Tackling systematic label

noise in EEG data with Machine Learning”, in IEEE International Winter Work-

shop on Brain-Computer Interface (BCI), 2014

Porbadnigk, A. K., Görnitz, N., Sannelli, C., Binder, A., Braun, M., Kloft, M.,

Müller, K.-R., “Extracting latent brain states — Towards true labels in cognitive

neuroscience experiments”, NeuroImage, vol. 120, pp. 225–253, 2015

Görnitz, N., Lima, L. A., Varella, L. E., Müller, K.-R., Nakajima, S., “Transductive

Regression for Data with Latent Dependency Structure”, IEEE Transactions on

Neural Networks and Learning (TNNLS), 2017

Görnitz, N., Lima, L. A., Müller, K.-R., Kloft, M., Nakajima, S., “Support vec-

tor data descriptions and k-means clustering: one class?”, IEEE Transactions on

Neural Networks and Learning (TNNLS), 2017

Lima, L. A., Görnitz, N., Varella, L. E., Vellasco, M., Müller, K.-R., Nakajima,

S., “Porosity Estimation by Semi-supervised Learning with Sparsely Available

Labeled Samples”, Computers & Geosciences, vol. 106, pp. 33–48, 2017

ahttps://github.com/nicococo/tilitools

bhttps://github.com/nicococo/niidbox

107

Chapter 7

Conclusions

The world exploded into a whirling

network of kinships, where everything

pointed to everything else, everything

explained everything else.

Umberto Eco (Foucault’s Pendulum)

We have addressed the central research question of how to tie together various infor-

mation, such as labels, dependency structure, sparseness, to obtain better anomaly detection

models for the three diﬀerent classes of anomalies. To that end, we have presented exten-

sions to the one-class SVM as well as the related support vector data description (SVDD)

to incorporate various kinds of side information, i.e. dependency structure. We showed for

artiﬁcially generated data as well as for a variety of real-world applications that incorporat-

ing side information does help to increase detection performance when compared with the

respective base models. In detail, we presented the following extensions in the spirit of the

one-class classiﬁcation paradigm:

Point Anomalies Assuming that anomalies are scarce and occur independently of

each other, methods for controlling the sparsity of the found solutions in terms of single

independent features (Semi-supervised ℓp-norm regularized one-class SVM) and

groups of features (Semi-supervised ℓp-norm regularized multiple kernel learn-

ing one-class SVM) have been derived.

Collective Anomalies In this scenario anomalies are assumed to appear as groups of

measurements instead of single entries. Techniques from structured output learning

have been (i) extended to cope with large-scale problems (Bundle Methods Optimiza-

tion for SSVM), (ii) employed to derive an unsupervised anomaly detector (Latent

Structure Anomaly Detector) for groups of measurements that exhibit a latent de-

pendency structure.

Contextual Anomalies Anomalies appear only in speciﬁc contexts and are supposed

to carry two signals that contain behavioral and contextual information. Contributions

in this scenario consider latent class dependencies and are threefold: (i) we derived

a method capable of detecting contextual anomalies (LatentSVDD), (ii) theoretical

insight reveal k-means as a special case (ClusterSVDD), and (iii) a method for learning

with latent class dependencies when an additional structure is imposed on the latent

variables (TCRFR for regression and ContextualSVDD for anomaly detection).

However, we would like to emphasize that extending the basic model to incorporate side

information such as data dependency structure is no silver bullet, it comes with higher com-

plexity and thus, more possibilities to fail. In many cases we had to give up on desired prop-

erties such as convexity in order to derive a solution.

108 Chapter 7. Conclusions

A research direction of growing importance, not only for anomaly detection, will be the

interpretability of complex methods [252–258]. Explanations of single decisions, models, or

data sets greatly helps the acceptance of those models in application domains. Furthermore,

they can reveal, in a way a human can understand, important information that otherwise

might have stayed cloaked.

In a broader perspective, solving complicated real-world problem requires taking every

single bit of information into account and no other technique than deep learning has been

more successful at this and transformed machine learning more in the recent years. Now, in

order to be successful those methods need large scale data and corresponding labels. How-

ever, there have important attempts towards unsupervised and semi-supervised extensions.

Most importantly, generative adversarial networks (GANs) [259] and (variational) auto en-

coders. Unsupervised learning of concise, meaningful and interpretable feature descriptions

is the holy grail of anomaly detection. There have been attempts at combining techniques for

deep learning and one-class classiﬁcation [91] with promising results. However, there are a

number of complex technical details that need to be solved in order to avoid trivial solutions,

i.e. manifold collapse.

109

Appendix A

Learning with Structured Data

A.1 Proofs of Results in Section 5.3

We show the equivalence of (5.10) and (P) for loss l(t) = max(0, t).

Proof of equivalence of (5.10) and (P) for l(t) = max(0, t).First note that for loss l(t) = max(0, t)

the problem (5.10) becomes the structured one-class SVM problem (P′) from Section 5.3. To

see that (5.10) is equivalent to (P′), we employ a variable substitution ˜

w:= w/ρ∗in (5.10).

This yields

Eq. (P′)=−ρ∗+ρ∗min

w∈H1

2k˜

wk2

νn

i=1

max 0,1−max

z∈Z h˜

w,Ψ(xi, z)i+δ(z)′,(A.1)

where δ(z)′=δ(z)/ρ∗and ρ∗is optimal in (P′). Thus, in order to solve (A.1) (and thus (P′)),

it is suﬃcient to solve

min

w∈H

2kwk2+1

νn

i=1

max 0,1

−max

z∈Z hw,Ψ(xi, z)i+δ(z).(A.2)

By Lemma 5 below, for each ν∈]0,1], there exists a C > 0such that (A.2) is, indeed, equiv-

alent to (5.10).

Lemma 5. Let D⊂Rdbe a set, let f, g :D→Rbe arbitrary functions. Consider the

optimization tasks

min

x∈Df(x) + σg(x),(A.3)

min

x∈D:g(x)≤τf(x).(A.4)

Assume that the minima exist. Then we have that for each σ > 0there exists τ > 0such that

OP (A.3) is equivalent to OP (A.4), that is, each optimal solution x∗of one is an optimal solution

of the other, and vice versa.

Proof. The proof is similar to the one of Proposition 12 in [105]. Let be σ > 0and x∗be the

optimal of (A.3). We have to show that there exists a τ > 0such that x∗is optimal in (A.4).

110 Appendix A. Learning with Structured Data

We set τ=g(x∗). Suppose x∗is not optimal in (A.4), that is, it exists ˜

x∈D:g(˜

x)≤τsuch

that f(˜

x)< f(x∗). Then we have

f(˜

x) + σg(˜

x)< f(x∗) + στ,

which by τ=g(x∗)translates to

f(˜

x) + σg(˜

x)< f(x∗) + σg(x∗).

This contradicts the optimality of x∗in (A.3), and hence shows that x∗is optimal in (A.4),

which was to be shown.

Proof of Theorem 8. By [200] we have that, if lis L-Lipschitz and ranges in [0, D], with prob-

ability at least 1−ǫover the draw of the sample,

E l(ˆ

f)−E l(f∗)≤8LRn(F) + l(0)

n+Dr2 log(2/ǫ)

n,(A.5)

where Rn(F) := Esupf∈F 1

nPn

i=1 σif(Xi)is the Rademacher complexity of the class Fand

σ1, . . . , σndenote i.i.d. Rademacher variables (random signs). For many learning algorithms

Rn(G)is of the order O(1/√n), when employing appropriate regularization, and thus so

is (A.5). We will show that also the latent anomaly detection method of (5.10) enjoys this

favorable rate, too: By deﬁnition of the Rademacher complexity of F,

Rn(F) = Emax

f∈F

i=1

σif(Xi)

=Emax

w∈H:kwk≤C

i=1

σi1

−max

z∈Z hw,Ψ(Xi, z)i+δ(z)

=1 + max

z∈Z |δ(z)|E

i=1

σi

|{z }

(∗)

+Emax

w∈H:kwk≤C

i=1

σimax

z∈Z hw,Ψ(Xi, z)i

|{z }

(∗∗)

We bound the two summands in the above expression separately: on one hand, by Jensen’s in-

equality, E1

nPn

i=1 σi≤qE1

n2Pn

i,j=1 σiσj=1

√nbecause Eσiσj= 0 when i6=j, which

shows (∗)≤1+A

√n. To bound the second summand, note that (∗∗)≤Rn(F′)with F′de-

ﬁned as F′:= fw=x7→ maxz∈Z hw,Ψ(x, z)i:kwk ≤ C.Furthermore put F′′ :=

fw=x7→ maxz∈Z fz:fz∈ Fz, z ∈ Zand Fz:= fw=x7→ hw,Ψ(x, z)i:kwk ≤ C.

Clearly, F′⊂ F′′ and thus Rn(F′)≤Rn(F′′). By Lemma 8 in the supplemental material,

Rn(F′′)is itself bounded by Rn(F′′)≤Pz∈Z Rn(Fz), and the terms Rn(Fz), for each z∈

Zare known from [200] to be bounded as Rn(Fz)≤B

√n.1This shows (∗∗)≤BC|Z|

√n. The re-

sult is then obtained from (A.5) by noting, that Dcan be chosen as D:= L(1+A+BC).

1Again this quickly follows from Jensen’s inequality because Eσiσj= 0 when i6=j.

A.2. Proofs of Results in Section 5.3 II 111

In the proof of Theorem 8 above, we use the following result.

Lemma 6 (Lemma 8.1 in [201]).Let F1,...,Flbe sets of functions f:X → R, and let

F:= {max(f1, . . . , fl}:fi∈ Fi, i ∈ {1, . . . , l}}. Then,

Rn(F)≤

j=1

Rn(Fj).

Sketch of proof [201]. The idea of the proof is to write max(h1, h2) = 1

2(h1+h2+|h1−h2|),

and then to show that

E"sup

h1∈F1,h2∈F2

i=1 |h1(xi)−h2(xi)|#≤Rn(F1) + Rn(F2).

This proof technique also generalizes to l > 2. For the complete proof see Section 8 in

[201].

A.2 Proofs of Results in Section 5.3 II

Proof of Theorem 11. First observe that it holds α∗

imax(0, f(xi)) = 0 for all i= 1, . . . , n

in the optimal point of the Lagrangian saddle point problem.2This implies that we have

f(xi)≤0if xiis a support vector (that is, α∗

i>0) [33,196]. Since Pn

i=1 α∗

i= 1 and

α∗

i≤1

νn there must at least ⌈νn⌉many such points (the function ⌈·⌉ rounds a real number

up to the next large integer). Hence there can be no more than n−⌊νn⌋many points with

f(xi)>0, which corresponds to a fraction of n−⌊νn⌋

n≤1−ν, and thus shows the assertion

(b). Next observe that if we have f(xi)<0then α∗

i=1

νn (to see this, note that if α∗

i<1

νn

we could increase the objective of the Lagrangian by increasing α∗

i, which would contradict

the optimality of α∗

i). Since Pn

i=1 α∗

i= 1 there can be no more than ⌊νn⌋many such points,

which corresponds to a fraction of ⌊νn⌋

n≤ν, thus showing the assertion (a).

2For convex problems, this statement is known as the KKT condition complementary slackness. The argument

holds, however, for the solution of the Lagrangian saddle point problem, regardless of whether or not the problem

is convex, and for arbitrary objective and constraint functions.

113

Appendix B

Learning with Latent Class Dependencies

B.1 Analysis of LatentSVDD

When bounding the Rademacher complexity for Lipschitz continuous loss classes (such as

the hinge loss or the squared loss), the following lemma is often very helpful.

Lemma 7 (Talagrand’s lemma [260]).Let l:R→Rbe a loss function that is L-Lipschitz

continuous and l(0) = 0. Let Fbe a hypothesis class of real-valued functions and denote its

loss class by G:= l◦F. Then the following inequality holds:

Rn(G)≤2LRn(F).

We can use the above result to prove Lemma 1.

Proof of Lemma 1. Since the LatentSVDD loss function is 1-Lipschitz with l(0) = 0, by

Lemma 7, it is suﬃcient to bound R(FSVDD(z)). To this end, it holds

R(FSVDD(z)) def.

=Ehsup

c,Ω:0≤kck2+Ω≤λ

i=1

σiΩ + 2hc,Ψ(xi,z)i−kΨ(xi,z)k2i

≤E"sup

Ω:−λ≤Ω≤λ

i=1

σiΩ#+ 2E"sup

c:kck2≤λ

i=1

σi(hc,Ψ(xi,z)i)#

+E"−1

i=1

σikΨ(xi,z)k2#

|{z }

=0 (by symmetry of σi)

.(A.1.1)

Note that the term to the right is zero because the Rademacher variables are random signs,

independent of x1,...,xn. The term to the left can be bounded as follows:

E"sup

Ω:−λ≤Ω≤λ

i=1

σiΩ#=λE"

i=1

σi#(*)

≤λv

tE

1

i,j=1

σiσj

=λ

√n.(A.1.2)

114 Appendix B. Learning with Latent Class Dependencies

where for (∗)we employ Jensen’s inequality. Moreover, applying the Cauchy-Schwarz in-

equality and Jensen’s inequality, respectively, we obtain

E"sup

c:kck2≤λ

i=1

σi(hc,Ψ(xi,z)i)#C.-S.

≤E"sup

c:kck2≤λkck

i=1

σiΨ(xi,z)#

Jensen

≤v

tλE

1

i,j=1

σiσjhΨ(xi,z),Ψ(xj,z)i



tλ1

i=1 kΨ(xi,z)k2≤Brλ

n(A.1.3)

because P(kΨ(xi,z)k ≤ B) = 1. Hence, inserting the results (A.1.2) and (A.1.3) into (A.1.1),

yields the claimed result, that is,

R(GSVDD(z)) Lemma 7

≤R(FSVDD(z)) ≤λ

√n+Brλ

n=λ+B√λ

√n.(A.1.4)

Next, we invoke the following result, taken from [201] (Lemma 8.1).

Lemma 8. Let F1,...,Flbe hypothesis sets in Rx, and let F:= {max(f1, . . . , fl}:fi∈

Fi, i ∈ {1, . . . , l}}. Then,

Rn(F)≤

j=1

Rn(Fj).

Sketch of proof [201]. The idea of the proof is to write max(h1, h2) = 1

2(h1+h2+|h1−h2|),

and then to show that

E"sup

h1∈F1,h2∈F2

i=1 |h1(xi)−h2(xi)|#≤Rn(F1) + Rn(F2).

This proof technique also generalizes to l > 2.

We can use Lemma 8 and Lemma 1, to conclude the main theorem of this paper, that is,

Theorem 13, which establishes generalization guarantees of the usual order O(1/√n)for the

proposed LatentSVDD method.

Proof of Theorem 13. First observe that, because lis 1-Lipschitz,

Rn(GLatentSVDD)≤Rn(FLatentSVDD).

Next, note that we can write

Rn(FLatentSVDD) = nmax

z∈z(fz) : fz∈ FSVDD(z)o.

Thus, by Lemma 2 and Lemma 4,

Rn(FLatentSVDD)≤ |z|max

z∈zRn(FSVDD(z)) ≤ |z|λ+B√λ

√n.

B.2. Handcrafted procedure for porosity estimation 115

Moreover, observe that the loss function in the deﬁnition of GLatentSVDD can only range

in the interval [0, B]. Thus, Theorem 13 in the main paper gives the claimed result, that is,

E[bgn]−E[g∗]≤4Rn(GLatentSVDD) + Br2 log(2/δ)

n≤4|z|λ+B√λ

√n+Br2 log(2/δ)

B.2 Handcrafted procedure for porosity estimation

The common procedure for porosity estimation involves many intermediate domain knowl-

edge decisions and it relies upon the interpolation method known as kriging [250]. The

following are the main steps [247]:

1. First, the volume needs to be segmented into facies, which is usually accomplished by

applying a combination of semi-automatic clustering methods and domain knowledge.

The result is a facies model. All the following steps need then to be executed within

each facies;

2. Determine the degree of correlation between the porosity, sampled in the drilled wells,

and the seismic data, available at every node of the volume. Calibrate the seismic data

to porosity from the well data samples;

3. Deﬁne a function describing the degree of spatial dependence of the seismic-derived

porosity. This function is known as a variogram, and it is deﬁned as the variance of

the diﬀerence between a property value at two diﬀerent locations in the reservoir.

Three variograms must be created, one for each of the x, y, and z directions, due to the

anisotropy usually present in the reservoir;

4. Create a variogram model consistent with the data as a result of the previous step and

ﬁt the model parameters;

5. Choose a kriging method (simple, ordinary, anisotropic, universal, etc.), passing the

variogram model created in the previous step, and interpolate the data, generating the

porosity volume.

The procedure described above demands a great amount of specialized human eﬀort,

usually taking days or even weeks to be accomplished.

117

Bibliography

[1] Chandola, V., Banerjee, A., Kumar, V., “Anomaly detection: A survey”, ACM Comput-

ing Surveys (CSUR), vol. 41, no. 3, 1–58, 2009.

[2] Aggarwal, C. C., Outlier Analysis. Springer, 2013.

[3] Harmeling, S., Dornhege, G., Tax, D., Meinecke, F., Müller, K.-R., “From outliers to

prototypes: Ordering data”, Neurocomputing, vol. 69, no. 13-15, pp. 1608–1618, 2006.

[4] Moya, M. M., “A constrained second-order network with mean square error min-

imization and boundary size minimization for one-class classiﬁcation”, Sandia Na-

tional Labs., Albuquerque, NM (United States), Tech. Rep., 1993.

[5] Moya, M. M., Hush, D. R., “Network constraints and multi-objective optimization for

one-class classiﬁcation”, eng, Neural networks, vol. 9, no. 3, pp. 463–474, 1996.

[6] Laskov, P., Schäfer, C., Kotenko, I., Müller, K.-R., “Intrusion detection in Unlabeled

Data with Quarter-sphere Support Vector Machines”, Detection of Intrusions and Mal-

ware, and Vulnerability Assessment (DIMVA), vol. 27, pp. 71–82, 2004.

[7] Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., Williamson, R. C., “Estimating

the Support of a High-dimensional Distribution”, Neural Computation, vol. 13, no. 7,

pp. 1443–1471, 2001.

[8] Schölkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., Platt, J., “Support Vector

Method for Novelty Detection”, in Advances in Neural and Information Processing Sys-

tems (NIPS), 2000, pp. 582–588.

[9] Tax, D., Duin, R., “Support Vector Data Description”, Machine Learning, vol. 54, pp. 45–

66, 2004.

[10] Görnitz, N., Kloft, M., Rieck, K., Brefeld, U., “Active learning for network intru-

sion detection”, in ACM Workshop on Artiﬁcial Intelligence and Security (AISec), 2009,

pp. 47–54.

[11] Görnitz, N., Kloft, M., Brefeld, U., “Active and semi-supervised data domain descrip-

tion”, in European Conference on Machine Learning & Principles and Practice of Knowl-

edge Discovery in Databases (ECML PKDD), Springer, 2009, 407–422.

[12] Görnitz, N., Kloft, M., Rieck, K., Brefeld, U., “Toward Supervised Anomaly Detec-

tion”, Journal of Artiﬁcial Intelligence Research (JAIR), vol. 46, pp. 235–262, 2013.

[13] Banerjee, A., Burlina, P., Diehl, C., “A support vector method for anomaly detection

in hyperspectral imagery”, IEEE Transactions on Geoscience and Remote Sensing, vol.

44, no. 8, pp. 2282–2291, 2006.

[14] Schölkopf, B., Giesen, J., Spalinger, S., “Kernel methods for implicit surface modeling”,

in Advances in Neural and Information Processing Systems (NIPS), 2004, pp. 1193–1200.

[15] Görnitz, N., Porbadnigk, A. K., Binder, A., Sanelli, C., Braun, M., Müller, K.-R., Kloft,

M., “Learning and Evaluation in Presence of Non-i.i.d. Label Noise”, in International

Conference on Artiﬁcial Intelligence and Statistics (AISTATS), vol. 33, 2014, pp. 293–302.

118 BIBLIOGRAPHY

[16] Görnitz, N., Porbadnigk, A. K., Kloft, M., Binder, A., Sannelli, C., Braun, M., Müller,

K.-R., “When brain and behavior disagree: A novel ML approach for handling system-

atic label noise in EEG data”, in Machine Learning and Interpretation in Neuroimaging

Workshop (MLINI), 2013.

[17] Gao, H., Tang, J, Liu, H, “Mobile location prediction in spatio-temporal context”, Nokia

mobile data challenge workshop, no. 2, pp. 1–4, 2012.

[18] Kanamori, H., “Earthquake prediction: An overview”, International Handbook of Earth-

quake and Engineering Seismology, vol. 81, pp. 1205–1216, 2003.

[19] Xu, W., Tran, T. T., Srivastava, R. M., Journel, A. G., “Integrating Seismic Data in

Reservoir Modeling: The Collocated Cokriging Alternative”, in Annual Technical Con-

ference and Exhibition of the Society of Petroleum Engineers, 1992, pp. 833–842.

[20] Pawelzik, K., Kohlmorgen, J., Müller, K.-R., “Annealed Competition of Experts for a

Segmentation and Classiﬁcation of Switching Dynamics”, Neural Computation, vol.

8, no. 2, pp. 340–356, 1996.

[21] Cetin, M, Comert, G, “Short-term traﬃc ﬂow prediction with regime switching mod-

els”, Journal of the Transportation Research Board, pp. 23–31, 2006.

[22] Widmer, C., Kloft, M., Görnitz, N., Raetsch, G., “Eﬃcient Training of Graph-Regularized

Multitask SVMs”, in European Conference on Machine Learning & Principles and Prac-

tice of Knowledge Discovery in Databases (ECML PKDD), 2012, pp. 633–647.

[23] Nasir, J. A., Görnitz, N., Brefeld, U., “An Oﬀ-the-shelf Approach to Authorship At-

tribution”, in International Conference on Computational Linguistics (COLING), 2014,

pp. 895–904.

[24] Porbadnigk, A., Görnitz, N., Kloft, M., Müller, K.-R., “Decoding Brain States during

Auditory Perception by Supervising Unsupervised Learning.”, Journal of Computing

Science and Engineering (JCSE), vol. 7, no. 2, pp. 112–121, 2013.

[25] Görnitz, N., Braun, M., Kloft, M., “Hidden Markov Anomaly Detection”, in Interna-

tional Conference on Machine Learning (ICML), 2015, pp. 1833–1842.

[26] Görnitz, N., Widmer, C., Zeller, G., Kahles, A., Sonnenburg, S., Rätsch, G., “Hierarchi-

cal Multitask Structured Output Learning for Large-scale Sequence Segmentation”, in

Advances in Neural Information Processing Systems (NIPS), 2011, pp. 2690–2698.

[27] Zeller, G., Görnitz, N., Kahles, A., Behr, J., Mudrakarta, P., Sonnenburg, S., Rätsch,

G., “mTim: rapid and accurate transcript reconstruction from RNA-Seq data”, ArXiv,

2013.

[28] Porbadnigk, A. K., Görnitz, N., Sannelli, C., Binder, A., Braun, M., Kloft, M., Müller,

K.-R., “When Brain and Behavior Disagree: Tackling systematic label noise in EEG

data with Machine Learning”, in IEEE International Winter Workshop on Brain-Computer

Interface (BCI), 2014.

[29] Porbadnigk, A. K., Görnitz, N., Sannelli, C., Binder, A., Braun, M., Kloft, M., Müller,

K.-R., “Extracting latent brain states — Towards true labels in cognitive neuroscience

experiments”, NeuroImage, vol. 120, pp. 225–253, 2015.

[30] Görnitz, N., Lima, L. A., Varella, L. E., Müller, K.-R., Nakajima, S., “Transductive Re-

gression for Data with Latent Dependency Structure”, IEEE Transactions on Neural

Networks and Learning (TNNLS), 2017.

[31] Görnitz, N., Lima, L. A., Müller, K.-R., Kloft, M., Nakajima, S., “Support vector data

descriptions and k-means clustering: one class?”, IEEE Transactions on Neural Net-

works and Learning (TNNLS), 2017.

BIBLIOGRAPHY 119

[32] Lima, L. A., Görnitz, N., Varella, L. E., Vellasco, M., Müller, K.-R., Nakajima, S., “Poros-

ity Estimation by Semi-supervised Learning with Sparsely Available Labeled Sam-

ples”, Computers & Geosciences, vol. 106, pp. 33–48, 2017.

[33] Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B., “An Introduction to Kernel-

based Learning Algorithms”, English, IEEE Transactions on Neural Networks (TNNLS),

vol. 12, no. 2, pp. 181–201, Jan. 2001.

[34] Schölkopf, B., Mika, S., Burges, C. J., Knirsch, P., Müller, K.-R., Rätsch, G., Smola, A. J.,

“Input Space Versus Feature Space in Kernel-Based Methods”, IEEE Transactions on

Neural Networks (TNN), vol. 10, no. 5, pp. 1000–1017, 1999.

[35] Schölkopf, B., Smola, A., Müller, K.-R., “Nonlinear Component Analysis as a Kernel

Eigenvalue Problem”, Neural Computation, vol. 10, no. 5, pp. 1299–1319, 1998.

[36] ——, “Kernel principal component analysis”, in Artiﬁcial Neural Networks (ICANN),

1997, pp. 583–588.

[37] Cortes, C., Vapnik, V., “Support-Vector Networks”, Machine Learning, vol. 20, no. 3,

pp. 273–297, 1995.

[38] Boser, B. E., Guyon, I. M., Vapnik, V. N., “A Training Algorithm for Optimal Margin

Classiﬁers”, in Proceedings of the 5th Annual Workshop on Computational Learning

Theory (COLT), 1992, pp. 144–152.

[39] Mohri, M., Rostamizadeh, A., Talwalkar, A., Foundations of Machine Learning. MIT

Press, 2012.

[40] Smola, A. J., Schölkopf, B., Müller, K.-R., “The connection between regularization op-

erators and support vector kernels”, Neural Networks, vol. 11, no. 4, pp. 637–649, 1998.

[41] Kimeldorf, G., Wahba, G., “Some results on Tchebycheﬃan spline functions”, Journal

of Mathematical Analysis and Applications, vol. 33, no. 1, pp. 82–95, 1971.

[42] Schölkopf, B., Herbrich, R., Williamson, R., Smola, A. J., “A Generalized Representer

Theorem”, in International Conference on Learning Theory (COLT), 2001, pp. 416–426.

[43] Boyd, S., Vandenberghe, L., Convex Optimization. Cambridge University Press, 2004.

[44] Schmidt, M., “Convergence Rates of Stochastic Optimization Algorithms”, University

of British Columbia, Tech. Rep., 2010.

[45] Shalev-Shwartz, S., “Online Learning and Online Convex Optimization”, Foundations

and Trends in Machine Learning, vol. 4, no. 2, pp. 107–194, 2011.

[46] Tao, P. D., An, L. T. H., “A D.C. Optimization Algorithm for Solving the Trust-Region

Subproblem”, SIAM Journal on Optimization, vol. 8, no. 2, pp. 476–505, 1998.

[47] Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., “Distributed Optimization and

Statistical Learning via the Alternating Direction Method of Multipliers”, Foundations

and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2010.

[48] Bach, F., Jenatton, R., Mairal, J., Obozinski, G., “Optimization with Sparsity-Inducing

Penalties”, Foundations and Trends in Machine Learning, vol. 4, no. 1, pp. 1–106, 2012.

[49] Parikh, N., Boyd, S., “Proximal Algorithms”, Foundations and Trends in Optimization,

vol. 1, no. 3, pp. 127–239, 2014.

[50] Yen, E., Peng, N., Wang, P.-W., Lin, S.-D., “On convergence rate of concave-convex

procedure”, in NIPS Optimization Workshop, 2012.

[51] An, L. T. H., Tao, P. D., “The DC (Diﬀerence of Convex Functions) Programming and

DCA Revisited with DC Models of Real World Nonconvex Optimization Problems”,

Annals of Operations Research, vol. 133, no. 1-4, pp. 23–46, 2005.

120 BIBLIOGRAPHY

[52] Horst, R, Thoai, N. V., “DC Programming : Overview”, Journal of Optimization Theory

and Applications, vol. 103, no. 1, pp. 1–43, 1999.

[53] Hawkins, D. M., Identiﬁcation of outliers. Chapman and Hall, 1980, vol. 11.

[54] Grubbs, F. E., “Procedures for Detecting Outlying Observations in Samples”, Techno-

metrics, vol. 11, no. 1, pp. 1–21, 1969.

[55] Samek, W., Nakajima, S., Kawanabe, M., Müller, K.-R., “On robust parameter estima-

tion in brain-computer interfacing”, Journal of Neural Engineering (JNE), vol. 14, no.

6, 2017.

[56] Höhner, J., Nakajima, S., Bauer, A., Müller, K.-R., Görnitz, N., “Minimizing Trust

Leaks for Robust Sybil Detection”, in International Conference on Machine Learning

(ICML), Jul. 2017, pp. 1520–1528.

[57] Akoglu, L., Tong, H., Koutra, D., “Graph-based Anomaly Detection and Description:

A Survey”, Data Mining and Knowledge Discovery, p. 49, Apr. 2014.

[58] Stokes, J. W., Platt, J. C., Kravis, J., “ALADIN : Active Learning of Anomalies to Detect

Intrusion”, Tech. Rep., 2008.

[59] He, X., Niyogi, P., “Locality Preserving Projections”, in Advances in Neural Information

Processing Systems (NIPS), 2003.

[60] Tibshirani, R., “Regression Selection and Shrinkage via the Lasso”, Journal of the Royal

Statistical Society, vol. 58, no. 1, pp. 267–288, 1996.

[61] Tax, D., Duin, R., “Data domain description using support vectors”, in Proceedings of

the European Symposium on Artiﬁcial Neural Networks, vol. 256, 1999, pp. 251–256.

[62] Breunig, M. M., Kriegel, H.-P., Ng, R. T., Sander, J., “LOF: Identifying Density-Based

Local Outliers”, ACM SIGMOD International Conference on Management of Data, pp. 1–

12, 2000.

[63] Kriegel, H.-P., Kröger, P., Schubert, E., Zimek, A., “LoOP: Local Outlier Probabili-

ties”, in ACM Conference on Information and Knowledge Management (CIKM), 2009,

pp. 1649–1652.

[64] MacQueen, J., “Some methods for classiﬁcation and analysis of multivariate observa-

tions”, in Berkeley Symposium on Mathematical Statistics and Probability, University

of California Press, 1967, pp. 281–297.

[65] Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U., “When Is “Nearest Neighbor”

Meaningful?”, in International Conference on Database Theory (ICDT), 1999, pp. 217–

235.

[66] Angiulli, F., “Concentration Free Outlier Detection”, in European Conference on Ma-

chine Learning (ECML), 2017, pp. 3–19.

[67] Zimek, A., Schubert, E., Kriegel, H. P., “A survey on unsupervised outlier detection

in high-dimensional numerical data”, Statistical Analysis and Data Mining, vol. 5, no.

5, pp. 363–387, 2012.

[68] Rissanen, J., “Modeling by shortest data description”, Automatica, vol. 14, no. 5, pp. 465–

471, 1978.

[69] Grünwald, P., The Minimum Description Length Principle. MIT Press, 2007.

[70] ——, “A tutorial introduction to the minimum description length principle”, ArXiv,

2004.

[71] Moya, M. M., Koch, M. W., Hostetler, L. D., “One-class classiﬁer networks for target

recognition applications”, Sandia National Labs., Albuquerque, NM (United States),

Tech. Rep., 1993.

BIBLIOGRAPHY 121

[72] Moya, M. M., Hostetler, L. D., “One-class generalization in second-order backpropa-

gation networks for image classiﬁcation”, Tech. Rep., 1989.

[73] Minter, T. C., “Single-Class Classiﬁcation”, Symposium on Machine Processing of Re-

motely Sensed Data, 1975.

[74] John H.J. Einmahl, David M. Mason, “Generalized Quantile Processes”, The Annals of

Statistics, vol. 20, no. 2, pp. 1062–1078, 1992.

[75] Polonik, W., “Minimum Volume Sets in Statistics: Recent Developments”, in Annual

Conference of the Gesellschaft f{ü}r Klassiﬁkation e.V. R. Klar and O. Opitz, Eds., Berlin,

Heidelberg: Springer Berlin Heidelberg, 1997, pp. 187–194.

[76] ——, “Minimum volume sets and generalized quantile processes”, Stochastic Processes

and their Applications, vol. 69, no. 1, pp. 1–24, Jul. 1997.

[77] Tsybakov, A. B., “On nonparametric estimation of density level sets”, Annals of Statis-

tics, vol. 25, pp. 948–969, 1997.

[78] Polonik, W., “Measuring Mass Concentrations and Estimating Density Contour Clus-

ters - An Excess Mass Approach”, The Annals of Statistics, vol. 23, no. 3, pp. 855–881,

1995.

[79] Ghasemi, A., Rabiee, H. R., Manzuri, M. T., Rohban, M. H., “A Bayesian Approach

to the Data Description Problem”, in AAAI Conference on Artiﬁcial Intelligence, 2012,

pp. 907–913.

[80] Xiao, Y., Wang, H., Xu, W., “Hyperparameter Selection for Gaussian Process One-

Class Classiﬁcation”, IEEE Transactions on Neural Networks (TNNLS), vol. 26, no. 9,

pp. 2182–2187, 2015.

[81] Hovelynck, M., Chidlovskii, B., “Multi-modality in one-class classiﬁcation”, in Pro-

ceedings of the 19th international conference on World wide web - WWW ’10, 2010,

p. 441.

[82] Schölkopf, B, Platt, J. C., Shawe-Taylor, J, Smola, A. J., Williamson, R. C., “Estimat-

ing the Support of a High-dimensional Distribution”, Microsoft Research, Tech. Rep.,

1999.

[83] Rätsch, G., Mika, S., Schölkopf, B., Müller, K.-R., “Constructing Boosting Algorithms

from SVMs: An Application to One-Class Classiﬁcation”, IEEE Transactions on Pattern

Analysis and Machine Intelligence (TPAMI), vol. 24, pp. 1184–1199, 2002.

[84] Lee, G., Scott, C. D., “The one class support vector machine solution path”, in IEEE

International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2,

2007.

[85] Munoz, A., Moguerza, J. M., “One-class support vector machines and density estima-

tion: The precise relation”, in Iberoamerican Congress on Pattern Recognition (CIARP),

vol. 9, Springer, 2004, pp. 216–223.

[86] Davenport, M. A., Baraniuk, R. G., Scott, C. D., “Learning minimum volume sets with

support vector machines”, in IEEE Signal Processing Society Workshop on Machine

Learning for Signal Processing (MLSP), 2007, pp. 301–306.

[87] Vert, R., Vert, J.-P., “Consistency and Convergence Rates of One-Class SVMs and Re-

lated Algorithms”, Journal for Machine Learning Research (JMLR), vol. 7, pp. 817–854,

2006.

[88] Glazer, A., Lindenbaum, M., Markovitch, S., “q -OCSVM : A q -Quantile Estimator for

High-Dimensional Distributions”, in Advances in Neural and Information Processing

Systems (NIPS), 2013, pp. 503–511.

122 BIBLIOGRAPHY

[89] Lee, G., Scott, C. D., “Nested Support Vector Machines”, IEEE Transactions on Signal

Processing (TSP), vol. 58, no. 3, 2010.

[90] Muandet, K., Schölkopf, B., “One-Class Support Measure Machines for Group Anomaly

Detection”, ArXiv, 2013.

[91] Erfani, S. M., Rajasegarar, S., Karunasekera, S., Leckie, C., “High-Dimensional and

Large-Scale Anomaly Detection using a Linear One-Class SVM with Deep Learning”,

Pattern Recognition, vol. 58, pp. 121–134, 2016.

[92] David Tax, “One-class classiﬁcation”, PhD thesis, Delft University of Technology,

2001.

[93] Sjöstrand, K., Larsen, R., “The entire regularization path for the support vector domain

description”, International Conference on Medical Image Computing and Computer-

Assisted Intervention, vol. 9, pp. 241–248, 2006.

[94] Fawcett, T., “An introduction to ROC analysis”, Pattern Recognition Letters, vol. 27,

no. 8, pp. 861–874, 2006.

[95] Hajizadeh, S., Li, Z., Dollevoet, R. P.B. J., Tax, D. M. J., “Evaluating Classiﬁcation Per-

formance with only Positive and Unlabeled Samples”, in Structural, Syntactic, and

Statistical Pattern Recognition, 2014, pp. 233–242.

[96] Goix, N., “How to Evaluate the Quality of Unsupervised Anomaly Detection Algo-

rithms?”, in ICML 2016 Anomaly Detection Workshop, 2016.

[97] Tax, D., Müller, K.-R., “A Consistency-Based Model Selection for One-class Classiﬁ-

cation”, in International Conference on Pattern Recognition (ICPR), 2004, pp. 363–366.

[98] Thomas, A., Clémençon, S., Feuillard, V., Gramfort, A., “Learning Hyperparameters

for Unsupervised Anomaly Detection”, in ICML 2016 Anomaly Detection Workshop,

2016.

[99] Legendre, A. M., Nouvelles méthodes pour la détermination des orbites des comètes. F.

Didot, 1805.

[100] Gauss, C. F., Theoria motus corporum coelestium sectionibus conicis solem ambientium.

Hamburgi: sumtibus Frid. Perthes et IH Besser, 1809, pp. 1–227.

[101] Stigler, S. M., “Gauss and the Invention of Least Squares”, The Annals of Statistics, vol.

9, no. 3, pp. 465–474, 1981.

[102] Hoerl, A. E., Kennard, R. W., “Ridge Regression: Biased Estimation for Nonorthogonal

Problems”, Technometrics, vol. 12, no. 1, pp. 55–67, 1970.

[103] Lee, S., Zhu, J., Xing, E. P., “Adaptive Multi-Task Lasso: with Application to eQTL

Detection”, in Advances in Neural and Information Processing Systems (NIPS), 2010.

[104] Huang, J., Zhang, T., Metaxas, D., “Learning with Structured Sparsity”, Journal for

Machine Learning Research (JMLR), vol. 12, pp. 3371–3412, 2011.

[105] Kloft, M., Brefeld, U., Sonnenburg, S., Zien, A., “lp-Norm Multiple Kernel Learning”,

Journal of Machine Learning Research (JMLR), vol. 12, 953–997, 2011.

[106] Kloft, M., “Lp-Norm Multiple Kernel Learning”, PhD thesis, Berlin Institute of Tech-

nology (TU Berlin), 2011.

[107] Rakotomamonjy, A., Bach, F. R., Canu, S., Grandvalet, Y., “SimpleMKL”, Journal for

Machine Learning Research (JMLR), vol. 9, pp. 2491–2521, 2008.

[108] Gönen, M., Alpaydın, E., “Multiple Kernel Learning Algorithms”, Journal of Machine

Learning Research (JMLR), vol. 12, pp. 2211–2268, 2011.

BIBLIOGRAPHY 123

[109] Rätsch, G., Mika, S., Schölkopf, B., Müller, K.-R., “Constructing boosting algorithms

from SVMs: An application to one-class classiﬁcation”, IEEE Transactions on Pattern

Analysis and Machine Intelligence (TPAMI), vol. 24, no. 9, pp. 1184–1199, 2002.

[110] Rätsch, G., Schölkopf, B., Mika, S., Müller, K.-R., “SVM and boosting: One class”, GMD-

Forschungszentrum Informationstechnik, Tech. Rep., 2000.

[111] Zou, H., Hastie, T., “Regularization and variable selection via the elastic-net”, Journal

of the Royal Statistical Society, vol. 67, no. 2, pp. 301–320, 2005.

[112] Mika, S., Rätsch, G., Müller, K.-R., “A Mathematical Programming Approach to the

Kernel Fisher Algorithm”, in Advances in Neural Information Processing Systems (NIPS),

vol. 13, 2001, pp. 591–597.

[113] Yuan, M., Lin, Y., “Model selection and estimation in regression with group variables”,

Journal of the Royal Statistical Society, vol. 68, no. 1, pp. 49–67, 2006.

[114] Gehler, P. V., Nowozin, S., “Inﬁnite Kernel Learning”, in NIPS Workshop on Kernel

Learning: Automatic Selection of Optimal Kernels, 2008.

[115] Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer, Bernhard Schölkopf, “Large Scale

Multiple Kernel Learning”, Journal of Machine Learning Research (JMLR), vol. 7, pp. 1531–

1565, 2006.

[116] Bach, F., “Consistency of the group Lasso and multiple kernel learning”, Journal of

Machine Learning Research (JMLR), vol. 9, pp. 1179–1225, 2007.

[117] Kloft, M., Brefeld, U., Sonnenburg, S., Zien, A., “Non-Sparse Regularization for Mul-

tiple Kernel Learning”, in NIPS Workshop on Kernel Learning: Automatic Selection of

Optimal Kernels, 2008.

[118] Kloft, M., Brefeld, U., Sonnenburg, S., Laskov, P., Müller, K.-R., Zien, A., “Eﬃcient

and Accurate Lp-Norm Multiple Kernel Learning”, in Advances in Neural Information

Processing Systems (NIPS), 2009, pp. 997–1005.

[119] Kloft, M., Brefeld, U., Düssel, P., Gehl, C., Laskov, P., “Automatic Feature Selection for

Anomaly Detection”, in ACM Workshop on Artiﬁcial Intelligence and Security (AISec),

2008, 71–76.

[120] Lanckriet, G. R., Cristianini, N., Bartlett, P., Ghaoui, L. E., Jordan, M. I., “Learning

the kernel matrix with semi-deﬁnite programming”, Journal of Machine Learning Re-

search (JMLR), vol. 5, pp. 27–72, 2004.

[121] Bach, F. R., Lanckriet, G. R. G., Jordan, M. I., “Multiple kernel learning, conic dual-

ity, and the smo algorithm”, in International Conference on Machine Learning (ICML),

2004, pp. 6–14.

[122] Porbadnigk, A. K., Antons, J.-N., Blankertz, B., Treder, M. S., Schleicher, R., Möller, S.,

Curio, G., “Using ERPs for assessing the (sub)conscious perception of noise”, in IEEE

Engineering in Medicine and Biology Society (EMBC), 2010, pp. 2690–2693.

[123] Porbadnigk, A. K., Scholler, S., Blankertz, B., Ritz, A., Born, M., Scholl, R., Müller, K.-R.,

Curio, G., Treder, M. S., “Revealing the neural response to imperceptible peripheral

ﬂicker with machine learning”, in IEEE Engineering in Medicine and Biology Society

(EMBC), 2011, pp. 3692–3695.

[124] Scholler, S., Bosse, S., Treder, M. S., Blankertz, B., Curio, G., Müller, K.-R., Wiegand, T.,

“Towards a direct measure of video quality perception using EEG”, IEEE Transactions

on Image Processing, vol. 21, no. 5, pp. 2619–2629, 2012.

[125] Sutton, S., Braren, M., Zubin, J., John, E., “Evoked-potential correlates of stimulus

uncertainty”, Science, vol. 150, no. 3700, pp. 1187–1188, 1965.

124 BIBLIOGRAPHY

[126] Müller, K.-R., Anderson, C. W., Birch, G. E., “Linear and non-linear methods for brain-

computer interfaces”, IEEE Transactions on Neural Systems and Rehabilitation Engi-

neering, vol. 11, no. 2, pp. 165–169, 2003.

[127] Blankertz, B., Lemm, S., Treder, M. S., Haufe, S., Müller, K.-R., “Single-trial analysis

and classiﬁcation of ERP components – a tutorial”, NeuroImage, vol. 56, pp. 814–825,

2011.

[128] Bünau, P., Meinecke, F. C., Király, F., Müller, K.-R., “Finding stationary subspaces in

multivariate time series”, Physical Review Letters, vol. 103, 2009.

[129] Shenoy, P., Krauledat, M., Blankertz, B., Rao, R. P. N., Müller, K.-R., “Towards adaptive

classiﬁcation for BCI”, Journal of Neural Engineering, vol. 3, no. 1, R13, 2006.

[130] Vidaurre, C., Sannelli, C., Müller, K.-R., Blankertz, B., “Machine-learning based co-

adaptive calibration”, Neural Computation, vol. 23, no. 3, pp. 791–816, 2011.

[131] Kübler, A., Müller, K.-R., “An introduction to brain computer interfacing”, in Toward

Brain-Computer Interfacing, G. Dornhege, J. del R. Millán, T. Hinterberger, D. McFar-

land, and K.-R. Müller, Eds., Cambridge, MA: MIT press, 2007, pp. 1–25.

[132] Mendenhall, T. C., “The characteristic curves of composition”, Science, vol. 9, no. 214,

pp. 237–246, 1887.

[133] Maurer, H., Kappe, F., Zaka, B., “Plagiarism – a survey”, Journal of Universal Computer

Science, vol. 12, no. 8, pp. 1050–1084, 2006.

[134] Tan, E., Guo, L., Chen, S., Zhang, X., Zhao, Y. E., “Unik: Unsupervised social network

spam detection”, in International Conference on Information & Knowledge Management

(CIKM), 2013, pp. 479–488.

[135] Mosteller, F., Wallace, D. L., Inference and disputed authorship: The federalist. New

York: Addison-Wesley, 1964.

[136] Fung, G., “The disputed federalist papers: Svm feature selection via concave mini-

mization”, in Conference on Diversity in Computing, 2003, pp. 42–46.

[137] Burrows, J. F., Computation into criticism: A study of jane austen’s novels and an ex-

periment in method. Oxford: Clarendon Press, 1987.

[138] Houvardas, J., Stamatatos, E., “N-gram feature selection for authorship identiﬁcation”,

in International Conference on Artiﬁcial Intelligence: Methodology, Systems, and Appli-

cations, 2006, pp. 77–86.

[139] Stamatatos, E., Fakotakis, N., Kokkinakis, G., “Computer-based authorship attribution

without lexical measures”, Computers and the Humanities, vol. 35, no. 2, pp. 193–214,

2001.

[140] Raghavan, S., Kovashka, A., Mooney, R., “Authorship attribution using probabilistic

context-free grammars”, in Annual Meeting of the Association for Computational Lin-

guistics (ACL), 2010, pp. 38–42.

[141] Koppel, M., Akiva, N., Dagan, I., “Feature instability as a criterion for selecting poten-

tial style markers: Special topic section on computational analysis of style”, Journal of

the American Society for Information Science and Technology, vol. 57, no. 11, pp. 1519–

1525, 2006.

[142] Forman, G., “An extensive empirical study of feature selection metrics for text clas-

siﬁcation”, Journal of Machine Learning Research (JMLR), vol. 3, pp. 1289–1305, 2003.

[143] Seroussi, Y., Zukerman, I., Bohnert, F., “Authorship attribution with latent dirichlet

allocation”, in International Conference on Computational Natural Language Learning,

2011, pp. 181–189.

BIBLIOGRAPHY 125

[144] Koppel, M., Schler, J., Argamon, S., “Authorship attribution in the wild”, Language

Resources & Evaluation, vol. 45, no. 1, pp. 83–94, 2011.

[145] Li, J., Zheng, R., Chen, H., “From ﬁngerprint to writeprint”, Communications of the

ACM, vol. 49, no. 4, pp. 76–82, 2006.

[146] Zipf, G. K., Selective studies and the principle of relative frequency in language. Harvard

University Press, 1932.

[147] Yule, U., The statistical study of literary vocabulary. Cambridge University Press, 2014.

[148] Holmes, D. I., “The Evolution of Stylometry in Humanities Scholarship”, Literary and

Linguistic Computing, vol. 13, no. 3, pp. 111–117, 1998.

[149] Rudman, J., “The state of authorship attribution studies: Some problems and solu-

tions”, Computers and the Humanities, vol. 31, no. 4, pp. 351–365, 1997.

[150] Diederich, J., Kindermann, J., Leopold, E., Paass, G., “Authorship attribution with sup-

port vector machines”, Applied Intelligence, vol. 19, no. 1, pp. 109–123, 2003.

[151] Khmelev, D. V., “Disputed authorship resolution through using relative empirical en-

tropy for markov chains of letters in human language texts”, Journal of Quantitative

Linguistics, vol. 7, no. 3, pp. 201–207, 2000.

[152] Stamatatos, E., Fakotakis, N., Kokkinakis, G. K., “Automatic text categorization in

terms of genre and author”, Computational Linguistics, vol. 26, no. 4, pp. 471–495,

2000.

[153] Holmes, D. I., Crofts, D. W., “The diary of a public man: A case study in traditional

and non-traditional authorship attribution”, Literary and Linguistic Computing (LLC),

vol. 25, no. 2, pp. 179–197, 2010.

[154] Stamatatos, E., “A survey of modern authorship attribution methods”, Journal of the

American Society for Information Science and Technology, vol. 60, no. 3, pp. 538–556,

2009.

[155] Koppel, M., Schler, J., “Exploiting stylistic idiosyncrasies for authorship attribution”,

in International Joint Conference on Artiﬁcial Intelligence (IJCAI), vol. 69, 2003, pp. 72–

80.

[156] Dietterich, T, “Machine learning for sequential data: A review”, Structural, Syntactic,

and Statistical Pattern Recognition, vol. 2396, pp. 227–246, 2002.

[157] Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y., “Large Margin Methods for

Structured and Interdependent Output Variables”, Journal of Machine Learning Re-

search (JMLR), vol. 6, pp. 1453–1484, 2005.

[158] Laﬀerty, J., McCallum, A., Pereira, F., “Conditional random ﬁelds: Probabilistic models

for segmenting and labeling sequence data”, in International Conference on Machine

Learning (ICML), 2001, pp. 282–289.

[159] Collins, M., “Discriminative Training Methods for Hidden Markov Models: Theory

and Experiments with Perceptron Algorithms”, in Conference on Empirical Methods

in Natural Language Processing, 2002, pp. 1–8.

[160] Rätsch, G., Sonnenburg, S., “Large Scale Hidden Semi-Markov SVMs”, in Advances in

Neural and Information Processing Systems (NIPS), B Schölkopf, J Platt, and T Hoﬀman,

Eds., Cambridge, MA: MIT Press, 2006, pp. 1161–1168.

[161] Rabiner, L. R., “A Tutorial on Hidden Markov Models and Selected Appications in

Speech Recognition”, Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.

[162] Bakir, G., Hofmann, T., Schölkopf, B., Smola, A. J., Taskar, B., S.V.N. Vishwanathan,

Predicting Structured Data. MIT Press, 2007.

126 BIBLIOGRAPHY

[163] Altun, Y., Tsochantaridis, I., Hofmann, T., “Hidden Markov Support Vector Machines”,

in International Conference on Machine Learning (ICML), 2003, pp. 3–10.

[164] Hazan, T., Urtasun, R., “A Primal-Dual Message-Passing Algorithm for Approximated

Large Scale Structured Prediction”, in Neural Information Processing Systems (NIPS),

2010, pp. 838–846.

[165] Smola, A. J., Vishwanathan, S. V. N., Le, Q. V., “Bundle methods for machine learn-

ing”, in Advances in Neural and Information Processing Systems (NIPS), vol. 20, 2008,

1377–1384.

[166] Teo, C. H., Vishwanathan, S. V. N., Smola, A., Le, Q. V., “Bundle Methods for Reg-

ularized Risk Minimization”, Journal of Machine Learning Research (JMLR), vol. 11,

pp. 311–365, 2010.

[167] Do, T. M. T., “Regularized bundle methods for large-scale learning problems with an

application to large margin training of hidden Markov models”, PhD thesis, Sorbonne

University, 2010.

[168] Franc, V., Sonnenburg, S., “Optimized cutting plane algorithm for support vector ma-

chines”, in International Conference on Machine Learning (ICML), New York, New

York, USA: ACM Press, 2008, pp. 320–327.

[169] Nowozin, S., Lampert, C. H., “Structured Learning and Prediction in Computer Vi-

sion”, Foundations and Trends in Computer Graphics and Vision, vol. 6, no. 3-4, pp. 185–

365, 2010.

[170] Rifkin, R., Lippert, R., “Value Regularization and Fenchel Duality”, Journal of Machine

Learning Research (JMLR), vol. 8, 441–479, 2007.

[171] McAllester, D., Keshet, J., “Generalization Bounds and Consistency for Latent Struc-

tural Probit and Ramp Loss”, in Advances in Neural and Information Processing Systems

(NIPS), 2011, pp. 2205–2212.

[172] Lampert, C. H., Blaschko, M. B., “Structured prediction by joint kernel support esti-

mation”, Machine Learning, vol. 77, no. 2-3, pp. 249–269, Apr. 2009.

[173] Steinwart, I., Christmann, A., Support vector machines, 1st. Springer Science, 2008.

[174] Mortazavi, A., Williams, B. A., McCue, K., Schaeﬀer, L., Wold, B., “Mapping and quan-

tifying mammalian transcriptomes by RNA-seq”, Nature Methods, vol. 5, no. 7, p. 621,

2008.

[175] Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M., Gilad, Y., “RNA-seq: An as-

sessment of technical reproducibility and comparison with gene expression arrays”,

Genome Research, vol. 18, no. 9, pp. 1509–17, 2008.

[176] Wang, Z., Gerstein, M., Snyder, M., “RNA-seq: A revolutionary tool for transcrip-

tomics”, Nature Reviews Genetics, vol. 10, no. 1, pp. 57–63, 2009.

[177] Gan, X., Stegle, O., Behr, J., Steﬀen, J. G., Drewe, P., Hildebrand, K. L., Lyngsoe, R.,

Schultheiss, S. J., Osborne, E. J., Sreedharan, V. T., Kahles, A., Bohnert, R., Jean, G.,

Derwent, P., Kersey, P., Belﬁeld, E. J., Harberd, N. P., Kemen, E., Toomajian, C., Kover,

P. X., Clark, R. M., Rätsch, G., Mott, R., “Multiple reference genomes and transcrip-

tomes for arabidopsis thaliana”, Nature, vol. 477, pp. 419–423, 7365 2011.

[178] Anders, S., Huber, W., “Diﬀerential expression analysis for sequence count data”,

Genome Biology, vol. 11, no. 10, 2010.

[179] Bohnert, R., Rätsch, G., “rQuant.web: A tool for RNA-Seq-based transcript quantita-

tion”, Nucleic Acids Research, vol. 38, pp. 348–351, 2010.

BIBLIOGRAPHY 127

[180] Behr, J., Kahles, A., Zhong, Y., Sreedharan, V. T., Drewe, P., Rätsch, G., “MITIE: Si-

multaneous RNA-Seq-based transcript identiﬁcation and quantiﬁcation in multiple

samples”, Bioinformatics, vol. 29, no. 20, pp. 2529–2538, 2013.

[181] Anders, S., Reyes, A., Huber, W., “Detecting diﬀerential usage of exons from rna-seq

data”, Genome Research, pp. 2008–2017, 2012.

[182] Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., Adi-

conis, X., Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, N.,

Gnirke, A., Rhind, N., Palma, F., Birren, B. W., Nusbaum, C., Friedman, K. L.-T. N.,

Regev, A., “Full-length transcriptome assembly from RNA-Seq data without a refer-

ence genome”, Nature Biotechnology, vol. 29, pp. 644–652, 7 2011.

[183] Schulz, M. H., Zerbino, D. R., Vingron, M., Birney, E., “Oases: Robust de novo RNA-seq

assembly across the dynamic range of expression levels”, Bioinformatics, vol. 28, no.

8, pp. 1086–1092, 2012.

[184] Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., Baren, M. J., Salzberg,

S. L., Wold, B. J., Pachter, L., “Transcript assembly and quantiﬁcation by RNA-seq

reveals unannotated transcripts and isoform switching during cell diﬀerentiation”,

Nature Biotechnology, vol. 28, pp. 511–515, 5 2010.

[185] Guttman, M., Garber, M., Levin, J. Z., Donaghey, J., Robinson, J., Adiconis, X., Fan, L.,

Koziol, M. J., Gnirke, A., Nusbaum, C., Rinn, J. L., Lander, E. S., Regev, A., “Ab initio

reconstruction of cell type-speciﬁc transcriptomes in mouse reveals the conserved

multi-exonic structure of lincRNAs”, Nature Biotechnology, vol. 28, pp. 503–510, 5

2010.

[186] Trapnell, C., Pachter, L., Salzberg, S. L., “TopHat: Discovering Splice Junctions with

RNA-Seq”, Bioinformatics, vol. 25, no. 9, pp. 1105–1111, 2009.

[187] Jean, G., Kahles, A., Sreedharan, V. T., Bona, F. D., Rätsch, G., “RNA-seq read align-

ments with PALMapper”, Current Protocols in Bioinformatics, vol. 32, pp. 6–11, 2010.

[188] Schweikert, G., Zien, A., Zeller, G., Behr, J., Dieterich, C., Ong, C. S., Philips, P., Bona,

F. D., Hartmann, L., Bohlen, A., Krüger, N., Sonnenburg, S., Rätsch, G., “mGene: Ac-

curate SVM-based gene ﬁnding with an application to nematode genomes”, Genome

Research, vol. 19, no. 11, pp. 2133–2143, 2009.

[189] Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., Rätsch, G., “Accurate splice site

prediction using support vector machines”, BMC Bioinformatics, vol. 8, no. 10, 2007.

[190] Jebara, T., Kondor, R., Howard, A., “Probability Product Kernels”, Journal of Machine

Learning Research (JMLR), vol. 5, pp. 819–844, 2004.

[191] Jaakkola, T, Diekhans, M, Haussler, D, “Using the Fisher kernel method to detect re-

mote protein homologies”, in International Conference on Intelligent Systems for Molec-

ular Biology (ISMB), 1999, pp. 149–158.

[192] Tsuda, K., Kawanabe, M., Rätsch, G., Sonnenburg, S., Müller, K.-R., “A New Discrim-

inative Kernel from Probabilistic Models”, Neural Computation, vol. 2414, pp. 2397–

2414, 2002.

[193] Alberts, B., Bray, D., Lewis, J., Raﬀ, M., Roberts, K., Watson, J., Molecular Biology of

the Cell, 4th. Garland, 2002.

[194] Laptev, N., Amizadeh, S., Flint, I., “Generic and Scalable Framework for Automated

Time-series Anomaly Detection”, in International Conference on Knowledge Discovery

and Data Mining (KDD), 2015, pp. 1939–1947.

128 BIBLIOGRAPHY

[195] Dereszynski, E. W., Dietterich, T. G., “Spatiotemporal Models for Data-Anomaly De-

tection in Dynamic Environmental Monitoring Campaigns”, ACM Transactions on

Sensor Networks, vol. 8, no. 1, 3:1–3:36, 2011.

[196] Schölkopf, B., Smola, A. J., Learning with Kernels: Support Vector Machines, Regular-

ization, Optimization, and Beyond. MIT Press, 2002.

[197] Yuille, A. L., Rangarajan, A., “The concave-convex procedure”, Neural computation,

vol. 15, no. 4, pp. 915–36, Apr. 2003.

[198] Sriperumbudur, B. K., Lanckriet, G. R. G., “On the Convergence of the Concave-

Convex Procedure”, in NIPS, 2009, pp. 1–9.

[199] Kloft, M., Brefeld, U., Sonnenburg, S., Zien, A., Laskov, P., Müller, K.-R., “Learning

Non-sparse Kernel Mixtures”, in PASCAL2 Workshop on Sparsity in Machine Learning

and Statistics, 2009.

[200] Bartlett, P., Mendelson, S., “Rademacher and gaussian complexities: Risk bounds and

structural results”, Journal of Machine Learning Research (JMLR), vol. 3, pp. 463–482,

2002.

[201] Mohri, M., Rostamizadeh, A., Talwalkar, A., Foundations of machine learning. The MIT

Press, 2012.

[202] Jain, A. K., “Data clustering: 50 years beyond K-means”, Pattern Recognition Letters,

vol. 31, no. 8, pp. 651–666, Jun. 2010.

[203] Dhillon, I. S., Guan, Y., Kulis, B., “Kernel k-means, Spectral Clustering and Normalized

Cuts”, in ACM SIGKDD International Conference on Knowledge Discovery and Data

Mining, New York, New York, USA: ACM Press, Aug. 2004, p. 551.

[204] Girolami, M, “Mercer kernel-based clustering in feature space”, English, IEEE Trans-

actions on Neural Networks and Learning (TNNLS), vol. 13, no. 3, pp. 780–4, Jan. 2002.

[205] Forero, P. A., Kekatos, V., Giannakis, G. B., “Robust Clustering Using Outlier-Sparsity

Regularization”, IEEE Transactions on Signal Processing, vol. 60, no. 8, pp. 4163–4177,

Apr. 2012.

[206] Kondo, Y., “Robustiﬁcation of the sparse K-means clustering algorithm”, PhD thesis,

The University of British Columbia, 2009.

[207] Kondo, Y., Salibian-Barrera, M., Zamar, R., “A robust and sparse K-means clustering

algorithm”, ArXiv, 2012.

[208] Hamerly, G., Elkan, C., “Learning the k in k-means”, in Advances in Neural and Infor-

mation Processing Systems (NIPS), 2004, pp. 281–288.

[209] Jinwen Ma, Taijun Wang, “A cost-function approach to rival penalized competitive

learning (RPCL)”, IEEE Transactions on Systems, Man and Cybernetics, Part B (Cyber-

netics), vol. 36, no. 4, pp. 722–737, Aug. 2006.

[210] Bacciu, D., Starita, A., “Competitive Repetition Suppression (CoRe) Clustering: A Bi-

ologically Inspired Learning Model With Application to Robust Clustering”, IEEE

Transactions on Neural Networks, vol. 19, no. 11, pp. 1922–1941, Nov. 2008.

[211] Yiu-ming Cheung, “On rival penalization controlled competitive learning for cluster-

ing with automatic cluster number selection”, IEEE Transactions on Knowledge and

Data Engineering, vol. 17, no. 11, pp. 1583–1588, Nov. 2005.

[212] Jia, H., Cheung, Y.-M., Liu, J., “Cooperative and penalized competitive learning with

application to kernel-based clustering”, Pattern Recognition, vol. 47, pp. 3060–3069,

2014.

BIBLIOGRAPHY 129

[213] Chang, W.-C., Lee, C.-P., Lin, C.-J., “A Revisit to Support Vector Data Description

(SVDD)”, National Taiwan University, Tech. Rep., 2010.

[214] Chang, C.-C., Tsai, H.-C., “A Minimum Enclosing Balls Labeling Method for Support

Vector Clustering”, National Taiwan University of Science and Technology (Taipei,

Taiwan), Tech. Rep., 2007, pp. 1–27.

[215] Wang, X., Chung, F.-l., Wang, S., “Theoretical analysis for solution of support vector

data description”, Neural networks, vol. 24, pp. 360–369, 2011.

[216] Ng, A. Y., Jordan, M. I., Weiss, Y., “On Spectral Clustering: Analysis and an algorithm”,

in Advances in Neural and Information Processing Systems (NIPS), 2002, pp. 849–856.

[217] Baltrusaitis, T, Banda, N, Robinson, P, “Dimensional aﬀect recognition using Contin-

uous Conditional Random Fields”, IEEE Automatic Face and Gesture Recognition (FG),

pp. 1–8, 2013.

[218] Blaschko, M., Lampert, C., “Learning to localize objects with structured output re-

gression”, European Conference on Computer Vision (ECCV), 2–15, 2008.

[219] Bo, L., Sminchisescu, C., “Structured output-associative regression”, 2009 IEEE Com-

puter Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR

Workshops 2009, pp. 2403–2410, 2009.

[220] Peng, J., Bo, L., Xu, J., “Conditional Neural Fields”, in Neural Information Processing

Systems (NIPS), vol. 9, 2009, pp. 1419–1427.

[221] Ratliﬀ, N. D., Bagnell, J. A., Zinkevich, M. A., “(Online) Subgradient Methods for Struc-

tured Prediction”, Carnegie Mellon University, Tech. Rep., 2007.

[222] Kim, M., “Semi-supervised learning of hidden conditional random ﬁelds for time-

series classiﬁcation”, Neurocomputing, vol. 119, pp. 339–349, 2013.

[223] Cortes, C., Mohri, M., “On Transductive Regression”, in Advances in Neural and In-

formation Processing Systems (NIPS), 2007, pp. 305–312.

[224] Chapelle, O., Vapnik, V., Weston, J., “Transductive Inference for Estimating Values of

Functions”, in Advances in Neural and Information Processing Systems (NIPS), 2000,

pp. 421–427.

[225] Leisch, F., “FlexMix : A General Framework for Finite Mixture Models and Latent

Class Regression in R”, Journal of Statistical Software, vol. 11, no. 8, pp. 1–18, 2004.

[226] Grün, B., Leisch, F., “FlexMix Version 2 : Finite Mixtures with Concomitant Variables

and Varying and Constant Parameters”, Journal of Statistical Software, vol. 28, no. 4,

pp. 1–35, 2008.

[227] Sindhwani, V., Niyogi, P., Belkin, M., “Beyond the point cloud: from transductive to

semi-supervised learning”, in Proceedings of the International Conference on Machine

Learning (ICML), vol. 1, ACM, 2005, 824–831.

[228] Belkin, M., Niyogi, P., Sindhwani, V., “Manifold regularization: A geometric frame-

work for learning from labeled and unlabeled examples”, Journal of Machine Learning

Research (JMLR), vol. 7, no. 2006, pp. 2399–2434, 2006.

[229] Belkin, M., Matveeva, I., Niyogi, P., “Regression and regularization on large graphs”,

Tech. Rep., 2003.

[230] Yu, C.-N. J., Joachims, T., “Learning Structural SVMs with Latent Variables”, in Inter-

national Conference on Machine Learning (ICML), 2009, pp. 1169–1176.

[231] Wainwright, M., Jordan, M. I., “Variational inference in graphical models: The view

from the marginal polytope”, in Allerton Conference on Control, Communication and

Computing, 2003.

130 BIBLIOGRAPHY

[232] Wainwright, M. J., Jordan, M. I., “Graphical Models, Exponential Families, and Varia-

tional Inference”, Foundations and Trends in Machine Learning, vol. 1, no. 1–2, pp. 1–

305, 2008.

[233] Weiss, Y., Freeman, W. T., “Correctness of belief propagation in Gaussian models of

arbitrary topology”, Neural Computation, vol. 298, no. 0704, 2000.

[234] Besag, J., “Spatial Interaction and the Statistical Analysis of Lattice Systems”, Journal

of Royal Statistical Society, vol. 36, no. 2, pp. 192–236, 1974.

[235] Cristianini, N., Shawe-Taylor, J., Elisseeﬀ, A., Kandola, J., “On Kernel-Target Align-

ment”, in Advances in Neural and Information Processing Systems (NIPS), 2002, pp. 367–

373.

[236] Braun, M. L., Buhmann, J. M., Müller, K.-R., “On Relevant Dimensions in Kernel Fea-

ture Spaces”, Journal of Machine Learning Research (JMLR), vol. 9, pp. 1875–1908,

2008.

[237] Cortes, C., Mohri, M., Rostamizadeh, A., “Algorithms for learning kernels based on

centered alignment”, Journal of Machine Learning Research (JMLR), vol. 13, no. 1,

pp. 795–828, 2012.

[238] Cortes, C., Mohri, M., “Conﬁdence intervals for the area under the roc curve”, in

Advances in Neural Information Processing Systems (NIPS), 2004.

[239] Porbadnigk, A. K., Treder, M. S., Blankertz, B., Antons, J.-N., Schleicher, R., Möller,

S., Curio, G., Müller, K.-R., “Single-trial analysis of the neural correlates of speech

quality perception”, Journal of Neural Engineering, vol. 10, 5 2013.

[240] Falkenstein, M., Hoormann, J., Christ, S., Hohnsbein, J., “ERP components on reaction

errors and their functional signiﬁcance: A tutorial”, Biology and Psychology, vol. 51,

no. 2-3, pp. 87–107, 2000.

[241] Hohnsbein, J, Falkenstein, M, Hoormann, J, “Error processing in visual and auditory

choice reaction tasks”, Journal of Psychophysiology, vol. 3, 1998.

[242] Falkenstein, M., Hohnsbein, J., Hoormann, J., Blanke, L., “Eﬀects of errors in choice re-

action tasks on the ERP under focused and divided attention”, in Psychophysiological

Brain Research, C. H. M. Brunia, A. W. K. Gaillard, and A. Kok, Eds., Tilburg University

Press, Tilburg, 1990, pp. 192–195.

[243] Gehring, W. J., Coles, M. G. H., Meyer, D. E., Donchin, E., “The error-related negativity:

An event-related brain potential accompanying errors”, Journal of Psychophysiology,

vol. 27, pp. 385–390, 1990.

[244] Gehring, W. J., Goss, B., Coles, M. G. H., Meyer, D. E., Donchin, E., “A neural system

for error detection and compensation”, Psychological Science, vol. 4, pp. 385–390, 1993.

[245] Brickenkamp, R., Zillmer, E., D2 test of attention. Göttingen, Germany: Hogrefe &

Huber, 1998.

[246] Castro, S., Caers, J., Mukerji, T., “The Stanford VI Reservoir”, Annual Report of the

Stanford Center for Reservoir Forecasting (SCRF), pp. 153–154, 2005.

[247] Deutsch, C. V., Geostatistical Reservoir Modeling. Oxford University Press, 2002.

[248] Dvorkin, J., Gutierrez, M. A., Grana, D., Seismic Reﬂections of Rock Properties. Cam-

bridge University Press, 2014.

[249] Ratsaby, J., Venkatesht, S. S., “Learning from a Mixture of Labeled and Unlabeled

Examples with Parametric Side Information”, Annual Conference on Computational

Learning Theory, pp. 412–417, 1995.

BIBLIOGRAPHY 131

[250] Matheron, G., Traité de geoestatistique apliquée. Paris: Editions Technip, 1962, vol. 1.

[251] Deutsch, C. V., Journel, A. G., GSLIB - Geostatistical Software Library and User’s Guide,

2nd ed. Oxford University Press, 1998.

[252] Siddiqui, M. A., Fern, A., Dietterich, T. G., Wong, W.-K., “Sequential Feature Explana-

tions for Anomaly Detection”, ArXiv, Feb. 2015.

[253] Vidovic, M. M.-C., Görnitz, N., Müller, K.-R., Kloft, M., “Feature Importance Mea-

sure for Non-linear Learning Algorithms”, in NIPS Workshop on Interpretable Machine

Learning in Complex Systems, Nov. 2016.

[254] Vidovic, M. M.-C., Görnitz, N., Müller, K.-R., Rätsch, G., Kloft, M., “Opening the Black

Box: Revealing Interpretable Sequence Motifs in Kernel-Based Learning Algorithms”,

in European Conference on Machine Learning & Principles and Practice of Knowledge

Discovery in Databases (ECML PKDD), 2015, pp. 137–153.

[255] Vidovic, M. M.-C., Kloft, M., Müller, K.-R., Görnitz, N., “ML2Motif—Reliable extrac-

tion of discriminative sequence motifs from learning machines”, PloS one, vol. 12, no.

3, 2017.

[256] Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., Samek, W., “On pixel-

wise explanations for non-linear classiﬁer decisions by layer-wise relevance propa-

gation”, PloS one, vol. 10, no. 7, 2015.

[257] Montavon, G., Lapuschkin, S., Binder, A., Samek, W., Müller, K.-R., “Explaining non-

linear classiﬁcation decisions with deep taylor decomposition”, Pattern Recognition,

vol. 65, pp. 211–222, 2017.

[258] Ribeiro, M. T., Singh, S., Guestrin, C., “Why Should I Trust You? Explaining the Pre-

dictions of Any Classiﬁer”, in Knowledge Discovery and Data Mining (KDD), 2016.

[259] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,

Courville, A., Bengio, Y., “Generative Adversarial Nets”, ArXiv, 2016.

[260] Talagrand, M., “Concentration of measure and isoperimetric inequalities in product

spaces”, Publications Mathématiques de l’Institut des Hautes Études Scientiﬁques, vol.

81, pp. 73–205, 1 1995.