scieee Science in your language
[en] (orig)
`p-Norm Multiple Kernel Learning
Making Learning with Multiple Kernels Effective
`p-Norm Multiple Kernel Learning
vorgelegt von Dipl.-Math.
Marius Kloft
Von der Fakult¨at IV - Elektrotechnik und Informatik
der Technischen Universit¨at Berlin zur Erlangung des akademischen Grades
Doktor der Naturwissenschaften
Dr. rer. nat.
genehmigte
Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Dr. Manfred Opper
Fakult¨at IV Elektrotechnik und Informatik
Technische Universit¨at Berlin
Berichter: Prof. Dr. Klaus-Robert M¨uller
Fakult¨at IV Elektrotechnik und Informatik
Technische Universit¨at Berlin
Prof. Dr. Peter L. Bartlett
Department of EECS and Department of Statistics
University of California, Berkeley, USA
Prof. Dr. Gilles Blanchard
Institut ur Mathematik
Universit¨at Potsdam
Tag der m¨undlichen Aussprache: 26.09.2011
Berlin, 2011
D 83
To my parents
Acknowledgements
I am deeply grateful for the opportunity to have worked with so many brilliant and
creative minds over the last years. First of all, I would like to thank my PhD advisors
Prof. Dr. Klaus-Robert M¨uller, Prof. Dr. Peter L. Bartlett, and Prof. Dr. Gilles
Blanchard. Of course, I owe the utmost gratitude to Professor M¨uller, who initially
introduced me to the subject of machine learning in 2007; with his infectious optimism,
wit, and curiosity, he created the open and stimulating atmosphere that characterizes
his Machine Learning Laboratory at TU Berlin. It was also Professor M¨uller who gave
me advice and support in various respects from the very beginning of my PhD studies,
thus turning out to be much more than just a scientific advisor. I am deeply thankful
to him. By the same token, I would like to thank all members of his lab for creating
such a pleasurable working atmosphere in the office each day, most notably my former
and current office mates Pascal Lehwark, Nico ornitz, Gregoire Montavon, and Stefan
Haufe.
Furthermore, I would like to thank Professor Bartlett very much indeed for kindly
inviting me over to visit his learning theory group at UC Berkeley, a most enjoyable and
inspiring experience in many ways, and, most of all, for introducing me to the fascinat-
ing world of learning theory; his readiness to share his vast knowledge with me allowed
me to gain a deeper understanding of the field. I also thank all members of his group,
especially Alekh, for our stimulating discussions, not limited to
various mathematical subjects, but also in matters of the every-
day life. I also thank all office mates in Sutardja Dai Hall for
the nice atmosphere, and the UC Berkeley for providing such a
great office with an absolutely stunning view of the SF Bay, all of
which contributed to making my stay (Oct 2009–Sep 2010) such
a memorable one.
Likewise, I am very grateful to Professor Blanchard for taking so much of his valu-
able time for our extensive and fruitful discussions on various mathematical problems,
sharing his immense knowledge and deep insights with me. I remember us sitting in
caf´es for hours, trying to solve a mathematical puzzle arising from our joint research
III
projects. It is rare that a professor invests so much of his time in mentoring a student
and I owe him a big thanks!
My special thanks go to Dr. Ulf Brefeld for his caring mentoring and patient
teaching in the initial phase of my PhD, which considerably eased the start. His
encouragement gave me strength and confidence. Likewise, I thank Dr. Ulrich R¨uckert
for his mentoring while both of us were with the UC Berkeley, which substantially sped
up my introduction to learning theory. Moreover, I am indebted to Dr. Alexander
Zien for countless stimulating discussions and to Dr. oren Sonnenburg and Alexander
Binder for sharing their great insights in the SHOGUN toolbox with me. Furthermore,
I would like to thank all members of the REMIND project team, especially Dr. Konrad
Rieck and Dr. Pavel Laskov, for the warm and nice atmosphere that made working in
the team (2007–2009) such an effective and also fun experience.
I especially thank Claudia for so carefully proof-reading my manuscript in the final
stage with respect to grammar and spelling, and Helen, who suffered the most from my
obsession with unfinished chapters and last-minute changes to the manuscript, for her
patience and support. Most of all, I would like to thank my parents for supporting me
in every conceivable way.
Finally, I acknowledge financial support of the German Bundesministerium f¨ur
Bildung und Forschung (BMBF), under the project REMIND (FKZ 01-IS07007A), and
the European Community, under the PASCAL2 Network of Excellence of the FP7-ICT
program (ICT-216886). I acknowledge the Berlin Institute of Technology (TU Berlin)
and the German Academic Exchange Service (DAAD) for PhD student scholarships of
1-year runtime each.
IV
Abstract
The goal of machine learning is to learn unknown concepts from data. In real-world
applications such as bioinformatics and computer vision, data frequently arises from
multiple heterogeneous sources or is represented by various complementary views, the
right choice—or even combination—of which being unknown. To this end, the multiple
kernel learning (MKL) framework provides a mathematically sound solution. Previous
approaches to learning with multiple kernels promote sparse kernel combinations to
support interpretability and scalability. Unfortunately, classical approaches to learning
with multiple kernels are rarely observed to outperform trivial baselines in practical
applications.
In this thesis, I approach learning with multiple kernels from a unifying view which
shows previous works to be only particular instances of a much more general family
of multi-kernel methods. To allow for more effective kernel mixtures, I have developed
the `p-norm multiple kernel learning methodology, which, to sum it up, is both more
efficient and more accurate than previous approaches to multiple kernel learning, as
demonstrated on several data sets. In particular, I derive optimization algorithms that
are much faster than the commonly used ones, allowing to deal with up to ten thousands
of data points and thousands of kernels at the same time. Empirical applications of
`p-norm MKL to diverse, challenging problems from the domains of bioinformatics and
computer vision show that `p-norm MKL achieves accuracies that surpass the state-of-
the-art.
The proposed techniques are underpinned by deep foundations in the theory of
learning: I prove tight lower and upper bounds on the local and global Rademacher
complexities of the hypothesis class associated with `p-norm MKL, which yields excess
risk bounds with fast convergence rates, thus being tighter than existing bounds for
MKL, which only achieve slow convergence rates. I also connect the minimal values of
the bounds with the soft sparsity of the underlying Bayes hypothesis, proving that for
a large range of learning scenarios `p-norm MKL attains substantial stronger general-
ization guarantees than classical approaches to learning with multiple kernels. Using a
methodology based on the theoretical bounds, and exemplified by means of a controlled
toy experiment, I investigate why MKL is effective in real applications.
Data sets, source code and implementations of the algorithms, additional scripts
for model selection, and further information are freely available online.
V
Zusammenfassung
Ziel des Maschinellen Lernens ist das Erlernen unbekannter Konzepte aus Daten. In
vielen aktuellen Anwendungsbereichen des Maschinellen Lernens, wie zum Beispiel der
Bioinformatik oder der Computer Vision, sind die Daten auf vielf¨altige Art und Weise
in Merkmalsgruppierungen repr¨asentiert. Im Voraus ist allerdings die optimale Kom-
bination jener Merkmalsgruppen oftmals unbekannt. Die Methodologie des Lernens
mit mehreren Kernen bietet einen attraktiven und mathematisch fundierten Ansatz zu
diesem Problem. Existierende Modelle konzentrieren sich auf d¨unn besetzte Merkmals-
bzw. Kernkombinationen, um deren Interpretierbarkeit zu erleichtern. Allerdings er-
weisen sich solche klassischen Ans¨atze zum Lernen mit mehreren Kernen in der Praxis
als wenig effektiv.
In der vorliegenden Dissertation betrachte ich das Problem des Lernens mit meh-
reren Kernen aus einer neuartigen, generelleren Perspektive. In dieser Sichtweise sind
klassische Ans¨atze nur Spezialf¨alle eines wesentlich generelleren Systems des Lernens
mit mehreren Kernen. Um effektivere Kernmischungen zu erhalten, entwickle ich die
`p-norm multiple kernel learning Methodologie, die sich effizienter und effektiver als
vorherige osungsans¨atze erweist. Insbesondere leite ich Algorithmen zur Optimierung
des Problems her, die wesentlich schneller sind als existierende und es erlauben, gle-
ichzeitig Zehntausende von Trainingsbeispielen und Tausende von Kernen zu verar-
beiten. Ich analysiere die Effektivit¨at unserer Methodologie in einer Vielzahl von
schwierigen und hochzentralen Problemen aus den Bereichen Bioinformatik und Com-
puter Vision und zeige, dass `p-norm multiple kernel learning Vorhersagegenauigkeiten
erreicht, die den neuesten Stand der Forschung ¨ubertreffen.
Die entwickelten Techniken sind tief untermauert in der Theorie des Maschinellen
Lernens: Ich beweise untere und obere Schranken auf die Komplexit¨at der zugeh¨origen
Hypothesenklasse, was die Herleitung von Generalisierungsschranken erlaubt, die eine
schnellere Konvergenzgeschwindigkeit haben als vorherige Schranken. Des Weiteren
stelle ich den minimalen Wert der Schranken mit den geometrischen Eigenschaften der
Bayes-Hypothese in Verbindung. Darauf basierend beweise ich, dass f¨ur eine große An-
zahl von Szenarien `p-norm multiple kernel learning deutlich st¨arkere Generalisierungs-
garantien aufweist als vorherige Ans¨atze zum Lernen mit mehreren Kernen. Mit Hilfe
einer von mir vorgeschlagenen Methodik, basierend auf den theoretischen Schranken
und sogenannten kernel alignments, untersuche ich, warum sich `p-norm multiple ker-
nel learning als hocheffektiv in praktischen Anwendungsgebieten erweist.
Die eingesetzten Datens¨atze, der Quellcode und die Implementierungen der Algo-
rithmen sowie weitere Informationen zur Benutzung sind online frei verf¨ugbar.
VII
Contents
I Introduction and Overview 1
I Introduction and Overview 3
1 Introduction ................................ 3
1.1 AuthorsPhDThesis......................... 4
1.2 Organization of this Dissertation and Own Contributions . . . . 4
1.3 Multiple Kernel Learning in a Nutshell . . . . . . . . . . . . . . . 8
1.4 BasicNotation ............................ 12
2 A Unifying View of Multiple Kernel Learning ........... 13
2.1 A Regularized Risk Minimization Approach . . . . . . . . . . . . 13
2.2 DualProblem............................. 15
2.3 Recovering Prevalent MKL Formulations as Special Cases . . . . 16
2.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 20
II `p-norm Multiple Kernel Learning 23
II `p-norm Multiple Kernel Learning 25
3 Algorithms ................................. 31
3.1 Block Coordinate Descent Algorithm . . . . . . . . . . . . . . . . 31
3.2 Large-Scale Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Implementation............................ 37
3.4 Runtime Experiments . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 43
4 Theoretical Analysis ........................... 45
4.1 Global Rademacher Complexity . . . . . . . . . . . . . . . . . . . 46
4.2 Local Rademacher Complexity . . . . . . . . . . . . . . . . . . . 49
4.3 Excess Risk Bounds . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Why Can Learning Kernels Help Performance? . . . . . . . . . . 67
4.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 70
5 Empirical Analysis and Applications ................. 71
5.1 Goal and Experimental Methodology . . . . . . . . . . . . . . . . 71
5.2 Case Study 1: Toy Experiment . . . . . . . . . . . . . . . . . . . 74
IX
Contents
5.3 Case Study 2: Real-World Experiment (TSS) . . . . . . . . . . . 78
5.4 Bioinformatics Experiments . . . . . . . . . . . . . . . . . . . . . 82
5.5 Computer Vision Experiment . . . . . . . . . . . . . . . . . . . . 92
5.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . 97
Conclusion and Outlook 101
Appendix 105
A Foundations ................................. 105
A.1 Why Using Kernels? . . . . . . . . . . . . . . . . . . . . . . . . . 105
A.2 Basic Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . 107
A.3 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . 110
B Relating Tikhonov and Ivanov Regularization ...........111
C Supplements to the Theoretical Analysis ..............112
D Cutting Plane Algorithm ........................ 115
Bibliography 121
Curriculum Vitae 131
Publications 133
X
Part I
Introduction and Overview
1
I Introduction and Overview
1 Introduction
The goal of machine learning is to learn unknown concepts from data. But the success of
a learning machine crucially depends on the quality of the data representation. At this
point, the paradigm of kernel-based learning (Scolkopf et al., 1998; M¨uller et al., 2001)
offers an elegant way for decoupling the learning and data representation processes in
a modular fashion. This allows to obtain complex learning machines from simple linear
ones in a canonical way. Nowadays, kernel machines are frequently employed in modern
application domains that are characterized by vast amounts of data along with highly
non-trivial learning tasks such as bioinformatics or computer vision, for their favorable
generalization performance while maintaining computational feasibility.
However, after more than a decade of research it still remains an unsolved problem
to find the best kernel for a task at hand. Most frequently, the kernel is selected
from a candidate set according to its generalization performance on a validation set,
which is held back at training time. Clearly, the performance of such an algorithm is
limited by the performance of the best kernel in the set and can be arbitrarily bad if
the kernel does not match the underlying learning task. Unfortunately, in the current
state of research, there is little hope that in the near future a machine will be able
to automatically engineer the perfect kernel for a particular problem at hand (Searle,
1980). However, by restricting ourselves to a less general problem, can we legitimately
hope to obtain a mathematically sound solution? And if so, which restrictions have to
be imposed?
A first step towards a more realistic model of learning the kernel was achieved
in Lanckriet et al. (2004a), who showed that, given a candidate set of kernels, it is
computationally feasible to simultaneously learn a support vector machine and a linear
kernel combination at the same time, if the so-formed kernel combinations are required
to be positive-definite and trace-norm normalized. This framework was entitled mul-
tiple kernel learning (MKL). Research in the following years focused on speeding up
the initially demanding optimization algorithms (e.g. Sonnenburg et al., 2006a; Rako-
tomamonjy et al., 2008)—ignoring the fact that empirical evidence for the superiority
of learning with multiple kernels over single-kernel baselines was missing.
By imposing an `1-norm regularizer on the kernel weights, classical approaches to
multiple kernel learning promote sparse kernel combinations to support interpretability
and scalability. Unfortunately, sparseness is not always beneficial and can be restrictive
in practice, for example, in the presence of complementary kernel sets. However, nega-
3
I Introduction and Overview
tive results are less often published in science than positive ones. It took until 2008 for
concerns regarding the effectiveness of multiple kernel learning in practical applications
to be raised, starting in the domains of bioinformatics (Noble, 2008) and computer
vision (Gehler and Nowozin, 2008): a multitude of researchers presented empirical evi-
dence showing that, in practice, multiple kernel learning is frequently outperformed by
a simple uniform kernel combination (Cortes et al., 2008, 2009a; Gehler and Nowozin,
2009; Yu et al., 2010). The whole discussion peaked in the provocative question “Can
learning kernels help performance?” posed by Corinna Cortes in an invited talk at
ICML 2009 (Cortes, 2009).
Consequently, despite all the substantial progress in the field of multiple kernel
learning, there still remains an unsatisfied need for an approach that is really useful for
practical applications: a model that has a good chance of improving the accuracy (over
a plain sum kernel) together with an implementation that matches today’s standards
(i.e., that can be trained on 10,000s of data points in a reasonable time). Even worse,
despite the recent attempts for clarification (Lanckriet et al., 2009), underlying reasons
for the empirical picture remain unclear. At this point, I argue that all of this is now
achievable, thus answering Corinna Cortes’s research question in the affirmative:
1.1 Author’s PhD Thesis
This dissertation concerns the validation of the following thesis:
`p-norm multiple kernel learning, a methodology that I developed and of
which I show that it enjoys favorable theoretical guarantees, is both faster
and more accurate than existing approaches to learning with multiple ker-
nels, finally making multiple kernel learning effective in practical applica-
tions.
1.2 Organization of this Dissertation and Own Contributions
Part I: Introduction and Overview
I start my dissertation in Chapter 1 with a short introduction to and motivation of
multiple kernel learning (MKL), containing, in a nutshell, the statement of the problem
to be solved and examples of practical applications where it arises, taken from the
domains of bioinformatics and computer vision.
In Chapter 2, I formally introduce a rigorous mathematical view of the prob-
lem, deferring mathematical preliminaries to Appendix A. Deviating from standard
introductions, I phrase MKL as a general optimization criterion based on structured
regularization, covering and also unifying existing formulations under a common um-
brella; from this point of view, classical MKL is only a particular instance of a more
general family of MKL methods. This allows to analyze a large variety of MKL meth-
ods jointly, as exemplified by deriving a general dual representation of the criterion,
4
1 Introduction
without making assumptions on the employed norm or the loss, beside being convex.
This not only delivers insights into connections between existing MKL formulations,
but also allows to derive new ones as special cases of the unifying view.
Previously Published Work
This thesis is based on the following selected publications.
The core framework was published in:
[1] M. Kloft, U. Brefeld, P. Laskov, and S. Sonnenburg. Non-sparse multiple kernel
learning. In Proc. of the NIPS 2008 Workshop on Kernel Learning: Automatic
Selection of Kernels, 2008.
[2] M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. M¨uller, and A. Zien.
Efficient and accurate lp-norm multiple kernel learning. In Advances in Neural Infor-
mation Processing Systems 22 (NIPS 2009), pages 997–1005. MIT Press, 2009.
[3] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel
learning. Journal of Machine Learning Research (JMLR), 12:953–997, 2011.
The main idea of `p-norm MKL was initially presented in [1]. The core framework was
subsequently published in [2]-[3].
Theoretical aspects were presented in:
[4] M. Kloft, U. R¨uckert, and P. L. Bartlett. A unifying view of multiple kernel
learning. In Proceedings of the European Conference on Machine Learning (ECML),
2010.
[5] M. Kloft and G. Blanchard. The local Rademacher Complexity of Multiple Kernel
Learning. ArXiv preprint 1103.0790v1, 2011. Short version submitted to NIPS 2011,
Jun 2011. Full version submitted to Journal of Machine Learning Research (JMLR),
Mar 2011.
Applications to computer vision were discussed in:
[6] A. Binder, S. Nakajima, M. Kloft, C. M¨uller, W. Wojcikiewicz, U. Brefeld, K.-R.
M¨uller, and M. Kawanabe. Classifying Visual Objects with Multiple Kernels. Submit-
ted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). A
preliminary version is published in Proceedings of the 12th Workshop on Information-
based Induction Sciences, 2010.
The content of this thesis is related to the above publications in the following way. Chapter 2
is based on [4]; Chapter 3 is based on [1]-[3]; Chapter 4 is based on [5]; Chapter 5 contains
material from [1]-[3] and [6].
Note: a complete list of all publications is shown at the end of the bibliography
section.
5
I Introduction and Overview
Part II: `p-norm Multiple Kernel Learning
In the main part of my dissertation, I introduce a novel instantiation of the proposed
general criterion, which I entitle `p-norm multiple kernel learning. Recognizing classical
approaches to learning with multiple kernels as the special case of deploying `1-norm
and hinge loss, I argue that, from the structured point of view, it is more natural to chose
an intermediate norm rather than an extreme `1-norm. I show the general connection of
the structured formulation to the learning-kernels formulation that is usually considered
in the literature. For classical MKL, this connection is known from the seminal paper
of Bach et al. (2004); here, I show that it applies to a whole family of MKL algorithms,
no matter which convex loss function or structured `p-norm regularizer is employed.
The remainder of my dissertation focuses on the analysis of `p-norm MKL in terms of
optimization algorithms, theoretical justification, empirical analysis, and applications
to bioinformatics and computer vision.
Chapter 3 is on optimization algorithms. Considering the gain in prediction accu-
racy achieved by `p-norm MKL, established later in this dissertation, one might expect
a substantial drawback with respect to execution time—the contrary is the case: the
presented algorithms allow us to deal with ten thousands of data points and thousands
of kernels at the same time, being up to two magnitudes faster than the state-of-the-
art in MKL research, including HessianMKL and SimpleMKL. Some of these, like the
cutting-plane strategy, which was designed by me and implemented with the help of
oren Sonnenburg and Alexander Zien, are based on previous work by Sonnenburg
et al. (2006a), others like the analytical solver are completely novel. The latter one is
based on a simple analytical formula that can be evaluated in micro seconds, and thus,
despite its efficiency, is even simpler than SimpleMKL, which requires a heuristic line
search. I show it being provably convergent, using the usual regularity assumptions. I
also wrote macro scripts, completely automating the whole process from training over
model selection to evaluation. Currently, MKL can be trained and validated by a single
line of code including random subsampling, model search for the optimal parameters
Cand p, and collection of results.1
In Chapter 4, the proposed techniques are justified from a theoretical point of view.
I prove tight lower and upper bounds on the local and global Rademacher complexities
of the hypothesis class associated with `p-norm MKL, which yields excess risk bounds
with fast convergence rates, thus being tighter than existing bounds for MKL, which
only achieved slow convergence rates. For the results on the local complexities to hold,
I find an assumption on the uncorrelatedness of the kernels; a similar assumption was
also recently used by Raskutti et al. (2010), but in the different context of sparse re-
covery.
Even the tightest previous theoretical analyses such as the one carried out by Cortes
et al. (2010a) for the special case of classical MKL were not able to answer the research
question “Can learning kernels help performance?” (Cortes, 2009). In contrast, beside
1Implementation freely available under the GPL license at http://doc.ml.tu-berlin.de/nonsparse_
mkl/.
6
1 Introduction
reporting on the worst-case bounds, I also connect the minimal values of the theoretical
bounds with the geometry of the underlying learning scenario (namely, the soft spar-
sity of the Bayes hypothesis), in particular, proving that for a large range of learning
scenarios `p-norm MKL attains a strictly “better” (i.e., lower) bound than classical
`1-norm MKL and the SVM using a uniform kernel combination. This theoretically
justifies using `p-norm MKL and multiple kernel learning in general.
Chapter 5 concerns the empirical analysis of `p-norm MKL and applications to
diverse, challenging problems from the domains of bioinformatics and computer vision.
From a practical point of view, this is the most important chapter of my dissertation as
I show here that `p-norm MKL works well in practice. For the experiments, problems
from the domains of bioinformatics and computer vision were chosen, not only because
they come with highly topical, challenging, small- and large-scale prediction tasks, but
also because researchers frequently encounter multiple kernels or data sources here.
This renders these domains especially appealing for the use of MKL.
At this point, it has to be admitted that other researchers also deployed MKL to those
domains: Lanckriet et al. (2004b) experimented on bioinformatics data and Varma and
Ray (2007) and Gehler and Nowozin (2009) on computer vision data. However, none
of those studies were able to prove the practical effectiveness of MKL. The first study
investigated whether MKL can help performance in genomic data fusion: indeed, MKL
outperformed the best single-kernel SVM as determined by model selection; however,
the uniform kernel combination was not investigated at that point. Subsequent inves-
tigations showed here that the latter outperforms MKL on the very same data set.2
In the second study, MKL was shown to substantially outperform the uniform kernel
combination on the caltech-101 object recogniction data set. This study turned out to
be incorrect due to a flaw in the kernel generation.3Subsequently, MKL was studied
on the very same data set by Gehler and Nowozin (2009) and found to be outperformed
by an SVM using a uniform kernel combination. To the best of my knowledge, the
only confirmed experiment concerning MKL outperforming the SVM using a uniform
kernel combination is the one undertaken by Zien and Ong (2007) in the context of
protein subcellular localization prediction.
In this thesis, I show that by considering `p-norms, MKL can in fact help performance
in both of the above applications (genomic data fusion and object recognition). Besides,
I study applications to gene transcription splice site detection, protein fold prediction,
and metabolic network reconstruction. While I observe that MKL helps performance
in some applications (including the ones mentioned in the above paragraph, where re-
searchers tried for years making MKL effective), I also show that sometimes MKL does
not increase the performance (this is, for example, the case for the metabolic network
reconstruction experiment). The raises the question why it sometimes helps and why
sometimes it does not. At this point, I introduce a methodology deploying both, the
2Personal correspondences with William S. Noble; see W. Noble’s talk at http://videolectures.
net/lkasok08_whistler/, June 20, 2011.
3See errata on the first author’s personal homepage, http://research.microsoft.com/en-us/um/
people/manik/projects/trade-off/caltech101.html, June 20, 2011.
7
I Introduction and Overview
bounds that I prove in the theoretical chapter of this thesis and the kernel alignment
techniques initially proposed in a different context by Cristianini et al. (2002). While
the theoretical bounds are used to investigate the optimal norm parameter p, showing
that the effectiveness of MKL is connected with the soft sparsity of the underlying
Bayes hypothesis, the alignments are used to study whether the kernels at hand are
complementary or rather redundant. The whole methodology is exemplified by means
of a toy experiment, where I artificially construct the Bayes hypothesis, controlling
the underlying soft sparsity of the problem. It is shown that MKL’s empirical per-
formance can crucially depend on the choice of the norm parameter pand that the
optimality of such a parameter can highly depend on the geometry of the underlying
Bayes hypothesis (it can make the difference between 4% or 43% test error, as shown
in the simulations). The chapter concludes with my study, carried out with the help
of Shinichi Nakajima, on object recogniction, the very same application unsuccessfully
studied earlier by Varma and Ray (2007) and Gehler and Nowozin (2009)—see dis-
cussion above—where classical MKL could not help performance. In contrast, I show
that, by deploying the proposed `p-norm multiple kernel learning and taking p= 1.11
in median, the prediction accuracy can be raised over the SVM baselines, regardless of
the class, by an AP score of 1.5 in average, and for 7 out of the 20 classes significantly
so, concluding the final Chapter 5 of my thesis.
1.3 Multiple Kernel Learning in a Nutshell
In this section, we introduce the problem of multiple kernel learning.
Problem setting In classical supervised machine learning we are given training exam-
ples x1, . . . , xnlying in some input space Xand labels y1, . . . , yn Y, in the simplest
case, X=Rdand Y={−1,1}. The goal in supervised learning is to find a predic-
tion function f:X Y from a given set H YXthat has a low error rate on new
data xn+1, yn+1,...,xn+l, yn+lstemming from the same data source but which is
unseen at training time. Clearly, we cannot learn anything useful at all if the training
data and the new data are not connected in any way. Therefore one usually assumes a
stochastic mechanism underlying the data generation process, most commonly, that all
of the xi, yiare drawn independently from one and the same probability distribution
P. In this case the quality of the prediction function fis measured by the expected
rate of false predictions E(x,y)P
1
{f(x)6=y}.
Regularized risk minimization An often-employed approach to this problem is regu-
larized risk minimization (RRM), where a minimizer
fargminfHΩ(f) + CLn(f)
is found. Hereby Ln(f) = Pn
i=1 l(f(xi), yi) is the (cumulative) empirical loss of a
hypothesis fwith respect to a function l:R×Y R(called loss function) that upper
bounds the 0-1 loss
1
{f(x)6=y}and : HRis a mapping (called regularizer). The
name risk minimization stems from the fact that Lnis n-times the empirical risk. We
can interpret RRM as minimizing a trade-off between the empirical loss (to classify the
8
1 Introduction
training data well) and a regularizer (to penalize the complexity of fand thus avoid
overfitting), where the trade-off is controlled by a positive parameter C.
Kernel methods Single-kernel approaches to RRM (Vapnik, 1998) simply use linear
models of the form fw,b(x) = hw, xi+bbut—this is the core idea of kernel methods—
replace all resulting inner products hx, ¯xiby so-called kernels k(x, ¯x). Roughly speak-
ing, a kernel kis a clever way to efficiently compute inner products in a possibly very
high-dimensional feature space. As outlined in the introduction, the ultimate goal of
kernel learning would be to find the best kernel for a problem at hand—but this is a
task too hard to allow for a general solution so that, in practice, the kernel is usually
either fixed a priori or selected from a small candidate set {k1, . . . , kM}according to
its prediction error on a validation set, which is held back at training time.
Multiple kernel learning A first step towards finding an optimal kernel is multiple
kernel learning. Here, instead of just picking a kernel from a set, a new kernel kis
constructed by combining kernels from a possibly large given set. For example, the
following combination rules give rise to valid kernel:
k=θ1k1+···+θMkM(sums)
k=kθ1
1·. . . ·kθM
M,(products)
where θRM
+. By searching for an optimal θ, we traverse an infinitely large set of
“combined kernels”. Unfortunately, except for very small M(typically M3), the
search space is too large to be traversed by standard methods such as grid search.
The core idea of multiple kernel learning is based on the insight that most machine
learning problems are formulated as solutions of optimization problems. What if we
include the parameter θas a variable into the optimization? For example, for the
support vector machine (SVM) this task becomes
min
θ0SVMM
X
m=1
θmkm.
A first difficulty we face is that, without any further restriction on θ, the optimization
problem may be unbounded or, for example, yield a trivial solution that does not
generalize to new, unseen data. In the past, this has been addressed by restricting the
search space to convex combinations (i.e., sums that add up to one). In that case the
above problem becomes (Lanckriet et al., 2004a; Bach et al., 2004)
min
θ0,kθk1=1 SVMM
X
m=1
θmkm.
For quite some time, the above regularization strategy was the prevalent one in multiple
kernel learning research.
`p-Norm Multiple Kernel Learning Although having been folklore among researchers
for quite a while already, it took until 2008 that criticism was made public concerning
9
I Introduction and Overview
the usefulness of the above approach in practical applications (see citations in the
introduction): it is frequently observed being outperformed by a simple, regular SVM
using a uniform kernel combination.
In this thesis, we propose to discard the restrictive convex combination requirement
(which corresponds to using an `1-norm regularizer on θ) and use a more flexible `p-
norm regularizer instead, leading to the optimization problem
min
θ0,kθkp=1 SVMM
X
m=1
θmkm.
The difference between the two ways of regularizing is illustrated in the following figure,
where the `1- and `2-norm regularized problems are compared. The blue line shows the
norm constraint and the green one the level sets of a quadratic function. The optimal
||x||1=1
quatric
||x||2=1
quatric
solution of the optimization problem is at-
tained where the level sets of the objective
function touch the norm constraint. If the
objective function is convex (illustrated here
for a quadratic function), and for `1-norm the
point of intersection is likely to be at one of
the corners of the square (shown on the figure to the left). Because the corners are
likely to have zero-entries, sparsity is to be expected. This is in contrast to `2-norm
regularization, where, in general, a non-sparse solution is to be expected (shown on the
figure to the right).
Alternative view of MKL Another way to view MKL is based on the insight that
a kernel kgives rise to a (possibly high-dimensional) feature map φso that We can
illustrate single kernel learning by the following diagram:
input xkernel feature map
φ(x)linear discrimination
hw, φ(x)i.
Correspondingly, when we are given multiple kernels k1, . . . , kM, we also obtain mul-
tiple feature maps φ1, . . . , φM, one for each kernel. Thus the combined kernel k=
PM
m=1 θmkmcorresponds to a “combined” feature map φθ=θ1×···×θMφM. This
is illustrated in Figure 1.1.
Figure 1.1: Illustration of multiple kernel learning in terms of weighted kernel feature spaces.
10
1 Introduction
Examples In real-world applications such as bioinformatics and computer vision, data
either frequently arises from multiple heterogeneous sources, describing different prop-
erties of one and the same object, or is represented by various complementary views,
the right choice—or even combination—of which being unknown. In this case, multiple
kernel learning (MKL) is especially appealing as it provides a mathematically sound
solution to the data fusion problem.
Figure 1.2: Figure taken from Sonnenburg (2008).
For example, in transcrip-
tion splice site detection, a bio-
formatics application, we can de-
scribe the properties of DNA by
its twistedness, its binding en-
ergy, the position of the first
exon, or the abstract string information obtained from the sequence of nucleotides.
Each characterization gives rise to a different kernel: the energy kernel, the angle ker-
nel, the first-exon kernel, and the string kernel. In Section 5.3 we show that this
application highly profits from employing MKL. By training our large-scale implemen-
tation of `p-norm multiple kernel learning on up to 60,000 training examples and testing
on 20,000 data points, we show that `p-norm MKL can significantly increase the predic-
tion accuracy `1-norm MKL and the SVM using a uniform kernel combination. This is
remarkable since the latter was recently confirmed to be the winner of an international
comparison of 19 splice site detectors.
Figure 1.3: Figure taken from Bach
(2008a).
Another example is object recognition, a com-
puter vision application. Here, the goal is to cat-
egorize images according to their information con-
tent, e.g., what kind of animal is shown in an im-
age. Usually, the image is represented as a vector
in some feature space; however, representations can
be built from various features, for example, color,
texture, and shape information. Clearly, there is
no representation that is optimal for all tasks si-
multaneously. For example, color information is
essential for the detection of stop signs in images
but it is superfluous for finding cars. In this work
we propose to let the learner figure out an optimal
combination of features for the task at hand. In
Section 5.5, we report on results of the well-known VOC 2008 challenge data set and
show that the proposed `p-norm MKL achieves higher prediction accuracies than both,
classical MKL and the SVM baseline.
11
I Introduction and Overview
1.4 Basic Notation
In this thesis, vectors are understood as column vectors and marked with boldface
letters or symbols. However, for structured elements
w=w>
1,...,w>
M>Rd1×···×RdM,
in slight deviation to this notation, the simpler expression w=w1,...,wMis used.
Likewise, the notation (u(m))M
m=1 for the element u=u(1),...,u(M) H =H1×
. . . ×HM, where H,H1,...,HMare Hilbert spaces, is frequently used.
Vectors in Rnof all zeros or ones are denoted by 0and 1, respectively (where
ndepends on the context). Generalized inequalities such as α0are understood
coordinate-wise, i.e., αi0 for all i. In the whole thesis, it is understood that x
0= 0 if
x= 0 and elsewise. We also employ the convention
:= 1. Finally, for p[1,]
we use the standard notation pto denote the conjugate of p, that is, p[1,] and
1
p+1
p= 1, and R+denotes nonnegative reals. Indicator functions are denoted by
1
.
We denote the set of nonnegative reals by R+.
12
2 A Unifying View of Multiple Kernel Learning
2 A Unifying View of Multiple Kernel Learning
In this chapter, we cast multiple kernel learning into a unified framework. We show that
it comprises many popular MKL variants currently discussed in the literature, including
seemingly different ones. Our approach is based on regularized risk minimization (Vap-
nik, 1998). We derive generalized dual optimization problems without making specific
assumptions regarding the norm or the loss function, beside that the latter is convex.
Our formulation covers binary classification and regression tasks and can easily be ex-
tended to multi-class classification and structural learning settings using appropriate
convex loss functions and joint kernel extensions. Prior knowledge on kernel mixtures
and kernel asymmetries can be incorporated by non-isotropic norm regularizers. This
chapter is based on mathematical preliminaries introduced in Appendix A.
The main contributions in this chapter are the following:
we present a novel, unifying view of MKL, subsuming prevalent MKL approaches
under a common umbrella
this allows us to analyze the existing approaches jointly and is exemplified by deriving
a unifying dual representation
we show how prevalent models are contained in the framework as special cases.
Parts of this chapter are based on:
M. Kloft, U. R¨uckert, and P. L. Bartlett. A unifying view of multiple kernel learning.
In Proceedings of the European Conference on Machine Learning (ECML), 2010.
M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel learning.
Journal of Machine Learning Research (JMLR), 12:953–997, 2011.
2.1 A Regularized Risk Minimization Approach
We begin with reviewing the classical supervised learning setup, where we are given a
labeled sample D={(xi, yi)}i=1...,n with xilying in some input space Xand yiin some
output space Y R. The goal in supervised learning is to find a hypothesis fH
that has a low error rate on new and unseen data. An often-employed approach to this
problem is regularized risk minimization (RRM), where a minimizer
fargminfΩ(f) + CLn(f)
is found. Hereby Ln(f) = Pn
i=1 l(f(xi), yi) is the (cumulative) empirical loss of a
hypothesis fwith respect to a function l:R× Y R(called loss function); :
HRis a mapping (called regularizer), and λis a positive parameter. The name
risk minimization stems from the fact that Lnis n-times the empirical risk. We can
interpret RRM as minimizing a trade-off between the empirical loss (to classify the
training data well) and a regularizer (to penalize the complexity of fand thus avoid
overfitting), where the trade-off is controlled by λ.
Single-kernel approaches to RRM consider linear models of the form
fw,b(x) = hw, φ(x)i+b
13
I Introduction and Overview
together with a (possibly non-linear) mapping φ:X H to a Hilbert space Hand
regularizers
Ω(w) = 1
2kwk2
2(2.1)
(denoting by kwk2the Hilbert-Schmidt norm in H), which allows to “kernelize” (Sch¨ol-
kopf et al., 1998) the resulting models and algorithms, that is, formulating them solely
in terms of inner products k(x, x0) := hφ(x), φ(x0)iin H.
In multiple kernel learning, the feature mapping φdecomposes into Mdifferent
feature mappings φm:X Hm, m = 1,...M:
φ:X H
x7→ φ1(x), . . . , φM(x).
Thereby, each φmgives rise to a kernel kmso that the particular multiple kernels km
are connected with the “joint” kernel k(the one corresponding to the composite feature
map φ) by the simple equation
k=
M
X
m=1
km.(2.2)
As with every decomposition one can argue that nothing is won by writing the feature
map and the kernel as above. Indeed, in order to exploit the additional structure we
can extend the regularizer (2.1) to
mkl(w) = 1
2kwk2
2,O
where k·k2,O denotes the 2, O block-norm is defined by
kwk2,O := kw1k2,...,kwMk2O
and k·kOis an arbitrary norm on RM. This allows to kernelize the resulting models and
algorithms in terms of the multiple kernels kminstead of the joint kernel kas defined
in (2.2). The general multiple kernel learning RRM problem thus becomes
Problem 2.1 (Primal MKL Problem).
inf
w,b,t
1
2kwk2
2,O +C
n
X
i=1
lti, yi(P)
s.t. i:hw, φ(xi)i+b=ti.
The above primal generalizes multiple kernel learning to arbitrary convex loss functions
and norms. Note that if the loss function is continuous (e.g., hinge loss) and the
regularizer is such that its level sets form compact sets (e.g. `2,p-norm, p1), then
the supremum is in fact a maximum (this can be seen by rewriting the objective as a
constrained optimization problem).
14
2 A Unifying View of Multiple Kernel Learning
2.2 Dual Problem
In this section, we study the generalized MKL approach of the previous section in the
dual space. Dual optimization problems deliver insight into the nature of an opti-
mization problem. For example, they allow for computing the duality gap, which can
be used as a stopping criterion at optimization time or to evaluate the quality of the
numerical solutions by retrospect. Also, optimization can sometimes be considerably
easier in the dual space.
We start the derivation by introducing Lagrangian multipliers αRn. We assume
that the loss function is convex, so that (P) is a convex optimization problem (Boyd
and Vandenberghe, 2004), and thus by the strong Lagrangian duality principle the
optimal value of the primal (P) equals the one of the associated Lagrangian saddle
point problem (note that the given constraints are only linear equality ones, so that
constraint qualification trivially holds),
sup
α
inf
w,b,tLα,(w, b, t)(2.3)
with Lagrangian function
Lα,(w, b, t)=1
2kwk2
2,O +C
n
X
i=1
lti, yi+
n
X
i=1
αitihw, φ(xi)ib.
A standard approach in convex optimization and machine learning is to invoke the KKT
conditions to remove the dependancy on the primal variables in the above problem. But
because the objective is not differentiable for general norms and losses (except for b
where bL= 0 leads to 1>α= 0), we follow a different approach based on conjugate
functions.4We start by rewriting (2.3) as
sup
α:1>α=0 C
n
X
i=1
sup
tiαiti
Clti, yisup
wDw,
n
X
i=1
αiφ(xi)E1
2kwk2
2,O.
The Fenchel-Legendre conjugate function of a function g:RkRis defined by
g:t7→ supuu>tg(u). If gis a loss function then we consider the conjugate in the
first argument and call it dual loss. Equipped with this notation, we can rewrite the
above problem as
sup
α:1>α=0 C
n
X
i=1
lαi
C, yi1
2
n
X
i=1
αiφ(xi)
2
2,O.
We note that for any norm it holds 1
2k·k2=1
2k·k2
where
1
2k·k:w7→ sup
u:kuk≤1
u>w(2.4)
4Although we developed the approach for the purpose of MKL dualization, it might also be useful
outside the scope of MKL; for example, the RRM dualization approach given in Rifkin and Lippert
(2007) is contained in our framework as a special case but in contrast to them we can employ
arbitrary (i.e., not necessarily strictly) positive-semidefinite kernels.
15
I Introduction and Overview
denotes the dual norm (Boyd and Vandenberghe, 2004, , 3.26–3.27). Denoting the dual
norm of k·kOby k·kO, we can remark that the norm dual to k·k2,O is k·k2,O(e.g.,
Aflalo et al., 2011). Thus we can further rewrite the above, resulting in the following
dual MKL optimization problem which now solely depends on α:
Problem 2.2 (Dual MKL Problem).
sup
α:1>α=0 C
n
X
i=1
lαi
C, yi1
2
n
X
i=1
αiφ(xi)
2
2,O
.(D)
The above dual generalizes multiple kernel learning to arbitrary convex loss functions
and arbitrary block-structured norms where the inner norm is a Hilbert-Schmidt norm.
Discussion We note that (like in the primal) we have a decomposition of the above
dual into a loss term (on the left hand side) and a block-structured regularization term
(right hand side). A difference to the primal is that the decomposition is in terms of
the dual loss/regularizer. An advantage of the compact representation of the above
dual is that for a specific loss/regularizer pair (l, k·k) we just have to plug the dual loss
/ dual regularizer in (D) in order to obtain the dual MKL optimization problem. We
illustrate this by some examples of loss functions and regularizers in the next section.
2.3 Recovering Prevalent MKL Formulations as Special Cases
In this section, we show that existing MKL-based learners are subsumed by the gen-
eralized formulations in (P) and compute their dual representations (D). To do so, we
need to first compute the dual losses and dual regularizers. To this aim, we present
a table of loss functions and their duals (see Table 2.1). The table can be verified by
elementary calculations from the definition of the conjugate.
2.3.1 Support Vector Machines with Unweighted-Sum Kernels
Clearly, by considering the hinge loss l(t, y) = max(0,1ty) and the regularizer kwk2,2,
the support vector machine using a uniform combination of kernels is a special case of
our generalized formulation. It is instructive to compute the dual (D) for this simple
example.
To this aim, we first note that the dual loss of the hinge loss is l(t, y) = t
yif
1t
y0 and elsewise (see Table 2.1). Hence, for each ithe term lαi
C, yiof
the generalized dual (D) translates to αi
Cyiprovided that 0 αi
yiC. We can now
employ a variable substitution αnew
i=αi
yiso that (D) reads
max
α1>α1
2
n
X
i=1
αiyiφ(xi)
2
2,2
,s.t. y>α= 0 and 0αC1,(2.5)
Note that the regularizer kwk2,2is just kwk2. Hence, the right-hand side of the last
equation can be simply written as PM
m=1 α>Y KmYα, where Y= diag(y), and thus
16
2 A Unifying View of Multiple Kernel Learning
Table 2.1: Loss functions and regularizers used in this thesis and corresponding conjugate functions.
loss/regularizer g(t, y) conjugate g(t, y) used in5
hinge loss max(0,1ty)t
yif 1t
y0 and elsewise [1]
squared loss 1
2(yt)21
2t2+ty [2],[3]
unsigned hinge max(0,1t)tif 1t0 and elsewise [4]
`1-norm 1
2PM
m=1 kwmk221
2maxm∈{1,...,∞} kwk22[5]
`p-norm 1
2kwk2
p1
2kwk2
p[6]
`2,p block-norm 1
2kwk2
2,p 1
2kwk2
2,p[5],[7],[8]6
we obtain from (2.5):
sup
α
1>α1
2α>Y KY α
s.t. y>α= 0,0αC1.
This is the usual dual SVM optimization problem using a uniform kernel combination
K=PM
m=1 Km(M¨uller et al., 2001).
2.3.2 The Classical Quadratically Constrained Quadratic Program (QCQP)
A classical approach to multiple kernel learning, going back to the work of Lanckriet
et al. (2004a) and Bach et al. (2004), is to employ regularizers of the form
w7→ 1
2kwk2
2,1(2.6)
to promote (blockwise) sparse solutions (many wmare zero). Since wm= 0 means
that the corresponding kernel is “switched off and does not contribute to the decision
function, the so-obtained solutions are interpretable.
We can view the classical sparse MKL as a special case of our unifying framework;
to see this, note that the norm dual to `1is `(see Table 2.1). This means the right
hand side of (D) translates to maxm∈{1,...,M}α>Y KmYα; subsequently, the maximum
5[1] Lanckriet et al. (2004a); [2] Yuan and Lin (2006); [3] Bach (2008b); [4] Sonnenburg et al.
(2006a);
[5] Bach et al. (2004); [6] Kloft et al. (2009a); [7] Kloft et al. (2011); [8] Aflalo et al. (2011)
6Only for p= 1 in [5]; only for 1 < p < 2 in [7]; only for p > 2 in [8]
17
I Introduction and Overview
can be expanded into a slack variable ξ, resulting in
sup
α
1>αξ
s.t. m:1
2α>Y KmYαξ;y>α= 0 ; 0αC1,
which is the original QCQP formulation of MKL, first given by Lanckriet et al. (2004a).
2.3.3 A Smooth Variant of Group Lasso
Yuan and Lin (2006) studied the following regularized risk minimization problem known
as the group lasso,
min
w
C
2
n
X
i=1 yi
M
X
m=1 wm, φm(xi)!2
+1
2
M
X
m=1 kwmk2(2.7)
for Hm=Rdm. The above problem has been solved by active set methods in the
primal (Roth and Fischer, 2008). We sketch an alternative approach based on dual
optimization. First, we note that the dual of l(t, y) = 1
2(yt)2is l(t, y) = 1
2t2+ty
and thus the corresponding group lasso dual according to (D) can be written as
max
αy>α1
2Ckαk2
21
2α>Y KmYαM
m=1
,(2.8)
which can be expanded into the following QCQP
sup
α
y>α1
2Ckαk2
2ξ(2.9)
s.t. m:1
2α>Y KmYαξ.
For small n, the latter formulation can be handled rather efficiently by QCQP solvers.
However, the many quadratic constraints, caused by the non-smooth `-norm in the
objective, still are computationally demanding. As a remedy, we propose the following
unconstrained variant based on `p-norms (1 <p<2), given by
max
αy>α1
2Ckαk2
21
2α>Y KmYαM
m=1p
2
.
Because of the smoothness of the `p>1-norm, the above objective function is differ-
entiable in any αRn. Thus the above optimization problem can be solved very
efficiently by Newton descent methods (Nocedal and Wright, 2006) such as the limited
memory quasi-Newton method of (Liu and Nocedal, 1989).
18
2 A Unifying View of Multiple Kernel Learning
2.3.4 Density Level-Set Estimation
Density level-set estimators such as the one-class support vector machine (Scolkopf
et al., 2001) are frequently used for anomaly, novelty, and outlier detection tasks (see,
for example, Markou and Singh, 2003a,b). The one-class SVM can be cast into our
multi-kernel framework by employing loss functions of the form l(t) = max(0,1t):
inf
w,b,t
1
2kwk2
2,O +C
n
X
i=1
max(0,1ti)
s.t. i:hw, φ(xi)i+b=ti.
Noting that the dual loss is l(t) = tif 1t0 and elsewise, we obtain the
following generalized dual
sup
α:0αC1
1>α1
2
n
X
i=1
αiφ(xi)
2
2,O
.
This is studied in Sonnenburg et al. (2006a) for `2,1-norms and in Kloft et al. (2009b)
for `2,p-norms (1 p2).
2.3.5 Hierarchical Kernel Learning
Often multiple kernels come together with a block structure; this frequently happens,
for example, when multiple types of the kernels are given and each type is realized
for a number of parameter values (see the bioinformatics application in Section 5.4.1
and the computer vision application in Section 5.5 as examples). Hierarchical kernel
learning was recently studied using two levels of hierarchies: let I1, . . . , Ig {1, . . . , M}
be pairwise disjoint index sets; then the following penalty was studied
Ω(w) := 1
2k(kwik2)iI1kq,...,(kwik2)iIgq2
p.
This was considered in Szafranski et al. (2010) for p, q [1,2] and in Aflalo et al. (2011)
for p=[2,] and q= 1. Hierarchical kernel learning is contained in our unifying
framework: we obtain a dual representation by noting that the conjugate regularizer is
Ω(w)=1
2k(kwik2)iI1kq,...,(kwik2)iIgq2
p.
so that the general dual (D) translates into the following dual problem:
sup
α:1>α=0 C
n
X
i=1
lαi
C, yi1
2
n
X
i=1
αiφm(xi)2mIjq!g
j=1p
.
We can remark here that, from our unifying point of view, there is no need for the dis-
tinction of cases according to the ranges of pand qas imposed in Szafranski et al. (2010)
and Aflalo et al. (2011): the whole range p, q [1,] can be analyzed simultaneously
in our framework.
19
I Introduction and Overview
2.3.6 Non-Isotropic Norms
In practice, it is often desirable for an expert to incorporate prior knowledge about the
problem domain. For instance, an expert could provide estimates of the interactions of
the kernel matrices K1, . . . , KMin the form of an M×Mmatrix E. For example, if the
kernels are related by an underlying graph structure, Ecould be the graph Laplacian
(von Luxburg, 2007) encoding the similarities of the kernel matrices.
Alternatively, Ecould be estimated from data by computing the pairwise kernel
alignments Eij =<Ki,Kj>
kKik kKjk(given an inner product on the space of kernel matrices
such as the Frobenius dot product).
In a third scenario, Ecould be a diagonal matrix encoding the a priori importance
of kernels—it might be known from pilot studies that a subset of the employed kernels
is inferior to the remaining ones.
All those scenarios are subsumed by the proposed framework by considering non-
isotropic regularizers of the form
Ω(w) := 1
2kwk2
E1
where
kwkE1:= w>E1w
for some E0, where E1is the matrix inverse of E. The dual norm of k · kE1is
k·kE(this is easily verified by setting the gradient of the conjugate of 1
2k·k2
E1to zero)
so that we obtain from (D) the dual optimization problem
sup
α:1>α=0C
n
X
i=1
lαi
C, yi1
2
n
X
i=1
αiφ(xi)
2
E
,
which is a non-isotropic MKL problem. The usage of non-isotropic MKL was first
proposed in Varma and Ray (2007) for the very simple case of Ebeing a diagonal
matrix and generalized in Kloft et al. (2011).
2.4 Summary and Discussion
The standard view of multiple kernel learning (introduced in the introduction) has a
few limitations: first, obviously, it is incomplete since it does not account for p > 2
(Kloft et al., 2010a); second, there exist convex MKL variants (again, corresponding
to p > 2) that, in the standard view, cannot be represented in convex form, although
inherently being convex (Aflalo et al., 2011); third, as it will turn out, the standard
view is inconvenient for a theoretical generalization analysis such as the one carried out
in Chapter 4 of this thesis.
As a remedy, we developed a rigorous mathematical framework for the problem of
multiple kernel learning in this chapter. It comprises most existing lines of research in
that area, including very recent and seemingly different ones (e.g., the one considered
in Aflalo et al., 2011). Deviating from standard introductions, we phrased MKL as
20
2 A Unifying View of Multiple Kernel Learning
a general optimization criterion based on structured regularization, covering and also
unifying existing formulations under a common umbrella. By plugging arbitrary convex
loss functions and norms into the general framework, many existing approaches can be
recovered as instantiations of our model.
The unifying framework allows us to analyze a large variety of MKL methods
jointly, as exemplified by deriving a general dual representation of the criterion, without
making assumptions on the employed norms and losses, besides the latter being convex.
This delivers insights into connections between existing MKL formulations and, even
more importantly, can be used to derive novel MKL formulations as special cases of our
framework, as done in the next part of the thesis, where we propose `p-norm multiple
kernel learning. We note that in the most basic special case, the classical `1-norm
MKL formulation of Lanckriet et al. (2004a) is recovered by plugging the hinge loss
and the `1-norm into the framework. Historically, the structured view of classical MKL
is known since Bach et al. (2004). Here, we show that, more generally, the whole family
of MKL methods can be viewed as structured regularization (Obozinski et al., 2011), of
which we argue that it is a more elegant way to view MKL than the standard approach.
21
Part II
`p-norm Multiple Kernel Learning
23
II `p-norm Multiple Kernel Learning
In the previous chapter we presented a general view of MKL. However, at some point
we have to make a particular choice of a norm, for example, in order to employ MKL
in practical applications.
The main contribution of this thesis is a novel MKL formulation called `p-norm mul-
tiple kernel learning. Our interest in this stems from the fact that (in contrast to the
prevalent MKL variants) it has a good chance to improve on the trivial uniform-kernel-
combination baseline in practical applications (we show this later in the Chapter 5 of
this thesis).
We obtain `p-norm MKL from the unifying MKL framework by using the regularizer
mkl(w) = 1
2kwk2
2,p, p {1,...,∞},
where the `2,p-norm is defined as
kwk2,p := M
X
m=1 kwmkp
21
p
for p[1,[ and kwk2,p := supm=1,...,M kwmk2for p=. Plugging this into the
unifying optimization problems (Problems 2.1 and 2.2) we obtain the `p-norm MKL
primal and dual problems:
Problem II.3 (`p-norm MKL).For any p[1,],
inf
w,b,t
1
2kwk2
2,p +C
n
X
i=1
lti, yi(Primal)
s.t. i:hw, φ(xi)i+b=ti,
sup
α:1>α=0 C
n
X
i=1
lαi
C, yi1
2
n
X
i=1
αiφ(xi)
2
2,p
.(Dual)
The key difference of `p>1-norm MKL to previous MKL approaches, which are based
on `1-norms, is that the obtained weight vectors are unlikely to be sparse. This was
already indicated in the introduction and is now discussed in more detail here: the
optimal solution of the `p-norm MKL optimization problem is attained when the norm
25
II `p-norm Multiple Kernel Learning
||x||1=1
quatric
||x||2=1
quatric
Figure 2.1: Comparison of `1- (Left) and `2- (Right) regularized problems. The blue and green
lines show level sets of the `1-norm and a quadratic function, respectively.
constraint and the level sets of loss function touch each other (see Figure 2.1). If the
loss function is convex (illustrated here for a quadratic loss function) and the norm is
a`2,1-norm, the point of intersection is likely to be at one of the corners of the square
(shown on the figure to the left). Because the corners are likely to have zero-entries,
sparsity is to be expected. This is in contrast to `2,p-norm regularization with p > 1,
where, in general, a non-sparse solution is to be expected (shown on the figure to the
right).
Why `p-norms?
A naturally arising question is why (out of the set of all norms) focusing on an `p-norm?
A reason for this is that, without any prior knowledge, an `p-norm is a natural choice—
in contrast, for example, non-isotropic norms rely on prior knowledge of the relative
importance of the particular kernels or the interactions between the kernels. Also, our
optimization algorithms and the theoretical analysis presented later in this thesis make
use of particular properties of the `p-norm that are not valid for any arbitrary norm.
Alternative Formulation
In the previous chapter, we formulated MKL as a block-norm-regularized risk min-
imization problem. Here, we present an alternative but equivalent view of `p-norm
MKL as learning an “optimal” (linear) kernel combination
Kθ=X
m=1
θmKm
from a candidate set of kernels {K1, . . . , KM}subject to kθkp= 1. We now show
that this formulation naturally arises from the previously considered formulation of
Problem II.3. We need to treat the cases p[1,2] and p[2,] separately to ensure
that the occurring “norms” indeed are norms.
26
The case p[1,2[
We first deal with the case p[1,2[, which corresponds to the established standard
view of MKL. To this aim, we reconsider the regularizer of Problem II.3 and note that
for p[1,2] we can rewrite it as
n
X
i=1
αiφ(xi)2
2,p=α>KmαM
m=1p/2
(2.4)
= sup
θ:kθk(p/2)1
M
X
m=1
θmα>Kmα(II.1)
where we use the definition of the dual norm (Equation (2.4)). Note that the choice
of p[1,2] ensures that the above “norm” really is a norm: indeed, p[1,2] implies
that p/2 is in the valid interval [1,]. We also note that (p/2)=p
2pand that the
optimal θis nonnegative so that we can use (II.1) to rewrite Problem II.3 as
inf
θ:kθkp/(2p)1svm(Kθ),s.t. Kθ=
M
X
m=1
θmKm,(II.2)
where we use the shorthand
svm(Kθ) := sup
α:1>α=0 C
n
X
i=1
lαi
C, yi1
2α>Kθα(II.3)
to denote the optimal value of the SVM optimization problem. Note that we exchanged
the sequence of minimization and maximization to obtain (II.2), which is justified by
Sion’s Minimax theorem (Sion, 1958). Equation (II.2) gives an alternative formulation
of the `p-norm MKL problem: training a support vector machine which simultaneously
also optimizes over the optimal kernel combination (subject to a norm constraint on
the combination coefficients to avoid overfitting).
The case p]2,]
Unfortunately, we cannot use the above approach in the case p[2,] as then the
norm parameter p/(2 p) lies outside the valid range of [1,]. However, we can use
a similar argument as above by considering the primal of Problem II.3 instead of the
dual: the primal regularizer of Problem II.3 can be rewritten as
kwk2
2,p =kwmk2
2M
m=1p/2
(2.4)
= sup
θ:kθk(p/2)1
M
X
m=1
θmkwmk2
2
= sup
θ:Pmθ(p/2)
m1
M
X
m=1
θ1
mkwmk2
2.(II.4)
27
II `p-norm Multiple Kernel Learning
since p/2 is in the valid interval [1,] for p[2,]. We note that (p/2)=p/(2p)
and that the optimal θis nonnegative; hence, the primal of Problem II.3 translates
into
sup
θ:Pmθp/(2p)
m1
svm(Kθ),s.t. Kθ=
M
X
m=1
θmKm,
with svm(Kθ) := inf
w,b
1
2
M
X
m=1
θ1
mkwmk2
2+C
n
X
i=1
lhw, φ(xi)i+b, yi.
Remark
Note that in the above definition of the svm function there is no collision with the
definition in (II.3) since both formulations are dual to each other, for any fixed θ0.
One way to see this is by introducing slack variables ti:= hw, φ(xi)i+bto write the
above as
svm(Kθ) = inf
w,b,t
1
2
M
X
m=1
θ1
mkwmk2
2+C
n
X
i=1
lti, yi
and then incorporating the constraints by Lagrangian multipliers αi:
sup
α
inf
w,b,t
1
2
M
X
m=1
θ1
mkwmk2
2+C
n
X
i=1
lti, yi+
n
X
i=1
αitihw, φ(xi)ib.
The KKT conditions, which can be used to compute the decision function and to recover
the treshold b, hold for the pair (w,α) and yield
m= 1, . . . , M :kwmk2
2=θ2
mαKmα.(II.5)
Note, furthermore, that applying the KKT conditions to (2.3) for `p-norms yields:
c > 0 : kwmk=c(αKmα)1
2(p1) ,(II.6)
The remainder of the dualization is analogue to what we have seen above and yields
(II.3).
Putting the Pieces Together...
Putting things together, we obtain the following alternative MKL optimization problem.
Problem II.4 (`p-norm MKL, Alternative Formulation).The alternative for-
mulation of `p-norm multiple kernel learning is given by
inf
θ0svm(Kθ),( if p[1,2[ )
sup
θ0
svm(Kθ),( if p]2,] )
s.t. Kθ=
M
X
m=1
θmKm,X
m
θp/(2p)
m1,
28
where svm denotes the optimal objective value of the SVM optimization problem,
svm(Kθ) := inf
w,b
1
2
M
X
m=1
θ1
mkwmk2
2+C
n
X
i=1
lhw, φ(xi)i+b, yi
= sup
α:1>α=0 C
n
X
i=1
lαi
C, yi1
2α>Kθα.
The above problem can be interpreted as finding or “learning” an optimal kernel com-
bination from a set of kernels where the quality of the mixtures is evaluated in terms
of the SVM objective function. This alternative view is much in the original learning-
kernels spirit of Lanckriet et al. (2004a): “Learning the kernel matrix with semi-definite
programming”.
Remaining Contents of this Part
In the remainder of this part we derive optimization algorithms for `p-norm MKL and
apply them in order to empirically analyze the generalization performance of `p-norm
MKL in controlled artificial environments as well as real-world applications from the
domains of bioinformatics and computer vision. We also investigate `p-norm MKL
theoretically and show that for a large range of learning scenarios it enjoys stronger
generalization guarantees than classical MKL and the SVM using a uniform kernel
combination.
29
3 Algorithms
3 Algorithms
In this chapter, we present two optimization algorithms for the `p-norm MKL prob-
lem II.3. Each algorithm has its own advantage: while the first one is provably conver-
gent, easier to implement, and also to modify, due to its modular design, the second
one is expected to be faster in practice and also less memory-intensive. For the sake
of performance, both algorithms were implemented in C++ and made available as a
part of the open source machine learning toolbox SHOGUN (Sonnenburg et al., 2010),
which also contains interfaces to MATLAB, Octave, Phyton, and R.
Both algorithms are based on the alternative `p-norm MKL formulation (Prob-
lem II.4) in contrast to the original problem (Problem II.3). This is because the former
consists of an “inner” and an “outer” optimization problem, where the inner problem
is a standard SVM optimization problem. This has the advantage that existing (highly
optimized) SVM solvers can be exploited.
We remark that the proposed algorithms can, in particular, be used to optimize
the classical `p=1-norm MKL formulation. In the computational experiments at the
end of this chapter, we show that the proposed algorithms outperform the prevalent
state-of-the-art solvers by up to two magnitudes.
The main contributions in this chapter are the following:
We present a novel optimization method based on a simple, analytical formula, which,
in particular, can be used to optimize classical MKL.
We prove its convergence for p > 1.
We implement the algorithm as an interleaved chunking optimizer in C++ within the
SHOGUN toolbox with interfaces to MATLAB, Octave, Phyton, and R.
We show that our implementation allows for training with up to ten thousands of data
points and thousands of kernels, while the state-of-the-art approaches run already out
of memory with a few thousand of data points and hundreds of kernels.
Even for moderate sizes of the training and kernel sets, our approach outperforms
prevalent ones by up to two magnitudes.
Parts of this chapter are based on:
M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. M¨uller, and A. Zien. Efficient
and accurate lp-norm multiple kernel learning. In Advances in Neural Information
Processing Systems 22 (NIPS 2009), pages 997–1005. MIT Press, 2009.
M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel learning.
Journal of Machine Learning Research (JMLR), 12:953–997, 2011.
3.1 Block Coordinate Descent Algorithm
The main idea of the first approach is to divide the set of optimization variables (which
is {θ1, . . . , θM, α1, . . . , αn}) into two groups—the set {θ1, . . . , θM}on one hand and the
set {α1, ..., αM}on the other—and then alternating the optimization with respect to θ
with the one with respect to α.
31
II `p-norm Multiple Kernel Learning
We observe that in the α-step this boils down to training a standard SVM. In
contrast, the θ-step can simply be performed by means of an analytical formula, as the
following proposition shows.
Proposition 3.1. Given any (possibly suboptimal) w6=0in the objective of Prob-
lem II.4, the optimal θis attained for
m= 1, . . . , M :θm=kwmk2p
2
PM
m0=1 kwm0kp
2(2p)/p .( if p[1,2[ )
Assume at least one of the kernel matrices Kmis strictly positive-definite. Then, given
any (possibly suboptimal) α6=0, the optimal θis attained for
m= 1, . . . , M :θm=α>Kmα2p
22p
PM
m0=1 α>Km0αp
22p(2p)/p .( if p]2,] )
Proof To start the proof, we consider Problem II.4, fix the variables w, b, and only
optimize w.r.t θ. By Lemma B.1 we can write this as Tikhonov regularized problem:
inf
θ0
1
2
M
X
m=1
θ1
mkwmk2
2+C
n
X
i=1
lhw, φ(xi)i+b, yi+µ
M
X
m=1
θp/(2p)
m.
for a suitable chosen constant µ > 0. Let us ignore the positivity constraint θ0for
the moment; the above objective is differentiable in θfor θ6=0so that the optimum
is attained for the gradient w.r.t. θbeing zero, i.e.,
m= 1, . . . , M :1
2θ2
mkwmk2+µ p/(2 p)θ(p/(2p))1
m= 0 .
Solving this for θ, we observe that the optimal θis indeed positive and we have the
proportionality
m= 1, . . . , M :θm kwmk2p
2.
Normalizing θto fulfill the constraint Pmθp/(2p)
m= 1 (which is possible because
w6=0) yields the first part of the proposition. Note that θm6= 0 is no restriction as
θm= 0 can only be when kwmk2=0so that the proposition trivially holds in that
case.
For the second part, we proceed similarly but use the dual formulation as a starting
point: we write
inf
θ0C
n
X
i=1
lαi
C, yi+1
2α>
M
X
m=1
θmKmα+µ
M
X
m=1
θp/(2p)
m
32
3 Algorithms
so that setting the gradient w.r.t. θto zero yields
m= 1, . . . , M :1
2α>Kmα+µ p/(2 p)θ(p/(2p))1
m= 0 .
Solving this for θresults in the proportionality
m= 1, . . . , M :θmα>Kmα2p
22p.
Again, we observe that the so-obtained θis nonnegative and normalizing θto fulfill
the constraint Pmθp/(2p)
m= 1 (which is possible because m:Km0) yields the
second part of the proposition.
We now have all ingredients to formulate a simple macro-wrapper algorithm for `p-norm
MKL training:
Algorithm 3.1 (Analytical Wrapper). Simple `p-norm MKL training algorithm.
SVM computations and analytical updates of θare alternated.
1: input: p[1,]\{2}
2: For all minitialize θm:= (1/M)(2p)/p
3: while optimality conditions are not satisfied do
4: Compute α:= arg(svm(Kθ))
5: if p[1,2[
6: For all mcompute kwmkaccording to Eq. (II.5)
7: end if
8: Update θaccording to Prop. 3.1
9: end while
10: output: αand θas sparse vectors
The above algorithm starts with initializing θuniformly (Line 2) and then alternates
between training an SVM in the dual using the actual kernel mixture Kθ(Line 4) and
updating θ(Line 8). Thereby, if p[1,2[, the θ-update is performed in the primal
using wcomputed from αby (II.5) (Line 6) and, if p]2,], the update is performed
in the dual. The algorithm can be stopped when, for example, the duality gap or the
change in objective function within subsequent iterations is less than a pre-specified
threshold.
Beside its simplicity, an advantage of the above algorithm is its modular form
which allows to use existing (efficient) SVM solvers in the α-step.
3.2 Large-Scale Algorithm
The above wrapper algorithm computes a full-blown SVM in each iteration. This
can be disadvantageous because it is likely that much computational time is spent
on suboptimal mixtures. Certainly, suboptimal α-solutions would already suffice to
improve far-from-optimal θin the θ-step.
33
II `p-norm Multiple Kernel Learning
This is becoming a very pressing problem especially in large-scale machine learning
applications.1As a remedy, we propose the following algorithm for large-scale MKL
optimization:
Algorithm 3.2 (Analytical Chunking). `p-Norm MKL chunking-based train-
ing algorithm via analytical update. The variables θand (signed) αare optimized
interleavingly. The algorithm is stated here for the hinge loss.
1: input: p[1,]\{2},QN, > 0
2: initialize: i, m :gm,i = ˆgi=αi= 0; L=S=−∞;θm:= (1/M)(2p)/p
3: iterate
4: Select lvariables αi1, . . . , αilbased on the gradient ˆ
gof svm
5: Store αold =αand then compute α:= arg(svm(Kθ)) w.r.t. the selected variables
6: Update gradient i, m :gm,i := gm,i +PQ
q=1(αiqαold
iq)km(xiq, xi)
7: Compute the quadratic terms m:Sm:= 1
2Pigm,iαi,kwmk2
2:= 2θ2
mSm
8: Lold =L,L=Piyiαi,Sold =S,S=PmθmSm
9: if |1(LS)/(Lold Sold)| ε
10: Update θaccording to Prop. 3.1
11: if p[1,2]
12: For all mcompute kwmkaccording to Eq. (II.5)
13: end if
14: else
15: break
16: end if
17: ˆgi=Pmθmgm,i for all i= 1, . . . , n
18: output: αand θas sparse vectors
The above algorithm starts with initializing θuniformly (Line 2) and then alternates be-
tween (incomplete) α- and θ-steps. Thereby, the α-step is given by standard chunking-
based SVM computations and carried out by the SVMlight module in SHOGUN (Lines
4-6). SVM-objective values are computed in Lines 7-8. Finally, the analytical θ-step
is carried out in Lines 10-13. The algorithm terminates (Line 15) if the maximal KKT
violation (c.f. Joachims, 1999) falls below a predetermined precision εand if it holds
|1ω
ωold |< εmkl for the normalized maximal constraint violation in the MKL-step,
where ωdenotes the MKL objective function value.
The main idea of the above algorithm is to perform each α-step solely with respect
to a small number Qof active variables. Thereby, Qis chosen as described in Joachims
(1999); in our computational experiments (see Section 3.4), we observed Q= 40 as a
typical value.
1We refer to machine learning applications as being large-scale when the data cannot be stored in
memory or the computation reaches a maintainable limit. Note that in the case of MKL this can
be due to both a large sample size or a high number of kernels.
34
3 Algorithms
3.2.1 Convergence Proof
In this section, we prove the convergence of Algorithm 3.1. To this aim, we invoke a
result from Bertsekas (1999) on the convergence of the block coordinate descent method:
Proposition 3.2 (Bertsekas, 1999, p. 268–269).Let S=NM
m=1 Smbe the Carte-
sian product of closed convex sets SmRdm, be f:S Ra continuously differentiable
function. Define the block coordinate descent method recursively by letting s0 S be
any feasible point, and be
sk+1
m= argminξsmfsk+1
1,··· ,sk+1
m1,ξ,sk
m+1,··· ,sk
M,m= 1, . . . , M. (3.1)
Suppose that for each mand s S, the minimum
min
ξ∈Sm
f(s1,··· ,sm1,ξ,sm+1,··· ,sM) (3.2)
is uniquely attained. Then every limit point of the sequence {sk}kNis a stationary
point.
The next proposition establishes convergence of the proposed `p-norm MKL training
algorithm 3.1.
Theorem 3.3. Let lbe the hinge loss and be p]1,]\{2}. Let the kernel matrices
K1, . . . , KMbe strictly positive-definite. Then every limit point of Algorithm 3.1 is a
globally optimal point of Problem II.4.
Proof Note that Algorithm 3.1 can be interpreted a block coordinate descent algo-
rithm, so we write Problem II.4 as a minimization problem
if p[1,2[ : inf
(w,b),θ
1
2
M
X
m=1
θ1
mkwmk2
2+C
n
X
i=1
lhw, φ(xi)i+b, yi(3.3)
if p]2,] : inf
α,θC
n
X
i=1
lαi
C, yi+1
2α>
M
X
m=1
θmKmα.(3.4)
subject to Pmθp/(2p)
m1 and θ0. In the following, we differentiate the two cases
p[1,2[ and p]2,].
1. The case p[1,2[.In the first case, we have to transform (3.3) into a form
such that the requirements for the application of Prop. 3.2 are fulfilled. We start by
expanding the hinge loss l(t) = max(0,1t) in (3.3) into
min
w,b,ξ,θC
n
X
i=1
ξi+1
2
M
X
m=1
θ1
mkwmk2
2,
s.t. i:
M
X
m=1hwm, φm(xi)iHm+b1ξi;ξ0; kθkp/(2p)1; θ0,
35
II `p-norm Multiple Kernel Learning
thereby extending the second block of variables (w, b) to (w, b, ξ). Moreover, we note
that after an application of the representer theorem2(Kimeldorf and Wahba, 1971) we
may, without loss of generality, assume Hm=Rn.
In the problem’s current form, the possibility of θm= 0 while kwmk 6= 0 renders
the objective function nondifferentiable. This hinders the application of Prop. 3.2.
However, such a pair (wm, θm) would yield an infinite objective and thus cannot be
optimal. Moreover, even kwmk=0 would, for p > 1, contradict the KKT condition (II.6)
because the kernels are assumed to be strictly positive-definite; in any case it thus holds
θm6= 0 (given p > 1). Therefore we can substitute the constraint θ0by θ>0for
all mwithout changing the optimal solution. In order to maintain the closeness of the
feasible set we subsequently apply a bijective coordinate transformation log : RM
+RM
so that θnew
m:= log(θold
m), resulting in the following equivalent problem,
inf
w,b,ξ,θC
n
X
i=1
ξi+1
2
M
X
m=1
exp(θm)kwmk2
2,(3.5)
s.t. i:
M
X
m=1hwm, φm(xi)i+b1ξi;ξ0; kexp(θ)kp/(2p)1,
where we employ the notation exp(θ) = (exp(θ1),··· ,exp(θM))>.
Applying the block coordinate descent method of Eq. (3.1) to the base problem
(3.3) and to the reparameterized problem (3.5) yields the same sequence of solutions
{(w, b, θ)k}kN0. The reparameterized problem now allows to apply Prop. 3.2 for the
two blocks of coordinates θ S1:= Rand (w, b, ξ) S2:= Rn(M+1)+1: the objective
is continuously differentiable and the sets S1and S2are closed and convex. To see the
latter, note that k·kp/(2p)exp is a convex function, since k·kp/(2p)is convex and non-
increasing in each argument (cf., e.g., Section 3.2.4 in Boyd and Vandenberghe, 2004).
Moreover, the minima in Eq. (3.1) are uniquely attained: the (w, b)-step amounts to
solving an SVM on a positive-definite kernel mixture, and the analytical θ-step clearly
yields unique solutions as well.
Hence, we conclude that every limit point of the sequence {(w, b, θ)k}kNis a
stationary point of (3.3). For a convex problem, this is equivalent to such a limit point
being globally optimal.
2. The case p]2,].In the second case there is nothing to do since for hinge loss
(3.4) trivially fulfills the conditions for the application of Proposition 3.2.
3.2.2 Practical Considerations
In practice, we are facing two problems. First, the standard Hilbert space setup nec-
essarily implies that kwmk 0 for all m. In practice, this assumption may often be
2Note that the coordinate transformation into Rncan be explicitly given in terms of the empirical
kernel map (Scolkopf et al., 1999).
36
3 Algorithms
violated, either due to numerical imprecision or because of using an indefinite “ker-
nel” function. However, for any kwmk 0 it follows that θ?
m= 0 as long as at
least one strictly positive kwm0k>0 exists. This is because for any λ < 0 we have
limh0,h>0λ
h=−∞. Thus, for any mwith kwmk 0, we can immediately set the
corresponding mixing coefficients θ?
mto zero. The remaining θmare then computed
according to Prop. 3.1 but with the normalization being adjusted accordingly.
Second, in practice, the SVM problem will only be solved up to finite precision,
which may lead to convergence problems. Moreover, we want to improve αonly a little
bit before recomputing θsince computing a high precision solution can be wasteful (cf.
Alg. 3.2). We can obtain a heuristic to overcome the potential convergence problem
by computing the SVM by a higher precision if needed. The main idea is that this
way it is likely that the primal objective decreases within each α-step. However, in our
computational experiments we find that this precaution is not even necessary: even
without it, the algorithm converges in all cases that we tried in our computational
experiments (shown later in this thesis).
Finally, we note that, of course, for linear kernels the SVM-step of the proposed
block coordinate descent algorithm could also be carried out in the primal. In the view
of efficient primal SVM optimizers such as LibLinear (Fan et al., 2008) or Ocas (Franc
and Sonnenburg, 2008) this is extremely appealing.
3.3 Implementation
Both of the algorithms described in the previous section were implemented in C++ for
regression, one-class classification, and two-class classification tasks. The implementa-
tion has been made available as a part of the open source machine learning toolbox
SHOGUN (Sonnenburg et al., 2010). SHOGUN, which contains interfaces to MATLAB,
Octave, Phyton, and R, can be freely obtained from http://www.shogun-toolbox.
org.
How to use?
The user can choose from a variety of options. First, in contrast to prevalent MKL
implementations, not only precomputed kernels can be used as an input to our soft-
ware but kernels can also be computed on-the-fly during operation (in that case kernel
caching is employed); on-the-fly kernel computation can be very appealing if the kernel
matrix is too large to be held in memory, as commonly encountered in large-scale ma-
chine learning applications. Second, one can choose the MKL optimization algorithm
to be used—valid options are any of the proposed analytical optimization schemes and
the cutting plane strategy described in the appendix. In case Algorithm 3.1 is used, the
α-step can be carried out by any of the SVM implementations contained in SHOGUN
(this includes, among many others, LIBSVM, LIBLINEAR, and SVMlight). The cut-
ting plane optimizer requires IBM ILOG CPLEX3; alternatively, but only for p= 1,
3http://www.ibm.com/software/integration/optimization/cplex/.
37
II `p-norm Multiple Kernel Learning
the free LP solver GLPK4can be chosen.
Implementation Details
Both, the cutting plane strategy and the analytical algorithm (Algorithm 3.2) perform
interleaved optimization and thus require modification of the core SVM optimization al-
gorithm. This is currently integrated into the chunking-based SVRlight and SVMlight.
To reduce the implementation effort, we implement a single function perform mkl -
step(lin,quad), that has two arguments: first, the linear term lin := Pn
i=1 αiand
second, an array quad =quadmM
m=1 containing the quadratic terms of the respective
kernels quadm:= 1
2αTKmα. This function is either, in the interleaved optimization
case, called as a callback function (after each chunking step), or it is called by the
wrapper algorithm (after each SVM optimization to full precision).
One-class classification is trivially implemented using lin = 0 and support vector
regression (SVR) is typically performed by internally translating the SVR problem
into a standard SVM classification problem with twice the number of examples (once
positively and once negatively labeled with corresponding αand α). Thus one needs
direct access to αand computes lin =Pn
i=1(αi+α
i)εPn
i=1(αiα
i)yi. Since this
requires modification of the core SVM solver, we implement SVR only for interleaved
optimization and SVMlight.
Note that the choice of the size of the kernel cache becomes crucial when applying
MKL to large-scale learning applications. While conventional wrapper algorithms only
solve a single kernel SVM in each iteration and thus use a single large kernel cache,
interleaved optimization methods must keep track of several partial MKL objectives
objmand thus require access to individual kernel rows. Therefore, the same cache size
should be used for all sub-kernels.
3.3.1 Kernel Normalization
In practice, the normalization of kernels is as important for MKL as the normalization
of features is for training regularized linear or single-kernel models. This is owed to the
bias introduced by the regularization: optimal feature / kernel weights are requested
to be small. This is easier to achieve for features (or entire feature spaces, as implied
by kernels) that are scaled to be of large magnitude, while downscaling them would
require a correspondingly upscaled weight for representing the same predictive model.
Upscaling (downscaling) features is thus equivalent to modifying regularizers so that
they penalize those features less (more). We here use isotropic regularizers, which
penalize all dimensions uniformly. This implies that the kernels have to be normalized
in a sensible way in order to represent an “uninformative prior” as to which kernels are
useful.
There exist several approaches to kernel normalization, of which we use two in the
computational experiments below. They are fundamentally different. The first one gen-
eralizes the common practice of standardizing features to entire kernels, thereby directly
4http://www.gnu.org/software/glpk/.
38
3 Algorithms
implementing the spirit of the discussion above. In contrast, the second normalization
approach rescales the data points to unit norm in feature space. Nevertheless, it can
have a beneficial effect on the scaling of kernels, as we argue below.
Both normalization approaches, as well as a number of additional ones, were imple-
mented and made available as a part of our SHOGUN implementation (see SHOGUN
documentation for a complete list of normalization methods).
Method 1: Multiplicative Normalization
As proposed in Ong and Zien (2008), we can multiplicatively normalize the kernels to
have uniform variance of data points in feature space. Formally, we find a positive
rescaling ρmof the kernel so that the rescaled kernel ˜
km(·,·) = ρmkm(·,·) and the
corresponding feature map ˜
Φm(·) = ρmΦm(·) satisfy
1
n
n
X
i=1 ˜
Φm(xi)˜
Φm(¯x)2= 1
for each m= 1, . . . , M, where ˜
Φm(¯x) := 1
nPn
i=1 ˜
Φm(xi) is the empirical mean of the
data in feature space. It is straightforward to verify that the above equation can be
equivalently be expressed in terms of kernel functions as
1
n
n
X
i=1
˜
km(xi, xi)1
n2
n
X
i=1
n
X
j=1
˜
km(xi, xj) = 1,
so that the final normalization rule is
k(x, ¯x)7− k(x, ¯x)
1
nPn
i=1 k(xi, xi)1
n2Pn
i,j=1, k(xi, xj).(3.6)
Note that in case the kernel is centered (i.e. the empirical mean of the data points lies
on the origin), the above rule simplifies to k(x, ¯x)7− k(x, ¯x)/1
ntr(K), where tr(K) :=
Pn
i=1 k(xi, xi) is the trace of the kernel matrix K.
Method 2: Spherical Normalization
Another approach, which is frequently applied in the MKL literature, is to normalize
k(x, ¯x)7− k(x, ¯x)
pk(x, x)k(¯x, ¯x).(3.7)
After this operation, kxk=k(x, x) = 1 holds for each data point x; this means that each
data point is rescaled to lie on the unit sphere. Still, this also may have an effect on the
scale of the features: a centered, spherically normalized kernel is also multiplicatively
normalized because, for centered kernels, the multiplicative normalization rule becomes
k(x, ¯x)7− k(x, ¯x)/1
ntr(K) = k(x, ¯x)/1.
39
II `p-norm Multiple Kernel Learning
Thus, spherical normalization may be seen as an approximation of multiplicative
normalization and may be used as a substitute for it. Note, however, that it changes
the data points themselves by eliminating length information; whether this is desired
or not depends on the learning task at hand (for example, in text classification, the
category of a text rather depends on its word frequency distribution than on its length
so that spherical normalization is desired).
3.4 Runtime Experiments
In this section, we analyze the efficiency of our implementation of `p-norm MKL. We
experiment on the MNIST data set, where the task is to separate odd and even dig-
its. The digits in this n= 60,000-elemental data set, which we downloaded from
http://yann.lecun.com/exdb/mnist/, are of size 28x28, leading to d= 784 dimen-
sional examples. We compare the proposed optimization approaches for `p-norm MKL
with the state-of-the-art for `1-norm MKL, namely, SimpleMKL (Rakotomamonjy
et al., 2008), HessianMKL (Chapelle and Rakotomamonjy, 2008), and the cutting plane
method (CPM) proposed in Sonnenburg et al. (2006a).5We also experiment with the
analytical method for p= 1, although convergence is only guaranteed by Theorem 3.3
for p > 1. In addition, we compare all approaches to a standard single-kernel SVM,
namely, SVMlight using a uniform kernel mixture (`2-norm MKL).
We experiment with MKL using precomputed kernels (excluding the kernel compu-
tation time from the timings) and MKL based on on-the-fly computed kernel matrices
measuring training time including kernel computations. Naturally, runtimes of on-the-
fly methods should be expected to be higher than the ones of the precomputed coun-
terparts; on the other hand, on-the-fly methods are not subject to memory constraints
while precomputation-based methods eventually run out of memory with increasing n.
We optimize all of our methods up to a precision of εsvm = 103in the α-step
and εmkl = 105in the θ-step (those values have proven to be adequate in practice).
Subsequently, we compute the relative duality gaps of the `1-cutting-plane method for
each run and input them as stopping criteria of SimpleMKL and HessianMKL. This
is to ensure that the comparison is fair (i.e., all methods are optimized to the same
precision). The SVM trade-off parameters are set to C= 1 for all methods (which is
a natural choice; but note that even a different value of Cshould affect all methods
equally as it only affects the α-step of the MKL optimizers). The runtime differences
between the p-norms were marginal for p]1,[ so that for non-sparse `p>1-norm
MKL we only plot the results for p=4
3(which is a natural value as it lies right in the
middle between the extreme cases `1-norm MKL and uniform-sum-kernel SVM).
Experiment 1: Runtime in the Number of Training Examples
Figure 3.1 (top) displays the results for varying sample sizes but a fixed number of 50
precomputed or on-the-fly computed Gaussian kernels with bandwidths 2σ21.20,...,49.
5The implementations were obtained from http://asi.insa-rouen.fr/enseignants/~arakotom/
code/ (SimpleMKL) and http://olivier.chapelle.cc/ams/hessmkl.tgz (HessianMKL).
40
3 Algorithms
100 1,000 10,000 100,000
1
10
100
1,00
10,000
number of training examples
training time (seconds)
MKL wrapper (1−norm CPM)
SimpleMKL
on−the−fly CPM chunking (1−norm)
analytical chunking (p−norm)
SVM
HessianMKL
analytical chunking (1−norm)
(p−norm)
CPM chunking (1−norm)
(p−norm)
SVM
proposed chunking methods
SVM
proposed on−the−fly methods
HessianMKL
SimpleMKL
out−of−memory
10 100 1,000 10,000
1
10
100
1,000
10,000
number of kernels
training time (seconds)
SVM
proposed on−the−fly methods
out−of−memory
proposed chunking methods
HessianMKL
SimpleMKL
Figure 3.1: Runtime of `p-norm MKL in the number of training examples (Top) and the number
of kernels (Bottom), respectively, and comparison to MKL baselines.
41
II `p-norm Multiple Kernel Learning
Error bars indicate standard errors over 5 repetitions. As expected, the SVM using
a precomputed uniform kernel mixture is the fastest method. The classical MKL-
wrapper-based methods, SimpleMKL and the CPM wrapper, are the slowest; they are
even slower than methods that compute kernels on-the-fly. Note that the on-the-fly
methods naturally have higher runtimes because they do not profit from precomputed
kernel matrices.
Notably, when considering 50 kernel matrices of size 8,000 ×8,000 (memory re-
quirements about 24GB for double precision numbers), SimpleMKL is the slowest
method: it is more than 120 times slower than the `1-norm CPM solver and the p-
norm analytical chunking. This is because SimpleMKL suffers from having to train
an SVM to full precision for each gradient evaluation. In contrast, kernel caching and
interleaved optimization still allow to train our algorithm on kernel matrices of size
20000 ×20000, which would usually not completely fit into memory since they require
about 149GB.
Non-sparse `p>1-norm MKL scales similarly as the `1-norm CPM for all chunking-
based optimization strategies: analytical and CPM chunking. Naturally, the general-
ized `p>1-norm CPM is slightly slower than its `1-norm counterpart as an additional
series of Taylor expansions has to be computed within each θ-step. HessianMKL ranks
in between on-the-fly and `p-norm chunking methods.
Experiment 2: Runtime in the Number of Kernels
Figure 3.1 (bottom) shows the execution times as a function of the number of RBF
kernels M. Thereby, the sample size is fixed to n= 1000. The bandwidths of the RBF
kernels are taken as 2σ220,...,M1. As expected, the SVM with the unweighted-sum
kernel is hardly affected by this setup and essentially has a constant training time.
The `1-norm CPM handles the increasing number of kernels best and is the fastest
MKL method. Non-sparse approaches to MKL show reasonable run-times, being just
slightly slower. Thereby, the analytical methods are somewhat faster than the CPM
approaches. The sparse analytical method performs substantially worse than its non-
sparse counterpart; this might be related to the fact that convergence of the analytical
method is only guaranteed for p > 1. The wrapper methods (CPM wrapper and
SimpleMKL) again perform worst.
However, in contrast to the previous experiment, SimpleMKL becomes more effi-
cient with an increasing number of kernels. We conjecture that this is in part owing
to the fact that the line search (which in the case of SimpleMKL is a full-blown SVM
computation) is relatively fast for moderate sample sizes. Nevertheless, the capacity
of SimpleMKL remains limited due to memory restrictions of the hardware. For ex-
ample, for storing 1,000 kernel matrices for 1,000 data points, about 7.4GB of memory
are required. On the other hand, our interleaved optimizers which allow for effective
caching can easily cope with 10,000 kernels of the same size (74GB). HessianMKL is
considerably faster than SimpleMKL but slower than the `p>1-norm chunking-based
methods and the `1-norm CPM; similar to SimpleMKL, it becomes more efficient with
an increasing number of kernels but eventually runs out of memory.
42
3 Algorithms
Overall, the proposed chunking-based optimization strategies (analytic and CPM)
achieve a speedup of up to one and two orders of magnitude over HessianMKL and
SimpleMKL, respectively. Using efficient kernel caching, they allow for truly large-scale
multiple kernel learning—well beyond the limits imposed by having to precompute and
store the complete kernel matrices.
3.5 Summary and Discussion
In this chapter, we derived two efficient algorithms for solving the `p-norm MKL op-
timization problem. Both algorithms were, for the sake of performance, implemented
in C++ and made available as a part of the open source machine learning toolbox
SHOGUN (Sonnenburg et al., 2010), which also contains interfaces to MATLAB, Oc-
tave, Phyton, and R. In our computational experiments, we found these large-scale
optimization algorithms allowing us to deal with ten thousands of data points and
thousands of kernels at the same time, as demonstrated on the MNIST data set. We
compared our algorithms to the state-of-the-art in MKL research, namely HessianMKL
(Chapelle and Rakotomamonjy, 2008) and SimpleMKL (Rakotomamonjy et al., 2008),
and found ours to be up to two magnitudes faster. Both algorithms are based on the
alternative, wrapper-based `p-norm MKL formulation (Problem II.4) in contrast to the
original problem (Problem II.3).
The first algorithm, the analytical wrapper, consists of an iterate 2-step proce-
dure, alternatingly performing the analytical MKL step and calling an SVM solver
(to this end, both, SVMlight and LIBSVM, are implemented in SHOGUN). Clearly,
AnalyticalMKL is even simpler than SimpleMKL since the latter requires a heuristic
line search in the MKL step. We showed AnalyticalMKL being provably convergent,
using the usual regularity assumptions. The second algorithm, the analytical chunking,
is directly integrated into the SVM light code (i.e., the MKL step is called after each
chunking iteration), allowing for truly large-scale MKL.
We remark that the proposed algorithms can, in particular, be used to optimize the
classical `p=1-norm MKL formulation. However, we found the algorithms in some cases
to converge considerably faster in the non-sparse case p > 1, where the smoothness
of the objective can be exploited. Considering the gain in prediction accuracy that is
also achieved by `p>1-norm MKL, established later in this dissertation, this might be
remarkable. Note that we also wrote scripts completely automating the whole process
from training over model selection to evaluation. Currently, MKL can be trained and
validated by a single line of MATLAB code including random subsampling, model
search for the optimal parameters Cand p, and collection of results. Our software is
freely available under the GPL license. We also remark that another fast algorithm
based on cutting planes is described in Appendix D. We found this algorithm to be
even faster in some cases but it requires a commercial QCQP solver so that it cannot
be released under the GPL license.
Finally, we note that performing MKL with 1,000 precomputed kernel matrices of
43
II `p-norm Multiple Kernel Learning
size 1,000 ×1,000 requires less than 3 minutes for the chunking methods. This suggests
to focus future research efforts rather on improving the accuracy of MKL models rather
than accelerating the optimization algorithms.
44
4 Theoretical Analysis
4 Theoretical Analysis
In this chapter, we derive upper bounds on the global and local Rademacher com-
plexities of `p-norm MKL, from which we deduce excess risk bounds that achieve fast
convergence rates of the order O(nα
1+α), where αis the minimum eigenvalue decay rate
of the individual kernels (previous bounds for `p-norm MKL only achieved O(n1
2)).
We also give a lower bound that, up to constants, matches the upper bounds, showing
that our results are tight. Finally, the generalization performance of `p-norm MKL, as
guaranteed by the excess risk bound, is studied for varying values of p, shedding light
on why learning kernels can help performance. This chapter is based on mathematical
preliminaries introduced in Appendix A.
The main contributions in this chapter are the following:
We derive an upper bound on the local Rademacher complexity of `p-norm multiple
kernel learning that yields an excess risk bound achieving fast convergence rates, while
previous approaches only yielded slow rates.
We also prove a lower bound that matches the upper one, showing that our result is
tight.
As a byproduct, we extend a previous upper bound of Cortes et al. (2010a) on the
global Rademacher complexity to the full range of p[1,], yet presenting a sub-
stantially easier proof and improved constants.
We exemplarily evaluate the bound for several scenarios that differ in the soft sparsity
of the underlying Bayes hypothesis, shedding light on why learning kernels can help
performance.
Parts of this chapter are based on:
M. Kloft and G. Blanchard. The local Rademacher Complexity of Multiple Kernel
Learning. ArXiv preprint 1103.0790v1. Submitted to Journal of Machine Learning
Research (JMLR) , Mar 2011.
A short version has been submitted to Twenty-Fifth Annual Conference on Neural
Information Processing Systems (NIPS 2011), Jun 2011.
Notation
For the upcoming analysis it is convenient to view MKL as an empirical minimization
algorithm acting on the hypothesis class
Hp,D,M =fw:x7→ hw, φ(x)iw= (w(1),...,w(M)),kwk2,p D(4.1)
(if Dor Mare clear from the context, we sometimes denote Hp=Hp,D =Hp,D,M ).
This is equivalent to the original formulation in Problem 2.1, where we viewed MKL as
regularized risk minimization problem. To see the equivalence between the regulariza-
tion and hypothesis-class views, note that we can equivalently translate the Tikhonov-
regularized MKL problem in an Ivanov-regularized problem by applying Lemma B.1
(presented in Appendix B).
45
II `p-norm Multiple Kernel Learning
Furthermore, let Pbe a probability measure on Xindependently generating the
data x1, . . . , xnand denote by Ethe corresponding expectation operator. Note that for
xPwe can view φ(x) and φm(x) as random variables taking values in the Hilbert
spaces Hand Hm, respectively. Correspondingly, we will work with covariance opera-
tors in Hilbert spaces. In a finite dimensional vector space, the (uncentered) covariance
operator can be defined in the usual vector/matrix notation as Eφ(x)φ(x)>. Since we
are working with potentially infinite-dimensional vector spaces, instead of φ(x)φ(x)>we
will use the tensor notation φ(x)φ(x)HS(H), which is a Hilbert-Schmidt operator
H 7→ H defined as (φ(x)φ(x))u=hφ(x),uiφ(x). The space HS(H) of Hilbert-
Schmidt operators on His itself a Hilbert space, and the expectation Eφ(x)φ(x)
is well-defined and belongs to HS(H) as soon as Ekφ(x)k2is finite, which will always
be assumed as such (as a matter of fact, we will often assume that kφ(x)kis bounded
a.s.). We denote by J=Eφ(x)φ(x) and Jm=Eφm(x)φm(x) the uncentered
covariance operators corresponding to variables φ(x) and φm(x), respectively; it holds
that tr(J) = Ekφ(x)k2
2and tr(Jm) = Ekφm(x)k2
2.
4.1 Global Rademacher Complexity
We first review global Rademacher complexities (GRC) in multiple kernel learning. Let
x1, . . . , xnbe an i.i.d. sample drawn from P. The global Rademacher complexity is
defined as R(Hp) = EsupfwHphw,1
nPn
i=1 σiφ(xi)i, where (σi)1inis an i.i.d. family
(independent of φ(xi)) of Rademacher variables (random signs). Its empirical counter-
part is denoted by b
R(Hp) = ER(Hp)x1, . . . , xn=EσsupfwHphw,1
nPn
i=1 σiφ(xi)i.
The interest in the global Rademacher complexity comes from the fact that, if known, it
can be used to bound the generalization error (Koltchinskii, 2001; Bartlett and Mendel-
son, 2002); see Appendix A.
In the recent paper of Cortes et al. (2010a) it was shown, using a combinatorial argu-
ment, that the empirical version of the global Rademacher complexity can be bounded
as
b
R(Hp)D
nrcptr(Km)M
m=1p
2
(for p[1,2] and pbeing an integer)
where c=23
22. We will now show a quite short proof of this result (extending it to
the whole range p[1,]) and then present a novel bound on the population version
of the GRC. The proof presented here is based on the Khintchine-Kahane inequality
(Kahane, 1985) using the constants taken from Lemma 3.3.1 and Proposition 3.4.1 in
Kwapi´en and Woyczy´nski (1992).
Lemma 4.1 (Khintchine-Kahane inequality).Let v1,...,vM H. Then, for
any q1, it holds
Eσ
n
X
i=1
σiviq
2c
n
X
i=1 vi2
2q
2,
where c= max(1, p1). In particular, the result holds for c=p.
46
4 Theoretical Analysis
Proposition 4.2 (Global Rademacher complexity bound, empirical ver-
sion).For any p1the empirical version of global Rademacher complexity of the
multi-kernel class Hpcan be bounded as
b
R(Hp)min
t[p,]Dst
n1
ntr(Km)M
m=1t
2
.
Proof First note that it suffices to prove the result for t=pas trivially kwk2,t kwk2,p
holds for all tpso that HpHtand therefore R(Hp)R(Ht). We can use a
block-structured version of older’s inequality (cf. Lemma C.1) and the Khintchine-
Kahane (K.-K.) inequality (cf. Lemma 4.1) to bound the empirical version of the global
Rademacher complexity as follows:
b
R(Hp)def.
=Eσsup
fwHphw,1
n
n
X
i=1
σiφ(xi)i
older
Eσsup
fwHpw2,p 1
n
n
X
i=1
σiφ(xi)2,p
(4.1)
DEσ1
n
n
X
i=1
σiφ(xi)2,p
Jensen
DEσ
M
X
m=1 1
n
n
X
i=1
σiφm(xi)p
21
p
K.-K.
Drp
nM
X
m=1 1
n
n
X
i=1 φm(xi)2
2
| {z }
=1
ntr(Km)
p
21
p
=Dsp
n1
ntr(Km)M
m=1p
2
,
which is what was to be shown.
Remark. Note that there is a very good reason to state the above bound in terms
of tpinstead of solely in terms of p: the Rademacher complexity b
R(Hp) is not
monotonic in pand thus it is not always the best choice to take t:= pin the above
bound. This can be readily seen, for example, for the simple case where all kernels have
the same trace—in that case the bound translates into b
R(Hp)D
nqtM2
ttr(K1).
Interestingly, the function s7→ sM2/s is not monotone and attains its minimum for
s= 2 log M, where log denotes the natural logarithm with respect to the base e.
This has interesting consequences: for any p(2 log M)we can take the bound
b
R(Hp)D
npelog(M)tr(K1), which has only a mild dependancy on the number of
47
II `p-norm Multiple Kernel Learning
kernels; note that, in particular, we can take this bound for the `1-norm class b
R(H1)
for all M > 1.
Despite the simplicity of the above proof, the constants are a little better than the ones
achieved in Cortes et al. (2010a). However, computing the population version of the
global Rademacher complexity of MKL is somewhat more involved and, to the best of
our knowledge, has not been addressed yet by the literature. To this end, note that from
the previous proof we obtain R(Hp) = EDpp/nPM
m=1 1
nPn
i=1 φm(xi)2
2p
21
p.
We can thus use Jensen’s inequality to move the expectation operator inside the root,
R(Hp) = Dpp/nM
X
m=1
E1
n
n
X
i=1 φm(xi)2
2p
21
p,(4.2)
but now need a handle on the p
2-th moments. To this aim, we use the inequalities of
Rosenthal (1970) and Young (e.g., Steele, 2004) to show the following Lemma.
Lemma 4.3 (Rosenthal + Young).Let X1, . . . , Xnbe independent nonnegative
random variables satisfying i:XiB < almost surely. Then, denoting cq=
(2qe)q, for any q1
2it holds
E1
n
n
X
i=1
Xiq
cqB
nq+1
n
n
X
i=1
EXiq.
The proof is deferred to Appendix C. It is now easy to show:
Corollary 4.4 (Global Rademacher complexity bound, population version).
Assume the kernels are uniformly bounded, that is, kkkB < , almost surely.
Then for any p1the population version of the global Rademacher complexity of the
multi-kernel class Hpcan be bounded as
R(Hp,D,M )min
t[p,]D tre
ntr(Jm)M
m=1t
2
+BeDM 1
tt
n.
For t2the right-hand term can be discarded and the result also holds for unbounded
kernels.
Proof As in the previous proof it suffices to prove the result for t=p. From (4.2) we
conclude by the previous Lemma
R(Hp)Drp
n M
X
m=1
(ep)p
2B
np
2+E1
n
n
X
i=1 φm(xi)2
2
| {z }
=tr(Jm)
p
2!1
p
D pre
ntr(Jm)M
m=1p
2
+BeDM 1
pp
n,
48
4 Theoretical Analysis
where for the last inequality we use the subadditivity of the root function. Note that
for p2 it is p/21 and thus it suffices to employ Jensen’s inequality instead of the
previous lemma so that we get by without the last term on the right-hand side.
Interpretation
When, for example, the traces of the kernels are bounded, the above bound is essentially
determined by OmintptM1
t
n, which indicates a dependancy on the number of
kernels that is monotone in the norm-parameter p. In the extreme case p= 1, by
setting t= (log(M)), the bound can be more compactly written R(H1) = Olog M
n.
4.2 Local Rademacher Complexity
Let x1, . . . , xnbe an i.i.d. sample drawn from P. We define the local Rademacher
complexity of Hpas Rr(Hp) = EsupfwHp:Pf2
wrhw,1
nPn
i=1 σiφ(xi)i, where Pf2
w:=
E(fw(φ(x)))2. Note that it subsumes the global RC as a special case for r=. Our
interest in the local Rademacher complexity stems from the fact that, if known, it can
be used to obtain fast convergence rates of the excess risk (see subsequent section).
We will state our local Rademacher bounds in terms of the spectra of the kernels:
this is possible since covariance operators (as self-adjoint, positive Hilbert-Schmidt
operators) enjoy discrete eigenvalue-eigenvector decompositions J=Eφ(x)φ(x) =
P
j=1 λjujujand Jm=Ex(m)x(m)=P
j=1 λ(m)
ju(m)
ju(m)
j, where (uj)j1and
(u(m)
j)j1form orthonormal bases of Hand Hm, respectively.
We will also use the following assumption in the bounds for the case p[1,2]:
Assumption (U) (no-correlation). Let xP. The Hilbert space valued random
variables φ1(x), . . . , φM(x)are said to be (pairwise) uncorrelated if for any m6=m0and
w Hm,w0 Hm0, the real variables hw, φm(x)iand hw0, φm0(x)iare uncorrelated.
Since Hm,Hm0are reproducing kernel Hilbert spaces with kernels km, km0, if we go
back to the input random variable in the original space x X, the above property is
equivalent to saying that for any fixed t, t0 X, the variables km(x, t) and km0(x, t0) are
uncorrelated. This is the case, for example, if the original input space Xis RM, the
original input variable x X has independent coordinates, and the kernels k1, . . . , kM
each act on a different coordinate. Such a setting was considered in particular by
Raskutti et al. (2010) in the setting of `1-penalized MKL. We discuss this assumption
in more detail in Section 4.3.2.
We are now equipped to state our main results:
Theorem 4.5 (Local Rademacher complexity bound, p[1,2] ).Assume that
the kernels are uniformly bounded (kkkB < ) and that Assumption (U) holds.
49
II `p-norm Multiple Kernel Learning
The local Rademacher complexity of the multi-kernel class Hpcan be bounded for any
p[1,2] as
Rr(Hp)min
t[p,2] v
u
u
t16
n
X
j=1
min rM12
t, ceD2t2λ(m)
jM
m=1t
2
+BeDM 1
tt
n.
Theorem 4.6 (Local Rademacher complexity bound, p[2,] ).The local
Rademacher complexity of the multi-kernel class Hpcan be bounded for any p[2,]
as
Rr(Hp)min
t[p,]v
u
u
t2
n
X
j=1
min(r, D2M2
t1λj).
Remark 1. Note that for the case p= 1, by using t= (log(M))in Theorem 4.5,
we obtain the bound
Rr(H1)v
u
u
t16
n
X
j=1
min rM, e3D2(log M)2λ(m)
jM
m=1
+Be3
2Dlog(M)
n,
for all Me2(see below after the proof of Theorem 4.5 for a detailed justification).
Remark 2. The result of Theorem 4.6 (for p[2,]) can be proved using con-
siderably less complex techniques and without imposing assumptions on boundedness
or on uncorrelation of the kernels. If, in addition, the variables (φm(x)) are centered
and uncorrelated, then the spectra are related as follows: spec(J) = SM
m=1 spec(Jm);
that is, {λi, i 1}=SM
m=1 nλ(m)
i, i 1o. In that case one can equivalently write
the bound of Theorem 4.6 as Rr(Hp)q2
nPM
m=1 P
j=1 min(r, D2M2
p1λ(m)
j) =
r2
nP
j=1 min(r, D2M2
p1λ(m)
j)M
m=11.However, the main focus of this thesis is
on the more challenging case 1 p2, which is more relevant in practice (see empir-
ical chapter of this thesis, i.e., Chapter 5).
Remark 3. It is interesting to compare the above bounds for the special case p= 2
with the ones of Bartlett et al. (2005). The main term of the bound of Theorem 4.6
(taking t=p= 2) is then essentially determined by Oq1
nPM
m=1 P
j=1 min r, λ(m)
j.
If the variables (φm(x)) are centered and uncorrelated, by the relation between the
spectra stated in Remark 2, this is equivalently of order Oq1
nP
j=1 min r, λj,
which is also what we obtain through Theorem 4.6, and which coincides with the rate
shown in Bartlett et al. (2005).
50
4 Theoretical Analysis
Proof of Theorem 4.5. The proof is based on first relating the complexity of the
class Hpto its centered counterpart, i.e., where all functions fwHpare centered
around their expected value. Then we compute the complexity of the centered class
by decomposing the complexity into blocks, applying the no-correlation assumption,
and using the inequalities of older and Rosenthal. Then we relate it back to the
original class, which, in the final step, we relate to a bound involving the truncation
of the particular spectra of the kernels. Note that it suffices to prove the result for
t=pas trivially kwk2,t kwk2,p holds for all tpso that HpHtand therefore
Rr(Hp)Rr(Ht).
Step 1: Relating the original class with the centered class. In order
to exploit the no-correlation assumption, we will work in large parts of the proof with
the centered class f
Hp=e
fwkwk2,p D, wherein e
fw:x7→ hw,e
φ(x)i, and e
φ(x) :=
φ(x)Eφ(x). We start the proof by noting that e
fw(x) = fw(x)hw,Eφ(x)i=fw(x)
Ehw, φ(x)i=fw(φ(x)) Efw(φ(x)), so that, by the bias-variance decomposition, it
holds that
Pf2
w=Efw(x)2=E(fw(x)Efw(x))2+ (Efw(φ(x)))2=Pe
f2
w+Pfw2.(4.3)
Furthermore, we note that by Jensen’s inequality
Eφ(x)2,p=M
X
m=1 Eφm(x)p
21
p
=M
X
m=1 Eφm(x),Eφm(x)p
21
p
Jensen
M
X
m=1
Eφm(x), φm(x)p
21
p
=str(Jm)M
m=1p
2
(4.4)
so that we can express the complexity of the centered class in terms of the uncentered
one as follows:
Rr(Hp) = Esup
fwHp,
Pf2
wrw,1
n
n
X
i=1
σiφ(xi)
Esup
fwHp,
Pf2
wrw,1
n
n
X
i=1
σie
φ(xi)+Esup
fwHp,
Pf2
wrw,1
n
n
X
i=1
σiEφ(x).
Concerning the first term of the above upper bound, using (4.3) we have Pe
f2
wPf2
w,
and thus
Esup
fwHp,
Pf2
wrw,1
n
n
X
i=1
σie
φ(xi)Esup
fwHp,
Pe
f2
wrw,1
n
n
X
i=1
σie
φ(xi)=Rr(e
Hp).
51
II `p-norm Multiple Kernel Learning
Now to bound the second term, we write
Esup
fwHp,
Pf2
wrw,1
n
n
X
i=1
σiEφ(x)=E
1
n
n
X
i=1
σisup
fwHp,
Pf2
wr
hw,Eφ(x)i
sup
fwHp,
Pf2
wrw,Eφ(x)
E 1
n
n
X
i=1
σi!2
1
2
=nsup
fwHp,
Pf2
wr
hw,Eφ(x)i.
We finally observe
hw,Eφ(x)iolder
kwk2,p kEφ(x)k2,p
(4.4)
kwk2,p rtr(Jm)M
m=1p
2
as well as
hw,Eφ(x)i=Efw(x)pPf2
w.
Putting the above steps together, we obtain as a result
Rr(Hp)Rr(e
Hp) + n1
2min r, D rtr(Jm)M
m=1p
2.(4.5)
This shows that, at the expense of the additional summand on the right-hand side, we
can work with the centered class instead of the uncentered one.
Step 2: Bounding the complexity of the centered class. Since the (cen-
tered) covariance operator Ee
φm(x)e
φm(x) is also a self-adjoint Hilbert-Schmidt oper-
ator on Hm, there exists an eigendecomposition
Ee
φm(x)e
φm(x) =
X
j=1 e
λ(m)
je
u(m)
je
u(m)
j,(4.6)
wherein (e
u(m)
j)j1is an orthogonal basis of Hm. Furthermore, the no-correlation as-
sumption (U) entails Ee
φl(x)e
φm(x) = 0for all l6=m. As a consequence,
Pe
f2
w=E(e
fw(x))2=EM
X
m=1 wm,e
φm(x)2=
M
X
l,m=1 Dwl,Ee
φl(x)e
φm(x)wmE
(U)
=
M
X
m=1 Dwm,Ee
φm(x)e
φm(x)wmE=
M
X
m=1
X
j=1 e
λ(m)
jDwm,e
u(m)
jE2(4.7)
52
4 Theoretical Analysis
and, for all jand m,
ED1
n
n
X
i=1
σie
φm(xi),e
u(m)
jE2=E1
n2
n
X
i,l=1
σiσlDe
φm(xi),e
u(m)
jEDe
φm(xl),e
u(m)
jE
σi.i.d.
=E1
n2
n
X
i=1 De
φm(xi),e
u(m)
jE2
=1
nDe
u(m)
j,1
n
n
X
i=1
Ee
φm(xi)e
φm(xi)
| {z }
=Ee
φm(x)e
φm(x)
e
u(m)
jE
=e
λ(m)
j
n.(4.8)
Let now h1, . . . , hMbe arbitrary nonnegative integers. We can express the local
Rademacher complexity in terms of the eigendecompositon (4.6) as follows
Rr(e
Hp)
=Esup
fwe
Hp:Pe
f2
wrDw,1
n
n
X
i=1
σie
φ(xi)E
=Esup
fwe
Hp:Pe
f2
wrDw(m)M
m=1,1
n
n
X
i=1
σie
φm(xi)M
m=1E
Esup
Pe
f2
wrDhm
X
j=1 qe
λ(m)
jhw(m),e
u(m)
jie
u(m)
jM
m=1,
hm
X
j=1 qe
λ(m)
j
1
h1
n
n
X
i=1
σie
φm(xi),e
u(m)
jie
u(m)
jM
m=1 E
+Esup
fwe
Hpw,
X
j=hm+1h1
n
n
X
i=1
σie
φm(xi),e
u(m)
jie
u(m)
jM
m=1
C.-S., Jensen
sup
Pe
f2
wr"
M
X
m=1
hm
X
j=1 e
λ(m)
jhw(m),e
u(m)
ji2
1
2
×
M
X
m=1
hm
X
j=1 e
λ(m)
j1E1
n
n
X
i=1
σie
φm(xi),e
u(m)
j2
1
2#
+Esup
fwe
Hpw,
X
j=hm+1h1
n
n
X
i=1
σie
φm(xi),e
u(m)
jie
u(m)
jM
m=1
53
II `p-norm Multiple Kernel Learning
so that (4.7) and (4.8) yield
Rr(e
Hp)
(4.7), (4.8)
srPM
m=1 hm
n+Esup
fwe
Hpw,
X
j=hm+1h1
n
n
X
i=1
σie
φm(xi),e
u(m)
jie
u(m)
jM
m=1
older
srPM
m=1 hm
n+DE
X
j=hm+1h1
n
n
X
i=1
σie
φm(xi),e
u(m)
jie
u(m)
jM
m=12,p
.
(4.9)
Step 3: Khintchine-Kahane’s and Rosenthal’s inequalities. We can now
use the Khintchine-Kahane (K.-K.) inequality (see Lemma 4.1 in Appendix C) to fur-
ther bound the right term in the above expression as follows
E
X
j=hm+1h1
n
n
X
i=1
σie
φm(xi),e
u(m)
jie
u(m)
jM
m=12,p
Jensen
EM
X
m=1
Eσ
X
j=hm+1h1
n
n
X
i=1
σie
φm(xi),e
u(m)
jie
u(m)
j
p
Hm1
p
K.-K.
rp
nEM
X
m=1
X
j=hm+1
1
n
n
X
i=1he
φm(xi),e
u(m)
ji2p
21
p
Jensen
rp
nM
X
m=1
E
X
j=hm+1
1
n
n
X
i=1he
φm(xi),e
u(m)
ji2p
21
p
.
Note that for p2 it holds that p/21, and thus it suffices to employ Jensen’s
inequality once again in order to move the expectation operator inside the inner term.
In the general case, we need a handle on the p
2-th moments and to this end employ
54
4 Theoretical Analysis
Lemma 4.3 (Rosenthal + Young), which yields
M
X
m=1
E
X
j=hm+1
1
n
n
X
i=1he
φm(xi),e
u(m)
ji2p
21
p
R+Y
M
X
m=1
(ep)p
2B
np
2+
X
j=hm+1
1
n
n
X
i=1
Ehe
φm(xi),e
u(m)
ji2
| {z }
=e
λ(m)
j
p
2!1
p
()
v
u
u
tep BM 2
p
n+M
X
m=1
X
j=hm+1 e
λ(m)
jp
22
p!
=v
u
u
tep BM 2
p
n+
X
j=hm+1 e
λ(m)
jM
m=1p
2!
v
u
u
tep BM 2
p
n+
X
j=hm+1
λ(m)
jM
m=1p
2!
where for () we used the subadditivity of p
·and in the last step we exploit the
eigenvalues of the centered covariance operator being smaller than or equal to those of
the centered one: j, m :e
λ(m)
jλ(m)
j. This follows from the decomposition Eφm(x)
φm(x) = Ee
φm(x)e
φm(x)+Eφm(x)Eφm(x) by the Lidskii-Mirsky-Wielandt theorem.
Thus putting the pieces together, we obtain from (4.9)
Rr(e
Hp)
srPM
m=1 hm
n+Dv
u
u
tep2
n BM 2
p
n+
X
j=hm+1
λ(m)
jM
m=1p
2!
=srPM
m=1 hm
n+v
u
u
tep2D2
n
X
j=hm+1
λ(m)
jM
m=1p
2
+BeDM 1
pp
n,(4.10)
again, using subadditivity for the last inequality.
Step 4: Bounding the complexity of the original class. Now note that for
all nonnegative integers hmwe either have
n1
2min r, D rtr(Jm)M
m=1p
2v
u
u
tep2D2
n
X
j=hm+1
λ(m)
jM
m=1p
2
(in case all hmare zero) or it holds
n1
2min r, D rtr(Jm)M
m=1p
2srPM
m=1 hm
n
55
II `p-norm Multiple Kernel Learning
(in case that at least one hmis nonzero) so that in any case we get
n1
2min r, D rtr(Jm)M
m=1p
2
srPM
m=1 hm
n+v
u
u
tep2D2
n
X
j=hm+1
λ(m)
jM
m=1p
2
.(4.11)
Thus the following preliminary bound follows from (4.5) by (4.10) and (4.11):
Rr(Hp)
s4rPM
m=1 hm
n+v
u
u
t4ep2D2
n
X
j=hm+1
λ(m)
jM
m=1p
2
+BeDM 1
pp
n,(4.12)
for all nonnegative integers hm0. We could stop here as the above bound is already
the one that will be used in the subsequent section for the computation of the excess
loss bounds. However, we can work a little more on the form of the above bound to
gain more insight into the properties—we will show that it is related to the truncation
of the spectra at the scale r.
Step 5: Relating the bound to the truncation of the spectra of the
kernels. To this end, notice that for all nonnegative real numbers A1, A2and any
a1,a2Rm
+it holds for all q1
pA1+pA2p2(A1+A2) (4.13)
ka1kq+ka2kq211
qka1+a2kq2ka1+a2kq(4.14)
(the first statement follows from the concavity of the square root function and the
second one is proved in appendix C; see Lemma C.3) and thus
Rr(Hp)
(4.13)
v
u
u
u
t8
rPM
m=1 hm
n+ep2D2
n
X
j=hm+1
λ(m)
jM
m=1p
2
+BeDM 1
pp
n
`1-to-`p
2
v
u
u
u
t8
n
rM12
phmM
m=1p
2
+ep2D2
X
j=hm+1
λ(m)
jM
m=1p
2
+BeDM 1
pp
n
(4.14)
v
u
u
t16
nrM12
phm+ep2D2
X
j=hm+1
λ(m)
jM
m=1p
2
+BeDM 1
pp
n,
56
4 Theoretical Analysis
where, in order to obtain the second inequality, we apply that for all non-negative
aRMand 0 < q < p it holds6
(`q-to-`pconversion) kakq=h1,aqi1
q
older
k1k(p/q)kaqkp/q1/q =M1
q1
pkakp.
(4.15)
Since the above holds for all nonnegative integers hm, it follows
Rr(Hp)v
u
u
t16
nmin
hm0rM12
phm+ep2D2
X
j=hm+1
λ(m)
jM
m=1p
2
+BeDM 1
pp
n
=v
u
u
t16
n
X
j=1
min rM12
p, ep2D2λ(m)
jM
m=1p
2
+BeDM 1
pp
n,
which completes the proof of the theorem.
Proof of Remark 1. To see that Remark 1 holds, notice that R(Hp=1)R(Ht) for
all t1 and thus, by choosing t= (log(M))in the bound of Theorem 4.5 (which is
valid only if t[1,2], i.e., Me2), we obtain
Rr(H1)v
u
u
t16
n
X
j=1
min rM12
t, et2D2λ(m)
jM
m=1t
2
+BeDM 1
tt
n
`t
2to`
v
u
u
t16
n
X
j=1
min rM, et2M2
tD2λ(m)
jM
m=1
+BeDM 1
tt
n
=v
u
u
t16
n
X
j=1
min rM, e3D2(log M)2λ(m)
jM
m=1
+Be3
2D(log M)
n,
which completes the proof.
Proof of Theorem 4.6. The eigendecomposition Eφ(x)φ(x) = P
j=1 λjujuj
yields
Pf2
w=E(fw(x))2=Ehw, φ(x)i2=w,(Eφ(x)φ(x))w=
X
j=1
λjhw,uji2,(4.16)
6We denote by aqthe vector with entries aq
iand by 1the vector with entries all 1.
57
II `p-norm Multiple Kernel Learning
and, for all j
ED1
n
n
X
i=1
σiφ(xi),ujE2=E1
n2
n
X
i,l=1
σiσlhφ(xi),ujihφ(xl),ujiσi.i.d.
=E1
n2
n
X
i=1 hφ(xi),uji2
=1
nDuj,1
n
n
X
i=1
Eφ(xi)φ(xi)
| {z }
=Eφ(x)φ(x)
ujE=λj
n.(4.17)
Therefore, we can use, for any nonnegative integer h, the Cauchy-Schwarz inequality
and a block-structured version of older’s inequality (see Lemma C.1) to bound the
local Rademacher complexity as follows:
Rr(Hp)
=Esup
fwHp:Pf2
wrw,1
n
n
X
i=1
σiφ(xi)
=Esup
fwHp:Pf2
wrh
X
j=1 pλjhw,ujiuj,
h
X
j=1 pλj1h1
n
n
X
i=1
σiφ(xi),ujiuj
+w,
X
j=h+1h1
n
n
X
i=1
σiφ(xi),ujiuj
C.-S., (4.16),(4.17)
rrh
n+Esup
fwHpw,
X
j=h+1h1
n
n
X
i=1
σiφ(xi),ujiuj
older
rrh
n+DE
X
j=h+1h1
n
n
X
i=1
σiφ(xi),ujiuj2,p
`p
2to`2
rrh
n+DM 1
p1
2E
X
j=h+1h1
n
n
X
i=1
σiφ(xi),ujiuj2
Jensen
rrh
n+DM 1
p1
2
X
j=h+1
Eh1
n
n
X
i=1
σiφ(xi),uji2
| {z }
(4.17)
λj
n
1
2
rrh
n+v
u
u
tD2M2
p1
n
X
j=h+1
λj.
Since the above holds for all h, the result now follows from A+Bp2(A+B)
for all nonnegative real numbers A, B (which holds by the concavity of the square root
58
4 Theoretical Analysis
function):
Rr(Hp)v
u
u
t2
nmin
0hnrh +D2M2
p1
X
j=h+1
λj=v
u
u
t2
n
X
j=1
min(r, D2M2
p1λj).
4.2.1 Lower Bound
In this subsection we investigate the tightness of our bound on the local Rademacher
complexity of Hp. To derive a lower bound, we consider the particular case where
variables φ1(x), . . . , φM(x) are i.i.d. For example, this happens if the original input
space Xis RM, the original input variable x X has i.i.d. coordinates, and the kernels
k1, . . . , kMare equal, but each acts on a different coordinate of x.
Lemma 4.7. Assume that the variables φ1(x), . . . , φM(x)are centered and identi-
cally independently distributed. Then, the following lower bound holds for the local
Rademacher complexity of Hpfor any p1:
Rr(Hp,D,M )RrM (H1,DM1/p,1).
Proof First note that, since the x(i)are centered and uncorrelated,
Pf2
w=M
X
m=1 wm, φm(x)2=
M
X
m=1 wm, φm(x)2.
59
II `p-norm Multiple Kernel Learning
Now it follows
Rr(Hp,D,M ) = Esup
w:Pf2
wr
kwk2,pDw,1
n
n
X
i=1
σiφ(xi)
=Esup
w:PM
m=1hw(m)m(x)i2r
kwk2,pD
*w, 1
n
n
X
i=1
σiφ(xi)+
Esup
w:m:hw(m)m(x)i2r/M
kw(m)k2,pD
kw(1)k=···=kw(M)k
*w,1
n
n
X
i=1
σiφ(xi)+
=Esup
w:m:hw(m)m(x)i2r/M
m:kw(m)k2DM1
p
M
X
m=1 *w(m),1
n
n
X
i=1
σiφm(xi)+
=
M
X
m=1
Esup
w(m):hw(m)m(x)i2r/M
kw(m)k2DM1
p
*w(m),1
n
n
X
i=1
σiφm(xi)+
so that we can use the i.i.d. assumption on φm(x) to equivalently rewrite the last term
as
Rr(Hp,D,M )
(φm(x))1mMi.i.d.
Esup
w(1):hw(1)1(x)i2r/M
kw(1)k2DM1
p
*Mw(1),1
n
n
X
i=1
σiφ1(xi)+
=Esup
w(1):hMw(1)1(x)i2rM
kMw(1)k2DM
1
p
*Mw(1),1
n
n
X
i=1
σiφ1(xi)+
=Esup
w(1):hw(1)1(x)i2rM
kw(1)k2DM
1
p
*w(1),1
n
n
X
i=1
σiφ1(xi)+
=RrM (H1,DM1/p,1).
60
4 Theoretical Analysis
In Mendelson (2003) it was shown that there is an absolute constant cso that if λ(1) 1
n
then for all r1
nit holds Rr(H1,1,1)qc
nP
j=1 min(r, λ(1)
j). Closer inspection of the
proof reveals that more generally it holds Rr(H1,D,1)qc
nP
j=1 min(r, D2λ(1)
j) if
λ(m)
11
nD2so that we can use that result together with the previous lemma to obtain:
Theorem 4.8 (Lower bound).Assume that the kernels are centered and identi-
cally independently distributed. Then the following lower bound holds for the local
Rademacher complexity of Hp. There is an absolute constant csuch that if λ(1) 1
nD2
then for all r1
nand p1,
Rr(Hp,D,M )v
u
u
tc
n
X
j=1
min rM, D2M2/pλ(1)
j.(4.18)
We would like to compare the above lower bound with the upper bound of Theorem 4.5.
To this end, note that for centered identical independent kernels the upper bound reads
Rr(Hp)v
u
u
t16
n
X
j=1
min rM, ceD2p2M2
pλ(1)
j+BeDM 1
pp
n,
which is of the order OqP
j=1 min rM, D2M2
pλ(1)
jand, disregarding the quickly
converging term on the right hand side and absolute constants, again matches the upper
bounds of the previous section. A similar comparison can be performed for the upper
bound of Theorem 4.6: by Remark 2 the bound reads
Rr(Hp)v
u
u
t2
n
X
j=1
min r, D2M2
p1λ(m)
jM
m=11,
which for i.i.d. kernels becomes q2/n P
j=1 min rM, D2M2
pλ(1)
jand thus, up to
constants, matches the lower bound. This shows that the upper bounds of the previous
section are tight.
4.3 Excess Risk Bounds
In this section, we show an application of our results to prediction problems, such as
classification or regression. To this aim, in addition to the data x1, . . . , xnintroduced
earlier in this thesis, let also a label sequence y1, . . . , yn[1,1] be given so that the
pairs (x1, y1),...,(xn, yn) are i.i.d. generated from a probability distribution. The goal
in statistical learning is to find a hypothesis ffrom a given class Fthat minimizes
the expected loss El(f(x), y), where l:R27→ [0,1] is a predefined loss function that
encodes the objective of the given learning/prediction task at hand. For example, the
61
II `p-norm Multiple Kernel Learning
hinge loss l(t, y) = max(0,1yt) and the squared loss l(t, y)=(ty)2are frequently
used in classification and regression problems, respectively.
Since the distribution generating the example/label pairs is unknown, the optimal
decision function
f:= argminf∈F El(f(x), y)
cannot be computed directly and a frequently used method consists in minimizing the
empirical loss instead
ˆ
f:= argminf∈F
1
n
n
X
i=1
l(f(xi), yi).
In order to evaluate the performance of this so-called empirical risk minimization al-
gorithm, we study the excess loss
P(lˆ
flf) := El(ˆ
f(x), y)El(f(x), y).
In Bartlett et al. (2005) and Koltchinskii (2006) it was shown that the rate of con-
vergence of the excess risk is basically determined by the fixed point of the local
Rademacher complexity. For example, the following result is a slight modification
of Corollary 5.3 in Bartlett et al. (2005) that is well-taylored to the class studied in
this thesis.7
Lemma 4.9 (Bartlett et al., 2005).Let Fbe an absolute convex class ranging
in the interval [a, b]and let lbe a Lipschitz continuous loss with constant L. Assume
there is a positive constant Fsuch that f F :P(ff)2F P(lflf). Then,
denoting by rthe fixed point of
2FL R r
4L2(F)
for all z > 0with probability at least 1ezthe excess loss can be bounded as
P(lˆ
flf)7r
F+(11L(ba) + 27F)z
n.
The above result shows that in order to obtain an excess risk bound on the multi-kernel
class Hpit suffices to compute the fixed point of our bound on the local Rademacher
complexity presented in Section 4.2. To this end, we show:
Lemma 4.10. Assume that kkkBalmost surely and let p[1,2]. For the fixed
point rof the local Rademacher complexity 2F LR r
4L2(Hp)it holds
r
min
0hm≤∞
4F2PM
m=1 hm
n+ 8FL v
u
u
tep2D2
n
X
j=hm+1
λ(m)
jM
m=1p
2
+4BeDFLM 1
pp
n.
7We exploit the improved constants from Theorem 3.3 in Bartlett et al. (2005) because an absolute
convex class is star-shaped. Compared to Corollary 5.3 in Bartlett et al. (2005) we also use a
slightly more general function class ranging in [a, b] instead of the interval [1,1]. This is justified
by Theorem 3.3 as well.
62
4 Theoretical Analysis
Proof For this proof we make use of the bound (4.12) on the local Rademacher
complexity. Defining a=4F2PM
m=1 hm
nand
b= 4FL v
u
u
tep2D2
n
X
j=hm+1
λ(m)
jM
m=1p
2
+2BeDFLM 1
pp
n,
in order to find a fixed point of (4.12), we need to solve for r=ar +b, which is
equivalent to solving r2(a+ 2b)r+b2= 0 for a positive root. Denote this solution
by r. It is then easy to see that ra+ 2b. Re-substituting the definitions of aand
byields the result.
We now address the issue of computing actual rates of convergence of the fixed point
runder the assumption of algebraically decreasing eigenvalues of the kernel matrices,
this means, we assume dm:λ(m)
jdmjαmfor some αm>1. This is a common
assumption and, for example, fulfilled by finite rank kernels and convolution kernels
(Williamson et al., 2001). Note that this implies
X
j=hm+1
λ(m)
jdm
X
j=hm+1
jαmdmZ
hm
xαmdx =dmh1
1αm
x1αmi
hm
=dm
1αm
h1αm
m.(4.19)
To exploit the above fact, first note that by `p-to-`qconversion
4F2PM
m=1 hm
n4FsF2MPM
m=1 h2
m
n24FsF2M22
ph2
m)M
m=12/p
n2
so that we can translate the result of the previous lemma by (4.13), (4.14), and (4.15)
into
rmin
0hm≤∞8Fv
u
u
t1
nF2M22
ph2
m
n+ 4ep2D2L2
X
j=hm+1
λ(m)
jM
m=1p
2
+4BeDFLM 1
pp
n.(4.20)
Inserting the result of (4.19) into the above bound and setting the derivative with
respect to hmto zero we find the optimal hmas
hm=4dmep2D2F2L2M2
p2n1
1+αm.
Re-substituting the above into (4.20) we note that
r=O sn2αm
1+αmM
m=1p
2!
63
II `p-norm Multiple Kernel Learning
so that we observe that the asymptotic rate of convergence in nis determined by the
kernel with the smallest decreasing spectrum (i.e., smallest αm).
Therefore, denoting dmax := maxm=1,...,M dm,αmin := minm=1,...,M αm, and hmax :=
4dmaxep2D2F2L2M2
p2n1
1+αmin , we can upper bound (4.20) by
r8Fr3αm
1αm
F2M2h2
maxn2+4BeDFLM 1
pp
n
8r3αm
1αm
F2Mhmaxn1+4BeDFLM 1
pp
n
16re3αm
1αm
(dmaxD2L2p2)1
1+αmin F
2αmin
1+αmin M1+ 2
1+αmin 1
p1nαmin
1+αmin
+4BeDFLM 1
pp
n.(4.21)
We have thus proved the following theorem, which follows by the above inequality,
Lemma 4.9, and the fact that our class Hpranges in BDM 1
p.
Theorem 4.11. Assume that kkkBand dmax >0and α:= αmin >1such that
for all m= 1, . . . , M it holds λ(m)
jdmaxjα. Let lbe a Lipschitz continuous loss with
constant Land assume there is a positive constant Fsuch that f F :P(ff)2
F P(lflf). Then for all z > 0with probability at least 1ezthe excess loss of the
multi-kernel class Hpcan be bounded for p[1,...,2] as
P(lˆ
flf)
min
t[p,2] 186r3αm
1αmdmaxD2L2t21
1+αFα1
α+1 M1+ 2
1+α1
t1nα
1+α
+47BDLM 1
tt
n+(22BDLM 1
t+ 27F)z
n.
We see from the above bound that convergence can be almost as slow as OpM1
pn1
2
(if at least one αm1 is small and thus αmin is small) and almost as fast as On1
(if αmis large for all mand thus αmin is large). For example, the latter is the case if
all kernels have finite rank and also the convolution kernel is an example of this type.
Note that, of course, we could repeat the above discussion to obtain excess risk
bounds for the case p2 as well, but since it is very questionable that this will lead
to new insights, it is omitted for the sake of simplicity.
4.3.1 Discussion of Bounds
In this section, we discuss the rates obtained from the bound in Theorem 4.11 for
the excess risk and compare them to the rates obtained using the global Rademacher
complexity bound of Corollary 4.4. To somewhat simplify the discussion, we assume
64
4 Theoretical Analysis
that the eigenvalues satisfy λ(m)
jdjα(with α > 1) for all mand concentrate on
the rates obtained as a function of the parameters n, α, M, D and p, while considering
other parameters fixed and hiding them in a big-O notation. Using this simplification,
the bound of Theorem 4.11 reads
t[p, 2] : P(lˆ
flf) = OtD2
1+αM1+ 2
1+α1
t1nα
1+α.(4.22)
On the other hand, the global Rademacher complexity directly leads to a bound on the
supremum of the centered empirical process indexed by F(see Appendix A) and thus
also provides a bound on the excess risk (see, e.g., Bousquet et al., 2004). Therefore,
using Corollary 4.4, wherein we upper bound the trace of each Jmby the constant B
(and subsume it under the O-notation), we have a second bound on the excess risk of
the form
t[p, 2] : P(lˆ
flf) = OtDM 1
tn1
2.(4.23)
To start the discussion, we first consider the case where p(log M), that is, the best
choice in (4.22) and (4.23) is t=p. Clearly, if we hold all other parameters fixed and
let ngrow to infinity, the rate obtained through the local Rademacher analysis is better
since α > 1. However, it is also of interest to consider what happens when the number
of kernels Mand the radius Dof the `p-ball can grow with n. In general, we have a
bound on the excess risk given by the minimum of (4.22) and (4.23); a straightforward
calculation shows that the local Rademacher analysis surpasses the global one whenever
M1
p
D=O(n).
Interestingly, we note that this “phase transition” does not depend on α(i.e. the
“complexity” of the individual kernels), but only on p.
If p(log M), the best choice in (4.22) and (4.23) is t= (log M). In this case
taking the minimum of the two bounds reads
P(lˆ
flf)Omin D(log M)n1
2,Dlog M2
1+αMα1
1+αnα
1+α,(4.24)
and the phase transition when the local Rademacher bound improves over the global
one occurs for M
Dlog M=O(n).
Finally, it is also interesting to observe the behavior of (4.22) and (4.23) as α .
In this case, it means that only one eigenvalue is nonzero for each kernel, that is, each
kernel space is one-dimensional. In other words, here we are in the case of “classical”
aggregation of Mbasis functions, and the minimum of the two bounds reads
P(lˆ
flf)Omin Mn1,min
t[p,2] tDM 1
tn1
2.(4.25)
65
II `p-norm Multiple Kernel Learning
In this configuration, observe that the local Rademacher bound is O(M/n) and does
not depend on D, nor p, any longer; in fact, it is the same bound that one would
obtain for the empirical risk minimization over the space of all linear combinations of
the Mbase functions, without any restriction on the norm of the coefficients—the `p-
norm constraint becomes void. The global Rademacher bound, on the other hand, still
depends crucially on the `p-norm constraint. This situation is to be compared to the
sharp analysis of the optimal convergence rates of convex aggregation of Mfunctions
obtained by Tsybakov (2003) in the framework of squared error loss regression, which
are shown to be
O min M
n,s1
nlog M
n!!.
This corresponds to the setting studied here with D= 1, p= 1 and α , and we
see that the bound (4.24) recovers (up to log factors) the above bound and the related
phase transition phenomenon.
4.3.2 Discussion of Assumption (U)
Assumption (U) is arguably quite a strong hypothesis for the validity of our results
(needed for 1 p2), which was not required for the global Rademacher bound.
A similar assumption was made in the recent work of Raskutti et al. (2010), where a
related MKL algorithm using an `1-type penalty is studied, and bounds are derived
that depend on the “sparsity pattern” of the Bayes function, i.e. how many coefficients
w
mare non-zero. If the kernel spaces are one-dimensional, in which case `1-penalized
MKL reduces qualitatively to standard lasso-type methods, this assumption can be
seen as a strong form of the so-called Restricted Isometry Property (RIP), which is
known to be necessary to grant the validity of bounds taking into account the sparsity
pattern of the Bayes function.
In the present work, our analysis stays deliberately “agnostic” (or worst-case) with
respect to the true sparsity pattern (in part because experimental evidence seems to
point towards the fact that the Bayes function is not strongly sparse), correspondingly
it could legitimately be hoped that the RIP condition, or Assumption (U), could be
substantially relaxed. Considering again the special case of one-dimensional kernel
spaces and the discussion about the qualitatively equivalent case α in the previous
section, it can be seen that Assumption (U) is indeed unnecessary for bound (4.25) to
hold, and more specifically for the rate of M/n obtained through local Rademacher
analysis in this case. However, as we discussed, what happens in this specific case is
that the local Rademacher analysis becomes oblivious to the `p-norm constraint, and
we are left with the standard parametric convergence rate in dimension M. In other
words, with one-dimensional kernel spaces, the two constraints (on the L2(P)-norm of
the function and on the `pblock-norm of the coefficients) appearing in the definition of
local Rademacher complexity are essentially not simultaneously active. Unfortunately,
it is clear that this property is not true anymore for kernels of higher complexity (i.e.
with a non-trivial decay rate of the eigenvalues). This is a specificity of the kernel setting
66
4 Theoretical Analysis
as compared to combinations of a dictionary of Msimple functions, and Assumption
(U) was in effect used to “align” the two constraints. To sum up, Assumption (U)
is used here for a different purpose from that of the RIP in sparsity analyses of `1
regularization methods; it is not clear to us at this point if this assumption is necessary
or if uncorrelated variables φ(x) constitutes a “worst case” for our analysis. We did
not succeed so far in relinquishing this assumption for p2, and this question remains
open.
To our knowledge, there is no previous existing analysis of the `p-MKL setting for
p > 1; the recent works of Raskutti et al. (2010) and Koltchinskii and Yuan (2010) focus
on the case p= 1 and on the sparsity pattern of the Bayes function. A refined analysis
of `p-regularized methods in the case of combination of Mbasis functions was laid out
by Koltchinskii (2009), also taking into account the possible soft sparsity pattern of
the Bayes function. Extending the ideas underlying the latter analysis into the kernel
setting is likely to open interesting developments.
4.4 Why Can Learning Kernels Help Performance?
In this section, we give a practical application of the bounds of the previous section. We
analyze the impact of the norm-parameter pon the accuracy of `p-norm MKL in various
learning scenarios. We show that, depending on the underlying truth, any value of p
can be optimal in practice. Since the trivial uniform-kernel-SVM baseline corresponds
to `p=2-norm MKL, this, in particular, shows that learning kernels can be beneficial. It
also shows that whether or not this is the case depends on the geometry of the learning
problem. This addresses the question “Can learning kernels help performance?” posed
by Corinna Cortes in an invited talk at ICML 2009 (Cortes, 2009). Our answer is thus
clearly affirmative: it is also in accordance with our empirical findings presented in the
upcoming chapter.
As will be shown in the application chapter of this thesis, there is empirical evidence
that the performance of `p-norm MKL crucially depends on the choice of the norm pa-
0 10,000 20,000 30,000 40,000 50,000 60,000
0.88
0.89
0.9
0.91
0.92
0.93
sample size
AUC
1−norm MKL
8/7−norm MKL
4/3−norm MKL
8/5−norm MKL
SVM
8/5−normuniform4/3−norm
8/7−norm1−norm
n=5,000
rameter p. For example, the figure shown
on the right-hand side, taken from Sec-
tion 5.3 of this thesis, shows the result of
a typical experiment with `p-norm MKL.
We first observe that, as expected, `p-norm
MKL enforces strong sparsity in the coeffi-
cients θmwhen q= 1, and no sparsity at all
for p= 2, which corresponds to the SVM
using a uniform kernel combination, while
intermediate values of penforce different
degrees of soft sparsity (understood as the
steepness of the decrease of the ordered co-
efficients θm). Crucially, the performance
(as measured by the AUC criterion) is not
67
II `p-norm Multiple Kernel Learning
w*
(a) β= 2
w*
(b) β= 1
w*
(c) β= 0.5
Figure 4.1: Two-dimensional illustration of the three analyzed learning scenarios, which differ in
the soft sparsity of the Bayes hypothesis w(parameterized by β). Left: A soft sparse w.
Center: An intermediate non-sparse w.Right: An almost-uniformly non-sparse w.
1.0 1.2 1.4 1.6 1.8 2.0
60 70 80 90 110
p
bound
(a) β= 2
1.0 1.2 1.4 1.6 1.8 2.0
40 45 50 55 60 65
p
bound
(b) β= 1
1.0 1.2 1.4 1.6 1.8 2.0
20 30 40 50 60
p
bound
(c) β= 0.5
Figure 4.2: Results of the simulation for the three analyzed learning scenarios (which were illustrated
in Figure 4.1). The value of the bound factor νtis plotted as a function of p. The minimum is
attained depending on the true soft sparsity of the Bayes hypothesis w(parameterized by β).
68
4 Theoretical Analysis
monotonic as a function of q;q= 1 (sparse MKL) yields significantly worse performance
than q=(regular SVM with sum kernel), but optimal performance is attained for
some intermediate value of q.
The aim of this section is to relate the theoretical analysis presented here to the
empirically observed phenomenon. This phenomenon can (at least partly) be explained
on base of our excess risk bound obtained in the last section. To this end, we will analyze
the dependancy of the excess risk bounds on the chosen norm parameter p. We will
show that the optimal pdepends on the geometrical properties of the learning problem
and that in general—depending on the true geometry—any pcan be optimal. Since
our excess risk bound is only formulated for p2, we will limit the analysis to the
range p[1,2].
To start with, note that the choice of ponly affects the excess risk bound in the
factor (cf. Theorem 4.11 and Equation (4.22))
νp:= min
t[p,2] Dpt2
1+αM1+ 2
1+α1
t1.
So we write the excess risk as P(lˆ
flf) = O(νp) and hide all variables and constants
in the O-notation for the whole section (in particular the sample size nis considered a
constant for the purposes of the present discussion). It might surprise the reader that
we consider the term depending on Dalthough it seems from the bound that it does
not depend on p. This stems from a inconspicuous fact that we have ignored in this
analysis so far: Dis related to the approximation properties of the class, i.e., its ability
to attain the Bayes hypothesis. For a “fair” analysis we should take the approximation
properties of the class into account.
To illustrate this, let us assume that the Bayes hypothesis belongs to the space
Hand can be represented by w; assume further that the block components satisfy
kw
mk2=mβ,m= 1, . . . , M, where β0 is a parameter parameterizing the “soft
sparsity” of the components. For example, the cases β {0.5,1,2}are shown in
Figure 4.1 for M= 2 and assuming that each kernel has rank 1 (thus being isomorphic
to R). If nis large, the best bias-complexity tradeoff for a fixed pwill correspond to a
vanishing bias, so that the best choice of Dwill be close to the minimal value so that
wHp,D, that is, Dp=||w||p. Plugging in this value for Dp, the bound factor νp
becomes
νp:= kwk
2
1+α
pmin
t[p,2] t2
1+αM1+ 2
1+α1
t1.
We can now plot the value νpas a function of pfor special choices of α,M, and β.
We realize this simulation for α= 2, M= 1000, and β {0.5,1,2}, which means we
generate three learning scenarios with different levels of soft sparsity parameterized by
β. The results are shown in Figure 4.2. Note that the soft sparsity of wis increased
from the left hand to the right hand side. We observe that in the “soft sparsest”
scenario (β= 2, shown on the left-hand side) the minimum is attained for a quite small
p= 1.2, while for the intermediate case (β= 1, shown at the center) p= 1.4 is optimal,
and finally in the uniformly non-sparse scenario (β= 2, shown on the right-hand side)
69
II `p-norm Multiple Kernel Learning
the choice of p= 2 is optimal (although even a higher pcould be optimal, but our
bound is only valid for p[1,2]).
This means that if the true Bayes hypothesis has an intermediately dense repre-
sentation, our bound gives the strongest generalization guarantees to `p-norm MKL
using an intermediate choice of p. This is also intuitive: if the truth exhibits some soft
sparsity but is not strongly sparse, we expect non-sparse MKL to perform better than
strongly sparse MKL or the unweighted-sum kernel SVM.
4.5 Summary and Discussion
We justified the proposed `p-norm multiple kernel learning methodology from a the-
oretical point of view. Our analysis is built upon the established framework of local
Rademacher complexities (Koltchinskii and Panchenko, 2002; Koltchinskii, 2006): we
derived tight upper bounds on the local Rademacher complexity of the hypothesis class
associated with `p-norm MKL, yielding excess risk bounds with fast convergence rates
of the order O(nα
1+α), where αis the minimum eigenvalue decay rate of the individ-
ual kernels, thus being tighter than existing bounds for `p>1-norm MKL, which only
achieved slow convergence rates of the order O(n1
2). We also showed a lower bound
on the local complexity, which, up to constants, matches the upper bounds, showing
that our results are tight. For the upper bounds to hold, we used an assumption on the
uncorrelatedness of the kernels; a similar assumption was also recently used by Raskutti
et al. (2010), but in the different context of sparse recovery. We also remark that it
would be interesting to even refine the analysis beyond local Rademacher complexities;
Bartlett and Mendelson (2006) would be a good starting point for such an undertaking.
Beside reporting on the worst-case bounds, we also connected the minimal values
of the bounds with the geometry of the underlying learning scenario (namely, the
soft sparsity of the Bayes hypothesis), in particular, demonstrating that for a large
range of learning scenarios `p-norm MKL attains a strictly “better” (i.e., lower) bound
than classical `1-norm MKL and the SVM using a uniform kernel combination. This
theoretically justifies using `p-norm MKL and multiple kernel learning in general. This
is notable since even the tightest previous analyses such as the one carried out by Cortes
et al. (2010a) were not able to answer the research question “Can learning kernels help
performance?” (Cortes, 2009).
70
5 Empirical Analysis and Applications
5 Empirical Analysis and Applications
In this chapter, we present an empirical analysis of `p-norm MKL. We first exemplify
our experimental methodology on artificial data and real-world data, shedding light
on the appropriateness of a high/low pin several learning scenarios that differ in the
sparseness of the underlying Bayes hypothesis. We then apply the proposed `p-norm
MKL on diverse, highly topical real-world problems from the domains of bioinformatics
and computer vision that traditionally come with various heterogeneous feature groups
/ kernels and thus are very attractive for multiple kernel learning.
The main contributions in this chapter are the following:
We investigate why learning kernels can help performance on artificial data, where we
find the bounds well predicting the optimal value of the parameter p.
We show that `p-norm MKL achieves accuracies in highly topical and central real-
world applications from the domains of bioinformatics and computer vision that go
beyond the state-of-the-art:
(a) we apply `p-norm MKL to protein fold prediction, experimenting on the recently
released dingshen data, yielding an accuracy improvement of 6%
(b) we apply `p-norm MKL to DNA transcription splice site (TSS) prediction, train-
ing our large-scale implementation on 60,000 training examples and outperform-
ing the uniform-sum kernel SVM (which recently was confirmed to be leading in
a comparison of 19 DNA TSS predictors)
(c) we apply `p-norm MKL to visual image classification, a computer vision applica-
tion, significantly increasing the predictive accuracy for all concept classes and
by 1.5 AP point in average.
We introduce a methodology based on kernel alignments and the theoretical bounds
to investigate/estimate our findings, i.e., the optimality of a particular p.
Parts of this chapter are based on:
M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. M¨uller, and A. Zien. Efficient
and accurate lp-norm multiple kernel learning. In NIPS 2009, pages 997–1005. MIT
Press, 2009.
M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel learning.
Journal of Machine Learning Research (JMLR), 12:953–997, 2011.
A. Binder, S. Nakajima, M. Kloft, C. M¨uller, W. Wojcikiewicz, U. Brefeld, K.-R.
M¨uller, and M. Kawanabe. Multiple kernel learning for object classification. Submit-
ted to IEEE TPAMI. A preliminary version is published in IBIS 2010.
5.1 Goal and Experimental Methodology
Goal The goal of the application chapter of this thesis is to confirm the following
hypotheses:
1. In practical applications that are characterized by multiple heterogeneous kernels
/ feature mappings, `p-norm MKL often yields more accurate prediction models
71
II `p-norm Multiple Kernel Learning
than `1-norm MKL, the SVM using a uniform kernel combination, and the single-
kernel SVM with the kernel tuned by model selection.
2. For medium and large training set sizes n, the bounds presented in the previ-
ous chapter reflect the empirically best-performing parameter p(rendering them
attractive for model selection) while for small nthe Bayes hypothesis might be
insufficiently approximated, leading to an overestimation of the optimal p.
5.1.1 Methodology
To investigate the validity of the above hypotheses, we experiment on cutting-edge
data sets taken from diverse application domains. Thereby, we deploy the following
experimental protocol:
1. Application description and goal First, the application’s problem setting
is described and the goal is defined; usually this is maximizing the prediction
accuracy or an application-specific standard performance measure such as MCC,
AUC, or AP (see below for definitions of this measures).
2. Experimental setup Second, we report on the employed data set and the
experimental setup. Usually, we deploy random sampling of disjoint training,
validation, and test sets, where the number of repetitions is chosen such that the
standard errors indicate significance of the results (usually 100–250 repetitions).
We always compare our results to the ones achieved by `1-norm MKL and the
SVM using a uniform kernel combination. Often, the single-kernel-SVM perfor-
mance is known from the literature, where it was either compared to `1-norm
MKL or the SVM using an uniform kernel combination; in all of our experiments,
the latter two baselines are contained in the set of models, thus allowing to also
compare `p-norm MKL to the single-kernel SVM without actually recomputing
it. However, if the single-kernel performance is not known from the literature, we
compute this baseline. We compare various `p-norm MKL variants and deploy a
local search for the optimal p. In all experiments, we optimize the relative duality
gap up to a precision of 0.001.
3. Results The results of the experiment are shown in terms of an application-
specific standard performance measure along with corresponding standard errors.
In one case we use a benchmark data set that comes with a single test set; here,
we compute the standard deviation by repeatedly and randomly splitting the test
set in two parts of equal size.
4. Interpretation The kernel weights θas output by MKL are shown. This
indicates which kernels contribute most to the final MKL solution. We also report
on the alignments of the given kernel matrices with respect to the Frobenius scalar
product. This can be of help in identifying redundant and complementary kernels.
72
5 Empirical Analysis and Applications
5. Bound We compute the factor of the theoretical bound that depends on the
norm parameter p, thus being relevant for the estimation of the appropriateness of
a small/large p. We focus on global Rademacher complexity as it does not involve
the additional parameter αthat is unknown in practice, which corresponds to
taking α1 in the local complexity (this can be seen as a conservative choice:
the (empirical) global bound can directly be evaluated from the kernel matrices
at hand without knowledge of the true spectra of the kernels). As discussed in
Section 4.4, we require the Bayes hypothesis to be in the hypothesis set so that the
relevant bound factor is νglob
p= mintptkwBayesk2,p M1/t. Since this requires
knowledge of the Bayes hypothesis wBayes, we take the one hypothesis ˆ
wmkl
output by MKL that performed best in the experiments (using the empirically
optimal p) as an approximation of wBayes. Note that we expect this approximation
getting tighter with increasing n. For the computation of ˆ
wmkl, we take medians
over 100 runs; for the multi-class data sets the medians are also computed over
the classes.
6. Summary and discussion Each experiment is concluded by a short discussion.
5.1.2 Evaluation measures
To assess the effectiveness of the compared methods, we use the following evaluation
measures. Let (xt, yt)tTbe a set of example/label test pairs and f:X Y a
hypothesis function; let Y={−1,1}.
The test error is defined as TE = PtT
1
f(xt)6=yt, where
1
denotes the indicator
function.
The area under the receiver operating characteristic curve (AUC) is defined as
the integral of the receiver operating characteristic (ROC) curve. Thereby, the
ROC curve is the true positive rate (fraction of correctly positively classified data
points) as a function of the false positive rate (fraction of incorrectly positively
classified data points). For (multi-) kernel methods and linear methods the vari-
ous values of the false positive rate correspond to translating the bias b. The AUC
ranges in the interval [0,1], where 0.5 is attained by a completely random clas-
sifier (random label assignment); thus, reasonable classifiers should attain values
higher than 0.5.
The Matthews correlation coefficient (MCC) or φcoefficient is defined as
MCC = TP ×TN FP ×FN
p(TP + FP)(TP + FN)(TN + FP)(TN + FN) ,
where FP and TP are the number of correctly and incorrectly positively classified
data points, respectively, and FN and TN are defined analogously for negatively
classified data points. The MCC ranges in the interval [1,1] where MCC = 0
corresponds to a random prediction.
73
II `p-norm Multiple Kernel Learning
Average precision: let t1, . . . , tnbe the indexes corresponding to sorting the clas-
sifier outputs f(xt) in decreasing order. Denote the number of positive and
negative examples by n+and n, respectively; define the precision at kby
pk=1
kPk
i=1
1
yti=1 (the number of correctly positively classified examples among
the top-kranked ones). Then, the average precision is defined as
AP = 1
n+
n
X
k=1
pk
1
ytk=1
(thus it is the precision at kaveraged over all k). The AP ranges in the interval
[0,1] and somewhat depends on the skewness of the classes: a random classifier
attains AP = n+
n++n.
Probably the most standard measure in machine learning is the test error. However, in
applications with unbalanced class sizes it is recommended to use one of the alternative
measures, that is, MCC, AUC, or AP. Traditionally, some applications come with
certain standard evaluation measures. For example, in image categorization, as an
information retrieval application, AP is the standard measure and, in bioinformatics,
MCC and AUC are frequently used.
Kernel aligments As a tool for the explorative analysis of the obtained results, we also
employ the kernel alignment introduced by Cristianini et al. (2002), which measures
the similarity of two matrices and can be interpreted as a (hyper-) kernel acting on
the space of kernel functions (Ong et al., 2005). Let Kand ˜
Kbe two kernel matrices
corresponding to kernels kand ˜
k, respectively; then, the alignment of Kand ˜
Kis
defined as the cosine of the angle between Kand ˜
K:
A(K, ˜
K) := hK, ˜
KiF
kKkFk˜
KkF
,(5.1)
where hK, ˜
KiF:= Pn
i,j=1 k(xi, xj)˜
k(xi, xj) denotes the Frobenius scalar product and
kKkF:= phK, KiFthe corresponding norm.
In many applications, centering of the kernels is recommended before comput-
ing the alignment (Cortes et al., 2010b); for example, SVMs and many other kernel-
based learning algorithms are invariant against mean shifts in the corresponding Hilbert
spaces. Centering Kin the corresponding feature space can easily be achieved by com-
puting the product HKH, where H:= I1
n11>with Ibeing the identity matrix of
size nand 1a column vector of all ones.
5.2 Case Study 1: Toy Experiment
We now present a toy experiment that serves the following purposes:
empirically confirming some of our claims in a controlled environment, that is,
one where we know the underlying distribution generating the data
74
5 Empirical Analysis and Applications
irrelevant
feature w2
relevant
feature w1
µ+
µ
(a) Illustration
0
0.2
0.4
0.6
0.8
1
uniform scenario
0 10 20 30 40 50
0
0.2
0.4
0.6
0.8
1
Bayes wi
0 10 20 30 40 50
index i 0 10 20 30 40 50
sparse scenario
(b) Scenarios
Figure 5.1: The toy experiment: experimental design.
exemplary application of the experimental protocol described in the previous
section.
The claims we would like to confirm are the following:
1. The bounds reflect the empirically optimal pif the Bayes hypothesis is known.
2. The pairwise kernel alignments can be successfully used to explore dependancies
in the kernel set.
3. The choice of the norm parameter pcan be crucial for the generalization perfor-
mance of `p-norm MKL.
4. The optimality of a particular pdepends on the underlying geometry (i.e, the
sparsity) of the underlying Bayes hypothesis.
Experimental setup We construct six artificial data sets in which we vary the degree of
sparsity in the true Bayes hypothesis w. For each data set, we generate an n-element,
balanced sample D={(xi, yi)}n
i=1 from two d= 50-dimensional isotropic Gaussian
distributions with equal covariance matrices C=Id×dand equal, but opposite, means
µ+=ρ
kwk2wand µ=µ+. Thereby, wis a binary vector, i.e., i:wi {0,1},
encoding the true underlying data sparsity as follows. Zero components wi= 0 clearly
imply identical means of the two classes’ distributions in the ith feature set; hence, the
latter does not carry any discriminating information. In summary, the fraction of zero
components, sparsity(w) = 1 1
dPd
i=1 wi, is a measure for the feature sparsity of the
learning problem. This is illustrated in Figure 5.1 (a).
We generate six different wthat differ in their value of sparsity(w); see Fig-
ure 5.1 (b), which shows bar plots of the wof the various scenarios considered. For
each of the wwe generate m= 250 data sets D1,...,Dmfixing ρ= 1.75. Then, each
feature is input into a linear kernel and the resulting kernel matrices are multiplica-
tively normalized as described in Section 3.3.1. Hence, sparsity(w) gives the fraction
75
II `p-norm Multiple Kernel Learning
of noise kernels in the working kernel set. Next, classification models are computed
by training `p-norm MKL for p= 1,4/3,2,4,on each Di. Soft margin parame-
ters Care tuned on independent 1,000-elemental validation sets by grid search over
C10ii=4,3.5,...,0(optimal Cs are attained in the interior of the grid).
The relative duality gaps were optimized up to a precision of 103. The simulation is
realized for n= 50. We report on test errors evaluated on 1,000-elemental independent
test sets.
Results The results in terms of test errors are shown in Figure 5.2 (a). As expected,
`1-norm MKL performs best and reaches the Bayes error in the sparsest scenario,
where only a single kernel carries the whole discriminative information of the learning
problem. However, in the other scenarios it mostly performs worse than the other
MKL variants. This is remarkable because the underlying ground truth, i.e., the vector
w, is sparse in all but the uniform scenario. In other words, selecting this data set
may imply a bias towards `1-norm. In contrast, the vanilla SVM using a uniform
kernel combination performs best when all kernels are equally informative; however,
its performance does not approach the Bayes error. In contrast, the truly uniform `-
norm MKL variant succeeds in approaching the Bayes error although it does not reach
it completely. However, it should be kept in mind that such a uniform scenario might be
quite artificial. The non-sparse `4/3-norm MKL variants perform best in the balanced
scenarios, i.e., when the noise level is ranging in the interval 64%-92%. Intuitively, the
non-sparse `4/3-norm MKL is the most robust MKL variant, achieving test errors of
less than 12% in all scenarios. Tuning the sparsity parameter pfor each experiment,
`p-norm MKL achieves low test error across all scenarios.
Interpretation We also consider the weights ˆ
wmkl output by MKL and compare them to
the true underlying wBayes by computing the root mean `2(model) errors, ME( ˆ
wmkl) :=
kζ(ˆ
wmkl)ζ(wBayes)k2, where ζ(v) = v
kvk2. The results are shown in Figure 5.2 (b). We
observe that the model errors reflect the corresponding test errors well. This observation
can be explained by model selection favoring strongly regularized hypotheses for such
a small n, leading to the observed agreement between test error and model error.
We are also interested in whether pairwise kernel alignments can be used as a
measure for the dependancy of the kernels. For the toy experiment, we know the true
dependancies: each noise kernel is independent of any other kernel by construction (re-
gardless of whether the latter is a noise kernel as well or an information-carrying one);
on the other hand, we would expect the informative kernels being mutually aligned
since they are correlated via the labels (but only slightly so because the conditional
distributions of the features are independent; this is a specialty of our data generation
setup). The empirical kernel alignments are shown in Figure 5.2 (c). Indeed, as ex-
pected, we can see from the plot that overall the alignments are rather small although
we can observe moderate alignments between informative kernels. This indicates the
usefulness of this explorative method in practice.
Bound We evaluate the theoretical bound factor for the various scenarios, exploiting
the fact that the Bayes hypothesis is completely known to us. To analyze whether the p
that are minimizing the bound are reflected in the empirical result, we compute the test
76
5 Empirical Analysis and Applications
Uniform
(a) Test errors
0 44 64 82 92 98
0
0.5
1.0
1.5
sparsity(w) [in %]
root mean L2 errors
Uniform
Sparse
Uniform
(b) Model errors
uniform scenario
10
20
30
40
50
10 20 30 40 50
10
20
30
40
50
10 20 30 40 50
sparse scenario
10 20 30 40 50 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
kernel id
kernel id
(c) Alignments
50
150
250
p
bound factor
50
100
150
40
60
80
1 1.5 2
25
35
45
1 1.5 2
14
18
22
1 1.5 2
4
8
12
Uniform
Sparse
empirically optimal p
theoretically optimal p empirically optimal p
(d) Bound
Figure 5.2: The toy experiment: results and analysis.
77
II `p-norm Multiple Kernel Learning
errors of the various MKL variants again, using the setup above except that we employ
a much finer grid for finding the optimal p. The results are shown in Figure 5.2 (d).
We can observe that the minima of the bounds clearly reflect the pthat are found to
also work well empirically: In the sparsest scenario (shown on the lower right-hand
side), the bound predicts p[1,1.14] to be optimal and indeed, in the experiments, all
p[1,1.15] performed best (and equally well) while the next higher pcontained in our
grid, namely p= 1.19, already has a slightly (but significantly) worse test error—in
striking match with our bounds. In the second sparsest scenario, the bound predicts
p= 1.25 and we empirically found p= 1.26. In the non-sparse scenarios, intermediate
values of p[1,2] are optimal (see Figure for details)—again we can observe a good
accordance of the empirical and theoretical values. In the extreme case, i.e., the uniform
scenario, the bound indicates a pthat lies well beyond the valid interval of the bound
(i.e., p > 2) and this is also what we observed empirically: p[4,] worked best in
our experiments.
Summary and discussion We can conclude that the results confirm our two claims
stated at the beginning of this subsection: first, the theoretical bounds reflect the em-
pirically observed optimal pin the idealized setup where we know the Bayes hypothesis;
second, we showed that kernel alignments reflect the dependancy structure of the kernel
set; third, the choice of the norm parameter pcan be crucial to obtaining high test set
accuracy (for example, in the sparsest scenario `-norm MKL has 43% test error while
`1-norm MKL reaches the Bayes error with an test error of 4%; cf. Figure 5.2(a)).
Last, we observed that the optimality of a particular pstrongly depends on the
geometry of the learning task: the sparsity of the underlying Bayes hypothesis w. If
wcontains many irrelevant features, a small pis beneficial whereas intermediate or
large potherwise. This raises the question into which scenario practical applications
fall. For example, do we rather encounter a “sparse” or non-sparse scenario in bioin-
formatics? This will be investigated in the upcoming real-world experiments. Our
answer is somewhat mixed: we will show that, for example, bioinformatics is a domain
too diverse to be categorized into a single learning scenario (e.g., sparse). In fact, we
will present diverse bioinformatics applications covering the whole range of scenarios:
sparse, intermediate non-sparse, and uniform (see Section 5.4). However, as observed
from the toy experiment, by appropriately tuning the norm parameter, `p-norm MKL
can prove robust in all scenarios.
5.3 Case Study 2: Real-World Experiment (TSS)
In this section, we study transcription splice site (TSS) detection, a highly topical,
large-scale bioinformatics application. The purposes of this section are the following:
1. studying the prediction accuracy of `p-norm MKL for TSS detection
2. exemplarily carrying out the proposed methodology for a real-world application
3. investigating the impact of the sample size non the performance of `p-norm MKL
78
5 Empirical Analysis and Applications
(as we have access to 93,500 data points for this application)
4. investigating the impact of the sample size on the estimation of the optimal pby
means of the theoretical bounds.
Figure 5.3: Figure taken from Alberts et al. (2002).
Application description and goal
This experiment aims at detecting
transcription start sites (TSS) of
RNA Polymerase II binding genes
in genomic DNA sequences. The
accurate detection of the tran-
scription start site (see the figure
to the right for an illustration of
the start site) is crucial to identi-
fying genes and their promoter regions and can be regarded as a first step in deci-
phering the key regulatory elements that determine transcription. Those transcription
start sites are located in the core promoter region and, for the majority of species, their
localization must be achieved without the help of massive sequencing: for some species,
including human, large scale sequencing projects of complete mRNAs have been under-
taken, but many low copy genes still evade being sequenced, leaving a huge demand for
accurate ab initio TSS prediction algorithms. Consequently, a fairly large number of
TSS finders, exploiting the fact that the features of promoter regions and the transcrip-
tion start sites are different from the features of other genomic DNA (Bajic et al., 2004),
have been developed, which called for a comparison. To this end, the recent study of
Abeel et al. (2009) compared 19 state-of-the-art promoter prediction programs in terms
of their predictive accuracy. Here, the winning program of Sonnenburg et al. (2006b),
entitled ARTS, was perceived to be leading.
Interestingly, ARTS deploys an SVM using a uniform combination of five het-
erogeneous kernels capturing various aspect of the promoter region, rendering ARTS
attractive for the application of `p-norm MKL. In this thesis, we study whether `p-norm
MKL can be used to even further improve this cutting-edge accuracy. Note that a com-
parison to single-kernel SVMs was already carried out in Sonnenburg et al., 2006b (see
Tables 2 and 3), showing that the uniform kernel combination is more accurate than
the best single-kernel SVM, rendering another comparison unnecessary.
Experimental setup For our experiments we use the data set from Sonnenburg et al.
(2006b), which contains a curated set of 8,508 TSS annotated genes utilizing dbTSS
version 4 (Suzuki et al., 2002) and refseq genes. These are translated into positive
training instances by extracting windows of size [1000,+1000] around the TSS. Similar
to Bajic et al. (2004), 85,042 negative instances are generated from the interior of the
gene using the same window size. Following Sonnenburg et al. (2006b), we employ five
different kernels representing the TSS signal (weighted degree with shift), the promoter
(spectrum), the 1st exon (spectrum), angles (linear), and energies (linear). Optimal
kernel parameters are determined by model selection in Sonnenburg et al. (2006b). The
kernel matrices are spherically normalized as described in Section 3.3.1. We reserve
79
II `p-norm Multiple Kernel Learning
13,000 and 20,000 randomly drawn instances for validation and test sets, respectively,
and use the remaining 60,000 as the training pool. Soft margin parameters Care tuned
on the validation set by grid search over C2[2,1,...,5] (optimal Cs are attained in
the interior of the grid). Figure 5.4 (a) shows test errors for varying training set sizes
drawn from the pool; training sets of the same size are disjoint. Error bars indicate
standard errors of repetitions for small training set sizes.
Results Regardless of the sample size, `1-norm MKL is significantly outperformed by
the SVM using a uniform kernel combination. In contrast, non-sparse MKL achieves
significantly higher AUC values than the SVM using a uniform combination of kernels
for sample sizes up to 40,000. The scenario is well suited for `4/3-norm MKL which per-
forms best. Finally, for 60,000 training instances, regularization becomes less and less
important so that all methods except `1-norm MKL perform similarly with, again, `4/3-
norm MKL performing best among all prediction models. This superior performance
of non-sparse MKL regardless of the sample size is remarkable and of significance for
the application domain: as indicated above, the method using the unweighted sum of
kernels has recently been confirmed to be leading in a comparison of 19 state-of-the-art
promoter prediction programs and our experiments suggest that this accuracy can be
further increased by non-sparse MKL, especially when data is spare. But obtaining
data for this application is costly: even the most modern sequencing techniques used in
the domain are not completely automatic and need human input by an biology expert.
Therefore, our method is especially appealing for genomes that are hardly sequenced.
Interpretation To further explore possible reasons for optimality of a non-sparse `p-
norm in the above experiments, we also recall the single-kernel performance of the
kernels: TSS signal 0.89, promoter 0.86, 1st exon 0.84, angles 0.55, and energies 0.74,
for fixed sample size n= 2000. This indicates that there are three highly and two
moderately informative kernels. While non-sparse MKL distributes the weights over
all kernels (see Figure 5.4 (b)), sparse MKL focuses on the best kernel. However, the
superior performance of non-sparse MKL means that dropping the remaining kernels
is detrimental, indicating that they may carry additional discriminative information.
To investigate this hypothesis, we compute the pairwise alignments8of the cen-
tered kernel matrices, i.e., A(i, j) = <Ki,Kj>F
kKikFkKjkF, with respect to the Frobenius dot
product (e.g., Golub and van Loan, 1996). The computed alignments are shown in
Figure 5.4 (c). One can observe that the three relevant kernels are highly aligned,
as was to be expected, since they are correlated via the labels. However, the energy
kernel shows only slight correlations with the remaining kernels, which is surprisingly
little compared to its single kernel performance (AUC=0.74). We conclude that this
kernel carries complementary information about the learning problem and should thus
be included in the resulting kernel mixture. This is precisely what is done by non-
sparse MKL, as can be seen in Fig. 5.4 (b), and the reason for the empirical success of
non-sparse MKL on this data set.
8The alignments can be interpreted as empirical estimates of the Pearson correlation of the kernels
(Cristianini et al., 2002).
80
5 Empirical Analysis and Applications
0 10,000 20,000 30,000 40,000 50,000 60,000
0.88
0.89
0.9
0.91
0.92
0.93
sample size
AUC
1−norm MKL
8/7−norm MKL
4/3−norm MKL
8/5−norm MKL
SVM
(a) Results (in AUC)
1−norm
n=5,000
8/7−norm 4/3−norm 8/5−norm unw.−sum
n=20,000n=60,000
(b) MKL weights θ
(c) Alignments
1 1.2 1.4 1.6 1.8 2
2.2
2.4
2.6
2.8
3
p
bound factor
empirically
optimal p
n= 50
n= 250
n= 1,000
n= 5,000
n=15,000
n=60,000
theoretically
optimal p
(d) Bound
50 250 1,000 5,000 15,000 60,000
0
0.2
0.4
0.6
0.8
1
n
||wm||
TSS (m=1)
promoter (m=2)
1st exon (m=3)
angle (m=4)
energy (m=5)
trend
lines
(e) MKL block weights kwmk
Figure 5.4: The transcription splice site (TSS) detection experiment: results and analysis.
81
II `p-norm Multiple Kernel Learning
Bounds For various training set sizes n, we compute the bound factor as a function of p.
The results are shown in Figure 5.4 (d). We can observe the optimal p, as predicted by
the theoretical bounds, apparently converging towards the empirically optimal p(which
is p= 4/3) when increasing the training set size n. This results from the approximation
of the Bayes hypothesis being tighter for large n: we compute ˆ
wmkl as output by MKL
for various n(shown in Figure 5.4 (e)) and observe that, with increasing n, MKL tends
to discard the angle and energy kernels; the latter two kernels are of finite rank. Their
limited complexity renders them less effective for large n. We thus conjecture that the
Bayes hypothesis is likely to focus on the first three kernels (with an emphasis on the
TSS kernel, which also empirically performs best). In this sense, it is not surprising
that the bound “overestimates” the optimal p: since ˆ
wmkl is “too uniform” compared
to the Bayes-w, the bound (which has ˆ
wmkl as an input) leads to overly high estimates
of the optimal p. However, it is interesting that already for n= 60,000 the bound gives
a good estimate (p= 1.48) of the empirically optimal p= 1.33.
Summary and discussion Exemplarily, we carried out the proposed experimental
methodology for transcription splice site detection, a cutting-edge bioinformatics ap-
plication. The data set we experimented on was also used in the previous study of
Sonnenburg et al. (2006a) and comprises 93,500 instances. We showed that regardless
of the sample size, `p-norm MKL can improve the prediction accuracy over `1-norm
MKL and the SVM using a uniform kernel combination (which itself outperforms the
single-kernel SVM using the kernel determined by model selection). This is remarkable
and of significance for the application domain: the method using a uniform kernel com-
bination was recently confirmed to be the most accurate TSS detector in a comparison
of 19 state-of-the-art models Abeel et al. (2009). Furthermore, we observed that the
choice of nhad no impact on the empirical optimality of a particular p. We also eval-
uated the theoretical bounds and found that with increasing nthey give a more and
more accurate estimation of the optimal pthrough an increasingly tight approximation
of the Bayes hypothesis.
5.4 Bioinformatics Experiments
In this section, we apply `p-norm MKL to diverse, highly topical applications taken
from the domain of bioinformatics. Many bioinformatics applications are too complex
to be described by a single type of feature descriptors; for example, finding splice
sites in genes can only be done moderately well on the raw sequence information—
incorporating complementary information such as binding energies or twistedness of
DNA increases the chance of finding a splice site. A challenge is here to deal with the
additional noise introduced by those so-called weak features, making it very appealing
for multiple kernel learning.
We recall the toy experiment, which showed that the amount by which `p-norm
MKL is beneficial in such applications crucially depends on the sparsity/uniformness
of the learning task at hand. We are therefore interested in finding out into which
82
5 Empirical Analysis and Applications
scenario particular bioinformatics applications fall. We show that bioinformatics is a
domain so diverse that we can encounter scenarios ranging from sparse over non-sparse
to uniform ones. Nevertheless, appropriately tuning the norm parameter p,`p-norm
MKL proves robust in all applications considered.
5.4.1 Protein Subcellular Localization—A Sparse Scenario
The prediction of the subcellular localization of proteins is one of the rare empirical
success stories of `1-norm-regularized MKL (Ong and Zien, 2008; Zien and Ong, 2007):
after defining 69 kernels that capture diverse aspects of protein sequences, classical, `1-
norm MKL could raise the predictive accuracy significantly above that of vanilla SVMs
using a uniform kernel combination or the best kernel determined by model selection—
thereby also surpassing established prediction systems for this problem. This has been
demonstrated on 4 data sets, corresponding to 4 different sets of organisms (plants,
non-plant eukaryotes, Gram-positive and Gram-negative bacteria) with differing sets
of relevant localizations. In this thesis, we investigate the performance of `p-norm MKL
on the same 4 data sets.
Figure 5.5: Figure taken from Scolkopf
et al. (2004).
Application description and goal There are
several possible locations for a protein to reside
in a cell: lysosome, mitochondrion, nucleus, etc.
(see the Figure to the right for a list of possible
locations). Predicting such a location is the goal
of protein subcellular localization. The accurate
localization of proteins in cells is important be-
cause the location of a protein is very closely
connected to its function. This is especially rel-
evant in pharmacology because the knowledge
of the function of proteins is crucial when de-
signing new drugs. Unfortunately, the manual
localization of the protein is a time-consuming
task. This renders bioinformatics approaches
appealing.
Experimental setup We downloaded the kernel matrices of all 4 data sets from
http://www.fml.tuebingen.mpg.de/raetsch/suppl/protsubloc/. The kernel ma-
trices are multiplicatively normalized as described in Section 3.3.1. The experimental
setup used here is related to that of Ong and Zien (2008), although it deviates from
it in several details. For each data set, we perform the following steps for each of the
30 predefined splits in training set and test set (downloaded from the same URL): We
consider norms p {1,1.01,1.03,1.07,1.14,1.33,1.6,2}and regularization constants
C {1/32,1/8,1/2,1,2,4,8,32,128}. For each parameter setting (p, C), we train `p-
norm MKL using a 1-vs-rest strategy on the training set. The predictions on the test
set are then evaluated w.r.t. average (over the classes) MCC (Matthews correlation
coefficient). As we are only interested in the influence of the norm on the performance,
83
II `p-norm Multiple Kernel Learning
we forbear proper cross-validation (the so-obtained systematical error affects all norms
equally). Instead, for each of the 30 data splits and for each p, the value of Cthat
yields the highest MCC is selected. Thus we obtain an optimized Cand MCC value
for each combination of data set, split, and norm p. For each norm, the final MCC
value is obtained by averaging over the data sets and splits (i.e., Cis selected to be
optimal for each data set and split).
Note that the results for the best single-kernel SVMs were reported in Ong and
Zien, 2008 (see Figure 3): the single-kernel SVM was shown to be outperformed by
both the SVM using a uniform kernel combination and `1-norm MKL on all four data
sets.
Results The results, shown in Figure 5.6 (a), indicate that, indeed, with proper choice
of a non-sparse regularizer, the accuracy of `1-norm can be recovered. On the other
hand, non-sparse MKL can approximate `1-norm MKL arbitrarily closely, and thereby
approach the same results. However, even when 1-norm is clearly superior to -norm,
as demonstrated for these 4 data sets, it is possible that intermediate norms perform
even better. As the figure shows, this is indeed the case for the PSORT data sets, albeit
only slightly and not significantly so.
Interpretation At this point, we can remark that the superior performance of `p1-
norm MKL in this setup is not surprising. As the kernel alignment plots indicate (see
Figure 5.6 (b)), there are four sets of 16 kernels each, in which each kernel picks up very
similar information: they only differ in number and placing of gaps in all substrings of
length 5 of a given part of the protein sequence. The situation is roughly analogous to
considering (inhomogeneous) polynomial kernels of different degrees on the same data
vectors. This means that they carry large parts of overlapping information. By con-
struction, also some kernels (those with less gaps) also have access to more information
(similar to higher degree polynomials including low degree polynomials). Furthermore,
Ong and Zien (2008) studied single kernel SVMs for each kernel individually and found
that in most cases the 16 kernels from the same subset perform very similarly. This
means that each set of 16 kernels is highly redundant and the excluded parts of informa-
tion are not very discriminative. This renders a non-sparse kernel mixture ineffective.
We conclude that `1-norm must be the best prediction model.
Bound Figure 5.6 (c) shows the results of the bound simulation. We observe from the
figure that the theoretical bounds predict p= 1.2, which is close to the empirically
measured optimal p= 1, although it does not exactly match that value due to the
approximation of the Bayes hypothesis involved (see discussion in the previous case
study (TSS)). However, we can see from the figure that the curvature of the bound
increases with nso that large values of pare more strongly excluded by the bound when
nis larger. This might indicate a trend towards a sparser theoretically optimal pwhen
increasing nas also observed in the TSS case study—however, to further investigate
this hypothesis, we would need access to additional data, which, unfortunately, is not
available for the application at hand.
Summary and discussion We applied the proposed `p-norm MKL to the subcellular
localization of proteins, which is one of most successful applications of computer systems
84
5 Empirical Analysis and Applications
`p-norm 1 1.01 1.03 1.07 1.14 1.33 1.6 1.8 SVM
plant 8.18 8.22 8.20 8.21 8.43 9.47 11.00 11.61 11.85
std. err. ±0.47 ±0.45 ±0.43 ±0.42 ±0.42 ±0.43 ±0.47 ±0.49 ±0.60
nonpl 8.97 9.01 9.08 9.19 9.24 9.43 9.77 10.05 10.33
std. err. ±0.26 ±0.25 ±0.26 ±0.27 ±0.29 ±0.32 ±0.32 ±0.32 ±0.31
psortNeg 9.99 9.91 9.87 10.01 10.13 11.01 12.20 12.73 13.33
std. err. ±0.35 ±0.34 ±0.34 ±0.34 ±0.33 ±0.32 ±0.32 ±0.34 ±0.35
psortPos 13.07 13.01 13.41 13.17 13.25 14.68 15.55 16.43 17.63
std. err. ±0.66 ±0.63 ±0.67 ±0.62 ±0.61 ±0.67 ±0.72 ±0.81 ±0.80
(a) Results (in MCC)
kernel id
kernel id
10 20 30 40 50 60
10
20
30
40
50
60 0.2
0.4
0.6
0.8
1
(b) Alignments
1 1.2 1.4 1.6 1.8 2
1
2
3
4
5
p
bound factor
empirically
optimal p
theoretically
optimal p
50
100
250
500
940
(c) Bound
Figure 5.6: The protein subcellular localization experiment: results and analysis.
85
II `p-norm Multiple Kernel Learning
to biology. We confirmed the results of Ong and Zien (2008) in that MKL greatly helps
performance in this application, which we found to fall into the category of a sparse
learning scenario: `1-norm MKL lead to an improvement of up to 4.5% points MCC
over the SVM using a uniform kernel combination. The optimality of a low pis also
reflected in the bounds, which predicted p= 1.2.
5.4.2 Protein Fold Prediction—A Non-Sparse Scenario
In this section, we experiment on the dingshen data which was used in the previous
study of Campbell and Ying (2011).
Figure 5.7: Figure taken from Alberts et al. (2002).
Application description and goal
Proteins are the functional molec-
ular components inside cells. The
so-called messenger RNA is tran-
scribed from a genetic sequence
that codes for a particular protein.
An important step in this process is
folding, in which the protein forms
its final three-dimensional struc-
ture. Understanding the three-
dimensional structure of a protein
can give insight into its function. For example, if the protein is a drug target, knowl-
edge of its structure is important in the design of small molecular inhibitors which
would bind to, and disable, the protein. Advances in gene sequencing technologies
have resulted in a large increase in the number of identified sequences that code for
proteins (Campbell and Ying, 2011, p. 63). However, there has been a much slower
increase in the number of known three-dimensional protein structures. This motivates
computer-aided prediction of the structure of a protein from sequence and other data.
In the present study, we consider the sub-problem of structure prediction, in which
the predicted label is over a set of fold classes. The fold classes are a set of structural
components, common across proteins, which give rise to the overall three-dimensional
structure.
Experimental setup We obtained the dingshen data set, including a fixed training and
test split, from Colin Campbell. The dingshen data set consists of 27 fold classes with
313 proteins used for training and 385 for testing. There are a number of observational
features relevant to predicting the fold class; in this study, we used 12 different informa-
tive data types. These included the RNA sequence and various physical measurements
such as hydrophobicity, polarity, and van-der-Waals volume, resulting in 12 kernels.
We precisely replicate the experimental setup of Campbell and Ying (2011): we use
the train/test split provided by Colin Campbell and carry out MKL via one-vs.-rest
SVMs to deal with the multiple classes; we report on test set accuracy. We perform
model selection by cross validation on the training set over C10[4,3.5,...,4]. Since
we only have a single test set at hand, we compute the standard deviation by 20 times
86
5 Empirical Analysis and Applications
randomly splitting the test set in two parts of equal size and computing the standard
deviations (shown as slim vertical bars).
Results The results are shown in Figure 5.8 (a). Flat vertical bars show the test set
accuracy of the single-kernel SVMs: for example, H is Hydrophobicity, P is Polarity,
V is van der Waals volume. Slim horizontal bars show the performances of `p-norm
and the SVM using a uniform kernel combination. We observe that the best single-
kernel SVM is the one using the SW2-kernel, having a test set accuracy of 64.0%; in
contrast, the SVM using a uniform kernel combination achieves the substantially higher
accuracy of 68.9%, which is slightly better than the 68.4% reached by `1-norm MKL.
Interestingly, there is a huge improvement in using non-sparse `p>1-norm MKL: the
best performing norm is p= 1.14, which has an impressive accuracy of 74.4% and
significantly outperforms the SVM baselines.
Interpretation Figure 5.8 (b) gives the values of the kernel coefficients θ. We observe
that `1-norm MKL emphasizes on the SW1 and SW2 kernels, which also have the high-
est single-kernel performance. The `p>1-norm variants yield precisely the same “rank-
ing” of weights θibut distribute the weights among the kernels more strongly. Generally,
the kernel combinations output by MKL nicely reflect the true performances as deter-
mined by the single-kernel SVMs. The superior performance of winning `1.14-norm
MKL compared to `1-norm MKL and the SVM using a uniform kernel combination
indicates that although all 12 types of data are relevant, they are not equally so. For
example, the features SW1 and SW2, which are based on sequence alignments, appear
to be more informative than the others.
To further analyze the result, we compute the pairwise kernel alignments shown
in Figure 5.8 (c). One can see from the figure that the Kernels L1–L30 and SW1–SW2
correlate quite strongly. This resembles the similarity in the generation process of those
kernels (they differ by different parameter values). However, the other kernels correlate
surprisingly little—this indicates that here complementary information is contained in
the kernels. Therefore, discarding or overly downgrading one of those kernels can be
disadvantageous, which explains the poor `1-norm MKL performance. On the other
hand, we know from the single-kernel performances that not all kernels are equally
informative, which explains the rather bad performance of the uniform-combination
SVM. We conclude that an intermediate norm must be optimal—and this is also what
we observe in terms of test errors.
Bound The bound factor, shown as a function of pin Figure 5.8 (d) indicates a theo-
retically optimal ptoo high compared to the empirically optimal one. This may stem
from the small sample that is employed here. Indeed, in the application at hand, we
face 313 training examples from 27 classes, that is, only 11.6 examples per class in
average. As shown in Case Study 2, small samples can result in an underestimation of
the true underlying sparsity of the task and this is probably also what happens here.
This hypothesis is further supported by the fact that we can observe the curvature of
the bound increasing with n. This might indicate that if we had access to a larger
sample size (resulting in a tighter approximation of the Bayes hypothesis), a lower p
might be optimal in the bound. However, for the limited data at hand, we cannot fully
87
II `p-norm Multiple Kernel Learning
C H P Z S V L1 L4 L14 L30 SW1SW2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Test Set Accuracy
SVM (single)
1−norm MKL
1.07−norm MKL
1.15−norm MKL
1.33−norm MKL
SVM (all)
(a) Results (accuracy)
C H P Z S V L1 L4 L14 L30 SW1SW2
0
0.1
0.2
0.3
0.4
0.5
0.6
Kernel Weights θi
1−norm MKL
1.14−norm MKL
(b) MKL weights θ
kernel
kernel
C H P Z S V L1 L4 L14 L30SW1SW2
C
H
P
Z
S
V
L1
L4
L14
L30
SW1
SW2
0.4
0.6
0.8
1
(c) Alignments
1 1.2 1.4 1.6 1.8 2
1
2
3
4
p
bound factor
empirically
optimal p n= 20
n= 50
n=125
n=281
theoretically
optimal p
(d) Bound
Figure 5.8: The protein fold prediction experiment: results and analysis.
88
5 Empirical Analysis and Applications
verify this hypothesis.
Summary and discussion We studied `p-norm MKL in the bioinformatics application
of protein fold prediction, which is a key step towards understanding the function of
proteins and thus crucial for drug design. We found that using non-sparse `p=1.14-norm
MKL significantly increases the predictive accuracy by over 5% points compared to
`1-norm MKL, the SVM using a uniform kernel combination, and the SVM using the
best kernel determined by model selection. The optimality of such an intermediate
pcan be explained by the presence of a large number of “weak” (but nevertheless
complementary, informative) kernels.
5.4.3 Metabolic Network Reconstruction—A Uniform Scenario
Figure 5.9: Part of an aminosugars
metabolic network (figure taken from
Bleakley et al., 2007).
Application description and goal Ametabolic
network is a schematic representation of the set
of enzymes residing in a cell together with their
functional interactions. As such it allows us to
understand the connections between the genome
and the physiology of an organism. We can think
of a metabolic network as a graph where each en-
zyme is represented by a node and there is an
edge between two enzymes if they interact (see
example shown in the figure to the right). Ex-
amples of metabolic networks include glycolysis,
Krebs cycle, and the pentose phosphate pathway.
The task in metabolic network reconstruction is,
given a partial network built from a subset of en-
zymes, to predict functional interactions of new
enzymes.
There is a growing interest in this application
within the bioinformatics community due to the accumulation of biological information
about enzymes such as gene expression data, phylogenetic data, and location data
of enzymes in the cell—in part this can be attributed to the sequencing of complete
genomes. In the study of Bleakley et al. (2007), a new method to predict potentially
new relationships was proposed and its effectiveness demonstrated. We recognize this
method as a regular SVM using a uniform combination of kernels and thus are in-
terested in studying whether `p-norm MKL can help in making even more accurate
reconstructions.
Experimental setup We use a data set that was originally studied in Yamanishi et al.
(2005): it consists of 668 enzymes of the yeast Saccharomyces cerevisiae and 2782
functional relationships extracted from the KEGG database (Kanehisa et al., 2004).
We employ the experimental setup of Bleakley et al. (2007), who phrase the task as
graph-based edge prediction with local models by learning a model for each of the 668
enzymes. The provided kernel matrices capture expression data (EXP), cellular local-
89
II `p-norm Multiple Kernel Learning
AUC ±stderr
EXP 71.69 ±1.1 (69.3±1.9)
LOC 58.35 ±0.7 (56.0±3.3)
PHY 73.35 ±1.9 (67.8±2.1)
INT (SVM) 82.94 ±0.8(82.1±2.2)
1-norm MKL 74.7±1.5
1.33-norm MKL 81.30 ±1.4
1.85-norm MKL 82.90 ±1.0
2.28-norm MKL 82.85 ±0.9
2.66-norm MKL 82.24 ±0.9
4-norm MKL 80.37 ±0.7
Recombined and product kernels
1-norm MKL 79.05 ±0.5
1.14-norm MKL 80.92 ±0.6
1.33-norm MKL 81.95 ±0.6
1.6-norm MKL 83.13 ±0.6
(a) Results (in AUC).
(b) Alignments
1 1.2 1.4 1.6 1.8 2
1.6
2
2.4
2.8
p
bound factor
empirically
optimal p
n= 50
n=100
n=250
n=668
theoretically
optimal p
(c) Bound (curves are equall for all n)
Figure 5.10: The metabolic network reconstruction experiment: results and analysis.
90
5 Empirical Analysis and Applications
ization (LOC), and the phylogenetic profile (PHY); in addition, we use the integration
of the former 3 kernels (INT), which matches our definition of an unweighted-sum ker-
nel. Following Bleakley et al. (2007), we employ a 5-fold cross validation; on average,
we train 534 enzyme-based models in each fold; however, in contrast to Bleakley et al.
(2007) we omit enzymes reacting with only one or two others to guarantee well-defined
problem settings. As Table 5.10 (c) shows, this results in slightly better AUC values
for single kernel SVMs; the results by Bleakley et al. (2007) are shown in brackets.
Results The results are shown in Figure 5.10 (a); the results by Bleakley et al. (2007)
for single-kernel SVM are shown in brackets. We can observe from the figure that the
SVM using a uniform kernel combination performs best, although its solution is well
approximated by `p-norm MKL using values that are slightly smaller or larger than
p= 2. Increasing the number of kernels by including recombined and product kernels
does improve the results obtained by MKL, especially for small values of p, but even
the highest AUC values are not significantly different from those of the uniform kernel
combination.
Interpretation It suggests that the performance of the SVM using a uniform kernel
combination can be explained by all three kernels performing well individually. Their
correlation is only moderate, as shown in Figure 5.10 (c), indicating that they contain
complementary information. Hence, downweighting one of those three kernels leads
to a decrease in performance, as observed in our experiments. This explains why the
uniform kernel combination yields the best prediction accuracy in this experiment.
Bound Figure 5.10 (c) shows bound factor as a function of pfor various values of
n. Note that regardless of nthe curves fall into the same regions, rendering them
impossible to differentiate in the plot. The optimal pin the interval p[1,2] is
determined as p= 2, although the plot indicates that even a higher pcould be op-
timal. In contrast to previous applications of the bound, we do not observe a trend
in the bound when increasing n. Inspecting the wsvm as output by the best per-
forming algorithm (namely, the SVM using a uniform kernel combination), we find
(kwsvm
mk)m=1,2,3= (0.332,0.330,0.338) for n= 50 and (0.337,0.326,0.337) for the
maximal n= 668. Clearly, there is hardly a development towards a more sparse w
when increasing nso that it is not surprising that the bounds coincide with varying n.
Summary and discussion We studied `p-norm MKL in the bioinformatics application
of metabolic network reconstruction. With the rapidly growing “industrialization” of
sequencing technologies there is growing demand for the reconstruction of metabolic
cycles such as the Krebs cycle. We found the uniform kernel combination working best
in this application because, as determined by our kernel alignment technique, all three
kernels encode strong and complementary features. The uniformness of this scenario
can also be seen from the bounds, which indicate a higher pto be optimal than in the
applications we have seen before.
91
II `p-norm Multiple Kernel Learning
5.5 Computer Vision Experiment: Object Recognition
Application description and goal In object recognition the task is to find an object in
an image. Traditionally, this is easily handled by humans, even if the object is rotated
or partially obstructed from view, but for machines it poses quite a challenge. This is
because objects can be shown from various view points and thus can come in various
shapes and scales.
Figure 5.11: Figure taken Bach (2008a).
A specialty of computer-based
object recognition approaches is that
a variety of feature descriptors are
given, varying in, for example, color,
texture, and shape information; each
combination of those descriptors gives
rise to a kernel. Cross-validation-
based model selection is known to
fail for this data as the descriptors
capture complementary information
(Gehler and Nowozin, 2009). In-
stead, researchers frequently resort
to heuristics such as a uniform ker-
nel combination. However, this does
not account for varying relevances of
the features that change across the
tasks/concepts/classes. For example, color information increases the detection rates
of stop signs substantially but is almost useless—or even counterproductive due to the
introduction of additional noise—for finding cars. Clearly, this is because stop signs
are colored red while cars can come in any color. As additional, non-essential features
not only slow down computation but may also harm the predictive performance, it is
necessary to combine the features in a more meaningful way to achieve state-of-the-
art object recognition performance in real systems. In this thesis, we thus investigate
whether `p-norm MKL can help in this task.
Data Set We experiment on the PASCAL Visual Object Classes Challenge 2008 (VOC
2008) data (Everingham et al., 2008), which consists of 8780 images that are officially
split by the organizers into 2113 training, 2227 validation, and 4340 test images. The
organizers also provide annotation of the images in 20 object classes, where each image
can be annotated by multiple objects (Figure 5.12 (a) shows examples of the occurring
object classes). The official task posed by the challenge organizers consists in solving
20 binary classification problems, i.e., predicting whether at least one object from a
class kis visible in a test image. The evaluation measure is the average precision (AP).
Feature Extraction We deploy 12 kernels that are inspired by the approaches of the
winners of the VOC 2007 challenge (Marszalek and Schmid, 2007; Bosch et al., 2007).
They are based on the following three types of features: Histogram of visual words
(HoW), Histogram of oriented gradients (HoG), and Histogram of pixel colors (HoC).
92
5 Empirical Analysis and Applications
aeroplane bicycle bird boat bottle
bus car cat chair cow
diningtable dog horse motorbike person
pottedplant sheep sofa train tvmonitor
(a) VOC 2008 data set: Exemplary images. Note that an image can be annotated by multiple
concepts (e.g., the above horse image is also annotated with person).
average aeroplane bicycle bird boat bottle
1-norm MKL 40.8±0.9 66.9±6.8 36.4±6.6 44.1±5.7 56.8±5.0 19.2±3.8
SVM 40.8±1.0 66.4±6.6 39.1±5.9 43.3±5.8 57.5±5.0 18.4±3.6
p-norm MKL 42.3±0.9 67.1±6.3 40.7±6.6 44.7±5.4 57.8±5.4 19.5±3.6
selected p1.07 1.06 1.11 1.11 1.11
bus car cat chair cow
1-norm MKL 39.3±10.6 49.0±2.8 47.7±3.8 44.1±4.9 10.8±3.5
SVM 42.3±9.1 48.9±3.3 46.1±3.2 43.0±4.7 8.2±2.9
p-norm MKL 41.7±9.5 50.3±3.4 48.9±3.7 44.9±3.8 10.3±3.1
selected p1.33 1.11 1.0588 1.0154 1.0448
diningtable dog horse motorbike person
1-norm 27.1±7.0 34.4±4.4 39.6±5.8 41.7±4.5 84.1±1.3
SVM 29.5±9.1 33.2±2.7 42.5±6.5 42.8±2.9 83.9±1.2
p-norm MKL 30.1±6.2 34.0±3.4 42.0±6.6 44.7±4.2 84.5±1.2
selected p1.1351 1.1351 1.1579 1.1111 1.0588
pottedplant sheep sofa train tvmonitor
1-norm 14.7±4.5 26.3±7.7 33.0±7.2 50.9±9.8 50.8±5.5
SVM 15.5±4.3 22.9±7.0 31.3±6.5 50.9±9.2 51.0±5.7
p-norm MKL 16.1±4.7 27.5±7.6 33.9±6.9 53.7±9.9 52.9±5.7
optimal p1.1111 1.0154 1.1351 1.1111 1.0588
(b) Results (in AP)
Figure 5.12: The computer vision experiment: data set and results.
93
II `p-norm Multiple Kernel Learning
1. The HoW features are constructed in a standard way as in Csurka et al. (2004):
first, the SIFT descriptors (Lowe, 2004) are calculated on a grid of 10 pixel pitches
for each image using a code book of size 1200, which is learned by k-means clustering
deploying the SIFT descriptors extracted from the training images. Then, all SIFT
descriptors are assigned to visual words (prototypes) and summarized into histograms
within entire images or sub-regions, deploying HoW features over the grey and hue
channels.
2. The HoG features over the grey channel are computed by discretizing the orientation
of the gradient vector at each pixel into 40 bins and by summarizing the discretized
orientations into histograms within image regions (Dalal and Triggs, 2005; Bosch et al.,
2007). Canny detectors (Canny, 1986) are used to discard contributions from pixels,
around which the image is almost uniform.
3. The HoC features over the hue channel are constructed by pixel-wise discretizing
color values and computing their histograms within image regions.
The above features are histograms containing no spatial information. We therefore
enrich the respective representations by a pyramidal approach (Lazebnik et al., 2006)
to also capture the spatial context of an image. Furthermore, we apply a χ2kernel
(Zhang et al., 2007) on top of the enriched histogram features, where the bandwidth of
the χ2kernel is heuristically chosen (proportional to the mean of the squared Euclidean
distance averaged over all training example pairs (Lampert and Blaschko, 2008)). In
total, we prepared 12 kernels by mutually combining 4 feature types/colors {HoWg,
HoWh, HoGg, HoCh}with 3 pyramid levels. All kernels were spherically normalized
according to Eq. (3.7).
Experimental setup The original data set contains a test set, but with the labels
being undisclosed. We therefore create 10 random splits of the unified training and
validation sets into new, smaller sets containing 2111 training, 1111 validation, and
1110 test images. For each of the 10 splits, the training images are used for learning
classifiers, while the SVM regularization parameter Cand the MKL norm parameter
pare chosen based on the maximal AP score on the validation images. Thereby, the
regularization constant Cand the norm parameter pare optimized class-wise from the
candidates {24,23,...,24}.
For a thorough evaluation, we would have to construct another codebook for each
cross-validation split. However, creating codebooks and assigning descriptors to visual
words is a time-consuming process. In our experiments, we therefore resort to the
common practice of using a single codebook created from all training images contained
in the official split. Although this could result in a slight overestimation of the AP
scores, this affects all methods equally and does not favor any classification method
over another—our focus lies on a relative comparison of the different classification
methods; therefore there is no loss in exploiting this computational shortcut.
Note that we exclude a comparison to single-kernel SVMs since it is clear for this
application that a single kernel alone cannot capture the relevant information needed
for this task (see, e.g., Gehler and Nowozin, 2009, and references therein).
94
5 Empirical Analysis and Applications
Results The results are shown in Table 5.12 (b). Boldface shows the best method as
well as all other ones that are not statistical-significantly worse according to a Wilcoxon
signed-ranks test at a significance level of 5%. Medians of the optimal norm parameter
pare shown. One can observe that `p>1-norm MKL is among the best algorithms
independent of the concept class, except for bird where it attains the maximal AP
score, but insignificantly so. For many concepts, the AP score of `p>1-norm MKL
is considerably higher than the one attained by 1-norm MKL or the SVM using the
uniform kernel combination: e.g., bicycle,car,cat,motorbike, and train are classes
where learning kernels substantially helps. On average, the AP scores increases by
about 1.5 AP Points by using non-sparse `p>1-norm MKL. It is also interesting to note,
that when comparing `1-norm MKL with the SVM using a uniform kernel combination,
there is no consistent picture over the classes: while `1-norm MKL is better for the
aeroplane,bottle,chair,sheep and sofa classes, it is outperformed by the regular sum-
kernel SVM for boat,bus,diningtable,horse and tvmonitor. By optimizing the non-
sparsity parameter pin between the two extreme cases, our procedures can always
outperform both baselines.
Interpretation We now analyze the kernel mixtures θas output by MKL. To this aim,
we first compute the pairwise kernel alignment scores of the 12 base kernels (shown in
Figure 5.13 (a)). One can see from the figure that the tree kernels constructed from
the same type/color features have high similarities, as easily identified from the four
diagonal blocks. Across different features, we observe mid-range similarities between
those with the same type (HoWgHoWh) and with the same colors (HoWgHoGg,
HoWhHoCh). As expected, the three different pyramid levels of each type/color
feature are very closely aligned. The pair of the HoW features are indeed located closer
to each other than to the other features.
Intuitively, the kernel weights matter only moderately when the base kernels are
similar. This situation holds for the three pyramid levels of each type/color feature.
Therefore, we also try combining these three kernels with uniform weights in advance
and optimized the four weights corresponding to feature types/colors afterwards. In
fact, this pre-combination procedure does not degrade the classification performance:
its mean AP is 42.3±0.7. Especially, when further increasing the number of employed
kernel parameters, this pre-averaging procedure becomes particularly appealing.
Based on the properties of the 12 base kernels, we check their kernel weights as
output by 1-norm and p-norm MKL. Figure 5.13 (c) shows the average kernel weights
over 10 repetitions of the experiment. We see that `1-norm MKL focuses on only a few
informative kernels such as HoW2g, HoW0gand HoW2h, almost completely neglecting
HoGgand HoCh(Left). By contrast, the solutions output by `p>1-norm MKL (Right)
are more balanced: while the HoW kernels get relatively high weights, the HoG and
HoC features get considerably lower weights but are not ignored completely. We can
also observe that `1-norm MKL tends to completely discard color information while it
is still incorporated into the final model by `p>1-norm MKL. But color information can
be crucial for object detection: for example, school buses tend to be yellow and horses
are more frequently found on green lawns than on gray roads; discarding the color
95
II `p-norm Multiple Kernel Learning
HoW_g HoW_h HoG_g HoC_h
H
oW_g
H
oW_h
HoG_g
HoC_h
0
0.
2
0.
4
0.
6
0.
8
1
(a) Alignments
1 1.2 1.4 1.6 1.8 2
1.5
2
2.5
3
3.5
4
p
bound factor
empirically
optimal p n= 60
n= 260
n=1000
n=4000
theoretically
optimal p
(b) Bound
0
0.1
0.2
0.3
0.4
0.5
0.6
HoW0_{g}
HoW1_{g}
HoW2_{g}
HoW0_{h}
HoW1_{h}
HoW2_{h}
HoG0_{g}
HoG1_{g}
HoG2_{g}
HoC0_{h}
HoC1_{h}
HoC2_{h}
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
0
0.1
0.2
0.3
0.4
0.5
0.6
HoW0_{g}
HoW1_{g}
HoW2_{g}
HoW0_{h}
HoW1_{h}
HoW2_{h}
HoG0_{g}
HoG1_{g}
HoG2_{g}
HoC0_{h}
HoC1_{h}
HoC2_{h}
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
(c) MKL weights θ:`1-norm (left) and `p-norm MKL (right).
bus:2008 004725
horse:2008 002860
(d) Examples: Images where MKL helped performance most. Left: a
bus ranked 39th (`1-norm) and 6th (`p-norm MKL), respectively. Right:
a horse ranked as 142th (`1-norm) and as 41st (`p-norm MKL).
Figure 5.13: The computer vision experiment: interpretation and analysis.
96
5 Empirical Analysis and Applications
information, as done by `1-norm MKL, hurts performance on concept classes that are
well characterized by colors—this might be one of the major reasons for the superior
performance of `p-norm MKL on these images compared to the `1-norm MKL.
Figure 5.13 (d) shows example images where `p-norm MKL greatly helps general-
ization performance. In both cases, `1-norm MKL tends to focus on the finest pyramid
layer, and tries to capture layout information. However, since the VOC dataset includes
a substantial number of not-so-well-aligned images (such as the examples shown in the
figure), focusing on layout might sometimes fail. In contrast, `p-norm MKL distributes
more weights to coarser pyramid layers, and thus successfully identifies these images.
Bound From the bound shown in Figure 5.13 (b), we observe an overestimation of
the optimal p. The empirical best pwas p= 1.11 in median. In contrast, the bound
predicts a pwith p2. This is very similar to what be have already observed in
the protein fold prediction experiment, presented in Section 5.4.2. Interestingly, both
experiments, the present one and the protein folding experiment, consist of multi-label
classification tasks with 20 and 27 classes, respectively. This indicates that, for small n,
the bounds are not reliable in multi-label classification problems. Nevertheless, we can
observe a trend towards a more accurate prediction of the optimal pwith increasing n,
which is similar to the previous experiments. It could thus be conjectured that with
increasing nthe bound’s prediction will eventually coincide with what is also measured
empirically.
Summary and discussion In this chapter, we applied `p-norm MKL to object recogni-
tion in computer vision. While humans can easily recognize objects in images, even if
they are rotated or partly covered, machines frequently fail in this task. Since images
are characterized by heterogeneous kernels arising from color, texture, or shape infor-
mation, but the usefulness of a particular kernel strongly depends on the object to be
discovered (for example, cows are frequently found in front of green lawns while cars
can come in any color), it is very appealing to use MKL in this application. Indeed,
in our experiments on the official VOC08 challenge data set, we observed a substantial
improvement in using `p-norm MKL—this holds over the whole range of object classes.
Using the alignment technique we connected this fact with the color information be-
ing discarded by `1-norm MKL while it is overly incorporated into the uniform kernel
combination. This is especially remarkable since the uniform SVM was shown to be a
tough competitor in the past (Gehler and Nowozin, 2009). The latter might also be
connected with prior knowledge that is incorporated by researchers when the set of
kernels is chosen a priori.
5.6 Summary and Discussion
We empirically analyzed the generalization performance of `p-norm MKL in terms of
the test error and application-specific evaluation measures, respectively. To this end,
we measured the accuracy of `p-norm MKL and compared it to the one achieved by
the SVM baselines, using either a uniform kernel combination or a single kernel tuned
by model selection. To analyze the results we developed a methodology based on
97
II `p-norm Multiple Kernel Learning
theoretical bounds (investigating the theoretically optimal p) and kernel alignments
(analyzing the complementariness of the kernel set). We exemplified the methodology
by means of a toy experiment, where we constructed data by gradually increasing
the true underlying sparsity (as measured by the size of the sparsity pattern of the
Bayes hypothesis), resulting in several constructed learning scenario that differ in their
underlying sparsity. First, we observed the theoretical bounds reflecting the empirically
observed optimal pin the idealized setup where we know the Bayes hypothesis; second,
we showed that kernel alignments reflect the dependancy structure of the kernel set;
third, we found that the choice of the norm parameter pcan be crucial to obtain a high
test set accuracy, concluding that in the presence of many irrelevant kernels, a small
pis expected to be beneficial and an intermediate/large pwill prove to be beneficial
otherwise.
We then applied `p-norm MKL to diverse, highly topical and challenging prob-
lems from the domains of bioinformatics and computer vision on small and large scales.
Data frequently arises in those domains from multiple heterogeneous sources or is repre-
sented by various complementary views, the right combination of which being unknown,
rendering them especially appealing for the use of MKL. Beside investigating the ef-
fectiveness of `p-norm MKL in terms of test set accuracy, we were also interested in
understanding the results, implementing the proposed methodology.
We exemplarily carried out the proposed experimental methodology for transcrip-
tion splice site detection, a cutting-edge bioinformatics application. The data set we
experimented on was also used in the previous study of Sonnenburg et al. (2006a) and
comprises 93,500 instances. We showed that regardless of the sample size, `p-norm
MKL can improve the prediction accuracy over `1-norm MKL and the SVM using a
uniform kernel combination (which itself outperforms the single-kernel SVM using the
kernel determined by model selection). This is remarkable and of significance for the
application domain: the method using a uniform kernel combination was recently con-
firmed to be the most accurate TSS detector in a comparison of 19 state-of-the-art
models Abeel et al. (2009). Furthermore, we observed that the choice of nhad no
impact on the empirical optimality of a particular p. We also evaluated the theoretical
bounds and found that with increasing nthey give a more and more accurate estimation
of the optimal pthrough an increasingly tight approximation of the Bayes hypothesis.
Moreover, we also studied other applications from the domain of bioinformatics:
protein subcellular localization, protein fold prediction, and metabolic network recon-
struction. In a nutshell, we found this domain being too diverse to be categorized into
a single learning scenario (e.g., sparse). In fact, we presented diverse bioinformatics
applications covering the whole range of scenarios: sparse, intermediate non-sparse,
and uniform. However, by appropriately tuning the norm parameter, `p-norm MKL
proved robust in all scenarios.
Finally, we applied `p-norm MKL to object recognition, a burning topic in com-
puter vision. The same setting was also studied earlier by Varma and Ray (2007) and
Gehler and Nowozin (2009), who found classical MKL to not increase the prediction
accuracy compared to the uniform kernel SVM baseline. We applied the developed `p-
98
5 Empirical Analysis and Applications
norm MKL to this application domain and showed that, by taking p= 1.11 in median,
the prediction accuracy could be raised over the SVM baselines, regardless of the object
class, by an AP score of 1.5 in average, and for 7 out of the 20 classes significantly so.
99
Conclusion and Outlook
To successfully cope with the most recent challenges imposed by application domains
such as bioinformatics and computer vision, machine learning systems need to effec-
tively deal with the multiple data representations—or kernels—naturally arising in
those domains. Previous approaches to learning with multiple kernels restrict the search
space by convex combinations, promoting sparse solutions to support interpretability
and scalability. Unfortunately, this is often too restrictive and hinders MKL being
effective in practice, as demonstrated in a variety of recent works (e.g. Cortes et al.,
2008; Gehler and Nowozin, 2008; Kloft et al., 2008b; Noble, 2008; Cortes, 2009; Gehler
and Nowozin, 2009; Kloft et al., 2009a).
In this thesis, we addressed the problem of learning with multiple kernels by propos-
ing the `p-norm multiple kernel learning methodology, deduced from a rigorous math-
ematical framework for learning with multiple kernels, unifying previous approaches
to learning with multiple kernels under a common umbrella. In an extensive study,
we evaluated `p-norm multiple kernel learning empirically in controlled environments
as well as in highly topical real-work applications from the domains of computer vi-
sion and bioinformatics, comparing our methodology to even its strongest competitors
such as the support vector machine using a uniform kernel combination. We found
`p-norm multiple kernel learning being more efficient and more accurate than previous
approaches to learning with multiple kernels, allowing us to deal with up to ten thou-
sands of data points and thousands of kernels at the same time, as demonstrated on
several data sets.
Our methodology is underpinned by deep foundations in the theory of learning:
we proved tight lower and upper bounds on the complexity of the hypothesis class
associated with `p-norm multiple kernel learning, giving theoretical guarantees on the
fit of our method and showing that, depending on the true underlying learning task,
our framework can attain stronger theoretical guarantees than classical approaches to
learning with multiple kernels. Stimulated by recent attempts to catalyze research in
the direction of understanding learning with multiple kernels (Lanckriet et al., 2009), we
showed that the optimality of `p-norm multiple kernel learning can be connected with
the stronger theoretical guarantees that it attains in practice, compared to classical
approaches.
The proposed framework does not generally require the incorporation of expert
knowledge; however, such application-specific knowledge can be used to further improve
its performance, for example, by pre-selecting or upscaling kernels with high information
content. In combination with this technique, `p-norm multiple kernel learning can even
cope with the most challenging problems posed by application domains.
101
II `p-norm Multiple Kernel Learning
Summary of Results
The main results of this thesis can be summarized as follows.
Chapter 2. We presented a mathematical framework for learning with multiple
kernels, comprising most existing lines of research in that area and rigorously gener-
alizing the classic approach of Bach et al. (2004). We unified previous formulations
under a common umbrella, which allowed us to analyze a large variety of MKL meth-
ods jointly, as exemplified by deriving a general dual representation of the criterion,
without making assumptions on the employed norms and losses, beside the latter being
convex. This delivered insights into connections between existing MKL formulations
and, even more importantly, can be used to derive novel MKL formulations as special
cases of the framework.
Chapter 3. We derived efficient algorithms to solve the `p-norm MKL optimization
problem. All algorithms, for the sake of performance, are implemented in C++, but for
usability, are equipped with interfaces to MATLAB, Octave, Phyton, and R. We also
wrote macro scripts completely automating the whole process from training over model
selection to evaluation. Our software is freely available under the GPL license. In our
computational experiments, we found these large-scale optimization algorithms allowing
us to deal with ten thousands of data points and thousands of kernels at the same
time, as demonstrated on the MNIST data set. We compared our algorithms to the
state-of-the-art in MKL research, namely HessianMKL (Chapelle and Rakotomamonjy,
2008) and SimpleMKL (Rakotomamonjy et al., 2008), and found ours to be up to two
magnitudes faster. We proved the convergence of our AnalyticalMKL using the usual
regularity assumptions.
Chapter 4. The proposed techniques are also theoretically founded: we proved
tight lower and upper bounds on the local and global Rademacher complexities of
the hypothesis class associated with `p-norm MKL, from which we derived excess risk
bounds with fast convergence rates, thus being tighter than existing bounds for `p-
norm MKL (Cortes et al., 2010a), which only achieved slow convergence rates. While
for our results to hold we employed an assumption on the independence of the kernels,
related works on sparse recovery required similar assumptions (Raskutti et al., 2010;
Koltchinskii and Yuan, 2010). We connected the minimal values of the bounds with
the structured soft sparsity of the underlying Bayes hypothesis, demonstrating that
for a large range of learning scenarios `p-norm MKL attains substantially stronger
generalization guarantees than classical MKL, justifying the use of `p-norm MKL and
multiple kernel learning in general.
Chapter 5. In an extensive evaluation, we empirically measured the generaliza-
tion performance of `p-norm MKL on diverse and relevant real-world applications from
bioinformatics and computer vision. We developed a methodology based on the theoret-
ical bounds and kernel alignments for analyzing the results, as exemplified by means of
a toy experiment, where we observed the theoretical bounds and the kernel alignments
well reflecting the empirically observed optimal p. We carried out the experimental
methodology for genomic transcription splice site detection. We showed that regard-
102
5 Empirical Analysis and Applications
less of the sample size, by taking p= 4/3, `p-norm MKL can improve the prediction
accuracy over `1-norm MKL and SVM using a uniform kernel combination, which was
recently confirmed to be the most accurate TSS detector in an international compar-
ison of 19 models Abeel et al. (2009). We also evaluated the theoretical bounds and
found them with increasing nincreasingly accurate reflecting the optimal p. We ap-
plied `p-norm MKL to protein fold prediction, achieving an significant improvement
of 6% in accuracy. Finally, we applied `p-norm MKL to object recognition, a burning
topic in computer vision, studied earlier by Varma and Ray (2007) and Gehler and
Nowozin (2009) who found no improvement in using classical MKL. In contrast, we
showed that, by taking p= 1.11 in median, `p-norm MKL outperforms even the best
SVM competitors, regardless of the object class.
Future Work
As recently argued in Kloft et al. (2010b), among the most important challenges in
multiple kernel learning is the exploration of new objectives and parameterizations.
To this end, an interesting approach was recently undertaken by Yan et al. (2010),
who studied a `2-norm MKL variant based on a Fisher discriminant objective function
with very promising results on object categorization tasks. Similar objectives were
also explored in Yu et al. (2010), who presented a more general formulation. First
steps towards new ways of combining kernels were undertaken in Varma and Babu
(2009) and Cortes et al. (2009b), respectively, where non-linear combinations (products
and polynomials, respectively) were studied. Unfortunately, in the current state of
research, both formulations in general lead to non-convex optimization problems and
the corresponding objectives are hard to optimize accurately.
Besides classification, which forms the core of machine learning and on which we
focused in this thesis, it is interesting to study the proposed methodology also in other
learning tasks such as regression or novelty detection tasks. A first step in this direction
was undertaken in Cortes et al. (2009a), where `p-norm MKL was applied to regression
tasks on Reuters and various sentiment analysis datasets, but restricted to the case
p= 4/3. Here it would be interesting to exploit the full power of our framework by
studying the whole range of p[1,]. Another approach towards different learning
tasks was undertaken in Kloft et al. (2009b), where density-level set estimation and
novelty detection were studied with promising first results. Likewise, semi-supervised
learning (Chapelle et al., 2006) is to be explored. Also, online learning recently gained
considerable attention (Orabona et al., 2010) and is subject to current work (Martins
et al., 2011; Orabona and Jie, 2011).
Future work could also aim at further exploring applications of `p-norm MKL.
Ongoing work in the field of computer vision is based on Nakajima et al. (2009b), in-
vestigating `p-norm multiple kernel learning on photo annotation tasks with promising
first results. Also, effectively learning taxonomies (Binder et al., 2011) has recently
gained attention from the scientific community and could benefit from the use of MKL.
Also, in the bioinformatics domain, we observed applications of `p-norm MKL, most
103
II `p-norm Multiple Kernel Learning
notably, the study by Yu et al. (2010) on two real-world genomic data sets for clinical
decision support in cancer diagnosis and disease relevant gene prioritization, respec-
tively, showing a substantial improvement in using `4/3-norm MKL compared to the
SVM baselines. Again, it would be interesting to refine the analysis by searching for an
optimal pand, moreover, evaluate our bounds and kernel alignments. Another interest-
ing application is computer security, where MKL was successfully applied to network
intrusion detection tasks (Kloft et al., 2008a), using appropriate loss functions.
It would also be interesting to further theoretically explore the `p-norm multiple
kernel learning framework, for example, beyond the traditional borders of the local
Rademacher complexities. Clearly, future work here could built upon Bartlett and
Mendelson (2006). Another theoretical challenge is to analyze more complex hypothesis
classes. The unifying framework presented in this thesis might serve as a good starting
point for such an undertaking, as exemplified for the case of generalized elastic-net-
style classes in Kloft et al. (2010a). This is becoming especially pressing with respect
to non-isotropic hypothesis classes that might also be profitable in practice since they
allow for the incorporation of expert knowledge. A considerable amount of work on
those non-standard classes would also have to focus on optimization algorithms.
Finally, an innovative approach was recently taken in Widmer et al. (2010), where
`p>1-norm MKL was exploited as a methodical tool to learn hierarchy structures in
multi-task learning, thus extending the work of Evgeniou and Pontil (2004) to the
structured domain. This indicates the many possibilities for further exploration of our
framework beyond the standard setting, for example, exploring the relations to learning
distance metrics or learning the covariance function of a Gaussian process—clearly, this
requires research that goes beyond the classical setup.
104
Appendix
A Foundations
In this appendix, we introduce the foundations of this thesis; the presentation in part
follows the textbooks of Shawe-Taylor and Cristianini (2004) and Boyd and Vanden-
berghe (2004). We introduce kernel methods, Rademacher theory, and Lagrangian
duality.
A.1 Why Using Kernels?
Consider a typical machine learning problem. Let us assume, we are given a set of
images and the task is to build a machine that categorizes the images of 256 ×256
pixels with respect to whether they contain a certain object, for example, a vehicle (this
models a scenario where the images are obtained from the visual perception module
of a driverless vehicle and the goal is to evade other vehicles). One possible method
to obtain features from such an image could be through considering the gray value of
each pixel. This way we can represent each image by a 2562-dimensional vector. We
thus aim at finding a partition of R256×256 into two sets: the one that consists of the
points representing images that are likely to contain cars and the one that is likely to
not contain cars.
As it turns out, this problem is to hard to be solved by a simple, linear partition
of the space. Kernel methods offer an effective and computationally efficient solution
to this problem. Kernel methods can be understood as the following scheme (Shawe-
Taylor and Cristianini, 2004, p. 26):
1. Data items are embedded into a vector space called feature space.
2. A hyperplane separating the images of the data items in the feature space is
sought.
3. The algorithms are implemented in such a way that the coordinates of the em-
bedded points are not needed, only their pairwise scalar products.
4. The pairwise scalar products can be efficiently computed directly from the original
data items using a kernel function.
The above scheme is illustrated in Figure A.1A.2. Formally, we define a kernel as
follows:
105
II `p-norm Multiple Kernel Learning
Figure A.1: Illustration of kernel methods: linear separation is not always possible (Left) or desired
(Right).
Figure A.2: Kernel methods address this by mapping the data into a higher dimensional space to
allow for linear separation. This a modification of a figure taken from uller et al. (2001).
106
A Foundations
Definition A.1 (Kernel).Given an input space X, a kernel is a function k : X×X R
such that there is a (Hilbert) feature space Hand a mapping
φ:x7− φ(x) H
so that for all x, ¯x
k(x, ¯x) = hφ(x), φ(¯x)i.
The definition of a kernel uses the notion of a Hilbert space, which can be thought of as a
continuous generalization of a Euclidean vector space and is formally defined as a vector
space together with a scalar product that is complete and separable. Completeness
refers in this case to every Cauchy sequence converging in Hand separability means
that Hadmits a countable orthonormal basis.
How can we check whether a function kis indeed a kernel? If we know that kis a
continuous function, then there is a nice characterization: given a sample {x1, . . . , xn},
the matrix K=k(xi, xj)n
i,j=1 formed from the pairwise kernel evaluations is called
kernel matrix; one can show that kis a kernel if and only if for any given sample the
kernel matrix Kis positive semi-definite.
A.2 Basic Learning Theory
Empirical risk minimization Arguably the most important property of successful
learning machines is that they find decision functions which not only characterize the
training data (x1, y1),...,(xn, yn) X × Y well but also characterize new and un-
seen data. Many such algorithms underlie the concept of empirical risk minimization
(ERM), that is, for a given hypothesis class H YXa minimizer
fargminfHLn(f)
is searched for, where Ln(f) = Pn
i=1 l(f(xi), yi) is the (cumulative) empirical loss of a
hypothesis fwith respect to a function l:R×Y R(called loss function). The name
empirical risk minimization stems from the fact that Ln(f) being ntimes the empirical
risk of f. Kernel approaches consider linear models of the form
fw(x) = hw, φ(x)i
together with a (possibly non-linear) mapping φ:X H, where His a Hilbert space.
Often the hypothesis class is isotropically parameterized by a parameter D:
HD=fw(x)kwk2D
(denoting by kwk2the Hilbert-Schmidt norm in H), which allows to “kernelize” (Sch¨ol-
kopf et al., 1998) the resulting models and algorithms, that is, formulating them solely
in terms of inner products k(x, x0) := hφ(x), φ(x0)iin H.
107
II `p-norm Multiple Kernel Learning
Rademacher theory A useful quantity to theoretically analyze ERM is the Rademacher
complexity; let x1, . . . , xn X be an i.i.d. sample drawn from a probability distribution
P; then, the global Rademacher complexity of a hypothesis class His defined as
R(H) = Esup
fwHDw,1
n
n
X
i=1
σiφ(xi)E,
where (σi)1inis an i.i.d. family (independent of φ(xi)) of Rademacher variables
(random signs). Its empirical counterpart is denoted by
b
R(H) = ER(H)x1, . . . , xn=Eσsup
fwHDw,1
n
n
X
i=1
σiφ(xi)E.
The interest in the global Rademacher complexity comes from the fact that, if known, it
can be used to bound the generalization error (Koltchinskii, 2001; Bartlett and Mendel-
son, 2002):
Theorem A.2 (Bartlett and Mendelson, 2002, Theorems 8 and 12.4).Let l:R
Y [0,1] be a L-Lipschitz continuous loss function. With probability larger than 1δ
it holds uniformly for all classifiers fH
E[l(yf(x))] 1
n
n
X
i=1
l(yif(xi)) + 2LR(H) + s8 ln 2
δ
n.
The above theorem says that in order to obtain a generalization bound for an ERM-
based learner it suffices to bound the global Rademacher complexity. We illustrate this
for the support vector machine (SVM) as an example.
Example (SVM) An illustrative example for such a learning algorithm is the support
vector machine which finds a linear function (eventually in a higher-dimensional feature
space by using a kernel function) with an additional stability property: having a (soft)
margin to a predefined fraction of training data; this property allows to well deal with
outliers in the data (see Figure A.3). Formally the SVM is defined as follows:
inf
w,t
1
n
n
X
i=1
max(0,1yihw, φ(xi)i
s.t. kwk2D . (svm)
We recognize the SVM as an ERM algorithm using the hypothesis class H=f:
x7→ hw, φ(x)i | kwk Dand loss function l(t) = max(0,1t); however, for the
theoretical analysis it is more convenient to consider the truncated loss class
lsvm(t) = min(1,max(0,1t)) .
108
A Foundations
Figure A.3: Illustration of support vector machine: finding a linear hyperplane with maximum
margin.
Subsequently, we can use the Cauchy-Schwarz inequality to bound the global Rademacher
complexity of Has follows:
R(H) = Esup
fwHDw,1
n
n
X
i=1
σiφ(xi)E
C.-S.
Esup
fwHkwk1
n
n
X
i=1
σiφ(xi)
=ED1
n
n
X
i=1
σiφ(xi)
Jensen
EDv
u
u
tEσ
1
n2
n
X
i,j=1
σiσjk(xi, xj)
σii.i.d.
=EDv
u
u
t1
n2
n
X
i=1
k(xi, xi)
If we impose additional assumptions, for example, on the boundedness of the kernel,
or boundedness of its moments, then we can use the above Rademacher bound to-
gether with the previous theorem, directly leading to a generalization error bound. For
example, if the kernel is uniformly bound kkkB, then we have
E[lsvm(yf(x))] 1
n
n
X
i=1
lsvm(yif(xi)) + 2DrB
n+s8 ln 2
δ
n
because lsvm is 1-Lipschitz.
109
II `p-norm Multiple Kernel Learning
A.3 Convex Optimization
Aconvex optimization problem (Boyd and Vandenberghe, 2004) is one of the form
min
vRnf(v)
s.t. fi(v)0, i = 1, . . . , m
hi(v)=0, i = 1, . . . , l, (A.1)
where the functions h1, . . . , hl:Rn7→ Rare affine linear and f0, . . . , fm:Rn7→ Rare
convex, i.e., satisfy fi(αv+β˜
v)αfi(v) + βfi(˜
v) for all v,˜
vRnand all α, β R
with α, β = 1, α0, and β0.
There is no general analytical formula for the solution of such an optimization
problem, but there are very effective methods for solving them, for example, interior-
point methods and limited-memory quasi-Newton methods (Nocedal and Wright, 2006).
If we can formulate a problem as a convex optimization problem, then we have solved
the problem as we can usually solve it efficiently. In contrast, general, non-linear
optimization problems can be very challenging, even with ten variables and intractable
with a few hundreds of variables (Boyd and Vandenberghe, 2004, p. 8–9).
Lagrangian duality An important concept in optimization theory is the Lagrangian
duality. The basic idea is to take the constraints in (A.1) into account by augmenting
the objective function with a weighted sum of the constraint functions: we define
the Lagrangian L:Rn×Rm
+×Rlassociated with the problem (A.1) as (Boyd and
Vandenberghe, 2004, p. 215)
L(v,α,β) = f0(v) +
m
X
i=1
αifi(v) +
l
X
i=1
βihi.
We refer to αias the Lagrangian multiplier associated with the ith inequality constraint
and to βias the Lagrangian multiplier associated with the ith equality constraint. The
Lagrangian dual function is defined as the minimum value of the Lagrangian over v:
g(α,β) := inf
vRnL(v,α, β).(A.2)
It is evident that the Lagrangian dual function yields a lower bound for the optimal
value of the original problem (A.1); to see this, note that violated constraints con-
tribute positively in (A.2). It is therefore recommended to maximize the lower bound.
This raises the question “what is the best lower bound that can be obtained from the
Lagrangian dual function?”, leading to the optimization problem
min
α,βg(α,β)
s.t. αi0, i = 1, . . . , m,
110
B Relating Tikhonov and Ivanov Regularization
which is known as the dual problem associated with (A.1). In this context the original
problem is called primal problem. But how are the optima of the primal and dual
problems related? For convex problems, we have the nice result that both optimal
values are identical, under rather mild assumptions, which are known as constraint
qualifications (CQs). For example, Slater’s condition is such a CQ: there exists a point
vsuch that hi(v) = 0 and fi(v)<0 for all iactive in the optimum v(Boyd and
Vandenberghe, 2004, p. 225-226).
B Relating Tikhonov and Ivanov Regularization
In this section, we show a useful result that justifies switching from Tikhonov to Ivanov
regularization and vice versa, if the bound on the regularizing constraint is tight. It is
a key ingredient of the proof of Theorem 3.1. We state the result for arbitrary convex
functions, so that it can be applied beyond the multiple kernel learning framework of
this thesis.
Lemma B.1. Let DRdbe a convex set, let f, g :DRbe convex functions.
Consider the convex optimization tasks
min
vDf(v) + σg(v),(B.1a)
min
vD:g(v)τf(v).(B.1b)
Assume that the minima exist and that a constraint qualification holds in (B.1b), which
gives rise to strong duality, e.g., that Slater’s condition is satisfied. Furthermore, as-
sume that the constraint is active at the optimal point, i.e.
inf
vDf(v)<inf
vD:g(v)τf(v).(B.2)
Then we have the case that for each σ > 0there exists τ > 0—and vice versa—such
that the optimization problem (B.1a) is equivalent to (B.1b), i.e., each optimal solution
of one is an optimal solution of the other, and vice versa.
Proof
(a). Let be σ > 0 and vbe the optimal of (B.1a). We have to show that there exists
aτ > 0 such that vis optimal in (B.1b). We set τ=g(v). Suppose vis not optimal
in (B.1b), i.e., it exists ˜
vD:g(˜
v)τsuch that f(˜
v)< f(v). Then we have
f(˜
v) + σg(˜
v)< f(v) + στ,
which by τ=g(v) translates to
f(˜
v) + σg(˜
v)< f(v) + σg(v).
This contradicts the optimality of vin (B.1a), and hence proves that vis optimal in
(B.1b), which was to be shown.
111
II `p-norm Multiple Kernel Learning
(b). Vice versa, let τ > 0 be voptimal in (B.1b). The Lagrangian of (B.1b) is given
by
L(σ) = f(v) + σ(g(v)τ), σ 0.
By strong duality vis optimal in the saddle point problem
σ:= argmaxσ0min
vDf(v) + σ(g(v)τ),
and by the strong max-min property (cf. (Boyd and Vandenberghe, 2004), p. 238) we
may exchange the order of maximization and minimization. Hence vis optimal in
min
vDf(v) + σ(g(v)τ).(B.3)
Removing the constant term στ, and setting σ=σ, we have that vis optimal in
(B.1a), which was to be shown. Moreover by (B.2) we have that
v6= argminvDf(v),
and hence we see from Eq. (B.3) that σ>0, which completes the proof of the propo-
sition.
C Supplements to the Theoretical Analysis
The following result gives a block-structured version of older’s inequality (e.g., Steele,
2004).
Lemma C.1 (Block-structured H¨
older inequality).Let v= (v1,...,vM),w=
(w1,...,wM) H =H1×···×HM. Then, for any p1, it holds
hv,wi≤kvk2,pkwk2,p.
Proof By the Cauchy-Schwarz inequality (C.-S.), we have for all x, y H:
hv,wi=
M
X
m=1hvm,wmiC.-S.
M
X
m=1 kvmk2kwmk2
=(kv1k2,...,kvMk2),(kw1k2,...,kwMk2).
older
kvk2,pkwk2,p
Proof of Lemma 4.3 (Rosenthal + Young) It is clear that the result trivially
holds for 1
2p1 with cq= 1 by Jensen’s inequality . In the case p1, we apply
112
C Supplements to the Theoretical Analysis
Rosenthal’s inequality (Rosenthal, 1970) to the sequence X1, . . . , Xnthereby using the
optimal constants computed in Ibragimov and Sharakhmetov (2001), that are, cq= 2
(q2) and cq=EZq(q2), respectively, where Zis a random variable distributed
according to a Poisson law with parameter λ= 1. This yields
E1
n
n
X
i=1
Xiq
cqmax 1
nq
n
X
i=1
EXq
i, 1
n
n
X
i=1
Xi!q!.(C.1)
By using the fact that XiBholds almost surely, we could readily obtain a bound of
the form Bq
nq1on the first term. However, this is loose and for q= 1 does not converge
to zero when n . Therefore, we follow a different approach based on Young’s
inequality (e.g. Steele, 2004):
1
nq
n
X
i=1
EXq
iB
nq11
n
n
X
i=1
EXi
Young
1
qB
nq(q1)
+1
q 1
n
n
X
i=1
EXi!q
=1
qB
nq
+1
q 1
n
n
X
i=1
EXi!q
.
It thus follows from (C.1) that for all q1
2
E1
n
n
X
i=1
Xiq
cq B
nq+1
n
n
X
i=1
EXiq!,
where cqcan be taken as 2 (q2) and EZq(q2), respectively, where Zis Poisson-
distributed. In the subsequent Lemma C.2 we show EZq(q+e)q. Clearly, for q1
2
it holds q+eqe +eq = 2eq so that in any case cqmax(2,2eq)2eq, which
completes the proof.
We use the following Lemma, which gives a handle on the q-th moment of a Poisson-
distributed random variable and is used in the previous Lemma.
Lemma C.2. For the q-moment of a random variable Zdistributed according to a
Poisson law with parameter λ= 1, the following inequality holds for all q1:
EZqdef.
=1
e
X
k=0
kq
k!(q+e)q.
113
II `p-norm Multiple Kernel Learning
Proof We start by decomposing EZqas follows:
Eq=1
e
0 +
q
X
k=1
kq
k!+
X
k=q+1
kq
k!
=1
e
q
X
k=1
kq1
(k1)! +
X
k=q+1
kq
k!
1
e
qq+
X
k=q+1
kq
k!
(C.2)
(C.3)
Note that by Stirling’s approximation it holds k! = 2πeτkkk
eqwith 1
12k+1 < τk<1
12k
for all q. Thus
X
k=q+1
kq
k!=
X
k=q+1
1
2πeτkkekk(kq)
=
X
k=1
1
2πeτk+q(k+q)ek+qkk
=eq
X
k=1
1
2πeτk+q(k+q)e
kk
()
eq
X
k=1
1
2πeτkke
kk
Stirling
=eq
X
k=1
1
k!
=eq+1
where for () note that eτkkeτk+q(k+q) can be shown by some algebra using
1
12k+1 < τk<1
12k. Now by (C.2)
EZq=1
eqq+eq+1qq+eq(q+e)q,
which was to show.
Lemma C.3. For any a,bRm
+it holds for all q1
kakq+kbkq211
qka+bkq2ka+bkq.
114
D Cutting Plane Algorithm
Proof Let a= (a1, . . . , am) and b= (b1, . . . , bm). Because all components of a,bare
nonnegative, we have
i= 1, . . . , m :aq
i+bq
iai+biq
and thus
kakq
q+kbkq
q ka+bkq
q.(C.4)
We conclude by `q-to-`1conversion (see (4.15))
kakq+kbkq=kakq,kbkq1
(4.15)
211
qkakq,kbkqq
= 211
qkakq
q+kbkq
q1
q
(C.4)
211
qka+bkq,
which completes the proof.
D Cutting Plane Algorithm
In this section, we present an alternative optimization strategy for the case p[1,2],
based on the cutting plane method; to this end, we denote q=p/(2 p) (note that
thus q[1,] since p[1,2]). Our algorithm is based on the MKL optimization
problem II.4. We focus on the hinge loss, i.e., l(t, y) = max(0,1ty). Thus, employing
a variable substitution of the form αnew
i=αiyi, Problem II.4 translates into
min
θmax
α1>α1
2α>
M
X
m=1
θmQmα
s.t. 0αC1;y>α= 0; θ0; kθkq
q1,
where Qj=Y KjYfor 1 jmand Y= diag(y). The above optimization problem is
asaddle point problem and, as we have seen, can be solved by a block coordinate descent
algorithm, using an analytical formula. We take a different approach and translate the
min-max problem into an equivalent semi-infinite program (SIP) as follows. Denote
the value of the target function by t(α,θ) and suppose αis optimal. Then, according
to the max-min inequality (Boyd and Vandenberghe, 2004, p. 115), we have t(α,θ)
t(α,θ) for all αand θ. Hence, we can equivalently minimize an upper bound ηon the
optimal value and arrive at the following semi-infinite program:
min
ηη
s.t. α A :η1>α1
2α>
M
X
m=1
θmQmα; (SIP)
θ0;kθkq
q1,
115
II `p-norm Multiple Kernel Learning
Algorithm D.1 (Chunking CPM) Chunking-based `p-Norm MKL cutting plane training
algorithm. It simultaneously optimizes the variables αand the kernel weighting θ.The accuracy
parameter and the subproblem size Qare assumed to be given to the algorithm. For simplicity,
a few speed-up tricks are not shown, e.g., hot-starts of the SVM and the QCQP solver.
1: gm,i = 0, ˆgi= 0, αi= 0, η=−∞,θm=q
p1/M for m= 1, . . . , M and i= 1, . . . , n
2: for t= 1,2, . . . and while SVM and MKL optimality conditions are not satisfied do
3: Select Q suboptimal variables αi1, . . . , αiQbased on the gradient ˆ
gand α; store αold =α
4: Solve SVM dual with respect to the selected variables and update α
5: Update gradient gm,i gm,i +PQ
l=1(αilαold
il)yilkm(xil, xi) for all m= 1, . . . , M and
i= 1, . . . , n
6: for m= 1, . . . , M do
7: St
m=1
2Pigm,iαiyi
8: end for
9: Lt=Piαi,St=PmθmSt
m
10: if |1LtSt
η|
11: while MKL optimality conditions are not satisfied do
12: θold =θ
13: (θ, η)argmin η
14: w.r.t. θRM, η R
15: s.t. 0θ1,
16: q(q1)
2Pm(θold
m)q2θ2
mPmq(q2)(θold
m)q1θmq(3q)
2and
17: LrPmθmSr
mηfor r= 1, . . . , t
18: θθ/||θ||q
q
19: Remove inactive constraints
20: end while
21: end if
22: ˆgi=Pmθmgm,i for all i= 1, . . . , n
23: end for
where A=αRn|0αC1,y>α= 0.
Sonnenburg et al. (2006a) optimize the above SIP for p= 1 with interleaving
cutting plane algorithms. The solution of a quadratic program (here the regular SVM)
generates the most strongly violated constraint for the actual mixture θ. The optimal
(θ, η) is then identified by solving a linear program with respect to the set of active
constraints. The optimal mixture is then used for computing a new constraint and so
on.
Unfortunately, for q > 1, a non-linearity is introduced by requiring kθkq
q1 and
such constraint is unlikely to be found in standard optimization toolboxes that often
handle only linear and quadratic constraints. As a remedy, we propose to approximate
116
D Cutting Plane Algorithm
the constraint kθkq
q1 by a sequence of second-order Taylor expansions9
||θ||q
q ||˜
θ||q
q+q˜
θq1>θ˜
θ+q(q1)
2θ˜
θ>diag ˜
θq2θ˜
θ
= 1 + q(q3)
2
M
X
m=1
q(q2)(˜
θm)q1θm+q(q1)
2
M
X
m=1
˜
θq2
mθ2
m,
where θqis defined element-wise, that is θq:= (θq
1, . . . , θq
M). The sequence (θ0,θ1,···)
is initialized with a uniform mixture satisfying ||θ0||q
q= 1 as a starting point. Succes-
sively, θt+1 is computed using ˜
θ=θt. Note that the Hessian of the quadratic term
in the approximation is diagonal, strictly positive and very-well conditioned, for which
reason the resulting quadratically constrained problem can be solved efficiently. In
fact, since there is only one quadratic constraint, its complexity should rather be com-
pared to that of a considerably easier quadratic program. Moreover, in order to ensure
convergence, we enhance the resulting sequential quadratically constrained quadratic
programming by projection steps onto the boundary of the feasible set, as given in
Line 18. Finally, note that this approach can be further sped-up by additional level-set
projections in the θ-optimization phase similar to Xu et al. (2009). In our case, the
level-set projection is a convex quadratic problem with `p-norm constraints and can
again be approximated by a successive sequence of second-order Taylor expansions.
Algorithm D.1 outlines the interleaved α,θMKL training algorithm. Lines 3-5
are standard in chunking based SVM solvers and carried out by SVMlight. Lines 6-9
compute (parts of) SVM objective values for each kernel independently. Finally lines
11 to 19 solve a sequence of semi-infinite programs with the `p-norm constraint being
approximated as a sequence of second-order constraints. The algorithm terminates if
the maximal KKT violation (see Joachims, 1999) falls below a predetermined precision
εsvm and if the normalized maximal constraint violation |1LSt
η|< εmkl for the MKL.
The following proposition shows the convergence of the semi-infinite programming loop
in Algorithm D.1.
Proposition D.1. Let the kernel matrices K1, . . . , KMbe positive-definite and be q >
1. Suppose that the SVM computation is solved exactly in each iteration. Moreover,
suppose there exists an optimal limit point of nested sequence of QCCPs. Then the
sequence generated by Algorithm D.1 has at least one point of accumulation that solves
(P).
Proof By assumption, the SVM is solved to infinite precision in each MKL step,
which simplifies our analysis in that the numerical details in Algorithm D.1 can be
ignored. We conclude that the outer loop of Alg. D.1 amounts to a cutting-plane algo-
rithm for solving the semi-infinite program of Optimization Problem (SIP). It is well
9We also tried out first-order Taylor expansions, whereby our algorithm basically boils down the
renowned sequential quadratic programming, but, empirically, it turned out to be inferior. Intu-
itively, second-order expansions work best when the approximated function is almost quadratic, as
given in our case.
117
II `p-norm Multiple Kernel Learning
known (Sonnenburg et al., 2006a) that this algorithm converges, in the sense that there
exists at least one point of accumulation that solves the primal problem (P). E.g.,
this can be seen by viewing the cutting plane algorithm as a special instance of the
class of so-called exchange methods and subsequently applying Theorem 7.2 in Hettich
and Kortanek (1993). A difference to the analysis in Sonnenburg et al. (2006a) is the
`p>1-norm constraint in our algorithm. However, according to our assumption that the
nonlinear subprogram is solved correctly, a quick inspection of the preliminaries of the
latter theorem clearly reveals, that they remain fulfilled when introducing an `p-norm
constraint.
In order to complete our convergence analysis, it remains to show that the inner loop
(lines 11-18), that is, the sequence of QCQPs, converges against an optimal point.
Existing analyses of this so-called sequential quadratically constrained quadratic pro-
gramming (SQCQP) can be divided into two classes. First, one class establishes local
convergence, i.e., convergence in an open neighborhood of the optimal point, at a rate
of O(n2), under relatively mild smoothness and constraint qualification assumptions
(Fern´andez and Solodov, 2008; Anitescu, 2002), whereas Anitescu (2002) addition-
ally requires quadratic growth of the nonlinear constraints. Those analyses basically
guarantee local convergence the nested sequences of QCQPs in our `q-norm training
algorithm, for all q(1,) (Fern´andez and Solodov, 2008) and q2 (Anitescu, 2002),
respectively.
A second class of papers additionally establishes global convergence (e.g. Solodov,
2004; Fukushima et al., 2002), so they need more restrictive assumptions. Moreover,
in order to ensure feasibility of the subproblems when the actual iterate is too far
away from the true solution, a modification of the algorithmic protocol is needed. This
is usually dealt with by performing a subsequent line search and downweighting the
quadratic term by a multiplicative adaptive constant Di[0,1]. Unfortunately, the
latter involves a complicated procedure to tune Di(Fukushima et al., 2002, p. 7).
Employing the above modifications, the analysis in Fukushima et al. (2002) together
with our Prop. D.1 would guarantee the convergence of our Alg. D.1.
However, due to the special form of our SQCQP, we chose to discard the com-
fortable convergence guarantees and to proceed with a much more simple and efficient
strategy, which renders both the expensive line search and the tuning of the constant
Diunnecessary. The idea of our method is that the projection of θonto the boundary
of the feasible set, given by line 18 in Alg. D.1, can be performed analytically. This
projection ensures the feasibility of the QCQP subproblems. Note that in general, this
projection can be as expensive as performing a QCQP step, which is why, to the best of
our knowledge, projection-type algorithms for solving SQCQPs have not been studied
yet by the optimization literature.
Although the projection procedure is appealingly simple and—as we found out
empirically—seemingly shares nice convergence properties (the sequence of SQCQPs
converged optimally in all cases we tried, usually after 3-4 iterations), it unfortunately
prohibits exploitation of existing analyses for global convergence. However, the discus-
118
D Cutting Plane Algorithm
sions in Fukushima et al. (2002) and Solodov (2004) identify the reason of occasional
divergence of the vanilla SQCQP as the infeasibility of the subproblems. But in con-
trast, our projection algorithm always ensures the feasibility of the subproblem. We
therefore conjecture that, based on the superior empirical results and the discussions in
Fukushima et al. (2002) and Solodov (2004), our algorithm is designated to converge.
The theoretical analysis of this new class of SQCQP projection algorithms is beyond
the scope of this thesis.
119
Bibliography
T. Abeel, Y. V. de Peer, and Y. Saeys. Towards a gold standard for promoter prediction
evaluation. Bioinformatics, 2009.
J. Aflalo, A. Ben-Tal, C. Bhattacharyya, J. S. Nath, and S. Raman. Variable sparsity kernel
learning—algorithms and applications. Journal of Machine Learning Research, 12:565–592,
Feb 2011.
B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter. Molecular biology of the
cell. Garland Science, 4 edition, 2002. ISBN 0815332181.
M. Anitescu. A superlinearly convergent sequential quadratically constrained quadratic pro-
gramming algorithm for degenerate nonlinear programming. SIAM J. on Optimization, 12
(4):949–978, 2002.
F. Bach. Eccv 2008 tutorial on supervised learning, 2008a.
F. R. Bach. Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res.,
9:1179–1225, 2008b.
F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and
the SMO algorithm. In Proc. 21st ICML. ACM, 2004.
V. B. Bajic, S. L. Tan, Y. Suzuki, and S. Sugano. Promoter prediction analysis on the whole
human genome. Nature Biotechnology, 22(11):1467–1473, 2004.
P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and struc-
tural results. Journal of Machine Learning Research, 3:463–482, Nov. 2002.
P. L. Bartlett and S. Mendelson. Empirical minimization. Probability Theory and Related
Fields, 135(3):311–334, 2006.
P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of
Statistics, 33(4):1497–1537, 2005.
D. Bertsekas. Nonlinear Programming, Second Edition. Athena Scientific, Belmont, MA, 1999.
A. Binder, K.-R. M¨uller, and M. Kawanabe. On taxonomies for multi-class image categorization.
International Journal of Computer Vision, January 2011.
K. Bleakley, G. Biau, and J.-P. Vert. Supervised reconstruction of biological networks with
local models. Bioinformatics, 23:i57–i65, 2007.
A. Bosch, A. Zisserman, and X. Mu˜noz. Representing shape with a spatial pyramid kernel. In
Proceedings of the 6th ACM international conference on Image and video retrieval (CIVR
’07), pages 401–408, 2007. ISBN 978-1-59593-733-9.
121
Bibliography
O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. In
O. Bousquet, U. von Luxburg, and G. atsch, editors, Advanced Lectures on Machine Learn-
ing, volume 3176 of Lecture Notes in Computer Science, pages 169–207. Springer Berlin /
Heidelberg, 2004.
S. Boyd and L. Vandenberghe. Convex Optimization. Cambrigde University Press, Cambridge,
UK, 2004.
C. Campbell and Y. Ying. Learning with Support Vector Machines. Synthesis Lectures on
Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2011.
J. Canny. A computational approach to edge detection. IEEE Trans. on Pattern Analysis and
Machine Intelligence, 8:679–714, 1986.
O. Chapelle and A. Rakotomamonjy. Second order optimization of kernel parameters. In Proc.
of the NIPS Workshop on Kernel Learning: Automatic Selection of Optimal Kernels, 2008.
O. Chapelle, B. Scolkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, Cam-
bridge, MA, 2006. URL http://www.kyb.tuebingen.mpg.de/ssl-book.
C. Cortes. Invited talk: Can learning kernels help performance? In Proceedings of the 26th An-
nual International Conference on Machine Learning, ICML ’09, pages 1:1–1:1, New York, NY,
USA, 2009. ACM. ISBN 978-1-60558-516-1. Video http://videolectures.net/icml09_
cortes_clkh/.
C. Cortes, A. Gretton, G. Lanckriet, M. Mohri, A. Rostamizadeh, and editors, editors. Proceed-
ings of the NIPS 2008 workshop on kernel learning: automatic selection of optimal kernels,
2008. URL http://www.cs.nyu.edu/learning_kernels, Video http://videolectures.
net/lkasok08_whistler/.
C. Cortes, M. Mohri, and A. Rostamizadeh. L2 regularization for learning kernels. In Proceed-
ings of the International Conference on Uncertainty in Artificial Intelligence, 2009a.
C. Cortes, M. Mohri, and A. Rostamizadeh. Learning non-linear combinations of kernels. In
Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances
in Neural Information Processing Systems 22, pages 396–404, 2009b.
C. Cortes, M. Mohri, and A. Rostamizadeh. Generalization bounds for learning kernels. In
Proceedings, 27th ICML, 2010a.
C. Cortes, M. Mohri, and A. Rostamizadeh. Two-stage learning kernel algorithms. In Proceed-
ings of the 27th Conference on Machine Learning (ICML 2010), 2010b.
N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. Kandola. On kernel-target alignment. In
T. Fawcett and N. Mishra, editors, Advances in Neural Information Processing Systems,
volume 14, pages 367–373, 2002.
G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization with bags of keypoints. In
Workshop on Statistical Learning in Computer Vision, ECCV, volume 2, pages 1–22, Prague,
Czech Republic, May 2004.
N. Dalal and B. Triggs. Histograms of oriented gradientsfor human detection. In IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition, volume 1, pages
886–893, San Diego, USA, June 2005.
122
Bibliography
M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PAS-
CAL Visual Object Classes Challenge 2008 (VOC2008) Results, 2008. URL http://www.
pascal-network.org/challenges/VOC/voc2008/workshop/index.html.
T. Evgeniou and M. Pontil. Regularized multi–task learning. In W. Kim, R. Kohavi, J. Gehrke,
and W. DuMouchel, editors, KDD, pages 109–117. ACM, 2004. ISBN 1-58113-888-1.
R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. LIBLINEAR: A library for large linear
classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
D. Fern´andez and M. Solodov. On local convergence of sequential quadratically-constrained
quadratic-programming type methods, with an extension to variational problems. Comput.
Optim. Appl., 39(2):143–160, 2008.
V. Franc and S. Sonnenburg. OCAS optimized cutting plane algorithm for support vector
machines. In Proceedings of the 25nd International Machine Learning Conference. ACM
Press, 2008.
M. Fukushima, Z.-Q. Luo, and P. Tseng. A sequential quadratically constrained quadratic
programming method for differentiable convex minimization. SIAM J. on Optimization, 13
(4):1098–1119, 2002.
P. Gehler and S. Nowozin. Infinite kernel learning. In Proceedings of the NIPS 2008 Workshop
on Kernel Learning: Automatic Selection of Optimal Kernels, 2008.
P. V. Gehler and S. Nowozin. On feature combination for multiclass object classification. In
ICCV, pages 221–228, 2009.
G. Golub and C. van Loan. Matrix Computations. John Hopkins University Press, Baltimore,
London, 3rd edition, 1996.
N. ornitz, M. Kloft, and U. Brefeld. Active and semi-supervised data domain description.
In W. L. Buntine, M. Grobelnik, D. Mladenic, and J. Shawe-Taylor, editors, Proceedings
of the European Conference on Machine Learning and Knowledge Discovery in Databases
(ECML/PKDD), pages 407–422, 2009a.
N. ornitz, M. Kloft, K. Rieck, and U. Brefeld. Active learning for network intrusion detection.
In Proc. of CCS Workshop on Security and Artificial Intelligence (AISEC), pages 47–54, New
York, NY, USA, 2009b. ACM.
N. ornitz, M. Kloft, K. Rieck, and U. Brefeld. Active and semi-supervised domain descriptions
in theory and practice. Journal of Artificial Intelligence Research (JAIR), 2010. submitted
07/2010.
R. Hettich and K. O. Kortanek. Semi-infinite programming: theory, methods, and applications.
SIAM Rev., 35(3):380–429, 1993.
R. Ibragimov and S. Sharakhmetov. The best constant in the rosenthal inequality for non-
negative random variables. Statistics & Probability Letters, 55(4):367 376, 2001. ISSN
0167-7152.
R. Jenssen, M. Kloft, A. Zien, S. Sonnenburg, and K.-R. M¨uller. A multi-class support vector
machine based on scatter criteria. Technical Report 014-2009, Technische Universit¨at Berlin,
Jul 2009. Revised version submitted to Pattern Recognition, Apr 2011.
123
Bibliography
T. Joachims. Making large–scale SVM learning practical. In B. Scolkopf, C. Burges, and
A. Smola, editors, Advances in Kernel Methods Support Vector Learning, pages 169–184,
Cambridge, MA, 1999. MIT Press.
J.-P. Kahane. Some random series of functions. Cambridge University Press, 2nd edition, 1985.
M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori. The KEGG resource for
deciphering the genome. Nucleic Acids Res, 32:D277–D280, 2004.
G. Kimeldorf and G. Wahba. Some results on tchebycheffian spline functions. J. Math. Anal. Ap-
plic., 33:82–95, 1971.
M. Kloft and G. Blanchard. The local rademacher complexity of lp-norm multiple kernel
learning. Technical report, Arxiv preprint 1103.0790v1, Mar 2011. URL http://arxiv.org/
abs/1103.0790v1. Full version submitted to Journal of Machine Learning Research, Mar
2011. Short Version submitted to Twenty-Fifth Annual Conference on Neural Information
Processing Systems (NIPS 2011).
M. Kloft and P. Laskov. Online anomaly detection under adversarial impact. In Y. W. Teh
and M. Titterington, editors, JMLR Workshop and Conference Proceedings, Volume 9: AIS-
TATS, pages 405–412. MIT Press, 2010a.
M. Kloft and P. Laskov. Security analysis of online centroid anomaly detection. Technical Re-
port UCB/EECS-2010-22, EECS Department, University of California, Berkeley, Feb 2010b.
URL http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-22.html. CoRR
abs/1003.0078. Revised version submitted to Journal of Machine Learning Resarch, Feb 2011.
M. Kloft, U. Brefeld, P. D¨ussel, C. Gehl, and P. Laskov. Automatic feature selection for anomaly
detection. In D. Balfanz and J. Staddon, editors, AISec, pages 71–76. ACM, 2008a.
M. Kloft, U. Brefeld, P. Laskov, and S. Sonnenburg. Non-sparse multiple kernel learning. In
Proc. of the NIPS Workshop on Kernel Learning: Automatic Selection of Kernels, Dec 2008b.
URL http://www.cs.nyu.edu/learning_kernels/abstracts/ws_mkl.pdf.
M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. M¨uller, and A. Zien. Efficient and
accurate lp-norm multiple kernel learning. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I.
Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22,
pages 997–1005. MIT Press, 2009a.
M. Kloft, S. Nakajima, and U. Brefeld. Feature selection for density level-sets. In W. L. Bun-
tine, M. Grobelnik, D. Mladenic, and J. Shawe-Taylor, editors, Proceedings of the European
Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD),
pages 692–704, 2009b.
M. Kloft, U. R¨uckert, and P. L. Bartlett. A unifying view of multiple kernel learning. In
Proceedings of the European Conference on Machine Learning and Knowledge Discovery in
Databases (ECML/PKDD), pages 66–81, 2010a.
M. Kloft, U. R¨uckert, C. S. Ong, A. Rakotomamonjy, S. Sonnenburg, and F. Bach, editors.
Proceedings of the NIPS 2010 workshop on new directions in multiple kernel learning, 2010b.
URL http://doc.ml.tu-berlin.de/mkl_workshop/, Video http://videolectures.net/
nipsworkshops2010_kernel_learning/.
124
Bibliography
M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel learning. Journal
of Machine Learning Research, 12:953–997, Mar 2011.
V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions
on Information Theory, 47(5):1902–1914, 2001.
V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization.
Annals of Statistics, 34(6):2593–2656, 2006.
V. Koltchinskii. Sparsity in penalized empirical risk minimization. Annales de l’Institut Henri
Poincar´e, Probabilit´es et Statistiques, 45(1):7–57, 2009.
V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the general-
ization error of combined classifiers. Annals of Statistics, 30:1–50, 2002.
V. Koltchinskii and M. Yuan. Sparsity in multiple kernel learning. Annals of Statistics, 38(6):
3660–3695, 2010.
S. Kwapi´en and W. A. Woyczy´nski. Random Series and Stochastic Integrals: Single and Mul-
tiple. Birkh¨auser, Basel and Boston, M.A., 1992.
C. Lampert and M. Blaschko. A multiple kernel learning approach to joint multi-class object
detection. In DAGM, pages 31–40, 2008.
G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. I. Jordan. Learning the kernel
matrix with semi-definite programming. JMLR, 5:27–72, 2004a.
G. Lanckriet, F. Bach, N. Srebro, and B. McFee, editors. Proceedings of the NIPS 2009 Work-
shop on Understanding Multiple Kernel Learning Methods, 2009. URL http://mkl.ucsd.
edu/workshop.
G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical
framework for genomic data fusion. Bioinformatics, 20:2626–2635, November 2004b. ISSN
1367-4803.
P. Laskov and M. Kloft. A framework for quantitative security analysis of machine learning. In
D. Balfanz and J. Staddon, editors, Proceedings of the 2nd ACM Workshop on Security and
Artificial Intelligence (AISEC), pages 1–4, 2009.
S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching
for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition (CVPR ’06), volume 2, pages 2169–
2178, New York, USA, 2006.
D. Liu and J. Nocedal. On the limited memory method for large scale optimization. Mathe-
matical Programming B, 45(3):503–528, 1989.
D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal
of Computer Vision, 60(2):91–110, 2004.
M. Markou and S. Singh. Novelty detection: a review part 1: statistical approaches. Signal
Processing, 83:2481–2497, 2003a.
125
Bibliography
M. Markou and S. Singh. Novelty detection: a review part 2: neural network based approaches.
Signal Processing, 83:2499–2521, 2003b.
M. Marszalek and C. Schmid. Learning representations for visual object class recognition. In
Proceedings of the PASCAL Visual Object Classes Challenge 2007 (VOC2007), 2007.
A. F. Martins, M. A. Figueiredo, P. M. Aguiar, N. A. Smith, and E. P. Xing. Online learning of
structured predictors with multiple kernels. In JMLR Workshop and Conference Proceedings:
AISTATS 2011. MIT Press, 2011. To appear.
S. Mendelson. On the performance of kernel classes. J. Mach. Learn. Res., 4:759–771, December
2003.
K.-R. M¨uller, S. Mika, G. atsch, K. Tsuda, and B. Scolkopf. An introduction to kernel-based
learning algorithms. IEEE Neural Networks, 12(2):181–201, May 2001.
S. Nakajima, A. Binder, C. M¨uller, W. Wojcikiewicz, M. Kloft, U. Brefeld, K.-R. M¨uller, and
M. Kawanabe. Multiple kernel learning for object classification. In Proceedings of the 12th
Workshop on Information-based Induction Sciences, 2009a.
S. Nakajima, A. Binder, C. M¨uller, W. Wojcikiewicz, M. Kloft, U. Brefeld, K.-R. M¨uller, and
M. Kawanabe. Multiple kernel learning for object classification. In Proceedings of the 12th
Workshop on Information-based Induction Sciences, 2009b.
S. Nakajima, A. Binder, C. M¨uller, W. Wojcikiewicz, M. Kloft, U. Brefeld, K.-R. M¨uller, and
M. Kawanabe. Multiple kernel learning for object classification. IEEE Transactions on
Pattern Analysis and Machine Intelligence (TPAMI), 2010. submitted 09/2010.
W. S. Noble. Multi-kernel learning for biology. In Proceedings of the NIPS Workshop on Kernel
Learning: Automatic Selection of Optimal Kernels, 2008. URL http://videolectures.net/
lkasok08_noble_mklfb/.
J. Nocedal and S. J. Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006.
G. Obozinski, M. J. Wainwright, and M. I. Jordan. Support union recovery in high-dimensional
multivariate regression. Ann. Stat., 39(1):1–47, 2011.
C. S. Ong and A. Zien. An Automated Combination of Kernels for Predicting Protein Subcel-
lular Localization. In Proc. of the 8th Workshop on Algorithms in Bioinformatics, 2008.
C. S. Ong, A. J. Smola, and R. C. Williamson. Learning the kernel with hyperkernels. Journal
of Machine Learning Research, 6:1043–1071, 2005.
F. Orabona and L. Jie. Ultra-fast optimization algorithm for sparse multi kernel learning. In
Proceedings of the 28th Annual International Conference on Machine Learning, ICML ’11,
2011. To appear.
F. Orabona, J. Luo, and B. Caputo. Online-batch strongly convex multi kernel learning. In
CVPR, pages 787–794. IEEE, 2010.
A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. Journal of Machine
Learning Research, 9:2491–2521, 2008.
126
Bibliography
G. Raskutti, M. J. Wainwright, and B. Yu. Minimax-optimal rates for sparse additive models
over kernel classes via convex programming. CoRR, abs/1008.3654, 2010.
R. M. Rifkin and R. A. Lippert. Value regularization and Fenchel duality. J. Mach. Learn.
Res., 8:441–479, 2007.
H. Rosenthal. On the subspaces of Lp(p > 2) spanned by sequences of independent random
variables. Israel J. Math., 8:273–303, 1970.
V. Roth and B. Fischer. The group-lasso for generalized linear models: uniqueness of solutions
and efficient algorithms. In Proceedings of the Twenty-Fifth International Conference on
Machine Learning (ICML 2008), volume 307, pages 848–855. ACM, 2008.
U. R¨uckert and M. Kloft. Transfer learning with adaptive regularizers. In Proceedings
of the European Conference on Machine Learning and Knowledge Discovery in Databases
(ECML/PKDD), 2011.
B. Scolkopf, A. Smola, and K.-R. M¨uller. Nonlinear component analysis as a kernel eigenvalue
problem. Neural Computation, 10:1299–1319, 1998.
B. Scolkopf, S. Mika, C. Burges, P. Knirsch, K.-R. M¨uller, G. atsch, and A. Smola. Input
space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10
(5):1000–1017, September 1999.
B. Scolkopf, J. Platt, J. Shawe-Taylor, A. Smola, and R. Williamson. Estimating the support
of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001.
B. Scolkopf, K. Tsuda, and J. P. Vert, editors. Kernel Methods in Computational Biology.
MIT Press, 2004.
J. R. Searle. Minds, brains, and programs. Behavioral and Brain Sciences, 3(03):417–424, 1980.
J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University
Press, 2004.
M. Sion. On general minimax theorems. Pacific J. Math., 8(1):171–176, 1958.
M. V. Solodov. On the sequential quadratically constrained quadratic programming methods.
Math. Oper. Res., 29(1):64–79, 2004.
S. Sonnenburg. Machine Learning for Genomic Sequence Analysis. PhD thesis, Fraunhofer
Institute FIRST, December 2008.
S. Sonnenburg, G. atsch, C. Scafer, and B. Scolkopf. Large scale multiple kernel learning.
Journal of Machine Learning Research, 7:1531–1565, July 2006a.
S. Sonnenburg, A. Zien, and G. atsch. Arts: Accurate recognition of transcription starts in
human. Bioinformatics, 22(14):e472–e480, 2006b.
S. Sonnenburg, G. atsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona, A. Binder,
C. Gehl, and V. Franc. The SHOGUN Machine Learning Toolbox. Journal of Machine
Learning Research, 2010.
127
Bibliography
J. M. Steele. The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical
Inequalities. Cambridge University Press, New York, NY, USA, 2004. ISBN 052154677X.
Y. Suzuki, R. Yamashita, K. Nakai, and S. Sugano. dbTSS: Database of human transcriptional
start sites and full-length cDNAs. Nucleic Acids Research, 30(1):328–331, 2002.
M. Szafranski, Y. Grandvalet, and A. Rakotomamonjy. Composite kernel learning. Mach.
Learn., 79(1-2):73–103, 2010.
A. Tsybakov. Optimal rates of aggregation. In B. Sch¨olkopf and M. Warmuth, editors, Compu-
tational Learning Theory and Kernel Machines (COLT-2003), volume 2777 of Lecture Notes
in Artificial Intelligence, pages 303–313. Springer, 2003.
V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, Sept. 1998.
M. Varma and B. R. Babu. More generality in efficient multiple kernel learning. In Proceedings
of the 26th Annual International Conference on Machine Learning (ICML), pages 1065–1072,
New York, NY, USA, 2009. ACM.
M. Varma and D. Ray. Learning the discriminative power-invariance trade-off. In IEEE 11th
International Conference on Computer Vision (ICCV), pages 1–8, 2007.
U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416,
2007.
C. Widmer, N. Toussaint, Y. Altun, and G. Ratsch. Inferring latent task structure for multitask
learning by multiple kernel learning. BMC Bioinformatics, 11(Suppl 8):S5, 2010.
R. C. Williamson, A. J. Smola, and B. Sch¨olkopf. Generalization performance of regularization
networks and support vector machines via entropy numbers of compact operators. IEEE
Transactions on Information Theory, 47(6):2516–2532, 2001.
Z. Xu, R. Jin, I. King, and M. Lyu. An extended level method for efficient multiple kernel
learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in
Neural Information Processing Systems 21, pages 1825–1832, 2009.
Y. Yamanishi, , J.-P. Vert, and M. Kanehisa. Supervised enzyme network inference from the
integration of genomic data and chemical information. Bioinformatics, 21:i468–i477, 2005.
F. Yan, K. Mikolajczyk, M. Barnard, H. Cai, and J. Kittler. Lp norm multiple kernel fisher
discriminant analysis for object and image categorisation. In IEEE International Conference
on Computer Vision and Pattern Recognition, 2010.
S. Yu, T. Falck, A. Daemen, L.-C. Tranchevent, J. Suykens, B. De Moor, and Y. Moreau.
L2-norm multiple kernel learning and its application to biomedical data fusion. BMC Bioin-
formatics, 11(1):309, 2010. ISSN 1471-2105.
M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society, Series B, 68:49–67, 2006.
J. Zhang, M. Marszalek, S.Lazebnik, and C. Schmid. Local features and kernels for classification
of texture and object categories: A comprehensive study. International Journal of Computer
Vision, 73(2):213–238, 2007.
128
Bibliography
A. Zien and C. S. Ong. Multiclass multiple kernel learning. In Proceedings of the 24th interna-
tional conference on Machine learning (ICML), pages 1191–1198. ACM, 2007.
129
Curriculum Vitae
Personal Information
Name Marius Kloft
Date of birth July 16, 1980
Place of birth Dernbach, Rhineland-Palatinate, Germany
Citizenship German
Education and Employment
since 01/2007 PhD studies in Computer Science at the Machine Learning Laboratory, Depart-
ment of EECS, Berlin Institute of Technology (TU Berlin). Advisor: Prof. Dr.
Klaus-Robert M¨uller.
since 10/2010 PhD scholarship by the Berlin Institute of Technology (TU Berlin).
10/2009–09/2010 Research stay at the Learning Theory Group, EECS department, University of
California, Berkeley, USA. Founded by a scholarship of the German Foreign
Exchange Service (DAAD). Advisor: Prof. Dr. Peter L. Bartlett.
06/2007–09/2009 Employment in the BMBF project Real-time Machine Learning for Intrusion
Detection (REMIND) in collaboration with the Fraunhofer FIRST Institute,
Berlin. Supervisor: Pavel Laskov, PhD.
01/2007–04/2007 Employment in the BMBF project FASOR.
07/2006 Diploma in mathematics (M.Sc. equivalent) with GPA 1.0 (1.0 is max).
Diploma thesis in algebraic geometry on Stable Base Loci and the Theorem of
Zariski-Fujita. Advisor: Prof. Dr. Thomas Bauer.
10/2001–07/2006 Studies in mathematic with minor computer science at the University of Mar-
burg, Germany.
Majors: Algebraic geometry, algebra, complex geometry, probability theory,
universal algebra.
10/2000–09/2001 Studies in physics at the University of Marburg, Germany.
09/2000 Internship at Kl¨ockner Pentaplast of America, Gordonsville, VA, USA.
07/2000 Award by the German physical society for achievements in the high school
major physics.
08/1989–07/2000 Mons-Tabor Gymnasium Montabaur (high school)
131
Publications
Publications related to this thesis are marked with , others with .
Preprints (submitted)
D. Deb, J. assig, and M. Kloft. Effects of Seedling Age and Land Type on Grain Yield in
the System of Rice Intensification. Submitted to Experimental Agriculture, Jun 2011. [27]
M. Kloft and G. Blanchard. The Local Rademacher Complexity of `p-Norm Multiple Kernel
Learning. ArXiv preprint 1103.0790, Mar 2011. Submitted to Journal of Machine Learning
Research (JMLR). Status: Accepted pending minor revision.[26]
R. Jenssen, M. Kloft, A. Zien, S. Sonnenburg, and K.-R. M¨uller. A Multi-Class Support
Vector Machine based on Scatter Criteria. Technical Report 14/2009, Technische Universit¨at
Berlin, Jun 2009. Submitted to Pattern Recognition. Status: revision. Revision submitted,
Apr 2011. [25]
M. Kloft and P. Laskov. Security Analysis of Online Centroid Anomaly Detection. Technical
Report UCB/EECS-2010-22, University of California, Berkeley, Feb 2010. Submitted to Jour-
nal of Machine Learning Research (JMLR). Status: Accepted pending minor revision.[24]
A. Binder, S. Nakajima, M. Kloft, C. M¨uller, W. Wojcikiewicz, U. Brefeld, K.-R. uller,
and M. Kawanabe. Classifying Visual Objects with many Kernels. Submitted to Transactions
of Pattern Analysis and Machine Intelligence (TPAMI). Status: major revision. Revision in
preparation. [23]
N. ornitz, M. Kloft, K. Rieck, and U. Brefeld. From Unsupervised towards Supervised
Anomaly Detection. Submitted to Journal of Artificial Intelligence Research (JAIR). Status:
major revision. Revision in preparation. [22]
S. Sonnenburg, C. S. Ong, F. Bach, M. Kloft, U. R¨uckert, and A. Rakotomamonjy. Intro-
duction to the JMLR Special Topic on Metric and Kernel Learning. Journal of Machine
Learning Research (JMLR). Editorial paper. Manuscript in preparation. [21]
2011
M. Kloft and G. Blanchard. The Local Rademacher Complexity of `p-Norm Multiple Kernel
Learning. Advances in Neural Information Processing Systems 24 (NIPS), to appear. [20]
133
A. Binder, S. Samek, M. Kloft, C. M¨uller, K.-R. M¨uller, and M. Kawanabe. The joint
submission of the TU Berlin and Fraunhofer FIRST (TUBFI) to the ImageCLEF2011 Photo
Annotation Task. Proceedings of the 12th Workshop of the Cross-Language Evaluation Forum
(CLEF 2011), (to appear) 2011. [19]
R. Jenssen, M. Kloft, A. Zien, S. Sonnenburg, and K.-R. uller. A New Scatter-Based Multi-
Class Support Vector Machine. Proceedings of the IEEE International Workshop on Machine
Learning for Signal Processing (MLSP), (to appear) 2011. [18]
U. R¨uckert and M. Kloft. Transfer Learning with Adaptive Regularizers. Proceedings of
the European Conference on Machine Learning (ECML), (to appear) 2011. [17]
M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. `p-Norm Multiple Kernel Learning.
Journal of Machine Learning Research (JMLR), volume 12, pages 953-997, 2011. [16]
2010
M. Kloft, U. R¨uckert, C. S. Ong, A. Rakotomamonjy, S. Sonnenburg, and F. Bach,
editors, Proceedings of the NIPS 2010 Workshop on New Directions in Multiple Kernel Learn-
ing, 2010. [15]
M. Kloft, U. R¨uckert, and P. L. Bartlett. A Unifying View of Multiple Kernel Learning.
In Proceedings of the European Conference on Machine Learning (ECML), pages 66–81, 2010.
[14]
M. Kloft and P. Laskov. Online Anomaly Detection under Adversarial Impact. In JMLR
Workshop and Conference Proceedings 9 (AISTATS 2010), pages 405-412. MIT Press, 2010.
[13]
2009
M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Comparing Sparse and Non-sparse
Multiple Kernel Learning. Proceedings of the NIPS 2009 Workshop on Understanding Multiple
Kernel Learning Methods, 2009. [12]
A. Binder, M. Kawanabe, M. Kloft, and S. Nakajima. Enhancing Image Annotation with
Primitive Color Histograms via Non-sparse Multiple Kernel Learning. In Proceedings of the
NIPS 2009 Workshop on Understanding Multiple Kernel Learning Methods, 2009. [11]
M. Kloft, U. Brefeld, S. Sonnenburg, P. Laskov, K.-R. M¨uller, and A. Zien. Efficient and
Accurate `p-norm Multiple Kernel Learning. In Advances in Neural Information Processing
Systems 22 (NIPS), pages 997-1005, 2009.10 [10]
1054 citations, http://scholar.google.com, June 19, 2011.
134
S. Nakajima, A. Binder, C. M¨uller, W. Wojcikiewicz, M. Kloft, U. Brefeld, K.-R. M¨uller,
and M. Kawanabe. Multiple Kernel Learning for Object Classification. In Proceedings of the
12th Workshop on Information-based Induction Sciences (IBIS), 2009. [9]
N. ornitz, M. Kloft, K. Rieck, and U. Brefeld. Active Learning for Network Intrusion
Detection. In Proceedings of the 2nd ACM Workshop on Security and Artificial Intelligence
(AISec), pages 47-54. ACM, 2009. [8]
P. Laskov and M. Kloft. A Framework for Quantitative Security Analysis of Machine
Learning. In Proceedings of the 2nd ACM Workshop on Security and Artificial Intelligence
(AISec), pages 1-4. ACM, 2009. [7]
N. ornitz, M. Kloft, and U. Brefeld. Active and Semi-supervised Data Domain Description.
In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in
Databases (ECML/PKDD), pages 407-422, 2009. [6]
M. Kloft, S. Nakajima, and U. Brefeld. Feature selection for density level-sets. In Proceed-
ings of the European Conference on Machine Learning and Knowledge Discovery in Databases
(ECML/PKDD), pages 692-704, 2009. [5]
M. Kloft, U. Brefeld, S. Sonnenburg, A. Zien, P. Laskov, and K.-R. M¨uller. Learning Non-
sparse Kernel Mixtures. In Proceedings of the PASCAL2 Workshop on Sparsity in Machine
Learning and Statistics, 2009. [4]
2008
M. Kloft, U. Brefeld, P. Laskov, and S. Sonnenburg. Non-sparse Multiple Kernel Learning.
In Proceedings of the NIPS 2008 Workshop on Kernel Learning: Automatic Selection of Optimal
Kernels, 2008. [3]
M. Kloft, U. Brefeld, P. D¨ussel, C. Gehl, and P. Laskov. Automatic feature selection
for anomaly detection. In Proceedings of the 1st ACM Workshop on Security and Artificial
Intelligence (AISec), pages 71-76. ACM, 2008. [2]
M. Kloft and P. Laskov. A Poisoning Attack Against Online Anomaly Detection. In
Proceedings of the NIPS 2007 Workshop on Machine Learning in Adversarial Environments,
2008. [1]
135