Latent Variable Augmentation for
Approximate Bayesian Inference
Applications for Gaussian Processes
vorgelegt von
M. Sc.
Théo Galy-Fajou
ORCID: 0000-0002-3528-3536
an der Fakultät IV - Elektrotechnik und Informatik
der Technischen Universität Berlin
zur Erlangung des akademischen Grades
Doktor der Naturwissenschaften
-Dr. rer. nat.-
genehmigte Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Dr. Marc Toussaint
Gutachter: Prof. Dr. Manfred Opper
Gutachter: Dr. Mark van der Wilk
Gutachter: Dr. Arno Solin
Tag der wissenschaftlichen Aussprache: 07. Juli 2022
Berlin 2023
Zusammenfassung
Die Inferenz auf probabilistische Modelle kann selbst bei scheinbar einfachen Problemen
eine Herausforderung darstellen. Bei der Arbeit mit nicht-konjugierten Bayes’schen Modellen
benötigen wir Näherungsmethoden wie Variationsinferenz oder Sampling, die jeweils ihre
Tücken und Grenzen haben. So stellen beispielsweise stark schwanzlastige Verteilungen eine
Herausforderung für Sampling-Methoden dar, und stark korrelierte Variablen werden für viele
Inferenzalgorithmen schnell zu einem Engpass. Anstatt einen weiteren hochmodernen Sampler
oder Optimierer zu entwickeln, konzentrieren wir uns darauf, Modelle so umzuinterpretieren,
dass Standard-Inferenzalgorithmen wie blockiertes Gibbs-Sampling, die normalerweise auf
trivialere Modelle beschränkt sind, die beste Wahl werden. Im ersten Teil leiten wir
Modellerweiterungen für verschiedene Gauß’sche Prozessmodelle wie Klassifikation und
Mehrklassenklassifikation ab. Wir konzentrieren uns auf die Auswirkungen auf die Inferenz
und entwickeln eine Verallgemeinerung für eine bestimmte Klasse von Likelihoods. Wir zeigen,
dass die Augmentierungen mit den Daten skalierbar sind und alle bestehenden Methoden in
Bezug auf Geschwindigkeit und Stabilität übertreffen. Der zweite Teil konzentriert sich auf
Approximationen, die auf einer Gaußschen Variationsverteilung basieren. Wir zeigen, dass wir
durch die Parametrisierung der Gauß-Verteilung durch eine Menge von Partikeln anstelle ihrer
Parameter teure Berechnungen vermeiden, die Flexibilität des Modells erhöhen und theoretische
Konvergenzgrenzen nachweisen können. Zusätzlich zu den veröffentlichten Arbeiten diskutieren
wir die Auswirkungen dieser verschiedenen Erweiterungen, einschließlich ihrer Grenzen. Wir
geben auch einen Ausblick auf neue Forschungsrichtungen, einschließlich konkreter Fortschritte.
Insbesondere zeigen wir Wege auf, wie die in den vorgestellten Arbeiten aufgeworfenen Probleme
kompensiert werden können, und stellen neue Augmentationsmodelle und neue Inferenzansätze
vor, die mit augmentierten Modellen kompatibel sind.
Abstract
Performing inference on probabilistic models can represent a challenge even in seemingly
simple problems. When working with non-conjugate Bayesian models, we need approximate
methods such as variational inference or sampling, each with its pitfalls and limits. For instance,
heavy-tailed distributions represent a challenge for sampling methods, and strongly correlated
variables quickly become a bottleneck for many inference algorithms. Instead of developing yet
another new state-of-the-art sampler or optimizer, we focus on reinterpreting models such that
standard inference algorithms like blocked Gibbs sampling, usually restricted to more trivial
models, become the best choice. In the first part, we derive model augmentations for different
Gaussian Process models such as classification and multi-class classification. We focus on the
effects on inference and develop a generalization for a given class of likelihoods. We show that
augmentations are scalable with data and outperform all existing methods in terms of speed
and stability. The second part focuses on approximations based on a Gaussian variational
distribution. We show that by parametrizing the Gaussian distribution by a set of particles
instead of its parameters, we avoid expensive computations, increase the model flexibility, and
prove theoretical convergence bounds. In addition to the published papers, we discuss the
impact of these different augmentations, including their limitations. We also expose outlooks
on new research directions, including concrete advances. In particular, we present ways to
compensate for issues raised in the presented papers and present new augmentation models
and new inference approaches compatible with augmented models.
Dedié à Manou.
Acknowledgements
I would like to thank Ena for her unconditional love and support since the beginning, and
especially her help to not lose myself into work.
Professor Opper for sharing his immense wisdom and knowledge, bearing with by stubbornness
and for believing in me.
My parents for supporting me in everything I have ever started.
"Les filous", for keeping me entertained at all times.
My main co-author and tutor Florian who taught me so much before and during my Ph.D.
The Julia community, from whom I learned so much and for their indeflectible help during
hard programming times.
And of course all the people I shared lunch and good times with at the university.
Table of Contents
Title Page i
Zusammenfassung iii
Abstract v
1 Introduction 1
1.1 Bayesian Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The underestimated power of representations choices . . . . . . . . . . . . . . . 2
1.3 GaussianProcesses.................................. 3
1.4 Open-sourceprojects................................. 3
1.5 ThesisOutline .................................... 4
2 Background 5
2.1 Probabilistic Bayesian Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Posterior computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 GaussianProcesses.................................. 7
2.2.1 Gaussian Process Regression . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Non-Conjugate Gaussian Processes . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Sparse Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Approximate Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Sampling ................................... 10
2.3.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Scale mixtures and conditionally conjugate likelihoods . . . . . . . . . . 16
3
Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmen-
tation 17
4
Multi-Class Gaussian Process Classification Made Conjugate: Efficient
Inference via Data Augmentation 35
5
Automated Augmented Conjugate Inference for Non-conjugate Gaussian
Process Models 49
6
Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation 69
xi
TABLE OF CONTENTS
7 Discussions and extensions 105
7.1 Further generalizations and understanding . . . . . . . . . . . . . . . . . . . . . 105
7.2 Double bounds for intricate latent GPs.......................107
7.2.1 Heteroscedastic Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . 108
7.2.2 Heteroscedastic Non-Gaussian Likelihood . . . . . . . . . . . . . . . . . 110
7.3 Using Hamilton Monte Carlo on the augmented model . . . . . . . . . . . . . . 111
7.4 Improvements on the Multi-Class Classification . . . . . . . . . . . . . . . . . . 113
7.4.1 Marginalizing out variables . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4.2 A new model for the multi-class classification . . . . . . . . . . . . . . . 113
7.4.3 Scaling the logistic-softmax link . . . . . . . . . . . . . . . . . . . . . . . 116
7.5 Sampling from a sparse augmented model . . . . . . . . . . . . . . . . . . . . . 117
7.6 Limitations ......................................119
8 Conclusion 121
References 123
References 123
Appendix A Additional work 127
A.1 Adaptive Inducing Points Selection for Gaussian Processes . . . . . . . . . . . . 127
xii
TABLE OF CONTENTS
Acronyms
GP Gaussian Process
GPsGaussian Processes
MCMC Markov Chain Monte Carlo
VI Variational Inference
VFE Variational Free Energy
ELBO Evidence Lower BOund
KL Kullback-Leibler
MF Mean-Field
BMF Blocked Mean-Field
CAVI
Coordinate Ascent Variational Infer-
ence
HMC Hamiltonian Monte Carlo
MH Metropolis-Hastings
ML Machine Learning
VGA Variational Gaussian Approximation
MGF Moment Generating Function
pdf probability distribution function
iid independent and identically distributed
NUTS No-U-turn sampling
ABI Approximate Bayesian Inference
xiii
1
Introduction
Machine Learning (
ML
) is a wide field of research with plenty of successful applications
[
29
]. Some problems have specific requirements; for example, computing the probability of a
prediction is essential for decision-making algorithms. One of the best ways to incorporate
uncertainty in
ML
models is through the lens of probability theory. Probabilistic
ML
defines
quantities of interest as random variables and considers data-generative processes as stochastic.
We can produce more robust models and get more faithful to reality by accounting for the
intrinsic measurement uncertainty and unknown random processes. Additionally, stochastic
models return probabilistic predictions, allowing answers like "I don’t know."
1.1 Bayesian Machine Learning
In the Bayesian paradigm, parameters of
ML
models are random variables defined by
probability distributions instead of point estimates. Bayesian models allow modeling
uncertainty in a principled way and prevent overfitting in the low-data regime. We set a prior
distribution over the variables of interest representing our original belief. After observing
data, we update our belief about our model parameters to the posterior distribution. A
typical example is in medicine, where data is scarce, but the predictive outcome can have
dramatic effects (diagnosis, prognosis). Providing uncertainties helps the practitioner make a
better decision given the model predictions.
Generally, Bayesian models have a higher computational cost: a probability distribution
contains more information than a point estimate and requires more parameters. Calculus with
random variables is a difficult art, and finding analytical solutions happens almost exclusively
for trivial models. Approximation methods allow working with more complex models at the
cost of a potential bias or inaccuracies. Approximate Bayesian Inference (
ABI
)focuses
on these algorithms finding a similar solution to the true posterior.
The research in
ABI
goes in many directions, but some main ones are: How to compute a
highly accurate posterior approximation as efficiently as possible? How can it scale to large
amounts of data and parameters? What are the guarantees of such algorithms? This thesis
1
1. Introduction
aims to partially answer these questions for some given setups, mainly through a focus on
representations.
1.2 The underestimated power of representations choices
The leading thread of this thesis is model representation, alternatively called model
parameterization, and its use for solving problems more efficiently and faster without
compromising prediction quality.
When defining probabilistic models, one needs to define relations between variables (observed
and latent) and choose appropriate distributions to represent those. Some modeling choices are
equivalent conceptually but have drastic differences in inference. A neat example, presented in
Gorinova et al.
[18]
, is the so-called Neal’s funnel [
39
]. There are two equivalent representations,
called centered and non-centered, shown respectively in Figure 1.1 and 1.2, where one leads to
an inference nightmare while the other is a nice and easy isotropic Gaussian distribution.
z∼ N(0,3)
x∼ N(0,exp(z/2)) (1.1)
x
-5 -4 -3 -2 -1 0 1 2 3 4 5
z
-5
-4
-3
-2
-1
0
1
2
3
4
5
Figure 1.1: Neal’s funnel - Centered represen-
tation
z˜∼ N(0,1), z = 3z˜
x˜∼ N(0,1), x = exp(z/2)x˜(1.2)
x
-5 -4 -3 -2 -1 0 1 2 3 4 5
z
-5
-4
-3
-2
-1
0
1
2
3
4
5
Figure 1.2: Neal’s funnel - Non-centered
representation
While both parameterizations are the same, the distribution geometry of
p
(
x, z
)is less
favorable to inference.
x
and
z
are strongly correlated for small
z
, and the density function
is highly non-smooth. These constraints matter when running a sampling chain or fitting a
variational distribution.
The use of different model representations has an often underestimated effect and is
mainly considered "tricks." For example, when working with Gaussian Processess, it is generally
preferable to use the so-called "whitened" representation, which corresponds to the non-centered
representation of Neal’s funnel (Figure 1.2). The different segments of this thesis show that
finding better representations can confidently make inference easier, faster, and significantly
2
1.3 Gaussian Processes
more stable. The first part will use basic inference methods by representing likelihoods as
(hierarchical) mixtures. Rewriting distributions as scale mixtures, defined later in Section 2.3.3,
has a lot of advantages and interesting properties. The scale mixture representation involves
augmenting the model with new latent variables, making inference easier while keeping the
original model recoverable. This augmentation procedure brings the maybe counter-intuitive
view that adding more variables simplifies the problem. The last work of the thesis focuses
on the representation of the variational Gaussian approximation. We avoid computational
bottlenecks and add flexibility by representing the distribution with particles instead of using
the mean and the covariance.
1.3 Gaussian Processes
The techniques mentioned above apply to many probabilistic models; however, we focus on
Gaussian-based models, and more particularly Gaussian Processes (
GPs
) [
46
]. A
GP
is a strong
non-parametric tool to approximate functions using probabilistic methods. They were initially
applied to regression problems with Gaussian noise, like the original kriging problem [
9
].
However, they are also used as prior over latent functions for more complex problems like
classification, ordinal regression, and more. Compared to other general function approximators
like neural networks, they have the advantage of providing uncertainty on the prediction they
make. Most importantly, as their name suggests, they are based on Gaussian distributions,
making them the best candidates for the presented work on augmentation. A full technical
introduction to basic GPsand its extensions is given in Section 2.2.
1.4 Open-source projects
All the works presented in this thesis, as well as additional tools, are backed-up by user-friendly
packages in Julia [
4
]. Throughout my time as a Ph.D. student, I have developed numerous
Julia packages and was involved in the JuliaGaussianProcesses organisation to develop a
flexible, efficient and easy-to-use framework to work with
GPs
from the very low-end to high-
end interfaces through a series of packages:
KernelFunctions.jl
[
16
],
AbstractGPs.jl
[
61
],
ApproximateGPs.jl
,
InducingPoints.jl
and
GPLikelihoods.jl
. The particular strength of
our work is the one-to-one mapping between theory and code. For example to define the
posterior for some given data, the code looks like:
f=GP(mean_prior, kernel) # define an infinite-dimensional prior
fx =f(X, noise) # create a realization on the data X
fpost =posterior(fx, y) # create the posterior given the observations y
rand(fpost(x_test)) # sample from the predictive posterior of some test data
Here, each computational object represents exactly its mathematical equivalent.
The work of this thesis is represented as well with the package
AugmentedGPLikelihoods.jl
,
which provide all the necessary tools to work with augmentations.
3
1. Introduction
Julia’s advantage is its strong interoperability capacity. This allows to use the augmentation
work on more specialized implementations such as temporal
GPs
with a concrete example
given in TemporalGPs.jl (see examples/augmented_inference.jl).
Independently, I also developed
AugmentedGaussianProcesses.jl
[
14
] as a stand-alone
GP
package providing the augmentations techniques presented in the thesis, additional likelihoods,
and standard inference approaches.
1.5 Thesis Outline
This thesis is constructed as follows:
•
Chapter 2 introduces in detail all the common concepts of Bayesian inference and
GPs
.
There are introductions to these concepts in each published article, but this chapter dives
more into the background theory. Bayesian inference, especially, is properly introduced,
focusing on variational inference and sampling.
•
Chapter 3 contains the paper Efficient Gaussian Process Classification Using Pólya-
Gamma Data Augmentation, which was the first variable augmentation we explored.
•
Chapter 4 introduces the paper Multi-Class Gaussian Process Classification Made
Conjugate: Efficient Inference via Data Augmentation. This paper brings new
augmentation concepts to a more complex problem: multi-class classification.
•
Chapter 5 presents the paper Automated Augmented Conjugate Inference for Non-
conjugate Gaussian Process Models. This work presents a generic way to identify
augmentations in likelihoods and introduces a better understanding of the concepts
behind it.
•
Chapter 6 introduces the paper Flexible and Efficient Inference with Particles for the
Variational Gaussian Approximation a completely different way of performing variational
inference with a Gaussian distribution by using a continuous flow and particles.
•Chapter 7 discusses the different papers presented as well as some concrete outlooks on
how to explore new models and new generalizations.
•Chapter 8 finishes this thesis with a general conclusion.
•
The Appendix A also contains an additional workshop paper which does not fit the
narrative of this thesis
For all papers, a simplified view of the Contributor Roles Taxonomy (CReditT) details
the contributions of each author.
4
2
Background
To fully comprehend the papers to be presented, we present a general overview of the needed
concepts. A short introduction to the basic theory of Gaussian Processes as well as their
extension to large datasets using inducing points [
53
] is given in Chapters 3, 4 and 5. However,
this chapter presents a more thorough and basic description. Additionally, this chapter dives
more into the basics of probabilistic Bayesian modeling, variational inference, and sampling
methods.
2.1 Probabilistic Bayesian Modeling
Bayes’ theorem is one of the simplest theorems in probability theory, and its proof fits in one
line, yet its implications are immeasurably1important.
Let us give a very general modeling setting that we will follow for the rest of this chapter.
Given a set of observed variables
X
, a set of latent (unobserved) variables
θ
with a prior
distribution
p
(
θ
), and a likelihood function
p
(
X|θ
), we obtain the posterior distribution
p(θ|X)via Bayes’ theorem:
p(θ|X) = p(X|θ)p(θ)
p(X)=p(X|θ)p(θ)
∫︁p(X|θ)p(θ)dθ.(2.1)
p
(
X
)represents the so-called evidence and can be used to compare different models (the
dependency on the used model is implicit here). The posterior allows us to obtain a distribution
of the latent variables with its uncertainty given the prior
p
(
θ
)and the observed data
X
.
The posterior is used for computing all kinds of expectations of the form
Ep(θ|X)[f(θ)]
=
∫︁f
(
θ
)
p
(
θ|X
)
dθ
. Expected values of interest can be statistics of the posterior like the mean
(Ep(θ|X)[θ]) or predictive distribution of new data points p(x′|X) = Ep(θ|X)[p(x′|θ)].
1Pun intended.
5
2. Background
Let’s take the simple example of linear logistic regression, a discriminative model. Given
an input x∈RDand a binary label y∈ {0,1}, we model the process as:
y∼Bernoulli (︂σ(θ⊤x))︂,
where
Bernoulli
is the Bernoulli distribution,
θ∈RD
is a vector of weights (our latent variable),
and
σ
:
R→
[0
,
1] is the logistic function
σ
(
x
) =
1
1+exp(−x)
. The likelihood function is given by:
p(yi|θ,xi) = σ(︂θ⊤xi)︂yiσ(︂−θ⊤xi)︂1−yi.
Now let’s suppose that we have
N
pairs of input
xi
and label
yi
, that we assume to
be independent and identically distributed (
iid
), we get a training set
X
=
{x1,...,xN}
,
y
=
{y1, . . . , yN}
. With a prior distribution
p
(
θ
)on
θ
, we build the posterior as
p
(
θ|y,X
)
∝
p
(
θ
)
∏︁N
i=1 p
(
yi|θ,xi
). We can then compute the predictive distribution for a new data input
x∗:
p(y∗|x∗,y,X) = ∫︂p(y∗,θ|x∗y,X)dθ=∫︂p(y∗|θ,x∗)p(θ|y,X)dθ.(2.2)
Note that the last term of Equation
(2.2)
directly involves the posterior distribution
p
(
θ|y,X
). To solve this integral, we must either know the posterior distribution and compute
the integral numerically (or analytically) or sample from the posterior and estimate the integral
using Monte Carlo integration.
2.1.1 Posterior computations
Given a prior
p
(
θ
)and a likelihood
p
(
X|θ
), computing the posterior distribution function
(2.1)
in closed-form requires the integral
2p
(
X
) =
∫︁p
(
X|θ
)
p
(
θ
)
dθ
. For most non-trivial models,
this integral is intractable, and approximations to the posterior are needed. Such methods are
introduced in Section 2.3.
However, in specific settings, computing the posterior in closed-form is possible. When
the prior is said to be conjugate to the likelihood, the posterior is of the same probability
distribution family as the prior and is analytically tractable [
49
]. It is worth emphasizing this
seemingly trivial case since it will be exploited in Section 2.3.3. For a general example, we
consider a likelihood part of the exponential family:
p(x|θ) = h(x) exp(η(θ)⊤T(x)−A(θ)),(2.3)
where
θ
are the distribution parameters,
h
(
x
)is the base measure,
η
(
θ
)corresponds to
the natural parameters,
T
(
x
)are the sufficient statistics and
A
(
θ
)is the log-partition.
Formally, a conjugate prior to the likelihood (2.3) is defined as:
p(θ|α) = h′(θ) exp(η′(α)⊤T′(θ)−A′(α)),(2.4)
2Even if the integral is known, it might not be enough to compute some expectations or statistics.
6
2.2 Gaussian Processes
where
T′
(
θ
) =
{η(θ), A(θ)}
and where
α
represents the prior distribution parameters. Given
a factorizable likelihood p(X|θ) = ∏︁N
i=1 p(xi|θ), the posterior will be proportional to
p(θ|X)∝h′(θ) exp
(︄{
N
∑︂
i=1
T(xi), N}+η′(α))︄⊤
T′(θ)
.(2.5)
Note that the only dependence on Xis via the sufficient statistics T(x).
Conjugate models are very practical as the posterior can be found in one step, but are
very constraining in the choice of the prior. They tend to be considered too simple for many
applications.
If the prior is not conjugate of the likelihood, an alternative is to look for conditional
conjugacy. A parameter
θi
with a conditionally conjugate prior will have a full conditional
distribution of the same family. The full conditional distribution is defined as
p
(
θi|X,θ/i
)
where
θ/i
=
{θ1, . . . , θi−1, θi+1,...θD}
. This notion of full conditional also extends to blocks
of variables.
2.2 Gaussian Processes
Gaussian Processes (
GPs
) are a class of stochastic processes used as non-parametric probabilistic
representations of functions. A
GP
is a stochastic process
{ft}
, where the joint distribution on
any finite collection of random variables
{ft}
follows a (multivariate) Gaussian distribution [
46
].
Since all the variables are Gaussian, we can perform all linear operations analytically, making
them computationally attractive. We can also compute marginals exactly, and a product of
Gaussian distributions of the same variable is still proportional to another Gaussian.
A
GP
is uniquely specified by its mean function
µ0
(
x
)and kernel function (also
called covariance function)
k
(
x,x′
).
µ0
(
x
)can be any real-valued function while
k
(
x,x′
)
needs to be a positive-definite function (also called Mercer kernels). A symmetric function
k
:
X ×X → R
is positive-definite on
X
if
w⊤Kw
where
Kij
=
K
(
xi,xj
)for any
w∈RN
and any {xi,...,xN}∈X.
One of the interpretations of a
GP
(
µ0
,
k
) is as a prior on the function space. Given a
random function
f
with a
GP
prior, we can project
f
into a finite space by evaluating it on
a set of data inputs
X
=
{x1,...,xN}
such that we obtain the finite-dimensional vector
f
where
fi
=
f
(
xi
). The prior on the projected
GP
on
X
is given by
N(µ0(X), KX)
where
µ0(X) = {µ0(xi)}N
i=1 and K∈RN×Nis the kernel matrix, defined by Kij =k(xi,xj).
2.2.1 Gaussian Process Regression
Given our prior
p
(
f
) =
N
(
f|µ0,K
), we can add noisy observations
y
=
{yi}N
i=1
for each
respective xiand model the process as:
yi=f(xi) + ϵi,(2.6)
where
ϵi∼ N
(0
, σ2
). This leads to the likelihood
p
(
yi|fi
) =
N
(
yi|fi, σ2
). Fortunately, adding a
zero-mean Gaussian variable to another gives another Gaussian variable with increased variance
and the posterior for
f
is given by
p
(
f|y
) =
N
(
f|y,KX
+
σ2I
). The predictive distribution
7
2. Background
-2.5 0.0 2.5
-1.0
-0.5
0.0
0.5
1.0
GP Prior
-2.5 0.0 2.5
-1.0
-0.5
0.0
0.5
1.0
GP Poste
rior
Figure 2.1: Illustration of the realization of a Gaussian Process. The black line is the true
function
f
; the blue line is the mean of the prediction; the blue area represents the confidence
interval of 2 standard deviations; the orange points represent observed data. Left: prediction on a
grid given no observations. Right: prediction on a grid given a set of observations
of f∗=f(x∗)on a new input x∗can be evaluated by computing:
p(f∗|x∗,X,y) = ∫︂p(f∗|f,x∗)p(f|X,y)df.(2.7)
This integral is analytically tractable and results in
p(f∗|x∗,X,y) = N(f∗|m∗, s∗),(2.8)
where
m∗
=
Kx∗,X
(
KX
+
σ2I
)
−1y
and
s∗
=
Kx∗−Kx∗,X(︁KX+σ2I)︁−1KX,x∗
, with
(
Kx∗,X
)
i
=
k
(
x∗,xi
). The predictive distribution for
f∗
is Gaussian, with a known mean
m∗
and a measure of uncertainty given by the variance
s∗
. Note that
s∗
depends directly on
Kx∗,X
: if
x∗
is far from all points in
X
(in the sense of the distance used in the kernel
k
), then
Kx∗,X
will be very small and the variance
s∗
maximized. The predictive uncertainty will be
high when new inputs
x∗
are distant from the training data
X
. A concrete example is shown
on Figure 2.1.
2.2.2 Non-Conjugate Gaussian Processes
A Gaussian prior is only conjugate to the mean parameter of a Gaussian likelihood. Therefore,
the
GP
posterior obtained in the previous section is only tractable for homoscedastic
3
Gaussian
likelihoods. For all other cases we talk about non-conjugate
GPs
. Examples of non-conjugate
GP
problems are binary classification, regression using non-Gaussian noise such as Student-t
or Laplace noise, or Poisson regression. Other examples, such as multi-class classification or
heteroscedastic regression, can nrequire multiple latent
GPs
. Figure 2.2 shows an example
of 1-dimensional binary classification with a
GP
where the posterior was approximated using
variational inference (see Section 2.3.2). Although the
GP
does not recover exactly the true
process, most of it lies in the GP’s 95% confidence interval (blue band).
Posteriors of non-conjugate problems are not analytically tractable, and one needs to resort
to the approximation methods presented in Section 2.3. A strong focus of this thesis is to
3The noise variance is independent of the input
8
2.2 Gaussian Processes
-2 -1 0 1 2
-1
0
1
Latent GP representation
-2 -1 0 1 2
0.0
0.5
1.0
p(y|
f)
y
q(
f)
ftr
ue
y
E
q
(
f
)
[p(
y|f)]
Figure 2.2: Illustration of a latent Gaussian process used for a binary classification problem. The
Bernoulli likelihood is linked to the latent
GP
via the logistic function. On the left is shown the
optimal variational posterior
q
(
f
)in blue, compared to the true generation of
f
in green. Similar
to Figure 2.1, the blue band represents one standard deviation. On the right, we show the expected
predictive probability for ygiven the variational posterior q(f)in blue.
take these non-conjugate likelihoods and find a representation where inference is simplified and
basic methods can be used.
2.2.3 Sparse Gaussian Processes
One of the largest drawbacks of
GPs
, regardless of the conjugacy of the likelihood, is the
scalability with the number of observed samples. When computing the predictive mean and
covariance, the inverse matrix operation in Equation
(2.8)
has a computational complexity
of
O
(
N3
)where
N
is the number of samples. For one-dimensional inputs (
D
= 1), solutions
exist for specific kernels using state-space models representation [
55
,
52
], leading to an
O
(
N
)
complexity. However, higher-dimensional problems require alternative solutions. The first
approach to reduce the complexity was to use a Nyström approximation [
62
]. Csató and Opper
[11]
proposed to create an approximation of the posterior using a subset of the points only in
the context of online learning. Snelson and Ghahramani
[51]
expanded this theory to the offline
framework and Csató
[10]
followed by Titsias
[53]
developed an alternative approximation
based on KL divergence where the "inducing points" are not necessarily a subset of the training
data and do not even have to belong to the same domain [
31
,
56
]. For a unified view on sparse
GPs, see Quinonero-Candela and Rasmussen [45] and Bui et al. [7].
The works of thesis relying on inducing points are based on Titsias’ approach [
53
]: The
sparse approximation is made by defining a set of inducing points location
Z
=
{zi}M
i=1
and the
realization of a
GP u
on them:
u
where
ui
=
u
(
zi
). We proceed to use variational inference
(see Section 2.3.2) and approximate the posterior p(u,f|y)by the variational distribution
q(u,f) = q(u)
N
∏︂
i=1
p(fi|u),(2.9)
minimizing
KL (q(u,f)||p(u,f|y))
. The assumption used is that all components of the random
vector
f
are independent of each other given the random vector
u
. It is a strong assumption,
but the inference and prediction complexity reduces to
O
(
NM2
+
M3
), where
N
can be reduced
to a smaller batch-size
B
with stochastic inference approaches [
23
,
21
]. Given
q
(
u
) =
N
(
µ,Σ
),
9
2. Background
the predictive distribution of f∗=f(x∗)on a new input x∗is given by
p(f∗|y,X) = ∫︂p(f∗|u)p(u|y,X)du
≈∫︂p(f∗|u)q(u)du
=p(f∗|m∗, s∗),
where m∗=Kx∗,ZK−1
Zµand s∗=Kx∗−Kx∗,ZK−1
Z(I−Σ)K−1
ZKZ,x∗.
2.3 Approximate Bayesian Inference
The posterior distribution in Equation
(2.1)
cannot be computed in closed-form for non-trivial
problems such as the ones presented in Section 2.2.2 and 2.2.3. We can approximate the posterior
to obtain a valuable estimator for predictions and expected values of interest. Approximate
Bayesian Inference is a research field of its own, and this chapter will focus specifically on
sampling and Variational Inference, the most popular approximate inference methods for
GPs
.
2.3.1 Sampling
We can compute predictive estimates Ep(θ|X)[f(θ)] with Monte Carlo integration:
Ep(θ|X)[f(θ)] ≈1
N
N
∑︂
i=1
f(θi),θi∼p(θ|X),
where the samples θiare iid.
Even if the posterior distribution
p
(
θ|X
)is not available in closed-form or has no direct
sampler, there are many alternatives to draw samples from it. The advantage of sampling is its
unbiasedness: one obtains exact expectations in the limit of infinitely many samples. Sampling
is an art of its own, and the number of methods is too large to mention them all in this thesis.
Therefore, the scope is restricted to methods popular with or tailored to
GPs
. In particular,
we restrict ourselves to Markov Chain Monte Carlo (MCMC) methods.
Markov Chain Monte Carlo and Metropolis-Hastings
Markov Chain Monte Carlo (
MCMC
) methods generate a chain of variables
θt
with the Markov
assumption:
θt
depends only on
θt−1
and where the stationary distribution of
θt
is the same as
the target distribution
π
(
θ
)(for our use case the posterior
p
(
θ|X
)).
MCMC
methods require
a transition probability
t
(
θt+1|θ
)which leaves the target stationary distribution invariant, i.e.
π
(
θ
) =
∫︁t
(
θ|θ′
)
π
(
θ′
)
dθ′
. Other properties such as detailed balance and ergodicity need to be
satisfied as well [6, 42].
One of the most common algorithms to run a Markov Chain on a distribution
π
(
θ
)is
the Metropolis-Hastings (
MH
) algorithm. The
MH
algorithm consists in having a proposal
distribution
q
(
θ′|θ
)suggesting a new sample. Each proposed sample
θ′
is randomly accepted or
rejected with probability
p
(
accept
) =
π(θ′)
π(θ)
q(θ|θ′)
q(θ′|θ)
=
A
. The choice of the proposal distribution
q
is the key to producing "good" chains with a high acceptance rate and a good exploration of
θ’s parameter space. Next are presented some categories of choice for the proposal q.
10
2.3 Approximate Bayesian Inference
x₁
-1 0 1 2 3 4 5 6 7
x₂
-1
0
1
2
3
4
5
6
7
Figure 2.3: 20 steps of the Gibbs sampler trajectory on the Rosenbrock distribution in 2
dimensions.
Gibbs Sampling
Gibbs sampling is a particular
MCMC
method where we sample each component of the random
vector one after another. The proposal distribution for each component is given by the full
conditional
p
(
θi|x,θ/i
), where
θ/i
=
{θ1,...θi−1, θi+1,...θD}
. The most prominent feature of
Gibbs sampling is its acceptance probability, guaranteed to be 1:
A=p(θt+1
i,θt
/i|x)
p(θt
i,θt
/i|x)
p(θt
i|x,θt
/i)
p(θt+1
i|x,θt
/i)
=p(θt+1
i|x,θt
/i)
p(θt
i|x,θt
/i)
p(θt
/i|x)
p(θt
/i|x)
p(θt
i|x,θt
/i)
p(θt+1
i|x,θt
/i)= 1.
At every step, all proposed samples are therefore guaranteed to be accepted.
We illustrate the path of the sampler on a two-dimensional bimodal example in Figure 2.3.
The Gibbs sampling approach is a conundrum. On the one hand, sampling each component
using the full conditional is easy since it only involves drawing a scalar. However, building a
sampler for each full conditional at each step can be slow and costly. The sampler can also
get stuck or move very slowly if the components are highly correlated with another. We can
solve these drawbacks by using additional techniques like the blocked Gibbs sampler [
28
]
where we sample groups of variables jointly, or collapsed Gibbs sampling [
35
] where we
marginalize out some variables from the full conditional distributions. But blocked or collapsed
updates are not always available and require heavier sampling machinery.
The augmentations proposed in this thesis allow using both the blocked and collapsed version
by deriving the blocked full conditionals for each group of variables analytically. Experiments
show that the correlations are very low between each group of variables, and that the sampler
converges to the stationary distribution very fast.
11
2. Background
Hamilton/Hybrid Monte Carlo
Hamiltonian Monte Carlo (
HMC
) or Hybrid Monte Carlo [
13
,
40
,
3
] is a
MCMC
method that
uses Hamiltonian dynamics to make a new proposal. We augment
θt
with an extra momentum
pt
sampled randomly for every proposal from
N
(0
, M
)where
M
is the mass matrix. Next we
run the Hamiltonian dynamics based on the Hamiltonian
H
(
θ,p
) =
−log π
(
θ
) +
1
2p⊤Mp
over
L
leapfrog steps with step size ∆
t
. The proposal at time
L
∆
t
is accepted or rejected based on
the acceptance rate:
A= min (︃1,exp(−H(θ(L∆t),p(L∆t)))
exp(−H(θ(0),p(0))) )︃
Hamiltonian dynamics normally keep the Hamiltonian invariant. However, symplectic (volume
preserving) integrators, like the leapfrog method, only keep
H
approximately invariant [
40
].
The global error on
H
grows as
O
(
L
(∆
t
)
2
). We get high acceptance rates while the dynamics
lower the correlation between each sample by exploring the parameter space more freely than a
basic random walk. We can tune the
HMC
algorithm parameters
M
,
L
, and ∆
t
by drawing
a series of adaptive samples and by adjusting to the local geometry of the potential function
−log π
(
θ
). Figure 2.4 illustrates the sampling process with the Hamiltonian dynamics paths
drawn with gray lines.
HMC
is very popular due to its plug-and-play characteristics but suffers from different
issues. It is gradient-based and can not sample discrete variables. The integration of the
Hamiltonian dynamics requires 2
L
gradients per proposal. This computational cost can be
prohibitively expensive for high-dimensional problems or for target distributions with costly
computations.
x₁
-1 0 1 2 3 4 5 6 7
x₂
-1
0
1
2
3
4
5
6
7
Figure 2.4: Illustration of the HMC sampler (gray lines are the Hamiltonian dynamics)
Other samplers
There are other solid choices for sampling from
GPs
. For example, elliptical slice sampling
Murray et al.
[38]
is particularly well-fitted for Gaussian priors. The No-U-turn sampling (
NUTS
)
algorithm [
25
] is an extension of
HMC
where
L
is chosen automatically. We run the path
12
2.3 Approximate Bayesian Inference
integration with both
p
and
−p
until one of the particles goes backward or if one of the
Hamiltonian estimates becomes too inaccurate
4
. The proposal is finally sampled randomly
from both paths.
NUTS
is good at avoiding oscillatory dynamics and is particularly strong
for quadratic problems, which appear regularly in
GPs
problems. Finally, another orthogonal
approach to sample from predictive distributions with a known
GP
posterior is pathwise
sampling [
63
]. By taking a mix of random Fourier features, specific to a particular class of
kernels, the sampling complexity can be reduced from
O
(
N3
)to
O
(
T3
)where
N
is the number
of test inputs, and Tis the chosen number of basis.
2.3.2 Variational Inference
Variational Inference (
VI
), also called Variational Bayes, consists in approximating the posterior
p
(
θ|X
)with another distribution
q
(
θ
). Given a family of distributions
Q
, parametrized by the
variational parameters φ, one aims to solve the following optimization problem:
φ∗= argφmin D(qφ(θ), p(θ|x)) ,(2.10)
where
D
is a dissimilarity measure between two distributions and
qφ
is the distribution
q∈ Q
parametrized by
φ
. One of the most used dissimilarity measure is the reverse Kullback-
Leibler (KL) divergence, defined for continuous distributions as:
KL (q(x)||p(x)) = ∫︂q(x) log q(x)
p(x)dx(2.11)
The objective of Equation
(2.10)
or
(2.11)
is generally not directly tractable when the
normalizer is not known. Since
p
(
θ|x
)involves the normalization constant
p
(
x
), one resorts
to a surrogate function, the Variational Free Energy (
VFE
) (or its negative counterpart the
Evidence Lower BOund (ELBO)):
KL (qφ(θ)||p(θ|x)) = ∫︂qφ(θ) (log qφ(θ)−log p(θ|x)) dθ
=∫︂qφ(θ) (log qφ(θ)−log p(θ,x)−log p(x)) dθ
=−log p(x)
⏞⏟⏟ ⏞
:=C
+∫︂qφ(θ) (log qφ(θ)−log p(x|θ)−log p(θ)) dθ
=C−Eqφ[log p(x|θ)] + KL (qφ(θ)||p(θ)) = F(φ) + C. (2.12)
By minimizing
F
(
φ
)instead of the
KL
divergence, we can expect to find a solution close
to the optimum of the problem stated in Equation (2.10).
A standard way to find the
φ∗
=
argφmin F
(
φ
)is to perform gradient descent on the
variational parameters φ:
φt+1 =φt−ϵt∇φF(φt),(2.13)
where ϵt>0is the learning rate.
4Technical details are skipped here.
13
2. Background
Computing the gradient
∇φF
(
φ
)can be non-trivial. It involves derivatives over
expectations, but "tricks" like reparametrization [
54
] help to reduce the cost of these
computations.
The choice of the family
Q
is a trade-off decision. A richer, more complex family might be
able to approximate the posterior better, but computing the
KL
and optimizing the variational
parameters will be increasingly difficult. A standard example for continuous variables is
the Variational Gaussian Approximation (
VGA
)
5
, where the variational distribution
qφ
is a
Gaussian, i.e.
Q
=
{q∼ N(m, S)}
, and
φ
=
{m, S}
. Many expectations can be computed
analytically under
VGA
, in particular when the prior on
θ
is Gaussian as well. The Gaussian
distribution is easily reparametrizable, and it is straightforward to sample from it. Many
operations will be of the cost
O
(
D3
)where
D
is the dimensionality of
θ
. Restricting
Q
further
by constraining the covariance
S
can reduce this cost. For example, setting
S
to be diagonal
will reduce the number of variational parameters and avoid inverse matrix operations.
Mean-Field Approximation
We need assumptions on
Q
to reduce the computational cost of variational inference and scale
with high-dimensional
θ
. The Mean-Field (
MF
) assumption imposes that every component of
θis independent of each other. A MF variational family can be specified as:
QMF ={︄q=
D
∏︂
i=1
qφi(θi)}︄,(2.14)
where
φi
are the variational parameters for the variable
θi
. Under the
MF
approximation,
the number of variational parameters grows linearly with the dimensionality of
θ
instead
of quadratically. Additionally, integrals in Equation
(2.12)
can become one-dimensional or
sometimes analytically tractable (the
KL
for example), and therefore more easily solvable.
However, MF can not capture potential posterior correlations between the components of θ.
An intermediate solution is to assume independence between blocks of variables instead,
similarly to the blocked Gibbs sampler. Given
I
=
{
1
,
2
, . . . , D}
, the set of indices of
θ
, we
can build into
K
independent subsets
Ik⊆ I
such that
I
=
∪K
k=1Ik
and
Ii∩Ij
=
∅,iff i
=
j
.
The variational distribution based on this Blocked Mean-Field (
BMF
) approximation is then
defined as
qBMF
φ(θ) =
K
∏︂
k=1
qφk(θIk),(2.15)
where
φk
are the variational parameters for the set of variable
θIk
. The
BMF
approximation
can capture correlations inside blocks of variables but loses some of
MF
’s computational
attractiveness.
Coordinate Ascent VI
The Coordinate Ascent Variational Inference (
CAVI
)
6
approach is an alternative to the
gradient descent approach of Equation
(2.13)
. Instead of moving all parameters at once in the
5The VGA is explored in more details in Chapter 6.
6The word ascent is used since the scheme was originally derived using the negative VFE, i.e., the ELBO.
14
2.3 Approximate Bayesian Inference
gradient direction, we are interested in finding the optimal solution for each set of variational
parameters φione after another by keeping the others fixed:
φ∗
i= argφimin F(φi,φ/i),(2.16)
where
φ/i
=
{φj|j
=
i}
. Using the
BMF
approximation, we can update blocks of variational
parameters at once. The optimal φ∗
ican be found by solving:
∇φiF(φ)|φi=φ∗
i= 0,(2.17)
or performing a partial version of the gradient descent from Equation
(2.13)
. The solution to
Equation (2.16) is always given by
q∗
φi(θi)∝exp (︂Eqφ(θ/i)[︁log p(︁θi|θ/i,x)︁]︁)︂(2.18)
where
θ/i
represent the collection of variables
θ/i
=
{θj|j
=
i}
[
37
]. Even when the expectation
involved in Equation
(2.18)
is available in closed-form, the resulting distribution might not
always normalizable, but we are usually only interested in the different moments of qφi(θi).
Algorithm 1 summarizes the
CAVI
algorithm. The order of the updates does not matter as
long as the variational parameters φare initialized in their respective domain.
Algorithm 1 CAVI algorithm
while |Ft+1 −Ft|> ϵ do
for i∈ {1, . . . , D}do
φt+1
i= argφimin F(φt+1
1:(i−1),φi,φt
(i+1):D),
end for
end while
The
CAVI
and Gibbs sampling algorithms are very similar in nature. The observations on
Gibbs sampling also apply:
CAVI
updates on a distribution with
MF
is easily computable but
has slower convergence, while updates with the
BMF
approach are more complex to derive,
avoid some MF pitfalls, and provide a richer distribution.
Natural Gradients One interesting aspect of
CAVI
, is that it implicitly uses natural
gradients [
1
]. A natural gradient is a gradient preconditioned with the inverse Fisher
information matrix defined as
Iθ=Ep(x|θ)[︂(∇θlog p(x|θ)) (∇θlog p(x|θ))⊤]︂=−Ep(x|θ)[H(log p(x|θ))] ,(2.19)
where H(
f
)is the Hessian matrix of the function
f
. The Fisher information matrix is a
Riemannian metric that gives the direction of the steepest descent with respect to the
KL
divergence. The natural gradient is given by :
˜︁
∇φF(φ) = I−1∇φF(φ)
The natural gradient works in a metric that maximizes the change of the infinitesimal
KL
divergence between the given distribution and its target [
48
]. The updates of the
CAVI
15
2. Background
algorithm 1 for exponential distributions, can be interpreted as natural gradient ascent updates
with learning rate 1[60].
φt+1 =φt+I−1
θ∇φF(φt)
When working with constrained parameters like the covariance matrix of the Gaussian variational
distribution, a step with a high learning rate might overshoot out of the cone of positive-definite
matrices. Salimbeni et al.
[48]
proposes a given schedule to compensate while Lin et al.
[34]
forces a trajectory on a geodesic. Both approaches are computationally expensive, while we get
this feature automatically.
2.3.3 Scale mixtures and conditionally conjugate likelihoods
We base a large part of this work on mixtures and use scale mixtures in particular. A scale
mixture is a continuous mixture of a distribution with a varying scale parameter. A textbook
example is the Student-T distribution which is a Gaussian scale mixture with a Gamma prior
on the variance:
Tν(x) = ∫︂∞
0N(x|0, ω) Ga (︂ω|ν
2,ν
2)︂dω,
where
Ga
is a Gamma distribution. Another example is the Laplace distribution which is also
a Gaussian scale mixture:
La(x|β) = ∫︂∞
0N(x|0, ω)Exp (︃ω|1
2b2)︃dω,
where Exp is the exponential distribution.
These representations appear when computing predictive distributions. For example, when
performing Gaussian linear regression with a fixed weight
θ
and a Gamma prior on the likelihood
variance σ2, the resulting posterior predictive distribution will be a Student-T distribution.
This thesis shows that we can use this connection the other way around. Certain likelihoods
p
(
x|θ
)can be defined as scale mixtures
∫︁p
(
x|θ, ω
)
p
(
ω
)
dω
. We can "unmarginalize" the
likelihood by adding the scale variable
ω
to our model. We augment
p
(
x|θ
)to
p
(
x, ω|θ
). For
example, we can augment a Student-T likelihood into a Gaussian likelihood with a Gamma
prior on the variance. The advantage of the augmented model is to produce conditionally
conjugate likelihoods for all the model variables as the next chapters will show.
16
3
Efficient Gaussian Process
Classification Using Pólya-Gamma
Data Augmentation
Before my doctoral studies, I worked on extending the work of Henao et al.
[20]
on Bayesian
support vector machines to
GPs
as well as scaling them up to big data [
59
]. This paper is not
included in this thesis as it did not get published during my Ph.D. The approach proposed by
Henao et al.
[20]
was the first step on the road of our research on augmentations. A natural
continuation was to explore the binary classification problem with the logit link.
This paper extends the work of Polson et al.
[44]
on augmenting with Pólya-Gamma
variables to
GPs
and sparse
GPs
. The main contributions of this paper are to show that the
augmented model outperforms other state-of-the-art methods for
GPs
but also a derivation of
a remarkable equivalence between the variational bound derived Jaakkola and Jordan
[27]
and
the Pólya-Gamma augmentation.
Authors:
Florian Wenzel,1,∗ Théo Galy-Fajou,2,∗ Christian Donner,2 Marius Kloft,1,3 Manfred Opper2
∗
Equal Contribution,
1
TU Kaiserslautern, Germany,
2
TU Berlin, Germany,
3
University of Southern
California, USA
Details:
Type: Conference article Submitted: September 2018
Accepted: December 2018
DOI: https://doi.org/10.1609/aaai.v33i01.33015417
Conference: AAAI 2019
17
3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation
Contributions:
For an explanation of the terms see the Contributor Roles Taxonomy (CReditT)
F.W. T.G-F. C.D. M.K. M.O.
Conceptualization ✓ ✓ ✓ ✓
Methodology ✓ ✓
Formal Analysis ✓ ✓ ✓
Implementation ✓
Investigation ✓ ✓
Writing - Original Draft ✓ ✓ ✓
Writing - Review & Editing ✓ ✓ ✓ ✓
Supervision ✓
Funding Acquisition ✓ ✓
18
Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation
Florian Wenzel,1,* Théo Galy-Fajou,2,* Christan Donner,2Marius Kloft,1,3 Manfred Opper2
*Contributed equally, 1TU Kaiserslautern, Germany, 2TU Berlin, Germany, 3University of Southern California, USA
[email protected], manfred.opper@tu-berlin.de
Abstract
We propose a scalable stochastic variational approach to GP
classification building on Pólya-Gamma data augmentation
and inducing points. Unlike former approaches, we obtain
closed-form updates based on natural gradients that lead to ef-
ficient optimization. We evaluate the algorithm on real-world
datasets containing up to 11 million data points and demon-
strate that it is up to two orders of magnitude faster than the
state-of-the-art while being competitive in terms of prediction
performance.
1 Introduction
Gaussian processes (GPs) Rasmussen and Williams (2005)
provide a popular Bayesian non-linear non-parametric
method for regression and classification. Because of their
ability of accurately adapting to data and thus achieving
high prediction accuracy while providing well calibrated un-
certainty estimates, GPs are a standard method in several
application areas, including geospatial predictive modeling
Stein (2012) and robotics Dragiev, Toussaint, and Gienger
(2011).
However, recent trends in data availability in the sciences and
technology have made it necessary to develop algorithms ca-
pable of processing massive data John Walker (2014). Cur-
rently, GP classification has limited applicability to big data.
Naive inference typically scales cubic in the number of data
points, and exact computation of posterior and marginal like-
lihood is intractable.
Nevertheless, the combination of so-called sparse Gaus-
sian process techniques with approximate inference meth-
ods, such as expectation propagation (EP) or the varia-
tional approach, have enabled GP classification for datasets
containing millions of data points Hernández-Lobato and
Hernández-Lobato (2016); Salimbeni, Eleftheriadis, and
Hensman (2018).
While these results are already impressive, we will show in
this paper that a speedup of up to two orders magnitudes can
be achieved. Our approach is based on considering an aug-
mented version of the original GP classification model and
Copyright c
2019, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
replacing the ordinary (stochastic) gradients for optimiza-
tion by more efficient natural gradients, which is the stan-
dard Euclidean gradient multiplied by the inverse Fisher in-
formation matrix. Natural gradients recently have been suc-
cessfully used in a variety of variational inference problems
Honkela et al. (2010); Wenzel et al. (2017); Jähnichen et al.
(2018).
Unfortunately, an efficient computation of the natural gradi-
ent for the GP classification problem is not straight forward.
The use of the probit link function in Dezfouli and Bonilla
(2015); Hernández-Lobato and Hernández-Lobato (2016);
Mandt et al. (2017); Salimbeni, Eleftheriadis, and Hensman
(2018) leads to expectations in the variational objective func-
tions that can only be computed by numerical quadrature,
thus, preventing efficient optimization.
We derive a natural-gradient approach to variational infer-
ence in GP classification based on the logit link. We exploit
that the corresponding likelihood has an auxiliary variable
representation as a continuous mixture of Gaussians involv-
ing Pólya-Gamma random variables Polson, Scott, and Win-
dle (2013).
Unlike former approaches, our natural gradient updates can
be computed in closed-form. Moreover, they have the advan-
tage that they correspond to block-coordinate ascent updates
and, therefore, learning rates close to one can be chosen. This
leads to a fast and stable algorithm which is simple to imple-
ment. Our main contributions are as follows:
•We present a Gaussian process classification model using
a logit link function that is based on Pólya-Gamma data
augmentation and inducing points for Gaussian process in-
ference.
•We derive an efficient inference algorithm based on
stochastic variational inference and natural gradients. All
natural gradient updates are given in closed-form and do
not rely on numerical quadrature methods or sampling ap-
proaches. Natural gradients have the advantage that they
provide effective second-order optimization updates.
•In our experiments, we demonstrate that our approach
drastically improves speed up to two orders of magni-
tude while being competitive in terms of prediction per-
formance. We apply our method to massive real-world
19
datasets up to 11 million points and demonstrate superior
scalability.
The paper is organized as follows. In section 2 we discuss
related work. In section 3 we introduce our novel scalable GP
classification model and in section 4 we present an efficient
variational inference algorithm. Section 5 concludes with
experiments. Our code is available via Github1.
2 Background and Related Work
Gaussian process classification Hensman and Matthews
(2015) consider Gaussian process classification with a pro-
bit inverse link function and suggest a variational Gaussian
model that builds on inducing points. By employing auto-
matic differentiation, Salimbeni, Eleftheriadis, and Hensman
(2018) generalize this approach to use natural gradients in
non-conjugate GP models. Khan and Nielsen (2018) con-
sider natural gradient updates in the setting of variational
inference with exponential families. Unlike our approach,
these methods do not benefit from closed-form updates and
have to resort to numerical approximations. Moreover, our
approach has the advantage that a higher learning rate close
to one can be chosen leading to updates that can be inter-
preted as block-coordinate ascent updates.
Izmailov, Novikov, and Kropotov (2018) use tensor train de-
composition to allow for the training of GP models with bil-
lions of inducing points. The updates are not computed in
closed-form and they do not use natural gradients.
Dezfouli and Bonilla (2015) propose a general automated
variational inference approach for sparse GP models with
non-conjugate likelihood. Since they follow a black box ap-
proach and do not exploit model specific properties they do
not employ efficient optimization techniques.
Hernández-Lobato and Hernández-Lobato (2016) follow an
expectation propagation approach based on inducing points
and have a similar computational cost as Hensman and
Matthews (2015).
Pólya-Gamma data augmentation Polson, Scott, and
Windle (2013) introduced the idea of data augmentation in
logistic models using the class of Pólya-Gamma distribu-
tions. This allows for exact inference via Gibbs sampling
or approximate variational inference schemes Scott and Sun
(2013).
Linderman, Johnson, and Adams (2015) extend this idea to
multinomial models and discuss the application for Gaussian
processes with multinomial observations but their approach
does not scale to big datasets and they do not consider the
concept of inducing points.
1https://github.com/theogf/
AugmentedGaussianProcesses.jl
3 Model
The logit GP Classification model is defined as follows. Let
X= (x1,...,xn)∈Rd×nbe the d-dimensional training
points with labels y= (y1, . . . , yn)∈ {−1,1}n. The likeli-
hood of the labels is
p(y|f, X) =
n
Y
i=1
σ(yif(xi)),(1)
where σ(z) = (1+exp(−z))−1is the logit link function and
fis the latent decision function. We place a GP prior over f
and obtain the joint distribution of the labels and the latent
GP
p(y,f|X) = p(y|f, X)p(f|X),(2)
where p(f|X) = N(f|0, Knn)and Knn denotes the ker-
nel matrix evaluated at the training points X. For the sake
of clarity we omit the conditioning on Xin the follow-
ing.
3.1 Pólya-Gamma data augmentation
Due to the analytically inconvenient form of the likelihood
function, inference for logit GP classification is a challeng-
ing problem. We aim to remedy this issue by considering an
augmented representation of the original model. Later we
will see that the augmented model is indeed advantageous
as it leads to efficient closed-form updates in our variational
inference scheme.
Polson, Scott, and Windle (2013) introduced the class of
Pólya-Gamma random variables and proposed a data aug-
mentation strategy for inference in models with binomial
likelihoods. The augmented model has the appealing prop-
erty that the likelihood of the latent function fis propor-
tional to a Gaussian density when conditioned on the aug-
mented Pólya-Gamma variables. This allows for Gibbs sam-
pling methods, where model parameters and Pólya-Gamma
variables can be sampled alternately from the posterior Pol-
son, Scott, and Windle (2013). Alternatively, the augmenta-
tion scheme can be utilized to derive an efficient approximate
inference algorithm in the variational inference framework,
which will be pursued here.
The Pólya-Gamma distribution is defined as follows. The
random variable ω∼PG(b, 0),b > 0is defined by the
moment generating function
EPG(ω|b,0)[exp(−ωt)] = 1
coshb(pt/2).(3)
It can be shown that this is the Laplace transform of an in-
finite convolution of gamma distributions. The definition
is related to our problem by the fact that the logit link can
be written in a form that involves the cosh function, namely
σ(zi) = exp(1
2zi)(2 cosh(zi
2))−1. In the following we de-
rive a representation of the logit link in terms of Pólya-
Gamma variables.
3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation
20
First, we define the general PG(b, c)class which is derived
by an exponential tilting of the PG(b, 0) density, it is given
by
PG(ω|b, c)∝exp(−c2
2ω)PG(ω|b, 0).
From the moment generating function (3) the first moment
can be directly computed
EP G(ω|b,c)[ω] = b
2ctanh c
2.
For the subsequently presented variational algorithm these
properties suffice and the full representation of the Pólya-
Gamma density PG(ω|b, c)is not required.
We now adapt the data augmentation strategy based on
Pólya-Gamma variables for the GP classification model. To
do this we write the non-conjugate logistic likelihood func-
tion (1) in terms of Pólya-Gamma variables
σ(zi) = (1 + exp(−zi))−1=exp(1
2zi)
2 cosh(zi
2)
=1
2Zexp zi
2−z2
i
2ωip(ωi)dωi,(4)
where p(ωi) = PG(ωi|1,0) and by making use of (3). For
more details see Polson, Scott, and Windle (2013). Using
this identity and substituting zi=yif(xi)we augment the
joint density (2) with Pólya-Gamma variables
p(y,ω,f)∝exp 1
2y>f−1
2f>Ωfp(f)p(ω),(5)
where Ω = diag(ω)is the diagonal matrix of the Pólya-
Gamma variables {ωi}. In contrast to the original model (2)
the augmented model is conditionally conjugate forming the
basis for deriving closed-form updates in section 4.
Interestingly, employing a structured mean-field variational
inference approach (cf. section 4) to the plain Pólya-Gamma
augmented model (5) leads to the same bound for GP clas-
sification derived by Gibbs and MacKay (2000). This is
an interesting new perspective on this bound since they do
not employ a data augmentation approach. We provide a
proof in appendix A.5. Our approach goes beyond Gibbs
and MacKay (2000) by providing a fully Bayesian perspec-
tive, including a sparse GP prior (section 3.2) in the model
and proposing a scalable inference algorithm based on natu-
ral gradients (section 4).
3.2 Sparse Gaussian process
Inference in GP models typically has the computational com-
plexity O(n3). We obtain a scalable approximation of our
model and focus on inducing point methods Snelson and
Ghahramani (2006). We follow a similar approach as in
Hensman and Matthews (2015) and reduce the complexity
to O(m3), where mis number of inducing points.
We augment the latent GP fwith madditional input-output
pairs (Z1, u1),...,(Zm, um), termed as inducing inputs and
inducing variables. The function values of the GP fand
the inducing variables u= (u1, . . . , um)are connected
via
p(f|u) = Nf|KnmK−1
mmu,e
K
p(u) = N(u|0, Kmm),
(6)
where Kmm is the kernel matrix resulting from evaluating
the kernel function between all inducing inputs, Knm is the
cross-kernel matrix between inducing inputs and training
points and e
K=Knn −KnmK−1
mmKmn. Including the in-
ducing points in our model gives the augmented joint distri-
bution
p(y,ω,f,u) = p(y|ω,f)p(ω)p(f|u)p(u)(7)
Note that the original model (2) can be recovered by
marginalizing ωand u.
4 Inference
The goal of Bayesian inference is to compute the posterior
of the latent model variables. Because this problem is in-
tractable for the model at hand, we employ variational infer-
ence to map the inference problem to a feasible optimization
problem. We first chose a family of tractable variational dis-
tributions and select the best candidate by minimizing the
Kullback-Leibler divergence between the variational distri-
bution and the posterior. This is equivalent to optimizing a
lower bound on the marginal likelihood, known as evidence
lower bound (ELBO) Jordan et al. (1999); Wainwright and
Jordan (2008).
In the following we develop a stochastic variational infer-
ence (SVI) algorithm that enables stochastic optimization
based on natural gradient updates which are given in closed-
form.
4.1 Why use natural gradients?
Using the natural gradient over the standard Euclidean gra-
dient is favorable since natural gradients are invariant to
reparameterization of the variational family Amari and Na-
gaoka (2007); Martens (2017) and provide effective second-
order optimization updates Amari (1998); Hoffman et al.
(2013).
The superiority of using natural gradients in our approach
can be explained by the following. We reformulate the GP
classification model as an augmented model which is condi-
tionally conjugate. When using a learning rate of one, the
natural gradient updates correspond to block-coordinate as-
cent updates, i.e. in each iteration each parameter is set to
its optimal value given the remaining parameters (see ap-
pendix A.4 and Hoffman et al. (2013)). In practice, we em-
ploy stochastic variational inference, i.e. we only use mini-
batches of the data to obtain a noisy version of the natural
gradient. In this setting, learning rates slightly less than one
have to be chosen.
21
This is in contrast to former natural gradient based ap-
proaches, e.g. Salimbeni, Eleftheriadis, and Hensman
(2018), that focus on the original non-conjugate GP clas-
sification model. Although they benefit from using natural
gradients, they have the disadvantage that their updates do
not correspond to coordinate-ascent updates. Thus, learning
rates that are much smaller that one have to be used to assure
convergence.
Therefore, in our approach, we can use much higher learning
rates and optimization is faster and more stable which we
demonstrate in the experiments.
4.2 Variational approximation
We aim to approximate the posterior of the inducing
points p(u|y)and apply the methodology of variational
inference to the marginal joint distribution p(y, ω, u) =
p(y|ω,u)p(ω)p(u). Following a similar approach as Hens-
man and Matthews (2015), we apply Jensen’s inequality to
obtain a tractable lower bound on the log-likelihood of the
labels
log p(y|ω,u) = log Ep(f|u)[p(y|ω, f)]
≥Ep(f|u)[log p(y|ω, f)].(8)
By this inequality we construct a variational lower bound on
the evidence
log p(y)≥Eq(u,ω)[log p(y|u,ω)] −KL (q(u,ω)||p(u,ω))
≥Ep(f|u)q(u)q(ω)[log p(y|ω,f)]
−KL (q(u,ω)||p(u,ω))
=: L,
where the first inequality is the usual evidence lower bound
(ELBO) in variational inference and the second inequality is
due to (8).
We follow a structured mean-field approach Wainwright and
Jordan (2008) and assume independence between the induc-
ing variables uand Pólya-Gamma variables ω, yielding a
variational distribution of the form q(u, ω) = q(u)q(ω). Set-
ting the functional derivative of Lw.r.t. q(u)and q(ω)to
zero, respectively, results in the following consistency con-
dition for the maximum,
q(u,ω) = q(u)Y
i
q(ωi),(9)
with q(ωi) = PG(ωi|1, ci)and q(u) = N(u|µ,Σ). Re-
markably, we do not have to use the full Pólya-Gamma class
PG(ωi|bi, ci), but instead consider the restricted class bi= 1
since it already contains the optimal distribution.
We use (9) as variational family which is parameterized by
the variational parameters {µ,Σ,c}and obtain a closed-
form expression of the variational bound
L(c,µ,Σ)
=Ep(f|u)q(u)q(ω)[log p(y|ω,f)] −KL (q(u,ω)||p(u,ω))
c
=1
2log |Σ| − log |Kmm|)−tr(K−1
mmΣ) −µ>K−1
mmµ
+X
inyiκiµ−θie
Kii −κiΣκ>
i−µ>κ>
iκiµ
+c2
iθi−2 log coshci
2o,(10)
where θi=1
2citanh ci
2and κi=KimK−1
mm. Re-
markably, all intractable terms involving expectations of
log PG(ωi|1,0) cancel out. Details are provided in appendix
A.2.
4.3 Stochastic variational inference
Our algorithm alternates between updates of the local varia-
tional parameters cand global parameters µand Σ. In each
iteration we update the parameters based on a mini-batch of
the data S ⊂ {1, ..., n}of size s=|S|.
We update the local parameters cSin the mini-batch Sby
employing coordinate ascent. To this end, we fix the global
parameters and analytically compute the unique maximum of
(10) w.r.t. the local parameters, leading to the updates
ci=qe
Kii +κiΣκ>
i+µ>κ>
iκiµ(11)
for i∈ S.
We update the global parameters by employing stochastic
optimization of the variational bound (10). The optimization
is based on stochastic estimates of the natural gradients of
the global parameters. We use the natural parameterization
of the variational Gaussian distribution, i.e., the parameters
η1:= Σ−1µand η2=−1
2Σ−1. Using the natural parame-
ters results in simpler and more effective updates. The natu-
ral gradients based on the mini-batch Sare given by
e
∇η1LS=n
2sκ>
SyS−η1
e
∇η2LS=−1
2K−1
mm +n
sκ>
SΘSκS−η2,
(12)
where Θ = diag(θ)and θi=1
2citanh ci
2. The factor n
sis
due to the rescaling of the mini-batches. The global parame-
ters are updated according to a stochastic natural gradient as-
cent scheme. We employ the adaptive learning rate method
described by Ranganath et al. (2013).
The natural gradient updates always lead to a positive definite
covariance matrix2and in contrast to Hensman and Matthews
(2015) our implementation does not require any assurance
for positive-definiteness of the variational covariance matrix
Σ. Details for the derivation of the updates can be found in
appendix A.3. The complexity of each iteration in the infer-
ence scheme is O(m3), due to the inversion of the matrix
η2.
2This follows directly since Kmm and Θare positive definite.
3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation
22
On the quality of the approximation In other applica-
tions of variational inference to GP classification, one tries
to approximate the posterior directly by a Gaussian q∗(f)
which minimizes the Kullback-Leibler divergence between
the variational distribution and the true posterior Hensman
and Matthews (2015). On the other hand, in our paper, we
apply variational inference to the augmented model, looking
for the best distribution that factorizes in the Pólya-Gamma
variables ωiand the original function f. This approach
also yields a Gaussian approximation q(f)as a factor in
the optimal density. Of course q(f)will be different from
the âĂŸoptimalâĂŹ q∗(f). We could however argue that
asymptotically, in the limit of a large number of data, the
predictions given by both densities may not be too different,
as the posterior uncertainty for both densities should become
small Opper and Archambeau (2009).
It would be interesting to see how the ELBOs of the two vari-
ational approaches, which both give a lower bound on the
likelihood of the data, differ. Unfortunately, such a computa-
tion would require the knowledge of the optimal q∗(f). How-
ever, we can obtain some estimate of this difference when we
assume that we use the same Gaussian density q(f)for both
bounds as an approximation. In this case, we obtain
Lorig − Laugmented =Eq(f)[KL (q(ω)||p(ω|f, y))].
This lower bound on the gap is small if on average the varia-
tional approximation q(ω)is close to the posterior p(ω|f, y).
For the sake of simplicity we consider here the non-sparse
case, i.e. the inducing points equal the training points (f=
u). However, it is straight-forward to extend the results also
to the sparse case.
We empirically investigate the quality of our approximation
in experiment 5.1.
Predictions The approximate posterior of the GP values
and inducing variables is given by q(f,u) = p(f|u)q(u),
where q(u) = N(u|µ,Σ) denotes the optimal variational
distribution. To predict the latent function values f∗at a
test point x∗we substitute our approximate posterior into the
standard predictive distribution
p(f∗|y) = Zp(f∗|f,u)p(f,u|y)dfdu
≈Zp(f∗|f,u)p(f|u)q(u)dfdu
=Zp(f∗|u)q(u)du=Nf∗|µ∗, σ2
∗,(13)
where the prediction mean is µ∗=K∗mK−1
mmµand the vari-
ance σ2
∗=K∗∗ +K∗mK−1
mm(ΣK−1
mm −I)Km∗. The ma-
trix K∗mdenotes the kernel matrix between the test point
and the inducing points and K∗∗ the kernel value of the test
point. The distribution of the test labels is easily computed
by applying the logit link function to (13),
p(y∗= 1|y) = Zσ(f∗)p(f∗|y)df∗.(14)
This integral is analytically intractable but can be computed
numerically by quadrature methods. This is adequate and
fast since the integral is only one-dimensional.
Computing the mean and the variance of the predictive
distribution has complexity O(m)and O(m2), respec-
tively.
Optimization of the hyperparameters We select the op-
timal kernel hyperparameters by maximizing the marginal
likelihood p(y|h), where hdenotes the set of hyperparame-
ters (this approach is called empirical Bayes Maritz and Lwin
(1989)). We follow an approximate approach and optimize
the fitted variational lower bound L(h)(10) as a function of
hby alternating between optimization steps w.r.t. the varia-
tional parameters and the hyperparameters Mandt, Hoffman,
and Blei (2016).
5 Experiments
We compare our proposed method, efficient Gaussian pro-
cess classification (x-gpc), with the state-of-the-art meth-
ods svgpc Salimbeni, Eleftheriadis, and Hensman (2018),
provided in the package GPflow3Matthews et al. (2017),
which builds on TensorFlow and the EP approach epgpc
by Hernández-Lobato and Hernández-Lobato (2016), imple-
mented in R. All methods are applied to real-world datasets
containing up to 11 million data points.
In all experiments a squared exponential covariance function
with a common length scale parameter for each dimension,
an amplitude parameter and an additive noise parameter is
used. The kernel hyperparameters are initialized to the same
values and optimized using Adam Kingma and Ba (2014),
while inducing points location are initialized via k-means++
Arthur and Vassilvitskii (2007) and kept fixed during train-
ing. The SVI based methods, x-gpc and svgpc, use an adap-
tive learning rate. All algorithms are run on a single CPU.
We experiment on 12 datasets from the OpenML website
and the UCI repository ranging from 768 to 11 million data
points. In the first experiment (section 5.1), we examine the
quality of the approximation provided by x-gpc. In the next
experiment, we evaluate the prediction performance and run
time of x-gpc and svgpc and epgpc on several real-world
datasets. Finally, in 5.3, we examine the sensitivity of all
methods to the number of inducing points.
5.1 Quality of the approximation
We empirically examine the quality of the variational approx-
imation provided by our method. In Fig. 1, we compare the
approximations to the true posterior obtained by employing
an asymptotically correct Gibbs sampler Polson and Scott
(2011); Linderman, Johnson, and Adams (2015). We com-
pare the posterior mean and variance as well as the prediction
probabilities with the ground truth. Since the Gibbs sampler
3We use GPflow version 1.2.0.
23
Figure 1: Posterior mean (µ), variance (σ) and predictive
marginals (p) of the Diabetes dataset. Each plot shows the
MCMC ground truth on the x-axis and the estimated value
of our model on the y-axis. Our approximation is very close
to the ground truth.
does not scale to large datasets we experiment on the small
Diabetes dataset. In Fig. 1 we plot the approximated values
vs. the ground truth. We find that our approximation is very
close to the true posterior.
5.2 Numerical comparison
Dataset X-GPC SVGPC EPGPC
aXa Error 0.17 ±0.07 0.17 ±0.07 0.17 ±0.07
n= 36,974 NLL 0.29 ±0.13 0.36 ±0.13 0.34 ±0.13
d= 123 Time 47 ±2.2451 ±7.8 214 ±4.8
Bank Market. Error 0.14 ±0.12 0.12 ±0.12 0.12 ±0.13
n= 45,211 NLL 0.27 ±0.22 0.31 ±0.26 0.33 ±0.20
d= 43 Time 9±1.5205 ±6.6 46 ±3.5
Click Pred. Error 0.17 ±0.00 0.17 ±0.00 0.17 ±0.01
n= 399,482 NLL 0.39 ±0.07 0.46 ±0.00 0.46 ±0.01
d= 12 Time 4.5±1.3102 ±3.0 8.1±0.45
Cod RNA Error 0.04 ±0.00 0.04 ±0.00 0.04 ±0.00
n= 343,564 NLL 0.11 ±0.03 0.13 ±0.00 0.12 ±0.00
d= 8 Time 3.7±0.13 115 ±4.3 869 ±5.2
Diabetes Error 0.23 ±0.07 0.23 ±0.06 0.24 ±0.06
n= 768 NLL 0.47 ±0.11 0.47 ±0.10 0.48 ±0.09
d= 8 Time 8.8±0.12 150 ±5.18±0.45
Electricity Error 0.24 ±0.06 0.26 ±0.06 0.26 ±0.06
n= 45,312 NLL 0.31 ±0.17 0.53 ±0.08 0.53 ±0.06
d= 8 Time 8.2±0.48 356 ±6.9 13.5±1.50
German Error 0.25 ±0.12 0.25 ±0.11 0.26 ±0.13
n= 1,000 NLL 0.44 ±0.17 0.51 ±0.15 0.53 ±0.11
d= 20 Time 17 ±0.42 374 ±7.35.2±0.03
Higgs Error 0.33 ±0.01 0.45 ±0.01 0.38 ±0.01
n= 11,000,000 NLL 0.55 ±0.13 0.69 ±0.00 0.66 ±0.00
d= 28 Time 23 ±0.88 294 ±54 8732 ±867
IJCNN Error 0.03 ±0.01 0.06 ±0.01 0.02 ±0.01
n= 141,691 NLL 0.10 ±0.03 0.15 ±0.07 0.09 ±0.04
d= 22 Time 17 ±0.44 1033 ±45 756 ±8.6
Mnist Error 0.14 ±0.01 0.44 ±0.13 0.12 ±0.01
n= 70,000 NLL 0.24 ±0.10 0.66 ±0.11 0.27 ±0.01
d= 780 Time 200 ±5.5991 ±23 806 ±5.2
Shuttle Error 0.01 ±0.01 0.01 ±0.00 0.01 ±0.01
n= 58,000 NLL 0.07 ±0.01 0.07 ±0.00 0.07 ±0.01
d= 9 Time 0.01 ±0.00 7.5±0.7 100 ±0.63
SUSY Error 0.21 ±0.00 0.22 ±0.00 0.22 ±0.00
n= 5,000,000 NLL 0.31 ±0.10 0.49 ±0.01 0.50 ±0.00
d= 18 Time 14 ±0.29 10,000 10,000
wXa Error 0.03 ±0.01 0.04 ±0.01 0.03 ±0.01
n= 34,780 NLL 0.27 ±0.07 0.25 ±0.07 0.19 ±0.06
d= 300 Time 66 ±16 612 ±11 1.4±0.10
Table 1: Average test prediction error, negative test log-
likelihood (NLL) and time in seconds along with one stan-
dard deviation. Best values are highlighted.
We evaluate the prediction performance and run time of
our method x-gpc and the competing methods svgpc and
epgpc. We experiment on a variety of different datasets
and report the resulting prediction error, negative test log-
likelihood and run time for each method in table 1.
The experiments are conducted as follows. For each dataset
we perform a 10-fold cross-validation and for datasets with
more than 1 million points, we limit the test set to 100,000
points. We report the average prediction error, the negative
test log-likelihood (14) and the run time along with one stan-
dard deviation. For all datasets, we use 100 inducing points
and a mini-batch size of 100 points.
For x-gpc we find that the following simple convergence cri-
terion on the global parameters leads to good results: a slid-
ing window average being smaller than a threshold of 10−4
. Unfortunately, the original implementations of svgpc and
epgpc do not include a convergence criterion. We find that
the trajectories of the global parameters of svgpc tend to be
noisy, and using a convergence criterion on the global param-
eters often leads to poor results. To have a fair comparison,
we therefore monitor the convergence of the prediction per-
formance on a hold-out set and use a sliding window average
of size 5 and threshold 10−3as convergence criterion for all
methods.
We observe that x-gpc is about one to two orders of mag-
nitude faster than svgpc and epgpc on most datasets. Only
on the dataset wXa, epgpc is slightly faster than x-gpc. The
prediction error is similar for all methods but x-gpc outper-
forms the competitors in terms of the test log-likelihood on
most datasets (aXa, Bank Marketing, Click Prediction, Cod
RNA, Diabetes, Electricity, German, Higgs, Mnist, SUSY).
This means that the confidence levels in the predictions are
better calibrated for x-gpc, i.e. when predicting a wrong
label svgpc and epgpc tend to be more confident than x-
gpc.
Performance as a function of time Since all considered
methods are based on an optimization schemes, there is a
trade-off between the run time of the algorithm and the pre-
diction performance. We make this trade-off transparent
by plotting the prediction performance as function of time
on each dataset. For each method we monitor on a 10-
fold cross-validation the average negative test log-likelihood
and prediction error on a hold-out test set as a function of
time.
The results are displayed in Fig. 2 for three selected datasets,
while the results for the remaining datasets are deferred to
appendix A.1. For all datasets we observe that after a few
iterations x-gpc is already close to the optimum due to its
efficient closed form natural gradient updates. Both the pre-
diction error and test log-likelihood converge around one to
two orders of magnitude faster for x-gpc than for svgpc and
epgpc. Moreover, the performance curves tend to be nois-
ier for svgpc than for x-gpc and epgpc. For the datasets
HIGGS and IJCNN, epgpc lead to slightly better final pre-
diction performance, but with the cost of a runtime being
up to 4 orders of magnitude slower than x-gpc (approx. 28
hours vs. 9 and 435 seconds, respectively).
3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation
24
Figure 2: Average negative test log-likelihood and average test prediction error as a function of training time (seconds in a log10
scale) on the datasets Electricity (45,312 points), Cod RNA (343,564 points) and SUSY (5 million points). x-gpc (proposed)
reaches values close to the optimum after only a few iterations, whereas svgpc and epgpc are one to two orders of magnitude
slower.
Figure 3: Prediction error as function of training time (on a log10 scale) for the Shuttle dataset. Different numbers of inducing
points are considered, M= 16,32,64,128.x-gpc (proposed) converges the fastest in all settings of different numbers of
inducing points. Using only 32 inducing points is enought for obtaining allmost optimal prediction performance for all methods,
but svgpc becomes instable in settings of less than 128 inducing points.
25
All three methods are implemented in different program-
ming frameworks: x-gpc in Julia, svgpc in TensorFlow and
epgpc in R leading to different efficient implementations.
However, we find that the main speed-up of our method is due
to the efficient natural gradient updates and only marginally
related to the usage of a different programming language.
To check this we implemented epgpc also in Julia and ob-
tained similar runtimes. Since svgpc is part of the highly
optimized GPflow package we only used the original imple-
mentation.
5.3 Inducing points
We examine the effect of different numbers of inducing
points on the prediction performance and run time. For all
methods we compare different numbers of inducing points:
M= 16,32,64,128. For each setting, we perform a 10-fold
cross validation on the Shuttle dataset and plot the mean pre-
diction error as function of time. The results are displayed
in Fig. 3. We observe that the higher the number of inducing
points, the better the prediction performance, but the longer
the run time. Throughout all settings of inducing points our
method is consistently faster of around one to two orders of
magnitude than the competitors. On the Shuttle dataset us-
ing only M= 32 inducing points is enough and can only
be marginally improved by using more inducing point for all
methods. However, the performance curves of svgpc are
instable when using less than 128 inducing points.
6 Conclusions
We proposed an efficient Gaussian process classification
method that builds on Pólya-Gamma data augmentation and
inducing points. The experimental evaluations shows that
our method is up to two orders of magnitude faster than the
state-of-the-art approach while being competitive in terms
of prediction performance. Speed improvements are due to
the Pólya-Gamma data augmentation approach that enables
efficient second order optimization.
The presented work shows how data augmentation can speed
up variational approximation of GPs. Our analysis may
pave the way for using data augmentation to derive effi-
cient stochastic variational algorithms also for variational
Bayesian models other than GPs. Furthermore, future work
may aim at extending the approach to multi-class and multi-
label classification.
Acknowledgements We thank Stephan Mandt, James
Hensman and Scott W. Linderman for fruitful discussions.
This work was partly funded by the German Research Foun-
dation (DFG) awards KL 2698/2-1 and GRK1589/2 and the
by the Federal Ministry of Science and Education (BMBF)
awards 031L0023A, 01IS18051A.
References
Amari, S., and Nagaoka, H. 2007. Methods of Information Geom-
etry. American Mathematical Society.
Amari, S. 1998. Natural grad. works efficiently in learning. Neural
Computation.
Arthur, D., and Vassilvitskii, S. 2007. k-means++: The advan-
tages of careful seeding. In Proceedings of the eighteenth annual
ACM-SIAM symposium on Discrete algorithms, 1027–1035. So-
ciety for Industrial and Applied Mathematics.
Dezfouli, A., and Bonilla, E. V. 2015. Scalable inference for gaus-
sian process models with black-box likelihoods. In NIPS, 1414–
1422.
Dragiev, S.; Toussaint, M.; and Gienger, M. 2011. Gaussian process
implicit surfaces for shape estimation and grasping. In Robotics
and Automation (ICRA), 2845–2850.
Gibbs, M. N., and MacKay, D. J. C. 2000. Variational Gaus-
sian process classifiers. IEEE Transactions on Neural Networks
11(6):1458–1464.
Hensman, J., and Matthews, A. 2015. Scalable Variational Gaus-
sian Process Classification. In AISTATS.
Hernández-Lobato, D., and Hernández-Lobato, J. M. 2016. Scal-
able gaussian process classification via expectation propagation.
In AISTATS.
Hoffman, M. D.; Blei, D. M.; Wang, C.; and Paisley, J. 2013.
Stochastic Variational Inference. Journal of Machine Learning
Research.
Honkela, A.; Raiko, T.; Kuusela, M.; Tornio, M.; and Karhunen,
J. 2010. Approximate riemannian conjugate gradient learning
for fixed-form variational bayes. Journal of Machine Learning
Research 11.
Izmailov, P.; Novikov, A.; and Kropotov, D. 2018. Scalable gaus-
sian processes with billions of inducing inputs via tensor train
decomposition. In AISTATS, 726–735.
Jähnichen, P.; Wenzel, F.; Kloft, M.; and Mandt, S. 2018. Scalable
generalized dynamic topic models. In AISTATS.
John Walker, S. 2014. Big data: A revolution that will transform
how we live, work, and think. Taylor & Francis.
Jordan, M. I.; Ghahramani, Z.; Jaakkola, T. S.; and Saul, L. K. 1999.
An Introduction to Variational Methods for Graphical Models.
Machine Learning.
Khan, M. E., and Nielsen, D. 2018. Fast yet simple natural-
gradient descent for variational inference in complex models.
Arxiv Preprint.
Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic
optimization. CoRR abs/1412.6980.
Linderman, S. W.; Johnson, M. J.; and Adams, R. P. 2015. De-
pendent multinomial models made easy: Stick-breaking with the
polya-gamma augmentation. In NIPS.
Mandt, S.; Wenzel, F.; Nakajima, S.; Cunningham, J. P.; Lippert,
C.; and Kloft, M. 2017. Sparse Probit Linear Mixed Model.
Machine Learning Journal.
Mandt, S.; Hoffman, M.; and Blei, D. 2016. A Variational Analysis
of Stochastic Gradient Algorithms. ICML.
3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation
26
Maritz, J., and Lwin, T. 1989. Empirical Bayes Methods with
Applications. Monographs on Statistics and Applied Probability.
Martens, J. 2017. New insights and perspectives on the natural
gradient method. Arxiv Preprint.
Matthews, A. G. d. G.; van der Wilk, M.; Nickson, T.; Fujii, K.;
Boukouvalas, A.; León-Villagrá, P.; Ghahramani, Z.; and Hens-
man, J. 2017. GPflow: A Gaussian process library using Ten-
sorFlow. Journal of Machine Learning Research.
Opper, M., and Archambeau, C. 2009. The variational gaussian
approximation revisited. Neural Comput. 21(3):786–792.
Polson, N. G., and Scott, S. L. 2011. Data augmentation for support
vector machines. Bayesian Anal.
Polson, N. G.; Scott, J. G.; and Windle, J. 2013. Bayesian inference
for logistic models using pólya–gamma latent variables. Journal
of the American Statistical Association 108(504):1339–1349.
Ranganath, R.; Wang, C.; Blei, D. M.; and Xing, E. P. 2013.
An Adaptive Learning Rate for Stochastic Variational Inference.
ICML.
Rasmussen, C. E., and Williams, C. K. I. 2005. Gaussian Pro-
cesses for Machine Learning (Adaptive Computation and Ma-
chine Learning). The MIT Press.
Salimbeni, H.; Eleftheriadis, S.; and Hensman, J. 2018. Natu-
ral gradients in practice: Non-conjugate variational inference in
gaussian process models. In AISTATS.
Scott, J. G., and Sun, L. 2013. Expectation-maximization for lo-
gistic regression. arXiv preprint arXiv:1306.0040.
Snelson, E., and Ghahramani, Z. 2006. Sparse GPs using Pseudo-
inputs. NIPS.
Stein, M. L. 2012. Interpolation of spatial data: some theory for
kriging. Springer Science & Business Media.
Wainwright, M. J., and Jordan, M. I. 2008. Graphical models,
exponential families, and variational inference. Found. Trends
Mach. Learn. 1–305.
Wenzel, F.; Galy-Fajou, T.; Deutsch, M.; and Kloft, M. 2017.
Bayesian nonlinear support vector machines for big data. In Pro-
ceedings of the European Conference on Machine Learning and
Principles and Practice of Knowledge Discovery in Databases.
27
A Appendix
A.1 Additional performance plots
We show all time vs. prediction performance plots for the datasets presented in table 1 in section section 5.2 which could not be included in
the main paper due to space limitations.
Figure 4: Average negative test log-likelihood and average test prediction error as function of training time measured in seconds
(on a log10 scale).
3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation
28
Figure 5: Average negative test log-likelihood and average test prediction error as function of training time measured in seconds
(on a log10 scale). For the dataset Higgs, epgpc exceeded the time budget of 105seconds (≈28 h).
29
Figure 6: Average negative test log-likelihood and average test prediction error as function of training time measured in seconds
(on a log10 scale).
A.2 Variational bound
We provide details of the derivation of the variational bound (10) which is defined as
L(c,µ,Σ) = Ep(f|u)q(u)q(ω)[log p(y|ω,f)] −KL (q(u,ω)||p(u,ω)) ,
and the family of variational distributions
q(u,ω) = q(u)Y
i
q(ωi) = N(u|µ,Σ) Y
i
PG(ωi|1, ci).
Considering the likelihood term we obtain
Ep(f|u)[log p(y|ω, f)] c
=1
2Ep(f|u)hy>f−f>Ωfi
=1
2y>KnmK−1
mmu−tr(Ω e
K)−u>K−1
mmKmnΩKnmK−1
mmu.
3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation
30
Computing the expectations w.r.t. to variational distributions gives
Ep(f|u)q(u)q(ω)[log p(y|ω,f)]
c
=1
2Eq(u)q(ω)hy>KnmK−1
mmu−tr(Ω e
K)−u>K−1
mmKmnΩKnmK−1
mmui
=1
2Eq(u)hy>KnmK−1
mmu−tr(Θ e
K)−u>K−1
mmKmnΘKnmK−1
mmui
=1
2hy>KnmK−1
mmµ−tr(Θ e
K)−tr(K−1
mmKmnΘKnmK−1
mmΣ) −µ>K−1
mmKmnΘKnmK−1
mmµi
=1
2X
iyiκiµ−θie
Kii −θiκiΣκ>
i−θiµ>κ>
iκiµ,
where θi=Ep(ωi)[ωi] = 1
2citanh ci
2,Θ = diag(θ)and κi=KimK−1
mm.
The Kullback-Leibler divergence between the Gaussian distributions q(u)and p(u)is easily computed
KL(q(u))||p(u)) c
=1
2tr K−1
mmΣ+µ>K−1
mmµ−log |Σ|+ log |Kmm|.
The Kullback-Leibler divergence regarding the Pólya-Gamma also can be computed in closed-form. Have q(ωi) =
cosh ci
2exp −c2
i
2ωiPG(ωi|1,0) and p(ωi) = PG(ωi|1,0) we obtain
KL(q(ω))||p(ω)) = Eq(ω)[log q(ω)−log p(ω)]
=X
iEq(ωi)log cosh ci
2exp −c2
i
2ωiPG(ωi|1,0)−Eq(ωi)[log PG(ωi|1,0)]
=X
ilog coshci
2−ci
4tanh ci
2+Eq(ωi)[log PG(ωi|1,0)] −Eq(ωi)[log PG(ωi|1,0)]
=X
ilog coshci
2−ci
4tanh ci
2.
Remarkably, the intractable expectations cancel out which would not have been the case if we assumed PG(ωi|bi, ci)as variational family. In
section 4.2 we have shown that the restricted family bi= 1 contains the optimal distribution.
Summing all terms results in the final lower bound
L(c,µ,Σ) c
=1
2log |Σ| − log |Kmm|)−tr(K−1
mmΣ) −µ>K−1
mmµ+
X
inyiκiµ−θie
Kii −θiκiΣκ>
i−θiµ>κ>
iκiµ+c2
iθi−2 log coshci
2o.
A.3 Variational updates
Local parameters The derivative of the variational bound (10) w.r.t. the local parameter ciis
dL
dci
=1
2
d
dcinθi−e
Kii −κiΣκ>
i−µ>κ>
iκiµ+c2
i−2 log coshci
2o
=1
2
d
dci1
2ci
tanh ci
2−e
Kii −κiΣκ>
i−µ>κ>
iκiµ+c2
i−2 log coshci
2
=d
dci
1
4ci
tanh ci
2
−e
Kii −κiΣκ>
i−µ>κ>
iκiµ
|{z }
:=−Ai
+ci
4tanh ci
2−log cosh ci
2
=Ai
4c2
i
−1
4tanh ci
2−1
2Ai
4ci
−ci
41−tanh2(ci
2)
=U(ci)ci
21−tanh2(ci
2)−tanh ci
2,
where U(ci) = Σii+µ2
i
4c2
i
−1
4.
31
The gradient equals zero in two case. First, in the case U(ci) = 0 which leads to4
ci=qe
Kii +κiΣκ>
i+µ>κ>
iκiµ,
which is always valid since κ,Σand e
Kare definite positive matrices. The second consists of the right hand side of the product being zero which
leads to ci= 0. The second derivative reveals that the first case always corresponds to a maximum and the second case to a minimum.
Global parameters We first compute the Euclidean gradients of the variational bound (10) w.r.t. the global parameters µand Σ. We
obtain
dL
dµ=1
2
d
dµ−µ>K−1
mmµ+y>κµ −µ>κ>Θκµ
=1
2−2K−1
mmµ+κ>y−2κ>Θκµ
=−K−1
mm +κ>Θκµ+1
2κ>y,
(15)
and
dL
dΣ=1
2
d
dΣlog |Σ| − tr(K−1
mmΣ) −tr(κ>ΘκΣ)
=1
2Σ−1−K−1
mm −κ>Θκ.
(16)
We now compute the natural gradients w.r.t. natural parameterization of the variational Gaussian distribution, i.e the parameters η1:= Σ−1µ
and η2=−1
2Σ−1. For a Gaussian distribution, properties of the Fisher information matrix expose the simplification that the natural gradient
w.r.t. the natural parameters can be expressed in terms of the Euclidean gradient w.r.t. the mean and covariance parameters. It holds that
e
∇(η1,η2)L(η) = ∇µL(η)−2∇ΣL(η)µ, ∇ΣL(η),(17)
where e
∇denotes the natural gradient and ∇the Euclidean gradient. Substituting the Euclidean gradients (16) and (15) in to equation (17) we
obtain the natural gradients
e
∇η2L=1
2−2η2−K−1
mm −κ>Θκ
=−η2−1
2K−1
mm +κ>Θκ
and
e
∇η1L=−K−1
mm +κ>Θκ(−1
2η−1
2η1) + 1
2κ>y−2−η2−1
2K−1
mm +κ>Θκ(−1
2η−1
2η1)
=1
2κ>y−η1.
A.4 Natural gradient and coordinate ascent updates
If the full conditional distributions and the corresponding variational distribution belong to the same exponential family it is known in varia-
tional inference that “we can compute the natural gradient by computing the coordinate updates in parallel and subtracting the current setting
of the parameter” Hoffman et al. (2013). In our setting it is not clear if this relation holds since we do not consider the classic ELBO but a
lower bound on it due to (8). Interestingly, the lower bound (8) does not break this property and our natural gradient updates correspond to
coordinate ascent updates as we show in the following. Setting the Euclidean gradients and (15) to zero and using the natural parameterization
gives
η2=−1
2Σ−1=−1
2K−1
mm +κ>Θκ.(18)
Setting (16) to zero yields
µ=1
2K−1
mm +κ>Θκ−1
κ>y.
Substituting the update from above (18) and using natural parameterization results in
η1=1
2κ>y.
This shows that using learning rate one in our natural gradient ascent scheme corresponds to employing coordinate ascent updates in the
Euclidean parameter space.
4We omit the negative solution since PG(b, c) = PG(b, −c).
3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation
32
A.5 Variational bound by Gibbs and MacKay
When using the full GP representation in our model and not the sparse approximation, the bound in our model is equal to the bound used by
Gibbs and MacKay (2000). We provide a proof in the following.
Applying our variational inference approach to the joint distribution (5) gives the variational bound
log p(y|f)≥Eq(ω)[log p(y|f,ω)] −KL(q(ω)|p(ω))
=Eq(ω)1
2y>f−1
2f>Ωf−nlog(2) −KL(q(ω)|p(ω))
=1
2y>f−1
2f>Θf−nlog(2) +
n
X
i=1 c2
i
2θi−log cosh(ci/2).
Gibbs and MacKay (2000) employ the following inequality on logit link
σ(z)≥σ(c) exp z−c
2−σ(c)−1/2
2c(z2−c2).
Using this bound in the setting of GP classification yields the following lower bound on the log-likelihood,
log p(y|f) =
n
X
i=1
log σ(yifi)
≥
n
X
i=1 log σ(ci) + yifi−ci
2−σ(ci)−1/2
2ci
((yifi)2−c2
i)
=
n
X
i=1 −log cosh(ci/2) −log(2) + yifi
2−σ(ci)−1/2
2ci
(f2
i−c2
i)
=
n
X
i=1 −log cosh(ci/2) −log(2) + yifi
2−1
4ci
tanh(ci/2)(f2
i−c2
i)
=
n
X
i=1 −log cosh(ci/2) −log(2) + yifi
2−1
2θi(f2
i−c2
i)
=1
2y>f−1
2f>Θf−nlog(2) +
n
X
i=1 c2
i
2θi−log cosh(ci/2),
where we made use of the fact that σ(x)−1/2 = tanh(x/2)/2. This concludes the proof.
33
4
Multi-Class Gaussian Process
Classification Made Conjugate:
Efficient Inference via Data
Augmentation
After the binary classification problem, a natural extension is the multi-class classification setting. By
drawing inspiration from Donner and Opper
[12]
, we use new augmentations methods to circumvent the
problem of a much more complex likelihood function involving multiple latent
GPs
. More specifically,
we introduce a new link, the logistic-softmax function. We turn the model into a fully conditionally-
conjugate model with three successive augmentations. A thorough analysis is made to compare this
new model with other links and approaches, including standard choices like the softmax link.
Note that an extensive discussion about this model is given in Chapter 7 with potential model
extensions and solutions to some problems faced in the paper.
Authors:
Théo Galy-Fajou,1,∗, Florian Wenzel,1,∗, Christian Donner,1Manfred Opper1
∗Equal Contribution, 1TU Berlin, Germany, 2TU Kaiserslautern, Germany
Details:
Type: Conference article Submitted: January 2019
Accepted: May 2019
URL: http://auai.org/uai2019/proceedings/papers/264.pdf
Conference: UAI 2019
License: Creative Commons Attribution (CC BY 4.0)
35
4. Multi-Class Gaussian Process Classification Made Conjugate: Efficient Inference via
Data Augmentation
Contributions:
For an explanation of the terms see the Contributor Roles Taxonomy (CReditT)
T.G-F. F.W. C.D. M.O.
Conceptualization ✓ ✓ ✓
Methodology ✓
Formal Analysis ✓ ✓ ✓ ✓
Implementation ✓
Investigation ✓
Writing - Original Draft ✓ ✓ ✓
Writing - Review & Editing ✓ ✓ ✓ ✓
Supervision ✓
Funding Acquisition ✓
36
37
4. Multi-Class Gaussian Process Classification Made Conjugate: Efficient Inference via
Data Augmentation
38
39
4. Multi-Class Gaussian Process Classification Made Conjugate: Efficient Inference via
Data Augmentation
40
41
4. Multi-Class Gaussian Process Classification Made Conjugate: Efficient Inference via
Data Augmentation
42
43
4. Multi-Class Gaussian Process Classification Made Conjugate: Efficient Inference via
Data Augmentation
44
45
4. Multi-Class Gaussian Process Classification Made Conjugate: Efficient Inference via
Data Augmentation
46
47
5
Automated Augmented Conjugate
Inference for Non-conjugate
Gaussian Process Models
The larger question following the work on Pólya-Gamma variables and other augmentation works such
as Nguyen and Wu [41] or Henao et al. [20] is: What likelihoods have a scale mixture representation?
This article, extending the work of Palmer
[43]
partially answers by finding a class of functions, the
positive-definite radial functions, guaranteed to be interpretable as scale mixtures of Gaussians. The
paper also provides an algorithm to directly infer the
CAVI
updates and Gibbs sampling algorithm
from the likelihood.
Authors:
Théo Galy-Fajou,1, Florian Wenzel,2, Manfred Opper1
1TU Berlin, Germany, 2Google Research
Details:
Type: Conference article Submitted: October 2019
Accepted: December 2019
URL: https://proceedings.mlr.press/v108/galy-fajou20a.html
Conference: AISTATS 2020
License: Creative Commons Attribution (CC BY 4.0)
49
5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process
Models
Contributions:
For an explanation of the terms see the Contributor Roles Taxonomy (CReditT)
T.G-F. F.W. M.O.
Conceptualization ✓ ✓ ✓
Methodology ✓
Formal Analysis ✓ ✓ ✓
Software ✓
Investigation ✓
Writing - Original Draft ✓ ✓
Writing - Review & Editing ✓ ✓ ✓
Supervision ✓
Funding Acquisition ✓
50
Automated Augmented Conjugate Inference
for Non-conjugate Gaussian Process Models
Théo Galy-Fajou Florian Wenzel Manfred Opper
Technical University of Berlin Google Research∗Technical University of Berlin
Abstract
We propose automated augmented conjugate
inference, a new inference method for non-
conjugate Gaussian processes (GP) models. Our
method automatically constructs an auxiliary
variable augmentation that renders the GP model
conditionally conjugate. Building on the conju-
gate structure of the augmented model, we de-
velop two inference methods. First, a fast and
scalable stochastic variational inference method
that uses efficient block coordinate ascent up-
dates, which are computed in closed form. Sec-
ond, an asymptotically correct Gibbs sampler that
is useful for small datasets. Our experiments
show that our method are up two orders of mag-
nitude faster and more robust than existing state-
of-the-art black-box methods.
1 INTRODUCTION
Developing automated yet efficient Bayesian inference
methods for Gaussian process (GP) models is a challeng-
ing problem that has attracted considerable attention within
the probabilisitic machine learning community (Salimbeni
et al.,2018;Wenzel et al.,2019). A GP defines a distri-
bution over functions and can be used as a flexible build-
ing block to develop expressive probabilistic models. By
choosing an appropriate likelihood function on top of a la-
tent GP, a variety of interesting models is obtained, which
are successfully used in several application areas includ-
ing robotics (Beckers et al.,2019), facial behavior analy-
sis (Eleftheriadis et al.,2017) and electrical engineering
(Pandit and Infield,2018). For instance, using a logistic
likelihood leads to a binary GP classification model, and
using a Student-t likelihood can be used for robust regres-
sion.
Proceedings of the 23rdInternational Conference on Artificial In-
telligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR:
Volume 108. Copyright 2020 by the author(s).
The main challenge in these models is to infer the latent
GP given a general non-Gaussian likelihood. Methods that
are more generally applicable often treat the model as a
black box and are based on sampling or numerical quadra-
ture, thus, preventing efficient optimization (Hensman et al.,
2015;Salimbeni et al.,2018). On the other side. a lot of
methods focus on special cases of GP models (i.e. special
likelihood functions) by exploiting model specific proper-
ties, e.g. binary classification (Polson et al.,2013).
In this work, we develop automated augmented conjugate
inference (aaci).aaci is an efficient inference framework,
which is applicable to a large class of GP models that use
a super-Gaussian likelihood1. It automatically exploits spe-
cific properties of the likelihood leading to an inference al-
gorithm that is up to two orders of magnitudes faster than
the state of the art.
Our approach builds on an auxiliary variable augmentation
of the model: we add a latent variable to the model such that
the original model is recovered when this variable is inte-
grated out. We consider an augmentation that renders the
model conditionally conjugate. In a conditionally conjugate
model, all complete conditional distributions (the posterior
distribution of one random variable given all the others),
can be computed in closed form. Moreover, we show that
inference in the augmented conditionally conjugate model
is much easier than in the original model and demonstrate
superior performance over the state of the art.
Building on the conditionally conjugate augmentation,
aaci provides two options for inference: a scalable vari-
ational inference method based on efficient closed-form
coordinate ascent updates and an exact Gibbs sampling
method, which is useful on smaller datasets.
Our main contributions are as follows:
•We introduce aaci: an automated inference method
for GP models with a super-Gaussian likelihood.
•We propose two inference modules: augmented varia-
tional inference, which scales to large datasets contain-
∗Work done while at TU Berlin
1The definition of the family of super-Gaussian likelihoods is
given in Section 3.
51
Automated Augmented Conjugate Inference for GP Models
y
X
f
y
X
f
ω
Step 1: Construct conjugate
augmentation
f
ωp(ω|f,y)
p(f|ω,y)
Input: GP model Step 2: Compute complete
conditionals
Step 3: Perform inferen
ce
Variational Inference
Gibbs Sampling
Figure 1. Automated augmented conjugate inference (aaci)performs automated efficient inference in non-conjugate Gaus-
sian process models. In the first step, aaci translates the GP model into an augmented model that is conditionally conjugate.
In the second step, the complete conditionals are computed in closed form. In the final step, aaci provides two options:
(A) fast stochastic variational inference based on coordinate ascent updates, which easily scales to big datasets and (B) an
asymptotically exact Gibbs sampler, which provides high quality samples from the true posterior but is limited to smaller
datasets.
ing millions of instances and an exact Gibbs sampler,
which is useful for small datasets.
•The experiments demonstrate that the augmented vari-
ational inference module of aaci outperforms the state
of the art in terms of speed by up to two orders of mag-
nitude while being competitive in terms of prediction
performance. The Gibbs sampler module leads to a
much better efficient sample size while still being up
to ten times faster than Hamiltonian Monte Carlo.
The paper is structured as follows: Section 2 gives a high-
level overview about our novel inference method aaci. In
Section 3, we provide a detailed discussion of the algorithm
and proof that our approach indeed leads to conditionally
conjugate models. We discuss related work in Section 4
and show our experimental results in Section 5. Finally,
Section 6 concludes and lays out future research directions.
Our source code for the experiments is included in a gitgub
repository2.
2 AUTOMATED AUGMENTED
CONJUGATE INFERENCE
Let X= (x1,...,xn)>∈Rn×dbe a matrix of data points
and y= (y1, . . . , yn)∈Rnthe corresponding target val-
ues. The goal is to learn a mapping from the input points
to the target values via a latent function f. We assume a
prior GP distribution (with mean prior µ0and covariance
function k(x, x0)) on the latent function and the data labels
y= (y1, . . . , yn)are connected to fvia a factorizable like-
lihood
p(f) = GP(f|µ0, k), p(y|f, X) =
n
Y
i=1
p(yi|f(xi)).
2https://github.com/theogf/AutoConjGP_Exp
The key inference challenge in the GP models is to compute
the posterior distribution of the latent function
p(f|y) = p(y|f)p(f)
Rp(y|f)p(f)dy.
This is a challenging problem. Inference in GP models scale
cubically in the number of data points and is intractable for
non-Gaussian likelihoods.
Ideally, we would like an efficient inference method that is
not hand-tailored to a specific type of likelihood and hence
allows for experimenting with different types of GP mod-
els on big datasets in a scalable manner. Thus, we need
a flexible inference method that works for a large class of
likelihoods, is fast and ideally does not involve inefficient
black box approaches as approximating the objective by
sampling.
2.1 Automated Augmented Conjugate Infer-
ence
We introduce the automated augmented conjugate inference
(aaci)to achieve this goal. aaci accelerates training of GP
models whose likelihood is in the family of super-Gaussian
likelihood functions.
aaci translates the intractable non-conjugate model into
an easier, conditionally conjugate model by adding auxil-
iary random variables to the model. Inference in condition-
ally conjugate models is a classic and well-studied problem
(Bishop,2006). Because of the special structure of condi-
tionally conjugate models, many efficient inference meth-
ods exist (Wang and Blei,2013). Based on the automat-
ically constructed augmentation, we propose an efficient
variational inference method using coordinate ascent up-
dates and a Gibbs sampler.
The inference pipeline of aaci.aaci consists of three
steps. In the first step, a conjugate augmentation of the
model is constructed by adding auxiliary variables ωto the
5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process
Models
52
Théo Galy-Fajou, Florian Wenzel, Manfred Opper
model. Then, the complete conditional distributions of the
latent function fand auxiliary variables ωare computed.
In the final step, we provide two options to perform infer-
ence.
The variational inference (VI) module of aaci performs
block coordinate ascent updates, computed in closed form.
The updates are much more efficient than ordinary Eu-
clidean gradient updates, which are used in most previous
approaches. The Gibbs sampling module of aaci builds on
the complete conditional distributions and provides exact
samples from the true posterior. For each type of likelihood,
the sampler is automatically constructed.
The inference pipeline of aaci is summarized in Fig. 1. In
the following, we give an overview of how each module of
our inference pipeline works and provide the details in Sec-
tion 3.
(1) Augmenting the model. The first step of our inference
framework constructs an auxiliary variable augmentation
that renders the model conditionally conjugate. Our aug-
mentation approach finds a Gaussian scale mixture repre-
sentation of the intractable likelihood
p(yi|fi) = Zp(yi|fi, ωi)p(ωi)dω, (1)
where p(yi|fi, ωi)is an unnormalized Gaussian distribution
in fiwith precision ωiand p(ωi)is the prior distribution of
the auxiliary variable. The construction of the distribution
p(ω)is based on an inverse Laplace transformation and is
discussed in Section 3.1.
Building on Eq. 1, we augment the GP model by a set of
auxiliary variables ω= (ω1, . . . , ωn)leading to the aug-
mented joint distribution
p(y,f,ω) = Y
i
p(yi|fi, ωi)p(ωi)p(f), . (2)
The auxiliary variable augmentation is constructed in a way
such that the augmented model is conditionally conjugate,
i.e. the complete conditional distributions p(ω|f,y)and
p(f|ω,y)are in the same family as their associated pri-
ors.
(2) Computing the complete conditionals. The complete
conditionals of fand the auxiliary variables ωiare com-
puted in closed form and are given by
p(f|y,ω) =N(f|µ,Σ)
p(ωi|fi, yi) =πϕ(ωi|ci),
where ϕis a function determined by the type of the likeli-
hood (see Eq. 4) and the parameters µ,Σ, cihave closed-
form expressions and are described in Section 3.2. The dis-
tribution family πϕ(ω|c)is derived by an exponential tilt-
ing of the prior distribution p(ω)and is discussed in Sec-
tion 3.2.
(3a) Augmented variational inference. In step 3, aaci
provides two options to perform inference. We first discuss
the variational inference module, which approximates the
posterior by a variational distribution and easily scales to
big datasets.
We assume a mean-field variational distribution, where the
latent GP fand the auxiliary variables ωare decoupled, i.e.
q(f,ω) = q(f)q(ω). The optimal variational distribution
of ωnaturally factorizes, i.e. q(ω) = Qiq(ωi). Following
standard results (Bishop,2006) the variational distributions
can be iteratively optimized by the block-coordinate ascent
updates:
q(f)∝exp Eq(ω)[log p(f|ω,y)]
q(ωi)∝exp Eq(f)[log p(ωi|f,y)].(3)
In Section 3.3, we show that these updates are given in
closed form and can be computed efficiently without resort-
ing to numerical methods. To scale to big datasets we em-
ploy SVI (Hoffman et al.,2013) and replace the original la-
tent GP fby Titsias (2009) sparse approximation building
on inducing points .
(3b) Exact inference via Gibbs sampling. Building on the
conditionally conjugate augmentation, it is straightforward
to derive a Gibbs sampler. In order to sample from the exact
posterior, we alternate between drawing a sample from each
complete conditional distribution
ωt∼p(ω|ft−1,y),
ft∼p(f|ωt,y).
The augmented variables are naturally marginalized out and
the latent GP samples {ft}will be from the true posterior
p(y|f). As we empirically show in Section 5.1, the Gibbs
sampler leads to very fast mixing and outperforms standard
Hamiltonian Monte Carlo sampling.
3 ALGORITHM DETAILS
Here we provide the details on the automated augmented
conjugate inference (aaci)algorithm. We start by specify-
ing the class of GP models that we consider in our frame-
work. We then discuss the technical details of aaci and
proof that the automatically constructed augmentation in-
deed leads to a conditionally conjugate model.
GP Models with a super-Gaussian likelihood. aaci can
be applied to GP models, where the likelihood is within the
class of super-Gaussian likelihoods. A super-Gaussian like-
lihood is of the form
p(y|f;θ) =C(θ)eg(y;θ)>fϕ(||h(f,y)||2
2),(4)
where θare hyperparameters of the likelihood, C(θ)is the
normalizing constant, g(y;θ)is an arbitrary function, ϕis
53
Automated Augmented Conjugate Inference for GP Models
apositive definite radial (pdr) function3, and his a linear
function in f, such that we can write
||h(f,y)||2
2=α(y, θ)−β(y, θ)>f+γ(y, θ)||f||2
2,(5)
where α, β, γ are arbitrary functions. We omit θin the later
derivations for clarity.
Many interesting models are instances of super-Gaussian
likelihood GP models. In Table 1, we present several likeli-
hood functions with their corresponding parameter settings
of the super-Gaussian likelihood as given in Eq. 4.
Constructing new likelihoods. Using Eq. 4, we can also
construct novel likelihood functions based on existing ker-
nel functions. In this paper we propose the Matern 3/2 like-
lihood.
3.1 Step 1: Conjugate augmentation
Given the likelihood of the model, aaci constructs a con-
ditionally conjugate auxiliary variable augmentation as fol-
lows. We first define a family of distribution πϕ(ω|c), which
will be useful for constructing the augmentation.
For the case c= 0, the distribution πϕ(ω|0) is defined by
the inverse Laplace transform of ϕ(·),
πϕ(ω|0) = L−1{ϕ(·)}(ω).(6)
The inverse Laplace is the inverse mapping of the Laplace
transformation and can be computed by the Bromwich in-
tegral formula4(Debnath and Bhatta,2014) and it defines a
valid density in our setting (see proof of Theorem 1). Re-
markably, we will see that for the final updates of our al-
gorithm, we do not need to compute the inverse Laplace
transformation explicitly.
We generalize the base distribution πϕ(ω|0) by applying an
exponential tilting:
πϕ(ω|c) = e−c2ωπϕ(ω|0)
ϕ(c2),(7)
where c∈R.
Theorem 1. A GP model with a super-Gaussian like-
lihood (of the form of Eq. 4) is rendered condition-
ally conjugate by the auxiliary variable augmentation
p(y,f,ω;θ) = p(y|f,ω;θ)p(f)p(ω). The augmented
likelihood is
p(y|f,ω;θ) = C(θ) exp g(y;θ)>f−||h(f,y)||2
2ω
3ϕis a positive definite radial function if ϕ(r)is completely
monotone for all r≥0and limr→0ϕ(r) = 1.
4The inverse Laplace transformation of a function ϕ(·)can be
computed by L−1{ϕ(·)}(ω) = limT→∞
1
2πi Rb+iT
b−iT erω ϕ(r)dr,
where bcan be arbitrarily chosen but has to be larger than the real
part of all singularities of ϕ.
and the prior distribution of the auxiliary variables is
p(ω) = πϕ(ω|0) .
Proof: We first apply Schoenberg’s theorem (Schoenberg,
1938), which states that a function Rd3x→ϕ(kxk2
2)
is a pdr function for any dimension d > 0if and only if
ϕ(r)is a completely monotone function on the domain r≥
0.
A completely monotone function ϕ(·)has the property that
it is infinitely differentiable and its derivatives have an al-
ternating sign (Bernstein et al.,1929), i.e.
(−1)kϕ(k)(r)>0, r ∈[0,+∞), k = 0,1,2,.... (8)
As a direct consequence, ϕ(·)is a positive, decreasing, and
convex function and the first derivative of ϕ(·)is a concave
function.
Building on these properties, Widder (1946) states that
we can rewrite ϕ(kh(f, y)k2
2)as a Gaussian scale-
mixture
ϕkh(f, y)k2
2=Z∞
0
e−kh(f,y)k2
2ωdµ(ω),(9)
with respect to a Borel measure µ(ω). We ap-
ply the monotone convergence theorem (Yeh,2006),
which gives that µ(ω)is even a probability measure iff
limr→0ϕ(r) = 1. Since we have a probability mea-
sure, we write dµ(ω) = p(ω)dω and which leads
to the equality ϕ(r) = L{p(ω)}(r), where Lde-
notes the Laplace transformation. The inverse Laplace
transformation gives the density of the auxiliary variable
p(ω) = L−1{ϕ(r)}(ω) = πϕ(ω|0).
Therefore we can rewrite the super-Gaussian likelihood Eq.
4 as :
p(y|f) = C(θ)Z∞
0
e−g(y)f−kh(f,y)k2
2ωp(ω)dω. (10)
Adding the auxiliary variable ωwith prior p(ω)
to the model, we obtain the augmented likelihood
p(y|f,ω;θ) = C(θ) exp g(y;θ)>f−||h(f,y)||2
2ω.
Since the function g(y;θ)>f−||h(f,y)||2
2ωis by defini-
tion quadratic in fthe augmented likelihood is proportional
to an (unnormalized) Gaussian distribution in f, hence,
conditionally conjugate in f.
For the augmented variable ωi, the likelihood p(y|ω, f)act
as an exponential tilting of p(ω)and the full conditional in ω
will stay in the same family of distributions. QED.
3.2 Step 2: Complete Conditionals
Since the augmented model (Section 3.1) is conditionally
conjugate, the complete conditional distribution are in the
5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process
Models
54
Théo Galy-Fajou, Florian Wenzel, Manfred Opper
Likelihood Full form g(f, y)h(f, y)ϕ(r)
Student-t Γ(ν+1
2)
√νπσΓ(ν
2)1 + (y−f)2
νσ2−ν+1
20f−y
σ1 + r
ν−ν+1
2
Laplace 1
2βexp −|y−f|
β0f−yexp −√r
β
Logistic 1
2exp yf
2cosh−1|yf|
2yf
2
f
2cosh−1(√r)
Bayesian SVM exp ((yf −1) −|1−yf|)yf 1−yf exp(−√r)
Matern 3/2 √3
4ρ(1 + √3|y−f|
ρ) exp(−√3|y−f|
ρ)0f−y(1 + √3r
ρ) exp(−√3r
ρ)
Table 1. Many interesting GP models are members of the super-Gaussian likelihood family introduced in Section 3. We
display the full likelihood and the corresponding terms of the super-Gaussian likelihood as described in Eq. 4. Some models
were already considered independently but our approach provides a unified view.
same family as their associated prior distributions and are
given in closed form.
Theorem 2. The complete conditional distributions of the
augmented model presented in Section 3.1 are given by
p(ωi|fi, yi) =πϕ(ωi|kh(fi, yi)k2),
p(f|y,ω) =N(f|µ,Σ),(11)
where Σ=diag (2ω◦γ(y)) + K−1−1and µ=
Σg(y) + ω◦β(y) + K−1µ0,◦denotes the Hadamard
product and the function h(·)is given by the form of likeli-
hood (see Eq.5).
The proof is given in Appendix A.1
3.3 Step 3: Efficient inference
In the final step of our inference pipeline, we leverage the
conditionally conjugate structure of the augmented model
and derive two inference methods. First, we propose a
scalable stochastic variational inference (SVI) method that
builds on efficient block coordinate ascent updates (CAVI)
updates, computed in closed form. Second, we develop a
Gibbs sampling scheme that generates samples from the ex-
act posterior.
3.3.1 Augmented variational inference
We implement the classic stochastic variational inference
(SVI) algorithm for conditionally conjugate models de-
scribed by Hoffman et al. (2013), which builds on block
coordinate ascent updates. The updates can be interpreted
as natural gradient updates and are much more efficient than
ordinary Euclidean gradient updates (Amari,1998).
Variational approximation. We approximate the poste-
rior distribution of the latent GP values by assuming a de-
coupling between fand ω. The family of the optimal varia-
tional distribution can be easily determined by averaging the
complete conditionals in log-space, as given in Eq. 3 (see
e.g. Blei et al.,2017). From the above decoupling assump-
tion, it follows that the optimal variational posterior is in the
variational family
q(f,ω) = q(f)
N
Y
i=1
q(ωi),(12)
where q(f) = N(f|m,S)and q(ωi) = πϕ(ωi|ci)and
m,Sand care the variational parameters.
Variational updates. We start with deriving the vari-
ational updates for the variational Gaussian distribu-
tion,
q(f)∝exp Eq(ω)[log p(f|ω,y)]
∝exp "X
i
g(yi)fi−kh(fi, yi)k2
2Eq(ωi)[ωi]#p(f)
Computing the variational updates of q(f)boils down
to computing the first moment of ω. Remarkably, the
moments of πϕcan be computed without computing the
closed-form density of πϕexplicitly, i.e. without evaluat-
ing the inverse Laplace transformation of ϕ(Eq. 6).
The moments can be computed by differentiating the mo-
ment generating function, which is itself a Laplace trans-
form. For our algorithm, we only need the first moment of
ω, which is given by
Eq(ω)[ω] = dL{q(ω)}(−t)
dt t=0
=−ϕ0(c2)
ϕ(c2)=ω,
which can be cheaply computed via automatic differentia-
tion.
The updates for the variational distribution of the auxiliary
variables q(ω)are computed as follows.
q(ωi)∝exp −Eq(fi)kh(fi, yi)k2
2ωi+ log p(ωi)
∝exp −Eq(fi)kh(fi, yi)k2
2ωip(ωi)
=πϕ(ωi|qEq(fi)[h(fi, yi)2]).
55
Automated Augmented Conjugate Inference for GP Models
We get then the update ci=qEq(fi)[kh(fi, yi)k2
2], which
can be easily computed in closed form since kh(fi, yi)k2
2is
a quadratic function of fi.
The coordinate ascent variational inference (CAVI) method
is summarized in Algorithm 1.
Algorithm 1 Augmented Variational Inference
Input: Data (X,y), GP model p(y|f), kernel k
Output: Approximate posterior q(f) = N(f|m,S)
for iteration t= 1,2, . . . ,do
# Local updates:
for i∈1 : Ndo
ci=pEq(f)[h(fi, yi)2]
ωi=Eq(ωi)[ωi] = −ϕ0(c2
i)/ϕ(c2
i)
end for
# Coordinate ascent updates (CAVI):
S←diag (2ω◦γ(y)) + K−1−1
m←SK−1µ0+g(y) + ω◦β(y)
end for
Sparse GP approximation. To scale our method to big
datasets, we approximate the latent GP fby a sparse Gaus-
sian process building on inducing points. We introduce M
inducing points uand connect the GP values with the in-
ducing points via the joint prior distribution p(f,u)given
in Titsias (2009). The introduction of inducing points
preserves conditional conjugacy and allows for mini-batch
sampling of the data (stochastic variational inference). This
scales the algorithm to big datasets and has the computa-
tional complexity O(M3). The SVI version of our algo-
rithm only slightly changes the updates that are presented
in Algorithm 1. It is deferred to Appendix A.3.
3.3.2 Gibbs sampling
To sample from the exact posterior distribution, a Gibbs
sampling scheme alternates between sampling from the
complete conditional distributions. In the following we pro-
pose a sampling scheme for the distribution family πϕ(ω|c)
that is automatically constructed given the pdr function of
the likelihood ϕ(·)
The distribution class πϕis defined in Eq. 6 and is based on
the inverse Laplace transform of ϕ(·). However there is no
general approach to compute the inverse Laplace in closed
form (Cohen,2007). We circumvent this issue by proposing
an algorithm that only evaluates the inverse Laplace trans-
formation point-wise but does not need access to its full
analytical form. We apply the method proposed by Rid-
out (2009), which build on the fact that the cumulative den-
sity function (cdf)Fπϕ(ω|c)(·)can be computed via the in-
verse Laplace transform of a scaled (forward) Laplace trans-
form,
Fπϕ(ω|c)(x) = L−1L{πϕ(ω|c)}(s)
s(x)
=L−1ϕ(s+c2)
sϕ(c2)(x).
To generate samples from πϕ(ω|c), we first generate a uni-
form sample u∼ U [0,1] and then push it through the in-
verse cdf,ω=F−1
πϕ(ω|c)(u)(Devroye,1986) Finally, to
compute the inverse cdf, we solve a fixed point problem
using the modified Newton-Raphson method described by
Ridout (2009). We solve the equation Fϕ(c)(ω) = uby re-
peatedly setting ω←ω−Fϕ(c)(ω)/πϕ(ω|c)until reaching
convergence. We numerically approximate the (forward)
cdf Fϕ(c)(ω)by the cheap trapezoidal method introduced
in Abate et al. (2000), which has error guarantees. The
cost of this process is negligible against the matrix inver-
sion for sampling f. All steps are summarized in Algo-
rithm 2.
Note that for some likelihood functions (e.g. the logistic
likelihood function), the inverse Laplace transform can be
derived analytically and the steps described above can be
optimized by using an existing the sampler for the corre-
sponding complete conditional distribution.
Algorithm 2 Gibbs Sampling
Input: Data (X,y), GP model p(y|f), kernel k
Output: Posterior samples {ft} ∼ p(f|y)
for sample index t= 1,2, . . . ,do
# Sample ω∼p(ω|f,y):
for i∈1 : Ndo
Compute ci=kh(fi, yi)k2
Sample ui∼ U[0,1]
# Compute inverse cdf ωi=F−1
πϕ(ci)(ui):
Initialize ωi>0
while |e
Fπϕ(ci)(ωi)−ui|> do
Approximate e
Fπϕ(ωi),eπϕ(ωi|ci)(see Sec.3.3.2)
ωi←ωi−
e
Fπϕ(ci)(ωi)
eπϕ(ωi|ci)
end while
end for
# Sample f∼p(f|ω,y):
Σ=diag (2ω◦γ(y)) + K−1−1
µ=ΣK−1µ0+g(y) + ω◦β(y)
Sample ft∼ N (µ,Σ)
end for
4 RELATED WORK
Inference for non-conjugate likelihoods is not a new topic
and there have been many works to deal efficiently with the
problem.
5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process
Models
56
Théo Galy-Fajou, Florian Wenzel, Manfred Opper
Scale mixtures of normals. The Gaussian scale-mixture
formulation is well known in statistics and have been ex-
plored more recently by Gneiting (1997,1999). Palmer
(2006); Palmer et al. (2006) started to generalize it for a ma-
chine learning use but did not explore the probability side
of the augmentation.
Black-box variational inference. One of the most popu-
lar approach for variational inference in the recent years is
to optimize the ELBO for an arbitrary model by computing
gradients estimates via sampling or quadrature, e.g. Salim-
beni et al. (2018); Mohamed et al. (2019). However these
methods do not exploit the structure of the model and can
be less efficient.
Sampling methods. Sampling is not a popular method for
GP models since fis high-dimensional and the posterior is
usually highly correlated (Lawrence et al.,2009). But as for
many Bayesian models, Hamiltonian Monte Carlo is a good
candidate (Titsias et al.,2008).
Likelihood approximation. Jaakkola and Jordan (2000)
propose a variational approach purely based on optimiza-
tion, using the partial convexity of the likelihood. Our
method recovers their results, but coming from a proba-
bilistic perspective. We show in Appendix A.5, the equiv-
alence with their approach. Khan and Lin (2017) exploit
existing partial conjugacy in the model and rely on the as-
sumption that part of the joint posterior can be rewritten as
an exponential family. Their approach is complementary
to ours and could be combined for solving more complex
models.
Use cases of the augmented model. Different applica-
tions of the augmentation technique for specific likelihoods
have been explored in multiple papers: Jylänki et al. (2011)
applied the augmentation on the Student-t likelihood with
Gaussian Processes. Polson et al. (2013) developed an ap-
proach with the logistic likelihood, this work was further
expanded by Wenzel et al. (2019) to big data. The augmen-
tation done on the Bayesian Support Vector Machine of Pol-
son et al. (2011) and scaled up by Wenzel et al. (2017), is
similar to our method but is based on a different augmenta-
tion approach. Note that our method covers all these cases
exactly but do not rely on any manual derivations.
5 EXPERIMENTS
In this section we answer the following questions empiri-
cally:
•How does the Gibbs sampling scheme compare to
other sampling methods?
•What is lost in variational inference by approximating
an additional variable?
•And what is the gain in speed?
We explore four different cases. We use three regression
models with different likelihood functions: a Laplace like-
Likelihood/Method MH HMC Gibbs
Logistic
Time/Sample (s) 0.001 0.041 0.01
Lag 1 0.996 0.53 0.11
Gelman 1.38 1.00 1.00
Student-t
Time/Sample (s) 0.003 0.573 0.028
Lag 1 1.0 0.857 0.04
Gelman 1.51 1.00 1.00
Laplace
Time/Sample (s) 0.002 0.082 0.028
Lag 1 0.995 0.931 0.26
Gelman 1.44 1.01 1.00
Matern
3/2
Time/Sample (s) 0.005 0.15 0.029
Lag 1 0.997 0.995 0.05
Gelman 1.59 1.10 1.00
Table 2. Sampling time and diagnostics of Gibbs Sampling,
naive Metropolis-Hastings and Hamiltonian Monte-Carlo.
The Gelman test indicates the inter-chain correlation and
should be close to 1.
lihood, a Student-t likelihood, a new likelihood inspired by
the Matern 3/2 kernel (Rasmussen,2003) and one classifi-
cation model with a logistic likelihood. All the mathemat-
ical details of these augmentations are deferred to the Ap-
pendix A.6. For the two first experiments we use a full GP
without inducing points to have a cleaner analysis of the
effect of the augmentation. For all experiments we use a
squared exponential kernel with automatic relevance deter-
mination: k(x, x0) = exp(−PD
d=1(xd−x0
d)2/θ2
d). For the
two first experiments we use datasets from the UCI repos-
itory (Dua and Graff,2017) : the Boston housing dataset
(N= 506, D = 14) for regression and the Heart dataset
(N= 303, D = 14) for classification. For the last experi-
ment we use the Protein dataset (N= 45730, D = 9) and
the Airline dataset (N= 190K, D = 7) for regression and
the Covtype dataset (N= 581K, D = 54) and the SUSY
dataset (N= 5M, D= 18) for classification. We normal-
ize the input features to mean 0 and variance 1.
5.1 Gibbs sampling mixing
Our approach leads to a Gibbs sampling algorithm that pro-
vides samples from the true posterior of the original model.
We compare our method (Gibbs) with a naive Metropolis-
Hasting algorithm (MH) and a Hamiltonian Monte Carlo
(HMC) sampler (where and nstep are selected via a grid
search, see appendix A.7) both implemented in Turing.jl
(Ge et al.,2018), with a whitening transformation on the
kernel matrix for better mixing. We draw 5 independent
chains of 10000 samples for each algorithm. We compare
crucial sampling diagnostics among different models: we
give the autocorrelation between consecutive samples (lag
1) (as well as the autocorrelation plots for all lags in ap-
pendix A.7) to estimate the efficient sample size and the
chain intercorrelation via the Gelman test (1 is the opti-
mum) (Brooks and Gelman,1998). The results are sum-
marized in table 2.
57
Automated Augmented Conjugate Inference for GP Models
Figure 3. Test negative log-likelihood and test error (classification)/RMSE (regression) as a function of time for different
likelihoods.
a) Matern 3/2 Likelihood on the Boston Housing dataset
b) Logistic Likelihood on the Heart dataset
Figure 2. Converged negative ELBO and averaged negative
log-likelihood on a held-out dataset in function of the kernel
lengthscale, training VI with and without augmentation.
We find that our method has a very low intrachain corre-
lation leading to a high sample efficiency, as well as a low
interchain correlation while still being faster than the HMC
algorithm. It is even more evident for heavy-tailed likeli-
hood like Student-T or Laplace where HMC can be of more
trouble (Betancourt,2017). Our approach is limited by the
O(N3)complexity for each sample.
5.2 Augmentation gap
To investigate the effect of augmenting the model when us-
ing variational inference, we train the original model us-
ing gradient descent and the augmented model until con-
vergence. While we fix the kernel variance at 0.1, we vary
the lengthscale θfrom 10−2to 102. We compare the con-
verged ELBOs as well as the predictive performance on
held-out test set. The results for the matern 3/2 and logistic
are shown on figure 2, the other likelihoods are show in the
appendix A.7. For both shown likelihoods, there is a visible
ELBO gap between the augmented model and the original
model. However the predictive performance is marginally
the same for both models.We can conclude that a poten-
tial difference in ELBO values does not affect the prediction
performance.
5.3 Convergence speed
To scale our model to large datasets, we use the inducing
points technique of Titsias (2009) and we use the stochas-
tic gradient descent approach of Hoffman et al. (2013).
We compare our variational approach (Algorithm 1) to
using natural gradient descent, (Salimbeni et al.,2018)
and ADAM (Hensman et al.,2015) both implemented in
GPFlow (Matthews et al.,2017). For all methods we use
200 inducing points determined by k-means++ (Arthur and
Vassilvitskii,2007), minibatches of size 100 and we train
the kernel hyperparameters using ADAM (Kingma and Ba,
2014), (the inducing points locations are fixed). We show
the predictive performance in function of the training time
for multiple likelihoods on figure 3.
Our method is up to two orders of magnitude faster than
the state of the art. Moreover, we find that the optimization
in our method is more stable (smooth decrease of the loss.
6 CONCLUSION
We proposed a new efficient inference method for GP mod-
els that have a super-Gaussian likelihood. Our method
builds on an auxiliary variable augmentation that renders
the model conditionally conjugate. We showed that in the
augmented model, variational inference is up to two orders
of magnitude faster and more stable than the state of the art.
For small dataset, we proposed a Gibbs sampler that outper-
forms Hamiltonian Monte Carlo sampling. Previous meth-
ods that build on auxiliary variable augmentations (e.g.
Wenzel et al.,2019) manually derived the augmentation and
inference methods, whereas in our approach the whole pro-
cedure is fully automated and works for much more gen-
eral class of models. Future work may aim on extend-
ing our approach to more general models by automatically
constructing hierarchical augmentations inspired by Galy-
Fajou et al. (2019)orDonner and Opper (2018).
5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process
Models
58
Théo Galy-Fajou, Florian Wenzel, Manfred Opper
References
Abate, J., Choudhury, G. L., and Whitt, W. (2000). An
introduction to numerical transform inversion and its ap-
plication to probability models. In Computational proba-
bility, pages 257–323. Springer.
Amari, S.-I. (1998). Natural gradient works efficiently in
learning. Neural computation, 10(2):251–276.
Arthur, D. and Vassilvitskii, S. (2007). k-means++: The
advantages of careful seeding. In Proceedings of the eigh-
teenth annual ACM-SIAM symposium on Discrete algo-
rithms, pages 1027–1035. Society for Industrial and Ap-
plied Mathematics.
Beckers, T., Kulić, D., and Hirche, S. (2019). Stable gaus-
sian process based tracking control of euler-lagrange sys-
tems. Automatica, (103):390–397.
Bernstein, S. et al. (1929). Sur les fonctions absolument
monotones. Acta Mathematica, 52:1–66.
Betancourt, M. (2017). A conceptual introduc-
tion to hamiltonian monte carlo. arXiv preprint
arXiv:1701.02434.
Bishop, C. M. (2006). Pattern recognition and machine
learning. springer.
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017).
Variational inference: A review for statisticians. Journal
of the American Statistical Association, 112(518):859–
877.
Brooks, S. P. and Gelman, A. (1998). General methods for
monitoring convergence of iterative simulations. Journal
of computational and graphical statistics, 7(4):434–455.
Cohen, A. M. (2007). Numerical methods for Laplace
transform inversion, volume 5. Springer Science & Busi-
ness Media.
Debnath, L. and Bhatta, D. (2014). Integral transforms
and their applications. Chapman and Hall/CRC.
Devroye, L. (1986). Nonuniform random variate genera-
tion. Springer-Verlag.
Donner, C. and Opper, M. (2018). Efficient bayesian in-
ference of sigmoidal gaussian cox processes. The Journal
of Machine Learning Research, 19(1):2710–2743.
Dua, D. and Graff, C. (2017). UCI machine learning repos-
itory.
Eleftheriadis, S., Rudovic, O., Deisenroth, M. P., and Pan-
tic, M. (2017). Gaussian process domain experts for mod-
eling of facial affect. IEEE Transactions on Image Pro-
cessing, 26(10):4697–4711.
Galy-Fajou, T., Wenzel, F., Donner, C., and Opper, M.
(2019). Multi-class gaussian process classification made
conjugate: Efficient inference via data augmentation. Un-
certainty in Artificial Intelligence (UAI).
Ge, H., Xu, K., and Ghahramani, Z. (2018). Turing: a
language for flexible probabilistic inference. In Interna-
tional Conference on Artificial Intelligence and Statistics,
AISTATS, pages 1682–1690.
Gneiting, T. (1997). Normal scale mixtures and dual prob-
ability densities. Journal of Statistical Computation and
Simulation, 59(4):375–384.
Gneiting, T. (1999). Radial positive definite functions gen-
erated by euclid’s hat. Journal of Multivariate Analysis,
69(1):88–119.
Hensman, J., Matthews, A., and Ghahramani, Z. (2015).
Scalable variational gaussian process classification. The
Journal of Machine Learning Research.
Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J.
(2013). Stochastic variational inference. The Journal of
Machine Learning Research, 14(1):1303–1347.
Jaakkola, T. S. and Jordan, M. I. (2000). Bayesian pa-
rameter estimation via variational methods. Statistics and
Computing, 10(1):25–37.
Jylänki, P., Vanhatalo, J., and Vehtari, A. (2011). Robust
gaussian process regression with a Student-t likelihood.
Journal of Machine Learning Research, 12(Nov):3227–
3257.
Khan, M. E. and Lin, W. (2017). Conjugate-computation
variational inference: Converting variational inference in
non-conjugate models to inferences in conjugate mod-
els. International Conference on Artificial Intelligence and
Statistics, AISTATS.
Kingma, D. P. and Ba, J. (2014). Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980.
Lawrence, N. D., Rattray, M., and Titsias, M. K. (2009).
Efficient sampling for gaussian process inference using
control variables. In Advances in Neural Information Pro-
cessing Systems, pages 1681–1688.
Matthews, D. G., Alexander, G., Van Der Wilk, M., Nick-
son, T., Fujii, K., Boukouvalas, A., León-Villagrá, P.,
Ghahramani, Z., and Hensman, J. (2017). Gpflow: A gaus-
sian process library using tensorflow. The Journal of Ma-
chine Learning Research, 18(1):1299–1304.
Merkle, M. (2014). Completely monotone functions: a di-
gest. In Analytic Number Theory, Approximation Theory,
and Special Functions, pages 347–364. Springer.
Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A.
(2019). Monte carlo gradient estimation in machine learn-
ing. arXiv preprint arXiv:1906.10652.
Palmer, J., Kreutz-Delgado, K., Rao, B. D., and Wipf,
59
Automated Augmented Conjugate Inference for GP Models
D. P. (2006). Variational em algorithms for non-gaussian
latent variable models. In Advances in neural information
processing systems, pages 1059–1066.
Palmer, J. A. (2006). Variational and scale mixture repre-
sentations of non-Gaussian densities for estimation in the
Bayesian linear model: Sparse coding, independent com-
ponent analysis, and minimum entropy segmentation. PhD
thesis, UC San Diego.
Pandit, R. K. and Infield, D. (2018). Comparative analysis
of binning and gaussian process based blade pitch angle
curve of a wind turbine for the purpose of condition mon-
itoring. Journal of Physics: Conference Series, 1102.
Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian
inference for logistic models using pólya–gamma latent
variables. Journal of the American statistical Association,
108(504):1339–1349.
Polson, N. G., Scott, S. L., et al. (2011). Data augmen-
tation for support vector machines. Bayesian Analysis,
6(1):1–23.
Rasmussen, C. E. (2003). Gaussian processes in machine
learning. Springer.
Ridout, M. S. (2009). Generating random numbers from
a distribution specified by its laplace transform. Statistics
and Computing, 19(4):439.
Salimbeni, H., Eleftheriadis, S., and Hensman, J. (2018).
Natural gradients in practice: Non-conjugate variational
inference in gaussian process models. roceedings of the
International Conference on Artificial Intelligence and
Statistics (AISTATS).
Schoenberg, I. J. (1938). Metric spaces and completely
monotone functions. Annals of Mathematics, pages 811–
841.
Titsias, M. (2009). Variational learning of inducing vari-
ables in sparse gaussian processes. In Artificial Intelli-
gence and Statistics, pages 567–574.
Titsias, M. K., Lawrence, N., and Rattray, M. (2008).
Markov chain monte carlo algorithms for gaussian pro-
cesses. Inference and Estimation in Probabilistic Time-
Series Models, 9.
Wang, C. and Blei, D. M. (2013). Variational inference
in nonconjugate models. Journal of Machine Learning
Research, 14(Apr):1005–1031.
Wenzel, F., Galy-Fajou, T., Deutsch, M., and Kloft, M.
(2017). Bayesian nonlinear support vector machines for
big data. In Joint European Conference on Machine
Learning and Knowledge Discovery in Databases, pages
307–322. Springer.
Wenzel, F., Galy-Fajou, T., Donner, C., Kloft, M., and Op-
per, M. (2019). Efficient gaussian process classification
using Pòlya-gamma data augmentation. In Proceedings of
the AAAI Conference on Artificial Intelligence, volume 33,
pages 5417–5424.
Widder, D. V. (1946). The Laplace transform. Princeton
university press.
Yeh, J. (2006). Real analysis: theory of measure and inte-
gration second edition. World Scientific Publishing Com-
pany.
5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process
Models
60
Théo Galy-Fajou, Florian Wenzel, Manfred Opper
A APPENDIX
A.1 Proof of theorem 2
Theorem 2 states:
Theorem. The complete conditional distributions of the augmented model presented in Section 3.1 are given by
p(ωi|fi, yi) =πϕ(ωi|kh(fi, yi)k2),
p(f|y,ω) =N(f|µ,Σ),
where Σ=diag (2ω◦γ(y)) + K−1−1and µ=Σg(y) + ω◦β(y) + K−1µ0,◦denotes the Hadamard product
and the function h(·)is given by the form of likelihood (see Eq.5).
Proof: For the full conditional on f:
p(f|y,ω)∝p(y|f,ω)p(f)
∝exp g(y)>f+ (β(y)◦ω)>f−f>diag(γ(y)◦ω)f−1
2f>K−1f
∝exp (g(y) + β(y)◦ω)>f−f>diag(γ(y)◦ω) + 1
2K−1f.
We get immediately a multivariate normal distribution with −1
2Σ−1=−diag(γ(y)◦ω) + 1
2K−1and Σ−1µ=g(y) +
(β(y)◦ω). Which corresponds to the result shown in equation (11).
For the augmented variable ωi:
p(ωi|yi, fi)∝p(yi|fi, ωi)p(ωi)
∝exp −kh(yi, fi)k2
2ωiπϕ(ωi|0)
=πϕ(ωi|kh(yi, fik2).
Note that the equation 9 gives the normalization constant directly ϕ(kh(yi, fi)k2
2)directly. QED.
A.2 Computation of the moments and cumulants for the augmentation variable
Given the general class of distribution πϕ(ω|c)described in Section 3.1, moments and cumulants can be easily computed:
The k-th moment of a distribution can be computed by taking the k-th derivative of the moment generating function (equiv-
alent to a negative Laplace transform) at t= 0. For example for the first moment:
Eπϕ(ω|c)[ω] = dL{πϕ(ω|c)}(−t)
dt t=0
=d
dt "L"e−c2ωπϕ(ω|0)
ϕ(c2)#(−t)#t=0
=−1
ϕ(c2)
d
dt L[πϕ(ω|b, 0)] (t+c2)t=0
=−1
ϕ(c2)
dϕ t+c2
dt t=0
=−dlog ϕ(t)
dt t=c2
=−ϕ0(c2)
ϕ(c2)=ω
61
Automated Augmented Conjugate Inference for GP Models
More generally the k-th moment mkis defined as :
mk=(−1)k1
ϕ(c2)
dkϕ(t)
dtkc2
And the cumulants κkare computed using the cumulant generating function (log of the moment generating function)
κk=(−1)kdklog ϕ(t)
dtkt=c2
A.3 Algorithm for the sparse case
Algorithm 3 Augmented Stochastic Variational Inference
Input: Data (X,y), GP model p(y|f,u), kernel k
Output: Approximate posterior q(u) = N(u|m,S)
Find inducing points inputs Zvia k-means
Compute kernel matrices : KZ,κ=KXZK−1
Z
for iteration t= 1,2, . . . ,do
# Local updates:
Sample minibatch B ⊆ {1, . . . , n}
for i∈ B do
ci=pEq(f)[h(fi, yi)2]
ωi=Eq(ωi)[ωi] = −ϕ0(c2
i)/ϕ(c2
i)
end for
# Natural gradient updates (CAVI):
e
S=κ>diag (2ω◦γ(y)) κ+K−1
Z−1
f
m=e
SK−1
Zµ0+κ>(g(y) + ω◦β(y))
{m,S} ← (1 −ρ(t)){m,S}+ρ(t){f
m,e
S}
end for
ρ(t)is an arbitrary learning rate respecting the Robbins-Monroe condition.
A.4 ELBO Analysis
A.4.1 Full ELBO
ELBO =
N
X
i=1
Eq(fi,ωi)[log p(yi|fi, ωi)]
−KL[q(f)||p(f)] −
N
X
i=1
KL[q(ωi)||p(ωi)]
Eq[log p(yi|fi, ωi, θ)] = log C(θ) + g(yi, θ)Eq(f)[f]−Eq(f)h(fi, yi)2Eq(ωi)[ωi]
= log C(θ) + g(yi, θ)mi−α(yi)−β(yi)mi+γ(yi)m2
i+Siiωi
KL[q(f)||p(f)] =1
2log |K|
|S|−N+tr(K−1S)+(µ0−m)>K−1(µ0−m)
KL[q(ωi)||p(ωi)] = −Eq(ωi)c2
iωi−log ϕ(c2
i) = −c2
iωi−log ϕ(c2
i)
Note that we can take the derivatives of the ELBO and set them to 0 to recover exactly the updates in algorithm 1.
5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process
Models
62
Théo Galy-Fajou, Florian Wenzel, Manfred Opper
A.4.2 Analysis of the optima
By setting c2
ias a function of mand S(and setting µ0to 0 for simplicity) we can get an ELBO only depending of the
variational parameters of f.
ELBO(m,S) = C+g>m+1
2
log |S|−tr(K−1S)−m>K−1m
|{z }
ELBO1
+X
i
log ϕ(m2
i+Sii)
| {z }
ELBO2
It is easy to show that ELBO1is jointly concave in mand Swith a short matrix analysis. However ELBO2is more complex
:m2
i+Sii is jointly convex in mand S,φ(r)is by definition convex as well, however φ(m2
i+Sii)is neither jointly convex
or concave in mand S. It is therefore impossible to guarantee that there is a global optima, however the CAVI updates
guarantee us a local optima.
A.4.3 ELBO Gap
For a fixed q(f)we can compare the ELBO of the original model Lstd(q(f)) and the augmented model Laug(q(f)q(ω)).
It is then straightforward to compute the difference between the two :
∆L=Lstd(q(f)) −Laug(q(f)q(ω))
=Eq(f)log p(y, f)−log q(f)−Eq(ω)[p(y, f, ω)−log q(f)q(ω)]
=Eq(f)q(ω)−log p(y, f, ω)
p(y, f)+ log q(ω)
=Eq(f)q(ω)[−log p(ω|y, f) + log q(ω)]
=Eq(ω)log q(ω)−Eq(f)[log p(ω|y, f)]
=−c2Eq(ω)[ω] + Eq(ω)[log PG(ω|1,0)] −log ϕ(c2)
+Eq(f)f2Eq(ω)[ω]−Eq(ω)[log PG(ω|1,0)] + Eq(f)log ϕ(f2)
=−c2m−log ϕ(c2) + Eq(f)f2m+Eq(f)log ϕ(f2)
Replacing with the optimal q∗(ω) = e−c2ωp(ω)
ϕ(c2)with c2=Eq(f)f2
∆L∗=−log ϕ(c2) + Eq(f)log ϕ(f2)
A.4.4 Sparse ELBO
When using the inducing points approach the ELBO becomes:
ELBO =
N
X
i=1
Eq(fi,ui,ωi)[log p(yi|fi, ui, ωi)]
−KL[q(u)||p(u)] −
N
X
i=1
KL[q(ωi)||p(ωi)]
63
Automated Augmented Conjugate Inference for GP Models
Eq[log p(yi|fi, ωi, θ)] = log C(θ) + g(yi, θ)Eq(f,u)[f]−Eq(f,u)h(fi, yi)2Eq(ωi)[ωi]
= log C(θ) + g(yi, θ)(κ>m)i−α(yi)−β(yi)(κ>m)i+γ(yi)(κ>m)2
i+ (κ>Sκ)iiωi
KL[q(f)||p(f)] =1
2log |K|
|S|−N+tr(K−1S)+(µ0−m)>K−1(µ0−m)
KL[q(ωi)||p(ωi)] = −Eq(ωi)c2
iωi−log ϕ(c2
i) = −c2
iωi−log ϕ(c2
i)
A.5 Proof of equivalence between Jaakkola bound and data augmentation
Jaakkola and Jordan (2000) proposed an approach purely based on optimization. They are assuming log p(y|f)contains
a part convex in f2:log p(y|f) = log pconvex(f) + log pnon−convex(f, y). Using convexity properties they are creating a
bound with a Taylor expansion to the first order around an additional variable c2:
log pc(f)≥log pc(c) + dlog pc(c)
dc2(f2−c2)
Putting it back in the full ELBO, they are now getting a quadratic part in f, analytically differentiable, and they just need
to optimize the additional variables {ci}.Merkle (2014) shows that any completely monotone function is log-convex,
i.e. log ϕ(r)is convex. Therefore we can replace log pc(c)by log ϕ(r)to recover our model in the context of variational
inference. Note that the converse does not hold, therefore the complete monotonicity is a stronger assumption.
A.6 Likelihoods used for the experiments
We detail all likelihoods used for the experiments and their formulation as in equation (4).
Laplace Likelihood : Laplace(y|f, β) = 1
2βexp −|f−y|
β
Logistic Likelihood : p(y|f) = σ(yf) = eyf /2
2 cosh(|f|/2)
Student-T Likelihood : p(y|f) = Γ((ν+1)/2)
Γ(ν/2)√πν 1 + (y−f)2
ν−(ν+1)/2
Matern 3/2 Likelihood : p(y|f) = 4ρ
√31 + √3(y−f)2
ρexp −√3(y−f)2
ρ
Likelihood C(θ)g(y, θ)||h(y, f, θ)2||2
2α(y)β(y)γ(y)ϕ(r)
Laplace (2β)−10 (y−f)2y22y1e−√r/β
Logistic 2−1y/2f20 0 1 cosh−1(√r/2)
Student-T Γ((ν+ 1)/2)/(Γ(ν)√πν) 0 (y−f)2y22y1 (1 + r
ν)−(ν+1)/2
Matern 3/2 4ρ/√3 0 (y−f)2y22y1 (1 + √3r
ρ)e−√3r/ρ
5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process
Models
64
Théo Galy-Fajou, Florian Wenzel, Manfred Opper
A.7 Extra figures
A.7.1 Autocorrelation plots
Figure 4. Auto-correlation plots for differents with lags from 1 to 10
65
Automated Augmented Conjugate Inference for GP Models
A.7.2 HMC Results
/nstep 1 2 5 10
0.01
Time/Sample (s) 0.037 0.045 0.077 0.133
Lag 1 0.999 0.993 0.978 0.963
Gelman 3.14 1.02 1.00 2.05
0.05
Time/Sample (s) 0.036 0.046 0.080 0.12
Lag 1 0.999 0.998 0.931 0.948
Gelman 1.72 1.18 1.01 3.25
0.1
Time/Sample (s) 0.033 0.042 0.073 0.13
Lag 1 0.997 0.996 0.998 0.994
Gelman 1.11 1.04 1.27 2.71
Table 3. HMC results for the Laplace likelihood
/nstep 1 2 5 10
0.01
Time/Sample (s) 0.675 0.110 0.177 0.251
Lag 1 0.999 0.999 0.997 0.993
Gelman 3.14 1.74 1.11 1.02
0.05
Time/Sample (s) 0.148 0.192 0.336 0.573
Lag 1 0.997 0.993 0.962 0.857
Gelman 1.10 1.02 1.00 1.00
0.1
Time/Sample (s) 0.142 0.193 0.337 NA
Lag 1 0.993 0.976 0.864 NA
Gelman 1.03 1.01 1.00 NA
Table 4. HMC results for the Student-T likelihood
/nstep 1 2 5 10
0.01
Time/Sample (s) 0.009 0.013 0.021 0.041
Lag 1 0.999 0.999 0.998 0.994
Gelman 3.19 1.68 1.12 1.02
0.05
Time/Sample (s) 0.011 0.014 0.025 0.41
Lag 1 0.998 0.994 0.968 0.871
Gelman 1.11 1.03 1.00 1.00
0.1
Time/Sample (s) 0.011 0.014 0.024 0.048
Lag 1 0.994 0.979 0.875 0.532
Gelman 1.02 1.01 1.00 1.00
Table 5. HMC Results for the Logistic likelihood
5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process
Models
66
Théo Galy-Fajou, Florian Wenzel, Manfred Opper
A.7.3 ELBO difference
a) Student-T likelihood on the Boston Housing dataset
b) Laplace likelihood on the Boston Housing dataset
Figure 5. Converged negative ELBO and averaged negative log-likelihood on a held-out dataset in function of the RBF
kernel lengthscale, training VI with and without augmentation.
67
Automated Augmented Conjugate Inference for GP Models
A.7.4 Convergence speed
a) Logistic likelihood on the HIGGS dataset
b) Matern 3/2 likelihood on the Airline dataset
c) Student-T likelihood on the Protein dataset
Figure 6. Supplementary convergence plots
5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process
Models
68
6
Flexible and Efficient Inference with
Particles for the Variational
Gaussian Approximation
This last published work is different from the previous c hapters. Instead of focusing on the representation
of the model, we aim at changing the variational distribution representation. The original motivation
behind this work was to answer the question: Can we fit a full Gaussian variational distribution to a
target distribution without matrix inverses, log-determinant, or second-order derivative computations?
The answer resulted in a particle approach: we parametrize the distribution with an arbitrary number
of points in the variable domain instead of the mean and covariance. Although the method might not
be a state-of-the-art approach for variational inference, it brings insights concerning convergence speed
and accuracy of the given posterior.
Authors:
Théo Galy-Fajou,1, Valerio Perrone,2, Manfred Opper1,3
1TU Berlin, Germany, 2Amazon Web Services, 3University of Birmingham
Details:
Type: Journal article
Submitted: June 2021
Accepted: July 2021
DOI: https://doi.org/10.3390/e23080990
Journal: Entropy (Special edition on Approximate Bayesian Inference)
License: Creative Commons Attribution (CC BY 4.0)
69
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
Contributions:
For an explanation of the terms see the Contributor Roles Taxonomy (CReditT)
T.G-F. V.P. M.O.
Conceptualization ✓ ✓
Methodology ✓ ✓ ✓
Formal Analysis ✓
Software ✓
Investigation ✓
Writing - Original Draft ✓ ✓ ✓
Writing - Review & Editing ✓ ✓ ✓
Supervision ✓
Funding Acquisition ✓
70
entropy
Article
Flexible and Efficient Inference with Particles for the
Variational Gaussian Approximation
Théo Galy-Fajou 1,*, Valerio Perrone 2and Manfred Opper 1,3
Citation: Galy-Fajou, T.; Perrone, V.;
Opper, M. Flexible and Efficient
Inference with Particles for the
Variational Gaussian Approximation.
Entropy 2021,23, 990. https://doi.org/
10.3390/e23080990
Academic Editor: Pierre Alquier
Received: 22 June 2021
Accepted: 21 July 2021
Published: 30 July 2021
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2021 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1Artificial Intelligence Group, Technische Universität Berlin, 10623 Berlin, Germany;
2Amazon Web Services, 10969 Berlin, Germany; [email protected]
3Centre for Systems Modelling and Quantitative Biomedicine, University of Birmingham,
Birmingham B15 2TT, UK
*Correspondence: [email protected]
Abstract:
Variational inference is a powerful framework, used to approximate intractable posteriors
through variational distributions. The de facto standard is to rely on Gaussian variational families,
which come with numerous advantages: they are easy to sample from, simple to parametrize,
and many expectations are known in closed-form or readily computed by quadrature. In this
paper, we view the Gaussian variational approximation problem through the lens of gradient flows.
We introduce a flexible and efficient algorithm based on a linear flow leading to a particle-based
approximation. We prove that, with a sufficient number of particles, our algorithm converges linearly
to the exact solution for Gaussian targets, and a low-rank approximation otherwise. In addition to
the theoretical analysis, we show, on a set of synthetic and real-world high-dimensional problems,
that our algorithm outperforms existing methods with Gaussian targets while performing on a par
with non-Gaussian targets.
Keywords: variational inference; Gaussian; particle flow; variable flow
1. Introduction
Representing uncertainty is a ubiquitous problem in machine learning. Reliable
uncertainties are key for decision making, especially in contexts where the trade-off between
exploitation and exploration plays a central role, such as Bayesian optimization [
1
], active
learning [
2
], and reinforcement learning [
3
]. While Bayesian inference is a principled tool to
provide uncertainty estimation, computing posterior distributions is intractable for many
problems of interest. Most sampling methods struggle to scale up to large datasets [
4
],
while the diagnosis of convergence is not always straightforward [
5
]. On the other hand,
Variational Inference
(VI)
methods can rely on well-understood optimization techniques
and scale well to large datasets, at the cost of an approximation quality depending heavily
on the assumptions made. The Gaussian family is by far the most popular variational
approximation used in
VI
[
6
,
7
]. This is for several reasons. First, Gaussian variational
families are easy to sample from, reparametrize, and marginalize. Second, they are easily
amenable to diagonal covariance approximations, making them scalable to high dimensions.
Third, most expectations are either easily computable by quadrature or Monte Carlo
integration, or known in closed-form.
A large body of work covers different approaches to optimize the Variational Gaussian
Approximation
(VGA)
, with the speed of convergence and the scalability in dimensions
as the main concerns. From the perspective of convergence speed, the major bottleneck
when computing gradients with stochastic estimators is the estimator variance [
8
]. Particle-
based methods with deterministic paths do not have this issue, and have been proven to
be highly successful in many applications [
9
–
11
]. However, can we use a particle-based
Entropy 2021,23, 990. https://doi.org/10.3390/e23080990 https://www.mdpi.com/journal/entropy
71
Entropy 2021,23, 990 2 of 34
algorithm to compute a
VGA
? If so, what are its properties and is it competitive with other
VGA methods?
In this paper, we attempt to answer these questions by introducing the Gaussian Particle
Flow
(GPF)
, a framework to approximate a Gaussian variational distribution with particles.
GPF
is derived from a continuous-time flow, where the necessary expectations over the
evolving densities are approximated by particles. The complexity of the method grows
quadratically with the number of particles but linearly with the dimension, remaining
compatible with other approximations such as structured mean-field approximations.
Using the same dynamics, we also derive a stochastic version of the algorithm, Gaussian
Flow
(GF)
. To show convergence, we prove the decrease in an empirical version of the free
energy that is valid for a finite number of particles. For the special case of
D
–dimensional
Gaussian target densities, we show that
D+
1 particles are enough to obtain convergence
to the true distribution. We also find, for this case, that convergence is exponentially fast.
Finally, we compare our approach with other
VGA
algorithms, both in fully controlled
synthetic settings and on a set of real-world problems.
2. Related Work
The goal of Bayesian inference is to carry out computations with the posterior dis-
tribution of a latent variable
x∈RD
given some observations
y
. By Bayes theorem, the
posterior distribution is
p(x|y) = p(y|x)p(x)
p(y)
, where
p(y|x)
and
p(x)
are, respectively, the
likelihood and the prior distribution. Even if the likelihood and the prior are known ana-
lytically, marginalizing out high-dimensional variables in the product
p(y|x)p(x)
in order
to compute quantities such as
p(y)
is typically intractable. Variational Inference
(VI)
aims to
simplify this problem by turning it into an optimization one. The intractable posterior is
approximated by the closest distribution within a tractable family, with closeness being
measured by the Kullback-Leibler (KL) divergence, defined by
KL [q(x)||p(x)]=Eq[log q(x)−log p(x)],
where
Eq[f(x)]=Rf(x)q(x)dx
denotes the expectation of
f
over
q
. Denoting by
Q
a
family of distributions, we look for
arg min
q∈Q
KL [q(x)||p(x|y)].
Since
p(y)
is not computable in an efficient way, we equivalently minimize the upper
bound F:
KL[q(x)||p(x|y)]≤ F[q] = −Eq[log p(y|x)p(x)]−Hq, (1)
where
Hq
is the entropy of
q
(
−Eq[log q(x)]
). Here,
F
is known as the variational free energy
and
−F
is known as the Evidence Lower BOund (ELBO). A diverse set of approaches to
perform
VI
with Gaussian families
Q
have been developed in the literature, which we
review in the following.
2.1. The Variational Gaussian Approximation
The
VGA
is the restriction of
Q
to be the family of multivariate Gaussian distributions
q(x) = N(m
,
C)
, where
m∈RD
is the mean and
C∈ {A∈RD×D|x>Ax ≥
0,
∀x∈RD}
is
the covariance matrix, for which the free energy is found to be
F[q] = −1
2log |C|+Eq[ϕ(x)]. (2)
where
ϕ(x) = −log(p(y|x)p(x))
. A standard descent algorithm based on gradients of
Equation
(2)
with respect to variational parameters
m
,
C
give rise to some issues. First,
naively computing the gradient of the expectation with respect to the covariance matrix
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
72
Entropy 2021,23, 990 3 of 34
C
involves unwanted second derivatives of
ϕ(x)
[
12
], which may not be available or
may be computationally too expensive in a black-box setting. Second, the gradient of the
entropy term
Hq
entails inverting a non-sparse matrix, which we would like to avoid
for higher-dimensional cases. Finally, the positive-definiteness of the covariance matrix
leads to non-trivial constraints on parameter updates, which can lead to a slowdown of
convergence or, if ignored, to instabilities in the algorithm.
To solve these issues, a variety of approaches have been proposed in the literature.
If we focus on factorizable models, we can make a simplification: for problems with
likelihoods that can be rewritten as
p(y|x) = ∏D
d=1p(y|xd)
, the number of independent
variational parameters is reduced to 2
D
[
12
,
13
]. In this special case, the Gaussian expec-
tations in the free energy
(2)
split into a sum of 1-dimensional integrals, which can be
efficiently computed by using numerical quadrature methods. To extend to the general
case, gradients of the free energy are estimated by a stochastic sampling approach, which
also forms the starting point of our method. This relies on the so-called reparametrization
trick, where the expectation over the parameter-dependent variational density
qθ
is replaced
by an expectation over a fixed density
q0
instead. This facilitates the gradient computation
because unwanted derivatives of the type
∇θqθ(x)
are avoided. For the Gaussian case,
the reparametrization trick is a linear transformation of an arbitrary
D
dimensional Gaus-
sian random variable
x∼qθ(x)
in terms of a
D
-dimensional Gaussian random variable
x0∼q0=N(m0,C0):
x=Γ(x0−m0) + m, (3)
where
Γ∈RD×D
and
m∈RD
are the variational parameters. We assume that the co-
variance
C0
is not degenerate and, for simplicity, we set it as the identity. For instance,
the gradient of the expectation given
q
over a function
f
given the mean
m
becomes
∇mEq[f(x)]=Eq0∇mf(Γ(x0−m0) + m)
. This can be simply proved by using the
reparametrization
(3)
inside the integral and passing the gradient inside; for more de-
tails, see [14].
Given this representation, the free energy is easily obtained as a function of the
variational parameters:
F(q) = −log |Γ|+Eq0hϕ(Γ(x0−m0) + m)i. (4)
Other representations are possible. Challis and Barber
[13]
and Ong et al.
[15]
use a different
reparametrization with a factorized structure of the covariance
C=Γ>Γ+diag(d)
, where
Γ∈RD×P
and
d∈RD
, with
P≤D
is the rank of
Γ>Γ
. Other representations assume
special structures of the precision matrix
Λ=C−1
, which allow you to enforce special
properties, such as sparsity in [16,17].
In general, these methods tend to scale poorly with the number of dimensions, as one
needs to optimize
D(D+
3
)/
2 parameters. The (structured) Mean-Field
(MF)
[
18
,
19
] approach
imposes independence between variables in the variational distribution. The number of
variational parameters is then 2
D
, but covariance information between dimensions is lost.
2.2. Natural Gradients
Besides the issue of expectations, more efficient optimizations directions, beyond
ordinary gradient descent, have been considered. These can help to deal with constraints
such as those given for the covariance matrix. Natural gradients [
20
] are a special case of
Riemannian gradients and utilize the specific Riemannian manifold structure of variational
parameters. They can often deal with constraints of parameters (such as the positive
definiteness of the covariance), accelerate inference, and improve the convergence of
algorithms. The application of such advanced gradient methods typically requires an
estimate of the inverse Fisher information matrix as a preconditioner of ordinary gradients.
Khan and Nielsen
[21]
and Lin et al.
[22]
propose a solution that requires extra second
derivatives of the log–posteriors. Salimbeni et al.
[23]
developed an automatic process to
73
Entropy 2021,23, 990 4 of 34
compute these without the second derivatives but with instability issues. Lin et al.
[17]
solved these issues by using geodesics on the manifold of parameters, at the price of having
to compute inverse matrices as well as Hessians.
2.3. Particle-Based VI
Stochastic gradient descent methods compute expectations (and gradients) at each
time step with new independent Monte Carlo samples drawn from the current approxi-
mation of the variational density. Particle-based methods for variational inference draw
samples only once at the beginning of the algorithm instead. They iteratively construct
transformations of an initial random variable (having a simple tractable density) where the
transformed density leads to the decrease and finally to the minimum of the variational free
energy. The iterative approach induces a deterministic temporal flow of random variables
which depends on the current density of the variable itself. Using an approximation by the
empirical density (which is represented by the positions of a set of ’particles’) one obtains a
flow of interacting particles which converges asymptotically to an empirical approximation
of the desired optimal variational density.
The most popular approach is Stein Variational Gradient Descent
(SVGD)
[
24
], which
computes a nonparametric transformation based on the kernelized Stein discrepancy [
9
].
SVGD
has the advantage of not being restricted to a parametric form of the variational
distribution. However, using standard distance-based kernels like the squared exponential
kernel (
k(x
,
y) = exp(−kx−yk2
2/
2
)
) can lead to underestimated covariances and poor per-
formance in high dimensions [
11
,
25
]. Hence, it is interesting to develop particle approaches
that approximate the
VGA
. We provide a more thorough comparison between our method
and SVGD in Section 3.6.
2.4. GVA in Bayesian Neural Networks
There has been increased interest in making Bayesian Neural Networks
(BNN)
by adding
priors to Neural Networks parameters. The true form of the posterior is unknown but
VGA
has been used due to its ease of use and scalability with the number of dimensions
(typically
D
10
5
). Most of the aforementioned methods apply to
BNN
, but techniques
have been specifically tailored with
BNN
in mind. [
26
] use the low-rank structure of [
13
]
but exploit the Local Reparametrization Trick, where each datapoint
yi
gets a different sample
from
q
in order to reduce the stochastic gradient estimator variance. Stochastic Weight
Averaging-Gaussian
(SWAG)
[
27
], in which a set of particles obtained via stochastic gradient
descent represent a low-rank Gaussian distribution, approximating the true posterior with
a prior posterior produced by the network’s regularization. While easy to implement,
SWAG
does not allow you to incorporate an explicit prior, and the resulting distribution
does not derive from a principled Bayesian approach.
2.5. Related Approaches
The closest approach to our proposed method is the Ensemble Kalman Filter
(EKF)
[
28
].
It assumes that the posterior is computed in a sequential way, where, at each time step, only
single (or smaller batches) of data observations, represented by their likelihoods, become
available. An ensemble of particles, representing a Gaussian distribution is iteratively
updated with every new batch of observations.
EKF
allows us to work on high-dimensional
problems with a limited amount of particles but is restricted to factorizable likelihoods for
which a sequential representation is possible. While
EKF
maintains a representation of a
Gaussian posterior, it is not clear how this relates to the goal of minimizing the free energy
or the KL divergence.
3. Gaussian (Particle) Flow
We introduce Gaussian Particle Flow
(GPF)
and Gaussian Flow
(GF)
, two computation-
ally tractable approaches, to obtain a Variational Gaussian Approximation
(VGA)
. In the
following, we derive deterministic linear dynamics, which decreases the variational free
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
74
Entropy 2021,23, 990 5 of 34
energy. We additionally give some variants with a Mean-Field
(MF)
approach and prove
theoretical convergence guarantees.
In the following,
d(·)
dt
indicates the total derivative given time,
∂(·)
∂t
partial derivatives
given time, ∇x(·)gradients given a vector x.
3.1. Gaussian Variable Flows
We next discuss an alternative approach to generate the desired transformation of
random variables, leading from a simple (prior) Gaussian density to a more complex
Gaussian, which minimizes the variational free energy. It is based on the idea of variable
flows, i.e., recursive deterministic transformations of the random variables defined by a
mapping
xn+1=xn+efn(xn)
where
fn:RD→RD
. Well-known examples of flows
are Normalizing Flows [
29
], where
fn
are bijections, or Neural ODEs [
30
] where
fn=f
is
defined by a neural network and
x0
is the input. For simplicity, we will consider small
changes
e→
0 and work with flows in the continuous-time limit (
t=ne
), which follow a
system of Ordinary Differential Equation
(ODE)
. For the Gaussian case, in the spirit of the
reparametrization trick (3), we choose a linear corresponding map fand write
dxt
dt =ft(xt) = At(xt−mt) + bt, (5)
where
At
is a matrix and
mt.
=Eqt[x]
(which is no longer interpreted as an independent
variational parameter). When the initial random variable
x0
is Gaussian distributed, the
vectors
xt
are also Gaussian for any
t
. To construct a flow that decreases the free energy
over time, we can either compute the time derivative of the specific free energy
(2)
induced
by the
ODE (5)
, or simply derive the general result valid for smooth maps
f
(see, e.g., [
24
]).
To be self contained, we briefly repeat the main steps: We first compute the change of the
free energy in terms of the time derivative of qt:
dF[qt]
dt =d
dt Zqt(x)log qt(x) + ϕ(x)dx
=Z∂qt(x)
∂tlog qt(x) + ϕ(x)dx +Zqt(x)∂qt(x)
∂t
1
qt(x)+∂ϕ(x)
∂tdx
=Z∂qt(x)
∂tlog qt(x) + ϕ(x)dx
where we have used the fact that
R∂qt(x)
∂tdx =d
dt Rqt(x)dx =
0 and
∂ϕ(x)
∂t=
0. We next use
the continuity equation for the density
∂qt(x)
∂t=−∇x·qt(x)ft(x),
related to the deterministic flow to obtain
dF[qt]
dt =Z∇x·qt(x)ft(x)log qt(x) + ϕ(x)dx
=−Zqt(x)ft(x)·∇xlog qt(x) + ϕ(x)dx
=Z∇x·(qt(x)ft(x)) + qt(x)ft(x)·∇xϕ(x)dx
=Z∇xqt(x)·ft(x) + qt(x)ft(x)·∇xϕ(x)dx
=−Eqt∇x·ft(x)−ft(x)·∇xϕ(x)
where we have applied Green’s identity twice and used the fact that
limx→∞qt(x) =
0.
Specializing to the linear flow (5), we obtain
75
Entropy 2021,23, 990 6 of 34
dF[qt]
dt =−tr[At(At
?)>]−(bt)>bt
?, (6)
where
At
?
.
=I−Eqth∇xϕ(x)(x−mt)>i
bt
?
.
=−Eqt[∇xϕ(x)](7)
Equation
(6)
represents the change in the free energy
F
for an infinitesimal change in the
variables xgiven by the flow (5). Obviously, the simplest choices
At≡At
?bt≡bt
?(8)
lead to a decrease in the free energy
dF[qt]
dt ≤
0. More detailed derivations are given in
Appendix A. Additionally, equality only happens, when
I−Eqh∇xϕ(x)(x−m)>i=0
Eq[∇xϕ(x)]=0 (9)
Using Stein’s lemma [
31
], we can show that these fixed-point solutions are equal to the
conditions for the optimal variational Gaussian distribution solution given in [
12
]. In
Appendix C, we show that our parameter updates can be interpreted as a Riemannian
gradient descent method for the free energy
(4)
. This is based on the metric introduced by
([
20
], Theorem 7.6) as an efficient technique for learning the mixing matrix in models of
blind source separation. This gradient should not be confused with the so-called natural
gradient obtained by pre-multiplying with the inverse Fischer-information matrix.
Of course, there are other choices for
At
and
bt
, which lead to a decrease in the free
energy and the same fixed-point equations. In Section 3.6, we discuss how
SVGD
, with a
linear kernel, can lead to the same fixed points but with different dynamics.
3.2. From Variable Flows to Parameter Flows
Before we introduce the particle algorithm, we show that the results for the variable
flow can also be converted into a temporal change of the parameters
Γt
,
mt
, as defined
for Equation
(3)
. From this, a corresponding Gaussian Flow
(GF)
algorithm can be eas-
ily derived. By differentiating the parametrisation
xt=Γt(x0−m0) + mt
(with
mt
now
considered as free variational parameter) with respect to time tand using (5), we obtain
dxt
dt =dΓt
dt (x0−m0) + dmt
dt =At(xt−mt) + bt(10)
By inserting
xt=Γt(x0−m0) + mt
into the right hand side of
(10)
, and using the optimal
parameters from (7), we obtain
dΓt
dt =Γt−Eq0h∇xϕ(xt)(x0−m0)>iΓt(Γt)>
dmt
dt =−Eq0∇xϕ(xt)(11)
Note that the expectations are over the probability distribution of the initial random
variable
x0
. Discretizing Equations
(11)
in time, and estimating the expectations by drawing
independent samples from the fixed Gaussian
q0
at each time step, we obtain our
GF
algorithm to minimize the variational free energy in the space of Gaussian densities.
We summarize the steps of
GF
in Algorithm 1. Remarkably, this scheme differs from
previous
VGA
algorithms with Riemannian gradients based on the Fisher information
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
76
Entropy 2021,23, 990 7 of 34
metric (see, e.g., [
17
,
32
]) because no matrix inversions or second order derivatives of the
function ϕare required.
GF
also allows for the computation of a low-rank
VGA
by enforcing
Γ∈RD×K
and
x0∈RK
. This algorithm scales linearly in the number of dimensions and quadratically in
the rank Kof the covariance.
It is interesting to note that the reverse construction of a variable flow from a parameter
flow is, in general, not possible. This would require the ability to eliminate all variational
parameters and the initial variables
x0
in the resulting differential equation for
xt
, and
replace them with functions of
xt
alone. For instance, if we eliminate the initial variables
x0
in terms of
(Γt)−1
and
xt
the algorithm of [
14
], the resulting expression still depends on
Γt
.
3.3. Particle Dynamics
The main idea of the particle approach is to approximate the Gaussian density
qt
in
(7)
by the empirical distribution
ˆ
qt.
=1
N
N
∑
i=1
δ(x−xt
i)(12)
computed from
N
samples
xt
i
,
i=
1,
. . .
,
N
. These are initially sampled from the density
q0
at time t=0 and are then propagated using the discretized dynamics of the ODE (5):
dxt
i
dt =−ηt
1Eˆ
qt[∇xϕ(x)]−ηt
2ˆ
At(xt
i−ˆ
mt)(13)
where
ˆ
At=I−1
N
N
∑
i=1∇xϕ(x)(xt
i−ˆ
mt)>
ˆ
bt=1
N
N
∑
i=1∇xϕ(xt
i),ˆ
mt=1
N
N
∑
i=1
xt
i
where
ηt
1
and
ηt
2
are learning rates (We further comment on the use of different optimization
schemes in Section 4.4). Note that although
Eˆ
qt∇xϕ(x)(x−ˆ
mt)>
is a
D×D
matrix,
changing the matrix multiplication order leads to a computational complexity of
O(N2D)
with a storage complexity of
O(N(N+D))
, since neither the empirical covariance matrix
or Atneed to be explicitly computed.
Relaxation of Empirical Free Energy and Convergence
We have shown that the continuous-time dynamics
(10)
of the random variables leads
to a decay of the free energy
F(qt)
with time
t
. Assuming that the free energy is bounded
from below, one might conjecture that this property would imply the convergence of the
particle algorithm to a fixed point when learning rates are sufficiently small such that the
discrete-time dynamics are approximated well by the continuous limit. Unfortunately, the
finite number
N
of particles poses an extra problem. The definition of the free energy
F(q)
by the KL–divergence
(1)
for continuous random variables such as assumes that both
q(·)
and
p(·|y)
are densities with respect to the Lebesgue measure. Hence,
F(ˆ
q)
is not defined
if we take
q≡ˆ
q
,
(12)
as the empirical distribution of the finite particle approximation.
Nevertheless, we define a finite
N
approximation to the Gaussian free energy, which is
also then found to decay under the finite
N
dynamics. Let us first assume that
N>D
and define
˜
F(ˆ
qt).
=−1
2log |ˆ
Ct|+Eˆ
qt[ϕ(x)](14)
77
Entropy 2021,23, 990 8 of 34
with the empirical covariance matrix
ˆ
Ct=1
N
N
∑
i=1xt
i−mtxt
i−mt>(15)
The definition
(14)
is chosen in such way that in the large
N
limit, when the empirical
distribution
ˆ
qt
converges to a Gaussian distribution
qt
, we will also obtain the convergence
of the approximation
(14)
to
F(qt)
. It can be shown (see Appendix B) that
d˜
F(ˆ
qt)
dt ≤
0, with
equality only at the fixed points of the dynamics.
In applications of our particle method to high-dimensional problems, the limitations
of computational power may force us to restrict particle numbers to be smaller than the
dimensionality
D
. For
N<D+
1, the empirical covariance
Ct
will be singular, and
typically contain only
N−
1 non-zero eigenvalues, which leads to the
−logˆ
C=∞
and
makes Equation
(14)
meaningless. We resolve this issue through a regularisation of the
log–determinant term in
(14)
, replacing all zero eigenvalues of
ˆ
C
by the values 1, i.e.,
λi=
0
→˜
λi=
1. We show in Appendix Bthat the free energy still decays, provided that
the dynamics of the particles stay the same. This regularisation step can be formally stated
as a replacement of the empirical covariance (15) in (14) by
ˆ
Ct→ˆ
Ct+∑
i:λt
i=0
et
i(et
i)>
where et
i=ith eigenvector of ˆ
Ct.
3.4. Algorithm and Properties
The algorithm we propose is to sample
N
particles
{x0
1
,
. . .
,
x0
N}
where
x0
i∈RD
from
q0
(which can be centered around the MAP for example), and iteratively optimize their
positions using Equation
(13)
. Once convergence is reached, i.e.,
dF
dt =
0, we can easily
make predictions using the converged empirical distribution
ˆ
q(x) = 1
N∑N
i=1δ(x−xi)
,
where
δ
is the Dirac delta function, or, alternatively, the Gaussian density it represents,
i.e.,
q(x) = N(m
,
C)
, where
m=1
N∑N
i=1xi
and
C=1
N∑N
i=1(xi−m)(xi−m)>
. To draw
samples from
ˆ
q
, no inversions of the empirical covariance
C
are needed, as we can obtain
new samples by computing:
x=1
√N
N
∑
i=1
(xi−m)◦ξi+m, (16)
where
ξi
are i.i.d. normal variables:
ξi∼ N(
0,
ID)
. This can be shown by defining
D
,
the deviation matrix, a matrix which columns equal to
Di=xi−m
√N
. We naturally have
DD>=Cwhich makes Dthe Cholesky decomposition of C.
All the inference steps are summarized in Algorithm 2and an illustration in two
dimensions is provided in Figure 1.
We summarize the principal points of our approach:
•
Gradients of expectations have zero variance, at the cost of a bias decreasing with the
number of particles and equal to zero for Gaussian target (see Theorem 1);
• It works with noisy gradients (when using subsampling data, for example);
•
The rank of the approximated covariance
C
is
min(N−
1,
D)
. When
N≤D
, the
algorithm can be used to obtain a low-rank approximation.
•
The complexity of our algorithm is
O(N2D)
and storing complexity is
O(N(N+D))
.
By adjusting the number of particles used, we can control the performance trade-off;
• GPF (and GF) are also compatible with any kind of structured MF (see Section 3.5);
•
Despite working with an empirical distribution ,we can compute a surrogate of the
free energy
F(q)
to optimize hyper-parameters, compute the lower bound of the
log-evidence, or simply monitor convergence.
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
78
Entropy 2021,23, 990 9 of 34
Figure 1.
Illustration of the Gaussian Particle Flow algorithm, with
q0(x)
and
p(x)
representing the
initial and target distribution respectively. Particles are iteratively moved according to the gradient
flow starting from q0(x), approximating a new Gaussian distribution qt(x)at each iteration t.
Algorithm 1: Gaussian Flow (GF)
Input: Number of samples N, initial distribution q0=N(µ0,Γ0(Γ0)>), target
p(x)∝e−ϕ(x), learning rates ηt
1,ηt
2
Output: Variational dist. q(x) = Nµ,ΓΓ>
for t in 0 : Tdo
{x0
i}N
i=1∼q0# Sample Ninitial particles from q0
xi=Γt(x0
i−µ0) + µt,∀i# Reparametrize
gi=∇xϕ(xi),∀i# Compute gradients
µt+1=µt−ηt
11
N∑N
i=1ϕ(xi)# Update µ
A=1
N∑igi(x0
i−µ0)>(Γt)># Compute matrix
Γt+1=Γt−ηt
2AΓt# Update Γ
Algorithm 2: Gaussian Particle Flow (GPF)
Input:
Number of particles
N
, initial distribution
q0
, target
p(x)∝e−ϕ(x)
, learning
rates ηt
1,ηt
2
Output: Empirical dist. q(x) = 1
N∑N
i=1δx,xi
Init: Sample N particles from q0:{x0
i}N
i=1
for t in 0 : Tdo
gi=∇xϕ(xt
i),∀i# Compute gradients
m=1
N∑ixi,g=1
N∑igi# Compute means
A=1
N∑igi(xt
i−m)>−I# Compute matrix
xt+1
i=xt
i−ηt
1g−ηt
2A(xt
i−m),∀i# Update particles
3.4.1. Relaxation of Empirical Free Energy
The definition of the free energy
F(q)
from the KL–divergence
(1)
for a continuous
random variables assumes that both
q(·)
and
p(·|y)
are densities with respect to the
Lebesgue measure. Hence, it is not a priori clear that a specific approximation
F(ˆ
qt)
, based
on an empirical distribution
ˆ
qt(x).
=1
N∑N
i=1δ(x−xt
i)
with a finite number of particles
N
,
will decrease under the particle flow. Thus we may not be able to guarantee convergence
to a fixed point for finite N. Luckily, as we show in Appendix D, we find that:
79
Entropy 2021,23, 990 10 of 34
dF(ˆ
qt)
dt =d(Eˆ
qt[ϕ(x)]−1
2logCt)
dt ≤0. (17)
For
N<D+
1, the empirical covariance
Ct
will typically contain
N−
1 non-zero eigenval-
ues and lead to
−log|C|=∞
, making Equation
(17)
meaningless. We resolve this issue
by introducing a regularized free energy
e
F
where
logCt
is replaced by
∑i:λi>0log λi
where
{λi}D
i=1
are the eigenvalues of
Ct
. We show in Appendix Dthat, given the dynamics from
Equation
(5)
,
e
F
is also guaranteed to not increase over time. It can, therefore, be used
as a regularized proxy for the true
F
and used to optimize over hyper-parameters or to
monitor convergence. Note that similar proofs exist for
SVGD
[
33
] and were proven to be
highly non-trivial.
3.4.2. Dynamics and Fixed Points for Gaussian Targets
We illustrate our method by some exact theoretical results for the dynamics and the
fixed points of our algorithm when the target is a multivariate Gaussian density. While such
targets may seem like a trivial application, our analysis could still provide some insight
into the performance for more complicated densities.
Theorem 1.
If the target density
p(x)
is a
D-dimensional
multivariate Gaussian, only
D+
1
particles are needed for Algorithm 2to converge to the exact target parameters.
Proof. The proof is given in Appendix E.
Theorem 2.
For a target
p(x) = N(x|µ
,
Λ−1)
, i.e., with precision matrix
Λ
, where
x∈RD
,
and
N≥D+
1particles, the continuous time limit of Algorithm 2will converge exponentially fast
for both the mean and the trace of the precision matrix:
mt−µ=e−Λt(m0−µ),
tr(Ct−1−Λ) =e−2ttr(C0−1−Λ),
where
mt
and
Ct
are the empirical mean and covariance matrix at time
t
and
exp(−Λt)
is the
matrix exponential.
Proof. The proof is given in Appendix F.
Our result shows that convergence of the mean
mt
directly depends on
Λ
. How-
ever, we can also precondition the gradient on
m
by
Ct
, i.e., using the natural gradient
approximation in the Fisher sense, and eventually get rid of the dependency on
Λ
when
Ct−1≈Λ.
The exponential relaxation of fluctuations also manifests itself in the decay of the free
energy towards its minimum. For the Gaussian target, the free energy exactly separates
into two terms corresponding to the mean and fluctuations. We can write
F(mt
,
Ct) =
1
2(mt−µ)>Λ(mt−µ) + D
2+Ff l(Ct)
, where the nontrivial fluctuation part (subtracted by
its minimum) is given by
Ff l(Ct) = −1
2logCt+1
2tr(ΛCt−I).
We can show that
−lim
t→∞
dln Ff l(Ct)
dt ≥4,
indicating an asymptotic decrease in
Ff l(Ct)
faster than
e−4t
, independent of the target.
We can also prove the finite time bound
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
80
Entropy 2021,23, 990 11 of 34
Ff l(Ct)≤ Ff l(C0)e−2t
tr(Λ−1)(tr(Λ)+|tr((C0)−1−Λ)|).
The degenerate case N<D+1
Additionally, we can show the following result for the fixed points:
Theorem 3.
Given a
D
-dimensional multivariate Gaussian target density
p(x) = N(x|µ
,
Σ)
,
using Algorithm 2with
N<D+1
particles, the empirical mean converges to the exact mean
µ
.
The
N−
1non-zero eigenvalues of
Ct
converge to a subset of the target covariance
Σ
spectrum.
Furthermore, the
global minimum
of the regularised version
e
F
of the free energy
(17)
corresponds
to the largest eigenvalues of Σ.
Proof. The proof is given in Appendix G.
This result suggests that
Ct
might typically converge to an optimal low-rank ap-
proximation of
Σ
. We show an empirical confirmation in Section 4.2 for this conjecture.
This suggests that it makes sense to apply our algorithm to high-dimensional problems
even when the number of particles is not large. If the target density has significant
support close to a low-dimensional submanifold, we might still obtain a reasonable ap-
proximation.
3.5. Structured Mean-Field
For high-dimensional problems, it may be useful to restrict the variational Gaus-
sian approximation to the posterior to a specific structure via a structured mean-field
approximation. In this way, spurious dependencies between variables that are caused by
finite-sample effects could be explicitly removed from the algorithms. This is most easily
incorporated in our approach by splitting a given collection of latent variables
x
into
M
disjoint subsets
x(i)
. We reorder the vector indices in such a way that the first components
correspond to
x(1)
,
x(2)
, and so on. Hence, we obtain
x={x(1)
,
x(2)
,
. . .
,
x(M)}
. A struc-
tured mean-field approach is enforced by imposing a block matrix structure for the update
matrix
AMF =A(1)⊕···⊕ A(M)
, where
⊕
is the direct sum operator. It is easy to see that
this construction corresponds to a related block structure of the
Γ
matrix in Equation
(3)
.
This means that the subsets of the random vectors are modeled as independent. Hence,
when the number of particles grows to infinity, one recovers the fixed-point equations
for the optimal
MF
structured Gaussian variational approximation from our approach.
As previously, as the number of particles grows to infinity, we recover the optimal
MF
Gaussian variational approximation. Note that using a structured
MF
does not change the
complexity of the algorithm but requires fewer particles to obtain a full-rank solution.
3.6. Comparison with SVGD
Given the similarities with the
SVGD
methods [
24
],one could question the differences
of our approach. The model proposed by [
10
] using a linear kernel
k(x
,
x0) = x>x0+
1 has
similar properties to our approach. The variable update becomes:
dx
dt =1
N
N
∑
i=1
(−k(xi,x)∇ϕ(xi) + ∇xiK(xl,xi))
=Eˆ
qhI−∇ϕ(x)x>ix−Eˆ
q[∇ϕ(x)]
The fixed points are
0=Eˆ
q[∇ϕ(x)]
I=Eˆ
qh∇ϕ(x)x>i=Eˆ
qh∇ϕ(x)(x−m)>i
81
Entropy 2021,23, 990 12 of 34
where the last equality holds since
Eˆ
q[∇ϕ(x)]=
0. This is the same as our algorithm fixed
points
(9)
. Similarly to Theorem 1,
D+
1 particles will converge to the exact
D
-dimensional
multivariate Gaussian target. However, the generated flows are different. The main
difference is that we normalize our flow via the
L2
norm, whereas [
10
] rely on the reproducing
kernel Hilbert space (RKHS) norm, i.e.,
kϕk2
k=ϕ>K−1ϕ
where
ϕi=ϕ(xi)
and
Kij =k(xi
,
xj)
.
For a full introduction on RKHS, we recommend [
34
]. Remarkably, centering the particles
on the mean, namely, using the modified linear kernel
k(x
,
x0) = (x−m)>(x0−m) +
1,
leads to the same dynamics. Additionally, when using
SVGD
, there is no direct possibility
of computing the current KL divergence between the variational distribution and the target,
unless some values are accumulated [
35
]. There is also no clear theory explaining what
happens when the number of particles is smaller than the number of dimensions, for both
distance-based kernels and the linear kernel.
4. Experiments
We now evaluate the efficiency of
GPF
and
GF
. First, given a Gaussian target, we
compare the convergence of our approach with popular
VGA
methods, which are all
described in
Section 2
. Second, we evaluate the effect of varying the number of particles
for both Gaussian targets and non-Gaussian targets, especially with a low-rank covariance.
Then, we evaluate the efficiency of our algorithm on a range of real-world binary classifi-
cation problems through a Bayesian logistic regression model and a series of
BNN
on the
MNIST dataset.
All the Julia [
36
] code and data used to reproduce the experiments are available
at the Github repository: https://github.com/theogf/ParticleFlow_Exp (accessed on
27 July 2021).
4.1. Multivariate Gaussian Targets
We consider a 20-dimensional multivariate Gaussian target distribution. The mean is
sampled from a normal Gaussian
µ∼ N(
0,
ID)
and the covariance is a dense matrix defined
as
Σ=UΛU>
, where
U
is a unitary matrix and
Λ
is a diagonal matrix.
Λ
is constructed as
log10(Λii) = log10(κ)(i−1)
D−1−1
where
κ
is the condition number, i.e.,
κ=Λmax/Λmin
. This
means that, for
κ=
1, we obtain a
Σ=
0.1
I
, and for
κ=
100, we obtain eigenvalues ranging
uniformly from 0.1 to 10 in log-space.
We compare
GPF
and
GF
to the state-of-the art methods for
VGA
described in
Section 2
, namely Doubly Stochastic VI
(DSVI)
[
14
], Factor Covariance Structure
(FCS)
[
15
]
with rank
p=D
,iBayes Learning Rule
(IBLR)
[
17
] with a full-rank covariance and their
Hessian approach, and Stein Variational Gradient Descent with both a linear kernel (
Linear
SVGD
) [
10
] and a squared-exponential kernel (
Sq. Exp. SVGD
) [
24
]. For all methods, we
set the number of particles or, alternatively, the number of samples used by the estimator,
as
D+
1, and use standard gradient descent (
xt+1=xt+ηϕtxt
) with a learning rate
of
η=
0.01 for all particle methods. We use RMSProp [
37
] with a learning rate of 0.01
for all stochastic methods. We run each experiment 10 times with 30,000 iterations, and
plot the average error on the mean and the covariance with one standard deviation. For
GPF
, we additionally evaluate the method with and without using natural gradients for
the mean (i.e., pre-multiplying the averaged gradient with
Ct
), indicated, respectively,
with a dashed and solid line.
Figure 2
reports the
L2
norm of the difference between the
mean and covariance with the true posterior over time for the target condition number
κ∈ {1, 10, 100}.
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
82
Entropy 2021,23, 990 13 of 34
Figure 2. L2
norm of the difference between the target mean
µ
(left side) and target covariance
Σ
(right side) with the inferred variational parameters
mt
and
Ct
against time for 20-dimensional
Gaussian targets with condition number
κ
. We use
D+
1 particles/samples and show the mean over
10 runs as well as the 68% credible interval. Methods with dashed curves use natural gradients on
the mean. Note that
DSVI
,
GF
and
FCS
are overlapping and are, at this scale, indistinguishable from
one another.
As Theorem 1predicts,
GPF
converges exactly to the true distribution, regardless of the
target.
GF
and other methods based on stochastic estimators cannot obtain the same precision
as their accuracy is penalized by the gradient noise.
IBLR
approximate the covariance
perfectly, despite the stochasticity of its estimator; however
IBLR
needs to compute the true
Hessian at each step. When using a Hessian approximation instead,
IBLR
performed just like
DSVI
; the true benefit of
IBLR
appears when second-order functions are computed, which
is naturally intractable in high-dimensions.
SVGD
with a linear kernel, achieves a good
performance but is highly unstable: most of the runs (ignored here) diverge. This is due to
the dot computation
x>x
which can become extremely high, especially for non-centered data.
For this reason, we do not consider this method for the later experiments.
SVGD
with a sq.
exp. kernel obtains a good estimate for the mean but fails to approximate the covariance.
83
Entropy 2021,23, 990 14 of 34
Perhaps surprisingly,
GF
does not perform much better than
DSVI
or
FCS
. This is
potentially due to the benefit of Riemannian gradients being canceled by the gradient noise [
38
]
providing a strong argument for particle-based methods over stochastic estimators.
Remarkably, we also confirm Theorem 2, that the convergence speed of
Ct
is indepen-
dent of the target
Σ
, while the convergence speed of
mt
has this dependency unless the
natural gradient is used (see the dashed curves). The case
κ=
1 highlights that natural
gradient do not necessarily improve convergence speed.
4.2. Low-Rank Approximation for Full Gaussian Targets
We explore the effect of the number of particles for both Gaussian and non-Gaussian
targets. We use the same Gaussian target from the previous experiment in 50 dimensions
with a full-rank covariance determined by their condition number
κ=λmax
λmin
. The covariance
eigenvalues
λi
in log-space range uniformly from 0.1 to 0.1
κ
. For a given target multivariate
Gaussian, we vary the number of particles from 2 to
D+
1 and look at the absolute
difference of
|tr(C−Σ)|
. The results in
D=
50, as well as the corresponding predictions
(in dashed-black), from Theorem 3, are shown on Figure 3.
The empirical results perfectly match the theoretical predictions, confirming that, for
Gaussian targets, the particles determine a low-rank approximation whose spectrum is
equal to the largest eigenvalues from the target.
Figure 3.
Trace error for a Gaussian target with
D=
50 and condition numbers
κ
for a varying
number of particles with GPF. Predictions from Theorem 3are shown in dashed-black.
4.3. High-Dimensional Low-Rank Gaussian Targets
We consider a typical low-rank target case where the dimensionality is high but the
effective rank of the covariance is unknown. The target is given by
p(x) = N(µ
,
Σ)
where
µ∼ N(0, ID)
, the covariance is defined by
Σ=UΛU>
, where
U
is a
D×D
unitary matrix
and Λis a diagonal matrix defined by
Λii =(N(2, 1), if i≤K
10−8, otherwise
where
K
is the effective rank of the target. We pick
D=
500 and vary
K∈ {
10, 20, 30
}
to
simulate a true problem where the correct
K
is not known. We test all methods allowing
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
84
Entropy 2021,23, 990 15 of 34
for low-rank structure, namely,
GPF
,
GF
,
FCS
and
SVGD
(Linear and Sq. Exp.). We fix the
rank (or the number of particles) to be 20; therefore, we obtain three cases where the rank is
exact, under-estimated, and over-estimated. For all methods, we use RMSProp [
37
] for the
stochastic methods, or a diagonal version of it (see Section 4.4) for the particle ones. The
error of the mean and the covariance is shown in Figure 4. Note that the difference in the
initial error on the covariance is due to the difficulty of starting with the same covariance
between particle and stochastic methods.
Figure 4.
Convergence plot of low-rank methods for a 500-dimensional multivariate Gaussian target
with effective rank
K∈ {
10, 20, 30
}
. The rank of each method is fixed as 20. The difference in the
starting point for the covariance is due to the initialization difference between each method. We show
the mean over 10 runs for each method with shadowed areas representing the 68% credible interval.
We observe once again that the
SVGD
with a linear kernel fails to converge due to the
large gradients. All methods perform equally in the estimation of the mean while being
non-influenced by the rank of the target. As expected, the approximation quality for the
covariance degrades when the rank gets bigger, but all algorithms still converge to good
85
Entropy 2021,23, 990 16 of 34
approximations.
SVGD
with a sq. exp. kernel performs much worse than the rest of the
methods. This is a known phenomenon where, for high dimensions, the covariance
SVGD
is either over- or underestimated.
4.4. Non-Gaussian Target
We now investigate the behavior of our algorithm with non-Gaussian target distribu-
tions. We built a two-dimensional banana distribution:
p(x)∝exp(−
0.5
(
0.01
x2
1+
0.1
(x2+
0.1
x2
1−
10
)2))
, varied the number of particles used for
GPF
in
{
3, 5, 10, 20, 50
}
and com-
pared it with a standard full-rank
VGA
approach. We also showed the impact of replacing
a fixed
η
with the Adam [
39
] optimizer for 50 particles. The results are shown in
Figure 5
.
As expected, increasing the number of particles madesthe distribution obtained via
GPF
increasingly closer to the optimal standard
VGA
, even in a non-Gaussian setting. However,
using a momentum-based optimizer such as Adam breaks the linearity assumption of the
original flow
(5)
and leads to a twisted representation of the particles. (We observed the
same behavior with other momentum-based optimizers). A simple modification of the
most known optimizers allows the linearity to be maintained while correctly adapting
the learning rate to the shape of the problem. Most optimisers accumulate momentum
or gradients element-wise, and end up modifying the updates as
xt+1=xt+Ptϕt(xt)
,
where
Pt∈RD×D
is the preconditioner obtained via the optimiser and
is the Hadamard
product. By instead taking the average over each dimensions, we obtained the updates
xt+1=xt+Ptϕt(xt)
, where
Pt
is a
D×D
diagonal matrix. The details of the dimension-
wise conditioners for ADAM, AdaGrad and AdaDelta are given in Appendix H.
Figure 5.
Two-dimensional Banana distribution. Comparison of
GPF
using an increasing number of
particles and a different optimizer (ADAM) with the standard VGA (rightmost plot).
4.5. Bayesian Logistic Regression
Finally, we considered a range of real-world binary classification problems mod-
eled with a Bayesian logistic regression. Given some data
{(xi
,
yi)}N
i=1
where
xi∈RD
and
y∈ {−
1, 1
}
, we defined the model
yi∼Bernoulli(σ(w>xi))
with weight
w∈RD
,
and with
σ
being the logistic function. We set a prior on
w
:
wN(0, 10ID)
. We bench-
marked the competing approaches over four datasets from the UCI repository [
40
]:
spam
(
N=4601, D=104
),
krkp
(
N=
351,
D=
111),
ionosphere
(
N=
3196,
D=
37) and
mushroom (N=8124, D=95
). We ran all algorithms discussed in Section 4.1, both with
and without a mean-field approximation;
SVGD
was omitted since it is too unstable. All
algorithms were run with a fixed learning rate
η=
10
−4
, and we used mini-batches of size
100. We show alternative training settings in Appendix I. Note that
FCS
, for mean-field,
simplifies to
DSVI
Additionally, we did not consider full-rank
IBLR
, as it is too expensive,
and we used their reparametrized gradient version for the Hessian. Figure 6shows the
average negative log-likelihood on 10-fold cross-validation with one standard deviation
for each dataset. While, as expected, the advantages shown for Gaussian targets do not
transfer to non-Gaussian targets,
GPF
and
GF
are consistently on par with competitors. On
the other hand,
IBLR
tends to be outperformed. It is also interesting to note that mean-field
does not seem to have a negative impact on these problems, and performance remains the
same even with a full-rank matrix.
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
86
Entropy 2021,23, 990 17 of 34
(a) Mean-field approximation
(b) No mean-field approximation
Figure 6.
Average negative log-likelihood vs. time on a test-set over 10 runs against training time
for a Bayesian logistic regression model applied to different datasets. Top plots use a mean-field
approximation, while bottom plots use a low-rank structure for the covariance with rank L=100.
87
Entropy 2021,23, 990 18 of 34
4.6. Bayesian Neural Network
We ran our algorithm on a standard network with two hidden layers each, with
L=
200 neurons and
tanh
activation functions (we additionally tried ReLU [
41
], but
some baselines failed to converge). We trained on the MNIST dataset [
42
] (
N=
60,000,
D=
784) and used an isotropic prior on the weights
p(w) = N(0, αID)
with
α=
1.0.
We additionally compared these with Stochastic Weight Averaging-Gaussian
(SWAG)
[
27
]
with an SGD learning rate of 10
−6
(selected empirically) and Efficient Low-Rank Gaussian
Variational Inference
(ELRGVI)
[
26
]. We varied the assumptions on the covariance matrix to
be diagonal (
Mean-Field
), or to have rank
L∈ {
5, 10
}
. Additionally, we showed, for
GPF
,
the effect of using a structured mean-field assumption by imposing the independence of
the weights between each layer (GPF (Layers)).
We trained each algorithm for 5000 iterations with a batchsize of 128(
∼
10 epochs)
and reported the final average negative log-likelihood, accuracy and expected calibration
error [43] on the test set (N=10,000) on Table 1. The predictive distribution is given by
p(y=k|x∗,D) = Zp(y=k|x∗,w)p(w|D)dw ≈Zp(y=k|x∗,w)q(w)dw,
where
D
is the training data, and
x∗
is a test sample. We computed the accuracy and the
average negative test log-likelihood as:
Acc =1
N
N
∑
i=1
1yi(argkmax p(y=k|x∗
i,D))
NLL =−1
N
N
∑
i=1
log p(y=yi|x∗
i,D)
where 1
y(x)
is the indicator function (equal to 1 for
y=x
, 0 otherwise). For the definition
of expected calibrated error, we refer the reader to [
43
]. Additional convergence and
uncertainty calibration plots can be found in Appendix I.
Table 1.
Negative Log-Likelihood (NLL), Accuracy (Acc), and Expected Calibration Error (ECE)
for a Bayesian Neural Networks
(BNN)
on the MNIST dataset. We varied the rank of the variational
covariance from mean-field (all variables are independent) to a low-rank structure with
L∈ {
5, 10
}
.
Bold numbers indicated the best performance, and italic bold numbers indicate the best performance
when restricted to VGA methods. Convergence and calibration plots can be found in Appendix I.
Alg. Mean-Field L=5L=10
NLL Acc ECE NLL Acc ECE NLL Acc ECE
GPF 0.183 0.95 0.0384 0.166 0.96 0.0918 0.172 0.955 0.0869
GPF (Layers) - - - 0.147 0.958 0.0181 0.178 0.952 0.0395
GF 0.178 0.953 0.0706 0.185 0.956 0.136 0.171 0.952 0.0455
DSVI 0.204 0.945 0.11 - - - - - -
SVGD (Sq. Exp) - - - 0.139 0.965 0.0732 0.133 0.967 0.0879
SWAG - - - 0.257 0.957 0.0662 0.287 0.956 0.0878
ELRGVI - - - 0.453 0.901 0.53 0.537 0.882 0.777
Overall, the
SVGD
method performed best in terms of both accuracy and negative
log-likelihood. However,
SVGD
is not in the same category as others, since it is not a
VGA
. For
VGA
s, we observed that a low-rank approximation improves upon mean-field
methods. In particular, assuming independence between layers provides a large advantage
to
GPF
.
GPF
and
GF
generally perform equally or better than all the other
VGA
methods.
Note that, although not reported here, all methods needed approximately the same time
for the 5000 iterations, except for
SWAG
, which only needed the MAP and a few thousand
iterations of SGD afterward, making it generally faster but also less controlled (a grid
search was needed to find the appropriate learning for SGD).
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
88
Entropy 2021,23, 990 19 of 34
5. Discussion
We introduced
GPF
, a general-purpose and theoretically grounded, particle-based
approach, to perform inference with variational Gaussians as well as
GF
its parameter
version. We were able to show the convergence of the particle algorithm based on an
empirical approximation of the free energy. We also showed that we can approximate
high-dimensional targets by allowing for low-rank approximations with a small number
of particles. The results for Gaussian targets suggest that the convergence of posterior
covariance approximation may relax asymptotically fast, with small dependence on the
target. This work is the first step in analyzing convergence speed and guarantees in
inference with variational Gaussians, and future work could extend guarantees to non-
Gaussian problems. One could also take advantage of existing particle-based VI methods
to accelerate inference further or reach a better optima [44,45].
Author Contributions:
Conceptualization, T.G.-F. and M.O.; methodology, T.G.-F., V.P. and M.O.; soft-
ware, T.G.-F.; validation, T.G.-F.; formal analysis, T.G.-F.; investigation, T.G.-F.; resources, T.G.-F. and
V.P.; data curation, T.G.-F.; writing—original draft preparation, T.G.-F., V.P. and M.O.; writing—review
and editing, T.G.-F., V.P. and M.O.; visualization, T.G.-F.; supervision, M.O.; project administration,
T.G.-F.; funding acquisition, M.O. All authors have read and agreed to the published version of
the manuscript.
Funding:
We acknowledge the support of the German Research Foundation and the Open Access
Publication Fund of TU Berlin.
Data Availability Statement:
Datasets can be found on the UCI dataset website [
40
] and the MNIST
dataset can be found on Yann Lecun website [42].
Acknowledgments:
We thank Fela Winkelmolen for his initial help on computations, Jannik Thüm-
mel for his work on the linear SVGD and the reviewers for their insightful comments.
Conflicts of Interest: The authors declare no conflict of interest.
Appendix A. Derivation of the Optimal Parameters
In Section 3, we considered the optimization problem:
min
At,bt∈B
dF[qt]
dt where B={At,bt:kAtk2
F=1, kbtk2=1},
where we have introduced
kA2k2
F=tr(AA>)
, the Froebius norm and
kbtk
, the
L2
norm and
dF[qt]
dt =−trhAt(At
?)>i−(bt)>bt
?(A1)
To solve this problem, we used the Lagrange multiplier method. We write the La-
grangian as:
L(At,bt) = dF[qt]
dt −λAg(At)−λbh(bt),
where
g(A) = tr(AA>)−
1 and
h(b) = kbk2
2−
1. For simplicity we can divide the
problem as:
L(At) = −trhAt(At
?)>i−λAg(At)
L(bt) = −(bt)>bt
?−λbh(bt)
For At, we have the constraints:
89
Entropy 2021,23, 990 20 of 34
∇AttrhAt(At
?)>i=λA∇Atg(At)
g(At) =0
Computing the gradients is straightforward:
At
?=2λAAt
⇒At=At
?
2λA
⇒1
4λ2
A
tr(At
?(At
?)>) =1
⇒λA=rtr(At
?(At
?)>)
4.
which gives us the result At=At
?
kAt
?kF. Similarly for bt:
∇bt(bt)>bt
?=λb∇bth(bt)
h(bt) =0.
Replacing the gradients gives:
bt
?=2λbbt
⇒bt=bt
?
2λb
⇒1
4λ2
bkbt
?k2
2=1
⇒λb=2
kbt
?k2
which gives us the result bt=bt
?
kbt
?k2.
Appendix B. Relaxation of the Empirical Free Energy
We prove the decrease in the empirical free energy
(17)
under the particle flow when
the covariance
C
is nonsingular. We define the empirical distribution
ˆ
q(x) = 1
N∑N
i=1δx,xi
with a finite number Nof particles. The empirical free energy is defined as
F[ˆ
q] = Eˆ
q[ϕ(x)]−1
2log |C|.
We are interested in the temporal change of the free energy, when particles move under a
general linear dynamics
dxi
dt =b+A(xi−m).
The induced dynamics for Fare:
dF
dt =Eqt∇xϕ(x)>dx
dt −1
2tr(C−1dC
dt )
For notational simplicity, we introduce g(x) = ∇xϕ(x)and ˙
x=dx
dt (similarly ˙
m=dm
dt ).
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
90
Entropy 2021,23, 990 21 of 34
dC
dt =d
dtEqh(x−m)(x−m)>i
=Eqh(˙
x−˙
m)(x−m)>i+Eqh(x−m)( ˙
x−˙
m)>i
=Eqh˙
xx>+x˙
x>−˙
mm>−m˙
m>i
=Eqh˙
x(x−m)>i+Eqh(x−m)˙
x>i
dF
dt =Eqhg(x)>˙
xi−
1
2Eqhtr(C−1˙
x(x−m)>) + tr(C−1(x−m)>˙
x>)i
=Eqh˙
x>g(x)−C−1(x−m)i (A2)
where we used the permutation properties of the trace.
Plugging the dynamics into Equation (A2), we obtain:
dF
dt =b>Eq[g(x)]+Eqh(x−m)>A>g(x)i
−Eqh(x−m)>A>C−1(x−m)i(A3)
where we used the fact that b>C−1Eq[x−m]=0.
We next look for conditions on
b
and
A
, under which
dF
dt <0
, i.e., the dynamics will
lead to a decrease in the free energy. We pick
b=−β1Eq[g(x)]
, where
β1>
0, and we
obtain, for the first term in (A3):
−β1kEq[g(x)]k2≤0.
For
A
, let us first define
ψ=Eqg(x)(x−m)>
and rewrite the second and last term
of the Equation (A3) as:
Eqh(x−m)>A>g(x)i=trEqhA>g(x)(x−m)>i
=trA>ψ
Eqh(x−m)>A>C−1(x−m)i=trA>C−1C
=tr(A)
Combining both, we get
trA>(ψ−I)
. Similarly to the previous step, we pick
A=
−β2(ψ−I), where β2≥0, which leads to another negative term:
−β2tr((ψ−I)>(ψ−I)) ≤0,
where we use the fact that X>Xis a positive semi-definite matrix for any real valued X.
Note that different forms of
A
(e.g.,
β2
are replaced by a positive definite matrix) could
be used, as long as the trace of the product stays positive. Inserting
b
and
A
, the free energy
dynamics become
dF
dt =−β1kEq[g(x)]k2−β2tr((ψ−I)>(ψ−I))
The variable dynamics are given by
91
Entropy 2021,23, 990 22 of 34
dx
dt =−β1Eq[g(x)]−β2(ψ−I)(x−m)
=−β1Eq[g(x)]
−β2Eqhg(x)(x−m)>i−I(x−m),
which is equivalent to Equation
(5)
, for
β1=β2=
1. Our result shows that the empirical
approximation of the free energy decreases under the particle flow.
Appendix C. Riemannian Gradient for Matrix Parameter Γ
The parameter flow for the matrix Γin (11) is given by
dΓt
dt =Γt−Eq0h∇xϕ(xt)(x0−m0)>iΓt(Γt)>.
This is easily rewritten in terms of the parameter gradient as dΓt
dt =∂F
∂ΓΓΓ>
Similar to natural gradients, which are defined by the metric, which is induced by
the Fisher–matrix, we can rewrite the parameter change in terms of a different Riemannian
gradient. This gradient is the direction of change
dΓ=Γ(t+dt)−Γ(t)
, which yields
the steepest descent of the free energy over a small time interval
dt
. As an extra con-
dition, one keeps the length of
dΓ
(measured by a ’natural’ metric, which has specific
invariance properties) fixed. This is defined by an inner product (the squared length)
hdΓ
,
dΓiΓ
in the tangent space of small deviations
dΓ
from the matrix
Γ
. Hence,
dΓ
is
found by minimising
F(Γ(t) + dΓ
,
m)
(for small
dΓ
) under the condition that
hdΓ
,
dΓiΓ(t)
is
fixed.
Following [20] (Theorem 6)
, a natural metric in the space of symmetric nonsingular
matrices can be defined as
hdΓ,dΓiΓ.
=tr(dΓ Γ−1)>dΓ Γ−1.
This metric is invariant against multiplications of
Γ
and
dΓ
by matrices
Y
, i.e.,
hdΓ
,
dΓiΓ=
hdΓY,dΓYiΓYand reduces to the Euclidian metric at the unit matrix Γ=I.
The direction of the natural gradient is obtained by expanding the free energy for
small
dΓ
and introducing a Lagrange–multiplier
λ
for the constraint. One ends up with the
quadratic form
∂F
∂ΓdΓ+λtr(dΓ Γ−1)>dΓ Γ−1
to be minimised by
dΓ
. By taking the derivative with respect to
dΓ
, one finds that the
direction of dΓagrees with the right equation of the flow (11).
Appendix D. Regularised Free Energy for N≤D
The problem of defining an empirical approximation for
N≤D
particles is that the
empirical covariance becomes singular and typically has
N−
1 nonzero eigenvalues, and
thus
|C|=
0. Note that the extra 0 eigenvalue is derived from the fact that the empirical
sum of fluctuations must be zero, which provides an additional linear constraint.
We can regularise the log determinant term by replacing the zero eigenvalues of
C
:
λi=0→˜
λi=1. The new covariance ˜
Cbecomes
log |e
C|=∑
i:λi>0
log λi,
since
log
1
=
0. The dynamics of the particles stays the same. To rewrite this formally in
terms of matrices, we define
e
C=C+C⊥
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
92
Entropy 2021,23, 990 23 of 34
where
C⊥=∑
i:λi=0
eie>
i
and
ei=i
th eigenvector of
C
. This replaces all 0 eigenvalues by 1.
C⊥
is a projector:
C2
⊥=C⊥
and
C⊥(I−C⊥) =
0. We also have
tr(C⊥) = D−(N−
1
)
. In the following,
it is useful to introduce the
D×N
matrix of fluctuations
Z
, such that
C=ZZ>/N
. The
column vectors of
Z
span the subspace of eigenvectors
ei
with
λi>
0. Hence, it follows
that C⊥Z=0.
We want to show that the regularised free energy
e
F
decreases under the particle
dynamics for
N≤D
. Since the part of the time derivative of
e
F
that depends on
dm
dt
is not
changed, we will only discuss the fluctuation part in the following.
It is useful to introduce the matrix:
e
A.
=I−C⊥−gZ>/N=A−C⊥,
with g=∇xϕ(x)is the D×Nmatrix of the gradient.
Eqg(x)>dx
dt =tr(A)−tr(A>A)
=tr(e
A+C⊥)−tr(( e
A+C⊥)>(e
A+C⊥))
=tr(e
A)−tr(e
A>e
A).
To obtain this result, we need
tr(C⊥e
A) =tr(C⊥e
A>)
=tr(C⊥(I−C⊥)−C⊥Zg>/N) = 0.
We need to work out
−1
2
dln |e
C|
dt =−1
2tr de
C
dt e
C−1!
=−1
2trdC
dt e
C−1
where we have used the fact that the eigenvalues
λi=
1 of
e
C
have a zero time derivative
and can be omitted. We use the linear dynamics dZ
dt =AZ to obtain:
dC
dt = = CA>+AC
=( e
C−C⊥)( e
A>+C⊥) + ( e
A+C⊥)( e
C−C⊥)
=e
Ce
A>+e
Ae
C+C⊥e
C+e
CC⊥−e
AC⊥−C⊥e
A>−2C⊥
=e
Ce
A>+e
Ae
C,
where we have used C2
⊥=C⊥and C⊥e
A>=0. Hence
−1
2tr de
C
dt e
C−1!=−tr(e
A).
93
Entropy 2021,23, 990 24 of 34
Finally, the temporal change in the free energy due to the fluctuations is given by
de
F
dt =−tr(e
A>e
A)≤0.
Note that this proof is not only valid for
N≤D
, but also for
N>D
, as the overall
computations are simplified with
C⊥=
0. A more detailed proof for
N>D
is, furthermore,
given in Appendix B.
Efficient Computation of loge
C
A practical way to compute
log |e
C|
without performing an eigenvector expansion is
to define the N×Nmatrix
R.
=Z>Z/N+JN,N/N,
where
JN,N
is the
N×N
all-ones matrix.
Z>Z/N
shares the
N−
1 nonzero eigenvalues with
C
and has an additional eigenvalue 0 corresponding to the constant eigenvector
(eN)i=
1
/√N
. Adding an all-ones matrix preserves all existing eigenvalues while replacing the 0
one with a constant. This leads to the following result:
−1
2log |R|=−1
2
N−1
∑
i=1
log λi.
Appendix E. Proof of Theorem 1: Fixed Points for a Gaussian Model (N>d)
Theorem A1
(1)
.
If the target density
p(x)
is a
D-dimensional
multivariate Gaussian, only
D+
1
particles are needed for Algorithm 2to converge to the exact target parameters.
The general fixed-point condition for the dynamics (13) of the position
xi
for particle
i
is given by:
(I−Eˆ
qhg(x)(x−m)>i)(xi−m)−Eˆ
q[g(x)]=0.
for i=1, . . . , N. By taking the expectation over all particles, we obtain:
Eˆ
q[g(x)]=0, (A4)
where
ˆ
q
is the empirical distributions of particles at the the fixed point. Note that this result
is independent of N, i.e., it is also valid for N=1.
For a
D
-dimensional Gaussian target
p(x) = N(µ
,
Σ)
, we will show that empirical
mean and covariance given by the particle algorithm converge to the true mean and
covariance matrix of the Gaussian when we use
N≥D+
1 particles. In this setting, we
have
ϕ(x) = 1
2x>Σ−1x−x>Σ−1µ
. For simplification, we use the precision matrix
Λ=Σ−1
and get
ϕ(x) = 1
2x>Λx−x>Λµ.
The gradient g(x)becomes:
g(x) = Λ(x−µ)
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
94
Entropy 2021,23, 990 25 of 34
At the fixed points, we have that dm
dt and dΓ
dt are equal to 0. For the mean m:
dm
dt =Eˆ
q[g(x)]=0
ΛEˆ
q[x−µ]=0
Λm=Λµ
m=µ
For the matrix Γ, we have
dΓ
dt =−AΓ=0
Γ−Eq0hg(x)(x−m)>iΓ=0
Eq0hΛ(x−µ)(x−m)>iΓ=Γ
−2η2Eq0h(x−m)(x−m)>iΓ=Γ
ΛCΓ=Γ
ΛC2=C
where we use the result for the mean
m=µ
and right multiplied by
Γ>
as
C=ΓΓ>
. Now,
we can only simplify, as
C=Λ−1=Σ
if
C
is not singular. This is true only if its rank is
equal to D, needing D+1 particles.
Appendix F. Proof of Theorem 2: Rates of Convergence for Gaussian Targets
Theorem A2
(2)
.
For a target
p(x) = N(x|µ
,
Λ−1)
, where
x∈RD
, and
N≥D+
1particles,
the continuous time limit of Algorithm 2will converge exponentially fast for both the mean and the
trace of the precision matrix:
mt−µ=e−Λt(m0−µ),
tr(Ct−1−Λ) =e−2ttr(C0−1−Λ),
where
mt
and
Ct
are the empirical mean and covariance matrix at time
t
and
exp(−Λt)
is the
matrix exponential.
In the following, we assume the target
p(x) = N(µ
,
Σ)
We use the notation
Λ.
=Σ−1
and δCt=Ct−Σ.
Appendix F.1. Convergence of the Mean
Given our target
p(x)
, similarly to Appendix Ewe have
g(x) = Λ(x−µ)
, where
η1=Σ−1µand η2=−1
2Σ−1. This transform the first of Equations (11) into
dm
dt =−Λ(Eˆ
q[x]−µ)
=−Λ(m−µ)
If now consider the error on m:δm=m−µwe obtain:
dδm
dt =dm
dt =−Λ(m−µ)
=−Λδm.
95
Entropy 2021,23, 990 26 of 34
Therefore, the mean converges exponentially fast to the true mean. The asymptotic rate
is governed by the largest eigenvalue of
Λ
, i.e., the inverse of the smallest eigenvalue of
Σ,λmin.
Appendix F.2. Convergence of the Covariance Matrix
Let z=x−m, we have from Equation (5), that
dz
dt =−Az
where A=Eq0g(x)z>−I. This expectation can further be simplified as
Eˆ
qhΛ(x−µ)z>i=ΛC, (A5)
where q∼ N(m,C). Hence, we have the exact result
dC
dt = (I−ΛC)C+C(I−CΛ). (A6)
We know that the optimal target is
C=Σ
. Therefore, we define the error
δC=C−Σ
.
Linearizing Equation (A6) gives us
dδC
dt =dC
dt =(I−Λ(δC+Σ))(δC+Σ)
+ (δC+Σ)(I−(δC+Σ)Λ)
=−ΛδC(δC+Σ)−(δC+Σ)δCΛ
≈−ΛδCΣ−ΣδCΛ
We were not yet able to find a general solution of this equation, but we can obtain a simple
result for the trace yt.
=tr(δC)at time t:
dyt
dt ≃ −2yt.
We, therefore, have a asymptotic linear convergence:
yt∝e−2ty0
which is independent of
the parameters of the Gaussian model.
We can also equivalently obtain a non-asymptotic estimate of a specific error measure
for the precision matrix. Using equation
(A6)
, we have the following dynamics for the
precision C−1:
dC−1
dt =−C−1dC
dt C−1
=−C−1(I−ΛC)−(I−ΛC)C−1
Taking the trace
dtr(C−1)
dt =−2tr(C−1)−2tr(Λ)
dtr(C−1−Λ)
dt =−2tr(C−1−Λ)
Hence we get the following exact result:
tr((Ct)−1−Λ) = e−2ttr((C0)−1−Λ)
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
96
Entropy 2021,23, 990 27 of 34
which is again independent of the parameters of the Gaussian model.
Additionally, this tells us that if the covariance
C
is non-singular at time
t=
0, it will
remain non-singular for all
t
(
tr(C−1)
would be infinite). Hence, if we start with
N>d
particles with a proper empirical covariance, they cannot collapse to make Csingular.
Appendix F.3. Convergence of the Trace of the Covariance
The asymptotic result on traces obtained previously can be turned into an exact
inequality. We have
dδC
dt =−ΛδCΣ−ΣΛδC−Λ(δC)2−(δC)2Λ
Taking the trace, we get
dtr(δC)
dt =−2tr(δC)−2tr(δCΛδC)
Since δCΛδCis positive definite, we have −2tr(δCΛδC)≤0 and thus
dtr(δC)
dt ≤ −2tr(δC)
leading to:
tr(δCt)≤tr(δC0)e−2t
by using by Grönwall’s lemma [46]:
Lemma A1
(Grönwall)
.
For an interval
I0= [
0,
∞)
and a given function
f
differentiable
everywhere in I0and satisfying:
f0(t)≤β(t)f(t),t∈I0
then f is bounded by the corresponding differential equation g0(t) = β(t)g(t):
f(t)≤f(0)Zt
0β(s)ds,t∈I0
The bound is nontrivial only if
tr(δC)≥
0. This would be natural assumption
for a Bayesian model, if
C0
is the prior covariance and the eigenvalues of
Ct
at
t=∞
(corresponding to the posterior) are reduced by the data.
Appendix F.4. Decay of Fluctuation Part of the Free Energy
Still focusing on the Gaussian model, we can further derive a bound on the free energy.
It is easy to see that for the Gaussian case, the free energy in Equation
(4)
separates into a
sum of two terms. The first one depends on the mean
mt
only and the second one on only
the fluctuations (i.e., Ct).
We will consider the second, nontrivial part only. We assume that the covariance
matrix is nonsingular (corresponding to
N>D
). The fluctuation part of the free energy
(minus its minimum) is given by
Ff l =−1
2ln |I−B|− 1
2tr(B)
97
Entropy 2021,23, 990 28 of 34
where we have introduced the matrix
B.
=I−ΛC
. One can show that its eigenvalues are
real and are upper bounded by 1. First, we can show from the equations of motion that
dFf l
dt =−tr(BB>)(A7)
Second, using the elementary bound
−ln(
1
−u)≤u
1−u
valid for
u≤
1 and applied to the
eigenvalues of Byields
Ff l ≤1
2tr(B(I−B)−1−B)
=1
2tr(B(I−B)−1−B(I−B)(I−B)−1)
=1
2tr(B2(I−B)−1)
=1
2tr(B2C−1Λ−1)≤1
2tr(B>Λ−1BC−1)
The last two equalities used the definition
B=I−ΛC
. Since
B>Λ−1B
and
C−1
are both
positive definite, we can bound the last term by (see ([47], Theorem 6.5))
Ff l ≤1
2tr(B>Λ−1B)tr(C−1)≤
1
2tr(BB>)tr(Λ−1)tr(C−1)),
where, in the last line, we have bounded the trace of a product of p.d. matrices a sec-
ond time.
Combining with Equation (A7) we show that
dFf l
dt ≤ − 2Ff l
tr(Λ−1)tr(C−1)
We can plug in our result from Theorem 2:
tr(C−1) =tr(Λ) + tr(C−1−Λ)
=tr(Λ) + e−2ttr((C0)−1−Λ)
≤tr(Λ) + e−2t|tr((C0)−1−Λ)|
≤tr(Λ) + |tr((C0)−1−Λ)|
We can plug this in and use Grönwall’s Lemma A1 to get an exponential bound
Ff l(Ct)≤ Ff l(C0)e−2t
tr(Λ−1)(tr(Λ)+|tr((C0)−1−Λ)|).
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
98
Entropy 2021,23, 990 29 of 34
Appendix F.5. Asymptotic Decay of the Free Energy:
For large times
t
, we can do better. Let us analyse the asymptotic decay constant
Ff l ≃e−λf reetdefined by
λfree .
=−lim
t→∞
dln(Ff l )
dt =−lim
dFf l
dt
Ff l
=lim tr(BB>)
−1
2ln |I−B|− 1
2tr(B)≥
lim tr(B2)
−1
2ln |I−B|− 1
2tr(B)
In the last inequality, we used
tr(BB>)≥tr(B2)
. Everything is expressed by traces of
functions of
B
, and thus by its eigenvalues. Since
B→
0 as
t→∞
(this applies also
to its eigenvalues
u
), we can use Taylor’s expansion
ln(
1
−u) + u=−u2/
2
+O(u3)
to
show that
λfree ≥4
which is independent of Λ.
Appendix G. Proof of Theorem 3: Fixed-Points for Gaussian Model (N≤D)
Theorem A3
(3)
.
Given a
D
-dimensional multivariate Gaussian target density
p(x) = N(x|µ
,
Σ)
,
using Algorithm 2with
N<D+1
particles, the empirical mean converges to the exact mean
µ
.
The
N−
1non-zero eigenvalues of
Ct
converge to a subset of the target covariance
Σ
spectrum.
Furthermore, the
global minimum
of the regularised version
e
F
of the free energy
(17)
corresponds
to the largest eigenvalues of Σ.
Applying Equation (A4) to our fixed point equation, we obtain
(I−Eˆ
qhg(x)(x−m)>i)(xi−m) = 0, ∀i=1, . . . , N
Hence, the set of centered positions of the particles
S={xi−m}N
i=1
, are all eigenvectors of
the matrix
Eˆ
qg(x)(x−m)>
with eigenvalue 1.
S
spans a
N−
1 dimensional space (we
have ∑N
i=1(xi−m) = 0).
If we specialise to a Gaussian target
p(x) = N(x|µ
,
Σ)
, (and
Λ=Σ−1
we have
g(x) = Λ(x−µ)and can reuse the result from Equation (A5):
Eˆ
qhg(x)(x−m)>i=ΛEˆ
qh(x−m)(x−m)>i
=ΛC.
Using the equality above, we get:
ΛC(xi−m) =(xi−m)
C(xi−m) =Σ(xi−m),∀i=1, . . . , N
which shows that the obtained low-rank covariance
C
and the target covariance
Σ
have
N−1 eigenvectors and eigenvalues in common.
99
Entropy 2021,23, 990 30 of 34
However, are these the largest ones? We look at the modified free energy
(17)
(ignoring
the contribution of the mean):
min e
F=min(−1
2∑
i:λi>0
ln λi+tr(ΛC))
where
λi
are the eigenvalues of the empirical covariance
C
. We first note that
tr(ΛC) =
N−
1, independent of which eigenvalues are obtained at the fixed point. This is easily seen
by the following argument: If we use the index–set
I
for the common eigenvectors
ei
and
eigenvalues λi,i∈ I, we can write
C=∑
i∈I
eiλie>
i
Σ=∑
i
eiλie>
i
From this we obtain
tr(ΛC) = tr(∑
i∈I
eiλ−1
iλie>) = N−1
From this result we obtain
min e
F=max 1
2∑
i:λi>0
ln λi−(N−1),
The term
N−
1 is a constant, but the first term makes a difference: The
absolute mini-
mum
of
e
F
is achieved, when the
λi
are
N−
1
largest
eigenvalues of
Σ
. Our simulations
empirically show that the algorithm usually converges to the absolute minimum.
Appendix H. Dimension-Wise Optimizers
Here, we list some of the most populars optimizers used and their dimension-wise
versions. In all algorithms, we consider
ϕ
the matrix created by the concatenation of the
flow of each particle :
ϕ=[ϕ1, . . . , ϕN]
, where
ϕn=ϕ(xn)
We additionally use the notation
ϕn,i
for the
i
-th dimension of the flow of the
n
-th particle. The main differences between
the original algorithms and their modified version were put in red.
Appendix H.1. ADAM
The ADAM algorithm is given by:
Algorithm A1: ADAM
Input: ϕt,mt−1,vt−1,β1,β2,η
Output: ∆
mt
n,d=β1mt−1
n,d+ (1−β1)ϕt
n,d
vt
n,d=β2vt−1
n,d+ (1−β2)ϕt
n,d2
∆n,d=ηmt
n,d
(1−βt
1)qvt
n,d(1−βt
2)−1+e
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
100
Entropy 2021,23, 990 31 of 34
Algorithm A2: Dimension-wise ADAM
Input: ϕt,mt−1,vt−1,β1,β2,η
Output: ∆
mt
n,d=β1mt−1
n,d+ (1−β1)ϕt
n,d;
vt
d=β2vt−1
d+ (1−β2)1
N∑N
n=1ϕt
n,d2;
∆n,d=ηmt
n,d
(1−βt
1)√vt
d(1−βt
2)−1+e;
Appendix H.2. AdaGrad
The AdaGrad algorithm is given by:
Algorithm A3: AdaGrad
Input: ϕt,vt−1,η
Output: ∆
vt
n,d=vt−1
n,d+ϕt
n,d2
∆n,d=ηϕt
n,d
qvt
n,d+e
Algorithm A4: Dimension-wise AdaGrad
Input: ϕt,vt−1,η
Output: ∆
vt
d=vt−1
d+1
N∑N
n=1ϕt
n,d2
∆n,d=ηϕt
n,d
√vt
d+e
Appendix H.3. RMSProp
The RMSProp algorithm is given by:
Algorithm A5: RMSProp
Input: ϕt,vt−1,ρ,η
Output: ∆
vt
n,d=ρvt−1
n,d+ (1−ρ)ϕt
n,d2
∆n,d=ηϕt
n,d
qvt
n,d+e
Algorithm A6: Dimension-wise RMSProp
Input: ϕt,vt−1,ρ,η
Output: ∆
vt
d=ρvt−1
d+ (1−ρ)1
N∑N
n=1ϕt
n,d2
∆n,d=ηϕt
n,d
√vt
d+e
101
Entropy 2021,23, 990 32 of 34
Appendix I. Additional Figures
Appendix I.1. Bayesian Logistic Regression
Similarly to the previous section, we also show results with the RMSProp optimizer
with learning rate 1 ×10−4.
(a) Mean-field approximation (b) No mean-field approximation
Figure A1.
Similarly to Figure 6, we show the average negative log-likelihood on a test-set over
10 runs against training time on different datasets for a Bayesian logistic regression problem. The
dashed curve represents the low-rank approximation with RMSProp for methods based on stochas-
tic estimators.
Appendix I.2. Bayesian Neural Network
Figure A2.
Convergence of the classification error and average negative log-likelihood as a function
of time.
Figure A3.
Accuracy vs confidence. Every test sample is clustered in function of its highest predictive
probability. The accuracy of this cluster is then computed. A perfectly calibrated estimator would
return the identity.
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
102
Entropy 2021,23, 990 33 of 34
References
1.
Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; de Freitas, N. Taking the human out of the loop: A review of Bayesian
optimization. Proc. IEEE 2016,104, 148–175. [CrossRef]
2.
Settles, B. Active Learning Literature Survey; Computer Sciences Technical Report 1648; University of Wisconsin–Madison: Madison,
WI, USA, 2009.
3. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA, USA, 2018.
4.
Bardenet, R.; Doucet, A.; Holmes, C. On Markov chain Monte Carlo methods for tall data. J. Mach. Learn. Res.
2017
,18, 1515–1557.
5.
Cowles, M.K.; Carlin, B.P. Markov chain Monte Carlo convergence diagnostics: A comparative review. J. Am. Stat. Assoc.
1996
,
91, 883–904. [CrossRef]
6.
Barber, D.; Bishop, C.M. Ensemble learning for multi-layer networks. In Advances in Neural Information Processing Systems; MIT
Press: Cambridge, MA, USA, 1998; pp. 395–401.
7.
Graves, A. Practical Variational Inference for Neural Networks. In Proceedings of the 24th International Conference on Neural
Information Processing Systems, Granada, Spain, 12–15 December 2011; Volume 24, pp. 2348–2356.
8.
Ranganath, R.; Gerrish, S.; Blei, D. Black box variational inference. In Proceedings of the Seventeenth International Conference
on Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–25 April 2014; pp. 814–822.
9.
Liu, Q.; Lee, J.; Jordan, M. A kernelized Stein discrepancy for goodness-of-fit tests. In Proceedings of the 33rd International
Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 276–284.
10.
Liu, Q.; Wang, D. Stein variational gradient descent as moment matching. In Proceedings of the 32nd International Conference
on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 32, pp. 8868–8877
11.
Zhuo, J.; Liu, C.; Shi, J.; Zhu, J.; Chen, N.; Zhang, B. Message Passing Stein Variational Gradient Descent. In Proceedings of the
35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 6018–6027.
12.
Opper, M.; Archambeau, C. The variational Gaussian approximation revisited. Neural Comput.
2009
,21, 786–792. [CrossRef]
[PubMed]
13. Challis, E.; Barber, D. Gaussian kullback-leibler approximate inference. J. Mach. Learn. Res. 2013,14, 2239–2286.
14.
Titsias, M.; Lázaro-Gredilla, M. Doubly stochastic variational Bayes for non-conjugate inference. In Proceedings of the 31st
International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1971–1979.
15.
Ong, V.M.H.; Nott, D.J.; Smith, M.S. Gaussian variational approximation with a factor covariance structure. J. Comput. Graph.
Stat. 2018,27, 465–478. [CrossRef]
16.
Tan, L.S.; Nott, D.J. Gaussian variational approximation with sparse precision matrices. Stat. Comput.
2018
,28, 259–275.
[CrossRef]
17.
Lin, W.; Schmidt, M.; Khan, M.E. Handling the Positive-Definite Constraint in the Bayesian Learning Rule. In Proceedings of the
37th International Conference on Machine Learning, Virtual, 13–18 July 2020; Volume 119, pp. 6116–6126.
18.
Hinton, G.E.; van Camp, D. Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights. In
Proceedings of the Sixth Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 26–28 July 1993; COLT
’93; Association for Computing Machinery: New York, NY, USA, 1993; pp. 5–13.
19.
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc.
2017
,112, 859–877.
[CrossRef]
20. Amari, S.I. Natural Gradient Works Efficiently in Learning. Neural Comput. 1998,10, 251–276. [CrossRef]
21.
Khan, M.E.; Nielsen, D. Fast yet simple natural-gradient descent for variational inference in complex models. In Proceedings of
the International Symposium on Information Theory and Its Applications (ISITA), Singapore, 28–31 October 2018; pp. 31–35.
22.
Lin, W.; Khan, M.E.; Schmidt, M. Fast and simple natural-gradient variational inference with mixture of exponential-family
approximations. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June
2019; pp. 3992–4002.
23.
Salimbeni, H.; Eleftheriadis, S.; Hensman, J. Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian
Process Models. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Lanzarote,
Canary Islands, 9–11 April 2018; pp. 689–697.
24.
Liu, Q.; Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. arXiv
2016
,
arXiv:1608.04471.
25.
Ba, J.; Erdogdu, M.A.; Ghassemi, M.; Suzuki, T.; Sun, S.; Wu, D.; Zhang, T. Towards Characterizing the High-dimensional Bias
of Kernel-based Particle Inference Algorithms. In Proceedings of the 2nd Symposium on Advances in Approximate Bayesian
Inference, Vancouver, BC, Canada, 8 December 2019.
26.
Tomczak, M.; Swaroop, S.; Turner, R. Efficient Low Rank Gaussian Variational Inference for Neural Networks. In Proceedings of
the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33.
27.
Maddox, W.J.; Izmailov, P.; Garipov, T.; Vetrov, D.P.; Wilson, A.G. A simple baseline for bayesian uncertainty in deep learning.
In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019;
pp. 13153–13164.
28.
Evensen, G. Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error
statistics. J. Geophys. Res. Oceans 1994,99, 10143–10162. [CrossRef]
103
Entropy 2021,23, 990 34 of 34
29.
Rezende, D.; Mohamed, S. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on
Machine Learning, Lille, France, 7–9 July 2015; pp. 1530–1538.
30.
Chen, R.T.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D. Neural ordinary differential equations. In Proceedings of the 32nd
International Conference on Neural Information Processing, Montréal, QC, Canada, 3–8 December 2018; pp. 6572–6583.
31. Ingersoll, J.E. Theory of Financial Decision Making; Rowman & Littlefield: Lanham, MD, USA, 1987; Volume 3.
32.
Barfoot, T.D.; Forbes, J.R.; Yoon, D.J. Exactly sparse gaussian variational inference with application to derivative-free batch
nonlinear state estimation. Int. J. Robot. Res. 2020,39, 1473–1502. [CrossRef]
33.
Korba, A.; Salim, A.; Arbel, M.; Luise, G.; Gretton, A. A Non-Asymptotic Analysis for Stein Variational Gradient Descent. In
Proceedings of the 32nd International Conference on Neural Information Processing, Virtual, 6–12 December 2020; Volume 33.
pp. 4672–4682.
34.
Berlinet, A.; Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics; Springer Science & Business Media:
Berlin/Heidelberg, Germany, 2011.
35.
Zaki, N.; Galy-Fajou, T.; Opper, M. Evidence Estimation by Kullback-Leibler Integration for Flow-Based Methods. In Proceedings
of the Third Symposium on Advances in Approximate Bayesian Inference, Virtual Event, January–February 2021.
36.
Bezanson, J.; Edelman, A.; Karpinski, S.; Shah, V.B. Julia: A fresh approach to numerical computing. SIAM Rev.
2017
,59, 65–98.
[CrossRef]
37.
Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop, Coursera: Neural Networks for Machine Learning; Technical Report; University of
Toronto: Toronto, ON, USA, 2012.
38.
Zhang, G.; Li, L.; Nado, Z.; Martens, J.; Sachdeva, S.; Dahl, G.; Shallue, C.; Grosse, R.B. Which Algorithmic Choices Matter at
Which Batch Sizes? Insights From a Noisy Quadratic Model. In Advances in Neural Information Processing Systems; Wallach, H.,
Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA 2019;
Volume 32, pp. 8196–8207.
39. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
40.
Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu/ml/datasets.php
(accessed on 28 July 2021).
41. Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375.
42.
LeCun, Y. The MNIST Database of Handwritten Digits. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 20
July 2021).
43.
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International
Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1321–1330.
44.
Liu, C.; Zhuo, J.; Cheng, P.; Zhang, R.; Zhu, J. Understanding and accelerating particle-based variational inference. In Proceedings
of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 4082–4092.
45.
Zhu, M.H.; Liu, C.; Zhu, J. Variance Reduction and Quasi-Newton for Particle-Based Variational Inference. In Proceedings of the
37th International Conference on Machine Learning, Virtual, 13–18 July 2020.
46.
Gronwall, T.H. Note on the derivatives with respect to a parameter of the solutions of a system of differential equations. Ann.
Math. 1919,20, 292–296. [CrossRef]
47. Zhang, F. Matrix Theory: Basic Results and Techniques; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011.
6. Flexible and Efficient Inference with Particles for the Variational Gaussian
Approximation
104
7
Discussions and extensions
This chapter presents both discussions and extensions on the models and ideas presented in
Chapters 3, 4, 5. All figures presented are reproducible by running the examples provided in the GitHub
repository
https://github.com/theogf/Phd-Thesis
. Section 7.1 considers how augmentations can
be generalized further and what analysis we need to fully understand the improvement brought by
augmentations. Section 7.2 presents new augmented models for
GP
regression with heteroscedastic
noise. Section 7.3 explores how
HMC
could be used (or not) with augmented models. Section 7.4 shows
how the multi-class model of Chapter 4 can be improved in multiple ways. Section 7.5 presents a way to
combine inducing points and sampling using augmentations. Finally, Section 7.6 consider more largely
the limitations existing with our augmentation approach.
7.1 Further generalizations and understanding
The works presented in this thesis only scratched the surface of how helpful mixtures and representations
are.
Moment Generating Functions
We are still exploring ways to identify larger classes of functions identifiable as scale mixtures or
hierarchical mixtures. Already mentioned in Chapters 4 and 5, the connection with the Moment
Generating Function (
MGF
) of a distribution is a promising direction. We already identified augmentable
functions as being a transformed MGF of the augmented variables in Chapter 5:
φ(x2) = ∫︂∞
0
e−x2ωp(ω)dω∀x∈R≡MGFp(ω)(x) = φ(−√x),∀x≥0.
However, this is limited to
MGF
of continuous variables with a square transformation on the inputs.
We can extend the notion of augmentable functions to
MGF
of discrete and multivariate distributions,
where the domain of
ω
is not always
R+
. For example, we used the
MGF
of a Poisson distribution in
Chapter 4:
exp(λ(ex−1)) =
∞
∑︂
n=0
enexPo(n|λ).
105
7. Discussions and extensions
It is not a scale mixture of Gaussians, but with the right variable transformations, it can still be useful.
The
MGF
of a Poisson is known, but we could also consider arbitrary
MGF
since we are able to sample
from a distribution given its Laplace transform only [47].
The
MGF
is also an interesting tool for creating hierarchical models. Since the
MGF
is of the
form
∑︁xetxp
(
x
)or
∫︁etxp
(
x
)
dx
, by setting
t
=
log σ
(
f
), we get scales mixtures of the form
∑︁xσx
(
f
).
Thanks to the property that
σn
(
f
)is augmentable for any
n∈R+
, we can use Pólya-Gamma variables
and obtain a conditionally conjugate model for a
GP
. Additional examples of such constructions are
shown in this chapter in Sections 7.2 and 7.4.
Marginalizing out augmented variables
A potential improvement for augmented models is the identification of marginalizable augmented
variables that keep the conditional conjugacy of the model. For example, in the multi-class model from
Chapter 4, the augmented variable
λ
can be marginalized out, as shown in Section 7.4. We can reduce
the dimensionality of the model and avoid tricky situations like the inner loop updates in Chapter 4.
This marginalization step is avoidable by identifying the right
MGF
from the start. As shown in
Section 7.2.2, switching between marginalized and augmented models gives great inference flexibility.
Convergence speed analysis
An unfinished work (despite trying) is to establish convergence rates (error as a function of the number
of iterations) for the
CAVI
algorithm and derive theoretical bounds on the intra-chain correlation and
of the ergodicity for the Gibbs sampler. Experimental results indicate that the error on the variational
free energy (and variational parameters) is decreasing as
∥F∗−Ft∥ ∝ C0e−ct
, where
t
is the number of
iterations, but we did not manage to write a formal proof. We show the decay for both the variational
free energy and the variational parameters for different examples in Figure 7.1.
106
7.2 Double bounds for intricate latent GPs
0 2 4 6 8
10⁻⁵⋅⁰
10⁻²⋅⁵
10⁰⋅
⁰
|mt−m*|
0 2
4
6 8
10⁻⁵⋅⁰
10⁻²⋅⁵
1
0⁰⋅⁰
|S
t
−S*
|
Bernoulli
0 2 4 6 8
10⁻
¹⁰
10⁻
⁵
10
⁰
10
⁵
10
¹⁰
|
Ft−F
*
|
0 5 10 15
10⁻⁸
10⁻⁶
10
⁻⁴
10⁻²
10⁰
0 5 10 15
10⁻⁸
10⁻⁶
10⁻⁴
10⁻²
10⁰
Student-T
0 5 1
0 15
10
⁻¹⁰
10⁻⁵
10⁰
10
⁵
10
¹⁰
0 5
10 15
10⁻⁴
10⁻²
10⁰
0 5
10 15
10⁻⁴
10⁻²
10⁰
Laplace
0 5 1
0 15
10⁻⁵
10⁰
10⁵
10¹⁰
Iteration t
0 5 10 15
10⁻⁴
10⁻²
10⁰
I
t
er
at
io
n
t
0 5 10 15
10⁻⁴
10⁻²
10⁰
Po
isson
Iteration t
0 5 10 15
10⁻⁵
10⁰
10⁵
10
¹⁰
c = -2.28 c = -2.28 c = -3.62
c = -0.99 c = -0.99
c = -1.39
c = -0.53 c = -0.53
c = -1.22
c = -0.52 c = -0.52 c = -1.24
Figure 7.1: Convergence plot of the
CAVI
updates for a one-dimensional toy example with
different likelihoods (y-axis in log scale). The solid blue line shows the empirical error over the
number of iterations and the dashed green line shows the fit of the function
C0exp
(
ct
). The
exponential coefficient is written down explicitly for each likelihood.
7.2 Double bounds for intricate latent GPs
The multi-class model developed in Chapter 4 paves the way to work with multi-latent models and
hierarchical augmentations. Based on this idea, we developed another multi-latent model on the
heteroscedastic regression likelihood [
58
,
32
]. It models simultaneously the mean and variance of a
regression likelihood with two latent
GPsf
and
g
. We consider both Gaussian and Non-Gaussian
107
7. Discussions and extensions
likelihoods since we can stack augmentations. We start with the simplest model: the heteroscedastic
Gaussian likelihood.
7.2.1 Heteroscedastic Gaussian Likelihood
A crucial model choice is the function mapping
g
to the likelihood variance
ϵ2
. The exponential link,
i.e.
ϵ2
(
x
) =
exp
(
g
(
x
)), is the most popular, however to be able to apply our augmentations, we use the
link
ϵ2
(
x
) =
(λσ(g(x)))−1
. Let’s look at the case of the heteroscedastic Gaussian likelihood, defined as:
p(y|f, g, λ) = √︁λσ(g)
√2πexp (︃−λσ(g)(y−f)2
2)︃.(7.1)
The augmentations for this likelihood are straightforward and quite similar to the multi-class ones
from Chapter 4.
exp (︃−λσ(g)(y−f)2
2)︃= exp (︃λ(σ(−g)−1)(y−f)2
2)︃
=
∞
∑︂
n=0
σn(−g)Po (︃n
λ(y−f)2
2)︃,(7.2)
where we used the
MGF
of the Poisson distribution. Using the Pólya-Gamma augmentation and the
additivity property of Pólya-Gamma variables, we get the final augmented likelihood:
p(y, n, ω |f, g, λ) = √λ
2n√πexp (︃1
2(︃g(︃1
2−n)︃−g2
ω)︃)︃PG (︃ω|1
2+n, 0)︃Po (︃n|λ(y−f)2
2)︃(7.3)
The interesting part about this augmented likelihood
(7.3)
is that although it is conditionally conjugate
in
g
,
ω
, and
n
, it is unclear how to infer
f
: it is quadratic in
g
but not in
f
. It turns out that the Gibbs
sampler for this model is very simple: We take the augmented likelihood
p
(
y, ω, n|f, g, λ
), marginalize
out
n
and
ω
and, as expected, we get the original likelihood
(7.1)
, which is conditionally conjugate
with
f
. The conditional
p
(
f|y, g, λ
)on this likelihood is the collapsed conditional. In a Gibbs
sampling scheme, this allows us to perform a collapsed step. We give all the Gibbs sampling steps in
Algorithm 2. So far, we have excluded the
λ
parameter from inference. By putting a Gamma prior
Ga
(
λ|α, β
), where
α
is the shape and
β
is the rate, the collapsed conditional is available in closed-form:
p(λ|f,g,y) = Ga(λ|α+N
2, β +
N
∑︂
i=1
σ(gi)
2(yi−fi)2).
As underlined in Section 2.3.2, the
CAVI
updates need the model’s full conditionals and are not
compatible with collapsed conditionals. To solve this problem, we need to reverse-engineer how
CAVI
updates are obtained and start with a first bound on the KL divergence:
KL (q(f)q(g)||p(f,g|y)) ≤min
q(g)−Eq(g)[︁Eq(f)[log p(y|f,g)]]︁+ KL (q(f)q(g)||p(f)p(g)) −log p(y)
= min
q(g)−Eq(g)[︁log p(y|g,µ∗
f,Σ∗
f)]︁+ KL (q(g)||p(g)) + KL∗
f−log p(y) = F1.
p
(
y|g,µ∗
f,Σ∗
f
)and
KL∗
f
are expectations computed with the optimal
q∗
(
f
) =
N(︂f|µ∗
f,Σ∗
f)︂
. We can
now use the augmentations from Equation
(7.3)
on the expected log-likelihood, where we replaced
(yi−fi)2by (yi−(µ∗
f)i)2+ (Σ∗
f)ii, and build a second bound.
F1≤min
q(g)q(ω,n)
Eq(g)q(ω,n)[︁log p(ω,n,y|g,µ∗
f,Σ∗
f)]︁+ KL (q(g)||p(g)) + KL∗
f=F2(7.4)
108
7.2 Double bounds for intricate latent GPs
It is straightforward to find the optimal variational distributions
q∗
(
g
)and
q∗
(
ω,n
)minimizing
F2
which allows us to use
CAVI
updates. Then, injecting the optimal distribution
q∗
(
g
)
q
(
ω,n
)in
F2
, we
can derive the optimal
µ∗
f
and
Σ∗
f
, obtainable in closed-form. The resulting
CAVI
updates are given in
Algorithm 3. For
λ
, we can use the second bound
(7.4)
and obtain a closed-form maximum-likelihood
estimate, given in Algorithm 3.
This double-bound approach is very similar to Lázaro-Gredilla and Titsias
[32]
, although they are
using the exponential link and need some extra computations.
Algorithm 2 Gibbs sampling for the Heteroscedastic Gaussian likelihood
input: f,g, λ, y,p(f,g) = N(f|µ0
f, K)N(g|µ0
g, K),p(λ|α, β).
for tin 1 : N samples do
Draw λ∼p(λ|f,g,y) = Ga(λ|α+N
2, β +∑︁N
i=1
σ(gi)
2(yi−fi)2).
Draw ni∼p(ni|fi, gi, λ) = Po(λσ(−gi)(yi−fi)2
2)
Draw ωi∼p(ωi|ni, gi) = PG(0.5 + ni,|gi)
Draw g∼p(g|n,ω) = N(µg,Σg)
where Σg=(︁K−1+ diag(ω))︁−1and µg=Σg(︁K−1µ0
g+0.5−n
2)︁
Draw f∼p(f|g, λ) = N(µf,Σf)
where Σf=(︁K−1+λdiag(σ(g)))︁−1and µf=Σf(︂K−1µ0
f+λdiag(σ(g))y
2)︂
end for
Algorithm 3 CAVI Updates for the Heteroscedastic Gaussian likelihood
input: q(f,g) = N(f|µf,Σf)N(g|µg,Σg),p(f,g) = N(f|µ0
f, K)N(g|µ0
g, K),yand λ.
while convergence criteria is not met do
ψi=˜︁σ(q(gi))
λ=N
∑︁N
i=1(1−ψi)√︂(yi−µi
f)2+Σii
f
γi=λ
2ψi√︂(yi−µi
f)2+ Σii
f
ci=√︂(µi
g)2+ Σii
g
θi=Eq(ωi|ni)q(ni)[︁ωi]︁=0.5+γi
2citanh (︂ci
2)︂
Σf=(︁K−1+λdiag(1 −ψ))︁−1
µf=Σf(︂K−1µ0
f+λdiag(1 −ψ)y)︂
Σg=(︁K−1+ diag(θ))︁−1
µg=Σg(︂K−1µ0
g+0.5+γ
2)︂
end while
where
q
(
n,ω
) =
∏︁N
i=1 PG
(
ωi|
0
.
5 +
n, ci
)
Po
(
ni|γi
)and
˜︁σ
(
q
(
gi
)) =
e−µi
g/2
√(µi
g)2+Σii
k/2
can be seen as
a close approximation to Eq(gi)[︁σ(−gi)]︁.
A 1-dimensional toy example is shown in Figure 7.2 with the results of the inference algorithms.
109
7. Discussions and extensions
-10 -5 0 5 10
-10
0
10
y|f,g
-
10 -5
0 5 10
-5
0
5
Lat
e
nt GP
s
-10 -5 0 5 10
-10
0
10
-
10 -5
0 5 10
-5
0
5
Variational
Inference
Gibbs
Sampling
yp(y|f,g)Eq(f,g)[p(y|f,g)]
{p(y|f
s
,gs)}s=1
S
{fs,gs}s=1
S∼p(f,g|y)
q(f)q(g)
f g
Figure 7.2: Toy example of a heteroscedastic Gaussian regression problem and the resulting
inference from Algorithm 2 (Gibbs sampling, bottom plots) and Algorithm 3 (Variational Inference,
top plots). The left plots show the output space. The training data
y
are in orange, the generating
likelihood is shown in blue (mean in solid line and one standard deviation in dashed-line). The
green bands show the predictive distributions with one standard deviation obtained after posterior
inference (one band for variational inference and cumulative bands for the sampling approach).
The right plots show the true latent functions
f
and
g
used to generate
y
as well as the inferred
posteriors: variational on top (mean with one standard deviation) and samples at the bottom.
We can see that on this one-dimensional example,
VI
but more particularly Gibbs sampling, manage
to recover the original model. For
VI
, the variance on the latent
f
is almost negligible since all the data
variance is absorbed into the likelihood variance term. The samples obtained with Gibbs sampling,
without any warmup, fit nicely the true processes of fand g.
An implementation as well as detailed derivations are in the AugmentedGPLikelihoods.jl package
[15].
7.2.2 Heteroscedastic Non-Gaussian Likelihood
This method extends to non-Gaussian likelihoods as well. We take the example of the heteroscedastic
Student-t likelihood, where we have a local scale with standard deviation ϵ(x) = λσ(g)with λ∈R+.
Similar to the heteroscedastic Gaussian likelihood (7.1), we get the likelihood:
p(y|f, g, λ, ν) = Γ(ν+1
2)√︁λσ(g)
Γ(ν
2)√πν (︃1 + λσ(g)(y−f)2
ν)︃−ν+1
2
(7.5)
110
7.3 Using Hamilton Monte Carlo on the augmented model
To simplify the notation, we define the scaled residuals ∆ = ∆
ν
(
f, y, λ
) =
λ(y−f)2
ν
,
α
=
ν+1
2
and
the normalization constant Z=Γ( ν+1
2)
Γ( ν
2). We can proceed with the first augmentation:
(1 + σ(g)∆)−α=(1 + ∆(1 −σ(−g)))−α
=(1 + ∆ −∆σ(−g))−α
=(∆σ(−g))−α(︃σ(−g)∆
1+∆−σ(−g)∆)︃
=
∞
∑︂
k=0
∆kNB(k|σ(−g), α),(7.6)
where we used the MGF of the Negative Binomial distribution.
We obtain the same result by performing first the augmentation of the Student-t with a Gamma
variable:
p(y|f, g, λ) = ∫︂∞
0N(y|f, (λσ(g)γ)−1)IG(γ|ν
2,ν
2)dγ
p(y, γ|f, g, λ) =N(y|f, (λσ(g)γ)−1)IG (︂γ|ν
2,ν
2)︂.(7.7)
N
(
y|f,
(
λσ
(
g
)
γ
)
−1
)is the same starting point as Equation
(7.1)
with an additional scaling
γ
. The next
augmentation steps are the same as in Equation
(7.2)
with an augmentation with a Poisson variable.
Marginalizing out the Gamma variable γresults in a Negative Binomial distribution.
Back to Equation
(7.6)
, we rework the likelihood by reorganizing the terms in the augmented
likelihood.
p(y, k|f, g, λ, ν) =Z√λσ 1
2σ(−g)−α∆−α∆kC(k, α)σα(g)σ(−g)k
⏞⏟⏟ ⏞
NB(k|σ(−g),α)
=ZC(k, α)√λ(σ(g))1
2+α(σ(−g))k−α∆k−α
where
C
(
k, α
) =
Γ(r+k)
k!Γ(r)
is the normalization constant of the negative binomial. We set
Z′
=
ZC
(
α, k
)
√λ
as a constant independent of for g. The final step is the Pólya-Gamma augmentation:
p(y, k, ω|f, g, λ, ν) = Z′∆k−α2−(1
2+k)exp (︃1
2(︃1
2+ 2α−k)︃g+g2ω)︃PG(ω|1
2+k, 0).(7.8)
Like for the heteroscedastic Gaussian likelihood, the augmented likelihood
(7.8)
is conjugate in
g
but not in f. We can find the collapsed conditional for fin closed-form.
The key to performing inference on this augmented model, is to use the right augmented likelihood
for each variable. For example, for
f
and
λ
, we only want to use the Inverse Gamma augmentation
described in Equation
(7.7)
. For
ω
,
g
, and
k
(used as a mixture of inverse Gamma and Poisson) we will
use the fully augmented likelihood
(7.8)
. This will give a combination of collapsed conditionals and full
conditionals directly usable in a Gibbs sampling scheme. For the
CAVI
updates, we reuse the double
bound idea of Section 7.2.1.
The full derivations, resulting algorithms and implementation will be found in the
AugmentedGPLikelihoods.jl package [15].
7.3 Using Hamilton Monte Carlo on the augmented model
The Gibbs sampler in the experiments of Chapters 3 and 5 outperforms the state-of-the-art
HMC
algorithm introduced in Section 2.3.1. A recurrent question I got is: Is the performance gain due only
111
7. Discussions and extensions
to the augmentation or the Gibbs sampling scheme? To answer this question, we try using the
HMC
algorithm on augmented models.
Before doing any experiments, let us consider the consequences that the augmented model has
on the
HMC
sampler. First, the augmentation increases the dimensionality of the model. For
N
observations, we need
KN
more dimensions (where
K
depends on the model); therefore, gradient
computations and algorithm tuning should be more expensive. On the other hand, since the likelihood
is simplified to a quadratic problem, the computational complexity of each step can decrease! The
second issue with using
HMC
on the augmented model is that the probability distribution function (
pdf
)
of the prior distribution on augmented variables is not always available in closed-form or not usable
at all. For example, one approximates the probability of a Pólya-Gamma variable with a truncated
alternating series, Truncated series are computationally expensive and can also be biased and unstable!
My experience with the Pólya-Gamma variables is that even when using tricks like "logsumexp" to
improve numerical stability, the
pdf
approximation can be negative, breaking the computations. Finally,
the critical problem with
HMC
is that it only works with continuous variables. Some augmentations
directly involve discrete variables like the Poisson in the multi-class setting, making it incompatible
with a scheme involving only HMC.
We try running
HMC
and
NUTS
with a compatible augmentation (augmented variable
pdf
known
in closed-form, no discrete variables). Figure 7.3 shows the auto-correlation plots on
GP
regression
problem with a Student-t likelihood with
ν
= 3 degrees of freedom applied on the Boston housing
dataset (506 data points, 13 dimensions) [
19
]. We draw one chain of 2000 samples (plus 500 adaptation
samples for HMC and NUTS) for both the original and augmented model.
From the first look,
HMC
applied on the augmented model has a lower auto-correlation. When
using
NUTS
, the gain becomes less clear. Moreover, the algorithm produces antithetic chains, making
it harder to have a proper comparison. The Gibbs sampler has the smallest intra-chain correlation, but
one could argue that negative correlations are desirable to compute expectations. However,
HMC
and
NUTS
turned out to be much slower than the Gibbs sampler: the Gibbs sampler took around 20 sec to
run against an average of 12 minutes for
HMC
and
NUTS
. This difference is due to
HMC
(and
NUTS
)
needing to compute many gradients for every sample. Perhaps surprisingly, there was no significant
time difference between the augmented and original models for HMC and NUTS.
Note that
HMC
is already, in a sense, making an augmentation of its own with the momentum
variables, and it could be added to the list of successful types of augmentations improving inference.
We should only consider these results preliminary since we used a simple likelihood, and the dataset
is relatively small and easy.
Lag
5 10 15 20
-0.25
0.00
0.25
Autocorrelation
Gibbs Sampling
HMC (aug. model)
HMC
NUTS (aug. model)
NUTS
Gibbs Sampling
HMC (aug. model)
HMC
NUTS (aug. model)
NUTS
Figure 7.3: Auto-correlation function of the Gibbs sampler,
HMC
and
NUTS
on the augmented
model, and
HMC
and
NUTS
on the original model. The mean is shown with one standard-deviation
over all dimensions.
112
7.4 Improvements on the Multi-Class Classification
7.4 Improvements on the Multi-Class Classification
We recently figured out additional ways to improve the multi-class classification model and the associated
inference. We present them here in 3 different sections.
7.4.1 Marginalizing out variables
In the augmentation derived in Chapter 4, we add 2
K
+1 new variables per observation:
λ
,
{ni}K
i=1
and
{ωi}K
i=1
. However, we can reduce this number to 2
K
and avoid unnecessary inner loops by marginalizing
out λ. When deriving the augmentations, one ends up with the following augmented likelihood:
p(y=k, {nj}K
j=1, λ|{fj}K
j=1) = σ(fk)
K
∏︂
j=1
σ(−fj)njPo(nj|λ),(7.9)
where we omitted the improper prior 1[0,∞)on λ. We can marginalize out λ:
∫︂∞
0
K
∏︂
j=1
σ(−fj)njPo(nj|λ)dλ =1
∏︁K
j=1 nj!∫︂∞
0
λ∑︁K
j=1 nje−Kλdλ
=K−∑︁K
j=1 nj
∏︁K
j=1 nj!
K
∏︂
j=1
σ(−fj)nj∫︂∞
0
(Kλ)∑︁K
j=1 nje−Kλdλ
=
K
∏︂
j=1
σ(−fj)njΓ(1 +
K
∑︂
j=1
nj)
K
∏︂
j=1 (︃1
K)︃nj1
nj!.(7.10)
Which is proportional to a Negative Multinomial NM(x0,p)defined by:
NM(x|x0,p)=Γ
K
∑︂
j=0
xj
px0
0
Γ(x0)
K
∏︂
j=1
pxj
j
xj!
with parameters x0= 1,p={︂σ(−fj)
K}︂K
j=1, and where p0= 1 −∑︁K
j=1 pj. Note that the normalization
term
p0
is missing in Equation
(7.10)
. However, we do not add it, as it would render the likelihood
unusable. We keep the prior unnormalized, but this does not influence the inference, as in Chapter 4,
since all full conditionals are available in closed-form and normalized.
These derivations could have been avoided by noticing that the
MGF
of a negative binomial
distribution is given by:
MGFNM(x0,p)(t) = (︄p0
1−∑︁K
j=1 pjetj)︄x0
.
Both the Gibbs sampling and
CAVI
updates based on this marginalization are described in
Algorithms 4 and 5.
7.4.2 A new model for the multi-class classification
In Chapter 4, two concerns can be raised. First, the parametrization of a categorical distribution
with
K
categories requires only
K−
1independent parameters
p
due to the constraint
∑︁K
j=1 pj
= 1.
However, in the original model, which we will call over-parametrized, we consider
K
independent
parameters. Second, the augmented variable
λ
has the improper prior
p
(
λ
) = 1
[0,∞)
, which is a proper
measure but is not normalizable. It is not an important concern since the posterior is normalizable
113
7. Discussions and extensions
despite the improper prior. Nevertheless, one might argue that improper priors should be avoided, as it
does not allow model comparison.
On a side note, the fact that augmentations with improper priors still lead to valid inference is a
good indication that scale mixtures for augmentation can be extended to non-normalizable measures.
These two issues seem connected, but we do not have any proof for it.
We propose an alternative parametrization with
K−
1latent
GPs
. The likelihood stays the same
but with one latent being fixed:
p(y=k|{fj}K−1
j=1 ) =
σ(fk)
D+∑︁K−1
j=1 σ(fj),if 1 ≤k < K −1
D
D+∑︁K−1
j=1 σ(fj),if k=K−1,(7.11)
where
D
=
σ
(
fK
)
∈
[0
,
1]. We call this version of the likelihood bijective since the dimensionality of
the simplex output is the same as the inputs.
This likelihood comes with different properties. Unlike the softmax link, the logistic-softmax link is
not translation invariant
1
. We can not freely exchange classes, and the "fixed" class has a different
behavior than the rest. For example, since we fix
D
, the probability for classes other than
K
will be
upper bounded by
1
D+1
. For example, taking
D
= 0
.
5(
fK
= 0) leads to a maximum probability of
1for the class
K
and 2
/
3for all other classes. On the other hand, if
D
= 0, the probability of the
class
K
will always be 0. The bijective likelihood can still be practical if we do not care about one of
the classes. Additionally, the scaled model presented in the next Section 7.4.3 can also help with the
imbalance between classes.
Starting from the likelihood in Equation 7.11 the first augmentation that led to an improper prior
in the over-parametrized model of Chapter 4:
1
∑︁K
j=1 σ(fj)=∫︂∞
0
e−λ∑︁K
j=1 σ(fj)dλ
is replaced by the known MGF of a Gamma distribution with the following mixture:
1
D+∑︁K−1
j=1 σ(fj)=1
D+∑︁K−1
j=1 σ(fj)=1
D
1
1 + 1
D∑︁K=1
j=1 σ(fj)
=1
D∫︂∞
0
e−λ∑︁K−1
j=1 σ(fj)Ga (︃λ|1,1
D)︃dλ,
which is true for D > 0.
The next augmentations steps are the same for the bijective and over-parametrized models: We
use the
MGF
of the Poisson distribution and finally the Pólya-Gamma augmentation. We show the
whole derivations on Algorithms 4 and 5 and show an example on Figure 7.4. We show 1-dimensional
examples with 3 classes with and without the bijection on Figure 7.4 and 7.5
Algorithm 4 Gibbs sampling updates: K/K−1latent GPsfor Kclasses
input: F={fk}K
k=1,p(F) = ∏︁K/K−1
k=1 p(fk|µ0, KX),Y={yi}N
i=1 (one-hot encoded)
for tin 1: # samples do
Draw ni∼p(ni|F) = NM(1,pi)where pi
k=σ(−fi
k)
K/σ(−fi
k)
D+K−1
Draw ωi
k∼p(ωi
k|fi
k, ni
k, yi
k) = PG(yi
k+ni
k,|fi
k|)
Draw fk∼p(fk|ωk,nk,Y) = N(mk,Sk)
where Sk=(︁K−1
X+ diag(ωk))︁−1and mk=Sk(︂K−1
Xµ0+yk−nk
2)︂
end for
1There is no function f(∆) such that σ(x+ ∆) = f(∆)σ(x)for all x.
114
7.4 Improvements on the Multi-Class Classification
Algorithm 5 CAVI updates: K/K−1latent GPsfor Kclasses
input:
q
(
F
) =
∏︁K/K−1
k=1 q
(
fk|µk,Σk
),
p
(
F
=
∏︁K/K−1
k=1 p
(
fk|µ0, K
),
Y
=
{yi}N
i=1
(one-hot
encoded)
while convergence criteria is not met do
ci
k=√︂(µi
k)2+ Σii
k
pi
k=˜︁σ(q(fi
k))
K/˜︁σ(q(fi
k))
D+K−1
γi=Eq(ni)[︁ni]︁=pi
1−∑︁K
i=1 pi
k
θi
k=Eq(ωi
k)[︁ωi
k]︁=yi
k+γi
k
2ci
k
tanh (︂ci
k
2)︂
Σk=(︁K−1
X+ diag(θk))︁−1
µk=Σk(︂K−1
Xµ0+yk−γk
2)︂
end while
where
q
(
N,Ω
) =
∏︁N
i=1 PG
(
ωi|yi
+
ni,ci
)
NM
(
ni|
1
,pi
)and
˜︁σ
(
q
(
fi
k
)) =
e−µi
k/2
√(µi
k)2+Σii
k/2
is an
approximation to the σ(−fi
k).
-10 -5 0 5 10
0.0
0.5
1.0
1.5
y|{fj}
-
10 -5
0 5 10
-5
0
5
Latent GPs
-10 -5 0 5 10
0.0
0.5
1.0
1.5
-
10 -5
0 5 10
-5
0
5
Variational
Inference
Gibbs
Sampling
yp(y=k|{fj}) Eq(
fj)
[p(y=k|{f
j
})] {fj} {q
(fj)}
{p(y=k|{fj}
s
)}s
=
1
S{{fj}s}s=1
S
Figure 7.4: Illustration of Algorithms 4 and 5 with the bijective link introduced in Section 7.4.2
and the marginalization of Section 7.4.1. Each color represents a class, and we compare the true
process to the inferred one for both Gibbs sampling and variational inference. The solid lines
represent the true probabilities and latent
GPs
. The plots on top show the variational inference
results, with the expected predictive probability on the left and the variational posterior on the
right. The plots at the bottom show the probabilities and latent
GPs
obtained via Gibbs sampling.
115
7. Discussions and extensions
-10 -5 0 5 10
0.0
0.5
1.0
1.5
y|{fj}
-
10 -5
0 5 10
-5
0
5
Latent GPs
-10 -5 0 5 10
0.0
0.5
1.0
1.5
-
10 -5
0 5 10
-5
0
5
Variational
Inference
Gibbs
Sampling
yp(y=k|{fj}) Eq(
fj)
[p(y=k|{f
j
})] {fj} {q
(fj)}
{p(y=k|{fj}
s
)}s
=
1
S{{fj}s}s=1
S
Figure 7.5: Illustration of Algorithms 4 and 5 with the overparametrized link with the
marginalization of Section 7.4.1. Each color represents a class, and we compare the true process to
the inferred one for both Gibbs sampling and variational inference. The solid lines represent the
true probabilities and latent
GPs
. The plots on top show the variational inference results, with
the expected predictive probability on the left and the variational posterior on the right. The plots
at the bottom show the probabilities and latent GPsobtained via Gibbs sampling.
Both the bijective and over-parametrized links fit correctly this one-dimensional example. The
over-parametrized link in Figure 7.5 do not approximate correctly the fixed latent
fK
= 0 but still
returns good predictive distributions.
When repeatedly running these examples, we observe that the predictive probabilities for the
bijective link are consistently more accurate, but the predictive log-likelihood for the correct class is
higher on the over-parametrized link. To confirm this trend, we would need further experiments on real
datasets and with a higher number of classes.
7.4.3 Scaling the logistic-softmax link
The logistic-softmax link has issues with the predictive probabilities, in particular with many classes.
Because of the boundedness of the logistic function, the logistic-softmax link needs large values of
fi
to reach prediction probabilities close to 1. Even when the model should be very confident about a
prediction and the latent
GPs
are correctly inferred, the predictive probability for the correct class
will be around (1
−ϵ
)
/
((
K−
1)
ϵ
+ 1
−ϵ
)where
ϵ
is the minimum value taken by
σ
(
f
). With a
GP
prior centered at 0 and a reasonable kernel variance,
f
can not take large values. For example, taking
10 classes, if we assume
fy
= 4 for the correct class and
−
4for the others,
ϵ≈
0
.
018, which gives a
probability of 0.858 with the logistic-softmax link against 0.996 for the softmax link.
This can be solved by using a scaled logistic function. We add
K
hyperparameters
θ
=
{θi}K
i=1
such that the likelihood becomes
p(y=k|{fj}K
j=1,θ) = θkσ(fk)
∑︁K
j=1 θjσ(fj).
116
7.5 Sampling from a sparse augmented model
The
θ
parameters can be optimized using the
ELBO
with the other hyperparameters. These can also
provide information about each class, a high
θj
meaning that the
j
-th class has zones of very high
confidence. With the likelihood augmented with the variable
λ
, the collapsed-conditional and the
maximum-likelihood optimum of
θ
is available in closed-form. The maximum-likelihood optimizer is
given by:
θ∗
k=∑︁N
i=1 δ(yi, k)
∑︁N
i=1 Eq(λn)[λn] (1 −˜︁σ(q(fi
k))),
where
δ
(
x, y
)is the Kronecker delta function, equal to 1 if
x
=
y
and 0 otherwise and where
˜︁σ
(
q
(
fi
k
)) is
defined as in Algorithm 5. We used the model definition where λis not marginalized out.
By putting a prior Ga(θk|α, β), the collapsed conditional of each θkis given by:
p(θk|fk,λ) = Ga(θk|α+
N
∑︂
i=1
δ(yi, k), β +
N
∑︂
i=1
λiσ(fi
k))
A Julia implementation as well as detailed derivations can be found in the
AugmentedGPLikelihoods.jl
package [15].
7.5 Sampling from a sparse augmented model
Another work in progress regards the sampling of sparse
GPs
models. Sampling from the augmented
model proves to be very effective (see Chapter 5) while still producing samples from the posterior
p
(
f|y
)
of the original model. Unfortunately, this property does not transfer when using sparse
GPs
(for a
reminder on sparse
GPs
, see Section 2.2.3) and the scalability is limited. Simply adding inducing points
locations
Z
with realizations
u
=
f
(
Z
)leads to a Gibbs sampling algorithm with a computational
complexity of
O
((
N
+
M
)
3
)per step and does not help with scalability. To solve this problem, we
propose to mix the Gibbs sampling approach we presented in Chapter 5 with variational inference.
We build on the work of Hensman et al.
[22]
. They make the Titsias’ assumption [
53
], i.e. setting
the variational distribution as
q
(
u,f
) =
q
(
u
)
p
(
f|u
). Since they also assume a fully factorizable
likelihood
p
(
y|f
) =
∏︁ip
(
yi|fi
), only marginals
q
(
fi
)are required and the computational complexity of
the bound decreases to
O
(
NM2
+
M3
). Hensman et al.
[22]
show the optimal variational distribution
of the inducing variables
u
minimizing
KL (q(u,f)||p(u)p(f|u)p(y|f))
for a factorizable likelihood
p(y|f) = ∏︁ip(yi|fi)is given by:
log q∗(u) = ∑︂
i
Ep(fi|u)[log p(yi|fi)] + log p(u) + C, (7.12)
where
C
is an intractable constant.
q∗
(
u
)does not have a specific form in the general case, but we can
sample from it by using
HMC
and evaluating the integrals
Ep(fi|u)[log p(yi|fi)]
numerically
2
as in [
22
].
We propose instead to derive a variational Gibbs sampling algorithm to draw samples from the
variational distribution minimizing the Renyi divergence [57] defined as
Dα(p, q) = 1
α(α−1) log ∫︂αp(x) + (1 −α)q(x)−pα(x)q1−α(x)dx, α ∈R+.(7.13)
The Renyi divergence converges to the forward KL divergence:
KL (p||q)
for
α
= 1 and the reverse
KL divergence:
KL (q||p)
for
α
= 0 [
57
]. We define our variational distribution as
q
(
u,f,Ω
) =
q
(
u,Ω
)
∏︁ip
(
fi|u
), and aim at minimizing
Dα
(
p
(
u,f,Ω|y
)
, q
(
u,f,Ω
)). Note that we do not assume
any independence between
u
and
Ω
, only that every
fi
is conditionally independent given
u
. There
2With quadrature for low-dimensions
117
7. Discussions and extensions
is no parametric closed-form for the optimal distribution
q∗
(
u,Ω
)minimizing the divergence in
Equation
(7.13)
, hence we take the approach of Hensman et al.
[22]
and sample from it instead. We
draw
u
,
f
and
Ω
with a blocked Gibbs sampler, by sampling from the optimal variational distribution
minimizing the conditional Renyi divergences:
Ωi∼q∗(Ω) = argqmin Dα(︁p(Ω|ui−1,fi−1,y), q(Ω))︁(7.14)
ui,fi∼q∗(u,f) = argqmin Dα(︁p(u,f|Ωi,y), q(u,f))︁
= argqmin Dα(︄p(u)p(f|u)p(f|Ωi,y), q(u)∏︂
i
p(fi|u))︄.(7.15)
For all
α
, the minimizer for
q∗
(
Ω
)is
p
(
Ω|ui−1,fi−1,y
), setting the conditional divergence to 0. With
the approach from Chapter 5, we know
p
(
Ω|u,f,y
)(which can be simplified to
p
(
Ω|f,y
)) in closed-form
and can sample from it with linear complexity with respect to the number of data points.
Bui et al.
[8]
solved the optimization problem of Equation
(7.15)
for Gaussian likelihoods, with the
Power-EP algorithm. Since
p
(
f|Ω,y
)is conjugate in
f
, the optimal
q∗
(
u
)is a multivariate normal
distribution with the mean and variance known in closed-form for all
α∈R+
. Each sampling step for
u
and
f
only has complexity
O
(
M3
+
M2N
). Like in the Power-EP setting,
α
= 0 corresponds to
solving the variational approach of Titsias
[53]
, while
α
= 1 corresponds to solve the Fully Independent
Training Conditional (FITC) approach of Snelson and Ghahramani [51], as shown in Bui et al. [8].
The only parameters left are the hyperparameters
θ
, omitted in the previous equations, that can
represent a real challenge. For
α
= 0, we could sample from
q∗
(
θ
)with the
HMC
algorithm in a separate
Gibbs sampling step. For other
α
, we could optimize
q
(
θ
)with variational inference methods [
33
,
24
],
and hot-start with the previous distribution. The complete variational Gibbs sampler is described in
Algorithm 6.
Algorithm 6 Variational Gibbs Sampler for Sparse GPs
input: y,u0∼p(u),f0∼p(f|u0),θ0∼p(θ)
for tin 1: # samples do
Draw Ωi∼p(Ω|fi−1,θi−1y)(in closed form)
Draw ui,fi∼q∗(u,f) = argqmin Dα(︁p(u,f|Ωi,θi−1,y), q(u,f))︁(in closed form)
Draw θi∼q∗(θi) = arg min Dα(︁p(θ|ui,fi,Ωi,y), q(θ))︁(HMC or optimization)
end for
Our approach completely gets rid of expectation computations for
u
. It opens up more possibilities
over more complex likelihoods like the multi-class or heteroscedastic ones where computing expectations
numerically, like in Equation
(7.12)
, is a limitation. For medium-sized datasets, this outperforms the
CAVI
algorithm as it has the same convergence speed but does not suffer from the mean-field assumption
on the variational parameters. We show preliminary results on Figure 7.6 for a binary classification
problem on the Magic Telescope dataset (10 dimensions, 19020 data points) [
5
]. The experiment is run
with a 10-fold cross-validation, we use
M
= 50 inducing points selected via the k-means++ algorithm
[
2
], and we keep the hyperparameters fixed. We compare our approach (VI-Gibbs) with
α
= 0
against the
HMC3
variational sampling method of Hensman et al.
[22]
mentioned earlier (VI-HMC), a
standard
VI
method optimized with an L-BFGS optimizer (Std. VI) and the augmented
VI
approach
from Chapter 3 with
CAVI
updates (Aug. VI). We show the classification error and test negative
log-likelihood over time on Figure 7.6.
3HMC is run with a fixed step-size of 0.1 and with 10 leapfrog steps.
118
7.6 Limitations
Time [s]
10⁻¹ 10⁰10¹ 10² 10³
10⁻⁰⋅⁴
10⁻⁰⋅²
10⁰⋅⁰
10⁰⋅²
10⁰⋅⁴
Avg. Predictive Neg. Log-Likelihood
Time [s]
10⁻¹ 10⁰10¹ 10² 10³
0.2
0.3
0.4
0.5
0.6
Clas
s
.
Error
Aug. VI Std. VI VI-HMC VI-Gibbs
Figure 7.6: Negative test log-likelihood and classification test error over time on the Magic
Telescope dataset. The mean with one standard deviation over 10 runs is shown for each algorithm.
These are first results, and there is still work on optimizing the implementation, but some first
impressions can already be drawn. In terms of iterations, VI-Gibbs is just as fast as the
CAVI
updates
but seem to have a slightly better optima. It also completely outperforms methods applied on the
original model.
These preliminary graphs look very promising, but adding hyperparameter sampling might slow
down the process. We also need to compare results with different likelihoods and different αs.
7.6 Limitations
Unfortunately, augmentations are not a silver bullet for approximate Bayesian inference.
Augmentable functions
The largest issue is naturally the limited domain of application. Only a constrained set of functions can
be augmented. The idea of generalization using
MGF
as mentioned in Section 7.1 is promising but
limited nonetheless. When they exist, the identification of augmentable functions in a given model can
be tedious and may require lengthy derivations. We often need to rearrange terms and use mathematical
identities before applying procedures like the ones described in this thesis. It is accessible to someone
with expertise, but automatizing this derivation process is complicated. Current progress in symbolic
programming could eventually help in this direction. We could automate this process by having a
lookup table of augmentable functions and manipulating terms symbolically.
Mean-field approximation in
VI
Another issue is the variational distribution
q
(
f,Ω
)(or
q
(
u,Ω
)) approximating the posterior
p
(
f,Ω|y
)of the augmented model is not as accurate as the
variational distribution
q
(
f
)(or
q
(
u
)) approximating the posterior
p
(
f|y
)of the original model (see
Section 2.3.2). Although the original model can be recovered from the augmented model by marginalizing
out the augmented variables
Ω
, the
MF
approximation loses information (correlation between
Ω
and
f
)
and breaks this link. Marginalizing out
Ω
in
q∗
(
f,Ω
)will not return the optimal
q∗
(
f
)trained on the
original model. Interestingly, the bound difference comes exclusively from the mean-field assumption
between
q
(
f
)and
q
(
Ω
). We can even identify these bound differences via the interpretation of Jaakkola
and Jordan
[26]
as missing terms from a Taylor series, as shown in Chapter 3. When analyzing the
quality of the predictive distributions, the variational distribution trained on the augmented model
proves to be almost as good as the variational distribution trained on the original model. The difference
119
7. Discussions and extensions
of bounds mentioned earlier is often not significant at convergence but will create a difference nonetheless.
These empirical results give us an indication that fand Ωare naturally strongly decorrelated, which
would explain why the Gibbs sampling and CAVI updates are so efficient.
120
8
Conclusion
With this thesis, I want to motivate the use of different representations to ease inference in probabilistic
models. The work on scale mixtures exploits the best out of the blocked Gibbs sampling and the
blocked
CAVI
algorithms. Deriving these augmentations can be complicated and require a certain
expertise. Finding more generalizations and rules will simplify and make this approach more accessible.
We do not have a clear theoretical understanding of the reason for the fast convergence of these
algorithms. By exploring the properties of these likelihoods, we work on obtaining bounds on the
convergence speed of these algorithms. An intuition on why these augmentations work so well is the
notion of decoupling. Many inference bottlenecks come from very highly-correlated variables and heavy
tails of distributions [
3
]. By separating these components into different variables, all parts become
easier to model and do not suffer from the typical inference issues mentioned beforehand. These ideas
do not represent an actual theory for now, and we need a thorough analysis. A better understanding
could give insights into how convergence speed and variable correlations are connected.
Another challenge, as pointed out in Chapter 7, is to widen the class of functions representable as
mixtures. The most promising lead are Moment Generating Function (
MGF
), but there is little theory
on their properties. Schwartz
[50]
is one of the few persons who developed a theory on distributions
and their Laplace transforms, but, to our knowledge, the relevant pieces are missing.
Regardless, one of the biggest challenges is to popularize the use of such models. The gradient descent
approach for
VI
of Hensman and Matthews
[21]
is by far the most popular, partly due to the success of
the
GPFlow
library [
36
]. Implementing these augmentations in popular libraries would be a good step.
There has been an effort in the Julia programming language [
4
] with the
AugmentedGPLikelihoods.jl
[15], but implementations in GPyTorch [17] or GPFlow would help the adoption of these techniques.
121
References
[1]
Amari, S. I. (1998). Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2):251–276.
ZSCC: 0002989 ISBN: 0899-7667.
[2]
Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of
the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial
and Applied Mathematics. ZSCC: NoCitationData[s0].
[3]
Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint
arXiv:1701.02434. ZSCC: 0000306.
[4]
Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). Julia: A fresh approach to numerical
computing. SIAM Review, 59(1):65–98.
[5]
Bock, R., Chilingarian, A., Gaug, M., Hakl, F., Hengstebeck, T., Jiřina, M., Klaschka, J., Kotrč, E., Savick`y,
P., Towers, S., et al. (2004). Methods for multidimensional event classification: a case study using images
from a cherenkov gamma-ray telescope. Nuclear Instruments and Methods in Physics Research Section A:
Accelerators, Spectrometers, Detectors and Associated Equipment, 516(2-3):511–528.
[6]
Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. (2011). Handbook of markov chain monte carlo. CRC
press.
[7]
Bui, T. D., Yan, J., and Turner, R. E. (2017a). A unifying framework for gaussian process pseudo-
point approximations using power expectation propagation. The Journal of Machine Learning Research,
18(1):3649–3720.
[8]
Bui, T. D., Yan, J., and Turner, R. E. (2017b). A Unifying Framework for Gaussian Process Pseudo-Point
Approximations using Power Expectation Propagation. arXiv:1605.07066 [cs, stat]. ZSCC: 0000072 arXiv:
1605.07066.
[9] Cressie, N. (1990). The origins of kriging. Mathematical geology, 22(3):239–252.
[10] Csató, L. (2002). Gaussian processes: iterative sparse approximations. PhD thesis.
[11]
Csató, L. and Opper, M. (2002). Sparse on-line Gaussian processes. Neural computation, 14(3):641–668.
ZSCC: 0000751 Publisher: MIT Press.
[12]
Donner, C. and Opper, M. (2018). Efficient bayesian inference for a gaussian process density model. arXiv
preprint arXiv:1805.11494.
[13]
Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). Hybrid monte carlo. Physics letters
B, 195(2):216–222.
[14] Galy-Fajou, T. (2021). theogf/AugmentedGaussianProcesses.jl.
[15] Galy-Fajou, T. (2022). JuliaGaussianProcesses/AugmentedGPLikelihoods.jl: v0.4.9.
[16]
Galy-Fajou, T., Widmann, D., Yalburgi, S., willtebbutt, st, Falk, I., Ridderbusch, S., Wright, T., david
vicente, Khan, S., Ge, H., Giersdorf, J., TagBot, J., Mones, L., Monticone, P., Viljoen, R., Schölly, S., and
Öcal, K. (2022). JuliaGaussianProcesses/KernelFunctions.jl.
123
REFERENCES
[17]
Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., and Wilson, A. G. (2018). Gpytorch: Blackbox
matrix-matrix gaussian process inference with gpu acceleration. Advances in neural information processing
systems, 31.
[18]
Gorinova, M., Moore, D., and Hoffman, M. (2020). Automatic Reparameterisation of Probabilistic
Programs. In International Conference on Machine Learning, pages 3648–3657. PMLR. ZSCC: 0000004
ISSN: 2640-3498.
[19]
Harrison Jr, D. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal
of environmental economics and management, 5(1):81–102. ZSCC: 0001726 Publisher: Elsevier.
[20]
Henao, R., Yuan, X., and Carin, L. (2014). Bayesian Nonlinear Support Vector Machines and Discriminative
Factor Modeling. Nips, (Mcmc):1–9. ZSCC: 0000028.
[21]
Hensman, J. and Matthews, A. (2015). Scalable Variational Gaussian Process Classification. Aistats,
38:1–9. ZSCC: 0000200 arXiv: 1411.2005.
[22]
Hensman, J., Matthews, A. G. d. G., Filippone, M., and Ghahramani, Z. (2015). MCMC for Variationally
Sparse Gaussian Processes. arXiv:1506.04000 [stat]. ZSCC: 0000090 arXiv: 1506.04000.
[23]
Hensman, J., Sheffield, U., Fusi, N., and Lawrence, N. (2013). Gaussian Processes for Big Data. Proceedings
of UAI 29, pages 282–290. ZSCC: NoCitationData[s1] arXiv: 1309.6835 ISBN: 978-1-4503-1285-1.
[24]
Hernandez-Lobato, J., Li, Y., Rowland, M., Bui, T., Hernández-Lobato, D., and Turner, R. (2016). Black-
box alpha divergence minimization. In International Conference on Machine Learning, pages 1511–1520.
PMLR.
[25]
Hoffman, M. D. and Gelman, A. (2014). The No-U-Turn sampler: adaptively setting path lengths in
Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623. ZSCC: 0001680.
[26]
Jaakkola, T. S. and Jordan, M. I. (1997). A Variational Approach to Bayesian Logistic Regression Models
and their Extensions. In Sixth International Workshop on Artificial Intelligence and Statistics, pages 283–294.
PMLR. ZSCC: 0000268 ISSN: 2640-3498.
[27]
Jaakkola, T. S. and Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statistics
and Computing, 10(1):25–37. ZSCC: 0000581.
[28]
Jensen, C. S., Kjærulff, U., and Kong, A. (1995). Blocking gibbs sampling in very large probabilistic expert
systems. International Journal of Human-Computer Studies, 42(6):647–666.
[29]
Jordan, M. I. and Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science,
349(6245):255–260.
[30]
Kulesza, A. and Taskar, B. (2012). Determinantal point processes for machine learning. pages 1–120.
ZSCC: 0000516 arXiv: 1207.6083 ISBN: 9781601986283.
[31]
Lázaro-Gredilla, M. and Figueiras-Vidal, A. (2009). Inter-domain gaussian processes for sparse inference
using inducing features. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C., and Culotta, A., editors,
Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc.
[32]
Lázaro-Gredilla, M. and Titsias, M. K. (2011). Variational heteroscedastic gaussian process regression. In
ICML.
[33]
Li, Y. and Turner, R. E. (2016). Rényi divergence variational inference. Advances in neural information
processing systems, 29.
[34]
Lin, W., Schmidt, M., and Khan, M. E. (2020). Handling the Positive-Definite Constraint in the Bayesian
Learning Rule. arXiv:2002.10060 [cs, stat]. ZSCC: 0000000 arXiv: 2002.10060.
124
REFERENCES
[35]
Liu, J. S. (1994). The collapsed gibbs sampler in bayesian computations with applications to a gene
regulation problem. Journal of the American Statistical Association, 89(427):958–966.
[36]
Matthews, A. G. d. G., van der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., León-Villagrá, P.,
Ghahramani, Z., and Hensman, J. (2017). GPflow: A Gaussian process library using TensorFlow. Journal of
Machine Learning Research, 18(40):1–6.
[37]
Murphy, K. P. (2012). Machine learning: a probabilistic perspective. Adaptive computation and machine
learning series. MIT Press, Cambridge, MA. ZSCC: 0007949.
[38]
Murray, I., Adams, R., and MacKay, D. (2010). Elliptical slice sampling. In Proceedings of the thirteenth
international conference on artificial intelligence and statistics, pages 541–548. JMLR Workshop and
Conference Proceedings.
[39]
Neal, R. M. (2003). Slice sampling. Annals of Statistics, 31(3):705–741. ZSCC: 0001947 arXiv: 1003.3201v1
ISBN: 00905364.
[40]
Neal, R. M. et al. (2011). Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo,
2(11):2.
[41]
Nguyen, T. M. and Wu, Q. M. (2012). Robust student’s-t mixture model with spatial constraints and its
application in medical image segmentation. IEEE Transactions on Medical Imaging, 31(1):103–116. ZSCC:
NoCitationData[s0] ISBN: 0278-0062.
[42]
O’Hagan, A. and Forster, J. J. (2004). Kendall’s advanced theory of statistics, volume 2B: Bayesian
inference, volume 2. Arnold.
[43]
Palmer, J. A. (2006). Variational and scale mixture representations of non-Gaussian densities for estimation
in the Bayesian linear model: Sparse coding, independent component analysis, and minimum entropy
segmentation. PhD thesis, UC San Diego. ZSCC: 0000014.
[44]
Polson, N. G., Scott, J. G., and Windle, J. (2012). Bayesian inference for logistic models using Polya-Gamma
latent variables. pages 1–42. ZSCC: NoCitationData[s0] arXiv: 1205.0310.
[45]
Quinonero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate gaussian
process regression. The Journal of Machine Learning Research, 6:1939–1959.
[46]
Rasmussen, C. E. and Williams, C. K. I. (2018). Gaussian Processes for Machine Learning, volume 1. MIT
press Cambridge. ZSCC: NoCitationData[s0] arXiv: 026218253X Publication Title: Gaussian Processes for
Machine Learning ISSN: 0129-0657.
[47]
Ridout, M. S. (2009). Generating random numbers from a distribution specified by its Laplace transform.
Statistics and Computing, 19(4):439. ZSCC: 0000049 Publisher: Springer.
[48]
Salimbeni, H., Eleftheriadis, S., and Hensman, J. (2018). Natural Gradients in Practice: Non-Conjugate
Variational Inference in Gaussian Process Models. arXiv:1803.09151 [cs, stat]. ZSCC: 0000028 arXiv:
1803.09151.
[49] Schlaifer, R. and Raiffa, H. (1961). Applied statistical decision theory.
[50]
Schwartz, L. (1952). Transformation de laplace des distributions. Comm. Sém. Math. Univ. Lund [Medd.
Lunds Univ. Mat. Sem.], 1952(Tome Supplémentaire):196–206.
[51]
Snelson, E. and Ghahramani, Z. (2009). Sparse Gaussian Processes using Pseudo-inputs. Advances in
Neural Information Processing Systems 18, pages 1–24. ZSCC: NoCitationData[s0] ISBN: 9780262232531.
[52]
Solin, A., Hensman, J., and Turner, R. E. (2018). Infinite-Horizon Gaussian Processes. arXiv:1811.06588
[cs, stat]. ZSCC: 0000013 arXiv: 1811.06588.
125
REFERENCES
[53]
Titsias, M. (2009). Variational Learning of Inducing Variables in Sparse Gaussian Processes. Aistats,
5:567–574. ZSCC: 0000724.
[54]
Titsias, M. and Lázaro-Gredilla, M. (2014). Doubly stochastic variational bayes for non-conjugate inference.
In International conference on machine learning, pages 1971–1979. PMLR.
[55]
Turner, R., Deisenroth, M., and Rasmussen, C. (2010). State-space inference and learning with gaussian
processes. In Teh, Y. W. and Titterington, M., editors, Proceedings of the Thirteenth International Conference
on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 868–875,
Chia Laguna Resort, Sardinia, Italy. PMLR.
[56]
van der Wilk, M., Dutordoir, V., John, S., Artemev, A., Adam, V., and Hensman, J. (2020). A framework
for interdomain and multioutput gaussian processes.
[57]
Van Erven, T. and Harremos, P. (2014). Rényi divergence and kullback-leibler divergence. IEEE
Transactions on Information Theory, 60(7):3797–3820.
[58]
Wang, C. and Neal, R. M. (2012). Gaussian Process Regression with Heteroscedastic or Non-Gaussian
Residuals. arXiv:1212.6246 [cs, stat]. ZSCC: 0000044 arXiv: 1212.6246.
[59]
Wenzel, F., Galy-Fajou, T., Deutsch, M., and Kloft, M. (2017). Bayesian nonlinear support vector machines
for big data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases,
pages 307–322. Springer. ZSCC: 0000020.
[60]
Wenzel, F., Galy-Fajou, T., Donner, C., Kloft, M., and Opper, M. (2018). Efficient Gaussian
Process Classification Using Polya-Gamma Data Augmentation. arXiv:1802.06383 [cs, stat]. ZSCC:
NoCitationData[s0] arXiv: 1802.06383.
[61]
Widmann, D., willtebbutt, Galy-Fajou, T., st, Yalburgi, S., Ge, H., david vicente, Bosch, N., Schmitz, N.,
Viljoen, R., Wright, T., and andreaskoher (2022). JuliaGaussianProcesses/AbstractGPs.jl.
[62]
Williams, C. K., Rasmussen, C. E., Scwaighofer, A., and Tresp, V. (2002). Observations on the nyström
method for gaussian process prediction.
[63]
Wilson, J. T., Borovitskiy, V., Terenin, A., Mostowsky, P., and Deisenroth, M. P. (2021). Pathwise
conditioning of gaussian processes. Journal of Machine Learning Research, 22(105):1–47.
126
A
Additional work
The following work does not fit the storyline of the thesis and is therefore presented here only as a side
project.
A.1
Adaptive Inducing Points Selection for Gaussian Processes
Two important questions raised when using the sparse
GPs
presented in Section 2.2.3 are: How should
the inducing points be located? How many points does one need to reach a desired level of accuracy?
This work tries to answer these questions by proposing an adaptive algorithm, working in O(N) time
and also valid in an online setting.
Although the algorithm proves to be more efficient than standard methods and to have interesting
theoretical properties related to Determinantal Point Processes [
30
], it has serious tuning issues. The
parameters regulating the algorithm, how often one adds a point or removes one, are tightly correlated
to the kernel hyperparameters. When optimizing hyperparameters during training, an unstable behavior
may lead to picking all points as inducing points or selecting none. I presented this work in the
Continual Learning Workshop of ICML 2020.
Authors:
Théo Galy-Fajou1, Manfred Opper1
1TU Berlin
Details:
Type: Workshop article
Submitted: June 2020
Accepted: July 2020
URL: https://arxiv.org/abs/2107.10066
Workshop: Continual Learning (ICML 2020)
127
Adaptive Inducing Points Selection for Gaussian Processes
Th´
eo Galy-Fajou 1Manfred Opper 1
1Technical University of Berlin
Abstract
Gaussian Processes (GPs) are flexible non-
parametric models with strong probabilistic in-
terpretation. While being a standard choice for
performing inference on time series, GPs have
little techniques to work in a streaming setting.
(Bui et al.,2017) developed an efficient varia-
tional approach to train online GPs by using spar-
sity techniques: The whole set of observations is
approximated by a smaller set of inducing points
(IPs) and moved around with new data. Both the
number and the locations of the IPs will affect
greatly the performance of the algorithm. In ad-
dition to optimizing their locations we propose to
adaptively add new points, based on the proper-
ties of the GP and the structure of the data.
1. Introduction
Gaussian Processes (GPs) are flexible non-parametric
models with strong probabilistic interpretation. They are
particularly fitted for time-series (Roberts et al.,2013) but
one of their biggest limitations is that they scale cubically
with the number of points (Williams & Rasmussen,2006).
Quinonero-Candela & Rasmussen (2005) introduced the
notion of sparse GPs, models approximating the posterior
by a smaller number Mof inducing points (IPs) and re-
ducing the inference complexity from O(N3)to O(M3)
where Mis the number of IPs. Titsias (2009) introduced
them later in a variational setting, allowing to optimize their
locations. Based on this idea, (Bui et al.,2017) introduced
a variational streaming model relying on inducing points.
One of their algorithm’s features is that hyper-parameters
can be optimized and more specifically the number of in-
ducing can vary between batches of data. However in their
work, the number of IPs is fixed and their locations are sim-
ply optimized against the variational bound of the marginal
likelihood. Having a fixed number of IPs limits the model’s
scope if the total data size is unknown. A gradient based
approach leads to two problems:
- IP’s locations need to be optimized until convergence for
every batch. Therefore batches need to be sufficiently large
to get a meaningful improvement. If the new data comes
in very far from the original positions of the IPs, the opti-
Figure 1: Illustration of the inducing point selection pro-
cess. Blue points represent inducing points, green points
data and the orange line represent the mean of the predic-
tion from the GP model surrounded by one standard error.
The dashed represent the space covered by the existing IPs,
only points seen outside those areas are selected as new IPs.
mization will be extremely slow.
- The number of IPs being fixed, there is no way to know
how many will be required to have a desired accuracy.
Finding the optimal number of IPs is also not an option as
it is an ill-posed problem: the objective will only decrease
with more IPs, i.e. the optimum is obtained when every
data point is an IP.
We propose a different approach to this problem with
a simple algorithm, Online Inducing Points Selection
(OIPS), requiring only one parameter to select automati-
cally both the number of inducing points and their location.
OIPS naturally takes into account the structure of the data
while the performance trade-off and the expected number
of IPs can be inferred.
Our main contributions are as follow :
- We develop an efficient online algorithm to automatically
select the number and location of inducing points for a
streaming GP.
- We give theoretical guarantees on the expected number of
inducing points and the performance of the GP.
In section 2we present existing methods to select inducing
A. Additional work
128
Online Inducing Points Selection for Gaussian Processes
points, as well as an online inference for GPs. We present
our algorithm and its theoretical guarantees in section 3.
We show our experiments in comparison with popular in-
ducing points selection methods in section 4. Finally we
summarize our findings and explore outlooks in section 5.
2. Background
2.1. Sparse Variational Gaussian Processes
Gaussian Processes: Given some training data D=
{X, y}where X={xi}N
i=1 are the inputs xi∈RDand
y={yi}N
i=1 are the labels, we want to compute the predic-
tive distribution p(y∗|D, x∗)for new inputs x∗. In order to
do this we try to find an optimal distribution over a latent
function f. We set the latent vector fas the realization of
f(X), where fi=f(xi), and put a GP prior GP(µ0, k)on
f, with µ0the prior mean (set to 0 without loss of general-
ity) and ka kernel function. In this work we are going to
use an isotropic squared exponential kernel (SE kernel) :
k(x, x0) = exp(−||x−x0||2/l2), but it is generally appli-
cable to all translation-invariant kernels. We then compute
the posterior:
p(f|D) = QN
i=1 p(yi|fi)p(f)
p(D)(1)
Where p(f)∼ N(0, KXX )and KXX is the kernel ma-
trix evaluated on X(in later notation we use KXinstead
of KXX ). For a Gaussian likelihood the posterior p(f|D)
is known analytically in closed-form. Prediction and infer-
ence have nonetheless a complexity of O(N3)
Sparse Variational Gaussian Processes: When the like-
lihood is not Gaussian, there is no tractable solution for the
posterior. One possible approximation is to use variational
inference : a family of distributions over fis selected, e.g.
the multivariate Gaussian q(f) = N(m, S), and one op-
timizes the variational parameters mand Sby minimiz-
ing the negative ELBO, a proxy for the KL divergence
KL(q(f)||p(f|D)). However the computational complex-
ity still grows cubically with the number of samples, and is
therefore inadequate to large datasets.
Quinonero-Candela & Rasmussen (2005) and Titsias
(2009) introduced the notion of sparse variational GPs
(SVGP). One adds inducing variables uand their induc-
ing locations Z={Zi}M
i=1 to the model. In this work we
restrict Zito be in the same domain as Xibut inter-domain
approaches also exist (Hensman et al.,2017). The relation
between uand fis given by the distribution p(f,u) =
p(f|u)p(u)where
p(f|u) = N(f|KXZ K−1
Zu,e
K), p(u) = N(0, KZ)(2)
where e
K=KX−KXZ K−1
ZKZX
Then we approximate p(f,u)with the variational distri-
bution q(f,u) = p(f|u)q(u)where q(u) = N(µ,Σ) by
optimizing KL(q(f,u)||p(f,u|D)).
Note that if the likelihood is Gaussian, the optimal vari-
ational parameters µ∗and Σ∗are known in closed-form.
The only parameters left to optimize are the kernel param-
eters as well as selecting the number and the location of the
inducing variables.
2.2. Inducing points selection methods
Titsias (2009) initially proposed to select the points lo-
cation via a greedy selection : A small batch of data is
randomly sampled, each sample is successively tested by
adding it to the set of inducing points and evaluating the
improvement on the ELBO. The sample bringing the best
performance is added to the set of inducing points and the
operation is repeated until the desired number of inducing
points is reached. This greedy approach has the advantage
of selecting a set which is already close to the optimum
set but is extremely expensive and is not applicable to non-
conjugate likelihoods as it relies on estimating the optimal
bound.
The most popular approach currently is to use the k-
means++ algorithm (Arthur & Vassilvitskii,2007) and take
the optimized clusters centers as inducing points locations.
The clustering nature of the algorithm allows to have good
coverage of the whole dataset. However the k-means al-
gorithm have a complexity of O(NMDT)on the whole
dataset where Tis the number of k-means iterations. An-
other issue is that it might allocate multiple centers in a re-
gion of high density leading to very close inducing points
and no significant performance improvement. It is also not
applicable online and does not solve the problem of choos-
ing the number of inducing points.
Another classical approach is to simply take a grid. For ex-
ample Moreno-Mu˜
noz et al. (2019) use a grid in an online
setting by updating the bounds of a uniform grid. Using a
grid is unfortunately limited a small number of dimensions
and does not take into account the structure of the data.
2.3. Online Variational Gaussian Process Learning
(Bui et al.,2017) developed a streaming algorithm for
GPs (SSVGP) based the inducing points approach of (Tit-
sias,2009). The method consists in recursively opti-
mizing the variational distribution qt(ut,f)for each new
batch of data Dtgiven the previous variational distribution
qt−1(ut−1,f).qtinitially approximates the posterior :
p(ut,f|D1:t) = p(Dt|f)p(D1:(t−1)|f)p(ut,f|θt)
p(D1:t)(3)
where θtare the set of hyper-parameters. Since D1:(t−1) is
not accessible anymore, the likelihood on previously seen
A.1 Adaptive Inducing Points Selection for Gaussian Processes
129
Online Inducing Points Selection for Gaussian Processes
data is approximated using the previous variational approx-
imation qt−1(ut−1)and the previous hyper-parameters
θt−1:
p(D1:(t−1)|f)≈qt−1(ut−1)p(D1:(t−1))
p(ut−1|θt−1).
The distribution approximated by qtis in the end:
qt(ut,f|D1:t)≈
p(Dt|f)qt−1(ut−1)p(ut,f|θt)
p(ut−1|θt−1)
p(D1:(t−1))
p(D1:t)
(4)
The optimization of the (bound on the) KL divergence be-
tween the two distributions for each new batch will pre-
serve the information of D1:(t−1) via qt−1and ensure a
smooth transition of the hyper-parameters, including the
number of inducing points. We give all technical details
including the hyper-parameter derivatives and the ELBO in
full form in appendix A.
3. Algorithm
The idea of our algorithm is that to give a good approxi-
mation, a large majority of the samples should be ”close”
(in the reproducing kernel Hilbert space (RKHS)) to the set
Zof IPs locations. Additionally, Zshould be as diverse as
possible, since IP degeneracy will not improve the approx-
imation. This intuition is supported by previous works:
-Bauer et al. (2016) showed that the most substantial im-
provement obtained by adding a new inducing point was
through the reduction of the uncertainty of q(f), which de-
creases quadratically with KXZ .
-Burt et al. (2019) showed that the quality of the approxi-
mation made with inducing points is bounded by the norm
of QX=KX−KXZ K−1
ZKZX .
Therefore by ensuring that KXZ and |KZ|are sufficiently
large, we can expect an improvement on the approximation
of the non-sparse problem.
3.1. Adding New Inducing Points
A simple yet efficient strategy is to verify that for
each new data point xseen during training, there ex-
ists a close inducing point. We first compute KxZ =
[k(x, Z1), . . . , k(x, ZM)]. If the maximum value of KxZ
is smaller than a threshold parameter ρ, the sample is added
to the set of IPs Z. If not, the algorithm passes on to the
next sample. We summarize all steps in Algorithm 1.
The streaming nature of the algorithm makes it perfectly
suited for an online learning setting : it needs to see sam-
ples only once, whereas other algorithms like k-means need
to parse all the data multiple times before converging. It
is fully deterministic for a given sequence of samples and
therefore convergence guarantees are given under some
conditions. This approach was previously explored in a dif-
Algorithm 1 Online Inducing Point Selection (OIPS)
Input: sample x, set of inducing points Z={Zj}M
j=1,
acceptance threshold 0<ρ<1, kernel function k
d←maxj(k(x, Zj))
if d < ρ then
{Zj}←{Zj}Sx
M←M+ 1
end if
return {Zj}
ferent context by Csat´
o & Opper (2002), but was limited to
small datasets.
The extra cost of the algorithm is virtually free since KXZ
needs to be computed for the variational updates of the
model.
One of our claims is that our algorithm is model and data
agnostic. The reason is that as kernel hyper-parameters are
optimized, the acceptance condition changes as well
Note that this method can be interpreted as a half-greedy
approach of a sequential sampling of a determinantal point
process (Kulesza & Taskar,2012). In appendix B, we show
that for the same number of points, the probability of our
selected set is higher than the one of a k-DPP.
3.2. Theoretical guarantees
The final size of Zis depending on many factors: the se-
lected threshold ρ, the chosen kernel, the structure of the
data (distribution, sparsity, etc) and the number of points
seen. However by having some weak assumptions on the
data we can prove a bound on the expected number of in-
ducing points as well as on the quality of the variational
approximation.
Expected number of inducing points : Since the selec-
tion process is directly depending on the data, it is impossi-
ble to give an arbitrary bound. However by adding assump-
tions on the distribution of xone can
Theorem 1. Given a dataset i.i.d and uniformly dis-
tributed, i.e. x∼ U(0, a)D, and a SE kernel with length-
scale lD1, the expected number of selected inducing
points Mafter parsing Npoints is
E[M|N]≤aD−(aD−α)N+1
α,(5)
where α=l√−Dlog ρ
2D
.
The proof is given in the appendix C. As N→ ∞, this
bound will converge to aD/α which is the estimated num-
ber of overlapping hyper-spheres of radius l√−Dlog ρin
to fill a hypercube of dimension Dwith side length a. This
can be used as an upper bound for any data lying in a com-
pact domain. This confirms the intuition that the number
A. Additional work
130
Online Inducing Points Selection for Gaussian Processes
of selected inducing points will grow faster with larger di-
mensions and a larger ρand with smaller lengthscales.
Expected performance on regression : Burt et al.
(2019) derived a convergence bound for the inducing points
approach of (Titsias,2009). Even if they show this bound
in an offline setting, their bound is still relevant for on-
line problems. They show that when Zis sampled via a
k-DPP process (Kulesza & Taskar,2011), i.e. a determi-
nantal point process conditioned on a fixed set size, the dif-
ference between the ELBO and the log evidence log p(D)
is bounded by
EZ[kKX−QXk]≤(M+ 1)
N
X
i=M+1
λi(KX)(6)
where λi(KX)is the i-th largest eigenvalue of KXand
QX=KXZ K−1
ZKZX is the Nystr¨
om approximation of
KX.
We derive a similar bound when using our algorithm in-
stead of k-DPPsampling:
Theorem 2. Let Zbe the set of inducing points locations
of size Mselected via Algorithm 1on the dataset Xof size
N.
kKX−QXk ≤ (N−M)1−ρ2
1 + M(M−1)ρ(7)
where KXis the kernel matrix on Xand QXis the
Nystr¨
om approximation of KXusing the subset Z
The proof and an empirical comparison are given in the
appendix D.
4. Experiments
In this section we get a quick look on how our algorithm
performs in different settings compared to approaches de-
scribed in section 2.2. We compare the online model
SSVGP described in section 2with different IP selection
techniques. We select from the first batch via k-means and
then optimize them (k-means/opt), select them via our al-
gorithm and optimize them (OIPS/opt), select them via our
algorithm but don’t optimize them (OIPS) and finally cre-
ate a Grid that we adapt according to new bounds. We
consider 3 different toy datasets, from which two are dis-
played in figure 2. The dataset A is a uniform time series
and the output function is a noisy sinus. The dataset B
is an irregular time-series, with a gap in the inputs. The
output function is also a noisy sinus. Dataset C inputs are
random samples from an isotropic multivariate 3D Gaus-
sian and the output function is given by sin(||x||)/||x||. All
datasets contain 200 training points and 200 test points. For
all experiments we use an isotropic SE kernel with fixed pa-
rameters. For datasets A and B, Grid and k-means has 25
IPs while OIPS converged to around 20 IPs. For dataset
Figure 2: Toy datasets A and B, divived in 4 batches. Aver-
age Negative Test Log-Likelihood on a test set in function
of number of batches seen. In a uniform streaming setting
all methods perform similarly but having a gap blocks the
convergence of a simple position optimization whereas in
a non-compact situation the adaptive grid suffers in perfor-
mance.
C, Grid has 103IPs, k-means 50, and both OIPS con-
verged to 10 IPs Figure 2shows the evolution on the av-
erage negative log likelihood on test data after every batch
has been seen. On a uniform time-series context all meth-
ods are pretty much equivalent. The presence of a gap,
blocks the optimization of IP locations and impede infer-
ence of future points. Whereas the grid suffers from being
in high-dimensions and All details on the datasets, different
training methods, hyper-parameters and optimization pa-
rameters used are to be found in appendix E.
5. Conclusion
We presented a new algorithm, OIPS, able to select induc-
ing points automatically for a GP in an online setting. The
theoretical bounds derived outperforms the previous work
based on DPPs. There is yet to improve the selection pro-
cess to make it robust to outliers and to variations of the
hyper-parameters. Using for instance a threshold on the
median or a mean on the k-nearest IPs could help to avoid
picking adversarial points such as outliers. We have only
considered regression but our algorithm is also compati-
ble with non-conjugate likelihoods. Using augmentations
approaches (Wenzel et al.,2019;Galy-Fajou et al.,2019),
same performance can be attained. Finally the most inter-
esting improvement would be to use a non-stationary kernel
(Remes et al.,2017) and be able to automatically adapt the
number of inducing points across the dataset.
A.1 Adaptive Inducing Points Selection for Gaussian Processes
131
Online Inducing Points Selection for Gaussian Processes
References
Arthur, D. and Vassilvitskii, S. k-means++: The advan-
tages of careful seeding. In Proceedings of the eighteenth
annual ACM-SIAM symposium on Discrete algorithms,
pp. 1027–1035. Society for Industrial and Applied Math-
ematics, 2007.
Bauer, M., van der Wilk, M., and Rasmussen, C. E. Under-
standing probabilistic sparse gaussian process approxi-
mations. In Advances in neural information processing
systems, pp. 1533–1541, 2016.
Belabbas, M.-A. and Wolfe, P. J. Spectral methods in ma-
chine learning and new strategies for very large datasets.
Proceedings of the National Academy of Sciences, 106
(2):369–374, 2009.
Bui, T. D., Nguyen, C., and Turner, R. E. Streaming sparse
gaussian process approximations. In Advances in Neural
Information Processing Systems, pp. 3299–3307, 2017.
Burt, D., Rasmussen, C. E., and Van Der Wilk, M. Rates
of convergence for sparse variational gaussian process
regression. In International Conference on Machine
Learning, pp. 862–871, 2019.
Csat´
o, L. and Opper, M. Sparse on-line gaussian processes.
Neural computation, 14(3):641–668, 2002.
Galy-Fajou, T., Wenzel, F., Donner, C., and Opper, M.
Multi-class gaussian process classification made conju-
gate: Efficient inference via data augmentation. arXiv
preprint arXiv:1905.09670, 2019.
Hensman, J., Durrande, N., and Solin, A. Variational
fourier features for gaussian processes. The Journal of
Machine Learning Research, 18(1):5537–5588, 2017.
Kulesza, A. and Taskar, B. k-dpps: Fixed-size determinan-
tal point processes. In Proceedings of the 28th Interna-
tional Conference on Machine Learning (ICML-11), pp.
1193–1200, 2011.
Kulesza, A. and Taskar, B. Determinantal point pro-
cesses for machine learning. pp. 1–120, 2012. ISSN
1935-8237. doi: 10.1561/2200000044. URL http:
//arxiv.org/abs/1207.6083%0Ahttp:
//dx.doi.org/10.1561/2200000044. ZSCC:
0000516 arXiv: 1207.6083 ISBN: 9781601986283.
Moreno-Mu˜
noz, P., Art´
es-Rodr´
ıguez, A., and ´
Alvarez,
M. A. Continual multi-task gaussian processes. arXiv
preprint arXiv:1911.00002, 2019.
Quinonero-Candela, J. and Rasmussen, C. E. A Unifying
View of Sparse Approximate Gaussian Process Regres-
sion. Journal of Machine Learning Research, 6:1939–
1959, 2005. ZSCC: NoCitationData[s0].
Remes, S., Heinonen, M., and Kaski, S. Non-stationary
spectral kernels. In Advances in Neural Information Pro-
cessing Systems, pp. 4642–4651, 2017.
Roberts, S., Osborne, M., Ebden, M., Reece, S., Gibson,
N., and Aigrain, S. Gaussian processes for time-series
modelling. Philosophical Transactions of the Royal
Society A: Mathematical, Physical and Engineering
Sciences, 371(1984):20110550, February 2013. ISSN
1364-503X, 1471-2962. doi: 10.1098/rsta.2011.0550.
URL https://royalsocietypublishing.
org/doi/10.1098/rsta.2011.0550.
Stewart, G. W. and guang Sun, J. Matrix Perturbation The-
ory. Academic Press, 1990.
Titsias, M. Variational learning of inducing variables in
sparse gaussian processes. In Artificial Intelligence and
Statistics, pp. 567–574, 2009.
Wenzel, F., Galy-Fajou, T., Donner, C., Kloft, M., and Op-
per, M. Efficient gaussian process classification using
p`
olya-gamma data augmentation. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 33,
pp. 5417–5424, 2019.
Williams, C. K. and Rasmussen, C. E. Gaussian processes
for machine learning, volume 2. MIT press Cambridge,
MA, 2006.
A. Additional work
132
Online Inducing Points Selection for Gaussian Processes
A. Derivations online GPs
A.1. ELBO
Following Bui et al. (2017), the ELBO for variational in-
ference is defined as :
L=−KL (qt(ut)||p(ut|θt)) + Eqt(ut,ft)[log p(yt|ft)]
−KL(qt(ut−1)||qt−1(ut−1))
+ KL(qt(ut−1)||p(ut−1|θt−1))
The terms of the first line correspond to a classical SVGP
problem and the second line express the KL divergence
with the previous variational posterior. The distributions
are defined as :
qt(ut) =N(µt,Σt)
p(ut|θt) =N(0, KZt)
qt(ut−1) = Zp(ut−1|ut)qt(ut)dut
=NκZt−1Ztµt,e
KZt−1
e
KZt−1=KZt−1+κZt−1ZtΣtκ>
Zt−1Zt
−KZt−1ZtK−1
ZtKZtZt−1
qt−1(ut−1) =Nµt−1,Σt−1
p(ut−1||θt−1) =N(0, K0
Zt−1
|{z}
Given θt−1
)
The first terms ares
KL(qt(ut)||p(ut|θt) =
1
2(log |KZt|−log |Σt|−Mt
+tr(K−1
ZtΣt) + µ>
tK−1
Ztµt
And for p(yt|ft) = QB
i=1 N(yi|fi, σ). The expected log-
likelihood is given by L
Eqt(ut,ft)[log p(yt|ft)] = −B
2log 2πσ2
−1
2σ2
B
X
i=1
(yi−κXiZtµt)2+e
K+κXiZtΣtκ>
XiZt
Writing the second terms fully we get :
KL(qt(ut−1)||p(ut−1|θt−1)) =
1
2log |K0
Zt−1|−log |e
Kt−1|−Mt−1
+tr((K0
Zt−1)−1e
KZt−1)
+(κZt−1Ztµt)>(K0
Zt−1)−1κZtZt−1µt
KL(qt(ut−1)||qt−1(ut−1)) =
1
2log |Σt−1|−log |e
KZt−1|−Mt−1
+tr(Σ−1
t−1e
KZt−1)
+(µt−1−κZtZt−1µt)>Σ−1
t−1(µt−1−κZtZt−1µt)
Subtracting the second term to the first we get:
KLt:t−1=
KL(qt(ut−1)||p(ut−1|θt−1)) −KL(qt(ut)||qt−1(ut−1))
=1
2log |K0
Zt−1|−log |Σt−1|−tr((Σ−1
t−1−(K0
Zt−1)−1)e
KZt−1)
−µ>
t−1Σ−1
t−1µt−1+ 2µt−1Σ−1
t−1κZt−1Ztµt
−(κZt−1Ztµt)>(Σ−1
t−1−(K0
Zt−1)−1)(κZt−1Ztµt)
=1
2log |K0
Zt−1|−log |Σt−1|−tr(D−1
t−1e
Kt−1)
−µ>
t−1Σ−1
t−1µt−1+ 2µt−1Σ−1
t−1κZt−1Ztµt
−(κZt−1Ztµt)>D−1
t−1(κZt−1Ztµt)
Where Dt=Σ−1
t−K−1
Zt−1.
Taking the derivative of Lgiven µtand Σtgives us directly
the optimal solution for Gaussian regression:
Σ∗
t=σ−2κ>
XtZtκXtZt+κ>
Zt−1ZtD−1
t−1κZt−1Zt+K−1
Zt−1
µ∗
t=Σtκ>
XtZtσ−2yt+κ>
Zt−1ZtΣt−1µt−1
Rewritten in natural parameters terms:
ηt
1=κ>
XtZtσ−2yt+κ>
Zt−1Ztηt−1
1
ηt
2=−1
2κ>
XtZtσ−2IκXtZt
+κ>
Zt−1Zt−2ηt−1
2−K−1
Zt−1κZt−1Zt+K−1
Zt
A.1 Adaptive Inducing Points Selection for Gaussian Processes
133
Online Inducing Points Selection for Gaussian Processes
A.2. Hyper-parameter derivatives
Given θa kernel hyperparameter and J =dK
dθ the
derivatives are given by:
dKLt:t−1
dθt
=−1
2tr D−1
t−1
de
KZt−1
dθt!
+µt−1Σ−1
t−1
dκZt−1Zt
dθt
µt
−(κZt−1Ztµt)>D−1
t−1(dκZt−1Zt
dθt
µt)
dκZt−1Zt
dθt
=dKZt−1Zt
dθt
K−1
Zt+KZtZt−1
dK−1
Zt
dθt
=(JZtZt−1−κZtZt−1JZt)K−1
Zt=ιZt−1Zt
de
KZt−1
dθt
=dKZt−1
dθt
+ 2dκZt−1Zt
dθt
Σtκ>
ZtZt−1
−dκZt−1Zt
dθt
KZtZt−1−κZt−1Zt
dKZtZt−1
dθt
=JZt−1+ 2ιZt−1ZtΣtκ>
Zt−1Zt
−ιZt−1ZtKZtZt−1−κZt−1ZtJZtZt−1
dKL(qt(ut)||p(ut|θt)
dθt
Special derivative given the variance :
dKLa
dv =−1
2tr D−1
a1
v(Kaa −KabK−1
bb Kba)
A.3. Comparison with SVI
If we take the special case where inducing points do
not change between iterations, then κZt−1Zt=Iand
KZt−1=KZt. The updates become
ηt
1=κ>
XtZtσ−2yt+ηt−1
1
ηt
2=−1
2κ>
XtZtσ−2κXtZt+−2ηt−1
2−K−1
Zt+K−1
Zt
=−1
2κ>
XtZtσ−2κXtZt+ηt−1
2
Compared to the SVI updates:
ηt
1=ηt−1
1+ρN
|B|κ>
XtZtσ−2yt−ηt−1
1
ηt
2=ηt−1
2+ρ−1
2N
|B|κ>
XtZtσ−2κXtZt+K−1
Zt−ηt−1
2
If we ignore ρby setting it as 1:
ηt
1=N
|B|κ>
XtZtσ−2yt
ηt
2=−1
2N
|B|κ>
XtZtσ−2κXtZt+K−1
Zt
Figure 3: Histogram of p(Z|k=M)for the OIPS algo-
rithm and k-DPPsampling
We forget completely the previous η1.
To make it directly comparable to streaming:
SVI
ηt+1
1=(1 −ρ)ηt
1+ρN
|B|κ>
fσ−2y
ηt+1
2=(1 −ρ)ηt
2+−1
2ρN
|B|κ>
fσ−2κf+K−1
bb
ηt
1=(1 −ρ)tη0+
t
X
i=1
(1 −ρ)i−1ρN
|B|κ>
fσ−2yi
Streaming
ηt+1
1=ηt
1+κ>
fσ−2y
ηt+1
2=ηt
2−1
2κ>
fσ−2κf
B. Deterministic algorithm as a DPP
half-greedy sampling
We proceed to a simple experiment, where given a dataset,
Abalone (N= 4177, D = 7), we repeatedly shuffle
the data. We apply algorithm 1parsing all the data to
get the subset ZOIP S. We use the resulting number of
inducing points kas a parameter to sample from a k-
DPP and obtain ZkDP P . We compute the probabilities of
log p(ZOIP S|M=k)and log p(ZkDP P |M=k)and re-
port the histogram of the probabilities on figure 3One can
observe that the probability given by the OIPS algorithm is
consistently higher as well as more narrow then the sam-
pling. This can be explained by the fact that we determin-
istically constrain all the points to have a certain distance
from each other and therefore put a deterministic limit on
the determinant of KZ.
A. Additional work
134
Online Inducing Points Selection for Gaussian Processes
C. Proof Theorem 1: Bound on the number
of points
Algorithm 1can be interpreted as filling a domain with
closed balls, where balls intersections are allowed but no
center can be inside another ball. For a SE kernel we can
compute the radius r(in euclidean space) of these balls :
k(x, x0) = ρin
exp −||x−x0||2
h2=ρin
||x−x0||2=−h2log ρin
r=hp−log ρin
We can bound the volume of the union of the balls by the
union of inscribed hypercubes. The length of an inscribed
hypercube in an hypersphere of radius ris l=r√D/2.
Since the volume of the hypercube is defined to be smaller,
this gives us an upper bound on the expected number of
inducing points. Defining as Knthe number of inducing
points at time n, the probability of having a point outside
of the union of all khypercubes is
p(Kn+1 =k+ 1|Kn=k) = max aD−
k
X
i=1
lD!
= max aD−klD,0
p+
k= max aD−kα, 0
Where α=r√D
2D
, is the volume of one hypercube and
therefore the probability of a new sample to appear in it.
The probability of keeping the same number of points is
p(Kn+1 =k|Kn=k) = min k
X
i=1
lD,1!
p=
k= min(kα, 1)
We now consider the problem as a Markov chain where the
state pis represented by a vector {pi}N
i=1 where pi= 1 if
there are iinducing points. The transition matrix Pis given
by :
P=
p=
10 0 0
p+
1p=
20 0
0p+
2
...0
0 0 ...0
0 0 p+
N−1p=
N
If we define that we start with inducing points the ini-
tial state is p1={1,0,...,0}>, the probability of
having kballs after nsteps is p(Kn=k|p1) =
Pnp1kwhile the expected number of pointsis given by
Pkk·p(Kn=k|p1).
These sequence can be complex to compute. Instead we
can approximate the final expectation by recursively com-
puting the update given the expectation at the previous step:
Ep(Kn+1|Kn=E[Kn]) [Kn+1]
=E[Kn]E[Kn]α+ (E[Kn] + 1)(aD−E[Kn]α)
=aDE[Kn] + aD−E[Kn]α=aD+E[Kn] (aD−α)
This is an arithmetico-geometric suite and given the origi-
nal condition E[K0]=1and since α < aDwe can get a
closed form solution for E[Kn]:
E[Kn] =(aD−α)n1−aD
α+aD
α
=aD−(aD−α)n+1
α
C.1. Empirical Comparison
We show the realization of this bound on uniform data with
3 dimensions, ρ= 0.7and l= 0.3on figure 4.
Figure 4: Bound on the number of inducing points accepted
Mgiven the number of seen points Nvs the empirical es-
timation
D. Proof theorem 2: Bounding the ELBO
We follow the approach of Burt et al. (2019) and Belabbas
& Wolfe (2009). Burt et al. (2019) showed that the error
between the ELBO and the log evidence was bounded by
kKX−KXZ K−1
ZKZX k. Where k·kis the Froebius norm.
Using a k-DPP sampling (Kulesza & Taskar,2011), they
were able to show a bound on the expectation of this norm.
We follow similar calculations with our deterministic al-
gorithm for fixed kernel parameters. Let be KXthe ker-
nel matrix of the full dataset and KZthe submatrix given
A.1 Adaptive Inducing Points Selection for Gaussian Processes
135
Online Inducing Points Selection for Gaussian Processes
the set of points {Zi}M
i=1. The Schur complement of KZZ ,
SC(KZZ )in KXX is given by KX−KXZ K−1
ZKZX . Fol-
lowing a similar approach then Belabbas & Wolfe (2009)
we bound the norm by the trace:
kSC(KZZ )k=v
u
u
t
N−M
X
j=1
λj≤
N−M
X
j=1
λj=tr(SC(KZZ ))
Using the definiton of SC(KZZ )we get :
tr(SC(KZZ )) =
N−M
X
i=1
KXi−KXiZK−1
ZKZXi
where every element of the sum is a scalar. Taking W>ΛW
the eigendecomposition of K−1
Z,wi=WKXiZand as-
suming a kernel variance vof 1 (although generalizable to
all variances) and a translation invariant kernel such that
k(x, x) = 1 we get :
KXi−KXiZK−1
ZKZXi= 1 −w>
iΛwi= 1 −
M
X
j=1
λj(wi)2
j
≤1−λminkwik2= 1 −λminkKXiZk2≤1−λminρ2
Where we used the fact that at least Xiwas close enough to
at least one Zjsuch that k(Xi, Zj)> ρ. For clarity we re-
place λmin =λ−1
max where λmax is the largest eigenvalue of
KZ. When summing over the trace we get the final bound
:
kKX−KXZ K−1
ZKZX k ≤ (N−M)1−ρ2
λmax
Now by construction all off-diagonal terms of KZare
smaller than ρ. Using the equality (Stewart & guang Sun,
1990)
|λi(A)−λi(B)|≤kA−Bk,∀i= 1, . . . , N
We get that
|λmax(KZ)−1| ≤kKZ−Ik2=sX
i6=j
(KZ)2
ij
≤M(M−1)ρ
Assuming λmax(KZ)≥1, we get
λmax(KZ)≤1 + M(M−1)ρout
Getting then the final bound :
kKX−QXk ≤ (N−M)1−ρ2
1 + M(M−1)ρ
Figure 5: Evaluation of the kKX−QXkgiven the OIPS
algorithm and computation of the bound from Burt et al.
(2019) given in equation 6and our bound given in equation
7
D.1. Empirical Comparison
These bounds are difficult to compare due to the different
parameters characterizing them. Nevertheless we give an
example by comparing the bound and the empirical value
on toy data drawn uniformly in 3 dimensions in figure 5.
For each Nwe ran our algorithm and input the required M
in the bounds as the resulting number of selected inducing
points. We show in the section 4the empirical effect on the
accuracy and on the number of points given the choice of
ρ.
E. Experiments parameters
For every problem we use an isotropic Squared Exponential
Kernel :
k(x,x0) = vexp −kx−x0k2
h2
Where his initialized by taking the median of the lower
triangular part of the pairwise distance matrix of the first
subset of points and fixed for the rest of the training. Future
work will involve working with kernel parameter optimiza-
tion as well. We fix the noise of the Gaussian likelihood to
σ2= 0.01.
IPs were optimized via ADAM (α= 10−2).
A. Additional work
136