Latent variable augmentation for approximate Bayesian inference [original]

Latent Variable Augmentation for

Approximate Bayesian Inference

Applications for Gaussian Processes

vorgelegt von

M. Sc.

Théo Galy-Fajou

ORCID: 0000-0002-3528-3536

an der Fakultät IV - Elektrotechnik und Informatik

der Technischen Universität Berlin

zur Erlangung des akademischen Grades

Doktor der Naturwissenschaften

-Dr. rer. nat.-

genehmigte Dissertation

Promotionsausschuss:

Vorsitzender: Prof. Dr. Marc Toussaint

Gutachter: Prof. Dr. Manfred Opper

Gutachter: Dr. Mark van der Wilk

Gutachter: Dr. Arno Solin

Tag der wissenschaftlichen Aussprache: 07. Juli 2022

Berlin 2023

Zusammenfassung

Die Inferenz auf probabilistische Modelle kann selbst bei scheinbar einfachen Problemen

eine Herausforderung darstellen. Bei der Arbeit mit nicht-konjugierten Bayes’schen Modellen

benötigen wir Näherungsmethoden wie Variationsinferenz oder Sampling, die jeweils ihre

Tücken und Grenzen haben. So stellen beispielsweise stark schwanzlastige Verteilungen eine

Herausforderung für Sampling-Methoden dar, und stark korrelierte Variablen werden für viele

Inferenzalgorithmen schnell zu einem Engpass. Anstatt einen weiteren hochmodernen Sampler

oder Optimierer zu entwickeln, konzentrieren wir uns darauf, Modelle so umzuinterpretieren,

dass Standard-Inferenzalgorithmen wie blockiertes Gibbs-Sampling, die normalerweise auf

trivialere Modelle beschränkt sind, die beste Wahl werden. Im ersten Teil leiten wir

Modellerweiterungen für verschiedene Gauß’sche Prozessmodelle wie Klassifikation und

Mehrklassenklassifikation ab. Wir konzentrieren uns auf die Auswirkungen auf die Inferenz

und entwickeln eine Verallgemeinerung für eine bestimmte Klasse von Likelihoods. Wir zeigen,

dass die Augmentierungen mit den Daten skalierbar sind und alle bestehenden Methoden in

Bezug auf Geschwindigkeit und Stabilität übertreffen. Der zweite Teil konzentriert sich auf

Approximationen, die auf einer Gaußschen Variationsverteilung basieren. Wir zeigen, dass wir

durch die Parametrisierung der Gauß-Verteilung durch eine Menge von Partikeln anstelle ihrer

Parameter teure Berechnungen vermeiden, die Flexibilität des Modells erhöhen und theoretische

Konvergenzgrenzen nachweisen können. Zusätzlich zu den veröffentlichten Arbeiten diskutieren

wir die Auswirkungen dieser verschiedenen Erweiterungen, einschließlich ihrer Grenzen. Wir

geben auch einen Ausblick auf neue Forschungsrichtungen, einschließlich konkreter Fortschritte.

Insbesondere zeigen wir Wege auf, wie die in den vorgestellten Arbeiten aufgeworfenen Probleme

kompensiert werden können, und stellen neue Augmentationsmodelle und neue Inferenzansätze

vor, die mit augmentierten Modellen kompatibel sind.

Abstract

Performing inference on probabilistic models can represent a challenge even in seemingly

simple problems. When working with non-conjugate Bayesian models, we need approximate

methods such as variational inference or sampling, each with its pitfalls and limits. For instance,

heavy-tailed distributions represent a challenge for sampling methods, and strongly correlated

variables quickly become a bottleneck for many inference algorithms. Instead of developing yet

another new state-of-the-art sampler or optimizer, we focus on reinterpreting models such that

standard inference algorithms like blocked Gibbs sampling, usually restricted to more trivial

models, become the best choice. In the first part, we derive model augmentations for different

Gaussian Process models such as classification and multi-class classification. We focus on the

effects on inference and develop a generalization for a given class of likelihoods. We show that

augmentations are scalable with data and outperform all existing methods in terms of speed

and stability. The second part focuses on approximations based on a Gaussian variational

distribution. We show that by parametrizing the Gaussian distribution by a set of particles

instead of its parameters, we avoid expensive computations, increase the model flexibility, and

prove theoretical convergence bounds. In addition to the published papers, we discuss the

impact of these different augmentations, including their limitations. We also expose outlooks

on new research directions, including concrete advances. In particular, we present ways to

compensate for issues raised in the presented papers and present new augmentation models

and new inference approaches compatible with augmented models.

Dedié à Manou.

Acknowledgements

I would like to thank Ena for her unconditional love and support since the beginning, and

especially her help to not lose myself into work.

Professor Opper for sharing his immense wisdom and knowledge, bearing with by stubbornness

and for believing in me.

My parents for supporting me in everything I have ever started.

"Les filous", for keeping me entertained at all times.

My main co-author and tutor Florian who taught me so much before and during my Ph.D.

The Julia community, from whom I learned so much and for their indeflectible help during

hard programming times.

And of course all the people I shared lunch and good times with at the university.

Table of Contents

Title Page i

Zusammenfassung iii

Abstract v

1 Introduction 1

1.1 Bayesian Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 The underestimated power of representations choices . . . . . . . . . . . . . . . 2

1.3 GaussianProcesses.................................. 3

1.4 Open-sourceprojects................................. 3

1.5 ThesisOutline .................................... 4

2 Background 5

2.1 Probabilistic Bayesian Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Posterior computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 GaussianProcesses.................................. 7

2.2.1 Gaussian Process Regression . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 Non-Conjugate Gaussian Processes . . . . . . . . . . . . . . . . . . . . . 8

2.2.3 Sparse Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Approximate Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Sampling ................................... 10

2.3.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.3 Scale mixtures and conditionally conjugate likelihoods . . . . . . . . . . 16

Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmen-

tation 17

Multi-Class Gaussian Process Classification Made Conjugate: Efficient

Inference via Data Augmentation 35

Automated Augmented Conjugate Inference for Non-conjugate Gaussian

Process Models 49

Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation 69

TABLE OF CONTENTS

7 Discussions and extensions 105

7.1 Further generalizations and understanding . . . . . . . . . . . . . . . . . . . . . 105

7.2 Double bounds for intricate latent GPs.......................107

7.2.1 Heteroscedastic Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . 108

7.2.2 Heteroscedastic Non-Gaussian Likelihood . . . . . . . . . . . . . . . . . 110

7.3 Using Hamilton Monte Carlo on the augmented model . . . . . . . . . . . . . . 111

7.4 Improvements on the Multi-Class Classification . . . . . . . . . . . . . . . . . . 113

7.4.1 Marginalizing out variables . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.4.2 A new model for the multi-class classification . . . . . . . . . . . . . . . 113

7.4.3 Scaling the logistic-softmax link . . . . . . . . . . . . . . . . . . . . . . . 116

7.5 Sampling from a sparse augmented model . . . . . . . . . . . . . . . . . . . . . 117

7.6 Limitations ......................................119

8 Conclusion 121

References 123

Appendix A Additional work 127

A.1 Adaptive Inducing Points Selection for Gaussian Processes . . . . . . . . . . . . 127

xii

TABLE OF CONTENTS

Acronyms

GP Gaussian Process

GPsGaussian Processes

MCMC Markov Chain Monte Carlo

VI Variational Inference

VFE Variational Free Energy

ELBO Evidence Lower BOund

KL Kullback-Leibler

MF Mean-Field

BMF Blocked Mean-Field

CAVI

Coordinate Ascent Variational Infer-

ence

HMC Hamiltonian Monte Carlo

MH Metropolis-Hastings

ML Machine Learning

VGA Variational Gaussian Approximation

MGF Moment Generating Function

pdf probability distribution function

iid independent and identically distributed

NUTS No-U-turn sampling

ABI Approximate Bayesian Inference

xiii

Introduction

Machine Learning (

) is a wide field of research with plenty of successful applications

[

]. Some problems have specific requirements; for example, computing the probability of a

prediction is essential for decision-making algorithms. One of the best ways to incorporate

uncertainty in

models is through the lens of probability theory. Probabilistic

defines

quantities of interest as random variables and considers data-generative processes as stochastic.

We can produce more robust models and get more faithful to reality by accounting for the

intrinsic measurement uncertainty and unknown random processes. Additionally, stochastic

models return probabilistic predictions, allowing answers like "I don’t know."

1.1 Bayesian Machine Learning

In the Bayesian paradigm, parameters of

models are random variables defined by

probability distributions instead of point estimates. Bayesian models allow modeling

uncertainty in a principled way and prevent overfitting in the low-data regime. We set a prior

distribution over the variables of interest representing our original belief. After observing

data, we update our belief about our model parameters to the posterior distribution. A

typical example is in medicine, where data is scarce, but the predictive outcome can have

dramatic effects (diagnosis, prognosis). Providing uncertainties helps the practitioner make a

better decision given the model predictions.

Generally, Bayesian models have a higher computational cost: a probability distribution

contains more information than a point estimate and requires more parameters. Calculus with

random variables is a difficult art, and finding analytical solutions happens almost exclusively

for trivial models. Approximation methods allow working with more complex models at the

cost of a potential bias or inaccuracies. Approximate Bayesian Inference (

ABI

)focuses

on these algorithms finding a similar solution to the true posterior.

The research in

ABI

goes in many directions, but some main ones are: How to compute a

highly accurate posterior approximation as efficiently as possible? How can it scale to large

amounts of data and parameters? What are the guarantees of such algorithms? This thesis

1. Introduction

aims to partially answer these questions for some given setups, mainly through a focus on

representations.

1.2 The underestimated power of representations choices

The leading thread of this thesis is model representation, alternatively called model

parameterization, and its use for solving problems more efficiently and faster without

compromising prediction quality.

When defining probabilistic models, one needs to define relations between variables (observed

and latent) and choose appropriate distributions to represent those. Some modeling choices are

equivalent conceptually but have drastic differences in inference. A neat example, presented in

Gorinova et al.

[18]

, is the so-called Neal’s funnel [

]. There are two equivalent representations,

called centered and non-centered, shown respectively in Figure 1.1 and 1.2, where one leads to

an inference nightmare while the other is a nice and easy isotropic Gaussian distribution.

z∼ N(0,3)

x∼ N(0,exp(z/2)) (1.1)

-5 -4 -3 -2 -1 0 1 2 3 4 5

-5

-4

-3

-2

-1

Figure 1.1: Neal’s funnel - Centered represen-

tation

z˜∼ N(0,1), z = 3z˜

x˜∼ N(0,1), x = exp(z/2)x˜(1.2)

x

-5 -4 -3 -2 -1 0 1 2 3 4 5

z

-5

-4

-3

-2

-1

Figure 1.2: Neal’s funnel - Non-centered

representation

While both parameterizations are the same, the distribution geometry of

(

x, z

)is less

favorable to inference.

and

are strongly correlated for small

, and the density function

is highly non-smooth. These constraints matter when running a sampling chain or fitting a

variational distribution.

The use of different model representations has an often underestimated effect and is

mainly considered "tricks." For example, when working with Gaussian Processess, it is generally

preferable to use the so-called "whitened" representation, which corresponds to the non-centered

representation of Neal’s funnel (Figure 1.2). The different segments of this thesis show that

finding better representations can confidently make inference easier, faster, and significantly

1.3 Gaussian Processes

more stable. The first part will use basic inference methods by representing likelihoods as

(hierarchical) mixtures. Rewriting distributions as scale mixtures, defined later in Section 2.3.3,

has a lot of advantages and interesting properties. The scale mixture representation involves

augmenting the model with new latent variables, making inference easier while keeping the

original model recoverable. This augmentation procedure brings the maybe counter-intuitive

view that adding more variables simplifies the problem. The last work of the thesis focuses

on the representation of the variational Gaussian approximation. We avoid computational

bottlenecks and add flexibility by representing the distribution with particles instead of using

the mean and the covariance.

1.3 Gaussian Processes

The techniques mentioned above apply to many probabilistic models; however, we focus on

Gaussian-based models, and more particularly Gaussian Processes (

GPs

) [

]. A

is a strong

non-parametric tool to approximate functions using probabilistic methods. They were initially

applied to regression problems with Gaussian noise, like the original kriging problem [

However, they are also used as prior over latent functions for more complex problems like

classification, ordinal regression, and more. Compared to other general function approximators

like neural networks, they have the advantage of providing uncertainty on the prediction they

make. Most importantly, as their name suggests, they are based on Gaussian distributions,

making them the best candidates for the presented work on augmentation. A full technical

introduction to basic GPsand its extensions is given in Section 2.2.

1.4 Open-source projects

All the works presented in this thesis, as well as additional tools, are backed-up by user-friendly

packages in Julia [

]. Throughout my time as a Ph.D. student, I have developed numerous

Julia packages and was involved in the JuliaGaussianProcesses organisation to develop a

flexible, efficient and easy-to-use framework to work with

GPs

from the very low-end to high-

end interfaces through a series of packages:

KernelFunctions.jl

[

AbstractGPs.jl

[

ApproximateGPs.jl

InducingPoints.jl

and

GPLikelihoods.jl

. The particular strength of

our work is the one-to-one mapping between theory and code. For example to define the

posterior for some given data, the code looks like:

f=GP(mean_prior, kernel) # define an infinite-dimensional prior

fx =f(X, noise) # create a realization on the data X

fpost =posterior(fx, y) # create the posterior given the observations y

rand(fpost(x_test)) # sample from the predictive posterior of some test data

Here, each computational object represents exactly its mathematical equivalent.

The work of this thesis is represented as well with the package

AugmentedGPLikelihoods.jl

which provide all the necessary tools to work with augmentations.

1. Introduction

Julia’s advantage is its strong interoperability capacity. This allows to use the augmentation

work on more specialized implementations such as temporal

GPs

with a concrete example

given in TemporalGPs.jl (see examples/augmented_inference.jl).

Independently, I also developed

AugmentedGaussianProcesses.jl

[

] as a stand-alone

package providing the augmentations techniques presented in the thesis, additional likelihoods,

and standard inference approaches.

1.5 Thesis Outline

This thesis is constructed as follows:

•

Chapter 2 introduces in detail all the common concepts of Bayesian inference and

GPs

There are introductions to these concepts in each published article, but this chapter dives

more into the background theory. Bayesian inference, especially, is properly introduced,

focusing on variational inference and sampling.

•

Chapter 3 contains the paper Efficient Gaussian Process Classification Using Pólya-

Gamma Data Augmentation, which was the first variable augmentation we explored.

•

Chapter 4 introduces the paper Multi-Class Gaussian Process Classification Made

Conjugate: Efficient Inference via Data Augmentation. This paper brings new

augmentation concepts to a more complex problem: multi-class classification.

•

Chapter 5 presents the paper Automated Augmented Conjugate Inference for Non-

conjugate Gaussian Process Models. This work presents a generic way to identify

augmentations in likelihoods and introduces a better understanding of the concepts

behind it.

•

Chapter 6 introduces the paper Flexible and Efficient Inference with Particles for the

Variational Gaussian Approximation a completely different way of performing variational

inference with a Gaussian distribution by using a continuous flow and particles.

•Chapter 7 discusses the different papers presented as well as some concrete outlooks on

how to explore new models and new generalizations.

•Chapter 8 finishes this thesis with a general conclusion.

•

The Appendix A also contains an additional workshop paper which does not fit the

narrative of this thesis

For all papers, a simplified view of the Contributor Roles Taxonomy (CReditT) details

the contributions of each author.

Background

To fully comprehend the papers to be presented, we present a general overview of the needed

concepts. A short introduction to the basic theory of Gaussian Processes as well as their

extension to large datasets using inducing points [

] is given in Chapters 3, 4 and 5. However,

this chapter presents a more thorough and basic description. Additionally, this chapter dives

more into the basics of probabilistic Bayesian modeling, variational inference, and sampling

methods.

2.1 Probabilistic Bayesian Modeling

Bayes’ theorem is one of the simplest theorems in probability theory, and its proof fits in one

line, yet its implications are immeasurably1important.

Let us give a very general modeling setting that we will follow for the rest of this chapter.

Given a set of observed variables

, a set of latent (unobserved) variables

with a prior

distribution

(

), and a likelihood function

(

X|θ

), we obtain the posterior distribution

p(θ|X)via Bayes’ theorem:

p(θ|X) = p(X|θ)p(θ)

p(X)=p(X|θ)p(θ)

∫︁p(X|θ)p(θ)dθ.(2.1)

(

)represents the so-called evidence and can be used to compare different models (the

dependency on the used model is implicit here). The posterior allows us to obtain a distribution

of the latent variables with its uncertainty given the prior

(

)and the observed data

The posterior is used for computing all kinds of expectations of the form

Ep(θ|X)[f(θ)]

∫︁f

(

)

(

θ|X

)

dθ

. Expected values of interest can be statistics of the posterior like the mean

(Ep(θ|X)[θ]) or predictive distribution of new data points p(x′|X) = Ep(θ|X)[p(x′|θ)].

1Pun intended.

2. Background

Let’s take the simple example of linear logistic regression, a discriminative model. Given

an input x∈RDand a binary label y∈ {0,1}, we model the process as:

y∼Bernoulli (︂σ(θ⊤x))︂,

where

Bernoulli

is the Bernoulli distribution,

θ∈RD

is a vector of weights (our latent variable),

and

R→

1] is the logistic function

(

) =

1+exp(−x)

. The likelihood function is given by:

p(yi|θ,xi) = σ(︂θ⊤xi)︂yiσ(︂−θ⊤xi)︂1−yi.

Now let’s suppose that we have

pairs of input

and label

, that we assume to

be independent and identically distributed (

iid

), we get a training set

{x1,...,xN}

{y1, . . . , yN}

. With a prior distribution

(

)on

, we build the posterior as

(

θ|y,X

)

∝

(

)

∏︁N

i=1 p

(

yi|θ,xi

). We can then compute the predictive distribution for a new data input

x∗:

p(y∗|x∗,y,X) = ∫︂p(y∗,θ|x∗y,X)dθ=∫︂p(y∗|θ,x∗)p(θ|y,X)dθ.(2.2)

Note that the last term of Equation

(2.2)

directly involves the posterior distribution

(

θ|y,X

). To solve this integral, we must either know the posterior distribution and compute

the integral numerically (or analytically) or sample from the posterior and estimate the integral

using Monte Carlo integration.

2.1.1 Posterior computations

Given a prior

(

)and a likelihood

(

X|θ

), computing the posterior distribution function

(2.1)

in closed-form requires the integral

(

) =

∫︁p

(

X|θ

)

(

)

dθ

. For most non-trivial models,

this integral is intractable, and approximations to the posterior are needed. Such methods are

introduced in Section 2.3.

However, in specific settings, computing the posterior in closed-form is possible. When

the prior is said to be conjugate to the likelihood, the posterior is of the same probability

distribution family as the prior and is analytically tractable [

]. It is worth emphasizing this

seemingly trivial case since it will be exploited in Section 2.3.3. For a general example, we

consider a likelihood part of the exponential family:

p(x|θ) = h(x) exp(η(θ)⊤T(x)−A(θ)),(2.3)

where

are the distribution parameters,

(

)is the base measure,

(

)corresponds to

the natural parameters,

(

)are the sufficient statistics and

(

)is the log-partition.

Formally, a conjugate prior to the likelihood (2.3) is defined as:

p(θ|α) = h′(θ) exp(η′(α)⊤T′(θ)−A′(α)),(2.4)

2Even if the integral is known, it might not be enough to compute some expectations or statistics.

2.2 Gaussian Processes

where

T′

(

) =

{η(θ), A(θ)}

and where

represents the prior distribution parameters. Given

a factorizable likelihood p(X|θ) = ∏︁N

i=1 p(xi|θ), the posterior will be proportional to

p(θ|X)∝h′(θ) exp 

(︄{

∑︂

i=1

T(xi), N}+η′(α))︄⊤

T′(θ)

.(2.5)

Note that the only dependence on Xis via the sufficient statistics T(x).

Conjugate models are very practical as the posterior can be found in one step, but are

very constraining in the choice of the prior. They tend to be considered too simple for many

applications.

If the prior is not conjugate of the likelihood, an alternative is to look for conditional

conjugacy. A parameter

θi

with a conditionally conjugate prior will have a full conditional

distribution of the same family. The full conditional distribution is defined as

(

θi|X,θ/i

)

where

θ/i

{θ1, . . . , θi−1, θi+1,...θD}

. This notion of full conditional also extends to blocks

of variables.

2.2 Gaussian Processes

Gaussian Processes (

GPs

) are a class of stochastic processes used as non-parametric probabilistic

representations of functions. A

is a stochastic process

{ft}

, where the joint distribution on

any finite collection of random variables

{ft}

follows a (multivariate) Gaussian distribution [

Since all the variables are Gaussian, we can perform all linear operations analytically, making

them computationally attractive. We can also compute marginals exactly, and a product of

Gaussian distributions of the same variable is still proportional to another Gaussian.

is uniquely specified by its mean function

µ0

(

)and kernel function (also

called covariance function)

(

x,x′

µ0

(

)can be any real-valued function while

(

x,x′

)

needs to be a positive-definite function (also called Mercer kernels). A symmetric function

X ×X → R

is positive-definite on

w⊤Kw

where

Kij

(

xi,xj

)for any

w∈RN

and any {xi,...,xN}∈X.

One of the interpretations of a

(

µ0

) is as a prior on the function space. Given a

random function

with a

prior, we can project

into a finite space by evaluating it on

a set of data inputs

{x1,...,xN}

such that we obtain the finite-dimensional vector

where

(

). The prior on the projected

is given by

N(µ0(X), KX)

where

µ0(X) = {µ0(xi)}N

i=1 and K∈RN×Nis the kernel matrix, defined by Kij =k(xi,xj).

2.2.1 Gaussian Process Regression

Given our prior

(

) =

(

f|µ0,K

), we can add noisy observations

{yi}N

i=1

for each

respective xiand model the process as:

yi=f(xi) + ϵi,(2.6)

where

ϵi∼ N

, σ2

). This leads to the likelihood

(

yi|fi

) =

(

yi|fi, σ2

). Fortunately, adding a

zero-mean Gaussian variable to another gives another Gaussian variable with increased variance

and the posterior for

is given by

(

f|y

) =

(

f|y,KX

σ2I

). The predictive distribution

2. Background

-2.5 0.0 2.5

-1.0

-0.5

0.0

0.5

1.0

GP Prior

-2.5 0.0 2.5

-1.0

-0.5

0.0

0.5

1.0

GP Poste

rior

Figure 2.1: Illustration of the realization of a Gaussian Process. The black line is the true

function

; the blue line is the mean of the prediction; the blue area represents the confidence

interval of 2 standard deviations; the orange points represent observed data. Left: prediction on a

grid given no observations. Right: prediction on a grid given a set of observations

of f∗=f(x∗)on a new input x∗can be evaluated by computing:

p(f∗|x∗,X,y) = ∫︂p(f∗|f,x∗)p(f|X,y)df.(2.7)

This integral is analytically tractable and results in

p(f∗|x∗,X,y) = N(f∗|m∗, s∗),(2.8)

where

m∗

Kx∗,X

(

σ2I

)

−1y

and

s∗

Kx∗−Kx∗,X(︁KX+σ2I)︁−1KX,x∗

, with

(

Kx∗,X

)

(

x∗,xi

). The predictive distribution for

f∗

is Gaussian, with a known mean

m∗

and a measure of uncertainty given by the variance

s∗

. Note that

s∗

depends directly on

Kx∗,X

: if

x∗

is far from all points in

(in the sense of the distance used in the kernel

), then

Kx∗,X

will be very small and the variance

s∗

maximized. The predictive uncertainty will be

high when new inputs

x∗

are distant from the training data

. A concrete example is shown

on Figure 2.1.

2.2.2 Non-Conjugate Gaussian Processes

A Gaussian prior is only conjugate to the mean parameter of a Gaussian likelihood. Therefore,

the

posterior obtained in the previous section is only tractable for homoscedastic

Gaussian

likelihoods. For all other cases we talk about non-conjugate

GPs

. Examples of non-conjugate

problems are binary classification, regression using non-Gaussian noise such as Student-t

or Laplace noise, or Poisson regression. Other examples, such as multi-class classification or

heteroscedastic regression, can nrequire multiple latent

GPs

. Figure 2.2 shows an example

of 1-dimensional binary classification with a

where the posterior was approximated using

variational inference (see Section 2.3.2). Although the

does not recover exactly the true

process, most of it lies in the GP’s 95% confidence interval (blue band).

Posteriors of non-conjugate problems are not analytically tractable, and one needs to resort

to the approximation methods presented in Section 2.3. A strong focus of this thesis is to

3The noise variance is independent of the input

2.2 Gaussian Processes

-2 -1 0 1 2

-1

Latent GP representation

-2 -1 0 1 2

0.0

0.5

1.0

p(y|

ftr

(

)

[p(

y|f)]

Figure 2.2: Illustration of a latent Gaussian process used for a binary classification problem. The

Bernoulli likelihood is linked to the latent

via the logistic function. On the left is shown the

optimal variational posterior

(

)in blue, compared to the true generation of

in green. Similar

to Figure 2.1, the blue band represents one standard deviation. On the right, we show the expected

predictive probability for ygiven the variational posterior q(f)in blue.

take these non-conjugate likelihoods and find a representation where inference is simplified and

basic methods can be used.

2.2.3 Sparse Gaussian Processes

One of the largest drawbacks of

GPs

, regardless of the conjugacy of the likelihood, is the

scalability with the number of observed samples. When computing the predictive mean and

covariance, the inverse matrix operation in Equation

(2.8)

has a computational complexity

(

)where

is the number of samples. For one-dimensional inputs (

= 1), solutions

exist for specific kernels using state-space models representation [

], leading to an

(

)

complexity. However, higher-dimensional problems require alternative solutions. The first

approach to reduce the complexity was to use a Nyström approximation [

]. Csató and Opper

[11]

proposed to create an approximation of the posterior using a subset of the points only in

the context of online learning. Snelson and Ghahramani

[51]

expanded this theory to the offline

framework and Csató

[10]

followed by Titsias

[53]

developed an alternative approximation

based on KL divergence where the "inducing points" are not necessarily a subset of the training

data and do not even have to belong to the same domain [

]. For a unified view on sparse

GPs, see Quinonero-Candela and Rasmussen [45] and Bui et al. [7].

The works of thesis relying on inducing points are based on Titsias’ approach [

]: The

sparse approximation is made by defining a set of inducing points location

{zi}M

i=1

and the

realization of a

GP u

on them:

where

(

). We proceed to use variational inference

(see Section 2.3.2) and approximate the posterior p(u,f|y)by the variational distribution

q(u,f) = q(u)

∏︂

i=1

p(fi|u),(2.9)

minimizing

KL (q(u,f)||p(u,f|y))

. The assumption used is that all components of the random

vector

are independent of each other given the random vector

. It is a strong assumption,

but the inference and prediction complexity reduces to

(

NM2

), where

can be reduced

to a smaller batch-size

with stochastic inference approaches [

]. Given

(

) =

(

µ,Σ

2. Background

the predictive distribution of f∗=f(x∗)on a new input x∗is given by

p(f∗|y,X) = ∫︂p(f∗|u)p(u|y,X)du

≈∫︂p(f∗|u)q(u)du

=p(f∗|m∗, s∗),

where m∗=Kx∗,ZK−1

Zµand s∗=Kx∗−Kx∗,ZK−1

Z(I−Σ)K−1

ZKZ,x∗.

2.3 Approximate Bayesian Inference

The posterior distribution in Equation

(2.1)

cannot be computed in closed-form for non-trivial

problems such as the ones presented in Section 2.2.2 and 2.2.3. We can approximate the posterior

to obtain a valuable estimator for predictions and expected values of interest. Approximate

Bayesian Inference is a research field of its own, and this chapter will focus specifically on

sampling and Variational Inference, the most popular approximate inference methods for

GPs

2.3.1 Sampling

We can compute predictive estimates Ep(θ|X)[f(θ)] with Monte Carlo integration:

Ep(θ|X)[f(θ)] ≈1

∑︂

i=1

f(θi),θi∼p(θ|X),

where the samples θiare iid.

Even if the posterior distribution

(

θ|X

)is not available in closed-form or has no direct

sampler, there are many alternatives to draw samples from it. The advantage of sampling is its

unbiasedness: one obtains exact expectations in the limit of infinitely many samples. Sampling

is an art of its own, and the number of methods is too large to mention them all in this thesis.

Therefore, the scope is restricted to methods popular with or tailored to

GPs

. In particular,

we restrict ourselves to Markov Chain Monte Carlo (MCMC) methods.

Markov Chain Monte Carlo and Metropolis-Hastings

Markov Chain Monte Carlo (

MCMC

) methods generate a chain of variables

θt

with the Markov

assumption:

θt

depends only on

θt−1

and where the stationary distribution of

θt

is the same as

the target distribution

(

)(for our use case the posterior

(

θ|X

)).

MCMC

methods require

a transition probability

(

θt+1|θ

)which leaves the target stationary distribution invariant, i.e.

(

) =

∫︁t

(

θ|θ′

)

(

θ′

)

dθ′

. Other properties such as detailed balance and ergodicity need to be

satisfied as well [6, 42].

One of the most common algorithms to run a Markov Chain on a distribution

(

)is

the Metropolis-Hastings (

) algorithm. The

algorithm consists in having a proposal

distribution

(

θ′|θ

)suggesting a new sample. Each proposed sample

θ′

is randomly accepted or

rejected with probability

(

) =

π(θ′)

π(θ)

q(θ|θ′)

q(θ′|θ)

. The choice of the proposal distribution

is the key to producing "good" chains with a high acceptance rate and a good exploration of

θ’s parameter space. Next are presented some categories of choice for the proposal q.

2.3 Approximate Bayesian Inference

x₁

-1 0 1 2 3 4 5 6 7

x₂

-1

Figure 2.3: 20 steps of the Gibbs sampler trajectory on the Rosenbrock distribution in 2

dimensions.

Gibbs Sampling

Gibbs sampling is a particular

MCMC

method where we sample each component of the random

vector one after another. The proposal distribution for each component is given by the full

conditional

(

θi|x,θ/i

), where

θ/i

{θ1,...θi−1, θi+1,...θD}

. The most prominent feature of

Gibbs sampling is its acceptance probability, guaranteed to be 1:

A=p(θt+1

i,θt

/i|x)

p(θt

i,θt

/i|x)

p(θt

i|x,θt

/i)

p(θt+1

i|x,θt

/i)

=p(θt+1

i|x,θt

/i)

p(θt

i|x,θt

/i)

p(θt

/i|x)

p(θt

/i|x)

p(θt

i|x,θt

/i)

p(θt+1

i|x,θt

/i)= 1.

At every step, all proposed samples are therefore guaranteed to be accepted.

We illustrate the path of the sampler on a two-dimensional bimodal example in Figure 2.3.

The Gibbs sampling approach is a conundrum. On the one hand, sampling each component

using the full conditional is easy since it only involves drawing a scalar. However, building a

sampler for each full conditional at each step can be slow and costly. The sampler can also

get stuck or move very slowly if the components are highly correlated with another. We can

solve these drawbacks by using additional techniques like the blocked Gibbs sampler [

]

where we sample groups of variables jointly, or collapsed Gibbs sampling [

] where we

marginalize out some variables from the full conditional distributions. But blocked or collapsed

updates are not always available and require heavier sampling machinery.

The augmentations proposed in this thesis allow using both the blocked and collapsed version

by deriving the blocked full conditionals for each group of variables analytically. Experiments

show that the correlations are very low between each group of variables, and that the sampler

converges to the stationary distribution very fast.

2. Background

Hamilton/Hybrid Monte Carlo

Hamiltonian Monte Carlo (

HMC

) or Hybrid Monte Carlo [

] is a

MCMC

method that

uses Hamiltonian dynamics to make a new proposal. We augment

θt

with an extra momentum

sampled randomly for every proposal from

, M

)where

is the mass matrix. Next we

run the Hamiltonian dynamics based on the Hamiltonian

(

θ,p

) =

−log π

(

) +

2p⊤Mp

over

leapfrog steps with step size ∆

. The proposal at time

∆

is accepted or rejected based on

the acceptance rate:

A= min (︃1,exp(−H(θ(L∆t),p(L∆t)))

exp(−H(θ(0),p(0))) )︃

Hamiltonian dynamics normally keep the Hamiltonian invariant. However, symplectic (volume

preserving) integrators, like the leapfrog method, only keep

approximately invariant [

The global error on

grows as

(

(∆

)

). We get high acceptance rates while the dynamics

lower the correlation between each sample by exploring the parameter space more freely than a

basic random walk. We can tune the

HMC

algorithm parameters

, and ∆

by drawing

a series of adaptive samples and by adjusting to the local geometry of the potential function

−log π

(

). Figure 2.4 illustrates the sampling process with the Hamiltonian dynamics paths

drawn with gray lines.

HMC

is very popular due to its plug-and-play characteristics but suffers from different

issues. It is gradient-based and can not sample discrete variables. The integration of the

Hamiltonian dynamics requires 2

gradients per proposal. This computational cost can be

prohibitively expensive for high-dimensional problems or for target distributions with costly

computations.

x₁

-1 0 1 2 3 4 5 6 7

x₂

-1

Figure 2.4: Illustration of the HMC sampler (gray lines are the Hamiltonian dynamics)

Other samplers

There are other solid choices for sampling from

GPs

. For example, elliptical slice sampling

Murray et al.

[38]

is particularly well-fitted for Gaussian priors. The No-U-turn sampling (

NUTS

)

algorithm [

] is an extension of

HMC

where

is chosen automatically. We run the path

2.3 Approximate Bayesian Inference

integration with both

and

−p

until one of the particles goes backward or if one of the

Hamiltonian estimates becomes too inaccurate

. The proposal is finally sampled randomly

from both paths.

NUTS

is good at avoiding oscillatory dynamics and is particularly strong

for quadratic problems, which appear regularly in

GPs

problems. Finally, another orthogonal

approach to sample from predictive distributions with a known

posterior is pathwise

sampling [

]. By taking a mix of random Fourier features, specific to a particular class of

kernels, the sampling complexity can be reduced from

(

)to

(

)where

is the number

of test inputs, and Tis the chosen number of basis.

2.3.2 Variational Inference

Variational Inference (

), also called Variational Bayes, consists in approximating the posterior

(

θ|X

)with another distribution

(

). Given a family of distributions

, parametrized by the

variational parameters φ, one aims to solve the following optimization problem:

φ∗= argφmin D(qφ(θ), p(θ|x)) ,(2.10)

where

is a dissimilarity measure between two distributions and

qφ

is the distribution

q∈ Q

parametrized by

. One of the most used dissimilarity measure is the reverse Kullback-

Leibler (KL) divergence, defined for continuous distributions as:

KL (q(x)||p(x)) = ∫︂q(x) log q(x)

p(x)dx(2.11)

The objective of Equation

(2.10)

(2.11)

is generally not directly tractable when the

normalizer is not known. Since

(

θ|x

)involves the normalization constant

(

), one resorts

to a surrogate function, the Variational Free Energy (

VFE

) (or its negative counterpart the

Evidence Lower BOund (ELBO)):

KL (qφ(θ)||p(θ|x)) = ∫︂qφ(θ) (log qφ(θ)−log p(θ|x)) dθ

=∫︂qφ(θ) (log qφ(θ)−log p(θ,x)−log p(x)) dθ

=−log p(x)

⏞⏟⏟ ⏞

:=C

+∫︂qφ(θ) (log qφ(θ)−log p(x|θ)−log p(θ)) dθ

=C−Eqφ[log p(x|θ)] + KL (qφ(θ)||p(θ)) = F(φ) + C. (2.12)

By minimizing

(

)instead of the

divergence, we can expect to find a solution close

to the optimum of the problem stated in Equation (2.10).

A standard way to find the

φ∗

argφmin F

(

)is to perform gradient descent on the

variational parameters φ:

φt+1 =φt−ϵt∇φF(φt),(2.13)

where ϵt>0is the learning rate.

4Technical details are skipped here.

2. Background

Computing the gradient

∇φF

(

)can be non-trivial. It involves derivatives over

expectations, but "tricks" like reparametrization [

] help to reduce the cost of these

computations.

The choice of the family

is a trade-off decision. A richer, more complex family might be

able to approximate the posterior better, but computing the

and optimizing the variational

parameters will be increasingly difficult. A standard example for continuous variables is

the Variational Gaussian Approximation (

VGA

)

, where the variational distribution

qφ

is a

Gaussian, i.e.

{q∼ N(m, S)}

, and

{m, S}

. Many expectations can be computed

analytically under

VGA

, in particular when the prior on

is Gaussian as well. The Gaussian

distribution is easily reparametrizable, and it is straightforward to sample from it. Many

operations will be of the cost

(

)where

is the dimensionality of

. Restricting

further

by constraining the covariance

can reduce this cost. For example, setting

to be diagonal

will reduce the number of variational parameters and avoid inverse matrix operations.

Mean-Field Approximation

We need assumptions on

to reduce the computational cost of variational inference and scale

with high-dimensional

. The Mean-Field (

) assumption imposes that every component of

θis independent of each other. A MF variational family can be specified as:

QMF ={︄q=

∏︂

i=1

qφi(θi)}︄,(2.14)

where

φi

are the variational parameters for the variable

θi

. Under the

approximation,

the number of variational parameters grows linearly with the dimensionality of

instead

of quadratically. Additionally, integrals in Equation

(2.12)

can become one-dimensional or

sometimes analytically tractable (the

for example), and therefore more easily solvable.

However, MF can not capture potential posterior correlations between the components of θ.

An intermediate solution is to assume independence between blocks of variables instead,

similarly to the blocked Gibbs sampler. Given

{

, . . . , D}

, the set of indices of

, we

can build into

independent subsets

Ik⊆ I

such that

∪K

k=1Ik

and

Ii∩Ij

∅,iff i

The variational distribution based on this Blocked Mean-Field (

BMF

) approximation is then

defined as

qBMF

φ(θ) =

∏︂

k=1

qφk(θIk),(2.15)

where

φk

are the variational parameters for the set of variable

θIk

. The

BMF

approximation

can capture correlations inside blocks of variables but loses some of

’s computational

attractiveness.

Coordinate Ascent VI

The Coordinate Ascent Variational Inference (

CAVI

)

approach is an alternative to the

gradient descent approach of Equation

(2.13)

. Instead of moving all parameters at once in the

5The VGA is explored in more details in Chapter 6.

6The word ascent is used since the scheme was originally derived using the negative VFE, i.e., the ELBO.

2.3 Approximate Bayesian Inference

gradient direction, we are interested in finding the optimal solution for each set of variational

parameters φione after another by keeping the others fixed:

φ∗

i= argφimin F(φi,φ/i),(2.16)

where

φ/i

{φj|j

. Using the

BMF

approximation, we can update blocks of variational

parameters at once. The optimal φ∗

ican be found by solving:

∇φiF(φ)|φi=φ∗

i= 0,(2.17)

or performing a partial version of the gradient descent from Equation

(2.13)

. The solution to

Equation (2.16) is always given by

q∗

φi(θi)∝exp (︂Eqφ(θ/i)[︁log p(︁θi|θ/i,x)︁]︁)︂(2.18)

where

θ/i

represent the collection of variables

θ/i

{θj|j

[

]. Even when the expectation

involved in Equation

(2.18)

is available in closed-form, the resulting distribution might not

always normalizable, but we are usually only interested in the different moments of qφi(θi).

Algorithm 1 summarizes the

CAVI

algorithm. The order of the updates does not matter as

long as the variational parameters φare initialized in their respective domain.

Algorithm 1 CAVI algorithm

while |Ft+1 −Ft|> ϵ do

for i∈ {1, . . . , D}do

φt+1

i= argφimin F(φt+1

1:(i−1),φi,φt

(i+1):D),

end for

end while

The

CAVI

and Gibbs sampling algorithms are very similar in nature. The observations on

Gibbs sampling also apply:

CAVI

updates on a distribution with

is easily computable but

has slower convergence, while updates with the

BMF

approach are more complex to derive,

avoid some MF pitfalls, and provide a richer distribution.

Natural Gradients One interesting aspect of

CAVI

, is that it implicitly uses natural

gradients [

]. A natural gradient is a gradient preconditioned with the inverse Fisher

information matrix defined as

where H(

)is the Hessian matrix of the function

. The Fisher information matrix is a

Riemannian metric that gives the direction of the steepest descent with respect to the

divergence. The natural gradient is given by :

˜︁

∇φF(φ) = I−1∇φF(φ)

The natural gradient works in a metric that maximizes the change of the infinitesimal

divergence between the given distribution and its target [

]. The updates of the

CAVI

2. Background

algorithm 1 for exponential distributions, can be interpreted as natural gradient ascent updates

with learning rate 1[60].

φt+1 =φt+I−1

θ∇φF(φt)

When working with constrained parameters like the covariance matrix of the Gaussian variational

distribution, a step with a high learning rate might overshoot out of the cone of positive-definite

matrices. Salimbeni et al.

[48]

proposes a given schedule to compensate while Lin et al.

[34]

forces a trajectory on a geodesic. Both approaches are computationally expensive, while we get

this feature automatically.

2.3.3 Scale mixtures and conditionally conjugate likelihoods

We base a large part of this work on mixtures and use scale mixtures in particular. A scale

mixture is a continuous mixture of a distribution with a varying scale parameter. A textbook

example is the Student-T distribution which is a Gaussian scale mixture with a Gamma prior

on the variance:

Tν(x) = ∫︂∞

0N(x|0, ω) Ga (︂ω|ν

2,ν

2)︂dω,

where

is a Gamma distribution. Another example is the Laplace distribution which is also

a Gaussian scale mixture:

La(x|β) = ∫︂∞

0N(x|0, ω)Exp (︃ω|1

2b2)︃dω,

where Exp is the exponential distribution.

These representations appear when computing predictive distributions. For example, when

performing Gaussian linear regression with a fixed weight

and a Gamma prior on the likelihood

variance σ2, the resulting posterior predictive distribution will be a Student-T distribution.

This thesis shows that we can use this connection the other way around. Certain likelihoods

(

x|θ

)can be defined as scale mixtures

∫︁p

(

x|θ, ω

)

(

)

dω

. We can "unmarginalize" the

likelihood by adding the scale variable

to our model. We augment

(

x|θ

)to

(

x, ω|θ

). For

example, we can augment a Student-T likelihood into a Gaussian likelihood with a Gamma

prior on the variance. The advantage of the augmented model is to produce conditionally

conjugate likelihoods for all the model variables as the next chapters will show.

Efficient Gaussian Process

Classification Using Pólya-Gamma

Data Augmentation

Before my doctoral studies, I worked on extending the work of Henao et al.

[20]

on Bayesian

support vector machines to

GPs

as well as scaling them up to big data [

]. This paper is not

included in this thesis as it did not get published during my Ph.D. The approach proposed by

Henao et al.

[20]

was the first step on the road of our research on augmentations. A natural

continuation was to explore the binary classification problem with the logit link.

This paper extends the work of Polson et al.

[44]

on augmenting with Pólya-Gamma

variables to

GPs

and sparse

GPs

. The main contributions of this paper are to show that the

augmented model outperforms other state-of-the-art methods for

GPs

but also a derivation of

a remarkable equivalence between the variational bound derived Jaakkola and Jordan

[27]

and

the Pólya-Gamma augmentation.

Authors:

Florian Wenzel,1,∗ Théo Galy-Fajou,2,∗ Christian Donner,2 Marius Kloft,1,3 Manfred Opper2

∗

Equal Contribution,

TU Kaiserslautern, Germany,

TU Berlin, Germany,

University of Southern

California, USA

Details:

Type: Conference article Submitted: September 2018

Accepted: December 2018

DOI: https://doi.org/10.1609/aaai.v33i01.33015417

Conference: AAAI 2019

3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation

Contributions:

For an explanation of the terms see the Contributor Roles Taxonomy (CReditT)

F.W. T.G-F. C.D. M.K. M.O.

Conceptualization ✓ ✓ ✓ ✓

Methodology ✓ ✓

Formal Analysis ✓ ✓ ✓

Implementation ✓

Investigation ✓ ✓

Writing - Original Draft ✓ ✓ ✓

Writing - Review & Editing ✓ ✓ ✓ ✓

Supervision ✓

Funding Acquisition ✓ ✓

Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation

Florian Wenzel,1,* Théo Galy-Fajou,2,* Christan Donner,2Marius Kloft,1,3 Manfred Opper2

*Contributed equally, 1TU Kaiserslautern, Germany, 2TU Berlin, Germany, 3University of Southern California, USA

w[email protected], galy-fajou@tu-berlin.de, [email protected],

[email protected], manfred.opper@tu-berlin.de

Abstract

We propose a scalable stochastic variational approach to GP

classification building on Pólya-Gamma data augmentation

and inducing points. Unlike former approaches, we obtain

closed-form updates based on natural gradients that lead to ef-

ficient optimization. We evaluate the algorithm on real-world

datasets containing up to 11 million data points and demon-

strate that it is up to two orders of magnitude faster than the

state-of-the-art while being competitive in terms of prediction

performance.

1 Introduction

Gaussian processes (GPs) Rasmussen and Williams (2005)

provide a popular Bayesian non-linear non-parametric

method for regression and classification. Because of their

ability of accurately adapting to data and thus achieving

high prediction accuracy while providing well calibrated un-

certainty estimates, GPs are a standard method in several

application areas, including geospatial predictive modeling

Stein (2012) and robotics Dragiev, Toussaint, and Gienger

(2011).

However, recent trends in data availability in the sciences and

technology have made it necessary to develop algorithms ca-

pable of processing massive data John Walker (2014). Cur-

rently, GP classification has limited applicability to big data.

Naive inference typically scales cubic in the number of data

points, and exact computation of posterior and marginal like-

lihood is intractable.

Nevertheless, the combination of so-called sparse Gaus-

sian process techniques with approximate inference meth-

ods, such as expectation propagation (EP) or the varia-

tional approach, have enabled GP classification for datasets

containing millions of data points Hernández-Lobato and

Hernández-Lobato (2016); Salimbeni, Eleftheriadis, and

Hensman (2018).

While these results are already impressive, we will show in

this paper that a speedup of up to two orders magnitudes can

be achieved. Our approach is based on considering an aug-

mented version of the original GP classification model and

2019, Association for the Advancement of Artificial

replacing the ordinary (stochastic) gradients for optimiza-

tion by more efficient natural gradients, which is the stan-

dard Euclidean gradient multiplied by the inverse Fisher in-

formation matrix. Natural gradients recently have been suc-

cessfully used in a variety of variational inference problems

Honkela et al. (2010); Wenzel et al. (2017); Jähnichen et al.

(2018).

Unfortunately, an efficient computation of the natural gradi-

ent for the GP classification problem is not straight forward.

The use of the probit link function in Dezfouli and Bonilla

(2015); Hernández-Lobato and Hernández-Lobato (2016);

Mandt et al. (2017); Salimbeni, Eleftheriadis, and Hensman

(2018) leads to expectations in the variational objective func-

tions that can only be computed by numerical quadrature,

thus, preventing efficient optimization.

We derive a natural-gradient approach to variational infer-

ence in GP classification based on the logit link. We exploit

that the corresponding likelihood has an auxiliary variable

representation as a continuous mixture of Gaussians involv-

ing Pólya-Gamma random variables Polson, Scott, and Win-

dle (2013).

Unlike former approaches, our natural gradient updates can

be computed in closed-form. Moreover, they have the advan-

tage that they correspond to block-coordinate ascent updates

and, therefore, learning rates close to one can be chosen. This

leads to a fast and stable algorithm which is simple to imple-

ment. Our main contributions are as follows:

•We present a Gaussian process classification model using

a logit link function that is based on Pólya-Gamma data

augmentation and inducing points for Gaussian process in-

ference.

•We derive an efficient inference algorithm based on

stochastic variational inference and natural gradients. All

natural gradient updates are given in closed-form and do

not rely on numerical quadrature methods or sampling ap-

proaches. Natural gradients have the advantage that they

provide effective second-order optimization updates.

•In our experiments, we demonstrate that our approach

drastically improves speed up to two orders of magni-

tude while being competitive in terms of prediction per-

formance. We apply our method to massive real-world

datasets up to 11 million points and demonstrate superior

scalability.

The paper is organized as follows. In section 2 we discuss

related work. In section 3 we introduce our novel scalable GP

classification model and in section 4 we present an efficient

variational inference algorithm. Section 5 concludes with

experiments. Our code is available via Github1.

2 Background and Related Work

Gaussian process classification Hensman and Matthews

(2015) consider Gaussian process classification with a pro-

bit inverse link function and suggest a variational Gaussian

model that builds on inducing points. By employing auto-

matic differentiation, Salimbeni, Eleftheriadis, and Hensman

(2018) generalize this approach to use natural gradients in

non-conjugate GP models. Khan and Nielsen (2018) con-

sider natural gradient updates in the setting of variational

inference with exponential families. Unlike our approach,

these methods do not benefit from closed-form updates and

have to resort to numerical approximations. Moreover, our

approach has the advantage that a higher learning rate close

to one can be chosen leading to updates that can be inter-

preted as block-coordinate ascent updates.

Izmailov, Novikov, and Kropotov (2018) use tensor train de-

composition to allow for the training of GP models with bil-

lions of inducing points. The updates are not computed in

closed-form and they do not use natural gradients.

Dezfouli and Bonilla (2015) propose a general automated

variational inference approach for sparse GP models with

non-conjugate likelihood. Since they follow a black box ap-

proach and do not exploit model specific properties they do

not employ efficient optimization techniques.

Hernández-Lobato and Hernández-Lobato (2016) follow an

expectation propagation approach based on inducing points

and have a similar computational cost as Hensman and

Matthews (2015).

Pólya-Gamma data augmentation Polson, Scott, and

Windle (2013) introduced the idea of data augmentation in

logistic models using the class of Pólya-Gamma distribu-

tions. This allows for exact inference via Gibbs sampling

or approximate variational inference schemes Scott and Sun

(2013).

Linderman, Johnson, and Adams (2015) extend this idea to

multinomial models and discuss the application for Gaussian

processes with multinomial observations but their approach

does not scale to big datasets and they do not consider the

concept of inducing points.

1https://github.com/theogf/

AugmentedGaussianProcesses.jl

3 Model

The logit GP Classification model is defined as follows. Let

X= (x1,...,xn)∈Rd×nbe the d-dimensional training

points with labels y= (y1, . . . , yn)∈ {−1,1}n. The likeli-

hood of the labels is

p(y|f, X) =

i=1

σ(yif(xi)),(1)

where σ(z) = (1+exp(−z))−1is the logit link function and

fis the latent decision function. We place a GP prior over f

and obtain the joint distribution of the labels and the latent

p(y,f|X) = p(y|f, X)p(f|X),(2)

where p(f|X) = N(f|0, Knn)and Knn denotes the ker-

nel matrix evaluated at the training points X. For the sake

of clarity we omit the conditioning on Xin the follow-

ing.

3.1 Pólya-Gamma data augmentation

Due to the analytically inconvenient form of the likelihood

function, inference for logit GP classification is a challeng-

ing problem. We aim to remedy this issue by considering an

augmented representation of the original model. Later we

will see that the augmented model is indeed advantageous

as it leads to efficient closed-form updates in our variational

inference scheme.

Polson, Scott, and Windle (2013) introduced the class of

Pólya-Gamma random variables and proposed a data aug-

mentation strategy for inference in models with binomial

likelihoods. The augmented model has the appealing prop-

erty that the likelihood of the latent function fis propor-

tional to a Gaussian density when conditioned on the aug-

mented Pólya-Gamma variables. This allows for Gibbs sam-

pling methods, where model parameters and Pólya-Gamma

variables can be sampled alternately from the posterior Pol-

son, Scott, and Windle (2013). Alternatively, the augmenta-

tion scheme can be utilized to derive an efficient approximate

inference algorithm in the variational inference framework,

which will be pursued here.

The Pólya-Gamma distribution is defined as follows. The

random variable ω∼PG(b, 0),b > 0is defined by the

moment generating function

EPG(ω|b,0)[exp(−ωt)] = 1

coshb(pt/2).(3)

It can be shown that this is the Laplace transform of an in-

finite convolution of gamma distributions. The definition

is related to our problem by the fact that the logit link can

be written in a form that involves the cosh function, namely

σ(zi) = exp(1

2zi)(2 cosh(zi

2))−1. In the following we de-

rive a representation of the logit link in terms of Pólya-

Gamma variables.

3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation

First, we define the general PG(b, c)class which is derived

by an exponential tilting of the PG(b, 0) density, it is given

PG(ω|b, c)∝exp(−c2

2ω)PG(ω|b, 0).

From the moment generating function (3) the first moment

can be directly computed

EP G(ω|b,c)[ω] = b

2ctanh c

2.

For the subsequently presented variational algorithm these

properties suffice and the full representation of the Pólya-

Gamma density PG(ω|b, c)is not required.

We now adapt the data augmentation strategy based on

Pólya-Gamma variables for the GP classification model. To

do this we write the non-conjugate logistic likelihood func-

tion (1) in terms of Pólya-Gamma variables

σ(zi) = (1 + exp(−zi))−1=exp(1

2zi)

2 cosh(zi

2Zexp zi

2−z2

2ωip(ωi)dωi,(4)

where p(ωi) = PG(ωi|1,0) and by making use of (3). For

more details see Polson, Scott, and Windle (2013). Using

this identity and substituting zi=yif(xi)we augment the

joint density (2) with Pólya-Gamma variables

p(y,ω,f)∝exp 1

2y>f−1

2f>Ωfp(f)p(ω),(5)

where Ω = diag(ω)is the diagonal matrix of the Pólya-

Gamma variables {ωi}. In contrast to the original model (2)

the augmented model is conditionally conjugate forming the

basis for deriving closed-form updates in section 4.

Interestingly, employing a structured mean-field variational

inference approach (cf. section 4) to the plain Pólya-Gamma

augmented model (5) leads to the same bound for GP clas-

sification derived by Gibbs and MacKay (2000). This is

an interesting new perspective on this bound since they do

not employ a data augmentation approach. We provide a

proof in appendix A.5. Our approach goes beyond Gibbs

and MacKay (2000) by providing a fully Bayesian perspec-

tive, including a sparse GP prior (section 3.2) in the model

and proposing a scalable inference algorithm based on natu-

ral gradients (section 4).

3.2 Sparse Gaussian process

Inference in GP models typically has the computational com-

plexity O(n3). We obtain a scalable approximation of our

model and focus on inducing point methods Snelson and

Ghahramani (2006). We follow a similar approach as in

Hensman and Matthews (2015) and reduce the complexity

to O(m3), where mis number of inducing points.

We augment the latent GP fwith madditional input-output

pairs (Z1, u1),...,(Zm, um), termed as inducing inputs and

inducing variables. The function values of the GP fand

the inducing variables u= (u1, . . . , um)are connected

via

p(f|u) = Nf|KnmK−1

mmu,e

K

p(u) = N(u|0, Kmm),

(6)

where Kmm is the kernel matrix resulting from evaluating

the kernel function between all inducing inputs, Knm is the

cross-kernel matrix between inducing inputs and training

points and e

K=Knn −KnmK−1

mmKmn. Including the in-

ducing points in our model gives the augmented joint distri-

bution

p(y,ω,f,u) = p(y|ω,f)p(ω)p(f|u)p(u)(7)

Note that the original model (2) can be recovered by

marginalizing ωand u.

4 Inference

The goal of Bayesian inference is to compute the posterior

of the latent model variables. Because this problem is in-

tractable for the model at hand, we employ variational infer-

ence to map the inference problem to a feasible optimization

problem. We first chose a family of tractable variational dis-

tributions and select the best candidate by minimizing the

Kullback-Leibler divergence between the variational distri-

bution and the posterior. This is equivalent to optimizing a

lower bound on the marginal likelihood, known as evidence

lower bound (ELBO) Jordan et al. (1999); Wainwright and

Jordan (2008).

In the following we develop a stochastic variational infer-

ence (SVI) algorithm that enables stochastic optimization

based on natural gradient updates which are given in closed-

form.

4.1 Why use natural gradients?

Using the natural gradient over the standard Euclidean gra-

dient is favorable since natural gradients are invariant to

reparameterization of the variational family Amari and Na-

gaoka (2007); Martens (2017) and provide effective second-

order optimization updates Amari (1998); Hoffman et al.

(2013).

The superiority of using natural gradients in our approach

can be explained by the following. We reformulate the GP

classification model as an augmented model which is condi-

tionally conjugate. When using a learning rate of one, the

natural gradient updates correspond to block-coordinate as-

cent updates, i.e. in each iteration each parameter is set to

its optimal value given the remaining parameters (see ap-

pendix A.4 and Hoffman et al. (2013)). In practice, we em-

ploy stochastic variational inference, i.e. we only use mini-

batches of the data to obtain a noisy version of the natural

gradient. In this setting, learning rates slightly less than one

have to be chosen.

This is in contrast to former natural gradient based ap-

proaches, e.g. Salimbeni, Eleftheriadis, and Hensman

(2018), that focus on the original non-conjugate GP clas-

sification model. Although they benefit from using natural

gradients, they have the disadvantage that their updates do

not correspond to coordinate-ascent updates. Thus, learning

rates that are much smaller that one have to be used to assure

convergence.

Therefore, in our approach, we can use much higher learning

rates and optimization is faster and more stable which we

demonstrate in the experiments.

4.2 Variational approximation

We aim to approximate the posterior of the inducing

points p(u|y)and apply the methodology of variational

inference to the marginal joint distribution p(y, ω, u) =

p(y|ω,u)p(ω)p(u). Following a similar approach as Hens-

man and Matthews (2015), we apply Jensen’s inequality to

obtain a tractable lower bound on the log-likelihood of the

labels

log p(y|ω,u) = log Ep(f|u)[p(y|ω, f)]

≥Ep(f|u)[log p(y|ω, f)].(8)

By this inequality we construct a variational lower bound on

the evidence

log p(y)≥Eq(u,ω)[log p(y|u,ω)] −KL (q(u,ω)||p(u,ω))

≥Ep(f|u)q(u)q(ω)[log p(y|ω,f)]

−KL (q(u,ω)||p(u,ω))

=: L,

where the first inequality is the usual evidence lower bound

(ELBO) in variational inference and the second inequality is

due to (8).

We follow a structured mean-field approach Wainwright and

Jordan (2008) and assume independence between the induc-

ing variables uand Pólya-Gamma variables ω, yielding a

variational distribution of the form q(u, ω) = q(u)q(ω). Set-

ting the functional derivative of Lw.r.t. q(u)and q(ω)to

zero, respectively, results in the following consistency con-

dition for the maximum,

q(u,ω) = q(u)Y

q(ωi),(9)

with q(ωi) = PG(ωi|1, ci)and q(u) = N(u|µ,Σ). Re-

markably, we do not have to use the full Pólya-Gamma class

PG(ωi|bi, ci), but instead consider the restricted class bi= 1

since it already contains the optimal distribution.

We use (9) as variational family which is parameterized by

the variational parameters {µ,Σ,c}and obtain a closed-

form expression of the variational bound

L(c,µ,Σ)

=Ep(f|u)q(u)q(ω)[log p(y|ω,f)] −KL (q(u,ω)||p(u,ω))

2log |Σ| − log |Kmm|)−tr(K−1

mmΣ) −µ>K−1

mmµ

inyiκiµ−θie

Kii −κiΣκ>

i−µ>κ>

iκiµ

+c2

iθi−2 log coshci

2o,(10)

where θi=1

2citanh ci

2and κi=KimK−1

mm. Re-

markably, all intractable terms involving expectations of

log PG(ωi|1,0) cancel out. Details are provided in appendix

A.2.

4.3 Stochastic variational inference

Our algorithm alternates between updates of the local varia-

tional parameters cand global parameters µand Σ. In each

iteration we update the parameters based on a mini-batch of

the data S ⊂ {1, ..., n}of size s=|S|.

We update the local parameters cSin the mini-batch Sby

employing coordinate ascent. To this end, we fix the global

parameters and analytically compute the unique maximum of

(10) w.r.t. the local parameters, leading to the updates

ci=qe

Kii +κiΣκ>

i+µ>κ>

iκiµ(11)

for i∈ S.

We update the global parameters by employing stochastic

optimization of the variational bound (10). The optimization

is based on stochastic estimates of the natural gradients of

the global parameters. We use the natural parameterization

of the variational Gaussian distribution, i.e., the parameters

η1:= Σ−1µand η2=−1

2Σ−1. Using the natural parame-

ters results in simpler and more effective updates. The natu-

ral gradients based on the mini-batch Sare given by

∇η1LS=n

2sκ>

SyS−η1

∇η2LS=−1

2K−1

mm +n

sκ>

SΘSκS−η2,

(12)

where Θ = diag(θ)and θi=1

2citanh ci

2. The factor n

sis

due to the rescaling of the mini-batches. The global parame-

ters are updated according to a stochastic natural gradient as-

cent scheme. We employ the adaptive learning rate method

described by Ranganath et al. (2013).

The natural gradient updates always lead to a positive definite

covariance matrix2and in contrast to Hensman and Matthews

(2015) our implementation does not require any assurance

for positive-definiteness of the variational covariance matrix

Σ. Details for the derivation of the updates can be found in

appendix A.3. The complexity of each iteration in the infer-

ence scheme is O(m3), due to the inversion of the matrix

η2.

2This follows directly since Kmm and Θare positive definite.

3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation

On the quality of the approximation In other applica-

tions of variational inference to GP classification, one tries

to approximate the posterior directly by a Gaussian q∗(f)

which minimizes the Kullback-Leibler divergence between

the variational distribution and the true posterior Hensman

and Matthews (2015). On the other hand, in our paper, we

apply variational inference to the augmented model, looking

for the best distribution that factorizes in the Pólya-Gamma

variables ωiand the original function f. This approach

also yields a Gaussian approximation q(f)as a factor in

the optimal density. Of course q(f)will be different from

the âĂŸoptimalâĂŹ q∗(f). We could however argue that

asymptotically, in the limit of a large number of data, the

predictions given by both densities may not be too different,

as the posterior uncertainty for both densities should become

small Opper and Archambeau (2009).

It would be interesting to see how the ELBOs of the two vari-

ational approaches, which both give a lower bound on the

likelihood of the data, differ. Unfortunately, such a computa-

tion would require the knowledge of the optimal q∗(f). How-

ever, we can obtain some estimate of this difference when we

assume that we use the same Gaussian density q(f)for both

bounds as an approximation. In this case, we obtain

Lorig − Laugmented =Eq(f)[KL (q(ω)||p(ω|f, y))].

This lower bound on the gap is small if on average the varia-

tional approximation q(ω)is close to the posterior p(ω|f, y).

For the sake of simplicity we consider here the non-sparse

case, i.e. the inducing points equal the training points (f=

u). However, it is straight-forward to extend the results also

to the sparse case.

We empirically investigate the quality of our approximation

in experiment 5.1.

Predictions The approximate posterior of the GP values

and inducing variables is given by q(f,u) = p(f|u)q(u),

where q(u) = N(u|µ,Σ) denotes the optimal variational

distribution. To predict the latent function values f∗at a

test point x∗we substitute our approximate posterior into the

standard predictive distribution

p(f∗|y) = Zp(f∗|f,u)p(f,u|y)dfdu

≈Zp(f∗|f,u)p(f|u)q(u)dfdu

=Zp(f∗|u)q(u)du=Nf∗|µ∗, σ2

∗,(13)

where the prediction mean is µ∗=K∗mK−1

mmµand the vari-

ance σ2

∗=K∗∗ +K∗mK−1

mm(ΣK−1

mm −I)Km∗. The ma-

trix K∗mdenotes the kernel matrix between the test point

and the inducing points and K∗∗ the kernel value of the test

point. The distribution of the test labels is easily computed

by applying the logit link function to (13),

p(y∗= 1|y) = Zσ(f∗)p(f∗|y)df∗.(14)

This integral is analytically intractable but can be computed

numerically by quadrature methods. This is adequate and

fast since the integral is only one-dimensional.

Computing the mean and the variance of the predictive

distribution has complexity O(m)and O(m2), respec-

tively.

Optimization of the hyperparameters We select the op-

timal kernel hyperparameters by maximizing the marginal

likelihood p(y|h), where hdenotes the set of hyperparame-

ters (this approach is called empirical Bayes Maritz and Lwin

(1989)). We follow an approximate approach and optimize

the fitted variational lower bound L(h)(10) as a function of

hby alternating between optimization steps w.r.t. the varia-

tional parameters and the hyperparameters Mandt, Hoffman,

and Blei (2016).

5 Experiments

We compare our proposed method, efficient Gaussian pro-

cess classification (x-gpc), with the state-of-the-art meth-

ods svgpc Salimbeni, Eleftheriadis, and Hensman (2018),

provided in the package GPflow3Matthews et al. (2017),

which builds on TensorFlow and the EP approach epgpc

by Hernández-Lobato and Hernández-Lobato (2016), imple-

mented in R. All methods are applied to real-world datasets

containing up to 11 million data points.

In all experiments a squared exponential covariance function

with a common length scale parameter for each dimension,

an amplitude parameter and an additive noise parameter is

used. The kernel hyperparameters are initialized to the same

values and optimized using Adam Kingma and Ba (2014),

while inducing points location are initialized via k-means++

Arthur and Vassilvitskii (2007) and kept fixed during train-

ing. The SVI based methods, x-gpc and svgpc, use an adap-

tive learning rate. All algorithms are run on a single CPU.

We experiment on 12 datasets from the OpenML website

and the UCI repository ranging from 768 to 11 million data

points. In the first experiment (section 5.1), we examine the

quality of the approximation provided by x-gpc. In the next

experiment, we evaluate the prediction performance and run

time of x-gpc and svgpc and epgpc on several real-world

datasets. Finally, in 5.3, we examine the sensitivity of all

methods to the number of inducing points.

5.1 Quality of the approximation

We empirically examine the quality of the variational approx-

imation provided by our method. In Fig. 1, we compare the

approximations to the true posterior obtained by employing

an asymptotically correct Gibbs sampler Polson and Scott

(2011); Linderman, Johnson, and Adams (2015). We com-

pare the posterior mean and variance as well as the prediction

probabilities with the ground truth. Since the Gibbs sampler

3We use GPflow version 1.2.0.

Figure 1: Posterior mean (µ), variance (σ) and predictive

marginals (p) of the Diabetes dataset. Each plot shows the

MCMC ground truth on the x-axis and the estimated value

of our model on the y-axis. Our approximation is very close

to the ground truth.

does not scale to large datasets we experiment on the small

Diabetes dataset. In Fig. 1 we plot the approximated values

vs. the ground truth. We find that our approximation is very

close to the true posterior.

5.2 Numerical comparison

Dataset X-GPC SVGPC EPGPC

aXa Error 0.17 ±0.07 0.17 ±0.07 0.17 ±0.07

n= 36,974 NLL 0.29 ±0.13 0.36 ±0.13 0.34 ±0.13

d= 123 Time 47 ±2.2451 ±7.8 214 ±4.8

Bank Market. Error 0.14 ±0.12 0.12 ±0.12 0.12 ±0.13

n= 45,211 NLL 0.27 ±0.22 0.31 ±0.26 0.33 ±0.20

d= 43 Time 9±1.5205 ±6.6 46 ±3.5

Click Pred. Error 0.17 ±0.00 0.17 ±0.00 0.17 ±0.01

n= 399,482 NLL 0.39 ±0.07 0.46 ±0.00 0.46 ±0.01

d= 12 Time 4.5±1.3102 ±3.0 8.1±0.45

Cod RNA Error 0.04 ±0.00 0.04 ±0.00 0.04 ±0.00

n= 343,564 NLL 0.11 ±0.03 0.13 ±0.00 0.12 ±0.00

d= 8 Time 3.7±0.13 115 ±4.3 869 ±5.2

Diabetes Error 0.23 ±0.07 0.23 ±0.06 0.24 ±0.06

n= 768 NLL 0.47 ±0.11 0.47 ±0.10 0.48 ±0.09

d= 8 Time 8.8±0.12 150 ±5.18±0.45

Electricity Error 0.24 ±0.06 0.26 ±0.06 0.26 ±0.06

n= 45,312 NLL 0.31 ±0.17 0.53 ±0.08 0.53 ±0.06

d= 8 Time 8.2±0.48 356 ±6.9 13.5±1.50

German Error 0.25 ±0.12 0.25 ±0.11 0.26 ±0.13

n= 1,000 NLL 0.44 ±0.17 0.51 ±0.15 0.53 ±0.11

d= 20 Time 17 ±0.42 374 ±7.35.2±0.03

Higgs Error 0.33 ±0.01 0.45 ±0.01 0.38 ±0.01

n= 11,000,000 NLL 0.55 ±0.13 0.69 ±0.00 0.66 ±0.00

d= 28 Time 23 ±0.88 294 ±54 8732 ±867

IJCNN Error 0.03 ±0.01 0.06 ±0.01 0.02 ±0.01

n= 141,691 NLL 0.10 ±0.03 0.15 ±0.07 0.09 ±0.04

d= 22 Time 17 ±0.44 1033 ±45 756 ±8.6

Mnist Error 0.14 ±0.01 0.44 ±0.13 0.12 ±0.01

n= 70,000 NLL 0.24 ±0.10 0.66 ±0.11 0.27 ±0.01

d= 780 Time 200 ±5.5991 ±23 806 ±5.2

Shuttle Error 0.01 ±0.01 0.01 ±0.00 0.01 ±0.01

n= 58,000 NLL 0.07 ±0.01 0.07 ±0.00 0.07 ±0.01

d= 9 Time 0.01 ±0.00 7.5±0.7 100 ±0.63

SUSY Error 0.21 ±0.00 0.22 ±0.00 0.22 ±0.00

n= 5,000,000 NLL 0.31 ±0.10 0.49 ±0.01 0.50 ±0.00

d= 18 Time 14 ±0.29 10,000 10,000

wXa Error 0.03 ±0.01 0.04 ±0.01 0.03 ±0.01

n= 34,780 NLL 0.27 ±0.07 0.25 ±0.07 0.19 ±0.06

d= 300 Time 66 ±16 612 ±11 1.4±0.10

Table 1: Average test prediction error, negative test log-

likelihood (NLL) and time in seconds along with one stan-

dard deviation. Best values are highlighted.

We evaluate the prediction performance and run time of

our method x-gpc and the competing methods svgpc and

epgpc. We experiment on a variety of different datasets

and report the resulting prediction error, negative test log-

likelihood and run time for each method in table 1.

The experiments are conducted as follows. For each dataset

we perform a 10-fold cross-validation and for datasets with

more than 1 million points, we limit the test set to 100,000

points. We report the average prediction error, the negative

test log-likelihood (14) and the run time along with one stan-

dard deviation. For all datasets, we use 100 inducing points

and a mini-batch size of 100 points.

For x-gpc we find that the following simple convergence cri-

terion on the global parameters leads to good results: a slid-

ing window average being smaller than a threshold of 10−4

. Unfortunately, the original implementations of svgpc and

epgpc do not include a convergence criterion. We find that

the trajectories of the global parameters of svgpc tend to be

noisy, and using a convergence criterion on the global param-

eters often leads to poor results. To have a fair comparison,

we therefore monitor the convergence of the prediction per-

formance on a hold-out set and use a sliding window average

of size 5 and threshold 10−3as convergence criterion for all

methods.

We observe that x-gpc is about one to two orders of mag-

nitude faster than svgpc and epgpc on most datasets. Only

on the dataset wXa, epgpc is slightly faster than x-gpc. The

prediction error is similar for all methods but x-gpc outper-

forms the competitors in terms of the test log-likelihood on

most datasets (aXa, Bank Marketing, Click Prediction, Cod

RNA, Diabetes, Electricity, German, Higgs, Mnist, SUSY).

This means that the confidence levels in the predictions are

better calibrated for x-gpc, i.e. when predicting a wrong

label svgpc and epgpc tend to be more confident than x-

gpc.

Performance as a function of time Since all considered

methods are based on an optimization schemes, there is a

trade-off between the run time of the algorithm and the pre-

diction performance. We make this trade-off transparent

by plotting the prediction performance as function of time

on each dataset. For each method we monitor on a 10-

fold cross-validation the average negative test log-likelihood

and prediction error on a hold-out test set as a function of

time.

The results are displayed in Fig. 2 for three selected datasets,

while the results for the remaining datasets are deferred to

appendix A.1. For all datasets we observe that after a few

iterations x-gpc is already close to the optimum due to its

efficient closed form natural gradient updates. Both the pre-

diction error and test log-likelihood converge around one to

two orders of magnitude faster for x-gpc than for svgpc and

epgpc. Moreover, the performance curves tend to be nois-

ier for svgpc than for x-gpc and epgpc. For the datasets

HIGGS and IJCNN, epgpc lead to slightly better final pre-

diction performance, but with the cost of a runtime being

up to 4 orders of magnitude slower than x-gpc (approx. 28

hours vs. 9 and 435 seconds, respectively).

3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation

Figure 2: Average negative test log-likelihood and average test prediction error as a function of training time (seconds in a log10

scale) on the datasets Electricity (45,312 points), Cod RNA (343,564 points) and SUSY (5 million points). x-gpc (proposed)

reaches values close to the optimum after only a few iterations, whereas svgpc and epgpc are one to two orders of magnitude

slower.

Figure 3: Prediction error as function of training time (on a log10 scale) for the Shuttle dataset. Different numbers of inducing

points are considered, M= 16,32,64,128.x-gpc (proposed) converges the fastest in all settings of different numbers of

inducing points. Using only 32 inducing points is enought for obtaining allmost optimal prediction performance for all methods,

but svgpc becomes instable in settings of less than 128 inducing points.

All three methods are implemented in different program-

ming frameworks: x-gpc in Julia, svgpc in TensorFlow and

epgpc in R leading to different efficient implementations.

However, we find that the main speed-up of our method is due

to the efficient natural gradient updates and only marginally

related to the usage of a different programming language.

To check this we implemented epgpc also in Julia and ob-

tained similar runtimes. Since svgpc is part of the highly

optimized GPflow package we only used the original imple-

mentation.

5.3 Inducing points

We examine the effect of different numbers of inducing

points on the prediction performance and run time. For all

methods we compare different numbers of inducing points:

M= 16,32,64,128. For each setting, we perform a 10-fold

cross validation on the Shuttle dataset and plot the mean pre-

diction error as function of time. The results are displayed

in Fig. 3. We observe that the higher the number of inducing

points, the better the prediction performance, but the longer

the run time. Throughout all settings of inducing points our

method is consistently faster of around one to two orders of

magnitude than the competitors. On the Shuttle dataset us-

ing only M= 32 inducing points is enough and can only

be marginally improved by using more inducing point for all

methods. However, the performance curves of svgpc are

instable when using less than 128 inducing points.

6 Conclusions

We proposed an efficient Gaussian process classification

method that builds on Pólya-Gamma data augmentation and

inducing points. The experimental evaluations shows that

our method is up to two orders of magnitude faster than the

state-of-the-art approach while being competitive in terms

of prediction performance. Speed improvements are due to

the Pólya-Gamma data augmentation approach that enables

efficient second order optimization.

The presented work shows how data augmentation can speed

up variational approximation of GPs. Our analysis may

pave the way for using data augmentation to derive effi-

cient stochastic variational algorithms also for variational

Bayesian models other than GPs. Furthermore, future work

may aim at extending the approach to multi-class and multi-

label classification.

Acknowledgements We thank Stephan Mandt, James

Hensman and Scott W. Linderman for fruitful discussions.

This work was partly funded by the German Research Foun-

dation (DFG) awards KL 2698/2-1 and GRK1589/2 and the

by the Federal Ministry of Science and Education (BMBF)

awards 031L0023A, 01IS18051A.

References

Amari, S., and Nagaoka, H. 2007. Methods of Information Geom-

etry. American Mathematical Society.

Amari, S. 1998. Natural grad. works efficiently in learning. Neural

Computation.

Arthur, D., and Vassilvitskii, S. 2007. k-means++: The advan-

tages of careful seeding. In Proceedings of the eighteenth annual

ACM-SIAM symposium on Discrete algorithms, 1027–1035. So-

ciety for Industrial and Applied Mathematics.

Dezfouli, A., and Bonilla, E. V. 2015. Scalable inference for gaus-

sian process models with black-box likelihoods. In NIPS, 1414–

1422.

Dragiev, S.; Toussaint, M.; and Gienger, M. 2011. Gaussian process

implicit surfaces for shape estimation and grasping. In Robotics

and Automation (ICRA), 2845–2850.

Gibbs, M. N., and MacKay, D. J. C. 2000. Variational Gaus-

sian process classifiers. IEEE Transactions on Neural Networks

11(6):1458–1464.

Hensman, J., and Matthews, A. 2015. Scalable Variational Gaus-

sian Process Classification. In AISTATS.

Hernández-Lobato, D., and Hernández-Lobato, J. M. 2016. Scal-

able gaussian process classification via expectation propagation.

In AISTATS.

Hoffman, M. D.; Blei, D. M.; Wang, C.; and Paisley, J. 2013.

Stochastic Variational Inference. Journal of Machine Learning

Research.

Honkela, A.; Raiko, T.; Kuusela, M.; Tornio, M.; and Karhunen,

J. 2010. Approximate riemannian conjugate gradient learning

for fixed-form variational bayes. Journal of Machine Learning

Research 11.

Izmailov, P.; Novikov, A.; and Kropotov, D. 2018. Scalable gaus-

sian processes with billions of inducing inputs via tensor train

decomposition. In AISTATS, 726–735.

Jähnichen, P.; Wenzel, F.; Kloft, M.; and Mandt, S. 2018. Scalable

generalized dynamic topic models. In AISTATS.

John Walker, S. 2014. Big data: A revolution that will transform

how we live, work, and think. Taylor & Francis.

Jordan, M. I.; Ghahramani, Z.; Jaakkola, T. S.; and Saul, L. K. 1999.

An Introduction to Variational Methods for Graphical Models.

Machine Learning.

Khan, M. E., and Nielsen, D. 2018. Fast yet simple natural-

gradient descent for variational inference in complex models.

Arxiv Preprint.

Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic

optimization. CoRR abs/1412.6980.

Linderman, S. W.; Johnson, M. J.; and Adams, R. P. 2015. De-

pendent multinomial models made easy: Stick-breaking with the

polya-gamma augmentation. In NIPS.

Mandt, S.; Wenzel, F.; Nakajima, S.; Cunningham, J. P.; Lippert,

C.; and Kloft, M. 2017. Sparse Probit Linear Mixed Model.

Machine Learning Journal.

Mandt, S.; Hoffman, M.; and Blei, D. 2016. A Variational Analysis

of Stochastic Gradient Algorithms. ICML.

3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation

Maritz, J., and Lwin, T. 1989. Empirical Bayes Methods with

Applications. Monographs on Statistics and Applied Probability.

Martens, J. 2017. New insights and perspectives on the natural

gradient method. Arxiv Preprint.

Matthews, A. G. d. G.; van der Wilk, M.; Nickson, T.; Fujii, K.;

Boukouvalas, A.; León-Villagrá, P.; Ghahramani, Z.; and Hens-

man, J. 2017. GPflow: A Gaussian process library using Ten-

sorFlow. Journal of Machine Learning Research.

Opper, M., and Archambeau, C. 2009. The variational gaussian

approximation revisited. Neural Comput. 21(3):786–792.

Polson, N. G., and Scott, S. L. 2011. Data augmentation for support

vector machines. Bayesian Anal.

Polson, N. G.; Scott, J. G.; and Windle, J. 2013. Bayesian inference

for logistic models using pólya–gamma latent variables. Journal

of the American Statistical Association 108(504):1339–1349.

Ranganath, R.; Wang, C.; Blei, D. M.; and Xing, E. P. 2013.

An Adaptive Learning Rate for Stochastic Variational Inference.

ICML.

Rasmussen, C. E., and Williams, C. K. I. 2005. Gaussian Pro-

cesses for Machine Learning (Adaptive Computation and Ma-

chine Learning). The MIT Press.

Salimbeni, H.; Eleftheriadis, S.; and Hensman, J. 2018. Natu-

ral gradients in practice: Non-conjugate variational inference in

gaussian process models. In AISTATS.

Scott, J. G., and Sun, L. 2013. Expectation-maximization for lo-

gistic regression. arXiv preprint arXiv:1306.0040.

Snelson, E., and Ghahramani, Z. 2006. Sparse GPs using Pseudo-

inputs. NIPS.

Stein, M. L. 2012. Interpolation of spatial data: some theory for

kriging. Springer Science & Business Media.

Wainwright, M. J., and Jordan, M. I. 2008. Graphical models,

exponential families, and variational inference. Found. Trends

Mach. Learn. 1–305.

Wenzel, F.; Galy-Fajou, T.; Deutsch, M.; and Kloft, M. 2017.

Bayesian nonlinear support vector machines for big data. In Pro-

ceedings of the European Conference on Machine Learning and

Principles and Practice of Knowledge Discovery in Databases.

A Appendix

A.1 Additional performance plots

We show all time vs. prediction performance plots for the datasets presented in table 1 in section section 5.2 which could not be included in

the main paper due to space limitations.

Figure 4: Average negative test log-likelihood and average test prediction error as function of training time measured in seconds

(on a log10 scale).

3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation

Figure 5: Average negative test log-likelihood and average test prediction error as function of training time measured in seconds

(on a log10 scale). For the dataset Higgs, epgpc exceeded the time budget of 105seconds (≈28 h).

Figure 6: Average negative test log-likelihood and average test prediction error as function of training time measured in seconds

(on a log10 scale).

A.2 Variational bound

We provide details of the derivation of the variational bound (10) which is defined as

L(c,µ,Σ) = Ep(f|u)q(u)q(ω)[log p(y|ω,f)] −KL (q(u,ω)||p(u,ω)) ,

and the family of variational distributions

q(u,ω) = q(u)Y

q(ωi) = N(u|µ,Σ) Y

PG(ωi|1, ci).

Considering the likelihood term we obtain

Ep(f|u)[log p(y|ω, f)] c

2Ep(f|u)hy>f−f>Ωfi

2y>KnmK−1

mmu−tr(Ω e

K)−u>K−1

mmKmnΩKnmK−1

mmu.

3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation

Computing the expectations w.r.t. to variational distributions gives

Ep(f|u)q(u)q(ω)[log p(y|ω,f)]

2Eq(u)q(ω)hy>KnmK−1

mmu−tr(Ω e

K)−u>K−1

mmKmnΩKnmK−1

mmui

2Eq(u)hy>KnmK−1

mmu−tr(Θ e

K)−u>K−1

mmKmnΘKnmK−1

mmui

2hy>KnmK−1

mmµ−tr(Θ e

K)−tr(K−1

mmKmnΘKnmK−1

mmΣ) −µ>K−1

mmKmnΘKnmK−1

mmµi

iyiκiµ−θie

Kii −θiκiΣκ>

i−θiµ>κ>

iκiµ,

where θi=Ep(ωi)[ωi] = 1

2citanh ci

2,Θ = diag(θ)and κi=KimK−1

mm.

The Kullback-Leibler divergence between the Gaussian distributions q(u)and p(u)is easily computed

KL(q(u))||p(u)) c

2tr K−1

mmΣ+µ>K−1

mmµ−log |Σ|+ log |Kmm|.

The Kullback-Leibler divergence regarding the Pólya-Gamma also can be computed in closed-form. Have q(ωi) =

cosh ci

2exp −c2

2ωiPG(ωi|1,0) and p(ωi) = PG(ωi|1,0) we obtain

KL(q(ω))||p(ω)) = Eq(ω)[log q(ω)−log p(ω)]

iEq(ωi)log cosh ci

2exp −c2

2ωiPG(ωi|1,0)−Eq(ωi)[log PG(ωi|1,0)]

ilog coshci

2−ci

4tanh ci

2+Eq(ωi)[log PG(ωi|1,0)] −Eq(ωi)[log PG(ωi|1,0)]

ilog coshci

2−ci

4tanh ci

2.

Remarkably, the intractable expectations cancel out which would not have been the case if we assumed PG(ωi|bi, ci)as variational family. In

section 4.2 we have shown that the restricted family bi= 1 contains the optimal distribution.

Summing all terms results in the final lower bound

L(c,µ,Σ) c

2log |Σ| − log |Kmm|)−tr(K−1

mmΣ) −µ>K−1

mmµ+

inyiκiµ−θie

Kii −θiκiΣκ>

i−θiµ>κ>

iκiµ+c2

iθi−2 log coshci

2o.

A.3 Variational updates

Local parameters The derivative of the variational bound (10) w.r.t. the local parameter ciis

dci

dcinθi−e

Kii −κiΣκ>

i−µ>κ>

iκiµ+c2

i−2 log coshci

dci1

2ci

tanh ci

2−e

Kii −κiΣκ>

i−µ>κ>

iκiµ+c2

i−2 log coshci

2

dci











4ci

tanh ci

2



−e

Kii −κiΣκ>

i−µ>κ>

iκiµ

|{z }

:=−Ai





+ci

4tanh ci

2−log cosh ci

2









=Ai

4c2

−1

4tanh ci

2−1

2Ai

4ci

−ci

41−tanh2(ci

2)

=U(ci)ci

21−tanh2(ci

2)−tanh ci

2,

where U(ci) = Σii+µ2

4c2

−1

The gradient equals zero in two case. First, in the case U(ci) = 0 which leads to4

ci=qe

Kii +κiΣκ>

i+µ>κ>

iκiµ,

which is always valid since κ,Σand e

Kare definite positive matrices. The second consists of the right hand side of the product being zero which

leads to ci= 0. The second derivative reveals that the first case always corresponds to a maximum and the second case to a minimum.

Global parameters We first compute the Euclidean gradients of the variational bound (10) w.r.t. the global parameters µand Σ. We

obtain

dµ=1

dµ−µ>K−1

mmµ+y>κµ −µ>κ>Θκµ

2−2K−1

mmµ+κ>y−2κ>Θκµ

=−K−1

mm +κ>Θκµ+1

2κ>y,

(15)

and

dΣ=1

dΣlog |Σ| − tr(K−1

mmΣ) −tr(κ>ΘκΣ)

2Σ−1−K−1

mm −κ>Θκ.

(16)

We now compute the natural gradients w.r.t. natural parameterization of the variational Gaussian distribution, i.e the parameters η1:= Σ−1µ

and η2=−1

2Σ−1. For a Gaussian distribution, properties of the Fisher information matrix expose the simplification that the natural gradient

w.r.t. the natural parameters can be expressed in terms of the Euclidean gradient w.r.t. the mean and covariance parameters. It holds that

∇(η1,η2)L(η) = ∇µL(η)−2∇ΣL(η)µ, ∇ΣL(η),(17)

where e

∇denotes the natural gradient and ∇the Euclidean gradient. Substituting the Euclidean gradients (16) and (15) in to equation (17) we

obtain the natural gradients

∇η2L=1

2−2η2−K−1

mm −κ>Θκ

=−η2−1

2K−1

mm +κ>Θκ

and

∇η1L=−K−1

mm +κ>Θκ(−1

2η−1

2η1) + 1

2κ>y−2−η2−1

2K−1

mm +κ>Θκ(−1

2η−1

2η1)

2κ>y−η1.

A.4 Natural gradient and coordinate ascent updates

If the full conditional distributions and the corresponding variational distribution belong to the same exponential family it is known in varia-

tional inference that “we can compute the natural gradient by computing the coordinate updates in parallel and subtracting the current setting

of the parameter” Hoffman et al. (2013). In our setting it is not clear if this relation holds since we do not consider the classic ELBO but a

lower bound on it due to (8). Interestingly, the lower bound (8) does not break this property and our natural gradient updates correspond to

coordinate ascent updates as we show in the following. Setting the Euclidean gradients and (15) to zero and using the natural parameterization

gives

η2=−1

2Σ−1=−1

2K−1

mm +κ>Θκ.(18)

Setting (16) to zero yields

µ=1

2K−1

mm +κ>Θκ−1

κ>y.

Substituting the update from above (18) and using natural parameterization results in

η1=1

2κ>y.

This shows that using learning rate one in our natural gradient ascent scheme corresponds to employing coordinate ascent updates in the

Euclidean parameter space.

4We omit the negative solution since PG(b, c) = PG(b, −c).

3. Efficient Gaussian Process Classification Using Pólya-Gamma Data Augmentation

A.5 Variational bound by Gibbs and MacKay

When using the full GP representation in our model and not the sparse approximation, the bound in our model is equal to the bound used by

Gibbs and MacKay (2000). We provide a proof in the following.

Applying our variational inference approach to the joint distribution (5) gives the variational bound

log p(y|f)≥Eq(ω)[log p(y|f,ω)] −KL(q(ω)|p(ω))

=Eq(ω)1

2y>f−1

2f>Ωf−nlog(2) −KL(q(ω)|p(ω))

2y>f−1

2f>Θf−nlog(2) +

i=1 c2

2θi−log cosh(ci/2).

Gibbs and MacKay (2000) employ the following inequality on logit link

σ(z)≥σ(c) exp z−c

2−σ(c)−1/2

2c(z2−c2).

Using this bound in the setting of GP classification yields the following lower bound on the log-likelihood,

log p(y|f) =

i=1

log σ(yifi)

≥

i=1 log σ(ci) + yifi−ci

2−σ(ci)−1/2

2ci

((yifi)2−c2

i)

i=1 −log cosh(ci/2) −log(2) + yifi

2−σ(ci)−1/2

2ci

(f2

i−c2

i)

i=1 −log cosh(ci/2) −log(2) + yifi

2−1

4ci

tanh(ci/2)(f2

i−c2

i)

i=1 −log cosh(ci/2) −log(2) + yifi

2−1

2θi(f2

i−c2

i)

2y>f−1

2f>Θf−nlog(2) +

i=1 c2

2θi−log cosh(ci/2),

where we made use of the fact that σ(x)−1/2 = tanh(x/2)/2. This concludes the proof.

Multi-Class Gaussian Process

Classification Made Conjugate:

Efficient Inference via Data

Augmentation

After the binary classification problem, a natural extension is the multi-class classification setting. By

drawing inspiration from Donner and Opper

[12]

, we use new augmentations methods to circumvent the

problem of a much more complex likelihood function involving multiple latent

GPs

. More specifically,

we introduce a new link, the logistic-softmax function. We turn the model into a fully conditionally-

conjugate model with three successive augmentations. A thorough analysis is made to compare this

new model with other links and approaches, including standard choices like the softmax link.

Note that an extensive discussion about this model is given in Chapter 7 with potential model

extensions and solutions to some problems faced in the paper.

Authors:

Théo Galy-Fajou,1,∗, Florian Wenzel,1,∗, Christian Donner,1Manfred Opper1

∗Equal Contribution, 1TU Berlin, Germany, 2TU Kaiserslautern, Germany

Details:

Type: Conference article Submitted: January 2019

Accepted: May 2019

URL: http://auai.org/uai2019/proceedings/papers/264.pdf

Conference: UAI 2019

License: Creative Commons Attribution (CC BY 4.0)

4. Multi-Class Gaussian Process Classification Made Conjugate: Efficient Inference via

Data Augmentation

Contributions:

For an explanation of the terms see the Contributor Roles Taxonomy (CReditT)

T.G-F. F.W. C.D. M.O.

Conceptualization ✓ ✓ ✓

Methodology ✓

Formal Analysis ✓ ✓ ✓ ✓

Implementation ✓

Investigation ✓

Writing - Original Draft ✓ ✓ ✓

Writing - Review & Editing ✓ ✓ ✓ ✓

Supervision ✓

Funding Acquisition ✓

4. Multi-Class Gaussian Process Classification Made Conjugate: Efficient Inference via

Data Augmentation

4. Multi-Class Gaussian Process Classification Made Conjugate: Efficient Inference via

Data Augmentation

4. Multi-Class Gaussian Process Classification Made Conjugate: Efficient Inference via

Data Augmentation

4. Multi-Class Gaussian Process Classification Made Conjugate: Efficient Inference via

Data Augmentation

4. Multi-Class Gaussian Process Classification Made Conjugate: Efficient Inference via

Data Augmentation

Automated Augmented Conjugate

Inference for Non-conjugate

Gaussian Process Models

The larger question following the work on Pólya-Gamma variables and other augmentation works such

as Nguyen and Wu [41] or Henao et al. [20] is: What likelihoods have a scale mixture representation?

This article, extending the work of Palmer

[43]

partially answers by finding a class of functions, the

positive-definite radial functions, guaranteed to be interpretable as scale mixtures of Gaussians. The

paper also provides an algorithm to directly infer the

CAVI

updates and Gibbs sampling algorithm

from the likelihood.

Authors:

Théo Galy-Fajou,1, Florian Wenzel,2, Manfred Opper1

1TU Berlin, Germany, 2Google Research

Details:

Type: Conference article Submitted: October 2019

Accepted: December 2019

URL: https://proceedings.mlr.press/v108/galy-fajou20a.html

Conference: AISTATS 2020

License: Creative Commons Attribution (CC BY 4.0)

5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process

Models

Contributions:

For an explanation of the terms see the Contributor Roles Taxonomy (CReditT)

T.G-F. F.W. M.O.

Conceptualization ✓ ✓ ✓

Methodology ✓

Formal Analysis ✓ ✓ ✓

Software ✓

Investigation ✓

Writing - Original Draft ✓ ✓

Writing - Review & Editing ✓ ✓ ✓

Supervision ✓

Funding Acquisition ✓

Automated Augmented Conjugate Inference

for Non-conjugate Gaussian Process Models

Théo Galy-Fajou Florian Wenzel Manfred Opper

Technical University of Berlin Google Research∗Technical University of Berlin

Abstract

We propose automated augmented conjugate

inference, a new inference method for non-

conjugate Gaussian processes (GP) models. Our

method automatically constructs an auxiliary

variable augmentation that renders the GP model

conditionally conjugate. Building on the conju-

gate structure of the augmented model, we de-

velop two inference methods. First, a fast and

scalable stochastic variational inference method

that uses efficient block coordinate ascent up-

dates, which are computed in closed form. Sec-

ond, an asymptotically correct Gibbs sampler that

is useful for small datasets. Our experiments

show that our method are up two orders of mag-

nitude faster and more robust than existing state-

of-the-art black-box methods.

1 INTRODUCTION

Developing automated yet efficient Bayesian inference

methods for Gaussian process (GP) models is a challeng-

ing problem that has attracted considerable attention within

the probabilisitic machine learning community (Salimbeni

et al.,2018;Wenzel et al.,2019). A GP defines a distri-

bution over functions and can be used as a flexible build-

ing block to develop expressive probabilistic models. By

choosing an appropriate likelihood function on top of a la-

tent GP, a variety of interesting models is obtained, which

are successfully used in several application areas includ-

ing robotics (Beckers et al.,2019), facial behavior analy-

sis (Eleftheriadis et al.,2017) and electrical engineering

(Pandit and Infield,2018). For instance, using a logistic

likelihood leads to a binary GP classification model, and

using a Student-t likelihood can be used for robust regres-

sion.

Proceedings of the 23rdInternational Conference on Artificial In-

telligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR:

The main challenge in these models is to infer the latent

GP given a general non-Gaussian likelihood. Methods that

are more generally applicable often treat the model as a

black box and are based on sampling or numerical quadra-

ture, thus, preventing efficient optimization (Hensman et al.,

2015;Salimbeni et al.,2018). On the other side. a lot of

methods focus on special cases of GP models (i.e. special

likelihood functions) by exploiting model specific proper-

ties, e.g. binary classification (Polson et al.,2013).

In this work, we develop automated augmented conjugate

inference (aaci).aaci is an efficient inference framework,

which is applicable to a large class of GP models that use

a super-Gaussian likelihood1. It automatically exploits spe-

cific properties of the likelihood leading to an inference al-

gorithm that is up to two orders of magnitudes faster than

the state of the art.

Our approach builds on an auxiliary variable augmentation

of the model: we add a latent variable to the model such that

the original model is recovered when this variable is inte-

grated out. We consider an augmentation that renders the

model conditionally conjugate. In a conditionally conjugate

model, all complete conditional distributions (the posterior

distribution of one random variable given all the others),

can be computed in closed form. Moreover, we show that

inference in the augmented conditionally conjugate model

is much easier than in the original model and demonstrate

superior performance over the state of the art.

Building on the conditionally conjugate augmentation,

aaci provides two options for inference: a scalable vari-

ational inference method based on efficient closed-form

coordinate ascent updates and an exact Gibbs sampling

method, which is useful on smaller datasets.

Our main contributions are as follows:

•We introduce aaci: an automated inference method

for GP models with a super-Gaussian likelihood.

•We propose two inference modules: augmented varia-

tional inference, which scales to large datasets contain-

∗Work done while at TU Berlin

1The definition of the family of super-Gaussian likelihoods is

given in Section 3.

Automated Augmented Conjugate Inference for GP Models

Step 1: Construct conjugate

augmentation

ωp(ω|f,y)

p(f|ω,y)

Input: GP model Step 2: Compute complete

conditionals

Step 3: Perform inferen

Variational Inference

Gibbs Sampling

Figure 1. Automated augmented conjugate inference (aaci)performs automated efficient inference in non-conjugate Gaus-

sian process models. In the first step, aaci translates the GP model into an augmented model that is conditionally conjugate.

In the second step, the complete conditionals are computed in closed form. In the final step, aaci provides two options:

(A) fast stochastic variational inference based on coordinate ascent updates, which easily scales to big datasets and (B) an

asymptotically exact Gibbs sampler, which provides high quality samples from the true posterior but is limited to smaller

datasets.

ing millions of instances and an exact Gibbs sampler,

which is useful for small datasets.

•The experiments demonstrate that the augmented vari-

ational inference module of aaci outperforms the state

of the art in terms of speed by up to two orders of mag-

nitude while being competitive in terms of prediction

performance. The Gibbs sampler module leads to a

much better efficient sample size while still being up

to ten times faster than Hamiltonian Monte Carlo.

The paper is structured as follows: Section 2 gives a high-

level overview about our novel inference method aaci. In

Section 3, we provide a detailed discussion of the algorithm

and proof that our approach indeed leads to conditionally

conjugate models. We discuss related work in Section 4

and show our experimental results in Section 5. Finally,

Section 6 concludes and lays out future research directions.

Our source code for the experiments is included in a gitgub

repository2.

2 AUTOMATED AUGMENTED

CONJUGATE INFERENCE

Let X= (x1,...,xn)>∈Rn×dbe a matrix of data points

and y= (y1, . . . , yn)∈Rnthe corresponding target val-

ues. The goal is to learn a mapping from the input points

to the target values via a latent function f. We assume a

prior GP distribution (with mean prior µ0and covariance

function k(x, x0)) on the latent function and the data labels

y= (y1, . . . , yn)are connected to fvia a factorizable like-

lihood

p(f) = GP(f|µ0, k), p(y|f, X) =

i=1

p(yi|f(xi)).

2https://github.com/theogf/AutoConjGP_Exp

The key inference challenge in the GP models is to compute

the posterior distribution of the latent function

p(f|y) = p(y|f)p(f)

Rp(y|f)p(f)dy.

This is a challenging problem. Inference in GP models scale

cubically in the number of data points and is intractable for

non-Gaussian likelihoods.

Ideally, we would like an efficient inference method that is

not hand-tailored to a specific type of likelihood and hence

allows for experimenting with different types of GP mod-

els on big datasets in a scalable manner. Thus, we need

a flexible inference method that works for a large class of

likelihoods, is fast and ideally does not involve inefficient

black box approaches as approximating the objective by

sampling.

2.1 Automated Augmented Conjugate Infer-

ence

We introduce the automated augmented conjugate inference

(aaci)to achieve this goal. aaci accelerates training of GP

models whose likelihood is in the family of super-Gaussian

likelihood functions.

aaci translates the intractable non-conjugate model into

an easier, conditionally conjugate model by adding auxil-

iary random variables to the model. Inference in condition-

ally conjugate models is a classic and well-studied problem

(Bishop,2006). Because of the special structure of condi-

tionally conjugate models, many efficient inference meth-

ods exist (Wang and Blei,2013). Based on the automat-

ically constructed augmentation, we propose an efficient

variational inference method using coordinate ascent up-

dates and a Gibbs sampler.

The inference pipeline of aaci.aaci consists of three

steps. In the first step, a conjugate augmentation of the

model is constructed by adding auxiliary variables ωto the

5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process

Models

Théo Galy-Fajou, Florian Wenzel, Manfred Opper

model. Then, the complete conditional distributions of the

latent function fand auxiliary variables ωare computed.

In the final step, we provide two options to perform infer-

ence.

The variational inference (VI) module of aaci performs

block coordinate ascent updates, computed in closed form.

The updates are much more efficient than ordinary Eu-

clidean gradient updates, which are used in most previous

approaches. The Gibbs sampling module of aaci builds on

the complete conditional distributions and provides exact

samples from the true posterior. For each type of likelihood,

the sampler is automatically constructed.

The inference pipeline of aaci is summarized in Fig. 1. In

the following, we give an overview of how each module of

our inference pipeline works and provide the details in Sec-

tion 3.

(1) Augmenting the model. The first step of our inference

framework constructs an auxiliary variable augmentation

that renders the model conditionally conjugate. Our aug-

mentation approach finds a Gaussian scale mixture repre-

sentation of the intractable likelihood

p(yi|fi) = Zp(yi|fi, ωi)p(ωi)dω, (1)

where p(yi|fi, ωi)is an unnormalized Gaussian distribution

in fiwith precision ωiand p(ωi)is the prior distribution of

the auxiliary variable. The construction of the distribution

p(ω)is based on an inverse Laplace transformation and is

discussed in Section 3.1.

Building on Eq. 1, we augment the GP model by a set of

auxiliary variables ω= (ω1, . . . , ωn)leading to the aug-

mented joint distribution

p(y,f,ω) = Y

p(yi|fi, ωi)p(ωi)p(f), . (2)

The auxiliary variable augmentation is constructed in a way

such that the augmented model is conditionally conjugate,

i.e. the complete conditional distributions p(ω|f,y)and

p(f|ω,y)are in the same family as their associated pri-

ors.

(2) Computing the complete conditionals. The complete

conditionals of fand the auxiliary variables ωiare com-

puted in closed form and are given by

p(f|y,ω) =N(f|µ,Σ)

p(ωi|fi, yi) =πϕ(ωi|ci),

where ϕis a function determined by the type of the likeli-

hood (see Eq. 4) and the parameters µ,Σ, cihave closed-

form expressions and are described in Section 3.2. The dis-

tribution family πϕ(ω|c)is derived by an exponential tilt-

ing of the prior distribution p(ω)and is discussed in Sec-

tion 3.2.

(3a) Augmented variational inference. In step 3, aaci

provides two options to perform inference. We first discuss

the variational inference module, which approximates the

posterior by a variational distribution and easily scales to

big datasets.

We assume a mean-field variational distribution, where the

latent GP fand the auxiliary variables ωare decoupled, i.e.

q(f,ω) = q(f)q(ω). The optimal variational distribution

of ωnaturally factorizes, i.e. q(ω) = Qiq(ωi). Following

standard results (Bishop,2006) the variational distributions

can be iteratively optimized by the block-coordinate ascent

updates:

q(f)∝exp Eq(ω)[log p(f|ω,y)]

q(ωi)∝exp Eq(f)[log p(ωi|f,y)].(3)

In Section 3.3, we show that these updates are given in

closed form and can be computed efficiently without resort-

ing to numerical methods. To scale to big datasets we em-

ploy SVI (Hoffman et al.,2013) and replace the original la-

tent GP fby Titsias (2009) sparse approximation building

on inducing points .

(3b) Exact inference via Gibbs sampling. Building on the

conditionally conjugate augmentation, it is straightforward

to derive a Gibbs sampler. In order to sample from the exact

posterior, we alternate between drawing a sample from each

complete conditional distribution

ωt∼p(ω|ft−1,y),

ft∼p(f|ωt,y).

The augmented variables are naturally marginalized out and

the latent GP samples {ft}will be from the true posterior

p(y|f). As we empirically show in Section 5.1, the Gibbs

sampler leads to very fast mixing and outperforms standard

Hamiltonian Monte Carlo sampling.

3 ALGORITHM DETAILS

Here we provide the details on the automated augmented

conjugate inference (aaci)algorithm. We start by specify-

ing the class of GP models that we consider in our frame-

work. We then discuss the technical details of aaci and

proof that the automatically constructed augmentation in-

deed leads to a conditionally conjugate model.

GP Models with a super-Gaussian likelihood. aaci can

be applied to GP models, where the likelihood is within the

class of super-Gaussian likelihoods. A super-Gaussian like-

lihood is of the form

p(y|f;θ) =C(θ)eg(y;θ)>fϕ(||h(f,y)||2

2),(4)

where θare hyperparameters of the likelihood, C(θ)is the

normalizing constant, g(y;θ)is an arbitrary function, ϕis

Automated Augmented Conjugate Inference for GP Models

apositive definite radial (pdr) function3, and his a linear

function in f, such that we can write

||h(f,y)||2

2=α(y, θ)−β(y, θ)>f+γ(y, θ)||f||2

2,(5)

where α, β, γ are arbitrary functions. We omit θin the later

derivations for clarity.

Many interesting models are instances of super-Gaussian

likelihood GP models. In Table 1, we present several likeli-

hood functions with their corresponding parameter settings

of the super-Gaussian likelihood as given in Eq. 4.

Constructing new likelihoods. Using Eq. 4, we can also

construct novel likelihood functions based on existing ker-

nel functions. In this paper we propose the Matern 3/2 like-

lihood.

3.1 Step 1: Conjugate augmentation

Given the likelihood of the model, aaci constructs a con-

ditionally conjugate auxiliary variable augmentation as fol-

lows. We first define a family of distribution πϕ(ω|c), which

will be useful for constructing the augmentation.

For the case c= 0, the distribution πϕ(ω|0) is defined by

the inverse Laplace transform of ϕ(·),

πϕ(ω|0) = L−1{ϕ(·)}(ω).(6)

The inverse Laplace is the inverse mapping of the Laplace

transformation and can be computed by the Bromwich in-

tegral formula4(Debnath and Bhatta,2014) and it defines a

valid density in our setting (see proof of Theorem 1). Re-

markably, we will see that for the final updates of our al-

gorithm, we do not need to compute the inverse Laplace

transformation explicitly.

We generalize the base distribution πϕ(ω|0) by applying an

exponential tilting:

πϕ(ω|c) = e−c2ωπϕ(ω|0)

ϕ(c2),(7)

where c∈R.

Theorem 1. A GP model with a super-Gaussian like-

lihood (of the form of Eq. 4) is rendered condition-

ally conjugate by the auxiliary variable augmentation

p(y,f,ω;θ) = p(y|f,ω;θ)p(f)p(ω). The augmented

likelihood is

p(y|f,ω;θ) = C(θ) exp g(y;θ)>f−||h(f,y)||2

2ω

3ϕis a positive definite radial function if ϕ(r)is completely

monotone for all r≥0and limr→0ϕ(r) = 1.

4The inverse Laplace transformation of a function ϕ(·)can be

computed by L−1{ϕ(·)}(ω) = limT→∞

2πi Rb+iT

b−iT erω ϕ(r)dr,

where bcan be arbitrarily chosen but has to be larger than the real

part of all singularities of ϕ.

and the prior distribution of the auxiliary variables is

p(ω) = πϕ(ω|0) .

Proof: We first apply Schoenberg’s theorem (Schoenberg,

1938), which states that a function Rd3x→ϕ(kxk2

is a pdr function for any dimension d > 0if and only if

ϕ(r)is a completely monotone function on the domain r≥

A completely monotone function ϕ(·)has the property that

it is infinitely differentiable and its derivatives have an al-

ternating sign (Bernstein et al.,1929), i.e.

(−1)kϕ(k)(r)>0, r ∈[0,+∞), k = 0,1,2,.... (8)

As a direct consequence, ϕ(·)is a positive, decreasing, and

convex function and the first derivative of ϕ(·)is a concave

function.

Building on these properties, Widder (1946) states that

we can rewrite ϕ(kh(f, y)k2

2)as a Gaussian scale-

mixture

ϕkh(f, y)k2

2=Z∞

e−kh(f,y)k2

2ωdµ(ω),(9)

with respect to a Borel measure µ(ω). We ap-

ply the monotone convergence theorem (Yeh,2006),

which gives that µ(ω)is even a probability measure iff

limr→0ϕ(r) = 1. Since we have a probability mea-

sure, we write dµ(ω) = p(ω)dω and which leads

to the equality ϕ(r) = L{p(ω)}(r), where Lde-

notes the Laplace transformation. The inverse Laplace

transformation gives the density of the auxiliary variable

p(ω) = L−1{ϕ(r)}(ω) = πϕ(ω|0).

Therefore we can rewrite the super-Gaussian likelihood Eq.

4 as :

p(y|f) = C(θ)Z∞

e−g(y)f−kh(f,y)k2

2ωp(ω)dω. (10)

Adding the auxiliary variable ωwith prior p(ω)

to the model, we obtain the augmented likelihood

p(y|f,ω;θ) = C(θ) exp g(y;θ)>f−||h(f,y)||2

2ω.

Since the function g(y;θ)>f−||h(f,y)||2

2ωis by defini-

tion quadratic in fthe augmented likelihood is proportional

to an (unnormalized) Gaussian distribution in f, hence,

conditionally conjugate in f.

For the augmented variable ωi, the likelihood p(y|ω, f)act

as an exponential tilting of p(ω)and the full conditional in ω

will stay in the same family of distributions. QED.

3.2 Step 2: Complete Conditionals

Since the augmented model (Section 3.1) is conditionally

conjugate, the complete conditional distribution are in the

5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process

Models

Théo Galy-Fajou, Florian Wenzel, Manfred Opper

Likelihood Full form g(f, y)h(f, y)ϕ(r)

Student-t Γ(ν+1

√νπσΓ(ν

2)1 + (y−f)2

νσ2−ν+1

20f−y

σ1 + r

ν−ν+1

Laplace 1

2βexp −|y−f|

β0f−yexp −√r

β

Logistic 1

2exp yf

2cosh−1|yf|

2yf

2cosh−1(√r)

Bayesian SVM exp ((yf −1) −|1−yf|)yf 1−yf exp(−√r)

Matern 3/2 √3

4ρ(1 + √3|y−f|

ρ) exp(−√3|y−f|

ρ)0f−y(1 + √3r

ρ) exp(−√3r

ρ)

Table 1. Many interesting GP models are members of the super-Gaussian likelihood family introduced in Section 3. We

display the full likelihood and the corresponding terms of the super-Gaussian likelihood as described in Eq. 4. Some models

were already considered independently but our approach provides a unified view.

same family as their associated prior distributions and are

given in closed form.

Theorem 2. The complete conditional distributions of the

augmented model presented in Section 3.1 are given by

p(ωi|fi, yi) =πϕ(ωi|kh(fi, yi)k2),

p(f|y,ω) =N(f|µ,Σ),(11)

where Σ=diag (2ω◦γ(y)) + K−1−1and µ=

Σg(y) + ω◦β(y) + K−1µ0,◦denotes the Hadamard

product and the function h(·)is given by the form of likeli-

hood (see Eq.5).

The proof is given in Appendix A.1

3.3 Step 3: Efficient inference

In the final step of our inference pipeline, we leverage the

conditionally conjugate structure of the augmented model

and derive two inference methods. First, we propose a

scalable stochastic variational inference (SVI) method that

builds on efficient block coordinate ascent updates (CAVI)

updates, computed in closed form. Second, we develop a

Gibbs sampling scheme that generates samples from the ex-

act posterior.

3.3.1 Augmented variational inference

We implement the classic stochastic variational inference

(SVI) algorithm for conditionally conjugate models de-

scribed by Hoffman et al. (2013), which builds on block

coordinate ascent updates. The updates can be interpreted

as natural gradient updates and are much more efficient than

ordinary Euclidean gradient updates (Amari,1998).

Variational approximation. We approximate the poste-

rior distribution of the latent GP values by assuming a de-

coupling between fand ω. The family of the optimal varia-

tional distribution can be easily determined by averaging the

complete conditionals in log-space, as given in Eq. 3 (see

e.g. Blei et al.,2017). From the above decoupling assump-

tion, it follows that the optimal variational posterior is in the

variational family

q(f,ω) = q(f)

i=1

q(ωi),(12)

where q(f) = N(f|m,S)and q(ωi) = πϕ(ωi|ci)and

m,Sand care the variational parameters.

Variational updates. We start with deriving the vari-

ational updates for the variational Gaussian distribu-

tion,

q(f)∝exp Eq(ω)[log p(f|ω,y)]

∝exp "X

g(yi)fi−kh(fi, yi)k2

2Eq(ωi)[ωi]#p(f)

Computing the variational updates of q(f)boils down

to computing the first moment of ω. Remarkably, the

moments of πϕcan be computed without computing the

closed-form density of πϕexplicitly, i.e. without evaluat-

ing the inverse Laplace transformation of ϕ(Eq. 6).

The moments can be computed by differentiating the mo-

ment generating function, which is itself a Laplace trans-

form. For our algorithm, we only need the first moment of

ω, which is given by

Eq(ω)[ω] = dL{q(ω)}(−t)

dt t=0

=−ϕ0(c2)

ϕ(c2)=ω,

which can be cheaply computed via automatic differentia-

tion.

The updates for the variational distribution of the auxiliary

variables q(ω)are computed as follows.

q(ωi)∝exp −Eq(fi)kh(fi, yi)k2

2ωi+ log p(ωi)

∝exp −Eq(fi)kh(fi, yi)k2

2ωip(ωi)

=πϕ(ωi|qEq(fi)[h(fi, yi)2]).

Automated Augmented Conjugate Inference for GP Models

We get then the update ci=qEq(fi)[kh(fi, yi)k2

2], which

can be easily computed in closed form since kh(fi, yi)k2

2is

a quadratic function of fi.

The coordinate ascent variational inference (CAVI) method

is summarized in Algorithm 1.

Algorithm 1 Augmented Variational Inference

Input: Data (X,y), GP model p(y|f), kernel k

Output: Approximate posterior q(f) = N(f|m,S)

for iteration t= 1,2, . . . ,do

# Local updates:

for i∈1 : Ndo

ci=pEq(f)[h(fi, yi)2]

ωi=Eq(ωi)[ωi] = −ϕ0(c2

i)/ϕ(c2

end for

# Coordinate ascent updates (CAVI):

S←diag (2ω◦γ(y)) + K−1−1

m←SK−1µ0+g(y) + ω◦β(y)

end for

Sparse GP approximation. To scale our method to big

datasets, we approximate the latent GP fby a sparse Gaus-

sian process building on inducing points. We introduce M

inducing points uand connect the GP values with the in-

ducing points via the joint prior distribution p(f,u)given

in Titsias (2009). The introduction of inducing points

preserves conditional conjugacy and allows for mini-batch

sampling of the data (stochastic variational inference). This

scales the algorithm to big datasets and has the computa-

tional complexity O(M3). The SVI version of our algo-

rithm only slightly changes the updates that are presented

in Algorithm 1. It is deferred to Appendix A.3.

3.3.2 Gibbs sampling

To sample from the exact posterior distribution, a Gibbs

sampling scheme alternates between sampling from the

complete conditional distributions. In the following we pro-

pose a sampling scheme for the distribution family πϕ(ω|c)

that is automatically constructed given the pdr function of

the likelihood ϕ(·)

The distribution class πϕis defined in Eq. 6 and is based on

the inverse Laplace transform of ϕ(·). However there is no

general approach to compute the inverse Laplace in closed

form (Cohen,2007). We circumvent this issue by proposing

an algorithm that only evaluates the inverse Laplace trans-

formation point-wise but does not need access to its full

analytical form. We apply the method proposed by Rid-

out (2009), which build on the fact that the cumulative den-

sity function (cdf)Fπϕ(ω|c)(·)can be computed via the in-

verse Laplace transform of a scaled (forward) Laplace trans-

form,

Fπϕ(ω|c)(x) = L−1L{πϕ(ω|c)}(s)

s(x)

=L−1ϕ(s+c2)

sϕ(c2)(x).

To generate samples from πϕ(ω|c), we first generate a uni-

form sample u∼ U [0,1] and then push it through the in-

verse cdf,ω=F−1

πϕ(ω|c)(u)(Devroye,1986) Finally, to

compute the inverse cdf, we solve a fixed point problem

using the modified Newton-Raphson method described by

Ridout (2009). We solve the equation Fϕ(c)(ω) = uby re-

peatedly setting ω←ω−Fϕ(c)(ω)/πϕ(ω|c)until reaching

convergence. We numerically approximate the (forward)

cdf Fϕ(c)(ω)by the cheap trapezoidal method introduced

in Abate et al. (2000), which has error guarantees. The

cost of this process is negligible against the matrix inver-

sion for sampling f. All steps are summarized in Algo-

rithm 2.

Note that for some likelihood functions (e.g. the logistic

likelihood function), the inverse Laplace transform can be

derived analytically and the steps described above can be

optimized by using an existing the sampler for the corre-

sponding complete conditional distribution.

Algorithm 2 Gibbs Sampling

Input: Data (X,y), GP model p(y|f), kernel k

Output: Posterior samples {ft} ∼ p(f|y)

for sample index t= 1,2, . . . ,do

# Sample ω∼p(ω|f,y):

for i∈1 : Ndo

Compute ci=kh(fi, yi)k2

Sample ui∼ U[0,1]

# Compute inverse cdf ωi=F−1

πϕ(ci)(ui):

Initialize ωi>0

while |e

Fπϕ(ci)(ωi)−ui|>  do

Approximate e

Fπϕ(ωi),eπϕ(ωi|ci)(see Sec.3.3.2)

ωi←ωi−

Fπϕ(ci)(ωi)

eπϕ(ωi|ci)

end while

end for

# Sample f∼p(f|ω,y):

Σ=diag (2ω◦γ(y)) + K−1−1

µ=ΣK−1µ0+g(y) + ω◦β(y)

Sample ft∼ N (µ,Σ)

end for

4 RELATED WORK

Inference for non-conjugate likelihoods is not a new topic

and there have been many works to deal efficiently with the

problem.

5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process

Models

Théo Galy-Fajou, Florian Wenzel, Manfred Opper

Scale mixtures of normals. The Gaussian scale-mixture

formulation is well known in statistics and have been ex-

plored more recently by Gneiting (1997,1999). Palmer

(2006); Palmer et al. (2006) started to generalize it for a ma-

chine learning use but did not explore the probability side

of the augmentation.

Black-box variational inference. One of the most popu-

lar approach for variational inference in the recent years is

to optimize the ELBO for an arbitrary model by computing

gradients estimates via sampling or quadrature, e.g. Salim-

beni et al. (2018); Mohamed et al. (2019). However these

methods do not exploit the structure of the model and can

be less efficient.

Sampling methods. Sampling is not a popular method for

GP models since fis high-dimensional and the posterior is

usually highly correlated (Lawrence et al.,2009). But as for

many Bayesian models, Hamiltonian Monte Carlo is a good

candidate (Titsias et al.,2008).

Likelihood approximation. Jaakkola and Jordan (2000)

propose a variational approach purely based on optimiza-

tion, using the partial convexity of the likelihood. Our

method recovers their results, but coming from a proba-

bilistic perspective. We show in Appendix A.5, the equiv-

alence with their approach. Khan and Lin (2017) exploit

existing partial conjugacy in the model and rely on the as-

sumption that part of the joint posterior can be rewritten as

an exponential family. Their approach is complementary

to ours and could be combined for solving more complex

models.

Use cases of the augmented model. Different applica-

tions of the augmentation technique for specific likelihoods

have been explored in multiple papers: Jylänki et al. (2011)

applied the augmentation on the Student-t likelihood with

Gaussian Processes. Polson et al. (2013) developed an ap-

proach with the logistic likelihood, this work was further

expanded by Wenzel et al. (2019) to big data. The augmen-

tation done on the Bayesian Support Vector Machine of Pol-

son et al. (2011) and scaled up by Wenzel et al. (2017), is

similar to our method but is based on a different augmenta-

tion approach. Note that our method covers all these cases

exactly but do not rely on any manual derivations.

5 EXPERIMENTS

In this section we answer the following questions empiri-

cally:

•How does the Gibbs sampling scheme compare to

other sampling methods?

•What is lost in variational inference by approximating

an additional variable?

•And what is the gain in speed?

We explore four different cases. We use three regression

models with different likelihood functions: a Laplace like-

Likelihood/Method MH HMC Gibbs

Logistic

Time/Sample (s) 0.001 0.041 0.01

Lag 1 0.996 0.53 0.11

Gelman 1.38 1.00 1.00

Student-t

Time/Sample (s) 0.003 0.573 0.028

Lag 1 1.0 0.857 0.04

Gelman 1.51 1.00 1.00

Laplace

Time/Sample (s) 0.002 0.082 0.028

Lag 1 0.995 0.931 0.26

Gelman 1.44 1.01 1.00

Matern

3/2

Time/Sample (s) 0.005 0.15 0.029

Lag 1 0.997 0.995 0.05

Gelman 1.59 1.10 1.00

Table 2. Sampling time and diagnostics of Gibbs Sampling,

naive Metropolis-Hastings and Hamiltonian Monte-Carlo.

The Gelman test indicates the inter-chain correlation and

should be close to 1.

lihood, a Student-t likelihood, a new likelihood inspired by

the Matern 3/2 kernel (Rasmussen,2003) and one classifi-

cation model with a logistic likelihood. All the mathemat-

ical details of these augmentations are deferred to the Ap-

pendix A.6. For the two first experiments we use a full GP

without inducing points to have a cleaner analysis of the

effect of the augmentation. For all experiments we use a

squared exponential kernel with automatic relevance deter-

mination: k(x, x0) = exp(−PD

d=1(xd−x0

d)2/θ2

d). For the

two first experiments we use datasets from the UCI repos-

itory (Dua and Graff,2017) : the Boston housing dataset

(N= 506, D = 14) for regression and the Heart dataset

(N= 303, D = 14) for classification. For the last experi-

ment we use the Protein dataset (N= 45730, D = 9) and

the Airline dataset (N= 190K, D = 7) for regression and

the Covtype dataset (N= 581K, D = 54) and the SUSY

dataset (N= 5M, D= 18) for classification. We normal-

ize the input features to mean 0 and variance 1.

5.1 Gibbs sampling mixing

Our approach leads to a Gibbs sampling algorithm that pro-

vides samples from the true posterior of the original model.

We compare our method (Gibbs) with a naive Metropolis-

Hasting algorithm (MH) and a Hamiltonian Monte Carlo

(HMC) sampler (where and nstep are selected via a grid

search, see appendix A.7) both implemented in Turing.jl

(Ge et al.,2018), with a whitening transformation on the

kernel matrix for better mixing. We draw 5 independent

chains of 10000 samples for each algorithm. We compare

crucial sampling diagnostics among different models: we

give the autocorrelation between consecutive samples (lag

1) (as well as the autocorrelation plots for all lags in ap-

pendix A.7) to estimate the efficient sample size and the

chain intercorrelation via the Gelman test (1 is the opti-

mum) (Brooks and Gelman,1998). The results are sum-

marized in table 2.

Automated Augmented Conjugate Inference for GP Models

Figure 3. Test negative log-likelihood and test error (classification)/RMSE (regression) as a function of time for different

likelihoods.

a) Matern 3/2 Likelihood on the Boston Housing dataset

b) Logistic Likelihood on the Heart dataset

Figure 2. Converged negative ELBO and averaged negative

log-likelihood on a held-out dataset in function of the kernel

lengthscale, training VI with and without augmentation.

We find that our method has a very low intrachain corre-

lation leading to a high sample efficiency, as well as a low

interchain correlation while still being faster than the HMC

algorithm. It is even more evident for heavy-tailed likeli-

hood like Student-T or Laplace where HMC can be of more

trouble (Betancourt,2017). Our approach is limited by the

O(N3)complexity for each sample.

5.2 Augmentation gap

To investigate the effect of augmenting the model when us-

ing variational inference, we train the original model us-

ing gradient descent and the augmented model until con-

vergence. While we fix the kernel variance at 0.1, we vary

the lengthscale θfrom 10−2to 102. We compare the con-

verged ELBOs as well as the predictive performance on

held-out test set. The results for the matern 3/2 and logistic

are shown on figure 2, the other likelihoods are show in the

appendix A.7. For both shown likelihoods, there is a visible

ELBO gap between the augmented model and the original

model. However the predictive performance is marginally

the same for both models.We can conclude that a poten-

tial difference in ELBO values does not affect the prediction

performance.

5.3 Convergence speed

To scale our model to large datasets, we use the inducing

points technique of Titsias (2009) and we use the stochas-

tic gradient descent approach of Hoffman et al. (2013).

We compare our variational approach (Algorithm 1) to

using natural gradient descent, (Salimbeni et al.,2018)

and ADAM (Hensman et al.,2015) both implemented in

GPFlow (Matthews et al.,2017). For all methods we use

200 inducing points determined by k-means++ (Arthur and

Vassilvitskii,2007), minibatches of size 100 and we train

the kernel hyperparameters using ADAM (Kingma and Ba,

2014), (the inducing points locations are fixed). We show

the predictive performance in function of the training time

for multiple likelihoods on figure 3.

Our method is up to two orders of magnitude faster than

the state of the art. Moreover, we find that the optimization

in our method is more stable (smooth decrease of the loss.

6 CONCLUSION

We proposed a new efficient inference method for GP mod-

els that have a super-Gaussian likelihood. Our method

builds on an auxiliary variable augmentation that renders

the model conditionally conjugate. We showed that in the

augmented model, variational inference is up to two orders

of magnitude faster and more stable than the state of the art.

For small dataset, we proposed a Gibbs sampler that outper-

forms Hamiltonian Monte Carlo sampling. Previous meth-

ods that build on auxiliary variable augmentations (e.g.

Wenzel et al.,2019) manually derived the augmentation and

inference methods, whereas in our approach the whole pro-

cedure is fully automated and works for much more gen-

eral class of models. Future work may aim on extend-

ing our approach to more general models by automatically

constructing hierarchical augmentations inspired by Galy-

Fajou et al. (2019)orDonner and Opper (2018).

5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process

Models

Théo Galy-Fajou, Florian Wenzel, Manfred Opper

References

Abate, J., Choudhury, G. L., and Whitt, W. (2000). An

introduction to numerical transform inversion and its ap-

plication to probability models. In Computational proba-

bility, pages 257–323. Springer.

Amari, S.-I. (1998). Natural gradient works efficiently in

learning. Neural computation, 10(2):251–276.

Arthur, D. and Vassilvitskii, S. (2007). k-means++: The

advantages of careful seeding. In Proceedings of the eigh-

teenth annual ACM-SIAM symposium on Discrete algo-

rithms, pages 1027–1035. Society for Industrial and Ap-

plied Mathematics.

Beckers, T., Kulić, D., and Hirche, S. (2019). Stable gaus-

sian process based tracking control of euler-lagrange sys-

tems. Automatica, (103):390–397.

Bernstein, S. et al. (1929). Sur les fonctions absolument

monotones. Acta Mathematica, 52:1–66.

Betancourt, M. (2017). A conceptual introduc-

tion to hamiltonian monte carlo. arXiv preprint

arXiv:1701.02434.

Bishop, C. M. (2006). Pattern recognition and machine

learning. springer.

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017).

Variational inference: A review for statisticians. Journal

of the American Statistical Association, 112(518):859–

877.

Brooks, S. P. and Gelman, A. (1998). General methods for

monitoring convergence of iterative simulations. Journal

of computational and graphical statistics, 7(4):434–455.

Cohen, A. M. (2007). Numerical methods for Laplace

transform inversion, volume 5. Springer Science & Busi-

ness Media.

Debnath, L. and Bhatta, D. (2014). Integral transforms

and their applications. Chapman and Hall/CRC.

Devroye, L. (1986). Nonuniform random variate genera-

tion. Springer-Verlag.

Donner, C. and Opper, M. (2018). Efficient bayesian in-

ference of sigmoidal gaussian cox processes. The Journal

of Machine Learning Research, 19(1):2710–2743.

Dua, D. and Graff, C. (2017). UCI machine learning repos-

itory.

Eleftheriadis, S., Rudovic, O., Deisenroth, M. P., and Pan-

tic, M. (2017). Gaussian process domain experts for mod-

eling of facial affect. IEEE Transactions on Image Pro-

cessing, 26(10):4697–4711.

Galy-Fajou, T., Wenzel, F., Donner, C., and Opper, M.

(2019). Multi-class gaussian process classification made

conjugate: Efficient inference via data augmentation. Un-

certainty in Artificial Intelligence (UAI).

Ge, H., Xu, K., and Ghahramani, Z. (2018). Turing: a

language for flexible probabilistic inference. In Interna-

tional Conference on Artificial Intelligence and Statistics,

AISTATS, pages 1682–1690.

Gneiting, T. (1997). Normal scale mixtures and dual prob-

ability densities. Journal of Statistical Computation and

Simulation, 59(4):375–384.

Gneiting, T. (1999). Radial positive definite functions gen-

erated by euclid’s hat. Journal of Multivariate Analysis,

69(1):88–119.

Hensman, J., Matthews, A., and Ghahramani, Z. (2015).

Scalable variational gaussian process classification. The

Journal of Machine Learning Research.

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J.

(2013). Stochastic variational inference. The Journal of

Machine Learning Research, 14(1):1303–1347.

Jaakkola, T. S. and Jordan, M. I. (2000). Bayesian pa-

rameter estimation via variational methods. Statistics and

Computing, 10(1):25–37.

Jylänki, P., Vanhatalo, J., and Vehtari, A. (2011). Robust

gaussian process regression with a Student-t likelihood.

Journal of Machine Learning Research, 12(Nov):3227–

3257.

Khan, M. E. and Lin, W. (2017). Conjugate-computation

variational inference: Converting variational inference in

non-conjugate models to inferences in conjugate mod-

els. International Conference on Artificial Intelligence and

Statistics, AISTATS.

Kingma, D. P. and Ba, J. (2014). Adam: A method for

stochastic optimization. arXiv preprint arXiv:1412.6980.

Lawrence, N. D., Rattray, M., and Titsias, M. K. (2009).

Efficient sampling for gaussian process inference using

control variables. In Advances in Neural Information Pro-

cessing Systems, pages 1681–1688.

Matthews, D. G., Alexander, G., Van Der Wilk, M., Nick-

son, T., Fujii, K., Boukouvalas, A., León-Villagrá, P.,

Ghahramani, Z., and Hensman, J. (2017). Gpflow: A gaus-

sian process library using tensorflow. The Journal of Ma-

chine Learning Research, 18(1):1299–1304.

Merkle, M. (2014). Completely monotone functions: a di-

gest. In Analytic Number Theory, Approximation Theory,

and Special Functions, pages 347–364. Springer.

Mohamed, S., Rosca, M., Figurnov, M., and Mnih, A.

(2019). Monte carlo gradient estimation in machine learn-

ing. arXiv preprint arXiv:1906.10652.

Palmer, J., Kreutz-Delgado, K., Rao, B. D., and Wipf,

Automated Augmented Conjugate Inference for GP Models

D. P. (2006). Variational em algorithms for non-gaussian

latent variable models. In Advances in neural information

processing systems, pages 1059–1066.

Palmer, J. A. (2006). Variational and scale mixture repre-

sentations of non-Gaussian densities for estimation in the

Bayesian linear model: Sparse coding, independent com-

ponent analysis, and minimum entropy segmentation. PhD

thesis, UC San Diego.

Pandit, R. K. and Infield, D. (2018). Comparative analysis

of binning and gaussian process based blade pitch angle

curve of a wind turbine for the purpose of condition mon-

itoring. Journal of Physics: Conference Series, 1102.

Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian

inference for logistic models using pólya–gamma latent

variables. Journal of the American statistical Association,

108(504):1339–1349.

Polson, N. G., Scott, S. L., et al. (2011). Data augmen-

tation for support vector machines. Bayesian Analysis,

6(1):1–23.

Rasmussen, C. E. (2003). Gaussian processes in machine

learning. Springer.

Ridout, M. S. (2009). Generating random numbers from

a distribution specified by its laplace transform. Statistics

and Computing, 19(4):439.

Salimbeni, H., Eleftheriadis, S., and Hensman, J. (2018).

Natural gradients in practice: Non-conjugate variational

inference in gaussian process models. roceedings of the

International Conference on Artificial Intelligence and

Statistics (AISTATS).

Schoenberg, I. J. (1938). Metric spaces and completely

monotone functions. Annals of Mathematics, pages 811–

841.

Titsias, M. (2009). Variational learning of inducing vari-

ables in sparse gaussian processes. In Artificial Intelli-

gence and Statistics, pages 567–574.

Titsias, M. K., Lawrence, N., and Rattray, M. (2008).

Markov chain monte carlo algorithms for gaussian pro-

cesses. Inference and Estimation in Probabilistic Time-

Series Models, 9.

Wang, C. and Blei, D. M. (2013). Variational inference

in nonconjugate models. Journal of Machine Learning

Research, 14(Apr):1005–1031.

Wenzel, F., Galy-Fajou, T., Deutsch, M., and Kloft, M.

(2017). Bayesian nonlinear support vector machines for

big data. In Joint European Conference on Machine

Learning and Knowledge Discovery in Databases, pages

307–322. Springer.

Wenzel, F., Galy-Fajou, T., Donner, C., Kloft, M., and Op-

per, M. (2019). Efficient gaussian process classification

using Pòlya-gamma data augmentation. In Proceedings of

the AAAI Conference on Artificial Intelligence, volume 33,

pages 5417–5424.

Widder, D. V. (1946). The Laplace transform. Princeton

university press.

Yeh, J. (2006). Real analysis: theory of measure and inte-

gration second edition. World Scientific Publishing Com-

pany.

5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process

Models

Théo Galy-Fajou, Florian Wenzel, Manfred Opper

A APPENDIX

A.1 Proof of theorem 2

Theorem 2 states:

Theorem. The complete conditional distributions of the augmented model presented in Section 3.1 are given by

p(ωi|fi, yi) =πϕ(ωi|kh(fi, yi)k2),

p(f|y,ω) =N(f|µ,Σ),

where Σ=diag (2ω◦γ(y)) + K−1−1and µ=Σg(y) + ω◦β(y) + K−1µ0,◦denotes the Hadamard product

and the function h(·)is given by the form of likelihood (see Eq.5).

Proof: For the full conditional on f:

p(f|y,ω)∝p(y|f,ω)p(f)

∝exp g(y)>f+ (β(y)◦ω)>f−f>diag(γ(y)◦ω)f−1

2f>K−1f

∝exp (g(y) + β(y)◦ω)>f−f>diag(γ(y)◦ω) + 1

2K−1f.

We get immediately a multivariate normal distribution with −1

2Σ−1=−diag(γ(y)◦ω) + 1

2K−1and Σ−1µ=g(y) +

(β(y)◦ω). Which corresponds to the result shown in equation (11).

For the augmented variable ωi:

p(ωi|yi, fi)∝p(yi|fi, ωi)p(ωi)

∝exp −kh(yi, fi)k2

2ωiπϕ(ωi|0)

=πϕ(ωi|kh(yi, fik2).

Note that the equation 9 gives the normalization constant directly ϕ(kh(yi, fi)k2

2)directly. QED.

A.2 Computation of the moments and cumulants for the augmentation variable

Given the general class of distribution πϕ(ω|c)described in Section 3.1, moments and cumulants can be easily computed:

The k-th moment of a distribution can be computed by taking the k-th derivative of the moment generating function (equiv-

alent to a negative Laplace transform) at t= 0. For example for the first moment:

Eπϕ(ω|c)[ω] = dL{πϕ(ω|c)}(−t)

dt t=0

dt "L"e−c2ωπϕ(ω|0)

ϕ(c2)#(−t)#t=0

=−1

ϕ(c2)

dt L[πϕ(ω|b, 0)] (t+c2)t=0

=−1

ϕ(c2)

dϕ t+c2

dt t=0

=−dlog ϕ(t)

dt t=c2

=−ϕ0(c2)

ϕ(c2)=ω

Automated Augmented Conjugate Inference for GP Models

More generally the k-th moment mkis defined as :

mk=(−1)k1

ϕ(c2)

dkϕ(t)

dtkc2

And the cumulants κkare computed using the cumulant generating function (log of the moment generating function)

κk=(−1)kdklog ϕ(t)

dtkt=c2

A.3 Algorithm for the sparse case

Algorithm 3 Augmented Stochastic Variational Inference

Input: Data (X,y), GP model p(y|f,u), kernel k

Output: Approximate posterior q(u) = N(u|m,S)

Find inducing points inputs Zvia k-means

Compute kernel matrices : KZ,κ=KXZK−1

for iteration t= 1,2, . . . ,do

# Local updates:

Sample minibatch B ⊆ {1, . . . , n}

for i∈ B do

ci=pEq(f)[h(fi, yi)2]

ωi=Eq(ωi)[ωi] = −ϕ0(c2

i)/ϕ(c2

end for

# Natural gradient updates (CAVI):

S=κ>diag (2ω◦γ(y)) κ+K−1

Z−1

m=e

SK−1

Zµ0+κ>(g(y) + ω◦β(y))

{m,S} ← (1 −ρ(t)){m,S}+ρ(t){f

m,e

end for

ρ(t)is an arbitrary learning rate respecting the Robbins-Monroe condition.

A.4 ELBO Analysis

A.4.1 Full ELBO

ELBO =

i=1

Eq(fi,ωi)[log p(yi|fi, ωi)]

−KL[q(f)||p(f)] −

i=1

KL[q(ωi)||p(ωi)]

Eq[log p(yi|fi, ωi, θ)] = log C(θ) + g(yi, θ)Eq(f)[f]−Eq(f)h(fi, yi)2Eq(ωi)[ωi]

= log C(θ) + g(yi, θ)mi−α(yi)−β(yi)mi+γ(yi)m2

i+Siiωi

KL[q(f)||p(f)] =1

2log |K|

|S|−N+tr(K−1S)+(µ0−m)>K−1(µ0−m)

KL[q(ωi)||p(ωi)] = −Eq(ωi)c2

iωi−log ϕ(c2

i) = −c2

iωi−log ϕ(c2

Note that we can take the derivatives of the ELBO and set them to 0 to recover exactly the updates in algorithm 1.

5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process

Models

Théo Galy-Fajou, Florian Wenzel, Manfred Opper

A.4.2 Analysis of the optima

By setting c2

ias a function of mand S(and setting µ0to 0 for simplicity) we can get an ELBO only depending of the

variational parameters of f.

ELBO(m,S) = C+g>m+1

2



log |S|−tr(K−1S)−m>K−1m

|{z }

ELBO1





+X

log ϕ(m2

i+Sii)

| {z }

ELBO2

It is easy to show that ELBO1is jointly concave in mand Swith a short matrix analysis. However ELBO2is more complex

:m2

i+Sii is jointly convex in mand S,φ(r)is by definition convex as well, however φ(m2

i+Sii)is neither jointly convex

or concave in mand S. It is therefore impossible to guarantee that there is a global optima, however the CAVI updates

guarantee us a local optima.

A.4.3 ELBO Gap

For a fixed q(f)we can compare the ELBO of the original model Lstd(q(f)) and the augmented model Laug(q(f)q(ω)).

It is then straightforward to compute the difference between the two :

∆L=Lstd(q(f)) −Laug(q(f)q(ω))

=Eq(f)log p(y, f)−log q(f)−Eq(ω)[p(y, f, ω)−log q(f)q(ω)]

=Eq(f)q(ω)−log p(y, f, ω)

p(y, f)+ log q(ω)

=Eq(f)q(ω)[−log p(ω|y, f) + log q(ω)]

=Eq(ω)log q(ω)−Eq(f)[log p(ω|y, f)]

=−c2Eq(ω)[ω] + Eq(ω)[log PG(ω|1,0)] −log ϕ(c2)

+Eq(f)f2Eq(ω)[ω]−Eq(ω)[log PG(ω|1,0)] + Eq(f)log ϕ(f2)

=−c2m−log ϕ(c2) + Eq(f)f2m+Eq(f)log ϕ(f2)

Replacing with the optimal q∗(ω) = e−c2ωp(ω)

ϕ(c2)with c2=Eq(f)f2

∆L∗=−log ϕ(c2) + Eq(f)log ϕ(f2)

A.4.4 Sparse ELBO

When using the inducing points approach the ELBO becomes:

ELBO =

i=1

Eq(fi,ui,ωi)[log p(yi|fi, ui, ωi)]

−KL[q(u)||p(u)] −

i=1

KL[q(ωi)||p(ωi)]

Automated Augmented Conjugate Inference for GP Models

Eq[log p(yi|fi, ωi, θ)] = log C(θ) + g(yi, θ)Eq(f,u)[f]−Eq(f,u)h(fi, yi)2Eq(ωi)[ωi]

= log C(θ) + g(yi, θ)(κ>m)i−α(yi)−β(yi)(κ>m)i+γ(yi)(κ>m)2

i+ (κ>Sκ)iiωi

KL[q(f)||p(f)] =1

2log |K|

|S|−N+tr(K−1S)+(µ0−m)>K−1(µ0−m)

KL[q(ωi)||p(ωi)] = −Eq(ωi)c2

iωi−log ϕ(c2

i) = −c2

iωi−log ϕ(c2

A.5 Proof of equivalence between Jaakkola bound and data augmentation

Jaakkola and Jordan (2000) proposed an approach purely based on optimization. They are assuming log p(y|f)contains

a part convex in f2:log p(y|f) = log pconvex(f) + log pnon−convex(f, y). Using convexity properties they are creating a

bound with a Taylor expansion to the first order around an additional variable c2:

log pc(f)≥log pc(c) + dlog pc(c)

dc2(f2−c2)

Putting it back in the full ELBO, they are now getting a quadratic part in f, analytically differentiable, and they just need

to optimize the additional variables {ci}.Merkle (2014) shows that any completely monotone function is log-convex,

i.e. log ϕ(r)is convex. Therefore we can replace log pc(c)by log ϕ(r)to recover our model in the context of variational

inference. Note that the converse does not hold, therefore the complete monotonicity is a stronger assumption.

A.6 Likelihoods used for the experiments

We detail all likelihoods used for the experiments and their formulation as in equation (4).

Laplace Likelihood : Laplace(y|f, β) = 1

2βexp −|f−y|

β

Logistic Likelihood : p(y|f) = σ(yf) = eyf /2

2 cosh(|f|/2)

Student-T Likelihood : p(y|f) = Γ((ν+1)/2)

Γ(ν/2)√πν 1 + (y−f)2

ν−(ν+1)/2

Matern 3/2 Likelihood : p(y|f) = 4ρ

√31 + √3(y−f)2

ρexp −√3(y−f)2

ρ

Likelihood C(θ)g(y, θ)||h(y, f, θ)2||2

2α(y)β(y)γ(y)ϕ(r)

Laplace (2β)−10 (y−f)2y22y1e−√r/β

Logistic 2−1y/2f20 0 1 cosh−1(√r/2)

Student-T Γ((ν+ 1)/2)/(Γ(ν)√πν) 0 (y−f)2y22y1 (1 + r

ν)−(ν+1)/2

Matern 3/2 4ρ/√3 0 (y−f)2y22y1 (1 + √3r

ρ)e−√3r/ρ

5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process

Models

Théo Galy-Fajou, Florian Wenzel, Manfred Opper

A.7 Extra figures

A.7.1 Autocorrelation plots

Figure 4. Auto-correlation plots for differents with lags from 1 to 10

Automated Augmented Conjugate Inference for GP Models

A.7.2 HMC Results

/nstep 1 2 5 10

0.01

Time/Sample (s) 0.037 0.045 0.077 0.133

Lag 1 0.999 0.993 0.978 0.963

Gelman 3.14 1.02 1.00 2.05

0.05

Time/Sample (s) 0.036 0.046 0.080 0.12

Lag 1 0.999 0.998 0.931 0.948

Gelman 1.72 1.18 1.01 3.25

0.1

Time/Sample (s) 0.033 0.042 0.073 0.13

Lag 1 0.997 0.996 0.998 0.994

Gelman 1.11 1.04 1.27 2.71

Table 3. HMC results for the Laplace likelihood

/nstep 1 2 5 10

0.01

Time/Sample (s) 0.675 0.110 0.177 0.251

Lag 1 0.999 0.999 0.997 0.993

Gelman 3.14 1.74 1.11 1.02

0.05

Time/Sample (s) 0.148 0.192 0.336 0.573

Lag 1 0.997 0.993 0.962 0.857

Gelman 1.10 1.02 1.00 1.00

0.1

Time/Sample (s) 0.142 0.193 0.337 NA

Lag 1 0.993 0.976 0.864 NA

Gelman 1.03 1.01 1.00 NA

Table 4. HMC results for the Student-T likelihood

/nstep 1 2 5 10

0.01

Time/Sample (s) 0.009 0.013 0.021 0.041

Lag 1 0.999 0.999 0.998 0.994

Gelman 3.19 1.68 1.12 1.02

0.05

Time/Sample (s) 0.011 0.014 0.025 0.41

Lag 1 0.998 0.994 0.968 0.871

Gelman 1.11 1.03 1.00 1.00

0.1

Time/Sample (s) 0.011 0.014 0.024 0.048

Lag 1 0.994 0.979 0.875 0.532

Gelman 1.02 1.01 1.00 1.00

Table 5. HMC Results for the Logistic likelihood

5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process

Models

Théo Galy-Fajou, Florian Wenzel, Manfred Opper

A.7.3 ELBO difference

a) Student-T likelihood on the Boston Housing dataset

b) Laplace likelihood on the Boston Housing dataset

Figure 5. Converged negative ELBO and averaged negative log-likelihood on a held-out dataset in function of the RBF

kernel lengthscale, training VI with and without augmentation.

Automated Augmented Conjugate Inference for GP Models

A.7.4 Convergence speed

a) Logistic likelihood on the HIGGS dataset

b) Matern 3/2 likelihood on the Airline dataset

c) Student-T likelihood on the Protein dataset

Figure 6. Supplementary convergence plots

5. Automated Augmented Conjugate Inference for Non-conjugate Gaussian Process

Models

Flexible and Efficient Inference with

Particles for the Variational

Gaussian Approximation

This last published work is different from the previous c hapters. Instead of focusing on the representation

of the model, we aim at changing the variational distribution representation. The original motivation

behind this work was to answer the question: Can we fit a full Gaussian variational distribution to a

target distribution without matrix inverses, log-determinant, or second-order derivative computations?

The answer resulted in a particle approach: we parametrize the distribution with an arbitrary number

of points in the variable domain instead of the mean and covariance. Although the method might not

be a state-of-the-art approach for variational inference, it brings insights concerning convergence speed

and accuracy of the given posterior.

Authors:

Théo Galy-Fajou,1, Valerio Perrone,2, Manfred Opper1,3

1TU Berlin, Germany, 2Amazon Web Services, 3University of Birmingham

Details:

Type: Journal article

Submitted: June 2021

Accepted: July 2021

DOI: https://doi.org/10.3390/e23080990

Journal: Entropy (Special edition on Approximate Bayesian Inference)

License: Creative Commons Attribution (CC BY 4.0)

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Contributions:

For an explanation of the terms see the Contributor Roles Taxonomy (CReditT)

T.G-F. V.P. M.O.

Conceptualization ✓ ✓

Methodology ✓ ✓ ✓

Formal Analysis ✓

Software ✓

Investigation ✓

Writing - Original Draft ✓ ✓ ✓

Writing - Review & Editing ✓ ✓ ✓

Supervision ✓

Funding Acquisition ✓

entropy

Article

Flexible and Efficient Inference with Particles for the

Variational Gaussian Approximation

Théo Galy-Fajou 1,*, Valerio Perrone 2and Manfred Opper 1,3





Citation: Galy-Fajou, T.; Perrone, V.;

Opper, M. Flexible and Efficient

Inference with Particles for the

Variational Gaussian Approximation.

Entropy 2021,23, 990. https://doi.org/

10.3390/e23080990

Academic Editor: Pierre Alquier

Received: 22 June 2021

Accepted: 21 July 2021

Published: 30 July 2021

Publisher’s Note: MDPI stays neutral

with regard to jurisdictional claims in

published maps and institutional affil-

iations.

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

1Artificial Intelligence Group, Technische Universität Berlin, 10623 Berlin, Germany;

[email protected]

2Amazon Web Services, 10969 Berlin, Germany; [email protected]

3Centre for Systems Modelling and Quantitative Biomedicine, University of Birmingham,

Birmingham B15 2TT, UK

*Correspondence: [email protected]

Abstract:

Variational inference is a powerful framework, used to approximate intractable posteriors

through variational distributions. The de facto standard is to rely on Gaussian variational families,

which come with numerous advantages: they are easy to sample from, simple to parametrize,

and many expectations are known in closed-form or readily computed by quadrature. In this

paper, we view the Gaussian variational approximation problem through the lens of gradient flows.

We introduce a flexible and efficient algorithm based on a linear flow leading to a particle-based

approximation. We prove that, with a sufficient number of particles, our algorithm converges linearly

to the exact solution for Gaussian targets, and a low-rank approximation otherwise. In addition to

the theoretical analysis, we show, on a set of synthetic and real-world high-dimensional problems,

that our algorithm outperforms existing methods with Gaussian targets while performing on a par

with non-Gaussian targets.

Keywords: variational inference; Gaussian; particle flow; variable flow

1. Introduction

Representing uncertainty is a ubiquitous problem in machine learning. Reliable

uncertainties are key for decision making, especially in contexts where the trade-off between

exploitation and exploration plays a central role, such as Bayesian optimization [

], active

learning [

], and reinforcement learning [

]. While Bayesian inference is a principled tool to

provide uncertainty estimation, computing posterior distributions is intractable for many

problems of interest. Most sampling methods struggle to scale up to large datasets [

while the diagnosis of convergence is not always straightforward [

]. On the other hand,

Variational Inference

(VI)

methods can rely on well-understood optimization techniques

and scale well to large datasets, at the cost of an approximation quality depending heavily

on the assumptions made. The Gaussian family is by far the most popular variational

approximation used in

[

]. This is for several reasons. First, Gaussian variational

families are easy to sample from, reparametrize, and marginalize. Second, they are easily

amenable to diagonal covariance approximations, making them scalable to high dimensions.

Third, most expectations are either easily computable by quadrature or Monte Carlo

integration, or known in closed-form.

A large body of work covers different approaches to optimize the Variational Gaussian

Approximation

(VGA)

, with the speed of convergence and the scalability in dimensions

as the main concerns. From the perspective of convergence speed, the major bottleneck

when computing gradients with stochastic estimators is the estimator variance [

]. Particle-

based methods with deterministic paths do not have this issue, and have been proven to

be highly successful in many applications [

–

]. However, can we use a particle-based

Entropy 2021,23, 990. https://doi.org/10.3390/e23080990 https://www.mdpi.com/journal/entropy

Entropy 2021,23, 990 2 of 34

algorithm to compute a

VGA

? If so, what are its properties and is it competitive with other

VGA methods?

In this paper, we attempt to answer these questions by introducing the Gaussian Particle

Flow

(GPF)

, a framework to approximate a Gaussian variational distribution with particles.

GPF

is derived from a continuous-time flow, where the necessary expectations over the

evolving densities are approximated by particles. The complexity of the method grows

quadratically with the number of particles but linearly with the dimension, remaining

compatible with other approximations such as structured mean-field approximations.

Using the same dynamics, we also derive a stochastic version of the algorithm, Gaussian

Flow

(GF)

. To show convergence, we prove the decrease in an empirical version of the free

energy that is valid for a finite number of particles. For the special case of

–dimensional

Gaussian target densities, we show that

D+

1 particles are enough to obtain convergence

to the true distribution. We also find, for this case, that convergence is exponentially fast.

Finally, we compare our approach with other

VGA

algorithms, both in fully controlled

synthetic settings and on a set of real-world problems.

2. Related Work

The goal of Bayesian inference is to carry out computations with the posterior dis-

tribution of a latent variable

x∈RD

given some observations

. By Bayes theorem, the

posterior distribution is

p(x|y) = p(y|x)p(x)

p(y)

, where

p(y|x)

and

p(x)

are, respectively, the

likelihood and the prior distribution. Even if the likelihood and the prior are known ana-

lytically, marginalizing out high-dimensional variables in the product

p(y|x)p(x)

in order

to compute quantities such as

p(y)

is typically intractable. Variational Inference

(VI)

aims to

simplify this problem by turning it into an optimization one. The intractable posterior is

approximated by the closest distribution within a tractable family, with closeness being

measured by the Kullback-Leibler (KL) divergence, defined by

KL [q(x)||p(x)]=Eq[log q(x)−log p(x)],

where

Eq[f(x)]=Rf(x)q(x)dx

denotes the expectation of

over

. Denoting by

family of distributions, we look for

arg min

q∈Q

KL [q(x)||p(x|y)].

Since

p(y)

is not computable in an efficient way, we equivalently minimize the upper

bound F:

KL[q(x)||p(x|y)]≤ F[q] = −Eq[log p(y|x)p(x)]−Hq, (1)

where

is the entropy of

(

−Eq[log q(x)]

). Here,

is known as the variational free energy

and

−F

is known as the Evidence Lower BOund (ELBO). A diverse set of approaches to

perform

with Gaussian families

have been developed in the literature, which we

review in the following.

2.1. The Variational Gaussian Approximation

The

VGA

is the restriction of

to be the family of multivariate Gaussian distributions

q(x) = N(m

, where

m∈RD

is the mean and

C∈ {A∈RD×D|x>Ax ≥

∀x∈RD}

the covariance matrix, for which the free energy is found to be

F[q] = −1

2log |C|+Eq[ϕ(x)]. (2)

where

ϕ(x) = −log(p(y|x)p(x))

. A standard descent algorithm based on gradients of

Equation

(2)

with respect to variational parameters

give rise to some issues. First,

naively computing the gradient of the expectation with respect to the covariance matrix

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 3 of 34

involves unwanted second derivatives of

ϕ(x)

[

], which may not be available or

may be computationally too expensive in a black-box setting. Second, the gradient of the

entropy term

entails inverting a non-sparse matrix, which we would like to avoid

for higher-dimensional cases. Finally, the positive-definiteness of the covariance matrix

leads to non-trivial constraints on parameter updates, which can lead to a slowdown of

convergence or, if ignored, to instabilities in the algorithm.

To solve these issues, a variety of approaches have been proposed in the literature.

If we focus on factorizable models, we can make a simplification: for problems with

likelihoods that can be rewritten as

p(y|x) = ∏D

d=1p(y|xd)

, the number of independent

variational parameters is reduced to 2

[

]. In this special case, the Gaussian expec-

tations in the free energy

(2)

split into a sum of 1-dimensional integrals, which can be

efficiently computed by using numerical quadrature methods. To extend to the general

case, gradients of the free energy are estimated by a stochastic sampling approach, which

also forms the starting point of our method. This relies on the so-called reparametrization

trick, where the expectation over the parameter-dependent variational density

qθ

is replaced

by an expectation over a fixed density

instead. This facilitates the gradient computation

because unwanted derivatives of the type

∇θqθ(x)

are avoided. For the Gaussian case,

the reparametrization trick is a linear transformation of an arbitrary

dimensional Gaus-

sian random variable

x∼qθ(x)

in terms of a

-dimensional Gaussian random variable

x0∼q0=N(m0,C0):

x=Γ(x0−m0) + m, (3)

where

Γ∈RD×D

and

m∈RD

are the variational parameters. We assume that the co-

variance

is not degenerate and, for simplicity, we set it as the identity. For instance,

the gradient of the expectation given

over a function

given the mean

becomes

∇mEq[f(x)]=Eq0∇mf(Γ(x0−m0) + m)

. This can be simply proved by using the

reparametrization

(3)

inside the integral and passing the gradient inside; for more de-

tails, see [14].

Given this representation, the free energy is easily obtained as a function of the

variational parameters:

F(q) = −log |Γ|+Eq0hϕ(Γ(x0−m0) + m)i. (4)

Other representations are possible. Challis and Barber

[13]

and Ong et al.

[15]

use a different

reparametrization with a factorized structure of the covariance

C=Γ>Γ+diag(d)

, where

Γ∈RD×P

and

d∈RD

, with

P≤D

is the rank of

Γ>Γ

. Other representations assume

special structures of the precision matrix

Λ=C−1

, which allow you to enforce special

properties, such as sparsity in [16,17].

In general, these methods tend to scale poorly with the number of dimensions, as one

needs to optimize

D(D+

2 parameters. The (structured) Mean-Field

(MF)

[

] approach

imposes independence between variables in the variational distribution. The number of

variational parameters is then 2

, but covariance information between dimensions is lost.

2.2. Natural Gradients

Besides the issue of expectations, more efficient optimizations directions, beyond

ordinary gradient descent, have been considered. These can help to deal with constraints

such as those given for the covariance matrix. Natural gradients [

] are a special case of

Riemannian gradients and utilize the specific Riemannian manifold structure of variational

parameters. They can often deal with constraints of parameters (such as the positive

definiteness of the covariance), accelerate inference, and improve the convergence of

algorithms. The application of such advanced gradient methods typically requires an

estimate of the inverse Fisher information matrix as a preconditioner of ordinary gradients.

Khan and Nielsen

[21]

and Lin et al.

[22]

propose a solution that requires extra second

derivatives of the log–posteriors. Salimbeni et al.

[23]

developed an automatic process to

Entropy 2021,23, 990 4 of 34

compute these without the second derivatives but with instability issues. Lin et al.

[17]

solved these issues by using geodesics on the manifold of parameters, at the price of having

to compute inverse matrices as well as Hessians.

2.3. Particle-Based VI

Stochastic gradient descent methods compute expectations (and gradients) at each

time step with new independent Monte Carlo samples drawn from the current approxi-

mation of the variational density. Particle-based methods for variational inference draw

samples only once at the beginning of the algorithm instead. They iteratively construct

transformations of an initial random variable (having a simple tractable density) where the

transformed density leads to the decrease and finally to the minimum of the variational free

energy. The iterative approach induces a deterministic temporal flow of random variables

which depends on the current density of the variable itself. Using an approximation by the

empirical density (which is represented by the positions of a set of ’particles’) one obtains a

flow of interacting particles which converges asymptotically to an empirical approximation

of the desired optimal variational density.

The most popular approach is Stein Variational Gradient Descent

(SVGD)

[

], which

computes a nonparametric transformation based on the kernelized Stein discrepancy [

SVGD

has the advantage of not being restricted to a parametric form of the variational

distribution. However, using standard distance-based kernels like the squared exponential

kernel (

k(x

y) = exp(−kx−yk2

)

) can lead to underestimated covariances and poor per-

formance in high dimensions [

]. Hence, it is interesting to develop particle approaches

that approximate the

VGA

. We provide a more thorough comparison between our method

and SVGD in Section 3.6.

2.4. GVA in Bayesian Neural Networks

There has been increased interest in making Bayesian Neural Networks

(BNN)

by adding

priors to Neural Networks parameters. The true form of the posterior is unknown but

VGA

has been used due to its ease of use and scalability with the number of dimensions

(typically

D

). Most of the aforementioned methods apply to

BNN

, but techniques

have been specifically tailored with

BNN

in mind. [

] use the low-rank structure of [

]

but exploit the Local Reparametrization Trick, where each datapoint

gets a different sample

from

in order to reduce the stochastic gradient estimator variance. Stochastic Weight

Averaging-Gaussian

(SWAG)

[

], in which a set of particles obtained via stochastic gradient

descent represent a low-rank Gaussian distribution, approximating the true posterior with

a prior posterior produced by the network’s regularization. While easy to implement,

SWAG

does not allow you to incorporate an explicit prior, and the resulting distribution

does not derive from a principled Bayesian approach.

2.5. Related Approaches

The closest approach to our proposed method is the Ensemble Kalman Filter

(EKF)

[

It assumes that the posterior is computed in a sequential way, where, at each time step, only

single (or smaller batches) of data observations, represented by their likelihoods, become

available. An ensemble of particles, representing a Gaussian distribution is iteratively

updated with every new batch of observations.

EKF

allows us to work on high-dimensional

problems with a limited amount of particles but is restricted to factorizable likelihoods for

which a sequential representation is possible. While

EKF

maintains a representation of a

Gaussian posterior, it is not clear how this relates to the goal of minimizing the free energy

or the KL divergence.

3. Gaussian (Particle) Flow

We introduce Gaussian Particle Flow

(GPF)

and Gaussian Flow

(GF)

, two computation-

ally tractable approaches, to obtain a Variational Gaussian Approximation

(VGA)

. In the

following, we derive deterministic linear dynamics, which decreases the variational free

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 5 of 34

energy. We additionally give some variants with a Mean-Field

(MF)

approach and prove

theoretical convergence guarantees.

In the following,

d(·)

indicates the total derivative given time,

∂(·)

∂t

partial derivatives

given time, ∇x(·)gradients given a vector x.

3.1. Gaussian Variable Flows

We next discuss an alternative approach to generate the desired transformation of

random variables, leading from a simple (prior) Gaussian density to a more complex

Gaussian, which minimizes the variational free energy. It is based on the idea of variable

flows, i.e., recursive deterministic transformations of the random variables defined by a

mapping

xn+1=xn+efn(xn)

where

fn:RD→RD

. Well-known examples of flows

are Normalizing Flows [

], where

are bijections, or Neural ODEs [

] where

fn=f

defined by a neural network and

is the input. For simplicity, we will consider small

changes

e→

0 and work with flows in the continuous-time limit (

t=ne

), which follow a

system of Ordinary Differential Equation

(ODE)

. For the Gaussian case, in the spirit of the

reparametrization trick (3), we choose a linear corresponding map fand write

dxt

dt =ft(xt) = At(xt−mt) + bt, (5)

where

is a matrix and

mt.

=Eqt[x]

(which is no longer interpreted as an independent

variational parameter). When the initial random variable

is Gaussian distributed, the

vectors

are also Gaussian for any

. To construct a flow that decreases the free energy

over time, we can either compute the time derivative of the specific free energy

(2)

induced

by the

ODE (5)

, or simply derive the general result valid for smooth maps

(see, e.g., [

]).

To be self contained, we briefly repeat the main steps: We first compute the change of the

free energy in terms of the time derivative of qt:

dF[qt]

dt =d

dt Zqt(x)log qt(x) + ϕ(x)dx

=Z∂qt(x)

∂tlog qt(x) + ϕ(x)dx +Zqt(x)∂qt(x)

∂t

qt(x)+∂ϕ(x)

∂tdx

=Z∂qt(x)

∂tlog qt(x) + ϕ(x)dx

where we have used the fact that

R∂qt(x)

∂tdx =d

dt Rqt(x)dx =

0 and

∂ϕ(x)

∂t=

0. We next use

the continuity equation for the density

∂qt(x)

∂t=−∇x·qt(x)ft(x),

related to the deterministic flow to obtain

dF[qt]

dt =Z∇x·qt(x)ft(x)log qt(x) + ϕ(x)dx

=−Zqt(x)ft(x)·∇xlog qt(x) + ϕ(x)dx

=Z∇x·(qt(x)ft(x)) + qt(x)ft(x)·∇xϕ(x)dx

=Z∇xqt(x)·ft(x) + qt(x)ft(x)·∇xϕ(x)dx

=−Eqt∇x·ft(x)−ft(x)·∇xϕ(x)

where we have applied Green’s identity twice and used the fact that

limx→∞qt(x) =

Specializing to the linear flow (5), we obtain

Entropy 2021,23, 990 6 of 34

dF[qt]

dt =−tr[At(At

?)>]−(bt)>bt

?, (6)

where

=I−Eqth∇xϕ(x)(x−mt)>i

=−Eqt[∇xϕ(x)](7)

Equation

(6)

represents the change in the free energy

for an infinitesimal change in the

variables xgiven by the flow (5). Obviously, the simplest choices

At≡At

?bt≡bt

?(8)

lead to a decrease in the free energy

dF[qt]

dt ≤

0. More detailed derivations are given in

Appendix A. Additionally, equality only happens, when

I−Eqh∇xϕ(x)(x−m)>i=0

Eq[∇xϕ(x)]=0 (9)

Using Stein’s lemma [

], we can show that these fixed-point solutions are equal to the

conditions for the optimal variational Gaussian distribution solution given in [

]. In

Appendix C, we show that our parameter updates can be interpreted as a Riemannian

gradient descent method for the free energy

(4)

. This is based on the metric introduced by

([

], Theorem 7.6) as an efficient technique for learning the mixing matrix in models of

blind source separation. This gradient should not be confused with the so-called natural

gradient obtained by pre-multiplying with the inverse Fischer-information matrix.

Of course, there are other choices for

and

, which lead to a decrease in the free

energy and the same fixed-point equations. In Section 3.6, we discuss how

SVGD

, with a

linear kernel, can lead to the same fixed points but with different dynamics.

3.2. From Variable Flows to Parameter Flows

Before we introduce the particle algorithm, we show that the results for the variable

flow can also be converted into a temporal change of the parameters

Γt

, as defined

for Equation

(3)

. From this, a corresponding Gaussian Flow

(GF)

algorithm can be eas-

ily derived. By differentiating the parametrisation

xt=Γt(x0−m0) + mt

(with

now

considered as free variational parameter) with respect to time tand using (5), we obtain

dxt

dt =dΓt

dt (x0−m0) + dmt

dt =At(xt−mt) + bt(10)

By inserting

xt=Γt(x0−m0) + mt

into the right hand side of

(10)

, and using the optimal

parameters from (7), we obtain

dΓt

dt =Γt−Eq0h∇xϕ(xt)(x0−m0)>iΓt(Γt)>

dmt

dt =−Eq0∇xϕ(xt)(11)

Note that the expectations are over the probability distribution of the initial random

variable

. Discretizing Equations

(11)

in time, and estimating the expectations by drawing

independent samples from the fixed Gaussian

at each time step, we obtain our

algorithm to minimize the variational free energy in the space of Gaussian densities.

We summarize the steps of

in Algorithm 1. Remarkably, this scheme differs from

VGA

algorithms with Riemannian gradients based on the Fisher information

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 7 of 34

metric (see, e.g., [

]) because no matrix inversions or second order derivatives of the

function ϕare required.

also allows for the computation of a low-rank

VGA

by enforcing

Γ∈RD×K

and

x0∈RK

. This algorithm scales linearly in the number of dimensions and quadratically in

the rank Kof the covariance.

It is interesting to note that the reverse construction of a variable flow from a parameter

flow is, in general, not possible. This would require the ability to eliminate all variational

parameters and the initial variables

in the resulting differential equation for

, and

replace them with functions of

alone. For instance, if we eliminate the initial variables

in terms of

(Γt)−1

and

the algorithm of [

], the resulting expression still depends on

Γt

3.3. Particle Dynamics

The main idea of the particle approach is to approximate the Gaussian density

(7)

by the empirical distribution

qt.

∑

i=1

δ(x−xt

i)(12)

computed from

samples

. . .

. These are initially sampled from the density

at time t=0 and are then propagated using the discretized dynamics of the ODE (5):

dxt

dt =−ηt

1Eˆ

qt[∇xϕ(x)]−ηt

2ˆ

At(xt

i−ˆ

mt)(13)

where

At=I−1

∑

i=1∇xϕ(x)(xt

i−ˆ

mt)>

bt=1

∑

i=1∇xϕ(xt

i),ˆ

mt=1

∑

i=1

where

ηt

and

ηt

are learning rates (We further comment on the use of different optimization

schemes in Section 4.4). Note that although

Eˆ

qt∇xϕ(x)(x−ˆ

mt)>

is a

D×D

matrix,

changing the matrix multiplication order leads to a computational complexity of

O(N2D)

with a storage complexity of

O(N(N+D))

, since neither the empirical covariance matrix

or Atneed to be explicitly computed.

Relaxation of Empirical Free Energy and Convergence

We have shown that the continuous-time dynamics

(10)

of the random variables leads

to a decay of the free energy

F(qt)

with time

. Assuming that the free energy is bounded

from below, one might conjecture that this property would imply the convergence of the

particle algorithm to a fixed point when learning rates are sufficiently small such that the

discrete-time dynamics are approximated well by the continuous limit. Unfortunately, the

finite number

of particles poses an extra problem. The definition of the free energy

F(q)

by the KL–divergence

(1)

for continuous random variables such as assumes that both

q(·)

and

p(·|y)

are densities with respect to the Lebesgue measure. Hence,

F(ˆ

is not defined

if we take

q≡ˆ

(12)

as the empirical distribution of the finite particle approximation.

Nevertheless, we define a finite

approximation to the Gaussian free energy, which is

also then found to decay under the finite

dynamics. Let us first assume that

N>D

and define

F(ˆ

qt).

=−1

2log |ˆ

Ct|+Eˆ

qt[ϕ(x)](14)

Entropy 2021,23, 990 8 of 34

with the empirical covariance matrix

Ct=1

∑

i=1xt

i−mtxt

i−mt>(15)

The definition

(14)

is chosen in such way that in the large

limit, when the empirical

distribution

converges to a Gaussian distribution

, we will also obtain the convergence

of the approximation

(14)

F(qt)

. It can be shown (see Appendix B) that

d˜

F(ˆ

qt)

dt ≤

0, with

equality only at the fixed points of the dynamics.

In applications of our particle method to high-dimensional problems, the limitations

of computational power may force us to restrict particle numbers to be smaller than the

dimensionality

. For

N<D+

1, the empirical covariance

will be singular, and

typically contain only

N−

1 non-zero eigenvalues, which leads to the

−logˆ

C=∞

and

makes Equation

(14)

meaningless. We resolve this issue through a regularisation of the

log–determinant term in

(14)

, replacing all zero eigenvalues of

by the values 1, i.e.,

λi=

→˜

λi=

1. We show in Appendix Bthat the free energy still decays, provided that

the dynamics of the particles stay the same. This regularisation step can be formally stated

as a replacement of the empirical covariance (15) in (14) by

Ct→ˆ

Ct+∑

i:λt

i=0

i(et

i)>

where et

i=ith eigenvector of ˆ

Ct.

3.4. Algorithm and Properties

The algorithm we propose is to sample

particles

{x0

. . .

where

i∈RD

from

(which can be centered around the MAP for example), and iteratively optimize their

positions using Equation

(13)

. Once convergence is reached, i.e.,

dt =

0, we can easily

make predictions using the converged empirical distribution

q(x) = 1

N∑N

i=1δ(x−xi)

where

is the Dirac delta function, or, alternatively, the Gaussian density it represents,

i.e.,

q(x) = N(m

, where

m=1

N∑N

i=1xi

and

C=1

N∑N

i=1(xi−m)(xi−m)>

. To draw

samples from

, no inversions of the empirical covariance

are needed, as we can obtain

new samples by computing:

x=1

√N

∑

i=1

(xi−m)◦ξi+m, (16)

where

ξi

are i.i.d. normal variables:

ξi∼ N(

ID)

. This can be shown by defining

the deviation matrix, a matrix which columns equal to

Di=xi−m

√N

. We naturally have

DD>=Cwhich makes Dthe Cholesky decomposition of C.

All the inference steps are summarized in Algorithm 2and an illustration in two

dimensions is provided in Figure 1.

We summarize the principal points of our approach:

•

Gradients of expectations have zero variance, at the cost of a bias decreasing with the

number of particles and equal to zero for Gaussian target (see Theorem 1);

• It works with noisy gradients (when using subsampling data, for example);

•

The rank of the approximated covariance

min(N−

. When

N≤D

, the

algorithm can be used to obtain a low-rank approximation.

•

The complexity of our algorithm is

O(N2D)

and storing complexity is

O(N(N+D))

By adjusting the number of particles used, we can control the performance trade-off;

• GPF (and GF) are also compatible with any kind of structured MF (see Section 3.5);

•

Despite working with an empirical distribution ,we can compute a surrogate of the

free energy

F(q)

to optimize hyper-parameters, compute the lower bound of the

log-evidence, or simply monitor convergence.

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 9 of 34

Figure 1.

Illustration of the Gaussian Particle Flow algorithm, with

q0(x)

and

p(x)

representing the

initial and target distribution respectively. Particles are iteratively moved according to the gradient

flow starting from q0(x), approximating a new Gaussian distribution qt(x)at each iteration t.

Algorithm 1: Gaussian Flow (GF)

Input: Number of samples N, initial distribution q0=N(µ0,Γ0(Γ0)>), target

p(x)∝e−ϕ(x), learning rates ηt

1,ηt

Output: Variational dist. q(x) = Nµ,ΓΓ>

for t in 0 : Tdo

{x0

i}N

i=1∼q0# Sample Ninitial particles from q0

xi=Γt(x0

i−µ0) + µt,∀i# Reparametrize

gi=∇xϕ(xi),∀i# Compute gradients

µt+1=µt−ηt

N∑N

i=1ϕ(xi)# Update µ

A=1

N∑igi(x0

i−µ0)>(Γt)># Compute matrix

Γt+1=Γt−ηt

2AΓt# Update Γ

Algorithm 2: Gaussian Particle Flow (GPF)

Input:

Number of particles

, initial distribution

, target

p(x)∝e−ϕ(x)

, learning

rates ηt

1,ηt

Output: Empirical dist. q(x) = 1

N∑N

i=1δx,xi

Init: Sample N particles from q0:{x0

i}N

i=1

for t in 0 : Tdo

gi=∇xϕ(xt

i),∀i# Compute gradients

m=1

N∑ixi,g=1

N∑igi# Compute means

A=1

N∑igi(xt

i−m)>−I# Compute matrix

xt+1

i=xt

i−ηt

1g−ηt

2A(xt

i−m),∀i# Update particles

3.4.1. Relaxation of Empirical Free Energy

The definition of the free energy

F(q)

from the KL–divergence

(1)

for a continuous

random variables assumes that both

q(·)

and

p(·|y)

are densities with respect to the

Lebesgue measure. Hence, it is not a priori clear that a specific approximation

F(ˆ

qt)

, based

on an empirical distribution

qt(x).

N∑N

i=1δ(x−xt

with a finite number of particles

will decrease under the particle flow. Thus we may not be able to guarantee convergence

to a fixed point for finite N. Luckily, as we show in Appendix D, we find that:

Entropy 2021,23, 990 10 of 34

dF(ˆ

qt)

dt =d(Eˆ

qt[ϕ(x)]−1

2logCt)

dt ≤0. (17)

For

N<D+

1, the empirical covariance

will typically contain

N−

1 non-zero eigenval-

ues and lead to

−log|C|=∞

, making Equation

(17)

meaningless. We resolve this issue

by introducing a regularized free energy

where

logCt

is replaced by

∑i:λi>0log λi

where

{λi}D

i=1

are the eigenvalues of

. We show in Appendix Dthat, given the dynamics from

Equation

(5)

is also guaranteed to not increase over time. It can, therefore, be used

as a regularized proxy for the true

and used to optimize over hyper-parameters or to

monitor convergence. Note that similar proofs exist for

SVGD

[

] and were proven to be

highly non-trivial.

3.4.2. Dynamics and Fixed Points for Gaussian Targets

We illustrate our method by some exact theoretical results for the dynamics and the

fixed points of our algorithm when the target is a multivariate Gaussian density. While such

targets may seem like a trivial application, our analysis could still provide some insight

into the performance for more complicated densities.

Theorem 1.

If the target density

p(x)

is a

D-dimensional

multivariate Gaussian, only

D+

particles are needed for Algorithm 2to converge to the exact target parameters.

Proof. The proof is given in Appendix E.

Theorem 2.

For a target

p(x) = N(x|µ

Λ−1)

, i.e., with precision matrix

, where

x∈RD

and

N≥D+

1particles, the continuous time limit of Algorithm 2will converge exponentially fast

for both the mean and the trace of the precision matrix:

mt−µ=e−Λt(m0−µ),

tr(Ct−1−Λ) =e−2ttr(C0−1−Λ),

where

and

are the empirical mean and covariance matrix at time

and

exp(−Λt)

is the

matrix exponential.

Proof. The proof is given in Appendix F.

Our result shows that convergence of the mean

directly depends on

. How-

ever, we can also precondition the gradient on

, i.e., using the natural gradient

approximation in the Fisher sense, and eventually get rid of the dependency on

when

Ct−1≈Λ.

The exponential relaxation of fluctuations also manifests itself in the decay of the free

energy towards its minimum. For the Gaussian target, the free energy exactly separates

into two terms corresponding to the mean and fluctuations. We can write

F(mt

Ct) =

2(mt−µ)>Λ(mt−µ) + D

2+Ff l(Ct)

, where the nontrivial fluctuation part (subtracted by

its minimum) is given by

Ff l(Ct) = −1

2logCt+1

2tr(ΛCt−I).

We can show that

−lim

t→∞

dln Ff l(Ct)

dt ≥4,

indicating an asymptotic decrease in

Ff l(Ct)

faster than

e−4t

, independent of the target.

We can also prove the finite time bound

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 11 of 34

Ff l(Ct)≤ Ff l(C0)e−2t

tr(Λ−1)(tr(Λ)+|tr((C0)−1−Λ)|).

The degenerate case N<D+1

Additionally, we can show the following result for the fixed points:

Theorem 3.

Given a

-dimensional multivariate Gaussian target density

p(x) = N(x|µ

Σ)

using Algorithm 2with

N<D+1

particles, the empirical mean converges to the exact mean

The

N−

1non-zero eigenvalues of

converge to a subset of the target covariance

spectrum.

Furthermore, the

global minimum

of the regularised version

of the free energy

(17)

corresponds

to the largest eigenvalues of Σ.

Proof. The proof is given in Appendix G.

This result suggests that

might typically converge to an optimal low-rank ap-

proximation of

. We show an empirical confirmation in Section 4.2 for this conjecture.

This suggests that it makes sense to apply our algorithm to high-dimensional problems

even when the number of particles is not large. If the target density has significant

support close to a low-dimensional submanifold, we might still obtain a reasonable ap-

proximation.

3.5. Structured Mean-Field

For high-dimensional problems, it may be useful to restrict the variational Gaus-

sian approximation to the posterior to a specific structure via a structured mean-field

approximation. In this way, spurious dependencies between variables that are caused by

finite-sample effects could be explicitly removed from the algorithms. This is most easily

incorporated in our approach by splitting a given collection of latent variables

into

disjoint subsets

x(i)

. We reorder the vector indices in such a way that the first components

correspond to

x(1)

x(2)

, and so on. Hence, we obtain

x={x(1)

x(2)

. . .

x(M)}

. A struc-

tured mean-field approach is enforced by imposing a block matrix structure for the update

matrix

AMF =A(1)⊕···⊕ A(M)

, where

⊕

is the direct sum operator. It is easy to see that

this construction corresponds to a related block structure of the

matrix in Equation

(3)

This means that the subsets of the random vectors are modeled as independent. Hence,

when the number of particles grows to infinity, one recovers the fixed-point equations

for the optimal

structured Gaussian variational approximation from our approach.

As previously, as the number of particles grows to infinity, we recover the optimal

Gaussian variational approximation. Note that using a structured

does not change the

complexity of the algorithm but requires fewer particles to obtain a full-rank solution.

3.6. Comparison with SVGD

Given the similarities with the

SVGD

methods [

],one could question the differences

of our approach. The model proposed by [

] using a linear kernel

k(x

x0) = x>x0+

1 has

similar properties to our approach. The variable update becomes:

dt =1

∑

i=1

(−k(xi,x)∇ϕ(xi) + ∇xiK(xl,xi))

=Eˆ

qhI−∇ϕ(x)x>ix−Eˆ

q[∇ϕ(x)]

The fixed points are

0=Eˆ

q[∇ϕ(x)]

I=Eˆ

qh∇ϕ(x)x>i=Eˆ

qh∇ϕ(x)(x−m)>i

Entropy 2021,23, 990 12 of 34

where the last equality holds since

Eˆ

q[∇ϕ(x)]=

0. This is the same as our algorithm fixed

points

(9)

. Similarly to Theorem 1,

D+

1 particles will converge to the exact

-dimensional

multivariate Gaussian target. However, the generated flows are different. The main

difference is that we normalize our flow via the

norm, whereas [

] rely on the reproducing

kernel Hilbert space (RKHS) norm, i.e.,

kϕk2

k=ϕ>K−1ϕ

where

ϕi=ϕ(xi)

and

Kij =k(xi

xj)

For a full introduction on RKHS, we recommend [

]. Remarkably, centering the particles

on the mean, namely, using the modified linear kernel

k(x

x0) = (x−m)>(x0−m) +

leads to the same dynamics. Additionally, when using

SVGD

, there is no direct possibility

of computing the current KL divergence between the variational distribution and the target,

unless some values are accumulated [

]. There is also no clear theory explaining what

happens when the number of particles is smaller than the number of dimensions, for both

distance-based kernels and the linear kernel.

4. Experiments

We now evaluate the efficiency of

GPF

and

. First, given a Gaussian target, we

compare the convergence of our approach with popular

VGA

methods, which are all

described in

Section 2

. Second, we evaluate the effect of varying the number of particles

for both Gaussian targets and non-Gaussian targets, especially with a low-rank covariance.

Then, we evaluate the efficiency of our algorithm on a range of real-world binary classifi-

cation problems through a Bayesian logistic regression model and a series of

BNN

on the

MNIST dataset.

All the Julia [

] code and data used to reproduce the experiments are available

at the Github repository: https://github.com/theogf/ParticleFlow_Exp (accessed on

27 July 2021).

4.1. Multivariate Gaussian Targets

We consider a 20-dimensional multivariate Gaussian target distribution. The mean is

sampled from a normal Gaussian

µ∼ N(

ID)

and the covariance is a dense matrix defined

Σ=UΛU>

, where

is a unitary matrix and

is a diagonal matrix.

is constructed as

log10(Λii) = log10(κ)(i−1)

D−1−1

where

is the condition number, i.e.,

κ=Λmax/Λmin

. This

means that, for

κ=

1, we obtain a

Σ=

0.1

, and for

κ=

100, we obtain eigenvalues ranging

uniformly from 0.1 to 10 in log-space.

We compare

GPF

and

to the state-of-the art methods for

VGA

described in

Section 2

, namely Doubly Stochastic VI

(DSVI)

[

], Factor Covariance Structure

(FCS)

[

]

with rank

p=D

,iBayes Learning Rule

(IBLR)

[

] with a full-rank covariance and their

Hessian approach, and Stein Variational Gradient Descent with both a linear kernel (

Linear

SVGD

) [

] and a squared-exponential kernel (

Sq. Exp. SVGD

) [

]. For all methods, we

set the number of particles or, alternatively, the number of samples used by the estimator,

D+

1, and use standard gradient descent (

xt+1=xt+ηϕtxt

) with a learning rate

η=

0.01 for all particle methods. We use RMSProp [

] with a learning rate of 0.01

for all stochastic methods. We run each experiment 10 times with 30,000 iterations, and

plot the average error on the mean and the covariance with one standard deviation. For

GPF

, we additionally evaluate the method with and without using natural gradients for

the mean (i.e., pre-multiplying the averaged gradient with

), indicated, respectively,

with a dashed and solid line.

Figure 2

reports the

norm of the difference between the

mean and covariance with the true posterior over time for the target condition number

κ∈ {1, 10, 100}.

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 13 of 34

Figure 2. L2

norm of the difference between the target mean

(left side) and target covariance

(right side) with the inferred variational parameters

and

against time for 20-dimensional

Gaussian targets with condition number

. We use

D+

1 particles/samples and show the mean over

10 runs as well as the 68% credible interval. Methods with dashed curves use natural gradients on

the mean. Note that

DSVI

and

FCS

are overlapping and are, at this scale, indistinguishable from

one another.

As Theorem 1predicts,

GPF

converges exactly to the true distribution, regardless of the

target.

and other methods based on stochastic estimators cannot obtain the same precision

as their accuracy is penalized by the gradient noise.

IBLR

approximate the covariance

perfectly, despite the stochasticity of its estimator; however

IBLR

needs to compute the true

Hessian at each step. When using a Hessian approximation instead,

IBLR

performed just like

DSVI

; the true benefit of

IBLR

appears when second-order functions are computed, which

is naturally intractable in high-dimensions.

SVGD

with a linear kernel, achieves a good

performance but is highly unstable: most of the runs (ignored here) diverge. This is due to

the dot computation

x>x

which can become extremely high, especially for non-centered data.

For this reason, we do not consider this method for the later experiments.

SVGD

with a sq.

exp. kernel obtains a good estimate for the mean but fails to approximate the covariance.

Entropy 2021,23, 990 14 of 34

Perhaps surprisingly,

does not perform much better than

DSVI

FCS

. This is

potentially due to the benefit of Riemannian gradients being canceled by the gradient noise [

]

providing a strong argument for particle-based methods over stochastic estimators.

Remarkably, we also confirm Theorem 2, that the convergence speed of

is indepen-

dent of the target

, while the convergence speed of

has this dependency unless the

natural gradient is used (see the dashed curves). The case

κ=

1 highlights that natural

gradient do not necessarily improve convergence speed.

4.2. Low-Rank Approximation for Full Gaussian Targets

We explore the effect of the number of particles for both Gaussian and non-Gaussian

targets. We use the same Gaussian target from the previous experiment in 50 dimensions

with a full-rank covariance determined by their condition number

κ=λmax

λmin

. The covariance

eigenvalues

λi

in log-space range uniformly from 0.1 to 0.1

. For a given target multivariate

Gaussian, we vary the number of particles from 2 to

D+

1 and look at the absolute

difference of

|tr(C−Σ)|

. The results in

50, as well as the corresponding predictions

(in dashed-black), from Theorem 3, are shown on Figure 3.

The empirical results perfectly match the theoretical predictions, confirming that, for

Gaussian targets, the particles determine a low-rank approximation whose spectrum is

equal to the largest eigenvalues from the target.

Figure 3.

Trace error for a Gaussian target with

50 and condition numbers

for a varying

number of particles with GPF. Predictions from Theorem 3are shown in dashed-black.

4.3. High-Dimensional Low-Rank Gaussian Targets

We consider a typical low-rank target case where the dimensionality is high but the

effective rank of the covariance is unknown. The target is given by

p(x) = N(µ

Σ)

where

µ∼ N(0, ID)

, the covariance is defined by

Σ=UΛU>

, where

is a

D×D

unitary matrix

and Λis a diagonal matrix defined by

Λii =(N(2, 1), if i≤K

10−8, otherwise

where

is the effective rank of the target. We pick

500 and vary

K∈ {

10, 20, 30

}

simulate a true problem where the correct

is not known. We test all methods allowing

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 15 of 34

for low-rank structure, namely,

GPF

FCS

and

SVGD

(Linear and Sq. Exp.). We fix the

rank (or the number of particles) to be 20; therefore, we obtain three cases where the rank is

exact, under-estimated, and over-estimated. For all methods, we use RMSProp [

] for the

stochastic methods, or a diagonal version of it (see Section 4.4) for the particle ones. The

error of the mean and the covariance is shown in Figure 4. Note that the difference in the

initial error on the covariance is due to the difficulty of starting with the same covariance

between particle and stochastic methods.

Figure 4.

Convergence plot of low-rank methods for a 500-dimensional multivariate Gaussian target

with effective rank

K∈ {

10, 20, 30

}

. The rank of each method is fixed as 20. The difference in the

starting point for the covariance is due to the initialization difference between each method. We show

the mean over 10 runs for each method with shadowed areas representing the 68% credible interval.

We observe once again that the

SVGD

with a linear kernel fails to converge due to the

large gradients. All methods perform equally in the estimation of the mean while being

non-influenced by the rank of the target. As expected, the approximation quality for the

covariance degrades when the rank gets bigger, but all algorithms still converge to good

Entropy 2021,23, 990 16 of 34

approximations.

SVGD

with a sq. exp. kernel performs much worse than the rest of the

methods. This is a known phenomenon where, for high dimensions, the covariance

SVGD

is either over- or underestimated.

4.4. Non-Gaussian Target

We now investigate the behavior of our algorithm with non-Gaussian target distribu-

tions. We built a two-dimensional banana distribution:

p(x)∝exp(−

0.5

(

0.01

0.1

(x2+

0.1

1−

)2))

, varied the number of particles used for

GPF

{

3, 5, 10, 20, 50

}

and com-

pared it with a standard full-rank

VGA

approach. We also showed the impact of replacing

a fixed

with the Adam [

] optimizer for 50 particles. The results are shown in

Figure 5

As expected, increasing the number of particles madesthe distribution obtained via

GPF

increasingly closer to the optimal standard

VGA

, even in a non-Gaussian setting. However,

using a momentum-based optimizer such as Adam breaks the linearity assumption of the

original flow

(5)

and leads to a twisted representation of the particles. (We observed the

same behavior with other momentum-based optimizers). A simple modification of the

most known optimizers allows the linearity to be maintained while correctly adapting

the learning rate to the shape of the problem. Most optimisers accumulate momentum

or gradients element-wise, and end up modifying the updates as

xt+1=xt+Ptϕt(xt)

where

Pt∈RD×D

is the preconditioner obtained via the optimiser and



is the Hadamard

product. By instead taking the average over each dimensions, we obtained the updates

xt+1=xt+Ptϕt(xt)

, where

is a

D×D

diagonal matrix. The details of the dimension-

wise conditioners for ADAM, AdaGrad and AdaDelta are given in Appendix H.

Figure 5.

Two-dimensional Banana distribution. Comparison of

GPF

using an increasing number of

particles and a different optimizer (ADAM) with the standard VGA (rightmost plot).

4.5. Bayesian Logistic Regression

Finally, we considered a range of real-world binary classification problems mod-

eled with a Bayesian logistic regression. Given some data

{(xi

yi)}N

i=1

where

xi∈RD

and

y∈ {−

1, 1

}

, we defined the model

yi∼Bernoulli(σ(w>xi))

with weight

w∈RD

and with

being the logistic function. We set a prior on

wN(0, 10ID)

. We bench-

marked the competing approaches over four datasets from the UCI repository [

spam

(

N=4601, D=104

krkp

(

351,

111),

ionosphere

(

3196,

37) and

mushroom (N=8124, D=95

). We ran all algorithms discussed in Section 4.1, both with

and without a mean-field approximation;

SVGD

was omitted since it is too unstable. All

algorithms were run with a fixed learning rate

η=

−4

, and we used mini-batches of size

100. We show alternative training settings in Appendix I. Note that

FCS

, for mean-field,

simplifies to

DSVI

Additionally, we did not consider full-rank

IBLR

, as it is too expensive,

and we used their reparametrized gradient version for the Hessian. Figure 6shows the

average negative log-likelihood on 10-fold cross-validation with one standard deviation

for each dataset. While, as expected, the advantages shown for Gaussian targets do not

transfer to non-Gaussian targets,

GPF

and

are consistently on par with competitors. On

the other hand,

IBLR

tends to be outperformed. It is also interesting to note that mean-field

does not seem to have a negative impact on these problems, and performance remains the

same even with a full-rank matrix.

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 17 of 34

(a) Mean-field approximation

(b) No mean-field approximation

Figure 6.

Average negative log-likelihood vs. time on a test-set over 10 runs against training time

for a Bayesian logistic regression model applied to different datasets. Top plots use a mean-field

approximation, while bottom plots use a low-rank structure for the covariance with rank L=100.

Entropy 2021,23, 990 18 of 34

4.6. Bayesian Neural Network

We ran our algorithm on a standard network with two hidden layers each, with

200 neurons and

tanh

activation functions (we additionally tried ReLU [

], but

some baselines failed to converge). We trained on the MNIST dataset [

] (

60,000,

784) and used an isotropic prior on the weights

p(w) = N(0, αID)

with

α=

1.0.

We additionally compared these with Stochastic Weight Averaging-Gaussian

(SWAG)

[

]

with an SGD learning rate of 10

−6

(selected empirically) and Efficient Low-Rank Gaussian

Variational Inference

(ELRGVI)

[

]. We varied the assumptions on the covariance matrix to

be diagonal (

Mean-Field

), or to have rank

L∈ {

5, 10

}

. Additionally, we showed, for

GPF

the effect of using a structured mean-field assumption by imposing the independence of

the weights between each layer (GPF (Layers)).

We trained each algorithm for 5000 iterations with a batchsize of 128(

∼

10 epochs)

and reported the final average negative log-likelihood, accuracy and expected calibration

error [43] on the test set (N=10,000) on Table 1. The predictive distribution is given by

p(y=k|x∗,D) = Zp(y=k|x∗,w)p(w|D)dw ≈Zp(y=k|x∗,w)q(w)dw,

where

is the training data, and

x∗

is a test sample. We computed the accuracy and the

average negative test log-likelihood as:

Acc =1

∑

i=1

1yi(argkmax p(y=k|x∗

i,D))

NLL =−1

∑

i=1

log p(y=yi|x∗

i,D)

where 1

y(x)

is the indicator function (equal to 1 for

y=x

, 0 otherwise). For the definition

of expected calibrated error, we refer the reader to [

]. Additional convergence and

uncertainty calibration plots can be found in Appendix I.

Table 1.

Negative Log-Likelihood (NLL), Accuracy (Acc), and Expected Calibration Error (ECE)

for a Bayesian Neural Networks

(BNN)

on the MNIST dataset. We varied the rank of the variational

covariance from mean-field (all variables are independent) to a low-rank structure with

L∈ {

5, 10

}

Bold numbers indicated the best performance, and italic bold numbers indicate the best performance

when restricted to VGA methods. Convergence and calibration plots can be found in Appendix I.

Alg. Mean-Field L=5L=10

NLL Acc ECE NLL Acc ECE NLL Acc ECE

GPF 0.183 0.95 0.0384 0.166 0.96 0.0918 0.172 0.955 0.0869

GPF (Layers) - - - 0.147 0.958 0.0181 0.178 0.952 0.0395

GF 0.178 0.953 0.0706 0.185 0.956 0.136 0.171 0.952 0.0455

DSVI 0.204 0.945 0.11 - - - - - -

SVGD (Sq. Exp) - - - 0.139 0.965 0.0732 0.133 0.967 0.0879

SWAG - - - 0.257 0.957 0.0662 0.287 0.956 0.0878

ELRGVI - - - 0.453 0.901 0.53 0.537 0.882 0.777

Overall, the

SVGD

method performed best in terms of both accuracy and negative

log-likelihood. However,

SVGD

is not in the same category as others, since it is not a

VGA

. For

VGA

s, we observed that a low-rank approximation improves upon mean-field

methods. In particular, assuming independence between layers provides a large advantage

GPF

and

generally perform equally or better than all the other

VGA

methods.

Note that, although not reported here, all methods needed approximately the same time

for the 5000 iterations, except for

SWAG

, which only needed the MAP and a few thousand

iterations of SGD afterward, making it generally faster but also less controlled (a grid

search was needed to find the appropriate learning for SGD).

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 19 of 34

5. Discussion

We introduced

GPF

, a general-purpose and theoretically grounded, particle-based

approach, to perform inference with variational Gaussians as well as

its parameter

version. We were able to show the convergence of the particle algorithm based on an

empirical approximation of the free energy. We also showed that we can approximate

high-dimensional targets by allowing for low-rank approximations with a small number

of particles. The results for Gaussian targets suggest that the convergence of posterior

covariance approximation may relax asymptotically fast, with small dependence on the

target. This work is the first step in analyzing convergence speed and guarantees in

inference with variational Gaussians, and future work could extend guarantees to non-

Gaussian problems. One could also take advantage of existing particle-based VI methods

to accelerate inference further or reach a better optima [44,45].

Author Contributions:

Conceptualization, T.G.-F. and M.O.; methodology, T.G.-F., V.P. and M.O.; soft-

ware, T.G.-F.; validation, T.G.-F.; formal analysis, T.G.-F.; investigation, T.G.-F.; resources, T.G.-F. and

V.P.; data curation, T.G.-F.; writing—original draft preparation, T.G.-F., V.P. and M.O.; writing—review

and editing, T.G.-F., V.P. and M.O.; visualization, T.G.-F.; supervision, M.O.; project administration,

T.G.-F.; funding acquisition, M.O. All authors have read and agreed to the published version of

the manuscript.

Funding:

We acknowledge the support of the German Research Foundation and the Open Access

Publication Fund of TU Berlin.

Data Availability Statement:

Datasets can be found on the UCI dataset website [

] and the MNIST

dataset can be found on Yann Lecun website [42].

Acknowledgments:

We thank Fela Winkelmolen for his initial help on computations, Jannik Thüm-

mel for his work on the linear SVGD and the reviewers for their insightful comments.

Conflicts of Interest: The authors declare no conflict of interest.

Appendix A. Derivation of the Optimal Parameters

In Section 3, we considered the optimization problem:

min

At,bt∈B

dF[qt]

dt where B={At,bt:kAtk2

F=1, kbtk2=1},

where we have introduced

kA2k2

F=tr(AA>)

, the Froebius norm and

kbtk

, the

norm and

dF[qt]

dt =−trhAt(At

?)>i−(bt)>bt

?(A1)

To solve this problem, we used the Lagrange multiplier method. We write the La-

grangian as:

L(At,bt) = dF[qt]

dt −λAg(At)−λbh(bt),

where

g(A) = tr(AA>)−

1 and

h(b) = kbk2

2−

1. For simplicity we can divide the

problem as:

L(At) = −trhAt(At

?)>i−λAg(At)

L(bt) = −(bt)>bt

?−λbh(bt)

For At, we have the constraints:

Entropy 2021,23, 990 20 of 34

∇AttrhAt(At

?)>i=λA∇Atg(At)

g(At) =0

Computing the gradients is straightforward:

?=2λAAt

⇒At=At

2λA

⇒1

4λ2

tr(At

?(At

?)>) =1

⇒λA=rtr(At

?(At

?)>)

which gives us the result At=At

kAt

?kF. Similarly for bt:

∇bt(bt)>bt

?=λb∇bth(bt)

h(bt) =0.

Replacing the gradients gives:

?=2λbbt

⇒bt=bt

2λb

⇒1

4λ2

bkbt

?k2

2=1

⇒λb=2

kbt

?k2

which gives us the result bt=bt

kbt

?k2.

Appendix B. Relaxation of the Empirical Free Energy

We prove the decrease in the empirical free energy

(17)

under the particle flow when

the covariance

is nonsingular. We define the empirical distribution

q(x) = 1

N∑N

i=1δx,xi

with a finite number Nof particles. The empirical free energy is defined as

F[ˆ

q] = Eˆ

q[ϕ(x)]−1

2log |C|.

We are interested in the temporal change of the free energy, when particles move under a

general linear dynamics

dxi

dt =b+A(xi−m).

The induced dynamics for Fare:

dt =Eqt∇xϕ(x)>dx

dt −1

2tr(C−1dC

dt )

For notational simplicity, we introduce g(x) = ∇xϕ(x)and ˙

x=dx

dt (similarly ˙

m=dm

dt ).

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 21 of 34

dt =d

dtEqh(x−m)(x−m)>i

=Eqh(˙

x−˙

m)(x−m)>i+Eqh(x−m)( ˙

x−˙

m)>i

=Eqh˙

xx>+x˙

x>−˙

mm>−m˙

m>i

=Eqh˙

x(x−m)>i+Eqh(x−m)˙

x>i

dt =Eqhg(x)>˙

xi−

2Eqhtr(C−1˙

x(x−m)>) + tr(C−1(x−m)>˙

x>)i

=Eqh˙

x>g(x)−C−1(x−m)i (A2)

where we used the permutation properties of the trace.

Plugging the dynamics into Equation (A2), we obtain:

dt =b>Eq[g(x)]+Eqh(x−m)>A>g(x)i

−Eqh(x−m)>A>C−1(x−m)i(A3)

where we used the fact that b>C−1Eq[x−m]=0.

We next look for conditions on

and

, under which

dt <0

, i.e., the dynamics will

lead to a decrease in the free energy. We pick

b=−β1Eq[g(x)]

, where

β1>

0, and we

obtain, for the first term in (A3):

−β1kEq[g(x)]k2≤0.

For

, let us first define

ψ=Eqg(x)(x−m)>

and rewrite the second and last term

of the Equation (A3) as:

Eqh(x−m)>A>g(x)i=trEqhA>g(x)(x−m)>i

=trA>ψ

Eqh(x−m)>A>C−1(x−m)i=trA>C−1C

=tr(A)

Combining both, we get

trA>(ψ−I)

. Similarly to the previous step, we pick

−β2(ψ−I), where β2≥0, which leads to another negative term:

−β2tr((ψ−I)>(ψ−I)) ≤0,

where we use the fact that X>Xis a positive semi-definite matrix for any real valued X.

Note that different forms of

(e.g.,

β2

are replaced by a positive definite matrix) could

be used, as long as the trace of the product stays positive. Inserting

and

, the free energy

dynamics become

dt =−β1kEq[g(x)]k2−β2tr((ψ−I)>(ψ−I))

The variable dynamics are given by

Entropy 2021,23, 990 22 of 34

dt =−β1Eq[g(x)]−β2(ψ−I)(x−m)

=−β1Eq[g(x)]

−β2Eqhg(x)(x−m)>i−I(x−m),

which is equivalent to Equation

(5)

, for

β1=β2=

1. Our result shows that the empirical

approximation of the free energy decreases under the particle flow.

Appendix C. Riemannian Gradient for Matrix Parameter Γ

The parameter flow for the matrix Γin (11) is given by

dΓt

dt =Γt−Eq0h∇xϕ(xt)(x0−m0)>iΓt(Γt)>.

This is easily rewritten in terms of the parameter gradient as dΓt

dt =∂F

∂ΓΓΓ>

Similar to natural gradients, which are defined by the metric, which is induced by

the Fisher–matrix, we can rewrite the parameter change in terms of a different Riemannian

gradient. This gradient is the direction of change

dΓ=Γ(t+dt)−Γ(t)

, which yields

the steepest descent of the free energy over a small time interval

. As an extra con-

dition, one keeps the length of

dΓ

(measured by a ’natural’ metric, which has specific

invariance properties) fixed. This is defined by an inner product (the squared length)

hdΓ

dΓiΓ

in the tangent space of small deviations

dΓ

from the matrix

. Hence,

dΓ

found by minimising

F(Γ(t) + dΓ

(for small

dΓ

) under the condition that

hdΓ

dΓiΓ(t)

fixed.

Following [20] (Theorem 6)

, a natural metric in the space of symmetric nonsingular

matrices can be defined as

hdΓ,dΓiΓ.

=tr(dΓ Γ−1)>dΓ Γ−1.

This metric is invariant against multiplications of

and

dΓ

by matrices

, i.e.,

hdΓ

dΓiΓ=

hdΓY,dΓYiΓYand reduces to the Euclidian metric at the unit matrix Γ=I.

The direction of the natural gradient is obtained by expanding the free energy for

small

dΓ

and introducing a Lagrange–multiplier

for the constraint. One ends up with the

quadratic form

∂F

∂ΓdΓ+λtr(dΓ Γ−1)>dΓ Γ−1

to be minimised by

dΓ

. By taking the derivative with respect to

dΓ

, one finds that the

direction of dΓagrees with the right equation of the flow (11).

Appendix D. Regularised Free Energy for N≤D

The problem of defining an empirical approximation for

N≤D

particles is that the

empirical covariance becomes singular and typically has

N−

1 nonzero eigenvalues, and

thus

|C|=

0. Note that the extra 0 eigenvalue is derived from the fact that the empirical

sum of fluctuations must be zero, which provides an additional linear constraint.

We can regularise the log determinant term by replacing the zero eigenvalues of

λi=0→˜

λi=1. The new covariance ˜

Cbecomes

log |e

C|=∑

i:λi>0

log λi,

since

log

0. The dynamics of the particles stays the same. To rewrite this formally in

terms of matrices, we define

C=C+C⊥

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 23 of 34

where

C⊥=∑

i:λi=0

eie>

and

ei=i

th eigenvector of

. This replaces all 0 eigenvalues by 1.

C⊥

is a projector:

⊥=C⊥

and

C⊥(I−C⊥) =

0. We also have

tr(C⊥) = D−(N−

)

. In the following,

it is useful to introduce the

D×N

matrix of fluctuations

, such that

C=ZZ>/N

. The

column vectors of

span the subspace of eigenvectors

with

λi>

0. Hence, it follows

that C⊥Z=0.

We want to show that the regularised free energy

decreases under the particle

dynamics for

N≤D

. Since the part of the time derivative of

that depends on

is not

changed, we will only discuss the fluctuation part in the following.

It is useful to introduce the matrix:

=I−C⊥−gZ>/N=A−C⊥,

with g=∇xϕ(x)is the D×Nmatrix of the gradient.

Eqg(x)>dx

dt =tr(A)−tr(A>A)

=tr(e

A+C⊥)−tr(( e

A+C⊥)>(e

A+C⊥))

=tr(e

A)−tr(e

A>e

A).

To obtain this result, we need

tr(C⊥e

A) =tr(C⊥e

A>)

=tr(C⊥(I−C⊥)−C⊥Zg>/N) = 0.

We need to work out

−1

dln |e

dt =−1

2tr de

dt e

C−1!

=−1

2trdC

dt e

C−1

where we have used the fact that the eigenvalues

λi=

1 of

have a zero time derivative

and can be omitted. We use the linear dynamics dZ

dt =AZ to obtain:

dt = = CA>+AC

=( e

C−C⊥)( e

A>+C⊥) + ( e

A+C⊥)( e

C−C⊥)

A>+e

C+C⊥e

C+e

CC⊥−e

AC⊥−C⊥e

A>−2C⊥

A>+e

where we have used C2

⊥=C⊥and C⊥e

A>=0. Hence

−1

2tr de

dt e

C−1!=−tr(e

A).

Entropy 2021,23, 990 24 of 34

Finally, the temporal change in the free energy due to the fluctuations is given by

dt =−tr(e

A>e

A)≤0.

Note that this proof is not only valid for

N≤D

, but also for

N>D

, as the overall

computations are simplified with

C⊥=

0. A more detailed proof for

N>D

is, furthermore,

given in Appendix B.

Efficient Computation of loge

C

A practical way to compute

log |e

without performing an eigenvector expansion is

to define the N×Nmatrix

=Z>Z/N+JN,N/N,

where

JN,N

is the

N×N

all-ones matrix.

Z>Z/N

shares the

N−

1 nonzero eigenvalues with

and has an additional eigenvalue 0 corresponding to the constant eigenvector

(eN)i=

/√N

. Adding an all-ones matrix preserves all existing eigenvalues while replacing the 0

one with a constant. This leads to the following result:

−1

2log |R|=−1

N−1

∑

i=1

log λi.

Appendix E. Proof of Theorem 1: Fixed Points for a Gaussian Model (N>d)

Theorem A1

(1)

If the target density

p(x)

is a

D-dimensional

multivariate Gaussian, only

D+

particles are needed for Algorithm 2to converge to the exact target parameters.

The general fixed-point condition for the dynamics (13) of the position

for particle

is given by:

(I−Eˆ

qhg(x)(x−m)>i)(xi−m)−Eˆ

q[g(x)]=0.

for i=1, . . . , N. By taking the expectation over all particles, we obtain:

Eˆ

q[g(x)]=0, (A4)

where

is the empirical distributions of particles at the the fixed point. Note that this result

is independent of N, i.e., it is also valid for N=1.

For a

-dimensional Gaussian target

p(x) = N(µ

Σ)

, we will show that empirical

mean and covariance given by the particle algorithm converge to the true mean and

covariance matrix of the Gaussian when we use

N≥D+

1 particles. In this setting, we

have

ϕ(x) = 1

2x>Σ−1x−x>Σ−1µ

. For simplification, we use the precision matrix

Λ=Σ−1

and get

ϕ(x) = 1

2x>Λx−x>Λµ.

The gradient g(x)becomes:

g(x) = Λ(x−µ)

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 25 of 34

At the fixed points, we have that dm

dt and dΓ

dt are equal to 0. For the mean m:

dt =Eˆ

q[g(x)]=0

ΛEˆ

q[x−µ]=0

Λm=Λµ

m=µ

For the matrix Γ, we have

dΓ

dt =−AΓ=0

Γ−Eq0hg(x)(x−m)>iΓ=0

Eq0hΛ(x−µ)(x−m)>iΓ=Γ

−2η2Eq0h(x−m)(x−m)>iΓ=Γ

ΛCΓ=Γ

ΛC2=C

where we use the result for the mean

m=µ

and right multiplied by

Γ>

C=ΓΓ>

. Now,

we can only simplify, as

C=Λ−1=Σ

is not singular. This is true only if its rank is

equal to D, needing D+1 particles.

Appendix F. Proof of Theorem 2: Rates of Convergence for Gaussian Targets

Theorem A2

(2)

For a target

p(x) = N(x|µ

Λ−1)

, where

x∈RD

, and

N≥D+

1particles,

the continuous time limit of Algorithm 2will converge exponentially fast for both the mean and the

trace of the precision matrix:

mt−µ=e−Λt(m0−µ),

tr(Ct−1−Λ) =e−2ttr(C0−1−Λ),

where

and

are the empirical mean and covariance matrix at time

and

exp(−Λt)

is the

matrix exponential.

In the following, we assume the target

p(x) = N(µ

Σ)

We use the notation

Λ.

=Σ−1

and δCt=Ct−Σ.

Appendix F.1. Convergence of the Mean

Given our target

p(x)

, similarly to Appendix Ewe have

g(x) = Λ(x−µ)

, where

η1=Σ−1µand η2=−1

2Σ−1. This transform the first of Equations (11) into

dt =−Λ(Eˆ

q[x]−µ)

=−Λ(m−µ)

If now consider the error on m:δm=m−µwe obtain:

dδm

dt =dm

dt =−Λ(m−µ)

=−Λδm.

Entropy 2021,23, 990 26 of 34

Therefore, the mean converges exponentially fast to the true mean. The asymptotic rate

is governed by the largest eigenvalue of

, i.e., the inverse of the smallest eigenvalue of

Σ,λmin.

Appendix F.2. Convergence of the Covariance Matrix

Let z=x−m, we have from Equation (5), that

dt =−Az

where A=Eq0g(x)z>−I. This expectation can further be simplified as

Eˆ

qhΛ(x−µ)z>i=ΛC, (A5)

where q∼ N(m,C). Hence, we have the exact result

dt = (I−ΛC)C+C(I−CΛ). (A6)

We know that the optimal target is

C=Σ

. Therefore, we define the error

δC=C−Σ

Linearizing Equation (A6) gives us

dδC

dt =dC

dt =(I−Λ(δC+Σ))(δC+Σ)

+ (δC+Σ)(I−(δC+Σ)Λ)

=−ΛδC(δC+Σ)−(δC+Σ)δCΛ

≈−ΛδCΣ−ΣδCΛ

We were not yet able to find a general solution of this equation, but we can obtain a simple

result for the trace yt.

=tr(δC)at time t:

dyt

dt ≃ −2yt.

We, therefore, have a asymptotic linear convergence:

yt∝e−2ty0

which is independent of

the parameters of the Gaussian model.

We can also equivalently obtain a non-asymptotic estimate of a specific error measure

for the precision matrix. Using equation

(A6)

, we have the following dynamics for the

precision C−1:

dC−1

dt =−C−1dC

dt C−1

=−C−1(I−ΛC)−(I−ΛC)C−1

Taking the trace

dtr(C−1)

dt =−2tr(C−1)−2tr(Λ)

dtr(C−1−Λ)

dt =−2tr(C−1−Λ)

Hence we get the following exact result:

tr((Ct)−1−Λ) = e−2ttr((C0)−1−Λ)

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 27 of 34

which is again independent of the parameters of the Gaussian model.

Additionally, this tells us that if the covariance

is non-singular at time

0, it will

remain non-singular for all

(

tr(C−1)

would be infinite). Hence, if we start with

N>d

particles with a proper empirical covariance, they cannot collapse to make Csingular.

Appendix F.3. Convergence of the Trace of the Covariance

The asymptotic result on traces obtained previously can be turned into an exact

inequality. We have

dδC

dt =−ΛδCΣ−ΣΛδC−Λ(δC)2−(δC)2Λ

Taking the trace, we get

dtr(δC)

dt =−2tr(δC)−2tr(δCΛδC)

Since δCΛδCis positive definite, we have −2tr(δCΛδC)≤0 and thus

dtr(δC)

dt ≤ −2tr(δC)

leading to:

tr(δCt)≤tr(δC0)e−2t

by using by Grönwall’s lemma [46]:

Lemma A1

(Grönwall)

For an interval

I0= [

∞)

and a given function

differentiable

everywhere in I0and satisfying:

f0(t)≤β(t)f(t),t∈I0

then f is bounded by the corresponding differential equation g0(t) = β(t)g(t):

f(t)≤f(0)Zt

0β(s)ds,t∈I0

The bound is nontrivial only if

tr(δC)≥

0. This would be natural assumption

for a Bayesian model, if

is the prior covariance and the eigenvalues of

t=∞

(corresponding to the posterior) are reduced by the data.

Appendix F.4. Decay of Fluctuation Part of the Free Energy

Still focusing on the Gaussian model, we can further derive a bound on the free energy.

It is easy to see that for the Gaussian case, the free energy in Equation

(4)

separates into a

sum of two terms. The first one depends on the mean

only and the second one on only

the fluctuations (i.e., Ct).

We will consider the second, nontrivial part only. We assume that the covariance

matrix is nonsingular (corresponding to

N>D

). The fluctuation part of the free energy

(minus its minimum) is given by

Ff l =−1

2ln |I−B|− 1

2tr(B)

Entropy 2021,23, 990 28 of 34

where we have introduced the matrix

=I−ΛC

. One can show that its eigenvalues are

real and are upper bounded by 1. First, we can show from the equations of motion that

dFf l

dt =−tr(BB>)(A7)

Second, using the elementary bound

−ln(

−u)≤u

1−u

valid for

u≤

1 and applied to the

eigenvalues of Byields

Ff l ≤1

2tr(B(I−B)−1−B)

2tr(B(I−B)−1−B(I−B)(I−B)−1)

2tr(B2(I−B)−1)

2tr(B2C−1Λ−1)≤1

2tr(B>Λ−1BC−1)

The last two equalities used the definition

B=I−ΛC

. Since

B>Λ−1B

and

C−1

are both

positive definite, we can bound the last term by (see ([47], Theorem 6.5))

Ff l ≤1

2tr(B>Λ−1B)tr(C−1)≤

2tr(BB>)tr(Λ−1)tr(C−1)),

where, in the last line, we have bounded the trace of a product of p.d. matrices a sec-

ond time.

Combining with Equation (A7) we show that

dFf l

dt ≤ − 2Ff l

tr(Λ−1)tr(C−1)

We can plug in our result from Theorem 2:

tr(C−1) =tr(Λ) + tr(C−1−Λ)

=tr(Λ) + e−2ttr((C0)−1−Λ)

≤tr(Λ) + e−2t|tr((C0)−1−Λ)|

≤tr(Λ) + |tr((C0)−1−Λ)|

We can plug this in and use Grönwall’s Lemma A1 to get an exponential bound

Ff l(Ct)≤ Ff l(C0)e−2t

tr(Λ−1)(tr(Λ)+|tr((C0)−1−Λ)|).

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

Entropy 2021,23, 990 29 of 34

Appendix F.5. Asymptotic Decay of the Free Energy:

For large times

, we can do better. Let us analyse the asymptotic decay constant

Ff l ≃e−λf reetdefined by

λfree .

=−lim

t→∞

dln(Ff l )

dt =−lim

dFf l

Ff l

=lim tr(BB>)

−1

2ln |I−B|− 1

2tr(B)≥

lim tr(B2)

−1

2ln |I−B|− 1

2tr(B)

In the last inequality, we used

tr(BB>)≥tr(B2)

. Everything is expressed by traces of

functions of

, and thus by its eigenvalues. Since

B→

0 as

t→∞

(this applies also

to its eigenvalues

), we can use Taylor’s expansion

ln(

−u) + u=−u2/

+O(u3)

show that

λfree ≥4

which is independent of Λ.

Appendix G. Proof of Theorem 3: Fixed-Points for Gaussian Model (N≤D)

Theorem A3

(3)

Given a

-dimensional multivariate Gaussian target density

p(x) = N(x|µ

Σ)

using Algorithm 2with

N<D+1

particles, the empirical mean converges to the exact mean

The

N−

1non-zero eigenvalues of

converge to a subset of the target covariance

spectrum.

Furthermore, the

global minimum

of the regularised version

of the free energy

(17)

corresponds

to the largest eigenvalues of Σ.

Applying Equation (A4) to our fixed point equation, we obtain

(I−Eˆ

qhg(x)(x−m)>i)(xi−m) = 0, ∀i=1, . . . , N

Hence, the set of centered positions of the particles

S={xi−m}N

i=1

, are all eigenvectors of

the matrix

Eˆ

qg(x)(x−m)>

with eigenvalue 1.

spans a

N−

1 dimensional space (we

have ∑N

i=1(xi−m) = 0).

If we specialise to a Gaussian target

p(x) = N(x|µ

Σ)

, (and

Λ=Σ−1

we have

g(x) = Λ(x−µ)and can reuse the result from Equation (A5):

Eˆ

qhg(x)(x−m)>i=ΛEˆ

qh(x−m)(x−m)>i

=ΛC.

Using the equality above, we get:

ΛC(xi−m) =(xi−m)

C(xi−m) =Σ(xi−m),∀i=1, . . . , N

which shows that the obtained low-rank covariance

and the target covariance

have

N−1 eigenvectors and eigenvalues in common.

Entropy 2021,23, 990 30 of 34

However, are these the largest ones? We look at the modified free energy

(17)

(ignoring

the contribution of the mean):

min e

F=min(−1

2∑

i:λi>0

ln λi+tr(ΛC))

where

λi

are the eigenvalues of the empirical covariance

. We first note that

tr(ΛC) =

N−

1, independent of which eigenvalues are obtained at the fixed point. This is easily seen

by the following argument: If we use the index–set

for the common eigenvectors

and

eigenvalues λi,i∈ I, we can write

C=∑

i∈I

eiλie>

Σ=∑

eiλie>

From this we obtain

tr(ΛC) = tr(∑

i∈I

eiλ−1

iλie>) = N−1

From this result we obtain

min e

F=max 1

2∑

i:λi>0

ln λi−(N−1),

The term

N−

1 is a constant, but the first term makes a difference: The

absolute mini-

mum

is achieved, when the

λi

are

N−

largest

eigenvalues of

. Our simulations

empirically show that the algorithm usually converges to the absolute minimum.

Appendix H. Dimension-Wise Optimizers

Here, we list some of the most populars optimizers used and their dimension-wise

versions. In all algorithms, we consider

the matrix created by the concatenation of the

flow of each particle :

ϕ=[ϕ1, . . . , ϕN]

, where

ϕn=ϕ(xn)

We additionally use the notation

ϕn,i

for the

-th dimension of the flow of the

-th particle. The main differences between

the original algorithms and their modified version were put in red.

Appendix H.1. ADAM

The ADAM algorithm is given by:

Algorithm A1: ADAM

Input: ϕt,mt−1,vt−1,β1,β2,η

Output: ∆

n,d=β1mt−1

n,d+ (1−β1)ϕt

n,d

n,d=β2vt−1

n,d+ (1−β2)ϕt

n,d2

∆n,d=ηmt

n,d

(1−βt

1)qvt

n,d(1−βt

2)−1+e

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

100

Entropy 2021,23, 990 31 of 34

Algorithm A2: Dimension-wise ADAM

Input: ϕt,mt−1,vt−1,β1,β2,η

Output: ∆

n,d=β1mt−1

n,d+ (1−β1)ϕt

n,d;

d=β2vt−1

d+ (1−β2)1

N∑N

n=1ϕt

n,d2;

∆n,d=ηmt

n,d

(1−βt

1)√vt

d(1−βt

2)−1+e;

Appendix H.2. AdaGrad

The AdaGrad algorithm is given by:

Algorithm A3: AdaGrad

Input: ϕt,vt−1,η

Output: ∆

n,d=vt−1

n,d+ϕt

n,d2

∆n,d=ηϕt

n,d

qvt

n,d+e

Algorithm A4: Dimension-wise AdaGrad

Input: ϕt,vt−1,η

Output: ∆

d=vt−1

d+1

N∑N

n=1ϕt

n,d2

∆n,d=ηϕt

n,d

√vt

d+e

Appendix H.3. RMSProp

The RMSProp algorithm is given by:

Algorithm A5: RMSProp

Input: ϕt,vt−1,ρ,η

Output: ∆

n,d=ρvt−1

n,d+ (1−ρ)ϕt

n,d2

∆n,d=ηϕt

n,d

qvt

n,d+e

Algorithm A6: Dimension-wise RMSProp

Input: ϕt,vt−1,ρ,η

Output: ∆

d=ρvt−1

d+ (1−ρ)1

N∑N

n=1ϕt

n,d2

∆n,d=ηϕt

n,d

√vt

d+e

101

Entropy 2021,23, 990 32 of 34

Appendix I. Additional Figures

Appendix I.1. Bayesian Logistic Regression

Similarly to the previous section, we also show results with the RMSProp optimizer

with learning rate 1 ×10−4.

(a) Mean-field approximation (b) No mean-field approximation

Figure A1.

Similarly to Figure 6, we show the average negative log-likelihood on a test-set over

10 runs against training time on different datasets for a Bayesian logistic regression problem. The

dashed curve represents the low-rank approximation with RMSProp for methods based on stochas-

tic estimators.

Appendix I.2. Bayesian Neural Network

Figure A2.

Convergence of the classification error and average negative log-likelihood as a function

of time.

Figure A3.

Accuracy vs confidence. Every test sample is clustered in function of its highest predictive

probability. The accuracy of this cluster is then computed. A perfectly calibrated estimator would

return the identity.

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

102

Entropy 2021,23, 990 33 of 34

References

Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R.P.; de Freitas, N. Taking the human out of the loop: A review of Bayesian

optimization. Proc. IEEE 2016,104, 148–175. [CrossRef]

Settles, B. Active Learning Literature Survey; Computer Sciences Technical Report 1648; University of Wisconsin–Madison: Madison,

WI, USA, 2009.

3. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; The MIT Press: Cambridge, MA, USA, 2018.

Bardenet, R.; Doucet, A.; Holmes, C. On Markov chain Monte Carlo methods for tall data. J. Mach. Learn. Res.

2017

,18, 1515–1557.

Cowles, M.K.; Carlin, B.P. Markov chain Monte Carlo convergence diagnostics: A comparative review. J. Am. Stat. Assoc.

1996

91, 883–904. [CrossRef]

Barber, D.; Bishop, C.M. Ensemble learning for multi-layer networks. In Advances in Neural Information Processing Systems; MIT

Press: Cambridge, MA, USA, 1998; pp. 395–401.

Graves, A. Practical Variational Inference for Neural Networks. In Proceedings of the 24th International Conference on Neural

Information Processing Systems, Granada, Spain, 12–15 December 2011; Volume 24, pp. 2348–2356.

Ranganath, R.; Gerrish, S.; Blei, D. Black box variational inference. In Proceedings of the Seventeenth International Conference

on Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–25 April 2014; pp. 814–822.

Liu, Q.; Lee, J.; Jordan, M. A kernelized Stein discrepancy for goodness-of-fit tests. In Proceedings of the 33rd International

Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 276–284.

10.

Liu, Q.; Wang, D. Stein variational gradient descent as moment matching. In Proceedings of the 32nd International Conference

on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 32, pp. 8868–8877

11.

Zhuo, J.; Liu, C.; Shi, J.; Zhu, J.; Chen, N.; Zhang, B. Message Passing Stein Variational Gradient Descent. In Proceedings of the

35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 6018–6027.

12.

Opper, M.; Archambeau, C. The variational Gaussian approximation revisited. Neural Comput.

2009

,21, 786–792. [CrossRef]

[PubMed]

13. Challis, E.; Barber, D. Gaussian kullback-leibler approximate inference. J. Mach. Learn. Res. 2013,14, 2239–2286.

14.

Titsias, M.; Lázaro-Gredilla, M. Doubly stochastic variational Bayes for non-conjugate inference. In Proceedings of the 31st

International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1971–1979.

15.

Ong, V.M.H.; Nott, D.J.; Smith, M.S. Gaussian variational approximation with a factor covariance structure. J. Comput. Graph.

Stat. 2018,27, 465–478. [CrossRef]

16.

Tan, L.S.; Nott, D.J. Gaussian variational approximation with sparse precision matrices. Stat. Comput.

2018

,28, 259–275.

[CrossRef]

17.

Lin, W.; Schmidt, M.; Khan, M.E. Handling the Positive-Definite Constraint in the Bayesian Learning Rule. In Proceedings of the

37th International Conference on Machine Learning, Virtual, 13–18 July 2020; Volume 119, pp. 6116–6126.

18.

Hinton, G.E.; van Camp, D. Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights. In

Proceedings of the Sixth Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 26–28 July 1993; COLT

’93; Association for Computing Machinery: New York, NY, USA, 1993; pp. 5–13.

19.

Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc.

2017

,112, 859–877.

[CrossRef]

20. Amari, S.I. Natural Gradient Works Efficiently in Learning. Neural Comput. 1998,10, 251–276. [CrossRef]

21.

Khan, M.E.; Nielsen, D. Fast yet simple natural-gradient descent for variational inference in complex models. In Proceedings of

the International Symposium on Information Theory and Its Applications (ISITA), Singapore, 28–31 October 2018; pp. 31–35.

22.

Lin, W.; Khan, M.E.; Schmidt, M. Fast and simple natural-gradient variational inference with mixture of exponential-family

approximations. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June

2019; pp. 3992–4002.

23.

Salimbeni, H.; Eleftheriadis, S.; Hensman, J. Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian

Process Models. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Lanzarote,

Canary Islands, 9–11 April 2018; pp. 689–697.

24.

Liu, Q.; Wang, D. Stein variational gradient descent: A general purpose bayesian inference algorithm. arXiv

2016

arXiv:1608.04471.

25.

Ba, J.; Erdogdu, M.A.; Ghassemi, M.; Suzuki, T.; Sun, S.; Wu, D.; Zhang, T. Towards Characterizing the High-dimensional Bias

of Kernel-based Particle Inference Algorithms. In Proceedings of the 2nd Symposium on Advances in Approximate Bayesian

Inference, Vancouver, BC, Canada, 8 December 2019.

26.

Tomczak, M.; Swaroop, S.; Turner, R. Efficient Low Rank Gaussian Variational Inference for Neural Networks. In Proceedings of

the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33.

27.

Maddox, W.J.; Izmailov, P.; Garipov, T.; Vetrov, D.P.; Wilson, A.G. A simple baseline for bayesian uncertainty in deep learning.

In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019;

pp. 13153–13164.

28.

Evensen, G. Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error

statistics. J. Geophys. Res. Oceans 1994,99, 10143–10162. [CrossRef]

103

Entropy 2021,23, 990 34 of 34

29.

Rezende, D.; Mohamed, S. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on

Machine Learning, Lille, France, 7–9 July 2015; pp. 1530–1538.

30.

Chen, R.T.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D. Neural ordinary differential equations. In Proceedings of the 32nd

International Conference on Neural Information Processing, Montréal, QC, Canada, 3–8 December 2018; pp. 6572–6583.

31. Ingersoll, J.E. Theory of Financial Decision Making; Rowman & Littlefield: Lanham, MD, USA, 1987; Volume 3.

32.

Barfoot, T.D.; Forbes, J.R.; Yoon, D.J. Exactly sparse gaussian variational inference with application to derivative-free batch

nonlinear state estimation. Int. J. Robot. Res. 2020,39, 1473–1502. [CrossRef]

33.

Korba, A.; Salim, A.; Arbel, M.; Luise, G.; Gretton, A. A Non-Asymptotic Analysis for Stein Variational Gradient Descent. In

Proceedings of the 32nd International Conference on Neural Information Processing, Virtual, 6–12 December 2020; Volume 33.

pp. 4672–4682.

34.

Berlinet, A.; Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics; Springer Science & Business Media:

Berlin/Heidelberg, Germany, 2011.

35.

Zaki, N.; Galy-Fajou, T.; Opper, M. Evidence Estimation by Kullback-Leibler Integration for Flow-Based Methods. In Proceedings

of the Third Symposium on Advances in Approximate Bayesian Inference, Virtual Event, January–February 2021.

36.

Bezanson, J.; Edelman, A.; Karpinski, S.; Shah, V.B. Julia: A fresh approach to numerical computing. SIAM Rev.

2017

,59, 65–98.

[CrossRef]

37.

Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop, Coursera: Neural Networks for Machine Learning; Technical Report; University of

Toronto: Toronto, ON, USA, 2012.

38.

Zhang, G.; Li, L.; Nado, Z.; Martens, J.; Sachdeva, S.; Dahl, G.; Shallue, C.; Grosse, R.B. Which Algorithmic Choices Matter at

Which Batch Sizes? Insights From a Noisy Quadratic Model. In Advances in Neural Information Processing Systems; Wallach, H.,

Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA 2019;

Volume 32, pp. 8196–8207.

39. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.

40.

Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu/ml/datasets.php

(accessed on 28 July 2021).

41. Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375.

42.

LeCun, Y. The MNIST Database of Handwritten Digits. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 20

July 2021).

43.

Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On calibration of modern neural networks. In Proceedings of the 34th International

Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 1321–1330.

44.

Liu, C.; Zhuo, J.; Cheng, P.; Zhang, R.; Zhu, J. Understanding and accelerating particle-based variational inference. In Proceedings

of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 4082–4092.

45.

Zhu, M.H.; Liu, C.; Zhu, J. Variance Reduction and Quasi-Newton for Particle-Based Variational Inference. In Proceedings of the

37th International Conference on Machine Learning, Virtual, 13–18 July 2020.

46.

Gronwall, T.H. Note on the derivatives with respect to a parameter of the solutions of a system of differential equations. Ann.

Math. 1919,20, 292–296. [CrossRef]

47. Zhang, F. Matrix Theory: Basic Results and Techniques; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011.

6. Flexible and Efficient Inference with Particles for the Variational Gaussian

Approximation

104

Discussions and extensions

This chapter presents both discussions and extensions on the models and ideas presented in

Chapters 3, 4, 5. All figures presented are reproducible by running the examples provided in the GitHub

repository

https://github.com/theogf/Phd-Thesis

. Section 7.1 considers how augmentations can

be generalized further and what analysis we need to fully understand the improvement brought by

augmentations. Section 7.2 presents new augmented models for

regression with heteroscedastic

noise. Section 7.3 explores how

HMC

could be used (or not) with augmented models. Section 7.4 shows

how the multi-class model of Chapter 4 can be improved in multiple ways. Section 7.5 presents a way to

combine inducing points and sampling using augmentations. Finally, Section 7.6 consider more largely

the limitations existing with our augmentation approach.

7.1 Further generalizations and understanding

The works presented in this thesis only scratched the surface of how helpful mixtures and representations

are.

Moment Generating Functions

We are still exploring ways to identify larger classes of functions identifiable as scale mixtures or

hierarchical mixtures. Already mentioned in Chapters 4 and 5, the connection with the Moment

Generating Function (

MGF

) of a distribution is a promising direction. We already identified augmentable

functions as being a transformed MGF of the augmented variables in Chapter 5:

φ(x2) = ∫︂∞

e−x2ωp(ω)dω∀x∈R≡MGFp(ω)(x) = φ(−√x),∀x≥0.

However, this is limited to

MGF

of continuous variables with a square transformation on the inputs.

We can extend the notion of augmentable functions to

MGF

of discrete and multivariate distributions,

where the domain of

is not always

R+

. For example, we used the

MGF

of a Poisson distribution in

Chapter 4:

exp(λ(ex−1)) =

∞

∑︂

n=0

enexPo(n|λ).

105

7. Discussions and extensions

It is not a scale mixture of Gaussians, but with the right variable transformations, it can still be useful.

The

MGF

of a Poisson is known, but we could also consider arbitrary

MGF

since we are able to sample

from a distribution given its Laplace transform only [47].

The

MGF

is also an interesting tool for creating hierarchical models. Since the

MGF

is of the

form

∑︁xetxp

(

)or

∫︁etxp

(

)

, by setting

log σ

(

), we get scales mixtures of the form

∑︁xσx

(

Thanks to the property that

σn

(

)is augmentable for any

n∈R+

, we can use Pólya-Gamma variables

and obtain a conditionally conjugate model for a

. Additional examples of such constructions are

shown in this chapter in Sections 7.2 and 7.4.

Marginalizing out augmented variables

A potential improvement for augmented models is the identification of marginalizable augmented

variables that keep the conditional conjugacy of the model. For example, in the multi-class model from

Chapter 4, the augmented variable

can be marginalized out, as shown in Section 7.4. We can reduce

the dimensionality of the model and avoid tricky situations like the inner loop updates in Chapter 4.

This marginalization step is avoidable by identifying the right

MGF

from the start. As shown in

Section 7.2.2, switching between marginalized and augmented models gives great inference flexibility.

Convergence speed analysis

An unfinished work (despite trying) is to establish convergence rates (error as a function of the number

of iterations) for the

CAVI

algorithm and derive theoretical bounds on the intra-chain correlation and

of the ergodicity for the Gibbs sampler. Experimental results indicate that the error on the variational

free energy (and variational parameters) is decreasing as

∥F∗−Ft∥ ∝ C0e−ct

, where

is the number of

iterations, but we did not manage to write a formal proof. We show the decay for both the variational

free energy and the variational parameters for different examples in Figure 7.1.

106

7.2 Double bounds for intricate latent GPs

0 2 4 6 8

10⁻⁵⋅⁰

10⁻²⋅⁵

10⁰⋅

⁰

|mt−m*|

0 2

6 8

10⁻⁵⋅⁰

10⁻²⋅⁵

0⁰⋅⁰

−S*

Bernoulli

0 2 4 6 8

10⁻

¹⁰

10⁻

⁵

⁰

⁵

¹⁰

Ft−F

0 5 10 15

10⁻⁸

10⁻⁶

⁻⁴

10⁻²

10⁰

0 5 10 15

10⁻⁸

10⁻⁶

10⁻⁴

10⁻²

10⁰

Student-T

0 5 1

0 15

⁻¹⁰

10⁻⁵

10⁰

⁵

¹⁰

0 5

10 15

10⁻⁴

10⁻²

10⁰

0 5

10 15

10⁻⁴

10⁻²

10⁰

Laplace

0 5 1

0 15

10⁻⁵

10⁰

10⁵

10¹⁰

Iteration t

0 5 10 15

10⁻⁴

10⁻²

10⁰

0 5 10 15

10⁻⁴

10⁻²

10⁰

isson

Iteration t

0 5 10 15

10⁻⁵

10⁰

10⁵

¹⁰

c = -2.28 c = -2.28 c = -3.62

c = -0.99 c = -0.99

c = -1.39

c = -0.53 c = -0.53

c = -1.22

c = -0.52 c = -0.52 c = -1.24

Figure 7.1: Convergence plot of the

CAVI

updates for a one-dimensional toy example with

different likelihoods (y-axis in log scale). The solid blue line shows the empirical error over the

number of iterations and the dashed green line shows the fit of the function

C0exp

(

). The

exponential coefficient is written down explicitly for each likelihood.

7.2 Double bounds for intricate latent GPs

The multi-class model developed in Chapter 4 paves the way to work with multi-latent models and

hierarchical augmentations. Based on this idea, we developed another multi-latent model on the

heteroscedastic regression likelihood [

]. It models simultaneously the mean and variance of a

regression likelihood with two latent

GPsf

and

. We consider both Gaussian and Non-Gaussian

107

7. Discussions and extensions

likelihoods since we can stack augmentations. We start with the simplest model: the heteroscedastic

Gaussian likelihood.

7.2.1 Heteroscedastic Gaussian Likelihood

A crucial model choice is the function mapping

to the likelihood variance

ϵ2

. The exponential link,

i.e.

ϵ2

(

) =

exp

(

)), is the most popular, however to be able to apply our augmentations, we use the

link

ϵ2

(

) =

(λσ(g(x)))−1

. Let’s look at the case of the heteroscedastic Gaussian likelihood, defined as:

p(y|f, g, λ) = √︁λσ(g)

√2πexp (︃−λσ(g)(y−f)2

2)︃.(7.1)

The augmentations for this likelihood are straightforward and quite similar to the multi-class ones

from Chapter 4.

exp (︃−λσ(g)(y−f)2

2)︃= exp (︃λ(σ(−g)−1)(y−f)2

2)︃

∞

∑︂

n=0

σn(−g)Po (︃n

λ(y−f)2

2)︃,(7.2)

where we used the

MGF

of the Poisson distribution. Using the Pólya-Gamma augmentation and the

additivity property of Pólya-Gamma variables, we get the final augmented likelihood:

p(y, n, ω |f, g, λ) = √λ

2n√πexp (︃1

2(︃g(︃1

2−n)︃−g2

ω)︃)︃PG (︃ω|1

2+n, 0)︃Po (︃n|λ(y−f)2

2)︃(7.3)

The interesting part about this augmented likelihood

(7.3)

is that although it is conditionally conjugate

, and

, it is unclear how to infer

: it is quadratic in

but not in

. It turns out that the Gibbs

sampler for this model is very simple: We take the augmented likelihood

(

y, ω, n|f, g, λ

), marginalize

out

and

and, as expected, we get the original likelihood

(7.1)

, which is conditionally conjugate

with

. The conditional

(

f|y, g, λ

)on this likelihood is the collapsed conditional. In a Gibbs

sampling scheme, this allows us to perform a collapsed step. We give all the Gibbs sampling steps in

Algorithm 2. So far, we have excluded the

parameter from inference. By putting a Gamma prior

(

λ|α, β

), where

is the shape and

is the rate, the collapsed conditional is available in closed-form:

p(λ|f,g,y) = Ga(λ|α+N

2, β +

∑︂

i=1

σ(gi)

2(yi−fi)2).

As underlined in Section 2.3.2, the

CAVI

updates need the model’s full conditionals and are not

compatible with collapsed conditionals. To solve this problem, we need to reverse-engineer how

CAVI

updates are obtained and start with a first bound on the KL divergence:

KL (q(f)q(g)||p(f,g|y)) ≤min

q(g)−Eq(g)[︁Eq(f)[log p(y|f,g)]]︁+ KL (q(f)q(g)||p(f)p(g)) −log p(y)

= min

q(g)−Eq(g)[︁log p(y|g,µ∗

f,Σ∗

f)]︁+ KL (q(g)||p(g)) + KL∗

f−log p(y) = F1.

(

y|g,µ∗

f,Σ∗

)and

KL∗

are expectations computed with the optimal

q∗

(

) =

N(︂f|µ∗

f,Σ∗

f)︂

. We can

now use the augmentations from Equation

(7.3)

on the expected log-likelihood, where we replaced

(yi−fi)2by (yi−(µ∗

f)i)2+ (Σ∗

f)ii, and build a second bound.

F1≤min

q(g)q(ω,n)

Eq(g)q(ω,n)[︁log p(ω,n,y|g,µ∗

f,Σ∗

f)]︁+ KL (q(g)||p(g)) + KL∗

f=F2(7.4)

108

7.2 Double bounds for intricate latent GPs

It is straightforward to find the optimal variational distributions

q∗

(

)and

q∗

(

ω,n

)minimizing

which allows us to use

CAVI

updates. Then, injecting the optimal distribution

q∗

(

)

(

ω,n

)in

, we

can derive the optimal

µ∗

and

Σ∗

, obtainable in closed-form. The resulting

CAVI

updates are given in

Algorithm 3. For

, we can use the second bound

(7.4)

and obtain a closed-form maximum-likelihood

estimate, given in Algorithm 3.

This double-bound approach is very similar to Lázaro-Gredilla and Titsias

[32]

, although they are

using the exponential link and need some extra computations.

Algorithm 2 Gibbs sampling for the Heteroscedastic Gaussian likelihood

input: f,g, λ, y,p(f,g) = N(f|µ0

f, K)N(g|µ0

g, K),p(λ|α, β).

for tin 1 : N samples do

Draw λ∼p(λ|f,g,y) = Ga(λ|α+N

2, β +∑︁N

i=1

σ(gi)

2(yi−fi)2).

Draw ni∼p(ni|fi, gi, λ) = Po(λσ(−gi)(yi−fi)2

Draw ωi∼p(ωi|ni, gi) = PG(0.5 + ni,|gi)

Draw g∼p(g|n,ω) = N(µg,Σg)

where Σg=(︁K−1+ diag(ω))︁−1and µg=Σg(︁K−1µ0

g+0.5−n

2)︁

Draw f∼p(f|g, λ) = N(µf,Σf)

where Σf=(︁K−1+λdiag(σ(g)))︁−1and µf=Σf(︂K−1µ0

f+λdiag(σ(g))y

2)︂

end for

Algorithm 3 CAVI Updates for the Heteroscedastic Gaussian likelihood

input: q(f,g) = N(f|µf,Σf)N(g|µg,Σg),p(f,g) = N(f|µ0

f, K)N(g|µ0

g, K),yand λ.

while convergence criteria is not met do

ψi=˜︁σ(q(gi))

λ=N

∑︁N

i=1(1−ψi)√︂(yi−µi

f)2+Σii

γi=λ

2ψi√︂(yi−µi

f)2+ Σii

ci=√︂(µi

g)2+ Σii

θi=Eq(ωi|ni)q(ni)[︁ωi]︁=0.5+γi

2citanh (︂ci

2)︂

Σf=(︁K−1+λdiag(1 −ψ))︁−1

µf=Σf(︂K−1µ0

f+λdiag(1 −ψ)y)︂

Σg=(︁K−1+ diag(θ))︁−1

µg=Σg(︂K−1µ0

g+0.5+γ

2)︂

end while

where

(

n,ω

) =

∏︁N

i=1 PG

(

ωi|

5 +

n, ci

)

(

ni|γi

)and

˜︁σ

(

)) =

e−µi

g/2

√(µi

g)2+Σii

k/2

can be seen as

a close approximation to Eq(gi)[︁σ(−gi)]︁.

A 1-dimensional toy example is shown in Figure 7.2 with the results of the inference algorithms.

109

7. Discussions and extensions

-10 -5 0 5 10

-10

y|f,g

10 -5

0 5 10

-5

Lat

nt GP

-10 -5 0 5 10

-10

10 -5

0 5 10

-5

Variational

Inference

Gibbs

Sampling

yp(y|f,g)Eq(f,g)[p(y|f,g)]

{p(y|f

,gs)}s=1

{fs,gs}s=1

S∼p(f,g|y)

q(f)q(g)

f g

Figure 7.2: Toy example of a heteroscedastic Gaussian regression problem and the resulting

inference from Algorithm 2 (Gibbs sampling, bottom plots) and Algorithm 3 (Variational Inference,

top plots). The left plots show the output space. The training data

are in orange, the generating

likelihood is shown in blue (mean in solid line and one standard deviation in dashed-line). The

green bands show the predictive distributions with one standard deviation obtained after posterior

inference (one band for variational inference and cumulative bands for the sampling approach).

The right plots show the true latent functions

and

used to generate

as well as the inferred

posteriors: variational on top (mean with one standard deviation) and samples at the bottom.

We can see that on this one-dimensional example,

but more particularly Gibbs sampling, manage

to recover the original model. For

, the variance on the latent

is almost negligible since all the data

variance is absorbed into the likelihood variance term. The samples obtained with Gibbs sampling,

without any warmup, fit nicely the true processes of fand g.

An implementation as well as detailed derivations are in the AugmentedGPLikelihoods.jl package

[15].

7.2.2 Heteroscedastic Non-Gaussian Likelihood

This method extends to non-Gaussian likelihoods as well. We take the example of the heteroscedastic

Student-t likelihood, where we have a local scale with standard deviation ϵ(x) = λσ(g)with λ∈R+.

Similar to the heteroscedastic Gaussian likelihood (7.1), we get the likelihood:

p(y|f, g, λ, ν) = Γ(ν+1

2)√︁λσ(g)

Γ(ν

2)√πν (︃1 + λσ(g)(y−f)2

ν)︃−ν+1

(7.5)

110

7.3 Using Hamilton Monte Carlo on the augmented model

To simplify the notation, we define the scaled residuals ∆ = ∆

(

f, y, λ

) =

λ(y−f)2

ν+1

and

the normalization constant Z=Γ( ν+1

Γ( ν

2). We can proceed with the first augmentation:

(1 + σ(g)∆)−α=(1 + ∆(1 −σ(−g)))−α

=(1 + ∆ −∆σ(−g))−α

=(∆σ(−g))−α(︃σ(−g)∆

1+∆−σ(−g)∆)︃

∞

∑︂

k=0

∆kNB(k|σ(−g), α),(7.6)

where we used the MGF of the Negative Binomial distribution.

We obtain the same result by performing first the augmentation of the Student-t with a Gamma

variable:

p(y|f, g, λ) = ∫︂∞

0N(y|f, (λσ(g)γ)−1)IG(γ|ν

2,ν

2)dγ

p(y, γ|f, g, λ) =N(y|f, (λσ(g)γ)−1)IG (︂γ|ν

2,ν

2)︂.(7.7)

(

y|f,

(

λσ

(

)

−1

)is the same starting point as Equation

(7.1)

with an additional scaling

. The next

augmentation steps are the same as in Equation

(7.2)

with an augmentation with a Poisson variable.

Marginalizing out the Gamma variable γresults in a Negative Binomial distribution.

Back to Equation

(7.6)

, we rework the likelihood by reorganizing the terms in the augmented

likelihood.

p(y, k|f, g, λ, ν) =Z√λσ 1

2σ(−g)−α∆−α∆kC(k, α)σα(g)σ(−g)k

⏞⏟⏟ ⏞

NB(k|σ(−g),α)

=ZC(k, α)√λ(σ(g))1

2+α(σ(−g))k−α∆k−α

where

(

k, α

) =

Γ(r+k)

k!Γ(r)

is the normalization constant of the negative binomial. We set

Z′

(

α, k

)

√λ

as a constant independent of for g. The final step is the Pólya-Gamma augmentation:

p(y, k, ω|f, g, λ, ν) = Z′∆k−α2−(1

2+k)exp (︃1

2(︃1

2+ 2α−k)︃g+g2ω)︃PG(ω|1

2+k, 0).(7.8)

Like for the heteroscedastic Gaussian likelihood, the augmented likelihood

(7.8)

is conjugate in

but not in f. We can find the collapsed conditional for fin closed-form.

The key to performing inference on this augmented model, is to use the right augmented likelihood

for each variable. For example, for

and

, we only want to use the Inverse Gamma augmentation

described in Equation

(7.7)

. For

, and

(used as a mixture of inverse Gamma and Poisson) we will

use the fully augmented likelihood

(7.8)

. This will give a combination of collapsed conditionals and full

conditionals directly usable in a Gibbs sampling scheme. For the

CAVI

updates, we reuse the double

bound idea of Section 7.2.1.

The full derivations, resulting algorithms and implementation will be found in the

AugmentedGPLikelihoods.jl package [15].

7.3 Using Hamilton Monte Carlo on the augmented model

The Gibbs sampler in the experiments of Chapters 3 and 5 outperforms the state-of-the-art

HMC

algorithm introduced in Section 2.3.1. A recurrent question I got is: Is the performance gain due only

111

7. Discussions and extensions

to the augmentation or the Gibbs sampling scheme? To answer this question, we try using the

HMC

algorithm on augmented models.

Before doing any experiments, let us consider the consequences that the augmented model has

on the

HMC

sampler. First, the augmentation increases the dimensionality of the model. For

observations, we need

more dimensions (where

depends on the model); therefore, gradient

computations and algorithm tuning should be more expensive. On the other hand, since the likelihood

is simplified to a quadratic problem, the computational complexity of each step can decrease! The

second issue with using

HMC

on the augmented model is that the probability distribution function (

pdf

)

of the prior distribution on augmented variables is not always available in closed-form or not usable

at all. For example, one approximates the probability of a Pólya-Gamma variable with a truncated

alternating series, Truncated series are computationally expensive and can also be biased and unstable!

My experience with the Pólya-Gamma variables is that even when using tricks like "logsumexp" to

improve numerical stability, the

pdf

approximation can be negative, breaking the computations. Finally,

the critical problem with

HMC

is that it only works with continuous variables. Some augmentations

directly involve discrete variables like the Poisson in the multi-class setting, making it incompatible

with a scheme involving only HMC.

We try running

HMC

and

NUTS

with a compatible augmentation (augmented variable

pdf

known

in closed-form, no discrete variables). Figure 7.3 shows the auto-correlation plots on

regression

problem with a Student-t likelihood with

= 3 degrees of freedom applied on the Boston housing

dataset (506 data points, 13 dimensions) [

]. We draw one chain of 2000 samples (plus 500 adaptation

samples for HMC and NUTS) for both the original and augmented model.

From the first look,

HMC

applied on the augmented model has a lower auto-correlation. When

using

NUTS

, the gain becomes less clear. Moreover, the algorithm produces antithetic chains, making

it harder to have a proper comparison. The Gibbs sampler has the smallest intra-chain correlation, but

one could argue that negative correlations are desirable to compute expectations. However,

HMC

and

NUTS

turned out to be much slower than the Gibbs sampler: the Gibbs sampler took around 20 sec to

run against an average of 12 minutes for

HMC

and

NUTS

. This difference is due to

HMC

(and

NUTS

)

needing to compute many gradients for every sample. Perhaps surprisingly, there was no significant

time difference between the augmented and original models for HMC and NUTS.

Note that

HMC

is already, in a sense, making an augmentation of its own with the momentum

variables, and it could be added to the list of successful types of augmentations improving inference.

We should only consider these results preliminary since we used a simple likelihood, and the dataset

is relatively small and easy.

Lag

5 10 15 20

-0.25

0.00

0.25

Autocorrelation

Gibbs Sampling

HMC (aug. model)

HMC

NUTS (aug. model)

NUTS

Gibbs Sampling

HMC (aug. model)

HMC

NUTS (aug. model)

NUTS

Figure 7.3: Auto-correlation function of the Gibbs sampler,

HMC

and

NUTS

on the augmented

model, and

HMC

and

NUTS

on the original model. The mean is shown with one standard-deviation

over all dimensions.

112

7.4 Improvements on the Multi-Class Classification

We recently figured out additional ways to improve the multi-class classification model and the associated

inference. We present them here in 3 different sections.

7.4.1 Marginalizing out variables

In the augmentation derived in Chapter 4, we add 2

+1 new variables per observation:

{ni}K

i=1

and

{ωi}K

i=1

. However, we can reduce this number to 2

and avoid unnecessary inner loops by marginalizing

out λ. When deriving the augmentations, one ends up with the following augmented likelihood:

p(y=k, {nj}K

j=1, λ|{fj}K

j=1) = σ(fk)

∏︂

j=1

σ(−fj)njPo(nj|λ),(7.9)

where we omitted the improper prior 1[0,∞)on λ. We can marginalize out λ:

∫︂∞

∏︂

j=1

σ(−fj)njPo(nj|λ)dλ =1

∏︁K

j=1 nj!∫︂∞

λ∑︁K

j=1 nje−Kλdλ

=K−∑︁K

j=1 nj

∏︁K

j=1 nj!

∏︂

j=1

σ(−fj)nj∫︂∞

(Kλ)∑︁K

j=1 nje−Kλdλ

∏︂

j=1

σ(−fj)njΓ(1 +

∑︂

j=1

nj)

∏︂

j=1 (︃1

K)︃nj1

nj!.(7.10)

Which is proportional to a Negative Multinomial NM(x0,p)defined by:

NM(x|x0,p)=Γ



∑︂

j=0

xj

px0

Γ(x0)

∏︂

j=1

pxj

xj!

with parameters x0= 1,p={︂σ(−fj)

K}︂K

j=1, and where p0= 1 −∑︁K

j=1 pj. Note that the normalization

term

is missing in Equation

(7.10)

. However, we do not add it, as it would render the likelihood

unusable. We keep the prior unnormalized, but this does not influence the inference, as in Chapter 4,

since all full conditionals are available in closed-form and normalized.

These derivations could have been avoided by noticing that the

MGF

of a negative binomial

distribution is given by:

MGFNM(x0,p)(t) = (︄p0

1−∑︁K

j=1 pjetj)︄x0

Both the Gibbs sampling and

CAVI

updates based on this marginalization are described in

Algorithms 4 and 5.

7.4.2 A new model for the multi-class classification

In Chapter 4, two concerns can be raised. First, the parametrization of a categorical distribution

with

categories requires only

K−

1independent parameters

due to the constraint

∑︁K

j=1 pj

= 1.

However, in the original model, which we will call over-parametrized, we consider

independent

parameters. Second, the augmented variable

has the improper prior

(

) = 1

[0,∞)

, which is a proper

measure but is not normalizable. It is not an important concern since the posterior is normalizable

113

7. Discussions and extensions

despite the improper prior. Nevertheless, one might argue that improper priors should be avoided, as it

does not allow model comparison.

On a side note, the fact that augmentations with improper priors still lead to valid inference is a

good indication that scale mixtures for augmentation can be extended to non-normalizable measures.

These two issues seem connected, but we do not have any proof for it.

We propose an alternative parametrization with

K−

1latent

GPs

. The likelihood stays the same

but with one latent being fixed:

p(y=k|{fj}K−1

j=1 ) = 





σ(fk)

D+∑︁K−1

j=1 σ(fj),if 1 ≤k < K −1

D+∑︁K−1

j=1 σ(fj),if k=K−1,(7.11)

where

(

)

∈

1]. We call this version of the likelihood bijective since the dimensionality of

the simplex output is the same as the inputs.

This likelihood comes with different properties. Unlike the softmax link, the logistic-softmax link is

not translation invariant

. We can not freely exchange classes, and the "fixed" class has a different

behavior than the rest. For example, since we fix

, the probability for classes other than

will be

upper bounded by

D+1

. For example, taking

= 0

= 0) leads to a maximum probability of

1for the class

and 2

3for all other classes. On the other hand, if

= 0, the probability of the

class

will always be 0. The bijective likelihood can still be practical if we do not care about one of

the classes. Additionally, the scaled model presented in the next Section 7.4.3 can also help with the

imbalance between classes.

Starting from the likelihood in Equation 7.11 the first augmentation that led to an improper prior

in the over-parametrized model of Chapter 4:

∑︁K

j=1 σ(fj)=∫︂∞

e−λ∑︁K

j=1 σ(fj)dλ

is replaced by the known MGF of a Gamma distribution with the following mixture:

D+∑︁K−1

j=1 σ(fj)=1

D+∑︁K−1

j=1 σ(fj)=1

1 + 1

D∑︁K=1

j=1 σ(fj)

D∫︂∞

e−λ∑︁K−1

j=1 σ(fj)Ga (︃λ|1,1

D)︃dλ,

which is true for D > 0.

The next augmentations steps are the same for the bijective and over-parametrized models: We

use the

MGF

of the Poisson distribution and finally the Pólya-Gamma augmentation. We show the

whole derivations on Algorithms 4 and 5 and show an example on Figure 7.4. We show 1-dimensional

examples with 3 classes with and without the bijection on Figure 7.4 and 7.5

Algorithm 4 Gibbs sampling updates: K/K−1latent GPsfor Kclasses

input: F={fk}K

k=1,p(F) = ∏︁K/K−1

k=1 p(fk|µ0, KX),Y={yi}N

i=1 (one-hot encoded)

for tin 1: # samples do

Draw ni∼p(ni|F) = NM(1,pi)where pi

k=σ(−fi

K/σ(−fi

D+K−1

Draw ωi

k∼p(ωi

k|fi

k, ni

k, yi

k) = PG(yi

k+ni

k,|fi

k|)

Draw fk∼p(fk|ωk,nk,Y) = N(mk,Sk)

where Sk=(︁K−1

X+ diag(ωk))︁−1and mk=Sk(︂K−1

Xµ0+yk−nk

2)︂

end for

1There is no function f(∆) such that σ(x+ ∆) = f(∆)σ(x)for all x.

114

7.4 Improvements on the Multi-Class Classification

Algorithm 5 CAVI updates: K/K−1latent GPsfor Kclasses

input:

(

) =

∏︁K/K−1

k=1 q

(

fk|µk,Σk

(

∏︁K/K−1

k=1 p

(

fk|µ0, K

{yi}N

i=1

(one-hot

encoded)

while convergence criteria is not met do

k=√︂(µi

k)2+ Σii

k=˜︁σ(q(fi

k))

K/˜︁σ(q(fi

k))

D+K−1

γi=Eq(ni)[︁ni]︁=pi

1−∑︁K

i=1 pi

θi

k=Eq(ωi

k)[︁ωi

k]︁=yi

k+γi

2ci

tanh (︂ci

2)︂

Σk=(︁K−1

X+ diag(θk))︁−1

µk=Σk(︂K−1

Xµ0+yk−γk

2)︂

end while

where

(

N,Ω

) =

∏︁N

i=1 PG

(

ωi|yi

ni,ci

)

(

ni|

,pi

)and

˜︁σ

(

)) =

e−µi

k/2

√(µi

k)2+Σii

k/2

is an

approximation to the σ(−fi

k).

-10 -5 0 5 10

0.0

0.5

1.0

1.5

y|{fj}

10 -5

0 5 10

-5

Latent GPs

-10 -5 0 5 10

0.0

0.5

1.0

1.5

10 -5

0 5 10

-5

Variational

Inference

Gibbs

Sampling

yp(y=k|{fj}) Eq(

fj)

[p(y=k|{f

})] {fj} {q

(fj)}

{p(y=k|{fj}

)}s

S{{fj}s}s=1

Figure 7.4: Illustration of Algorithms 4 and 5 with the bijective link introduced in Section 7.4.2

and the marginalization of Section 7.4.1. Each color represents a class, and we compare the true

process to the inferred one for both Gibbs sampling and variational inference. The solid lines

represent the true probabilities and latent

GPs

. The plots on top show the variational inference

results, with the expected predictive probability on the left and the variational posterior on the

right. The plots at the bottom show the probabilities and latent

GPs

obtained via Gibbs sampling.

115

7. Discussions and extensions

-10 -5 0 5 10

0.0

0.5

1.0

1.5

y|{fj}

10 -5

0 5 10

-5

Latent GPs

-10 -5 0 5 10

0.0

0.5

1.0

1.5

10 -5

0 5 10

-5

Variational

Inference

Gibbs

Sampling

yp(y=k|{fj}) Eq(

fj)

[p(y=k|{f

})] {fj} {q

(fj)}

{p(y=k|{fj}

)}s

S{{fj}s}s=1

Figure 7.5: Illustration of Algorithms 4 and 5 with the overparametrized link with the

marginalization of Section 7.4.1. Each color represents a class, and we compare the true process to

the inferred one for both Gibbs sampling and variational inference. The solid lines represent the

true probabilities and latent

GPs

. The plots on top show the variational inference results, with

the expected predictive probability on the left and the variational posterior on the right. The plots

at the bottom show the probabilities and latent GPsobtained via Gibbs sampling.

Both the bijective and over-parametrized links fit correctly this one-dimensional example. The

over-parametrized link in Figure 7.5 do not approximate correctly the fixed latent

= 0 but still

returns good predictive distributions.

When repeatedly running these examples, we observe that the predictive probabilities for the

bijective link are consistently more accurate, but the predictive log-likelihood for the correct class is

higher on the over-parametrized link. To confirm this trend, we would need further experiments on real

datasets and with a higher number of classes.

7.4.3 Scaling the logistic-softmax link

The logistic-softmax link has issues with the predictive probabilities, in particular with many classes.

Because of the boundedness of the logistic function, the logistic-softmax link needs large values of

to reach prediction probabilities close to 1. Even when the model should be very confident about a

prediction and the latent

GPs

are correctly inferred, the predictive probability for the correct class

will be around (1

−ϵ

)

((

K−

+ 1

−ϵ

)where

is the minimum value taken by

(

). With a

prior centered at 0 and a reasonable kernel variance,

can not take large values. For example, taking

10 classes, if we assume

= 4 for the correct class and

−

4for the others,

ϵ≈

018, which gives a

probability of 0.858 with the logistic-softmax link against 0.996 for the softmax link.

This can be solved by using a scaled logistic function. We add

hyperparameters

{θi}K

i=1

such that the likelihood becomes

p(y=k|{fj}K

j=1,θ) = θkσ(fk)

∑︁K

j=1 θjσ(fj).

116

7.5 Sampling from a sparse augmented model

The

parameters can be optimized using the

ELBO

with the other hyperparameters. These can also

provide information about each class, a high

θj

meaning that the

-th class has zones of very high

confidence. With the likelihood augmented with the variable

, the collapsed-conditional and the

maximum-likelihood optimum of

is available in closed-form. The maximum-likelihood optimizer is

given by:

θ∗

k=∑︁N

i=1 δ(yi, k)

∑︁N

i=1 Eq(λn)[λn] (1 −˜︁σ(q(fi

k))),

where

(

x, y

)is the Kronecker delta function, equal to 1 if

and 0 otherwise and where

˜︁σ

(

)) is

defined as in Algorithm 5. We used the model definition where λis not marginalized out.

By putting a prior Ga(θk|α, β), the collapsed conditional of each θkis given by:

p(θk|fk,λ) = Ga(θk|α+

∑︂

i=1

δ(yi, k), β +

∑︂

i=1

λiσ(fi

k))

A Julia implementation as well as detailed derivations can be found in the

AugmentedGPLikelihoods.jl

package [15].

7.5 Sampling from a sparse augmented model

Another work in progress regards the sampling of sparse

GPs

models. Sampling from the augmented

model proves to be very effective (see Chapter 5) while still producing samples from the posterior

(

f|y

)

of the original model. Unfortunately, this property does not transfer when using sparse

GPs

(for a

reminder on sparse

GPs

, see Section 2.2.3) and the scalability is limited. Simply adding inducing points

locations

with realizations

(

)leads to a Gibbs sampling algorithm with a computational

complexity of

((

)

)per step and does not help with scalability. To solve this problem, we

propose to mix the Gibbs sampling approach we presented in Chapter 5 with variational inference.

We build on the work of Hensman et al.

[22]

. They make the Titsias’ assumption [

], i.e. setting

the variational distribution as

(

u,f

) =

(

)

(

f|u

). Since they also assume a fully factorizable

likelihood

(

y|f

) =

∏︁ip

(

yi|fi

), only marginals

(

)are required and the computational complexity of

the bound decreases to

(

NM2

). Hensman et al.

[22]

show the optimal variational distribution

of the inducing variables

minimizing

KL (q(u,f)||p(u)p(f|u)p(y|f))

for a factorizable likelihood

p(y|f) = ∏︁ip(yi|fi)is given by:

log q∗(u) = ∑︂

Ep(fi|u)[log p(yi|fi)] + log p(u) + C, (7.12)

where

is an intractable constant.

q∗

(

)does not have a specific form in the general case, but we can

sample from it by using

HMC

and evaluating the integrals

Ep(fi|u)[log p(yi|fi)]

numerically

as in [

We propose instead to derive a variational Gibbs sampling algorithm to draw samples from the

variational distribution minimizing the Renyi divergence [57] defined as

Dα(p, q) = 1

α(α−1) log ∫︂αp(x) + (1 −α)q(x)−pα(x)q1−α(x)dx, α ∈R+.(7.13)

The Renyi divergence converges to the forward KL divergence:

KL (p||q)

for

= 1 and the reverse

KL divergence:

KL (q||p)

for

= 0 [

]. We define our variational distribution as

(

u,f,Ω

) =

(

u,Ω

)

∏︁ip

(

fi|u

), and aim at minimizing

Dα

(

u,f,Ω|y

)

, q

(

u,f,Ω

)). Note that we do not assume

any independence between

and

Ω

, only that every

is conditionally independent given

. There

2With quadrature for low-dimensions

117

7. Discussions and extensions

is no parametric closed-form for the optimal distribution

q∗

(

u,Ω

)minimizing the divergence in

Equation

(7.13)

, hence we take the approach of Hensman et al.

[22]

and sample from it instead. We

draw

and

Ω

with a blocked Gibbs sampler, by sampling from the optimal variational distribution

minimizing the conditional Renyi divergences:

Ωi∼q∗(Ω) = argqmin Dα(︁p(Ω|ui−1,fi−1,y), q(Ω))︁(7.14)

ui,fi∼q∗(u,f) = argqmin Dα(︁p(u,f|Ωi,y), q(u,f))︁

= argqmin Dα(︄p(u)p(f|u)p(f|Ωi,y), q(u)∏︂

p(fi|u))︄.(7.15)

For all

, the minimizer for

q∗

(

Ω

)is

(

Ω|ui−1,fi−1,y

), setting the conditional divergence to 0. With

the approach from Chapter 5, we know

(

Ω|u,f,y

)(which can be simplified to

(

Ω|f,y

)) in closed-form

and can sample from it with linear complexity with respect to the number of data points.

Bui et al.

[8]

solved the optimization problem of Equation

(7.15)

for Gaussian likelihoods, with the

Power-EP algorithm. Since

(

f|Ω,y

)is conjugate in

, the optimal

q∗

(

)is a multivariate normal

distribution with the mean and variance known in closed-form for all

α∈R+

. Each sampling step for

and

only has complexity

(

M2N

). Like in the Power-EP setting,

= 0 corresponds to

solving the variational approach of Titsias

[53]

, while

= 1 corresponds to solve the Fully Independent

Training Conditional (FITC) approach of Snelson and Ghahramani [51], as shown in Bui et al. [8].

The only parameters left are the hyperparameters

, omitted in the previous equations, that can

represent a real challenge. For

= 0, we could sample from

q∗

(

)with the

HMC

algorithm in a separate

Gibbs sampling step. For other

, we could optimize

(

)with variational inference methods [

and hot-start with the previous distribution. The complete variational Gibbs sampler is described in

Algorithm 6.

Algorithm 6 Variational Gibbs Sampler for Sparse GPs

input: y,u0∼p(u),f0∼p(f|u0),θ0∼p(θ)

for tin 1: # samples do

Draw Ωi∼p(Ω|fi−1,θi−1y)(in closed form)

Draw ui,fi∼q∗(u,f) = argqmin Dα(︁p(u,f|Ωi,θi−1,y), q(u,f))︁(in closed form)

Draw θi∼q∗(θi) = arg min Dα(︁p(θ|ui,fi,Ωi,y), q(θ))︁(HMC or optimization)

end for

Our approach completely gets rid of expectation computations for

. It opens up more possibilities

over more complex likelihoods like the multi-class or heteroscedastic ones where computing expectations

numerically, like in Equation

(7.12)

, is a limitation. For medium-sized datasets, this outperforms the

CAVI

algorithm as it has the same convergence speed but does not suffer from the mean-field assumption

on the variational parameters. We show preliminary results on Figure 7.6 for a binary classification

problem on the Magic Telescope dataset (10 dimensions, 19020 data points) [

]. The experiment is run

with a 10-fold cross-validation, we use

= 50 inducing points selected via the k-means++ algorithm

[

], and we keep the hyperparameters fixed. We compare our approach (VI-Gibbs) with

= 0

against the

HMC3

variational sampling method of Hensman et al.

[22]

mentioned earlier (VI-HMC), a

standard

method optimized with an L-BFGS optimizer (Std. VI) and the augmented

approach

from Chapter 3 with

CAVI

updates (Aug. VI). We show the classification error and test negative

log-likelihood over time on Figure 7.6.

3HMC is run with a fixed step-size of 0.1 and with 10 leapfrog steps.

118

7.6 Limitations

Time [s]

10⁻¹ 10⁰10¹ 10² 10³

10⁻⁰⋅⁴

10⁻⁰⋅²

10⁰⋅⁰

10⁰⋅²

10⁰⋅⁴

Avg. Predictive Neg. Log-Likelihood

Time [s]

10⁻¹ 10⁰10¹ 10² 10³

0.2

0.3

0.4

0.5

0.6

Clas

Error

Aug. VI Std. VI VI-HMC VI-Gibbs

Figure 7.6: Negative test log-likelihood and classification test error over time on the Magic

Telescope dataset. The mean with one standard deviation over 10 runs is shown for each algorithm.

These are first results, and there is still work on optimizing the implementation, but some first

impressions can already be drawn. In terms of iterations, VI-Gibbs is just as fast as the

CAVI

updates

but seem to have a slightly better optima. It also completely outperforms methods applied on the

original model.

These preliminary graphs look very promising, but adding hyperparameter sampling might slow

down the process. We also need to compare results with different likelihoods and different αs.

7.6 Limitations

Unfortunately, augmentations are not a silver bullet for approximate Bayesian inference.

Augmentable functions

The largest issue is naturally the limited domain of application. Only a constrained set of functions can

be augmented. The idea of generalization using

MGF

as mentioned in Section 7.1 is promising but

limited nonetheless. When they exist, the identification of augmentable functions in a given model can

be tedious and may require lengthy derivations. We often need to rearrange terms and use mathematical

identities before applying procedures like the ones described in this thesis. It is accessible to someone

with expertise, but automatizing this derivation process is complicated. Current progress in symbolic

programming could eventually help in this direction. We could automate this process by having a

lookup table of augmentable functions and manipulating terms symbolically.

Mean-field approximation in

Another issue is the variational distribution

(

f,Ω

)(or

(

u,Ω

)) approximating the posterior

(

f,Ω|y

)of the augmented model is not as accurate as the

variational distribution

(

)(or

(

)) approximating the posterior

(

f|y

)of the original model (see

Section 2.3.2). Although the original model can be recovered from the augmented model by marginalizing

out the augmented variables

Ω

, the

approximation loses information (correlation between

Ω

and

)

and breaks this link. Marginalizing out

Ω

q∗

(

f,Ω

)will not return the optimal

q∗

(

)trained on the

original model. Interestingly, the bound difference comes exclusively from the mean-field assumption

between

(

)and

(

Ω

). We can even identify these bound differences via the interpretation of Jaakkola

and Jordan

[26]

as missing terms from a Taylor series, as shown in Chapter 3. When analyzing the

quality of the predictive distributions, the variational distribution trained on the augmented model

proves to be almost as good as the variational distribution trained on the original model. The difference

119

7. Discussions and extensions

of bounds mentioned earlier is often not significant at convergence but will create a difference nonetheless.

These empirical results give us an indication that fand Ωare naturally strongly decorrelated, which

would explain why the Gibbs sampling and CAVI updates are so efficient.

120

Conclusion

With this thesis, I want to motivate the use of different representations to ease inference in probabilistic

models. The work on scale mixtures exploits the best out of the blocked Gibbs sampling and the

blocked

CAVI

algorithms. Deriving these augmentations can be complicated and require a certain

expertise. Finding more generalizations and rules will simplify and make this approach more accessible.

We do not have a clear theoretical understanding of the reason for the fast convergence of these

algorithms. By exploring the properties of these likelihoods, we work on obtaining bounds on the

convergence speed of these algorithms. An intuition on why these augmentations work so well is the

notion of decoupling. Many inference bottlenecks come from very highly-correlated variables and heavy

tails of distributions [

]. By separating these components into different variables, all parts become

easier to model and do not suffer from the typical inference issues mentioned beforehand. These ideas

do not represent an actual theory for now, and we need a thorough analysis. A better understanding

could give insights into how convergence speed and variable correlations are connected.

Another challenge, as pointed out in Chapter 7, is to widen the class of functions representable as

mixtures. The most promising lead are Moment Generating Function (

MGF

), but there is little theory

on their properties. Schwartz

[50]

is one of the few persons who developed a theory on distributions

and their Laplace transforms, but, to our knowledge, the relevant pieces are missing.

Regardless, one of the biggest challenges is to popularize the use of such models. The gradient descent

approach for

of Hensman and Matthews

[21]

is by far the most popular, partly due to the success of

the

GPFlow

library [

]. Implementing these augmentations in popular libraries would be a good step.

There has been an effort in the Julia programming language [

] with the

AugmentedGPLikelihoods.jl

[15], but implementations in GPyTorch [17] or GPFlow would help the adoption of these techniques.

121

References

[1]

Amari, S. I. (1998). Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2):251–276.

ZSCC: 0002989 ISBN: 0899-7667.

[2]

Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of

the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035. Society for Industrial

and Applied Mathematics. ZSCC: NoCitationData[s0].

[3]

Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint

arXiv:1701.02434. ZSCC: 0000306.

[4]

Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). Julia: A fresh approach to numerical

computing. SIAM Review, 59(1):65–98.

[5]

Bock, R., Chilingarian, A., Gaug, M., Hakl, F., Hengstebeck, T., Jiřina, M., Klaschka, J., Kotrč, E., Savick`y,

P., Towers, S., et al. (2004). Methods for multidimensional event classification: a case study using images

from a cherenkov gamma-ray telescope. Nuclear Instruments and Methods in Physics Research Section A:

Accelerators, Spectrometers, Detectors and Associated Equipment, 516(2-3):511–528.

[6]

Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. (2011). Handbook of markov chain monte carlo. CRC

press.

[7]

Bui, T. D., Yan, J., and Turner, R. E. (2017a). A unifying framework for gaussian process pseudo-

point approximations using power expectation propagation. The Journal of Machine Learning Research,

18(1):3649–3720.

[8]

Bui, T. D., Yan, J., and Turner, R. E. (2017b). A Unifying Framework for Gaussian Process Pseudo-Point

Approximations using Power Expectation Propagation. arXiv:1605.07066 [cs, stat]. ZSCC: 0000072 arXiv:

1605.07066.

[9] Cressie, N. (1990). The origins of kriging. Mathematical geology, 22(3):239–252.

[10] Csató, L. (2002). Gaussian processes: iterative sparse approximations. PhD thesis.

[11]

Csató, L. and Opper, M. (2002). Sparse on-line Gaussian processes. Neural computation, 14(3):641–668.

ZSCC: 0000751 Publisher: MIT Press.

[12]

Donner, C. and Opper, M. (2018). Efficient bayesian inference for a gaussian process density model. arXiv

preprint arXiv:1805.11494.

[13]

Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). Hybrid monte carlo. Physics letters

B, 195(2):216–222.

[14] Galy-Fajou, T. (2021). theogf/AugmentedGaussianProcesses.jl.

[15] Galy-Fajou, T. (2022). JuliaGaussianProcesses/AugmentedGPLikelihoods.jl: v0.4.9.

[16]

Galy-Fajou, T., Widmann, D., Yalburgi, S., willtebbutt, st, Falk, I., Ridderbusch, S., Wright, T., david

vicente, Khan, S., Ge, H., Giersdorf, J., TagBot, J., Mones, L., Monticone, P., Viljoen, R., Schölly, S., and

Öcal, K. (2022). JuliaGaussianProcesses/KernelFunctions.jl.

123

REFERENCES

[17]

Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., and Wilson, A. G. (2018). Gpytorch: Blackbox

matrix-matrix gaussian process inference with gpu acceleration. Advances in neural information processing

systems, 31.

[18]

Gorinova, M., Moore, D., and Hoffman, M. (2020). Automatic Reparameterisation of Probabilistic

Programs. In International Conference on Machine Learning, pages 3648–3657. PMLR. ZSCC: 0000004

ISSN: 2640-3498.

[19]

Harrison Jr, D. and Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. Journal

of environmental economics and management, 5(1):81–102. ZSCC: 0001726 Publisher: Elsevier.

[20]

Henao, R., Yuan, X., and Carin, L. (2014). Bayesian Nonlinear Support Vector Machines and Discriminative

Factor Modeling. Nips, (Mcmc):1–9. ZSCC: 0000028.

[21]

Hensman, J. and Matthews, A. (2015). Scalable Variational Gaussian Process Classification. Aistats,

38:1–9. ZSCC: 0000200 arXiv: 1411.2005.

[22]

Hensman, J., Matthews, A. G. d. G., Filippone, M., and Ghahramani, Z. (2015). MCMC for Variationally

Sparse Gaussian Processes. arXiv:1506.04000 [stat]. ZSCC: 0000090 arXiv: 1506.04000.

[23]

Hensman, J., Sheffield, U., Fusi, N., and Lawrence, N. (2013). Gaussian Processes for Big Data. Proceedings

of UAI 29, pages 282–290. ZSCC: NoCitationData[s1] arXiv: 1309.6835 ISBN: 978-1-4503-1285-1.

[24]

Hernandez-Lobato, J., Li, Y., Rowland, M., Bui, T., Hernández-Lobato, D., and Turner, R. (2016). Black-

box alpha divergence minimization. In International Conference on Machine Learning, pages 1511–1520.

PMLR.

[25]

Hoffman, M. D. and Gelman, A. (2014). The No-U-Turn sampler: adaptively setting path lengths in

Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623. ZSCC: 0001680.

[26]

Jaakkola, T. S. and Jordan, M. I. (1997). A Variational Approach to Bayesian Logistic Regression Models

and their Extensions. In Sixth International Workshop on Artificial Intelligence and Statistics, pages 283–294.

PMLR. ZSCC: 0000268 ISSN: 2640-3498.

[27]

Jaakkola, T. S. and Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statistics

and Computing, 10(1):25–37. ZSCC: 0000581.

[28]

Jensen, C. S., Kjærulff, U., and Kong, A. (1995). Blocking gibbs sampling in very large probabilistic expert

systems. International Journal of Human-Computer Studies, 42(6):647–666.

[29]

Jordan, M. I. and Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science,

349(6245):255–260.

[30]

Kulesza, A. and Taskar, B. (2012). Determinantal point processes for machine learning. pages 1–120.

ZSCC: 0000516 arXiv: 1207.6083 ISBN: 9781601986283.

[31]

Lázaro-Gredilla, M. and Figueiras-Vidal, A. (2009). Inter-domain gaussian processes for sparse inference

using inducing features. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C., and Culotta, A., editors,

Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc.

[32]

Lázaro-Gredilla, M. and Titsias, M. K. (2011). Variational heteroscedastic gaussian process regression. In

ICML.

[33]

Li, Y. and Turner, R. E. (2016). Rényi divergence variational inference. Advances in neural information

processing systems, 29.

[34]

Lin, W., Schmidt, M., and Khan, M. E. (2020). Handling the Positive-Definite Constraint in the Bayesian

Learning Rule. arXiv:2002.10060 [cs, stat]. ZSCC: 0000000 arXiv: 2002.10060.

124

REFERENCES

[35]

Liu, J. S. (1994). The collapsed gibbs sampler in bayesian computations with applications to a gene

regulation problem. Journal of the American Statistical Association, 89(427):958–966.

[36]

Matthews, A. G. d. G., van der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., León-Villagrá, P.,

Ghahramani, Z., and Hensman, J. (2017). GPflow: A Gaussian process library using TensorFlow. Journal of

Machine Learning Research, 18(40):1–6.

[37]

Murphy, K. P. (2012). Machine learning: a probabilistic perspective. Adaptive computation and machine

learning series. MIT Press, Cambridge, MA. ZSCC: 0007949.

[38]

Murray, I., Adams, R., and MacKay, D. (2010). Elliptical slice sampling. In Proceedings of the thirteenth

international conference on artificial intelligence and statistics, pages 541–548. JMLR Workshop and

Conference Proceedings.

[39]

Neal, R. M. (2003). Slice sampling. Annals of Statistics, 31(3):705–741. ZSCC: 0001947 arXiv: 1003.3201v1

ISBN: 00905364.

[40]

Neal, R. M. et al. (2011). Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo,

2(11):2.

[41]

Nguyen, T. M. and Wu, Q. M. (2012). Robust student’s-t mixture model with spatial constraints and its

application in medical image segmentation. IEEE Transactions on Medical Imaging, 31(1):103–116. ZSCC:

NoCitationData[s0] ISBN: 0278-0062.

[42]

O’Hagan, A. and Forster, J. J. (2004). Kendall’s advanced theory of statistics, volume 2B: Bayesian

inference, volume 2. Arnold.

[43]

Palmer, J. A. (2006). Variational and scale mixture representations of non-Gaussian densities for estimation

in the Bayesian linear model: Sparse coding, independent component analysis, and minimum entropy

segmentation. PhD thesis, UC San Diego. ZSCC: 0000014.

[44]

Polson, N. G., Scott, J. G., and Windle, J. (2012). Bayesian inference for logistic models using Polya-Gamma

latent variables. pages 1–42. ZSCC: NoCitationData[s0] arXiv: 1205.0310.

[45]

Quinonero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate gaussian

process regression. The Journal of Machine Learning Research, 6:1939–1959.

[46]

Rasmussen, C. E. and Williams, C. K. I. (2018). Gaussian Processes for Machine Learning, volume 1. MIT

press Cambridge. ZSCC: NoCitationData[s0] arXiv: 026218253X Publication Title: Gaussian Processes for

Machine Learning ISSN: 0129-0657.

[47]

Ridout, M. S. (2009). Generating random numbers from a distribution specified by its Laplace transform.

Statistics and Computing, 19(4):439. ZSCC: 0000049 Publisher: Springer.

[48]

Salimbeni, H., Eleftheriadis, S., and Hensman, J. (2018). Natural Gradients in Practice: Non-Conjugate

Variational Inference in Gaussian Process Models. arXiv:1803.09151 [cs, stat]. ZSCC: 0000028 arXiv:

1803.09151.

[49] Schlaifer, R. and Raiffa, H. (1961). Applied statistical decision theory.

[50]

Schwartz, L. (1952). Transformation de laplace des distributions. Comm. Sém. Math. Univ. Lund [Medd.

Lunds Univ. Mat. Sem.], 1952(Tome Supplémentaire):196–206.

[51]

Snelson, E. and Ghahramani, Z. (2009). Sparse Gaussian Processes using Pseudo-inputs. Advances in

Neural Information Processing Systems 18, pages 1–24. ZSCC: NoCitationData[s0] ISBN: 9780262232531.

[52]

Solin, A., Hensman, J., and Turner, R. E. (2018). Infinite-Horizon Gaussian Processes. arXiv:1811.06588

[cs, stat]. ZSCC: 0000013 arXiv: 1811.06588.

125

REFERENCES

[53]

Titsias, M. (2009). Variational Learning of Inducing Variables in Sparse Gaussian Processes. Aistats,

5:567–574. ZSCC: 0000724.

[54]

Titsias, M. and Lázaro-Gredilla, M. (2014). Doubly stochastic variational bayes for non-conjugate inference.

In International conference on machine learning, pages 1971–1979. PMLR.

[55]

Turner, R., Deisenroth, M., and Rasmussen, C. (2010). State-space inference and learning with gaussian

processes. In Teh, Y. W. and Titterington, M., editors, Proceedings of the Thirteenth International Conference

on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 868–875,

Chia Laguna Resort, Sardinia, Italy. PMLR.

[56]

van der Wilk, M., Dutordoir, V., John, S., Artemev, A., Adam, V., and Hensman, J. (2020). A framework

for interdomain and multioutput gaussian processes.

[57]

Van Erven, T. and Harremos, P. (2014). Rényi divergence and kullback-leibler divergence. IEEE

Transactions on Information Theory, 60(7):3797–3820.

[58]

Wang, C. and Neal, R. M. (2012). Gaussian Process Regression with Heteroscedastic or Non-Gaussian

Residuals. arXiv:1212.6246 [cs, stat]. ZSCC: 0000044 arXiv: 1212.6246.

[59]

Wenzel, F., Galy-Fajou, T., Deutsch, M., and Kloft, M. (2017). Bayesian nonlinear support vector machines

for big data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases,

pages 307–322. Springer. ZSCC: 0000020.

[60]

Wenzel, F., Galy-Fajou, T., Donner, C., Kloft, M., and Opper, M. (2018). Efficient Gaussian

Process Classification Using Polya-Gamma Data Augmentation. arXiv:1802.06383 [cs, stat]. ZSCC:

NoCitationData[s0] arXiv: 1802.06383.

[61]

Widmann, D., willtebbutt, Galy-Fajou, T., st, Yalburgi, S., Ge, H., david vicente, Bosch, N., Schmitz, N.,

Viljoen, R., Wright, T., and andreaskoher (2022). JuliaGaussianProcesses/AbstractGPs.jl.

[62]

Williams, C. K., Rasmussen, C. E., Scwaighofer, A., and Tresp, V. (2002). Observations on the nyström

method for gaussian process prediction.

[63]

Wilson, J. T., Borovitskiy, V., Terenin, A., Mostowsky, P., and Deisenroth, M. P. (2021). Pathwise

conditioning of gaussian processes. Journal of Machine Learning Research, 22(105):1–47.

126

Additional work

The following work does not fit the storyline of the thesis and is therefore presented here only as a side

project.

A.1

Adaptive Inducing Points Selection for Gaussian Processes

Two important questions raised when using the sparse

GPs

presented in Section 2.2.3 are: How should

the inducing points be located? How many points does one need to reach a desired level of accuracy?

This work tries to answer these questions by proposing an adaptive algorithm, working in O(N) time

and also valid in an online setting.

Although the algorithm proves to be more efficient than standard methods and to have interesting

theoretical properties related to Determinantal Point Processes [

], it has serious tuning issues. The

parameters regulating the algorithm, how often one adds a point or removes one, are tightly correlated

to the kernel hyperparameters. When optimizing hyperparameters during training, an unstable behavior

may lead to picking all points as inducing points or selecting none. I presented this work in the

Continual Learning Workshop of ICML 2020.

Authors:

Théo Galy-Fajou1, Manfred Opper1

1TU Berlin

Details:

Type: Workshop article

Submitted: June 2020

Accepted: July 2020

URL: https://arxiv.org/abs/2107.10066

Workshop: Continual Learning (ICML 2020)

127

Adaptive Inducing Points Selection for Gaussian Processes

Th´

eo Galy-Fajou 1Manfred Opper 1

1Technical University of Berlin

Abstract

Gaussian Processes (GPs) are flexible non-

parametric models with strong probabilistic in-

terpretation. While being a standard choice for

performing inference on time series, GPs have

little techniques to work in a streaming setting.

(Bui et al.,2017) developed an efficient varia-

tional approach to train online GPs by using spar-

sity techniques: The whole set of observations is

approximated by a smaller set of inducing points

(IPs) and moved around with new data. Both the

number and the locations of the IPs will affect

greatly the performance of the algorithm. In ad-

dition to optimizing their locations we propose to

adaptively add new points, based on the proper-

ties of the GP and the structure of the data.

1. Introduction

Gaussian Processes (GPs) are flexible non-parametric

models with strong probabilistic interpretation. They are

particularly fitted for time-series (Roberts et al.,2013) but

one of their biggest limitations is that they scale cubically

with the number of points (Williams & Rasmussen,2006).

Quinonero-Candela & Rasmussen (2005) introduced the

notion of sparse GPs, models approximating the posterior

by a smaller number Mof inducing points (IPs) and re-

ducing the inference complexity from O(N3)to O(M3)

where Mis the number of IPs. Titsias (2009) introduced

them later in a variational setting, allowing to optimize their

locations. Based on this idea, (Bui et al.,2017) introduced

a variational streaming model relying on inducing points.

One of their algorithm’s features is that hyper-parameters

can be optimized and more specifically the number of in-

ducing can vary between batches of data. However in their

work, the number of IPs is fixed and their locations are sim-

ply optimized against the variational bound of the marginal

likelihood. Having a fixed number of IPs limits the model’s

scope if the total data size is unknown. A gradient based

approach leads to two problems:

- IP’s locations need to be optimized until convergence for

every batch. Therefore batches need to be sufficiently large

to get a meaningful improvement. If the new data comes

in very far from the original positions of the IPs, the opti-

Figure 1: Illustration of the inducing point selection pro-

cess. Blue points represent inducing points, green points

data and the orange line represent the mean of the predic-

tion from the GP model surrounded by one standard error.

The dashed represent the space covered by the existing IPs,

only points seen outside those areas are selected as new IPs.

mization will be extremely slow.

- The number of IPs being fixed, there is no way to know

how many will be required to have a desired accuracy.

Finding the optimal number of IPs is also not an option as

it is an ill-posed problem: the objective will only decrease

with more IPs, i.e. the optimum is obtained when every

data point is an IP.

We propose a different approach to this problem with

a simple algorithm, Online Inducing Points Selection

(OIPS), requiring only one parameter to select automati-

cally both the number of inducing points and their location.

OIPS naturally takes into account the structure of the data

while the performance trade-off and the expected number

of IPs can be inferred.

Our main contributions are as follow :

- We develop an efficient online algorithm to automatically

select the number and location of inducing points for a

streaming GP.

- We give theoretical guarantees on the expected number of

inducing points and the performance of the GP.

In section 2we present existing methods to select inducing

A. Additional work

128

Online Inducing Points Selection for Gaussian Processes

points, as well as an online inference for GPs. We present

our algorithm and its theoretical guarantees in section 3.

We show our experiments in comparison with popular in-

ducing points selection methods in section 4. Finally we

summarize our findings and explore outlooks in section 5.

2. Background

2.1. Sparse Variational Gaussian Processes

Gaussian Processes: Given some training data D=

{X, y}where X={xi}N

i=1 are the inputs xi∈RDand

y={yi}N

i=1 are the labels, we want to compute the predic-

tive distribution p(y∗|D, x∗)for new inputs x∗. In order to

do this we try to find an optimal distribution over a latent

function f. We set the latent vector fas the realization of

f(X), where fi=f(xi), and put a GP prior GP(µ0, k)on

f, with µ0the prior mean (set to 0 without loss of general-

ity) and ka kernel function. In this work we are going to

use an isotropic squared exponential kernel (SE kernel) :

k(x, x0) = exp(−||x−x0||2/l2), but it is generally appli-

cable to all translation-invariant kernels. We then compute

the posterior:

p(f|D) = QN

i=1 p(yi|fi)p(f)

p(D)(1)

Where p(f)∼ N(0, KXX )and KXX is the kernel ma-

trix evaluated on X(in later notation we use KXinstead

of KXX ). For a Gaussian likelihood the posterior p(f|D)

is known analytically in closed-form. Prediction and infer-

ence have nonetheless a complexity of O(N3)

Sparse Variational Gaussian Processes: When the like-

lihood is not Gaussian, there is no tractable solution for the

posterior. One possible approximation is to use variational

inference : a family of distributions over fis selected, e.g.

the multivariate Gaussian q(f) = N(m, S), and one op-

timizes the variational parameters mand Sby minimiz-

ing the negative ELBO, a proxy for the KL divergence

KL(q(f)||p(f|D)). However the computational complex-

ity still grows cubically with the number of samples, and is

therefore inadequate to large datasets.

Quinonero-Candela & Rasmussen (2005) and Titsias

(2009) introduced the notion of sparse variational GPs

(SVGP). One adds inducing variables uand their induc-

ing locations Z={Zi}M

i=1 to the model. In this work we

restrict Zito be in the same domain as Xibut inter-domain

approaches also exist (Hensman et al.,2017). The relation

between uand fis given by the distribution p(f,u) =

p(f|u)p(u)where

p(f|u) = N(f|KXZ K−1

Zu,e

K), p(u) = N(0, KZ)(2)

where e

K=KX−KXZ K−1

ZKZX

Then we approximate p(f,u)with the variational distri-

bution q(f,u) = p(f|u)q(u)where q(u) = N(µ,Σ) by

optimizing KL(q(f,u)||p(f,u|D)).

Note that if the likelihood is Gaussian, the optimal vari-

ational parameters µ∗and Σ∗are known in closed-form.

The only parameters left to optimize are the kernel param-

eters as well as selecting the number and the location of the

inducing variables.

2.2. Inducing points selection methods

Titsias (2009) initially proposed to select the points lo-

cation via a greedy selection : A small batch of data is

randomly sampled, each sample is successively tested by

adding it to the set of inducing points and evaluating the

improvement on the ELBO. The sample bringing the best

performance is added to the set of inducing points and the

operation is repeated until the desired number of inducing

points is reached. This greedy approach has the advantage

of selecting a set which is already close to the optimum

set but is extremely expensive and is not applicable to non-

conjugate likelihoods as it relies on estimating the optimal

bound.

The most popular approach currently is to use the k-

means++ algorithm (Arthur & Vassilvitskii,2007) and take

the optimized clusters centers as inducing points locations.

The clustering nature of the algorithm allows to have good

coverage of the whole dataset. However the k-means al-

gorithm have a complexity of O(NMDT)on the whole

dataset where Tis the number of k-means iterations. An-

other issue is that it might allocate multiple centers in a re-

gion of high density leading to very close inducing points

and no significant performance improvement. It is also not

applicable online and does not solve the problem of choos-

ing the number of inducing points.

Another classical approach is to simply take a grid. For ex-

ample Moreno-Mu˜

noz et al. (2019) use a grid in an online

setting by updating the bounds of a uniform grid. Using a

grid is unfortunately limited a small number of dimensions

and does not take into account the structure of the data.

2.3. Online Variational Gaussian Process Learning

(Bui et al.,2017) developed a streaming algorithm for

GPs (SSVGP) based the inducing points approach of (Tit-

sias,2009). The method consists in recursively opti-

mizing the variational distribution qt(ut,f)for each new

batch of data Dtgiven the previous variational distribution

qt−1(ut−1,f).qtinitially approximates the posterior :

p(ut,f|D1:t) = p(Dt|f)p(D1:(t−1)|f)p(ut,f|θt)

p(D1:t)(3)

where θtare the set of hyper-parameters. Since D1:(t−1) is

not accessible anymore, the likelihood on previously seen

A.1 Adaptive Inducing Points Selection for Gaussian Processes

129

Online Inducing Points Selection for Gaussian Processes

data is approximated using the previous variational approx-

imation qt−1(ut−1)and the previous hyper-parameters

θt−1:

p(D1:(t−1)|f)≈qt−1(ut−1)p(D1:(t−1))

p(ut−1|θt−1).

The distribution approximated by qtis in the end:

qt(ut,f|D1:t)≈

p(Dt|f)qt−1(ut−1)p(ut,f|θt)

p(ut−1|θt−1)

p(D1:(t−1))

p(D1:t)

(4)

The optimization of the (bound on the) KL divergence be-

tween the two distributions for each new batch will pre-

serve the information of D1:(t−1) via qt−1and ensure a

smooth transition of the hyper-parameters, including the

number of inducing points. We give all technical details

including the hyper-parameter derivatives and the ELBO in

full form in appendix A.

3. Algorithm

The idea of our algorithm is that to give a good approxi-

mation, a large majority of the samples should be ”close”

(in the reproducing kernel Hilbert space (RKHS)) to the set

Zof IPs locations. Additionally, Zshould be as diverse as

possible, since IP degeneracy will not improve the approx-

imation. This intuition is supported by previous works:

-Bauer et al. (2016) showed that the most substantial im-

provement obtained by adding a new inducing point was

through the reduction of the uncertainty of q(f), which de-

creases quadratically with KXZ .

-Burt et al. (2019) showed that the quality of the approxi-

mation made with inducing points is bounded by the norm

of QX=KX−KXZ K−1

ZKZX .

Therefore by ensuring that KXZ and |KZ|are sufficiently

large, we can expect an improvement on the approximation

of the non-sparse problem.

3.1. Adding New Inducing Points

A simple yet efficient strategy is to verify that for

each new data point xseen during training, there ex-

ists a close inducing point. We first compute KxZ =

[k(x, Z1), . . . , k(x, ZM)]. If the maximum value of KxZ

is smaller than a threshold parameter ρ, the sample is added

to the set of IPs Z. If not, the algorithm passes on to the

next sample. We summarize all steps in Algorithm 1.

The streaming nature of the algorithm makes it perfectly

suited for an online learning setting : it needs to see sam-

ples only once, whereas other algorithms like k-means need

to parse all the data multiple times before converging. It

is fully deterministic for a given sequence of samples and

therefore convergence guarantees are given under some

conditions. This approach was previously explored in a dif-

Algorithm 1 Online Inducing Point Selection (OIPS)

Input: sample x, set of inducing points Z={Zj}M

j=1,

acceptance threshold 0<ρ<1, kernel function k

d←maxj(k(x, Zj))

if d < ρ then

{Zj}←{Zj}Sx

M←M+ 1

end if

return {Zj}

ferent context by Csat´

o & Opper (2002), but was limited to

small datasets.

The extra cost of the algorithm is virtually free since KXZ

needs to be computed for the variational updates of the

model.

One of our claims is that our algorithm is model and data

agnostic. The reason is that as kernel hyper-parameters are

optimized, the acceptance condition changes as well

Note that this method can be interpreted as a half-greedy

approach of a sequential sampling of a determinantal point

process (Kulesza & Taskar,2012). In appendix B, we show

that for the same number of points, the probability of our

selected set is higher than the one of a k-DPP.

3.2. Theoretical guarantees

The final size of Zis depending on many factors: the se-

lected threshold ρ, the chosen kernel, the structure of the

data (distribution, sparsity, etc) and the number of points

seen. However by having some weak assumptions on the

data we can prove a bound on the expected number of in-

ducing points as well as on the quality of the variational

approximation.

Expected number of inducing points : Since the selec-

tion process is directly depending on the data, it is impossi-

ble to give an arbitrary bound. However by adding assump-

tions on the distribution of xone can

Theorem 1. Given a dataset i.i.d and uniformly dis-

tributed, i.e. x∼ U(0, a)D, and a SE kernel with length-

scale lD1, the expected number of selected inducing

points Mafter parsing Npoints is

E[M|N]≤aD−(aD−α)N+1

α,(5)

where α=l√−Dlog ρ

2D

The proof is given in the appendix C. As N→ ∞, this

bound will converge to aD/α which is the estimated num-

ber of overlapping hyper-spheres of radius l√−Dlog ρin

to fill a hypercube of dimension Dwith side length a. This

can be used as an upper bound for any data lying in a com-

pact domain. This confirms the intuition that the number

A. Additional work

130

Online Inducing Points Selection for Gaussian Processes

of selected inducing points will grow faster with larger di-

mensions and a larger ρand with smaller lengthscales.

Expected performance on regression : Burt et al.

(2019) derived a convergence bound for the inducing points

approach of (Titsias,2009). Even if they show this bound

in an offline setting, their bound is still relevant for on-

line problems. They show that when Zis sampled via a

k-DPP process (Kulesza & Taskar,2011), i.e. a determi-

nantal point process conditioned on a fixed set size, the dif-

ference between the ELBO and the log evidence log p(D)

is bounded by

EZ[kKX−QXk]≤(M+ 1)

i=M+1

λi(KX)(6)

where λi(KX)is the i-th largest eigenvalue of KXand

QX=KXZ K−1

ZKZX is the Nystr¨

om approximation of

KX.

We derive a similar bound when using our algorithm in-

stead of k-DPPsampling:

Theorem 2. Let Zbe the set of inducing points locations

of size Mselected via Algorithm 1on the dataset Xof size

kKX−QXk ≤ (N−M)1−ρ2

1 + M(M−1)ρ(7)

where KXis the kernel matrix on Xand QXis the

Nystr¨

om approximation of KXusing the subset Z

The proof and an empirical comparison are given in the

appendix D.

4. Experiments

In this section we get a quick look on how our algorithm

performs in different settings compared to approaches de-

scribed in section 2.2. We compare the online model

SSVGP described in section 2with different IP selection

techniques. We select from the first batch via k-means and

then optimize them (k-means/opt), select them via our al-

gorithm and optimize them (OIPS/opt), select them via our

algorithm but don’t optimize them (OIPS) and finally cre-

ate a Grid that we adapt according to new bounds. We

consider 3 different toy datasets, from which two are dis-

played in figure 2. The dataset A is a uniform time series

and the output function is a noisy sinus. The dataset B

is an irregular time-series, with a gap in the inputs. The

output function is also a noisy sinus. Dataset C inputs are

random samples from an isotropic multivariate 3D Gaus-

sian and the output function is given by sin(||x||)/||x||. All

datasets contain 200 training points and 200 test points. For

all experiments we use an isotropic SE kernel with fixed pa-

rameters. For datasets A and B, Grid and k-means has 25

IPs while OIPS converged to around 20 IPs. For dataset

Figure 2: Toy datasets A and B, divived in 4 batches. Aver-

age Negative Test Log-Likelihood on a test set in function

of number of batches seen. In a uniform streaming setting

all methods perform similarly but having a gap blocks the

convergence of a simple position optimization whereas in

a non-compact situation the adaptive grid suffers in perfor-

mance.

C, Grid has 103IPs, k-means 50, and both OIPS con-

verged to 10 IPs Figure 2shows the evolution on the av-

erage negative log likelihood on test data after every batch

has been seen. On a uniform time-series context all meth-

ods are pretty much equivalent. The presence of a gap,

blocks the optimization of IP locations and impede infer-

ence of future points. Whereas the grid suffers from being

in high-dimensions and All details on the datasets, different

training methods, hyper-parameters and optimization pa-

rameters used are to be found in appendix E.

5. Conclusion

We presented a new algorithm, OIPS, able to select induc-

ing points automatically for a GP in an online setting. The

theoretical bounds derived outperforms the previous work

based on DPPs. There is yet to improve the selection pro-

cess to make it robust to outliers and to variations of the

hyper-parameters. Using for instance a threshold on the

median or a mean on the k-nearest IPs could help to avoid

picking adversarial points such as outliers. We have only

considered regression but our algorithm is also compati-

ble with non-conjugate likelihoods. Using augmentations

approaches (Wenzel et al.,2019;Galy-Fajou et al.,2019),

same performance can be attained. Finally the most inter-

esting improvement would be to use a non-stationary kernel

(Remes et al.,2017) and be able to automatically adapt the

number of inducing points across the dataset.

A.1 Adaptive Inducing Points Selection for Gaussian Processes

131

Online Inducing Points Selection for Gaussian Processes

References

Arthur, D. and Vassilvitskii, S. k-means++: The advan-

tages of careful seeding. In Proceedings of the eighteenth

annual ACM-SIAM symposium on Discrete algorithms,

pp. 1027–1035. Society for Industrial and Applied Math-

ematics, 2007.

Bauer, M., van der Wilk, M., and Rasmussen, C. E. Under-

standing probabilistic sparse gaussian process approxi-

mations. In Advances in neural information processing

systems, pp. 1533–1541, 2016.

Belabbas, M.-A. and Wolfe, P. J. Spectral methods in ma-

chine learning and new strategies for very large datasets.

Proceedings of the National Academy of Sciences, 106

(2):369–374, 2009.

Bui, T. D., Nguyen, C., and Turner, R. E. Streaming sparse

gaussian process approximations. In Advances in Neural

Information Processing Systems, pp. 3299–3307, 2017.

Burt, D., Rasmussen, C. E., and Van Der Wilk, M. Rates

of convergence for sparse variational gaussian process

regression. In International Conference on Machine

Learning, pp. 862–871, 2019.

Csat´

o, L. and Opper, M. Sparse on-line gaussian processes.

Neural computation, 14(3):641–668, 2002.

Galy-Fajou, T., Wenzel, F., Donner, C., and Opper, M.

Multi-class gaussian process classification made conju-

gate: Efficient inference via data augmentation. arXiv

preprint arXiv:1905.09670, 2019.

Hensman, J., Durrande, N., and Solin, A. Variational

fourier features for gaussian processes. The Journal of

Machine Learning Research, 18(1):5537–5588, 2017.

Kulesza, A. and Taskar, B. k-dpps: Fixed-size determinan-

tal point processes. In Proceedings of the 28th Interna-

tional Conference on Machine Learning (ICML-11), pp.

1193–1200, 2011.

Kulesza, A. and Taskar, B. Determinantal point pro-

cesses for machine learning. pp. 1–120, 2012. ISSN

1935-8237. doi: 10.1561/2200000044. URL http:

//arxiv.org/abs/1207.6083%0Ahttp:

//dx.doi.org/10.1561/2200000044. ZSCC:

0000516 arXiv: 1207.6083 ISBN: 9781601986283.

Moreno-Mu˜

noz, P., Art´

es-Rodr´

ıguez, A., and ´

Alvarez,

M. A. Continual multi-task gaussian processes. arXiv

preprint arXiv:1911.00002, 2019.

Quinonero-Candela, J. and Rasmussen, C. E. A Unifying

View of Sparse Approximate Gaussian Process Regres-

sion. Journal of Machine Learning Research, 6:1939–

1959, 2005. ZSCC: NoCitationData[s0].

Remes, S., Heinonen, M., and Kaski, S. Non-stationary

spectral kernels. In Advances in Neural Information Pro-

cessing Systems, pp. 4642–4651, 2017.

Roberts, S., Osborne, M., Ebden, M., Reece, S., Gibson,

N., and Aigrain, S. Gaussian processes for time-series

modelling. Philosophical Transactions of the Royal

Society A: Mathematical, Physical and Engineering

Sciences, 371(1984):20110550, February 2013. ISSN

1364-503X, 1471-2962. doi: 10.1098/rsta.2011.0550.

URL https://royalsocietypublishing.

org/doi/10.1098/rsta.2011.0550.

Stewart, G. W. and guang Sun, J. Matrix Perturbation The-

ory. Academic Press, 1990.

Titsias, M. Variational learning of inducing variables in

sparse gaussian processes. In Artificial Intelligence and

Statistics, pp. 567–574, 2009.

Wenzel, F., Galy-Fajou, T., Donner, C., Kloft, M., and Op-

per, M. Efficient gaussian process classification using

olya-gamma data augmentation. In Proceedings of the

AAAI Conference on Artificial Intelligence, volume 33,

pp. 5417–5424, 2019.

Williams, C. K. and Rasmussen, C. E. Gaussian processes

for machine learning, volume 2. MIT press Cambridge,

MA, 2006.

A. Additional work

132

Online Inducing Points Selection for Gaussian Processes

A. Derivations online GPs

A.1. ELBO

Following Bui et al. (2017), the ELBO for variational in-

ference is defined as :

L=−KL (qt(ut)||p(ut|θt)) + Eqt(ut,ft)[log p(yt|ft)]

−KL(qt(ut−1)||qt−1(ut−1))

+ KL(qt(ut−1)||p(ut−1|θt−1))

The terms of the first line correspond to a classical SVGP

problem and the second line express the KL divergence

with the previous variational posterior. The distributions

are defined as :

qt(ut) =N(µt,Σt)

p(ut|θt) =N(0, KZt)

qt(ut−1) = Zp(ut−1|ut)qt(ut)dut

=NκZt−1Ztµt,e

KZt−1

KZt−1=KZt−1+κZt−1ZtΣtκ>

Zt−1Zt

−KZt−1ZtK−1

ZtKZtZt−1

qt−1(ut−1) =Nµt−1,Σt−1

p(ut−1||θt−1) =N(0, K0

Zt−1

|{z}

Given θt−1

)

The first terms ares

KL(qt(ut)||p(ut|θt) =

2(log |KZt|−log |Σt|−Mt

+tr(K−1

ZtΣt) + µ>

tK−1

Ztµt

And for p(yt|ft) = QB

i=1 N(yi|fi, σ). The expected log-

likelihood is given by L

Eqt(ut,ft)[log p(yt|ft)] = −B

2log 2πσ2

−1

2σ2

i=1

(yi−κXiZtµt)2+e

K+κXiZtΣtκ>

XiZt

Writing the second terms fully we get :

KL(qt(ut−1)||p(ut−1|θt−1)) =

2log |K0

Zt−1|−log |e

Kt−1|−Mt−1

+tr((K0

Zt−1)−1e

KZt−1)

+(κZt−1Ztµt)>(K0

Zt−1)−1κZtZt−1µt

KL(qt(ut−1)||qt−1(ut−1)) =

2log |Σt−1|−log |e

KZt−1|−Mt−1

+tr(Σ−1

t−1e

KZt−1)

+(µt−1−κZtZt−1µt)>Σ−1

t−1(µt−1−κZtZt−1µt)

Subtracting the second term to the first we get:

KLt:t−1=

KL(qt(ut−1)||p(ut−1|θt−1)) −KL(qt(ut)||qt−1(ut−1))

2log |K0

Zt−1|−log |Σt−1|−tr((Σ−1

t−1−(K0

Zt−1)−1)e

KZt−1)

−µ>

t−1Σ−1

t−1µt−1+ 2µt−1Σ−1

t−1κZt−1Ztµt

−(κZt−1Ztµt)>(Σ−1

t−1−(K0

Zt−1)−1)(κZt−1Ztµt)

2log |K0

Zt−1|−log |Σt−1|−tr(D−1

t−1e

Kt−1)

−µ>

t−1Σ−1

t−1µt−1+ 2µt−1Σ−1

t−1κZt−1Ztµt

−(κZt−1Ztµt)>D−1

t−1(κZt−1Ztµt)

Where Dt=Σ−1

t−K−1

Zt−1.

Taking the derivative of Lgiven µtand Σtgives us directly

the optimal solution for Gaussian regression:

Σ∗

t=σ−2κ>

XtZtκXtZt+κ>

Zt−1ZtD−1

t−1κZt−1Zt+K−1

Zt−1

µ∗

t=Σtκ>

XtZtσ−2yt+κ>

Zt−1ZtΣt−1µt−1

Rewritten in natural parameters terms:

ηt

1=κ>

XtZtσ−2yt+κ>

Zt−1Ztηt−1

ηt

2=−1

2κ>

XtZtσ−2IκXtZt

+κ>

Zt−1Zt−2ηt−1

2−K−1

Zt−1κZt−1Zt+K−1

Zt

A.1 Adaptive Inducing Points Selection for Gaussian Processes

133

Online Inducing Points Selection for Gaussian Processes

A.2. Hyper-parameter derivatives

Given θa kernel hyperparameter and J =dK

dθ the

derivatives are given by:

dKLt:t−1

dθt

=−1

2tr D−1

t−1

KZt−1

dθt!

+µt−1Σ−1

t−1

dκZt−1Zt

dθt

µt

−(κZt−1Ztµt)>D−1

t−1(dκZt−1Zt

dθt

µt)

dκZt−1Zt

dθt

=dKZt−1Zt

dθt

K−1

Zt+KZtZt−1

dK−1

dθt

=(JZtZt−1−κZtZt−1JZt)K−1

Zt=ιZt−1Zt

KZt−1

dθt

=dKZt−1

dθt

+ 2dκZt−1Zt

dθt

Σtκ>

ZtZt−1

−dκZt−1Zt

dθt

KZtZt−1−κZt−1Zt

dKZtZt−1

dθt

=JZt−1+ 2ιZt−1ZtΣtκ>

Zt−1Zt

−ιZt−1ZtKZtZt−1−κZt−1ZtJZtZt−1

dKL(qt(ut)||p(ut|θt)

dθt

Special derivative given the variance :

dKLa

dv =−1

2tr D−1

a1

v(Kaa −KabK−1

bb Kba)

A.3. Comparison with SVI

If we take the special case where inducing points do

not change between iterations, then κZt−1Zt=Iand

KZt−1=KZt. The updates become

ηt

1=κ>

XtZtσ−2yt+ηt−1

ηt

2=−1

2κ>

XtZtσ−2κXtZt+−2ηt−1

2−K−1

Zt+K−1

Zt

=−1

2κ>

XtZtσ−2κXtZt+ηt−1

Compared to the SVI updates:

ηt

1=ηt−1

1+ρN

|B|κ>

XtZtσ−2yt−ηt−1

1

ηt

2=ηt−1

2+ρ−1

2N

|B|κ>

XtZtσ−2κXtZt+K−1

Zt−ηt−1

2

If we ignore ρby setting it as 1:

ηt

1=N

|B|κ>

XtZtσ−2yt

ηt

2=−1

2N

|B|κ>

XtZtσ−2κXtZt+K−1

Zt

Figure 3: Histogram of p(Z|k=M)for the OIPS algo-

rithm and k-DPPsampling

We forget completely the previous η1.

To make it directly comparable to streaming:

SVI

ηt+1

1=(1 −ρ)ηt

1+ρN

|B|κ>

fσ−2y

ηt+1

2=(1 −ρ)ηt

2+−1

2ρN

|B|κ>

fσ−2κf+K−1

bb 

ηt

1=(1 −ρ)tη0+

i=1

(1 −ρ)i−1ρN

|B|κ>

fσ−2yi

Streaming

ηt+1

1=ηt

1+κ>

fσ−2y

ηt+1

2=ηt

2−1

2κ>

fσ−2κf

B. Deterministic algorithm as a DPP

half-greedy sampling

We proceed to a simple experiment, where given a dataset,

Abalone (N= 4177, D = 7), we repeatedly shuffle

the data. We apply algorithm 1parsing all the data to

get the subset ZOIP S. We use the resulting number of

inducing points kas a parameter to sample from a k-

DPP and obtain ZkDP P . We compute the probabilities of

log p(ZOIP S|M=k)and log p(ZkDP P |M=k)and re-

port the histogram of the probabilities on figure 3One can

observe that the probability given by the OIPS algorithm is

consistently higher as well as more narrow then the sam-

pling. This can be explained by the fact that we determin-

istically constrain all the points to have a certain distance

from each other and therefore put a deterministic limit on

the determinant of KZ.

A. Additional work

134

Online Inducing Points Selection for Gaussian Processes

C. Proof Theorem 1: Bound on the number

of points

Algorithm 1can be interpreted as filling a domain with

closed balls, where balls intersections are allowed but no

center can be inside another ball. For a SE kernel we can

compute the radius r(in euclidean space) of these balls :

k(x, x0) = ρin

exp −||x−x0||2

h2=ρin

||x−x0||2=−h2log ρin

r=hp−log ρin

We can bound the volume of the union of the balls by the

union of inscribed hypercubes. The length of an inscribed

hypercube in an hypersphere of radius ris l=r√D/2.

Since the volume of the hypercube is defined to be smaller,

this gives us an upper bound on the expected number of

inducing points. Defining as Knthe number of inducing

points at time n, the probability of having a point outside

of the union of all khypercubes is

p(Kn+1 =k+ 1|Kn=k) = max aD−

i=1

lD!

= max aD−klD,0

k= max aD−kα, 0

Where α=r√D

2D

, is the volume of one hypercube and

therefore the probability of a new sample to appear in it.

The probability of keeping the same number of points is

p(Kn+1 =k|Kn=k) = min k

i=1

lD,1!

k= min(kα, 1)

We now consider the problem as a Markov chain where the

state pis represented by a vector {pi}N

i=1 where pi= 1 if

there are iinducing points. The transition matrix Pis given

by :







10 0 0

1p=

20 0

0p+

...0

0 0 ...0

0 0 p+

N−1p=







If we define that we start with inducing points the ini-

tial state is p1={1,0,...,0}>, the probability of

having kballs after nsteps is p(Kn=k|p1) =

Pnp1kwhile the expected number of pointsis given by

Pkk·p(Kn=k|p1).

These sequence can be complex to compute. Instead we

can approximate the final expectation by recursively com-

puting the update given the expectation at the previous step:

Ep(Kn+1|Kn=E[Kn]) [Kn+1]

=E[Kn]E[Kn]α+ (E[Kn] + 1)(aD−E[Kn]α)

=aDE[Kn] + aD−E[Kn]α=aD+E[Kn] (aD−α)

This is an arithmetico-geometric suite and given the origi-

nal condition E[K0]=1and since α < aDwe can get a

closed form solution for E[Kn]:

E[Kn] =(aD−α)n1−aD

α+aD

=aD−(aD−α)n+1

C.1. Empirical Comparison

We show the realization of this bound on uniform data with

3 dimensions, ρ= 0.7and l= 0.3on figure 4.

Figure 4: Bound on the number of inducing points accepted

Mgiven the number of seen points Nvs the empirical es-

timation

D. Proof theorem 2: Bounding the ELBO

We follow the approach of Burt et al. (2019) and Belabbas

& Wolfe (2009). Burt et al. (2019) showed that the error

between the ELBO and the log evidence was bounded by

kKX−KXZ K−1

ZKZX k. Where k·kis the Froebius norm.

Using a k-DPP sampling (Kulesza & Taskar,2011), they

were able to show a bound on the expectation of this norm.

We follow similar calculations with our deterministic al-

gorithm for fixed kernel parameters. Let be KXthe ker-

nel matrix of the full dataset and KZthe submatrix given

A.1 Adaptive Inducing Points Selection for Gaussian Processes

135

Online Inducing Points Selection for Gaussian Processes

the set of points {Zi}M

i=1. The Schur complement of KZZ ,

SC(KZZ )in KXX is given by KX−KXZ K−1

ZKZX . Fol-

lowing a similar approach then Belabbas & Wolfe (2009)

we bound the norm by the trace:

kSC(KZZ )k=v

N−M

j=1

λj≤

N−M

j=1

λj=tr(SC(KZZ ))

Using the definiton of SC(KZZ )we get :

tr(SC(KZZ )) =

N−M

i=1

KXi−KXiZK−1

ZKZXi

where every element of the sum is a scalar. Taking W>ΛW

the eigendecomposition of K−1

Z,wi=WKXiZand as-

suming a kernel variance vof 1 (although generalizable to

all variances) and a translation invariant kernel such that

k(x, x) = 1 we get :

KXi−KXiZK−1

ZKZXi= 1 −w>

iΛwi= 1 −

j=1

λj(wi)2

≤1−λminkwik2= 1 −λminkKXiZk2≤1−λminρ2

Where we used the fact that at least Xiwas close enough to

at least one Zjsuch that k(Xi, Zj)> ρ. For clarity we re-

place λmin =λ−1

max where λmax is the largest eigenvalue of

KZ. When summing over the trace we get the final bound

kKX−KXZ K−1

ZKZX k ≤ (N−M)1−ρ2

λmax 

Now by construction all off-diagonal terms of KZare

smaller than ρ. Using the equality (Stewart & guang Sun,

1990)

|λi(A)−λi(B)|≤kA−Bk,∀i= 1, . . . , N

We get that

|λmax(KZ)−1| ≤kKZ−Ik2=sX

i6=j

(KZ)2

≤M(M−1)ρ

Assuming λmax(KZ)≥1, we get

λmax(KZ)≤1 + M(M−1)ρout

Getting then the final bound :

kKX−QXk ≤ (N−M)1−ρ2

1 + M(M−1)ρ

Figure 5: Evaluation of the kKX−QXkgiven the OIPS

algorithm and computation of the bound from Burt et al.

(2019) given in equation 6and our bound given in equation

D.1. Empirical Comparison

These bounds are difficult to compare due to the different

parameters characterizing them. Nevertheless we give an

example by comparing the bound and the empirical value

on toy data drawn uniformly in 3 dimensions in figure 5.

For each Nwe ran our algorithm and input the required M

in the bounds as the resulting number of selected inducing

points. We show in the section 4the empirical effect on the

accuracy and on the number of points given the choice of

ρ.

E. Experiments parameters

For every problem we use an isotropic Squared Exponential

Kernel :

k(x,x0) = vexp −kx−x0k2

h2

Where his initialized by taking the median of the lower

triangular part of the pairwise distance matrix of the first

subset of points and fixed for the rest of the training. Future

work will involve working with kernel parameter optimiza-

tion as well. We fix the noise of the Gaussian likelihood to

σ2= 0.01.

IPs were optimized via ADAM (α= 10−2).

A. Additional work

136