Document [original]

A collection of identities for variational inference

with exponential-family models

Dominik Endres, Kathrin Pabst, Anna-Lena Eckert, Raphael Schween

[email protected]

September 5, 2022

Abstract

This is a collection of identities useful for variational inference with

exponential family distributions/densities. All derviations were done by

the authors, unless indicated otherwise. This does not imply that the

results collected here have not appeared in the literature before. DIS-

CLAIMER: this collection is a work in progress. It is certainly in-

complete and probably buggy. Bug-reports and contributions are most

welcome, please email [email protected].

1 Exponential family distributions

A distribution is said to belong to the exponential family, if it can be written

in the form [1]:

p(x|η) = h(x)g(η) exp(ηTu(x)) (1)

where the random variates xmay be discrete or continuous, the sufficient

statistics uare functions of the x, not necessarily of the same dimensionality.

However, the uneed to be linearly independent. The ηare the natural param-

eters, one for each sufficient statistic. The function g(η) is the normalization

constant (replace the integral with a sum for discrete x):

g(η)Zdxh(x) exp(ηTu(x)) = 1 (2)

1.1 Moments

Taking the gradient ∇w.r.t. to ηon both sides of equation 2, we find:

(∇g(η)) Zdxh(x) exp(ηTu(x))

| {z }

g(η)

+g(η)Zdxh(x)u(x) exp(ηTu(x))

| {z }

hu(x)i

=0(3)

and thus the expectation hu(x)iis:

hu(x)i=−∇g(η)

g(η)=−∇log(g(η)) (4)

Computing the derivative of the i-th component of the l.h.s. of eqn. 3 w.r.t.

ηjyields (noting that ∂2log(f(x,y))

∂x∂y =∂2f(x,y)

∂x∂y /f(x, y)−∂f(x,y)

∂x

∂f(x,y)

∂y /f2(x, y)):

∂2g(η)

∂ηi∂ηv

g(η)−

∂g(η)

∂ηi

∂g(η)

∂ηj

g2(η)+∂g(η)

∂ηj

hui(x)i

g(η)=−g(η)Zdxh(x)ui(x)uj(x) exp(ηTu(x))

∂2log(g(η))

∂ηi∂ηj−hui(x)ihuj(x)i=−hui(x)uj(x)i

⇒ hui(x)uj(x)i−hui(x)ihuj(x)i=−∂2log(g(η))

∂ηi∂ηj

(5)

Denoting the Hessian by ∇∇, the covariance matrix of u(x) is thus given by

Cov(u(x)) = −∇∇log(g(η)) (6)

Higher order moments can be computed via higher order derviatives.

1.2 Maximum Likelihood

For maximum-likelihood approximations, we need the gradient of log(p(x|η))

w.r.t. η:

∇log(p(x|η)) = ∇log(g(η)) + u(x) = −hu(x)i+u(x).(7)

In other words, maximizing the likelihood (i.e. following the gradient towards

higher likelihood values) amounts to making the expected value of the suffi-

cient statistic more similiar to the actually observed sufficient statistic. For

second-order optimization methods, the Hessian matrix may be needed, which

is given by eqn. 6. For maximum likelihood parameter estimates, assume we

had observed Ni.i.d. datapoints xn. The parameter estimate is obtained by

solving

n=1 ∇log(p(xn|η)) = 0 ⇒ hu(x)i=PN

i=1 u(xn)

N(8)

i.e. by setting the data mean of the sufficient statistics equal to the expectation.

1.3 Entropy

The (differential) entropy of xgiven ηis defined as

H(x|η) = −Zdxp(x|η) log(p(x|η)).(9)

note that this is not the conditional entropy of xgiven η. Using the definition

of the exponential family distribution (eqn. 1) , this can be written as

H(x|η) = −Zdxp(x|η)log(g(η)) + log(h(x)) + ηTu(x)

and thus

H(x|η) = −log(g(η)) −hlog(h(x))i−ηThu(x)i(10)

For the gradient of the entropy w.r.t. ηwe need

∂

∂ηk

ηThu(x)i=huk(x)i+X

ηi

∂hui(x)i

∂ηk

=huk(x)i−X

ηi

∂2log(g(η))

∂ηi∂ηk

.(11)

Using eqn. 6, we find

∇ηThu(x)i=hu(x)i+ Cov(u(x)) ·η(12)

We also need

∂hlog(h(x))i

∂ηk

=∂

∂ηkg(η)Zdxh(x) log(h(x)) exp(ηTu(x))

=∂g(η)

∂ηk

hlog(h(x))i

g(η)+g(η)Zdxh(x) log(h(x))uk(x) exp(ηTu(x))

=∂log(g(η))

∂ηkhlog(h(x))i+hlog(h(x))uk(x)i

=−huk(x)ihlog(h(x))i+hlog(h(x))uk(x)i(13)

and thus

∇hlog(h(x))i=−hlog(h(x))ihu(x)i+hlog(h(x))u(x)i(14)

The gradient of the entropy w.r.t. the ηis thus

∇H(x|η) = −∇log(g(η))

| {z }

hu(x)i

−∇hlog(h(x))i−∇ηThu(x)i(15)

∇H(x|η) = hlog(h(x))ihu(x)i−hlog(h(x))u(x)i−Cov(u(x)) ·η(16)

1.4 Kullback-Leibler divergence

The KL-divergence of a distribution with parameters ˜

ηto a distribution with

parameters ηis:

D(p(x|˜

η)||p(x|η)) = Zdxp(x|˜

η) log p(x|˜

η)

p(x|η)

=Zdxp(x|˜

η) [log (p(x|˜

η)) −log (p(x|η))]

=−H(x|˜

η)−Zdxp(x|˜

η) log (p(x|η)) (17)

For this, we compute the following expectation under p(x|˜

η):

Zdxp(x|˜

η) log (p(x|η)) = Zdxp(x|˜

η)log(g(η)) + log(h(x)) + ηTu(x)

= log(g(η)) + hlog(h(x))i+ηThu(x)i(18)

and thus, using eqn. 10:

D(p(x|˜

η)||p(x|η)) = log(g(˜

η)) −log(g(η)) + (˜

ηT−ηT)hu(x)i(19)

Its gradient w.r.t. ˜

ηis then (using eqn. 4 and eqn. 6)

∇D(p(x|˜

η)||p(x|η)) = ∇log(g(˜

η)) + ∇(˜

ηT−ηT)hu(x)i

=−hu(x)i+hu(x)i+∇hu(x)i(˜

η−η) (20)

where ∇hu(x)iacts on ucomponent-wise, yielding a matrix (ij) of deriva-

tives ∂ui(x)

∂˜

ηj. Thus

∇D(p(x|˜

η)||p(x|η)) = −∇∇log(g(˜

η)) ·(˜

η−η) = Cov(u(x))(˜

η−η) (21)

2 Conjugate priors on exponential family distri-

butions

For inference and learning in hierarchical models, conjugate priors on the pa-

rameters ηare very useful, because inference/learning with i.i.d. observations

translates into parameter updates (rather than complicated integrals). The

conjugate prior on an exponential family distribution (eqn. 1) is given by

p(η|λ, ν) = f(λ, ν)m(η)g(η)νexp(νηTλ) (22)

where the λ, ν are the parameters of the p(oste)rior, g(η) is the same function

as above and m(η) is an arbitrary positive function (different from g(η)). To see

that this is a conjugate prior on p(x|η), assume we had observed Ndatapoints

xn. The posterior of ηis then (using eqns. 1 and 22)

p(η|λ, ν, x1:N) = p(x1:N,η|λ, ν)

p(x1:N|λ, ν)

=QN

n=1 p(xn|η)p(η|λ, ν)

RdηQN

n=1 p(xn|η)p(η|λ, ν)

=QN

n=1 g(η)h(xn) exp(ηTu(xn)) ·f(λ, ν)m(η)g(η)νexp(νηTλ)

RdηQN

n=1 g(η)h(xn) exp(ηTu(xn)) ·f(λ, ν)m(η)g(η)νexp(νηTλ)

=[Qnh(xn)] f(λ, ν)m(η)g(η)ν+Nexp ηT(νλ+Pnu(xn))

[Qnh(xn)] f(λ, ν)Rdηm(η)g(η)ν+Nexp (ηT(νλ+Pnu(xn)))

=m(η)g(η)ν+Nexp ηT(νλ+Pnu(xn))

Rdηm(η)g(η)ν+Nexp (ηT(νλ+Pnu(xn))) (23)

Note that this expression depends on the data x1:Nonly through Nand

Pnu(xn). This is why the u(x) are called sufficient statistics: they contain

all the information about ηwhich we need from the data to determine the

parameter posterior. A similar result holds for maximum-likelihood learning,

see [1]. By introducting the

posterior parameters

˜ν:= ν+N(24)

λ:= νλ+Pnu(xn)

˜ν(25)

thus, νλ+PNu(xn) = ˜ν˜

λ. We furthermore identify f(λ, ν) as the normaliza-

tion constant of the p(oste)rior, i.e.

f(˜

λ,˜ν) = 1

Rdηm(η)g(η)˜νexp ηT˜ν˜

λ

and finally, plugging these identities back into eqn. 23, we obtain:

p(η|λ, ν, x1:N) = f(˜

λ,˜ν)m(η)g(η)˜νexp ηT˜ν˜

λ=p(η|˜

λ,˜ν).(26)

In other words, given an exponential family observation model, i.i.d. data and

a conjugate prior, we obtain posterior just by replacing the prior parameters

according to eqns. 24 and 25. Furthermore, note that



According to eqn. 24, ˜νkeeps track of the number of observed data-

points. Since it also contains prior information via ν, it is referred to as a

pseudocount.



For large enough N, the posterior is unimodal, and the log-posterior is

convex. The width of the maximum is monotonically decreasing in ˜ν

(Can be shown by computing the Hessian of the log-posterior). Hence, ˜ν

is also called the concentration parameter.



The posterior ˜

λis just a weighted mean of the prior λand and the

observed data.



These posterior updates can be iterated, i.e. the accumulation of the

sufficient statistics can be restarted at any point. The extreme case of

updating after every datapoint, i.e. online learning, boils down to keeping

track of this weighted average datapoint per datapoint.

2.1 Maximum-a-posteriori (MAP) parameter estimates

Instead of working with the full posterior of the natural parameters η(eqn.

26), it is sometimes enough to use the parameter values which maximize the

posterior, i.e. the numerator of eqn. 26 (the denominator does not depend on

ηafter the integration). Setting the derivative of the log of the numerator to

zero, we find

∇ηlog(m(η)) + (ν+N)∇ηlog(g(η))

| {z }

−hu(x)i,eqn. 4

+(νλ+X

u(xn)) !

= 0

⇒νλ+Pnu(xn)

N+ν+∇ηlog(m(η))

N+ν

=hu(x)i(27)

i.e. the posterior maximum is located at a point where the expected value of the

natural parameters is equal to the quotient of the posterior parameters (eqns.

24,25) plus a term depending on m(η). The latter is often zero, since m(η)=1

for many distributions (see tables in section 4).

2.2 Expectations

Computing parameter expectations (i.e. of ηand functions thereof) of a conju-

gate p(oste)rior can be done similar to the expectations of an exponential family

distribution. From the normalization equation

f(λ, ν)Zdηm(η)g(η)νexp νηTλ= 1 (28)

follows

∇λf(λ, ν)Zdηm(η)g(η)νexp νηTλ= 0

⇒ ∇λf(λ, ν)Zdηm(η)g(η)νexp νηTλ

| {z }

f(λ,ν)−1

=−f(λ, ν)Zdηm(η)g(η)νexp νηTλνη

⇒∇λf(λ, ν)

f(λ, ν)=−νhηi(29)

and thus

hηi=−∇λlog(f(λ, ν))

ν(30)

Likewise, from the derivative w.r.t. νwe find, noting that ∂g(η)ν

∂ν = log(g(η))g(η)ν:

hlog(g(η))i+λThηi=−∂log(f(λ, ν))

∂ν (31)

For the second moments, take the derivatives of hηi(see eqn. 29):

∂hηi

∂λj

=∂

∂λjf(λ, ν)Zdηm(η)g(η)νexp νηTλη

=∂f(λ, ν)

∂λjZdηm(η)g(η)νexp νηTλη

| {z }

=hηi

f(λ,ν)

+f(λ, ν)Zdηm(η)g(η)νexp νηTλνηηj

=∂log(f(λ, ν))

∂λjhηi+νhηηji

=−νhηjihηi+νhηηji(32)

⇒1

∂hηki

∂λj

=hηjηki−hηjihηki(33)

and thus

Cov(η) = −∇λ∇λlog(f(λ, ν))

ν2(34)

Likewise, computing the derivative of eqn. 31 on both sides yields

∂hlog(g(η))i

∂ν =−∂2log (f(λ, ν)))

∂ν2−λT∂hηi

∂ν (35)

where

∂hlog(g(η))i

∂ν =∂log(f(λ, ν))

∂ν hlog(g(η))i+log(g(η))2+λThlog(g(η))ηi

=log(g(η))2−hlog(g(η))i2+λThlog(g(η))ηi−λThlog(g(η))ihηi

= Var(log(g(η))) + Cov(log(g(η)),λTη)

∂hηii

∂ν =∂log(f(λ, ν))

∂ν hηii+hηilog(g(η))i+λThηηii(36)

=hηilog(g(η))i−hηiihlog(g(η))i+λThηηii−λThηihηii

= Cov(log(g(η)), ηi) + Cov(λTη, ηi)

⇒λT∂hηi

∂ν = Cov(log(g(η)),λTη) + Var(λTη) (37)

and thus, noting that Var(x+y) = Var(x) + Var(y) + 2 Cov(x, y):

Var log(g(η)) + λTη=−∂2log (f(λ, ν)))

∂ν2(38)

Another expectation that can be computed from the normalization constant

f(λ, ν) is

hg(η)ki=f(λ, ν)Zdηm(η)g(η)ν+kexp νηTλ.(39)

To evaluate the integral, choose new parameters ν0,λ0such that

ν0=ν+k(40)

ν0λ0=νλ⇒λ0=ν

ν0λ.(41)

With these parameters, the integral is in exponential family normal form, and

thus

hg(η)ki=f(λ, ν)

f(λ0, ν0)(42)

hg(η)i=f(λ, ν)

f(ν

ν+1 λ, ν + 1) (43)

Var(g(η)) = f(λ, ν)

f(ν

ν+2 λ, ν + 2) −"f(λ, ν)

f(ν

ν+1 λ, ν + 1)#2

(44)

2.3 Predictive distribution, entropy and log-likelihood

Let α > 0. The predictive distribution and related quantities can be derived

from the integral

Zdηp(x|η)αp(η|λ, ν) = h(x)αf(λ, ν)Zdηg(η)αexp αηTu(x)m(η)g(η)νexp νηTλ

=h(x)αf(λ, ν)Zdηm(η)g(η)ν+αexp νηTλ+αηTu(x)

=h(x)αf(λ, ν)Zdηm(η)g(η)ν+αexp (ν+α)ηTνλ+αu(x)

ν+α

hp(x|η)α)ip(η|λ,ν)=h(x)αf(λ, ν)1

fνλ+αu(x)

ν+α, ν +α(45)

where the last line follows from the normalization equation 28. For α= 1, the

integral on the l.h.s. is the expectation of p(x|η) under the prior:

p(x|λ, ν) = h(x)f(λ, ν)

fνλ+u(x)

ν+1 , ν + 1(46)

Differentiating with respect to αyields:

∂

= log(h(x)) h(x)αf(λ, ν)

fνλ+αu(x)

ν+α, ν +α

−h(x)αf(λ, ν)

fνλ+αu(x)

ν+α, ν +α2·ν

(ν+α)2∇λ0f(λ0, ν0)(u(x)−λ) + ∂f(λ0, ν0)

∂ν0

where the derivatives are evaluated at λ0=νλ+αu(x)

ν+αand ν0=ν+α. For α= 0,

we obtain the expected log-likelihood, using eqn. 30:

hlog(p(x|η))ip(η|λ,ν)= log(h(x)) + hηTi(u(x)−λ)−∂log(f(λ, ν))

∂ν (47)

For α= 1, we obtain expectations of the form hlog(p(x|η))p(x|η))ip(η|λ,ν),

which are the terms required for the computation of the expected entropy of x:

λ0=νλ+u(x)

ν+ 1 (48)

ν0=ν+ 1 (49)

hlog(p(x|η))p(x|η)ip(η|λ,ν)=p(x|λ, ν)hlog(h(x)) + ν

ν0hηTiλ0,ν0(u(x)−λ)

−∂log(f(λ0, ν0))

∂ν0(50)

2.4 Expected sufficient statistics

The expectations of the sufficient statistics ufor fixed natural parameters ηis

given by eqn. 4:

hu(x)ip(x|η)=−∇log(g(η)) (51)

If ηis drawn from a conjugate prior, the expectation of u(x) under the prior is

given by averaging the r.h.s. over p(η|λ, ν),

hu(x)ip(x|η)p(η|λ,ν)=−h∇log(g(η))ip(η|λ,ν)(52)

To compute this expectation, one can use the Divergence Theorem from vec-

tor calculus (Ostrogradsky-Gauss). For a differentiable vector field f(x), this

theorem states that

dx∇·f(x) = IS

ds f(x) (53)

where Sis the surface enclosing the volume V. As a special case, consider the

field f(x) = cz(x), with ~c a constant vector and z(x) a smoothly differentiable

scalar function. Then, using ∇·cz(x) = Pi

∂ciz(x)

∂xi=cT∇z(x) we find

cTZV

dx∇z(x) = cTIS

ds z(x) (54)

and since this holds for any c, it follows that

dx∇~z(x) = IS

ds z(x) (55)

This identity can be used to compute the expectation on the r.h.s. of eqn.

52 by integrating the gradient of p(η|λ, ν):

Zdη∇ηp(η|λ, ν) = f(λ, ν)Zdη∇ηm(η)g(η)νexp νηTλ

=f(λ, ν)Zdη∇ηm(η)

m(η)+ν∇ηg(η)

g(η)+νλp(η|λ, ν)

=h∇ηlog(m(η))ip(η|λ,ν)+νh∇ηlog(g(η))ip(η|λ,ν)+νλ

=IS

dsp(η|λ, ν) (56)

where the surface Sencloses the range of η. Hence, the expectation is:

hu(x)ip(x|η)p(η|λ,ν)=λ+h∇ηlog(m(η))ip(η|λ,ν)−Hdsp(η|λ, ν)

ν(57)

If η∈RDwith no further constraints, then p(η|λ, ν)→0 on the surface of the

range of η, since p(η|λ, ν) has to be normalizable. Hence, the surface integral

must be zero. This is e.g. the case for the Multinomial distribution and the

Poisson distribution. Furthermore, if m(η) = const., then the gradient vanishes

(Dirichlet,Gamma, Stick-breaking for ci= 0, see tables in appendix). In those

cases, the above expression simplifies to

hu(x)ip(x|η)p(η|λ,ν)=λ(58)

The expectation for the multivariate Gaussian is also computable, see ap-

pendix.

2.5 Maximum likelihood

For maximum-likelihood approximations, we need the gradient of log(p(η|λ, ν))

w.r.t. λand ν:

∇λlog(p(η|λ, ν)) = ∇λlog(f(λ, ν)) + νη

=−νhηi+νη=ν(η−hηi) (59)

∂

∂ν log(p(η|λ, ν)) = ∂f(λ, ν)

∂ν + log(g(η)) + ηTλ

= (log(g(η)) −hlog(g(η))i) + λT(η−hηi) (60)

Similar to the gradient of the exponential family distributions, this gradient

points towards the actual value of ηand away from the expectation.

2.6 Entropy

The differential entropy (not the conditional entropy) of ηgiven λand νis

H(η|λ, ν) = −f(λ, ν)Zdηm(η)g(η)νexp(νηTλ)log(f(λ, ν)) + log(m(η)) + νlog(g(η)) + νηTλ

=−log(f(λ, ν)) −hlog(m(η))i−νhhlog(g(η))i+λThηii(61)

where the expectations are w.r.t. the p(oste)rior eqn. 22. Using eqn. 31, this

can be rewritten as

H(η|λ, ν) = −log(f(λ, ν)) −hlog(m(η))i+ν∂log(f(λ, ν))

∂ν (62)

The derivates of this entropy are therefore:

∂H(η|λ, ν)

∂ν =ν∂2log (f(λ, ν))

∂ν2−∂hlog(m(η))i

∂ν (63)

and (using eqn. 36 and 30):

∇λH(η|λ, ν) = −∇λlog(f(λ, ν)) + ν∇λ

∂log(f(λ, ν))

∂ν −∇λhlog(m(η))i

=νhηi+ν∂

∂ν ∇λlog(f(λ, ν)) −∇λhlog(m(η))i

=νhηi−ν∂

∂ν νhηi−∇λhlog(m(η))i

=νhηi−νhηi−ν2∂hηi

∂ν −∇λhlog(m(η))i

=−ν2Cov(log(g(η)),η) + Cov(λTη,η)−∇λhlog(m(η))i

∇λH(η|λ, ν) = −∇λlog(f(λ, ν)) + ν∇λ

∂log(f(λ, ν))

∂ν −∇λhlog(m(η))i

(64)

∇λH(η|λ, ν) = −ν2∂hηi

∂ν −∇λhlog(m(η))i(65)

2.7 Kullback-Leibler divergence

The KL-divergence of a distribution with parameters ˜

λ,˜νto a distribution with

parameters λ, ν is given by

D(p(η|˜

λ,˜ν)||p(η|λ, ν)) = Zdηp(η|˜

λ,˜ν)hlog(p(η|˜

λ,˜ν)) −log(p(η|λ, ν))i

=−H(η|˜

λ,˜ν)−Zdηp(η|˜

λ,˜ν) log(p(η|λ, ν)) (66)

The second term on the r.h.s. is given by (expectations w.r.t p(η|˜

λ,˜ν))

Zdηp(η|˜

λ,˜ν) log(p(η|λ, ν)) = log(f(λ, ν))+hlog(m(η))i+νhlog(g(η))i+νλhηi

(67)

and thus, using eqn. 61 and 31 we find

D(p(η|˜

λ,˜ν)||p(η|λ, ν)) = log f(˜

λ,˜ν)

f(λ, ν)!−(˜ν−ν)∂log(f(˜

λ,˜ν))

∂˜ν+ (˜

λT−λT)νhηi

(68)

The derivatives are:

∇˜

λD(p(η|˜

λ,˜ν)||p(η|λ, ν)) = ∇˜

λlog(f(˜

λ,˜ν)) −(˜ν−ν)∂

∂˜ν∇˜

λlog(f(˜

λ,˜ν))

+νhηi− ν

˜ν(˜

λT−λT)∇˜

λ∇˜

λlog(f(˜

λ,˜ν))

=−˜νhηi+ (˜ν−ν)∂

∂˜ν˜νhηi

+νhηi− ν

˜ν(˜

λT−λT)∇2

λlog(f(˜

λ,˜ν))

= ˜ν(˜ν−ν)∂hηi

∂˜ν−ν

˜ν(˜

λT−λT)∇2

λlog(f(˜

λ,˜ν))

= ˜ν(˜ν−ν)∂hηi

∂˜ν+ν(˜

λT−λT)∇˜

λhηi(69)

∇˜

λD(p(η|˜

λ,˜ν)||p(η|λ, ν)) = ˜ν(˜ν−ν)∂hηi

∂˜ν+ν(˜

λT−λT)∇˜

λhηi(70)

where ∇˜

λhηiis a matrix with entries ∇˜

λhηii,j =∂hηii

∂˜

λj. Likewise,

∂

∂˜νD(p(η|˜

λ,˜ν)||p(η|λ, ν)) = −(˜ν−ν)∂2f(˜

λ,˜ν)

∂˜ν2−ν(˜

λT−λT)∂h˜

ηi

∂˜ν(71)

3 Variational approximations with exponential

family distributions

3.1 Variational inference with conjugate p(oste)riors

In variational inference, we replace an intractable distribution (or density)

q(X|d) (i.e. one where marginals and conditionals are hard to compute) with

a tractable, factorized approximation q(X). dis the observed data. Strictly

speaking, q(X) = q(X|d) but it is customary to omit writing this condition-

ing, since it is only approximate. The approximation is linked to the correct

distribution via the variational bound also ’evidence lower bound (ELBO)’:

log(q(d)) = log X

q(d,x)!= log X

q(x)p(d|x)p(x)

p(x)!

≥Zx

q(x) log p(d|x)p(x)

q(x)=hlog p(d|x)iq(X)−D(q(X)||p(X))

⇒L(q, d) = hlog p(d|x)iq(X)−D(q(X)||p(X)) ≤log(p(d)) (72)

where the second line follows from Jensen’s inequality for convex functions and

the definition of the Kullback-Leibler divergence. L(q, d) is a lower bound on the

log-marginal-likelihood, which we try to maximize w.r.t. q(x). The resulting

q(x) is an approximate version of the correct posterior p(x|d) which will be

exact iff p(x|d) is contained in the class of distributions which can be modeled

by q(x). In that case (and only in that case), the bound will be tight.

In the following, we will derive the posterior update rules for the case p(X)

is conjugate to q(d|X) and both are in the exponential family. We also assume

that q(X) is conjugate to the likelihood, such that posterior updated reduce to

parameter updates, like in section 2. Lastly, assume that the data are comprised

of Ni.i.d. observations, i.e. p(d|X) = QN

i=1 p(di|X). We use a generalized

version of the ELBO, which has an inverse temperature parameter β≥0:

L(q, d) = hlog p(d|x)iq(X)−βD(q(X)||p(X)) ≤log(P(d)) (73)

β6= 1 can be used to model deviations from optimal inference, or for stochastic

updating in minibatches etc.. Denote the prior parameters with ν, λand the

posterior parameters with ˜ν, ˜

λ.

The expected log-likelihood under the posterior hlog p(d|x)iq(X)for Ndat-

apoints can then be computed from eqn. (47):

hlog(p(d|η))iq(η|˜

λ,˜ν)=

i=1

log(h(di))+hηTi(

i=1

u(di)−N˜

λ)−N∂log(f(˜

λ,˜ν))

∂˜ν

(74)

where hηTiis given by eqn. (30) and the KL-divergence D(q(X)||p(X)) by eqn.

(68). To maximize eqn. (73) w.r.t. the posterior parameters ˜

λ,˜ν, we will rewrite

the elbo as difference between one part that does not depend on the posterior

parameters, and a KL-divergence between the posterior, and a distribution in

the same exponential family as the posterior that depends on parameters λ0, ν0

Because the KL-divergence is zero exactly if the two distributions that enter it

are pointwise equal, we can then compute the maximal ELBO by setting ˜

λ=λ0

and ν0= ˜ν.

L(q, d) = hlog p(d|x)iq(X)−βD(q(X)||p(X))

i=1

log(h(di)) + hηTi N

i=1

u(di)−N˜

λ!−N∂log(f(˜

λ,˜ν))

∂˜ν

−β"log f(˜

λ,˜ν)

f(λ, ν)!−(˜ν−ν)∂log(f(˜

λ,˜ν))

∂˜ν+ (˜

λT−λT)νhηi#

i=1

log(h(di)) −βlog f(˜

λ,˜ν)

f(λ, ν)!

+hηTi N

i=1

u(di) + βνλ−(N+βν)˜

λ!

−(N+βν −β˜ν)∂log(f(˜

λ,˜ν))

∂˜ν(75)

Define

Posterior parameters for β-variational update

ν0=N

β+ν(76)

λ0=PN

i=1 u(di) + βν ˜

N+βν =νλ+PN

i=1 u(di)/β

ν0(77)

and note the similarity of these definitions with the exact posterior updates eqn.

(26) – all data-related quantities have been divided by β. Collecting terms in

the ELBO eqn. (75) and plugging in these definitions, we find

L(q, d) =

i=1

log(h(di)) −βlog f(˜

λ,˜ν)

f(λ, ν)!

+βν0hηTiλ0−˜

λ

−β(˜ν−ν0)∂log(f(˜

λ,˜ν))

∂˜ν(78)

Comparing the last two lines to the expression for the KL divergence eqn. (68),

we find that up to a log f(˜

λ,˜ν)

f(λ0,ν0)term and a factor β, these lines are a KL-

divergence. Inserting and subtracting this term, we find

X1X2

X6X7

X3X5

X1X2

X6X7

X3X5

p(X4|X1)p(X5|X2)

p(X3|X1, X2)

p(X6|X4, X3)p(X7|X5, X3)

p(X1)p(X2)

Figure 1: Left: Bayesian network with (undirected) loops. Exact sum-product

can not be run on this graph. Right: corresponding factor graph. For vari-

ational message passing, a node (e.g. X3) needs to receive messages from all

members of its Markov blanket, which are the variables connected to neighbour-

ing factors (for X3: all other nodes). through the connecting factors.

L(q, d) =

i=1

log(h(di)) −βlog f(˜

λ,˜ν)

f(λ, ν)!+βlog f(˜

λ,˜ν)

f(λ0, ν0)!

−βD(q(X|˜

λ,˜ν)||p(X|λ0, ν0))

i=1

log(h(di)) −βlog f(λ0, ν0)

f(λ, ν)(79)

−βD(q(X|˜

λ,˜ν)||p(X|λ0, ν0)) (80)

which is maximal if D(q(X|˜

λ,˜ν)||p(X|λ0, ν0)) = 0, which happens if and only

if ˜

λ=λ0and ν0= ˜ν. Thus, the maximal ELBO after the posterior update is

given by eqn. (79), which is achieved for the parameters given in eqns. (76).

Now we derive an expression for the expected log-likelihood eqn. (74) that

depends only on the p(oste)rior parameters. This is possible because these

parameters are computed from the sufficient statistics (which are by definition

sufficient to determine the likelihood). Rewrite the posterior parameters as

N=β(ν0−ν) and β(λ0ν0−λν) and substitute these expressions into the log-

likelihood, then

hlog(p(d|η))iq(η|˜

λ,˜ν)=

i=1

log(h(di))+βνhηTi(λ0−λ)−β(ν0−ν)∂log(f(λ0, ν0))

∂ν0

(81)

3.2 Variational message passing

Denote the set of indexes of latent variables by L, the set of indexes of observed

variables with O, and the set of all indexes by Nsuch that L∩O=∅and

L∪O=N. We consider a fully factorized approximation, i.e. one where the

density of the latent variables

Q(x) = Y

i∈L

Qi(xi) = Y

i∈L

Q(xi)

is a product over distributions of individual variables. Strictly speaking, the

notation in the middle is correct because there is one density per variable. We

omit the extra index and assume the reader knows what is meant. Let the

correct density be expressed as a Bayes net,

P(d,x) = Y

j∈O

P(dj|paXj)Y

i∈L

P(xi|paXi)

and djis the observed data at node Xj. Furthermore, to simplify notation, we

introduce the ”sum-product” symbol:

i∈K

Q(xi) := X

xi:i∈KY

i∈K

Q(xi) (82)

The bound then is:

L(Q, d) = X

i∈L

Q(xi)

X

j∈O

log(P(dj|paXj)) + X

k∈L

log(P(xk|paXk)) −X

k∈L

log(Q(xk))



i∈L

Q(xi)

X

j∈O

log(P(dj|paXj)) + X

k∈L

log(P(xk|paXk))



−X

i∈LX

Q(xi) log(Q(xi))

j∈OXY

i∈L∩paXj

Q(xi) log(p(dj|paXj))

k∈LXY

i∈L∩paXk∪{k}

Q(xi) log(P(xk|paXk))

−X

i∈LX

Q(xi) log(Q(xi)) (83)

To find the Q(x) that maximizes the bound, we take the derivative w.r.t. q(x)

and set it to zero. This is a necessary condition for a maximum, it can be shown

that it is sufficient, too. We furthermore impose the constraint that all q(xi)

have to be distributions, i.e. q(xi)≥0 and Pxiq(xi) = 1. It will turn out that

we do not have to impose the first constraint, we do however need to make sure

the second one is fulfilled. This can be achieved by a Lagrange multiplier for

each distribution. The Lagrangian functional therefore is

L(Q, d) = L(Q, d) + X

i∈L

τi X

Q(xi)−1!(84)

A necessary condition for an extemum of L(Q, d) is a stationary point of L(Q, d),

i.e. the derivatives w.r.t. the components of Q(x) have to vanish:

δL(Q, d)

δQ(Xm)=X

j∈O∩chXmXY

i∈L∩paXj\{m}

Q(xi) log(P(dj|paXj))

+XY

i∈L∩paXm

Q(xi) log(P(xm|paXm))

k∈L∩chXmXY

i∈L∩paXk∪{k}\{m}

Q(xi) log(P(xk|paXk))

−log(Q(xm)) −1 + τm(85)

which we set to 0 and solve for Q(xm) to find the optimal approximate posterior

marginals. To do so, we interpret the terms in eqn. 85 as messages sent to node

Xmon the factor graph corresponding to the Bayesian network (see fig. 1 for

an example).

Define the messages sent a variable node Xmto a neighbouring factor node

f(Xm, . . .) as just the variational posterior marginals for unobserved nodes, and

a 1-or-0 message for observed nodes:

∀m∈L:µXm→f(...,Xm,...)(Xm) = Q(Xm) (86)

∀j∈O:µXj→f(...,Xj,...)(Xj) = δxj,dj(87)

and the messages sent from a factor f(...,Xm, . . .) depending on variables with

indices j∈F, m ∈Fto a neighbouring variable node as:

µf(...,Xm,...)→Xm(Xm) = XY

j∈F\{m}

Q(xj)f(...,xm, . . .) (88)

=XY

j∈F\{m}

f(...,xm, . . .)µXj→f(...,Xj,...)(xj) (89)

i.e. the average over the factor with respect to the posterior of all variables that

connect to it, except for the variable where the message is being sent to. With

these message definitions, eqn. 85 becomes

δL(Q, d)

δQ(Xm)=X

j∈O∩chXm

µlog(P(Xj|paXj))→Xm(Xm)

+µlog(P(Xm|paXm))→Xm(Xm)

k∈L∩chXm

µlog(P(Xk|paXk))→Xm(Xm)

−log(Q(xm)) −1 + τm(90)

j∈chXm

µlog(P(Xj|paXj))→Xm(Xm)

+µlog(P(Xm|paXm))→Xm(Xm)

−log(Q(Xm)) −1 + τm(91)

Defining exp(τm−1) = 1

Zm, we can solve for Q(Xm) now:

Q(Xm) = 1

exp 

X

j∈chXm

µlog(P(Xj|paXj))→Xm(Xm) + µlog(P(Xm|paXm))→Xm(Xm)



(92)

In other words, the variational posterior at variable node Xmis computed by

adding up all incoming messages, exponentiating, and normalizing (since Ziis

computed from the Lagrange multiplier which enforces normalization). This

message-passing scheme has to be iterated until convergence, which is guaran-

teed since the bound L(Q, D) is a Lyapunov function of the iteration dynamics.

Another way of deriving this algorithm without computing derivatives is via

the KL divergence D(Q(X)||˜

Q(X)). Recall that the KL divergence is positive,

and zero if and only if the distributions are equal everywhere. Assume again we

wanted to maximize eqn. 83 w.r.t. Q(Xk). To carry out this maximization, we

only need to consider terms which depend on Q(Xk), which in turn depend on

the members of Xk’s Markov blanket:

argmax

Q(Xk)

[L(Q, d)] = argmax

Q(Xk)



X

j∈O∩chXkXY

i∈L∩paXj

Q(xi) log(P(dj|paXj))

j∈L∩chXk∪{k}XY

i∈L∩paXj∪{j}

log(P(xj|paXj))

−X

Q(xk) log(Q(xk))#(93)

Note that all terms on the r.h.s. include a factor Q(xk) and a summation over

xk. We can therefore pull it out:

argmax

Q(Xk)

[L(Q, d)] = argmax

Q(Xk)



X

Q(xk)



X

j∈O∩chXkXY

i∈L∩paXj\{k}

Q(xi) log(P(dj|paXj))

+XY

i∈L∩paXk

log(P(xk|paXk))

j∈L∩chXkXY

i∈L∩paXj∪{j}\{k}

log(P(xj|paXj)) −log(Q(xk))







(94)

With the message definitions (eqns. 86 – 88) the r.h.s. can be written as

argmax

Q(Xk)

[L(Q, d)] = argmax

Q(Xk)



X

Q(xk)



X

j∈O∩chXk

µlog(P(dj|paXj))→Xk(xk)

+µlog(P(Xk|paXk))→Xk(xk)

j∈L∩chXk

µlog(P(Xj|paXj)→Xk(xk)−log(Q(xk))







(95)

i.e. we need the incoming messages from all neighbouring factor nodes to com-

pute this expression. Note that the unions of the index sets of the sums in the

first and the last line are simply the indexes of all children of Xk, whereas the

message on the second line is the incoming message from the parents. Thus,

define

log(U(xk)) = X

j∈chXk

µlog(P(dj|paXj))→Xk(xk) + µlog(P(Xk|paXk→Xk(xk) (96)

and let

U(xk) = Zk˜

Q(xk) (97)

with Pxk

Q(xk) = 1 and Zk>0, i.e. ˜

Q(Xk) is a probability distribution. With

these definition we obtain

argmax

Q(Xk)

[L(Q, d)] = argmax

Q(Xk)"X

Q(xk) log(U(xk)) −log(Q(xk))#

= argmax

Q(Xk)"X

Q(xk) log(Zk) + X

Q(xk) log ˜

Q(xk)

Q(xk)!#

= log(Zk)−argmax

Q(Xk)hD(Q(Xk)||˜

Q(Xk))i(98)

Since the KL-divergence is ≥0, it follows that the variational bound L(Q, d)

is maximized if Q(Xk) = ˜

Q(Xk). In other words, to compute the optimal

distribution at a given variable node given the distributions of the variables in

its Markov blanket, do the following:



sum all incoming messages from neighbouring factor nodes,



exponentiate,



normalize.

The factor nodes collect messages from their neighbouring variable nodes, and

compute messages by summing their log-factor over all variables except the one

where the message is being sent to, similar to sum-product message passing.

Figure 2: Variational approximation of parameter (Θ) learning in a Bayesian

network with a fully factorized approximation. A: fragment of a Bayesian net-

work. paXare the parents of X, which may have parents and other chil-

dren. For learning the parameters Θ, these other children/parents are not rele-

vant. B: corresponding factor graph fragment. The parameters Θ appear only

in the factor connecting Xto its parents. C: fully factorized approximation

and Dcorresponding factor graph. When computing the variational bound,

log(P(X|Θ,paX)) has to be averaged over all variables that appear in it, which

are the neighbours of the factor node P(X|Θ,paX) in B.

3.3 Learning parameters with exponential family distri-

butions

To apply variational message passing, it is necessary to know the factors, which

are the conditional probability distributions in case of a Bayesian network. If

we want to learn these factors from data, then it is useful to have a compact

parametrization of them, which can be done with exponential family distri-

butions and their conjugate p(oste)riors. Consider the network (fragment) in

fig. 2. Assume we wanted to learn the conditional distribution of Xgiven its

parents, and parametrize this distribution by Θ. We lump these parents to-

gether in one supernode. Xmay be continuous or discrete, but we assume that

the parents of X, paXare discrete (in some special cases, continuous models

are tractable). Also assume that the distribution of Xgiven paXis from the

exponential family, i.e.

p(X|paX,ηpaX) = h(x)g(ηpaX) exp(ηT

paXu(x)) (99)

i.e. there is one parameter vector ηpaXfor each value of paX, and Θ is the

concatentation of these parameter vectors. The conjugate prior on each ηpaX

is then

p(ηpaX|λpaX, νpaX) = f(λpaX, νpaX)g(ηpaX)νexp(νpaXηT

paXλpaX) (100)

Assume now we had observed n= 1, . . . , N datapoints d1:Nand computed

the corresponding latent variable distributions Q(Xn), Q(pan

X). If any of the

nodes are observed, replace the corresponding distribution with a distribution

concentrated at the observed value. Note that in a fully factorized approxi-

mation, each Q(paXn) is a actually a product over the parents, Q(paXn) =

QYn∈paXnQ(Yn). Looking at fig. 2, we note that there is exactly one fac-

tor connecting the parameter node to Xand paX, and the prior factor for Θ.

Using an approximating posterior for Q(Θ) which has the same form as the

prior (eqn. 100), the summands in the variational bound which depend on the

posterior distribution of Θ are (where yn,y∈range(paXn)):

LΘ=

n=1 X

xnX

Q(xn)Q(yn)hlog(P(xn|Θ,yn))iQ(Θ) −βD(Q(Θ)||P(Θ))

n=1 X

xnX

Q(xn)Q(yn)log(P(xn|ηyn))Q(ηyn|˜

λyn,˜νyn)

−X

βD(Q(ηy|˜

λy,˜νy)||P(ηy|λy, νy)) (101)

We rewrite the first term as a sum over the possible values of xand paX:

LΘ=X

ylog(P(x|ηy))Q(ηy|˜

λy,˜νy)

n=1

Q(Xn=x)Q(paXn=y)

−X

βD(Q(ηy|˜

λy,˜νy)||P(ηy|λy, νy)) (102)

and define the ”responsibilities” (because they measure how much a given setting

of the latent variables contributes to explaining a datapoint)

rx,y=

n=1

Q(Xn=x)Q(paXn=y) (103)

Using the definition of the exponential familiy distribution (eqn. 1), the Kullback-

Leibler divergence between conjugate p(oste)riors (eqn. 68) and denoting Q(ηy) =

Q(ηy|˜

λy,˜νy), we find

LΘ=X

yhlog(h(x)) + hlog(g(ηy))iQ(ηy)+u(x)ThηyiQ(ηy)irx,y

−X

y"βlog f(˜

λy,˜νy)

f(λy, νy)!−β(˜νy−νy)∂log(f(˜

λy,˜νy))

∂˜νy

+β(˜

λT

y−λT

y)νyhηyiQ(ηy)#

(104)

The expectations hlog(g(ηy))iQ(ηy)can be expressed using eqn. 31:

hlog(g(ηy))iQ(ηy)=−∂log(f(˜

λy,˜νy)

∂˜νy−˜

λT

yhηyiQ(ηy)(105)

Inserting this expression into eqn. 104:

LΘ=X

y"log(h(x)) −∂log(f(˜

λy,˜νy)

∂˜νy−˜

λT

yhηyiQ(ηy)+u(x)ThηyiQ(ηy)#rx,y

−X

y"βlog f(˜

λy,˜νy)

f(λy, νy)!−β(˜νy−νy)∂log(f(˜

λy,˜νy))

∂˜νy

+β(˜

λT

y−λT

y)νyhηyiQ(ηy)#

(106)

Collecting terms, we find:

LΘ=X

log(h(x))rx,y−X

βlog f(˜

λy,˜νy)

f(λy, νy)!

∂log(f(˜

λy,˜νy)

∂˜νy"β(˜νy−νy)−X

rx,y#

−X

νyβ˜

λT

y−λT

y+˜

λT

yPxrx,y

νy−Pxrx,yu(x)T

νyhηyiQ(ηy)

(107)

With the expressions

ˆνy=νy+Px,yrx,y

β=νy1 + Pxrx,y

νyβ

λy=νyλy+Pxrx,y

βu(x)

νy+Pxrx,y

=νy

ˆνy

λy+Pxrx,y

βu(x)

νy

(108)

and noting that

−βˆνy=−βνy−X

x,y

rx,y(109)

νyβ˜

λT

y+˜

λT

yPxrx,y

νy=βνy˜

λT

y+˜

λT

yPxrx,y

νyβ

=βνy˜

λT

y1 + Pxrx,y

νyβ=βˆνy˜

λT

y(110)

νy−βλT

y−Pxrx,yu(x)T

νy=−νyβλT

y+Pxrx,yu(x)T

νyβ

=−βˆνyˆ

λT

y(111)

we insert a zero by adding and subtracting the term Pyβlog f(˜

λy,˜νy)

f(ˆ

λy,ˆνy)

and write (using eqn. 68)

LΘ=X

log(h(x))rx,y−X

βlog f(˜

λy,˜νy)

f(λy, νy)!

βlog f(˜

λy,˜νy)

f(ˆ

λy,ˆνy)!−X

βlog f(˜

λy,˜νy)

f(ˆ

λy,ˆνy)!

∂log(f(˜

λy,˜νy)

∂˜νy

β(˜νy−ˆνy)

−X

βˆνy(˜

λT

y−ˆ

λT

y)hηyiQ(ηy)(112)

log(h(x))rx,y−X

βlog f(ˆ

λy,ˆνy)

f(λy, νy)!

−X

βD(Q(ηy|˜

λy,˜νy)||Q(ηy|ˆ

λy,ˆνy)) (113)

The first part of this expression is constant w.r.t. to the ˜

λy,˜νy, and the second

part is a sum of KL-divergences. To maximize LΘ, we therefore choose the

posterior parameters so that each KL-divergence is zero, i.e. ˜

λy=ˆ

λyand

˜νy= ˆνy:

variational posterior parameters

rx,y:=

n=1

Q(Xn=x)Q(paXn=y) (114)

˜νy:= νy+Pxrx,y

β(115)

λy:= νyλy+Pxrx,y

βu(x)

νy+Pxrx,y

(116)

Comparing these learning rules to the conjugate update rules (eqns. 24,25),

we see that in the variational framework, a datapoint can be ”shared” between

different values of the latent variables in the model. This sharing comes about

because the responsibilities are (approximate) probability, rather than deter-

ministic 0s or 1s. Otherwise the rules are identical. Note in particular that if

rx,y∈ {0,1}, i.e. the variables are known with certainty, then the variational

rules reduce to the exact update rules. Also, the β-variational update rules can

be obtained by dividing all responsibilities by β. Thus, if β > 1, the prior be-

comes ’stiffer’ and tends to ignore the data, whereas for β→0, we get maximum

likelihood updates.

4 Frequently used special cases

This section contains frequently used conjugate pairs and relevant quanitities

computed thereof.

4.1 Bernoulli-Beta

For a discrete random variable x∈ {0; 1}, where 1 is alternatively called ”suc-

cess” (e.g. when betting on coin tosses), is given by

P(x|q) = qx(1 −q)(1−x).(117)

Its canonical conjugate prior is a Beta distribution in qwith density

p(q|α, β) = 1

B(α, β)qα−1(1 −q)β−1.(118)

To transform these expressions into the exponential family normal form (eqns.

1 and 22), introduce the logit

η= log q

1−q(119)

whence q=1

1+exp(−η), 1 −q=1

1+exp(η)and dq

dη =−exp(−η)

(1+exp(−η))2=−q(1 −q).

Substitute ηin eqn. (117):

P(x|η) = exp (xlog(q) + (1 −x) log(1 −q))

= exp xlog q

1−q+ log(1 −q)

= (1 −q) exp(ηx) = 1

1 + exp(η)exp(ηx) (120)

Hence, h(x) = 1, g(η) = 1

1+exp(η)(cf. eqn. (1)) and

P(x|η) = h(x)g(η) exp(ηx).(121)

To transform the Beta density into exponential family normal form, note that

densities transform like p(η) = p(q(η)) 

dη :

p(η|α, β) = 1

B(α, β)qα−1(1 −q)β−1

dη 

B(α, β)q−1(1 −q)−1exp (αlog(q) + βlog(1 −q)) q(1 −q)

B(α, β)exp αlog q

1−q+ (α+β) log(1 −q)

B(α, β)1

1 + exp(η)α+β

exp (αη) (122)

Letting ν:= α+β,λ:= α

ν,f(λ, ν) = 1

B(νλ,ν(1−λ)) ,m(η) = 1 we find

p(η|λ, ν) = 1

B(νλ, ν(1 −λ)) 1

1 + exp(η)ν

exp(νλη)

=f(λ, ν)m(η)g(η)νexp(νλη).(123)

Bernoulli distribution

standard form qx(1 −q)1−x

constraints x x ∈ {0,1}

constraints q q ∈[0,1]

u(x)x

ηlog q

1−q

1+exp(−η)

constraints η η ∈R

g(η)1

1+exp(η)= 1 −q

h(x) 1

hu(x)i1

1+exp(−η)=q

Var(u(x)) exp(−η)

(1+exp(−η))2=q(1 −q)

Beta distribution

standard form qα−1(1−q)β−1

B(α,β)

constraints α, β α, β ∈R+

λα

α+β

ν α +β

α νλ

β ν(1 −λ)

constraints ν ν ∈R+

constraints λ λ ∈[0,1]

f(λ, ν)1

B(νλ,ν(1−λ))

m(η) 1

hηiψ(νλ)−ψ(ν(1 −λ))

Var(η)ψ(νλ)−ψ(ν(1−λ))

ν2+ψ0(νλ) + ψ0(ν(1 −λ))

∂log(f(λ,ν))

∂ν ψ(ν)−λψ(νλ)−(1 −λ)ψ(ν(1 −λ))

∂2log(f(λ,ν))

∂ν2ψ0(ν)−λ2ψ0(νλ)−(1 −λ)2ψ0(ν(1 −λ))

hg(η)i1−λ=β

α+β= 1 −hqi

Var(g(η)) (1−λ)λ

ν+1 =

α+β

α+β+1

p(x|λ, ν)λx + (1 −λ)(1 −x) = xhqi+ (1 −x)(1 −hqi)

hu(x)ip(x|λ,ν)λ

Table 1: Bernoulli distribution and conjugate Beta prior

4.2 Multinomial-Dirichlet

The multinomial distribution is the generalization of the Bernoulli distribution

to Kpossible outcomes. It is convenient to represent multinomial random

variates xby vectors with Kcomponents, such that xk∈ {0; 1}and PK

k=1 xk=

1, whence xK= 1 −PK−1

k=1 xk. This is called 1-of-K representation, because

exactly one component of xis 1 and all others are 0. Let q= (q1, . . . , qK) be

the probabilities of the Kpossible x, such that qK= 1 −PK−1

k=1 qk. Then, the

multinomial distribution can be written as

P(x|q) =

K−1

k=1

qxk

k 1−

K−1

k=1

qk!xK

(124)

This expression can be transformed into exponential family form via

P(x|q) = exp K−1

k=1

xklog(qk) + 1−

K−1

k=1

xk!log 1−

K−1

k=1

qk!!

= 1−

K−1

k=1

qk!exp K−1

k=1

xklog qk

1−PK−1

i=1 qi!! (125)

and by introducing the generalized logit ηk= log qk

1−PK−1

i=1 qiwe find that

∀k= 1, . . . , K −1 : qk=exp(ηk)

1 + PK−1

i=1 exp(ηi)(126)

qK=1

1 + PK−1

i=1 exp(ηi)(127)

(alternatively, we could fix ηK= 0, and let q= softmax(η)). Hence, h(x) = 1

(after xk∈ {0; 1}has been enforced) and

g(η) = 1−

K−1

k=1

qk!=qK=1

1 + PK−1

i=1 exp(ηi)

The standard conjugate prior on the multinomial is the Dirichlet distribu-

tion. Let α= (α1, . . . , αK) with αk≥0 and

M=QK

k=1 Γ(αk)

Γ(PK

k=1 αk)(128)

then the density of the Dirichlet distribution is

p(q|α) = M

K−1

k=1

qαk−1

k 1−

K−1

k=1

qk!αK−1

(129)

=Mexp K−1

k=1

(αk−1) log(qk)+(αK−1) log 1−

K−1

k=1

qk!!

K−1

k=1

q−1

k(1 −

K−1

k=1

qk)−1exp K−1

k=1

αklog qk

qK+

k=1

αKlog(qK)!

=M qPK

k=1 αK

K−1

k=1

q−1

k(1 −

K−1

k=1

qk)−1exp αTη(130)

To substitute qby η, we need to compute the determinant of the Jacobian dq

dη.

Let Z= 1 + PK−1

k=1 exp(ηk), i.e. qk=exp(ηk)

Z. Then



dη=







exp(η1)

Z−exp(2η1)

Z2. . . −exp(η1) exp(ηK)

...

−exp(ηK) exp(η1)

Z2. . . exp(ηK)

Z−exp(2ηK)







=







q1. . . 0

0...0

0. . . qK





−









(q1. . . qK)

=







q1. . . 0

0...0

0. . . qK









1K×K−









(q1. . . qK)





K−1

k=1

qk 1−

K−1

k=1

qk!(131)

And thus

p(η|α) = p(q(η)|α)

dη

=M 1

1 + PK−1

i=1 exp(ηi)!PK

k=1 αK

exp αTη(132)

With ν=PK

k=1 αk,λk=αk

ν,f(λ, ν) = M,m(η) = 1 and g(η) as above, we

find

p(η|λ, ν) = f(λ, ν)m(η)g(η)νexp(νηλ) (133)

multinomial distribution

standard form QK−1

k=1 qxk

k1−PK−1

k=1 qkxK

constraints xxk∈ {0,1},PK

k=1 xk= 1

constraints qqk∈[0,1], PK−1

k=1 qk≤1

u(x)x

ηlog q

1−PK−1

i=1 qi

qexp(η)

1+PK−1

i=1 exp(ηi)

constraints ηηk∈R

g(η)1

1+PK−1

i=1 exp(ηi)= 1 −PK−1

i=1 qi=qK

h(x) 1

hu(x)iexp(η)

1+PK−1

i=1 exp(ηi)=q

Cov(u(x)) Ckl =δkl exp(ηk)

1+PK−1

i=1 exp(ηi)−exp(ηk+ηl)

(1+PK−1

i=1 exp(ηi))2=δklql−qlqk

Dirichlet distribution

standard form Γ(PK

k=1 αk)

k=1 Γ(αk)QK−1

k=1 qαk−1

k1−PK−1

k=1 qkαK−1

constraints ααk∈R+

λ∀k= 1, . . . , K −1 : λk=αk

i=1 αi

νPK

i=1 αi

α νλ

constraints ν ν ∈R+

constraints λλi∈[0,1], PK−1

i=1 λi≤1

f(λ, ν)Γ(ν)

Γ(ν(1−PK−1

k=1 λk))QK−1

i=1 Γ(νλi)

m(η) 1

hηiψ(νλ)−ψν(1 −PK−1

i=1 λk)

Cov(ηi, ηj)δi,jψ0(νλi) + ψ0ν(1 −PK−1

i=1 λk)

∂log(f(λ,ν))

∂ν ψ(ν)−(1 −PK−1

i=1 λi)ψν(1 −PK−1

i=1 λi)−PK−1

i=1 λiψ(νλi)

∂2log(f(λ,ν))

∂ν2ψ0(ν)−(1 −PK−1

i=1 λi)2ψ0ν(1 −PK−1

i=1 λi)−PK−1

i=1 λ2

iψ0(νλi)

hg(η)i1−PK−1

i=1 λi=αK

i=1 αi=hqKi

Var(g(η)) (1−PK−1

i=1 λi)PK−1

i=1 λi

ν+1 =hqKih1−qKi

ν+1

p(x|λ, ν)PK−1

i=1 xiλi+xK(1 −PK−1

i=1 λi) = PK

i=1 xihqii

hu(x)ip(x|λ,ν)λ

Table 2: Multiomial distribution and conjugate Dirichlet prior

4.3 Multinomial-StickBreaking

The stick-breaking construction is another way of parameterizing multinomial

distributions. It has attracted a lot of attention in Machine Learning around

2005, because it is a convenient way of representing infinite multinomials with

a Dirichlet-process prior. To re-parameterize the distribution of a Kcompo-

nent multinomial variable x(in 1-of-K representation) with probabilities q,

introduce the variables v= (v1, . . . , vK−1) such that

q1=v1(134)

q2= (1 −v1)v2(135)

q3= (1 −v1)(1 −v2)v3(136)

qK−1= (1 −v1)(1 −v2). . . vK−1(137)

qK= (1 −v1)(1 −v2). . . (1 −vK−1).(138)

The distribution of xcan then be written as

p(x|v) =

K−1

i=1

vxi

i(1 −vi)PK

j=i+1 xj(139)

We now introduce the sufficient statistics u(x)=(u0, . . . , uK−1)

uk=

i=k+1

xi(140)

hence u0= 1. uk= 1 implies that xi= 1 with i>k. Eqn. 139 becomes

p(x|v) =

K−1

i=1

vui−1−ui

i(1−vi)ui= exp K−1

i=1

(ui−1−ui) log(vi) + uilog(1 −vi)!

(141)

To transform this expression into exponential family normal form, rewrite the

exponent on the r.h.s. as

log(v1) +

K−2

i=1

uilog 1−vi

vi+1+uK−1log 1−vK−1

vK−1(142)

and introduce the natural parameters η= (η1, . . . , ηK−1):

ηK−1= log 1−vK−1

vK−1(143)

∀k=K−2,...,1 : ηk= log 1−vk

vk+1(144)

Solving for vyields

vK−1=1

1 + exp(ηK−1)(145)

vK−2=1

1 + exp(ηK−2)(1 + exp(ηK−1)) (146)

v1=1

1 + exp(η1)(1 + exp(η2)(1 + . . . (1 + exp(ηK−1)) . . .)(147)

Thus

P(x|η) = v1(η) exp ηTu(x)(148)

which is in exponential family normal form, with h(x) = 1 and g(η) = v1(η).

The conjugate p(oste)rior on vis given by a product of Beta distributions.

Let α,βbe the parameters of these Beta distributions, then

p(v|α,β) =

K−1

i=1

B(αi, βi)vαi−1

i(1 −vi)βi−1(149)

To derive the corresponding density in ηwe need the determinant of the

Jacobian dv

dη. Note that as a consequence of eqns. 145-147, the Jacobian is an

upper triangular matrix, hence the determinant is given by the product of the

diagonal entries:

dvi

dηi

=−exp(ηi)(1 + exp(ηi+1)...)

(1 + exp(ηi)(1 + exp(ηi+1)(....)))2=−vi(1 −vi) (150)

⇒

dη=

K−1

i=1

vi(1 −vi).(151)

Now we can rewrite eqn. 149 in exponential form

p(v|α,β) =

K−1

i=1

v−1

i(1 −vi)−1

B(αi, βi)exp K−1

k=1

αilog(vi) + βilog(1 −vi)!(152)

and rearrange the exponent as:

K−1

k=1

αilog(vi) + βilog(1 −vi)

K−1

k=1

(αi+βi) log(vi) + βilog 1−vi

vi(153)

=βK−1log 1−vK−1

vK−1+

K−2

i=1

βilog 1−vi

vi+1

+(α1+β1) log(v1) +

K−1

i=2

(αi+βi−βi−1)

| {z }

=:ci

log(vi) (154)

=βK−1ηK−1+

K−2

i=1

βiηi+ (α1+β1) log(g(η)) +

K−1

i=2

cilog(vi) (155)

Hence, letting N(α,β) = QK−1

i=1 B(αi, βi)−1we obtain for the density of η:

p(η|α,β) = p(v(η)|α, β)

dη(156)

=N(α,β)

K−1

i=2

vi(η)cig(η)α1+β1exp ηTβ(157)

which is almost in exponential family normal form, except for:



QK−1

i=2 vi(η)cishould be m(η), i.e. it must not depend on the data, hence

the cineed to be constant. Since αand βwill be updated on observation,

this requires that ∀i= 2, . . . , K −1 : αi=ci−βi+βi−1. This implies

that αis determined by βup to the ci, except for α1.



ν:= α1+β1and λi:= βi

ν, hence α1=ν(1 −λ1).



f(λ, ν) = N(α(λ, ν), νλ)

In the standard parametrization of the prior on the stick-breaking construc-

tion (eqn. 149), the αiare the pseudocounts on the number of instances where

xi= 1. To see that our definition

αi:= ci−βi+βi−1(158)

has the same meaning, assume that we had observed Ndatapoints x1:Nand

corresponding sufficient statistics u1:N. Using the exponential family update

rules (eqns. 24 and 25), we find for the posterior parameters

αN

1=ν−βN

1=α1+β1+N−β1−

n=1

=α1+N−

n=1

=α1+

n=1

x1(159)

∀i= 2, . . . , K −1 : βN

i=βi+

n=1

i(160)

αN

i=ci−βN

i+βN

i−1

=ci−βi+βi−1+

n=1

(un

i−1−un

=ci+βi−1−βi+

n=1

=αi+

n=1

i(161)

Hence, the meaning of αis preserved.

In this table, let

γk= 1 + exp(ηk)(1 + exp(ηk+1)(1 + . . . (1 + exp(ηK−1)) . . .))

multinomial stick-breaking distribution

standard form QK−1

k=1 vxk

k(1 −vk)PK

i=k+1 xi

constraints xxk∈ {0,1},PK

k=1 xk= 1

constraints vvk∈[0,1]

u(x)∀k= 1, . . . , K −1 : uk=PK

i=k+1 xi

ηηK−1= log 1−vK−1

vK−1,∀k < K −1 : ηk= log 1−vk

vkvk+1

vvK−1=1

1+exp(ηK−1),∀k < K −1 : vi=1

1+ exp(ηi)

vi+1

constraints ηηk∈R

g(η)1

γ1=v1

h(x) 1

hu(x)i huki=exp(Pk

i=1 ηi)γk+1

γ1=Qk

i=1(1 −vi)

Cov(u(x)) Ckl =expPmax(k,l)

i=1 ηiγmax(k,l)+1

γ1−exp(Pk

i=1 ηi)γk+1 exp(Pk

i=1 ηi)γl+1

γ2

=Qmax(k,l)

i=1 (1 −vi)−Qk

i=1(1 −vi)Ql

j=1(1 −vj)

Table 3: The stick-breaking distribution for multinomial variables

stick-breaking prior

standard form QK−1

i=1 1

B(αi,βi)vαi−1

i(1 −vi)βi−1

constraints α,βαk, βk∈R+

λβ

ν α1+β1

constraints c∀i= 2, . . . , K −1 : ci> βi−βi−1

αα1=ν(1 −λ1), ∀i= 2, . . . , K −1 : αi=ci−βi+βi−1

constraints ν ν ∈R+

constraints λλi∈R+

f(λ, ν)Γ(ν)

Γ(νλ1)Γ(ν(1−λ1)) QK−1

i=2

Γ(ci+νλi−1)

Γ(νλi)Γ(ci−νλi+νλi−1)

m(η)QK−1

i=2 vi(η)ci

∀i<K−1 : hηiiψ(νλi)−ψ(ci−νλi+νλi−1) + ψ(ci+1 −νλi+1 +νλi)−ψ(ci+1 +νλi)

=ψ(βi)−ψ(αi) + ψ(αi+1)−ψ(αi+1 +βi+1)

hηK−1iψ(νλi)−ψ(ci−νλi+νλi−1) = ψ(βi)−ψ(αi)

Var(η1)ψ0(νλ1) + ψ0(ν(1 −λ1)) −ψ0(c2+νλ1) + ψ0(c2−νλ2+νλ1)

=ψ0(β1) + ψ0(α1)−ψ0(α2+β2) + ψ0(α2)

∀1< i < K −1 : Var(ηi)ψ0(νλi) + ψ0(ci−νλi+νλi−1)−ψ0(ci+1 +νλi) + ψ0(ci+1 −νλi+1 +νλi)

=ψ0(βi) + ψ0(αi)−ψ0(αi+1 +βi+1) + ψ0(αi+1)

Cov(ηi, ηi+1)ψ0(ci+1 −νλi+1 +νλi) = ψ0(αi+1)

Var(ηK−1)ψ0(νλK−1) + ψ0(cK−1−νλK−1+νλK−2) = ψ0(βK−1) + ψ0(αK−1)

∂log(f(λ,ν))

∂ν

ψ(ν)−λ1ψ(νλ1)−(1 −λ1)ψ(ν(1 −λ1))

+PK−1

i=2 [λi−1ψ(ci+νλi−1)−λiψ(νλi)−(λi+1 −λi)ψ(ci−νλi+νλi+1)]

∂2log(f(λ,ν))

∂ν2

ψ0(ν)−λ2

1ψ0(νλ1)−(1 −λ1)2ψ0(ν(1 −λ1))

+PK−1

i=2 λ2

i−1ψ0(ci+νλi−1)−λ2

iψ0(νλi)−(λi+1 −λi)2ψ0(ci−νλi+νλi+1)

hg(η)i1−λ1=α1

Var(g(η)) λ1(1−λ1)

nu+1 =α1β1

ν2(1−ν)

p(x|λ, ν)

x1= 1 : 1 −λ1=α1

α1+β1

xi∈{2;...;K−2}= 1 : λ1Qi−1

j=2

νλj

cj+νλj−1

ci−νλi+νλi−1

ci+νλi−1=Pi−1

j=1

βj

αj+βj

αi

αj+βj

xK= 1 : λ1QK−1

j=2

νλj

cj+νλj−1=QK−1

j=1

βj

αj+βj

Table 4: The conjugate prior on the stick-breaking distribution for multinomial

variables

4.4 Poisson-Gamma

The Poisson distribution is a distribution over a univariate integer-valued ran-

dom variable x, e.g. spike count or radioactive decay events. In standard form,

it is given by

p(x|r) = rxexp(−r)

x!(162)

where r∈R+

0is the rate. Its sufficient statistic and natural parameter are

u(x) = x(163)

η= log(r) (164)

and hence

p(x|η) = 1

Γ(x+ 1)

| {z }

h(x)

exp(−exp(η))

| {z }

g(η)

exp(η u(x)).(165)

Note that for x∈N, Γ(x+ 1) = x!. The conjugate prior on ris a Gamma

distribution with density

p(r|α, S) = 1

Γ(α)Sαrα−1exp −r

S(166)

where α∈R+

0is the shape parameter, and S∈R+is the scale. To tranform

this into exponential family form, let

ν=1

Sλ=αS (167)

and note that 

dr(η)

dη = exp(η) = r(η). Thus, we find

p(η|ν, λ) = ννλ

Γ(νλ)r(η)νλ−1exp (−νr(η)) r(η)

=ννλ

Γ(νλ)exp (νλ log(r(η))) exp (−νr(η))

=ννλ

Γ(νλ)

|{z}

f(λ,ν)

exp(−exp(η)))ν

| {z }

g(η)ν

exp(νλη) (168)

4.5 Multivariate Gaussian with Gauss-Wishart prior

The multivariate Gaussian is widely used, e.g. all finite-sized marginals of a

Gaussian process are Gaussian. But it is also a standard ingredient in parametric

models for regression, e.g. linear or other basis function. In the standard form, a

multivariate Gaussian density of a vector-valued random variable with variates

x,dim(x) = Dis parameterized by a mean vector µand a symmetric, positive

definite covariance matrix Σ:

p(x|µ,Σ) = 1

√2πDp|Σ|

exp −0.5·(x−µ)TΣ−1(x−µ)(169)

Poisson distribution

standard form rxexp(−r)

constraints x∈N0,r∈R+

u(x)x

ηlog(r)

constraints η η ∈R

g(η) exp (−exp(η))

h(x)1

Γ(x+1)

hu(x)iexp(η) = r

Var(u(x)) exp(η) = r

Gamma prior for Poisson-distributed RV

standard form 1

Γ(α)Sαrα−1exp −r

S

constraints α∈R+

0,S∈R+

λ αS

ν1

constraints ν∈R+,λ∈R+

f(λ, ν)ννλ

Γ(νλ)

m(η) 1

hηiψ(νλ)−log(ν)

Var(η)ψ0(νλ)

∂log(f(λ,ν))

∂ν λ(log(ν)+1−ψ(νλ))

∂2log(f(λ,ν))

∂ν2λ1

ν−λψ0(νλ)

hg(η)i1 + 1

ν−λν

Var(g(η)) 1 + 2

ν−λν −1 + 1

ν−2λν

p(x|λ, ν)Γ(νλ+x)

Γ(νλ)Γ(x+1)

(ν+1)−(νλ+x)

ν−λν

hu(x)ip(x|λ,ν)λ

Table 5: Poisson distribution and conjugate Gamma prior

It is often convenient to use the inverse of Σ, called precision matrix P=Σ−1,

which is also symmetric (P=PT) and positive definite:

p(x|µ,P) = p|P|

√2πDexp −1

2·(x−µ)TP(x−µ)(170)

To transform the Gaussian into the exponential family normal form, we rewrite

the exponent as

−1

2·(x−µ)TP(x−µ) = −1

2xTP x +µTP x −1

2µTP µ (171)

Note that

xTP x =X

xiPi,jxj=X

Pi,ix2

i+ 2 X

j<i

Pi,jxixj

µTP x =X

(P µ)ixi(172)

We therefore introduce sufficient statistics and natural parameters comprised of

three parts: first, from the diagonal elements in eqn. 172, the row vectors

ud= (x2

1,x2

2,...,x2

D)T(173)

ηd=−1

2(P1,1,P2,2,...,Pd,d)T.(174)

Second, we order the off-diagonal elements (with i < j) in some arbitrary fashion

(e.g. lectically) and construct the vectors

uc= (x2x1,x3x1,x3x2,...,xDxD−1)T= lt(xxT) (175)

ηc=−(P2,1,P3,1,P3,2,...,PD,D−1)T=−lt(P) (176)

i.e. ηccontains the lower triangle of P, and the lt() operator extracts the lower

triangle of a matrix, excluding the diagonal. Third, from eqn. 172:

uµ=x(177)

ηµ=P µ.(178)

We stack these vectors into the total sufficient statistics and natural parameters

u(x) = 



uµ

(179)

η=



ηd

ηc

ηµ

.(180)

Noting that µTP µ =µTPTP−1P µ =ηT

µP−1ηµand P=P(η), eqn. 171

can be written as

ηTu(x)−1

2µtP µ =ηT

dud+ηT

cuc

| {z }

−1

2xTP x

+ηT

µuµ

|{z}

µTP x

−1

2ηT

µP(η)−1ηµ(181)

With these substitutions, the exponential family normal form of the multivariate

Gaussian is therefore

p(x|η) = p|P(η)|

√2πDexp −1

2ηT

µP(η)−1ηµ

| {z }

=:g(η)

exp ηTu(x)(182)

and thus

log(g(η)) = −D

2log(2π) + 1

2log (|P(η)|)−1

2ηT

µP(η)−1ηµ.(183)

To compute moments, we need the gradient of this expression w.r.t. η, whose

components can be computed via the chain rule. Since

∂Pi,j

∂(ηd)k

=−2δi,kδj,k (184)

∂Pi,j

∂(ηc)(k,l)

=−δi,kδj,l −δi,lδj,k (185)

and Pdoes not depend on ηµ, we find

∂log(g(η))

∂(ηµ)k

=−1

2·2(P−1ηµ)k=−µk(186)

∂log(g(η))

∂(ηd)k

i,j

2P−1

i,j +P−1ηµηT

µP−1·(−2)δi,kδj,k =−P−1

k,k −µ2

k(187)

∂log(g(η))

∂(ηc)(k,l)

i,j

2P−1

i,j +P−1ηµηT

µP−1·(−δi,kδj,l −δi,lδj,k) = −P−1

k,l −µkµl

(188)

which implies, by virtue of eqn. 4:

huµ(x)i=µ(189)

hud(x)i=−diag(P−1)−µTµ(190)

huc(x)(kl)i=−P−1

k,l −µkµl(191)

i.e. the well-known expressions for expectations of Gaussians. The second

derivatives (necessary for the evaluation of the gradient of the KL divergence)

are omitted here, they can be computed by automatic differentiation from the

above expressions, e.g. by Theano.

The prior on the parameters of the Gaussian is given by a (multivariate)

Gauss-Wishart distribution [1]. In standard form, it is given by

p(µ,P|β, µ0, γ, V) = p(µ|β, P)p(P|γ, V) (192)

p(µ|β, P) = βD

2|P|1

√2πDexp −1

2β(µ−µ0)TP(µ−µ0)(193)

p(P|γ, V) = |P|γ−D−1

2γD

2|V|γ

2ΓDγ

2exp −1

2tr(V−1P)(194)

where ΓD(x) is the multivariate gamma function [3]:

ΓD(x) = πD(D−1)

j=1

Γx+1−j

2(195)

We reparameterize as follows:

β=ν(196)

γ=ν+α, α > 0 and const. (197)

λµ=µ0(198)

B=V−1(199)

(λd)i=Bi,i

ν+ (µ0)2

i(200)

(λc)(k,l)=Bk,l

ν+ (µ0)k(µ0)l=Bl,k

ν+ (µ0)k(µ0)l(201)

λ= (λµ,λd,λc)T(202)

Using these substitutions and µTP µ =ηT

µP−1ηµ, the arguments of the expo-

nentials in eqn. 192 can be written as

−ν1

2µTP µ +νµT

0P µ −ν1

2µT

0P µ0−1

2tr(BP )

=ν−1

2ηT

µP−1ηµ+ηT

µµ0−1

2tr PB

ν+µ0µT

0

=ν−1

2ηT

µP−1ηµ+ηT

µλµ+ηT

dλds+ηT

cλc(203)

Thus eqn. 192 becomes

p(µ,P|ν, µ0,B) = √2π(ν−1)Dν−D

2|B(λ)|ν+α

2(ν+α)D

2ΓDν+α

2|P|α−D

× 1

√2πD!ν

|P|ν

2exp −1

2ηT

µP−1ηµν

×exp νηT

µλµ+νηT

dλd+νηT

cλc.(204)

To transform this into the desired exponential family form, we need to change

µ,Pinto the natural parameters of the multivariate Gaussian (see eqns. 174,176,178).

This can be accomplished by multiplying eqn.192 with the determinant of the

Jacobian Jof that transformation, which can be constructed as a block matrix

as follows: stack the diagonal elements of Pinto the vector Pdof dimen-

sionality D, the off-diagonal lower triangle into the vector Pcof dimensionality

Q=D(D−1)

2. The Jacobian then has the following structure, which follows from

the definitions in eqns. 174,176,178 (rows and columns labelled with variable

names):

µ1. . . µD(Pd)1. . . (Pd)D(Pc)1. . . (Pc)Q

(ηµ)1

.P−10 0

(ηµ)D

(ηd)1

.∂µ

∂ηd−2·1D0

(ηd)D

(ηc)1

.∂µ

∂ηc0−1Q

(ηc)Q

(205)

By virtue of eqns. 397 and 398 in [2], the absolute value of the determinant of

Jis therefore

|J|= 2D|P−1(η)|(206)

Reparameterizing eqn. 204 in terms of ηand multiplying with this expression

yields (cf. eqn. 182 for the definition of g(η)):

p(η|ν, µ0,B)=2D√2π(ν−1)DνD

2|B(λ)|ν+α

2(ν+α)D

2ΓDν+α

2

| {z }

=:f(ν,λ)

|P|α−D

2−1

| {z }

=:m(η)

× 1

√2πD!ν

|P|ν

2exp −1

2ηT

µP−1ηµν

| {z }

=g(η)ν

×exp νηT

µλµ+νηT

dλd+νηT

cλc

=f(ν, λ)m(η)g(η)νexp(νηTλ) (207)

which is the exponential family normal form of the Gauss-Wishart prior on the

parameters of the multivariate Gaussian.

To calculate expectations, we need the derivatives of log(f(ν, λ)):

log(f(ν, λ)) = C+Dν

2log(2π)+D

2log(ν)−log ΓDν+α

2+ν+α

2log(|B(λ)|)

(208)

with C=Dlog(2) −D

2log(2π)−αD

2log(2). Thus

∂f(ν, λ)

∂ν =D

2log(2π) + D

2ν−1

2ΨDν+α

2+1

2log(|B(λ)|) (209)

where ΨD(x) = ∂log(ΓD(x))

∂x is the multivariate digamma function. For the

derivatives w.r.t. λ, note that eqns. 196-202 imply Bi,i =ν(2(λd)i−(λmu)2

and ∀i>j:Bij=ν((λc)i,j −(λµ)i(λµ)j,∀j > i :Bij=ν((λc)j,i −(λµ)i(λµ)j,

hence

∂Bi,j

∂(λµ)k

=−ν((λµ)iδj,k + (λµ)jδi,k) (210)

∂Bi,j

∂(λd)k

=νδi,kδj,k (211)

∂Bi,j

∂(λc)k,l

=ν(δi,kδj,l +δi,lδj,k) (212)

With formula 57 from [2], we therefore find

∂log(f(ν, λ))

∂(λd)k

m,n

ν+α

∂log(|B|)

∂Bm,n

∂(λd)k

=ν(ν+α)

m,n

B−1

m,nδm,kδn,k =−ν(ν+α)B−1

k,k (213)

∂log(f(ν, λ))

∂(λc)k,l

=ν(ν+α)B−1

k,l (214)

∂log(f(ν, λ))

∂(λµ)k

=−ν(ν+α)

m,n

B−1

m,n((λµ)mδn,k + (λµ)nδm,k)

=−ν(ν+α)X

B−1

k,n(λµ)n(215)

The expectations of the sufficient statistics of the multivariate Gaussian are thus

(lt(X): vectorized form of lower triangle of X)

hηµi=hP µi= (ν+α)B−1(λ)λµ=γV µ0(216)

hηdi=−1

2hdiag(P)i=−(ν+α)

2diag(B−1) (217)

hηci=−hlt(P)i=−(ν+α)lt(B−1) (218)

⇒ hPi=γV(219)

Second derivatives can be obtained similarly, or more easily by automatic dif-

ferentiation.

The expectation of the sufficient statistics u(x) can be computed from eqn.

57. We already computed λ(eqn. 202). The surface integral vanishes, because

p(x|λ, ν)→0 as λµ→ ∞, since the density must be normalized. Furthermore,

in the subspace where |P|= 0, the density is also zero if γ > D −1. Since m(η)

does not depend on µ0, we see that

huµ(x)ip(x|λ,µ)=λµ=µ0(220)

For ud,uc, it remains to evaluate the expectation

h∇ηlog(m(η))ip(x|λ,µ)=−2α−D

2−1P−1p(x|λ,µ)

=−(α−D−2) V−1

ν+α−D−1(221)

where the factor −2 results from the differentation of the elements of Pw.r.t. η,

see also the Jacobian above. The entries of V−1have to be suitably rearranged

to match the entries of η. Thus we find

hud(x)ip(x|λ,µ)= diag(µ0µT

0) + diag(V−1)

ν−α−D−2

diag(V−1)

ν+α−D−1

= diag(µ0µT

0) + ν+ 1

diag(V−1)

ν+α−D−1(222)

and likewise for uc.

4.5.1 Univariate Gauss-Gauss-Gamma

trivial, see above.

5 Random identities that might be useful

Decomposition of Kullback-Leibler divergence for multivariate Gaussians: let

p(x1,x2|µ1,µ2,K11,K21,K22) = N(x1,x2)|(µ1,µ2),K11 KT

K21 K22 

(223)

Multivariate Gaussian distribution

standard form 1

√2πD√|Σ|exp −0.5·(x−µ)TΣ−1(x−µ)

constraints x,µ∈RD,Σpositive semidefinite and symmetric

u(x)ud= diag(xxT),uc= lt(xxT),uµ=x

η ηd=−1

2diag(P), ηc=−lt(P), ηµ=P µ

constraints ηηµ∈RD,ηd≤0, ηcs.t. Ppos.semidef.

g(η)√|P(η)|

√2πDexp −1

2ηT

µP(η)−1ηµ

h(x) 1

hu(x)i huµi=µ,hudi=−diag(P−1+µµT),huci=−lt(P−1+µµT)

Gauss-Wishart prior for multivariate Gaussian RV

standard form βD

2|P|1

√2πDexp −1

2β(µ−µ0)TP(µ−µ0)|P|γ−D−1

2γD

2|V|γ

2ΓD(γ

2)exp −1

2tr(V−1P)

constraints µ,µ0∈RD,P,Vpos.semidef., β > 0,γ > D −1

λ B =V−1,λd= diag B

ν+µ0µT

0,λc= lt B

ν+µ0µT

0,λµ=µ0

ν ν =β,γ=ν+α

constraints ν∈R+,αconst. and s.t. ν+α > D −1

f(ν, λ) 2D(1−α)

2√π(ν−1)Dν−D

2|B(λ)|ν+α

ΓD(ν+α

m(η)|P(η)|α−D

2−1

hηi hηµi= (ν+α)B−1(λ)λµ,hηdi=−(ν+α)diag(B−1), hηci=−(ν+α)lt(B−1)

∂log(f(λ,ν))

∂ν

2log(π)−D

2ν−1

2ΨDν+α

2+1

2log(|B(λ)|)

p(x|λ, ν)ΓD(ν+α+1

√πΓD(ν+α

2)ν

ν+1 D

2|B(λ)|ν+α

|B(νλ+u(x)

ν+1 )|ν+α+1

huµ(x)ip(x|λ,µ)λu=µ0

hud(x)ip(x|λ,µ)λd−α−D−2

diag(V−1)

ν+α−D−1= diag(µ0µT

0) + ν+1

νdiag(V−1)

huc(x)ip(x|λ,µ)λc−α−D−2

lt(V−1)

ν+α−D−1= lt(µ0µT

0) + ν+1

νlt(V−1)

Table 6: Multivariate Gaussian distribution and conjugate Gauss-Wishart prior

be a joint multivariate Gaussian on x1,x2. The conditional distribution of x2

given x1is then also multivariate Gaussian (see [2]):

M21 =K21K−1

11 (224)

K2|1=K22 −M21K−1

11 MT

21 (225)

p(x1|x2) = Nx2|µ2+M21(x1−µ1),K2|1(226)

Assume now we had a variational posterior for x1,x2, which we decompose in

the following way (tilde indicates variational parameters):

q(x1,x2) = q(x2|x1)q(x1) = Nx2|˜

µ2+˜

M21x1,˜

K2|1Nx1|˜

µ1,˜

K1

(227)

which is a generalized form of the conditional Gaussian above, because ˜

µ2|1,˜

M21,˜

K2|1

are now free variational parameters. Note the following decomposition property

of the KL-divergence, which follows directly from its definition:

D(q(x1,x2)|p(x1,x2)) = hD(q(x2|x1)|p(x2|x1))iq(x1) +D(q(x1)|p(x1))

(228)

Using the above distributions, the second term on the right hand side is given

by the usual expression for the KL-divergence between multivariate Gaussians:

D(q(x1)|p(x1)) = 1

2tr hK−1

1˜

K1i+ (˜

µ1−µ1)TK−1

1(˜

µ1−µ1)

−dim[x1] + log(|K1|)−log(|˜

K1|)(229)

and the first term is

hD(q(x2|x1)|p(x2|x1))iq(x1)

2tr hK−1

2|1˜

K2|1i+

+˜

µ2+˜

M2|1˜

µ1−(µ2+M2|1(˜

µ1−µ1))T

K−1

1˜

µ2+˜

M2|1˜

µ1−(µ2+M2|1(˜

µ1−µ1))

−dim[x2] + log(|K2|1|)−log(|˜

K2|1|)

+tr h(˜

M2|1−M2|1)TK−1

2|1(˜

M2|1−M2|1)˜

K2|1i (230)

This expression is zero iff: ˜

M2|1=M2|1,K2|1=˜

K2|1,˜

µ1=µ1,˜

µ2=µ2−

M2|1µ1. Note that a zero expectation can not be achieved if we had made the

Ansatz q(x2|x1) = Nx2|˜

µ2|1,˜

K2|1, because the KL-divergence is positive,

thus for its expectation to be zero, it has to be zero for all values of x1which

requires an explicit representation of the mean projection matrix M.

References

[1] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, New

York, 2007.

[2] K. B. Petersen and M. S. Pedersen. The Matrix Cookbook. Version

20121115, http://www.math.uwaterloo.ca/ hwolkowi/matrixcookbook.pdf

[3] A.T. James. Distributions of Matrix Variates and Latent Roots Derived

from Normal Samples. Ann. Math. Statist. 35 (1964), no. 2, 475–501.