scieee Science in your language
[en] (orig) [de]
A collection of identities for variational inference
with exponential-family models
Dominik Endres, Kathrin Pabst, Anna-Lena Eckert, Raphael Schween
September 5, 2022
Abstract
This is a collection of identities useful for variational inference with
exponential family distributions/densities. All derviations were done by
the authors, unless indicated otherwise. This does not imply that the
results collected here have not appeared in the literature before. DIS-
CLAIMER: this collection is a work in progress. It is certainly in-
complete and probably buggy. Bug-reports and contributions are most
welcome, please email [email protected].
1 Exponential family distributions
A distribution is said to belong to the exponential family, if it can be written
in the form [1]:
p(x|η) = h(x)g(η) exp(ηTu(x)) (1)
where the random variates xmay be discrete or continuous, the sufficient
statistics uare functions of the x, not necessarily of the same dimensionality.
However, the uneed to be linearly independent. The ηare the natural param-
eters, one for each sufficient statistic. The function g(η) is the normalization
constant (replace the integral with a sum for discrete x):
g(η)Zdxh(x) exp(ηTu(x)) = 1 (2)
1.1 Moments
Taking the gradient w.r.t. to ηon both sides of equation 2, we find:
(g(η)) Zdxh(x) exp(ηTu(x))
| {z }
=1
g(η)
+g(η)Zdxh(x)u(x) exp(ηTu(x))
| {z }
hu(x)i
=0(3)
and thus the expectation hu(x)iis:
hu(x)i=g(η)
g(η)=−∇log(g(η)) (4)
1
Computing the derivative of the i-th component of the l.h.s. of eqn. 3 w.r.t.
ηjyields (noting that 2log(f(x,y))
x∂y =2f(x,y)
x∂y /f(x, y)f(x,y)
x
f(x,y)
y /f2(x, y)):
2g(η)
ηiηv
g(η)
g(η)
ηi
g(η)
ηj
g2(η)+g(η)
ηj
hui(x)i
g(η)=g(η)Zdxh(x)ui(x)uj(x) exp(ηTu(x))
2log(g(η))
ηiηjhui(x)ihuj(x)i=−hui(x)uj(x)i
hui(x)uj(x)i−hui(x)ihuj(x)i=2log(g(η))
ηiηj
(5)
Denoting the Hessian by ∇∇, the covariance matrix of u(x) is thus given by
Cov(u(x)) = −∇∇log(g(η)) (6)
Higher order moments can be computed via higher order derviatives.
1.2 Maximum Likelihood
For maximum-likelihood approximations, we need the gradient of log(p(x|η))
w.r.t. η:
log(p(x|η)) = log(g(η)) + u(x) = −hu(x)i+u(x).(7)
In other words, maximizing the likelihood (i.e. following the gradient towards
higher likelihood values) amounts to making the expected value of the suffi-
cient statistic more similiar to the actually observed sufficient statistic. For
second-order optimization methods, the Hessian matrix may be needed, which
is given by eqn. 6. For maximum likelihood parameter estimates, assume we
had observed Ni.i.d. datapoints xn. The parameter estimate is obtained by
solving
N
X
n=1 log(p(xn|η)) = 0 hu(x)i=PN
i=1 u(xn)
N(8)
i.e. by setting the data mean of the sufficient statistics equal to the expectation.
1.3 Entropy
The (differential) entropy of xgiven ηis defined as
H(x|η) = Zdxp(x|η) log(p(x|η)).(9)
note that this is not the conditional entropy of xgiven η. Using the definition
of the exponential family distribution (eqn. 1) , this can be written as
H(x|η) = Zdxp(x|η)log(g(η)) + log(h(x)) + ηTu(x)
and thus
H(x|η) = log(g(η)) hlog(h(x))iηThu(x)i(10)
2
For the gradient of the entropy w.r.t. ηwe need
ηk
ηThu(x)i=huk(x)i+X
i
ηi
hui(x)i
ηk
=huk(x)iX
i
ηi
2log(g(η))
ηiηk
.(11)
Using eqn. 6, we find
ηThu(x)i=hu(x)i+ Cov(u(x)) ·η(12)
We also need
hlog(h(x))i
ηk
=
ηkg(η)Zdxh(x) log(h(x)) exp(ηTu(x))
=g(η)
ηk
hlog(h(x))i
g(η)+g(η)Zdxh(x) log(h(x))uk(x) exp(ηTu(x))
=log(g(η))
ηkhlog(h(x))i+hlog(h(x))uk(x)i
=−huk(x)ihlog(h(x))i+hlog(h(x))uk(x)i(13)
and thus
∇hlog(h(x))i=−hlog(h(x))ihu(x)i+hlog(h(x))u(x)i(14)
The gradient of the entropy w.r.t. the ηis thus
H(x|η) = −∇log(g(η))
| {z }
hu(x)i
−∇hlog(h(x))i−∇ηThu(x)i(15)
H(x|η) = hlog(h(x))ihu(x)i−hlog(h(x))u(x)iCov(u(x)) ·η(16)
1.4 Kullback-Leibler divergence
The KL-divergence of a distribution with parameters ˜
ηto a distribution with
parameters ηis:
D(p(x|˜
η)||p(x|η)) = Zdxp(x|˜
η) log p(x|˜
η)
p(x|η)
=Zdxp(x|˜
η) [log (p(x|˜
η)) log (p(x|η))]
=H(x|˜
η)Zdxp(x|˜
η) log (p(x|η)) (17)
For this, we compute the following expectation under p(x|˜
η):
Zdxp(x|˜
η) log (p(x|η)) = Zdxp(x|˜
η)log(g(η)) + log(h(x)) + ηTu(x)
= log(g(η)) + hlog(h(x))i+ηThu(x)i(18)
3
and thus, using eqn. 10:
D(p(x|˜
η)||p(x|η)) = log(g(˜
η)) log(g(η)) + (˜
ηTηT)hu(x)i(19)
Its gradient w.r.t. ˜
ηis then (using eqn. 4 and eqn. 6)
D(p(x|˜
η)||p(x|η)) = log(g(˜
η)) + (˜
ηTηT)hu(x)i
=−hu(x)i+hu(x)i+∇hu(x)i(˜
ηη) (20)
where ∇hu(x)iacts on ucomponent-wise, yielding a matrix (ij) of deriva-
tives ui(x)
˜
ηj. Thus
D(p(x|˜
η)||p(x|η)) = −∇∇log(g(˜
η)) ·(˜
ηη) = Cov(u(x))(˜
ηη) (21)
2 Conjugate priors on exponential family distri-
butions
For inference and learning in hierarchical models, conjugate priors on the pa-
rameters ηare very useful, because inference/learning with i.i.d. observations
translates into parameter updates (rather than complicated integrals). The
conjugate prior on an exponential family distribution (eqn. 1) is given by
p(η|λ, ν) = f(λ, ν)m(η)g(η)νexp(νηTλ) (22)
where the λ, ν are the parameters of the p(oste)rior, g(η) is the same function
as above and m(η) is an arbitrary positive function (different from g(η)). To see
that this is a conjugate prior on p(x|η), assume we had observed Ndatapoints
xn. The posterior of ηis then (using eqns. 1 and 22)
p(η|λ, ν, x1:N) = p(x1:N,η|λ, ν)
p(x1:N|λ, ν)
=QN
n=1 p(xn|η)p(η|λ, ν)
RdηQN
n=1 p(xn|η)p(η|λ, ν)
=QN
n=1 g(η)h(xn) exp(ηTu(xn)) ·f(λ, ν)m(η)g(η)νexp(νηTλ)
RdηQN
n=1 g(η)h(xn) exp(ηTu(xn)) ·f(λ, ν)m(η)g(η)νexp(νηTλ)
=[Qnh(xn)] f(λ, ν)m(η)g(η)ν+Nexp ηT(νλ+Pnu(xn))
[Qnh(xn)] f(λ, ν)Rdηm(η)g(η)ν+Nexp (ηT(νλ+Pnu(xn)))
=m(η)g(η)ν+Nexp ηT(νλ+Pnu(xn))
Rdηm(η)g(η)ν+Nexp (ηT(νλ+Pnu(xn))) (23)
Note that this expression depends on the data x1:Nonly through Nand
Pnu(xn). This is why the u(x) are called sufficient statistics: they contain
all the information about ηwhich we need from the data to determine the
parameter posterior. A similar result holds for maximum-likelihood learning,
see [1]. By introducting the
4
posterior parameters
˜ν:= ν+N(24)
˜
λ:= νλ+Pnu(xn)
˜ν(25)
thus, νλ+PNu(xn) = ˜ν˜
λ. We furthermore identify f(λ, ν) as the normaliza-
tion constant of the p(oste)rior, i.e.
f(˜
λ,˜ν) = 1
Rdηm(η)g(η)˜νexp ηT˜ν˜
λ
and finally, plugging these identities back into eqn. 23, we obtain:
p(η|λ, ν, x1:N) = f(˜
λ,˜ν)m(η)g(η)˜νexp ηT˜ν˜
λ=p(η|˜
λ,˜ν).(26)
In other words, given an exponential family observation model, i.i.d. data and
a conjugate prior, we obtain posterior just by replacing the prior parameters
according to eqns. 24 and 25. Furthermore, note that
According to eqn. 24, ˜νkeeps track of the number of observed data-
points. Since it also contains prior information via ν, it is referred to as a
pseudocount.
For large enough N, the posterior is unimodal, and the log-posterior is
convex. The width of the maximum is monotonically decreasing in ˜ν
(Can be shown by computing the Hessian of the log-posterior). Hence, ˜ν
is also called the concentration parameter.
The posterior ˜
λis just a weighted mean of the prior λand and the
observed data.
These posterior updates can be iterated, i.e. the accumulation of the
sufficient statistics can be restarted at any point. The extreme case of
updating after every datapoint, i.e. online learning, boils down to keeping
track of this weighted average datapoint per datapoint.
2.1 Maximum-a-posteriori (MAP) parameter estimates
Instead of working with the full posterior of the natural parameters η(eqn.
26), it is sometimes enough to use the parameter values which maximize the
posterior, i.e. the numerator of eqn. 26 (the denominator does not depend on
ηafter the integration). Setting the derivative of the log of the numerator to
zero, we find
ηlog(m(η)) + (ν+N)ηlog(g(η))
| {z }
−hu(x)i,eqn. 4
+(νλ+X
n
u(xn)) !
= 0
νλ+Pnu(xn)
N+ν+ηlog(m(η))
N+ν
!
=hu(x)i(27)
5
i.e. the posterior maximum is located at a point where the expected value of the
natural parameters is equal to the quotient of the posterior parameters (eqns.
24,25) plus a term depending on m(η). The latter is often zero, since m(η)=1
for many distributions (see tables in section 4).
2.2 Expectations
Computing parameter expectations (i.e. of ηand functions thereof) of a conju-
gate p(oste)rior can be done similar to the expectations of an exponential family
distribution. From the normalization equation
f(λ, ν)Zdηm(η)g(η)νexp νηTλ= 1 (28)
follows
λf(λ, ν)Zdηm(η)g(η)νexp νηTλ= 0
λf(λ, ν)Zdηm(η)g(η)νexp νηTλ
| {z }
f(λ)1
=f(λ, ν)Zdηm(η)g(η)νexp νηTλνη
λf(λ, ν)
f(λ, ν)=νhηi(29)
and thus
hηi=λlog(f(λ, ν))
ν(30)
Likewise, from the derivative w.r.t. νwe find, noting that g(η)ν
ν = log(g(η))g(η)ν:
hlog(g(η))i+λThηi=log(f(λ, ν))
ν (31)
For the second moments, take the derivatives of hηi(see eqn. 29):
hηi
λj
=
λjf(λ, ν)Zdηm(η)g(η)νexp νηTλη
=f(λ, ν)
λjZdηm(η)g(η)νexp νηTλη
| {z }
=hηi
f(λ)
+f(λ, ν)Zdηm(η)g(η)νexp νηTλνηηj
=log(f(λ, ν))
λjhηi+νhηηji
=νhηjihηi+νhηηji(32)
1
ν
hηki
λj
=hηjηki−hηjihηki(33)
and thus
6
Cov(η) = λλlog(f(λ, ν))
ν2(34)
Likewise, computing the derivative of eqn. 31 on both sides yields
hlog(g(η))i
ν =2log (f(λ, ν)))
ν2λThηi
ν (35)
where
hlog(g(η))i
ν =log(f(λ, ν))
ν hlog(g(η))i+log(g(η))2+λThlog(g(η))ηi
=log(g(η))2hlog(g(η))i2+λThlog(g(η))ηiλThlog(g(η))ihηi
= Var(log(g(η))) + Cov(log(g(η)),λTη)
hηii
ν =log(f(λ, ν))
ν hηii+hηilog(g(η))i+λThηηii(36)
=hηilog(g(η))i−hηiihlog(g(η))i+λThηηiiλThηihηii
= Cov(log(g(η)), ηi) + Cov(λTη, ηi)
λThηi
ν = Cov(log(g(η)),λTη) + Var(λTη) (37)
and thus, noting that Var(x+y) = Var(x) + Var(y) + 2 Cov(x, y):
Var log(g(η)) + λTη=2log (f(λ, ν)))
ν2(38)
Another expectation that can be computed from the normalization constant
f(λ, ν) is
hg(η)ki=f(λ, ν)Zdηm(η)g(η)ν+kexp νηTλ.(39)
To evaluate the integral, choose new parameters ν0,λ0such that
ν0=ν+k(40)
ν0λ0=νλλ0=ν
ν0λ.(41)
With these parameters, the integral is in exponential family normal form, and
thus
hg(η)ki=f(λ, ν)
f(λ0, ν0)(42)
hg(η)i=f(λ, ν)
f(ν
ν+1 λ, ν + 1) (43)
Var(g(η)) = f(λ, ν)
f(ν
ν+2 λ, ν + 2) "f(λ, ν)
f(ν
ν+1 λ, ν + 1)#2
(44)
7
2.3 Predictive distribution, entropy and log-likelihood
Let α > 0. The predictive distribution and related quantities can be derived
from the integral
Zdηp(x|η)αp(η|λ, ν) = h(x)αf(λ, ν)Zdηg(η)αexp αηTu(x)m(η)g(η)νexp νηTλ
=h(x)αf(λ, ν)Zdηm(η)g(η)ν+αexp νηTλ+αηTu(x)
=h(x)αf(λ, ν)Zdηm(η)g(η)ν+αexp (ν+α)ηTνλ+αu(x)
ν+α
hp(x|η)α)ip(η|λ)=h(x)αf(λ, ν)1
fνλ+αu(x)
ν+α, ν +α(45)
where the last line follows from the normalization equation 28. For α= 1, the
integral on the l.h.s. is the expectation of p(x|η) under the prior:
p(x|λ, ν) = h(x)f(λ, ν)
fνλ+u(x)
ν+1 , ν + 1(46)
Differentiating with respect to αyields:
αhp(x|η)α)ip(η|λ)=hlog(p(x|η))p(x|η)αip(η|λ)
= log(h(x)) h(x)αf(λ, ν)
fνλ+αu(x)
ν+α, ν +α
h(x)αf(λ, ν)
fνλ+αu(x)
ν+α, ν +α2·ν
(ν+α)2λ0f(λ0, ν0)(u(x)λ) + f(λ0, ν0)
ν0
where the derivatives are evaluated at λ0=νλ+αu(x)
ν+αand ν0=ν+α. For α= 0,
we obtain the expected log-likelihood, using eqn. 30:
hlog(p(x|η))ip(η|λ)= log(h(x)) + hηTi(u(x)λ)log(f(λ, ν))
ν (47)
For α= 1, we obtain expectations of the form hlog(p(x|η))p(x|η))ip(η|λ),
which are the terms required for the computation of the expected entropy of x:
λ0=νλ+u(x)
ν+ 1 (48)
ν0=ν+ 1 (49)
hlog(p(x|η))p(x|η)ip(η|λ)=p(x|λ, ν)hlog(h(x)) + ν
ν0hηTiλ00(u(x)λ)
log(f(λ0, ν0))
ν0(50)
8
2.4 Expected sufficient statistics
The expectations of the sufficient statistics ufor fixed natural parameters ηis
given by eqn. 4:
hu(x)ip(x|η)=−∇log(g(η)) (51)
If ηis drawn from a conjugate prior, the expectation of u(x) under the prior is
given by averaging the r.h.s. over p(η|λ, ν),
hu(x)ip(x|η)p(η|λ)=−h∇log(g(η))ip(η|λ)(52)
To compute this expectation, one can use the Divergence Theorem from vec-
tor calculus (Ostrogradsky-Gauss). For a differentiable vector field f(x), this
theorem states that
ZV
dx·f(x) = IS
ds f(x) (53)
where Sis the surface enclosing the volume V. As a special case, consider the
field f(x) = cz(x), with ~c a constant vector and z(x) a smoothly differentiable
scalar function. Then, using ·cz(x) = Pi
ciz(x)
xi=cTz(x) we find
cTZV
dxz(x) = cTIS
ds z(x) (54)
and since this holds for any c, it follows that
ZV
dx~z(x) = IS
ds z(x) (55)
This identity can be used to compute the expectation on the r.h.s. of eqn.
52 by integrating the gradient of p(η|λ, ν):
Zdηηp(η|λ, ν) = f(λ, ν)Zdηηm(η)g(η)νexp νηTλ
=f(λ, ν)Zdηηm(η)
m(η)+νηg(η)
g(η)+νλp(η|λ, ν)
=h∇ηlog(m(η))ip(η|λ)+νh∇ηlog(g(η))ip(η|λ)+νλ
=IS
dsp(η|λ, ν) (56)
where the surface Sencloses the range of η. Hence, the expectation is:
hu(x)ip(x|η)p(η|λ)=λ+h∇ηlog(m(η))ip(η|λ)Hdsp(η|λ, ν)
ν(57)
If ηRDwith no further constraints, then p(η|λ, ν)0 on the surface of the
range of η, since p(η|λ, ν) has to be normalizable. Hence, the surface integral
must be zero. This is e.g. the case for the Multinomial distribution and the
Poisson distribution. Furthermore, if m(η) = const., then the gradient vanishes
(Dirichlet,Gamma, Stick-breaking for ci= 0, see tables in appendix). In those
cases, the above expression simplifies to
9
hu(x)ip(x|η)p(η|λ)=λ(58)
The expectation for the multivariate Gaussian is also computable, see ap-
pendix.
2.5 Maximum likelihood
For maximum-likelihood approximations, we need the gradient of log(p(η|λ, ν))
w.r.t. λand ν:
λlog(p(η|λ, ν)) = λlog(f(λ, ν)) + νη
=νhηi+νη=ν(ηhηi) (59)
ν log(p(η|λ, ν)) = f(λ, ν)
ν + log(g(η)) + ηTλ
= (log(g(η)) hlog(g(η))i) + λT(ηhηi) (60)
Similar to the gradient of the exponential family distributions, this gradient
points towards the actual value of ηand away from the expectation.
2.6 Entropy
The differential entropy (not the conditional entropy) of ηgiven λand νis
H(η|λ, ν) = f(λ, ν)Zdηm(η)g(η)νexp(νηTλ)log(f(λ, ν)) + log(m(η)) + νlog(g(η)) + νηTλ
=log(f(λ, ν)) hlog(m(η))iνhhlog(g(η))i+λThηii(61)
where the expectations are w.r.t. the p(oste)rior eqn. 22. Using eqn. 31, this
can be rewritten as
H(η|λ, ν) = log(f(λ, ν)) hlog(m(η))i+νlog(f(λ, ν))
ν (62)
The derivates of this entropy are therefore:
H(η|λ, ν)
ν =ν2log (f(λ, ν))
ν2hlog(m(η))i
ν (63)
and (using eqn. 36 and 30):
λH(η|λ, ν) = −∇λlog(f(λ, ν)) + νλ
log(f(λ, ν))
ν λhlog(m(η))i
=νhηi+ν
ν λlog(f(λ, ν)) λhlog(m(η))i
=νhηiν
ν νhηi−∇λhlog(m(η))i
=νhηiνhηiν2hηi
ν λhlog(m(η))i
=ν2Cov(log(g(η)),η) + Cov(λTη,η)λhlog(m(η))i
10
λH(η|λ, ν) = −∇λlog(f(λ, ν)) + νλ
log(f(λ, ν))
ν λhlog(m(η))i
(64)
λH(η|λ, ν) = ν2hηi
ν λhlog(m(η))i(65)
2.7 Kullback-Leibler divergence
The KL-divergence of a distribution with parameters ˜
λ,˜νto a distribution with
parameters λ, ν is given by
D(p(η|˜
λ,˜ν)||p(η|λ, ν)) = Zdηp(η|˜
λ,˜ν)hlog(p(η|˜
λ,˜ν)) log(p(η|λ, ν))i
=H(η|˜
λ,˜ν)Zdηp(η|˜
λ,˜ν) log(p(η|λ, ν)) (66)
The second term on the r.h.s. is given by (expectations w.r.t p(η|˜
λ,˜ν))
Zdηp(η|˜
λ,˜ν) log(p(η|λ, ν)) = log(f(λ, ν))+hlog(m(η))i+νhlog(g(η))i+νλhηi
(67)
and thus, using eqn. 61 and 31 we find
D(p(η|˜
λ,˜ν)||p(η|λ, ν)) = log f(˜
λ,˜ν)
f(λ, ν)!(˜νν)log(f(˜
λ,˜ν))
˜ν+ (˜
λTλT)νhηi
(68)
The derivatives are:
˜
λD(p(η|˜
λ,˜ν)||p(η|λ, ν)) = ˜
λlog(f(˜
λ,˜ν)) (˜νν)
˜ν˜
λlog(f(˜
λ,˜ν))
+νhηi ν
˜ν(˜
λTλT)˜
λ˜
λlog(f(˜
λ,˜ν))
=˜νhηi+ (˜νν)
˜ν˜νhηi
+νhηi ν
˜ν(˜
λTλT)2
˜
λlog(f(˜
λ,˜ν))
= ˜ν(˜νν)hηi
˜νν
˜ν(˜
λTλT)2
˜
λlog(f(˜
λ,˜ν))
= ˜ν(˜νν)hηi
˜ν+ν(˜
λTλT)˜
λhηi(69)
˜
λD(p(η|˜
λ,˜ν)||p(η|λ, ν)) = ˜ν(˜νν)hηi
˜ν+ν(˜
λTλT)˜
λhηi(70)
where ˜
λhηiis a matrix with entries ˜
λhηii,j =hηii
˜
λj. Likewise,
˜νD(p(η|˜
λ,˜ν)||p(η|λ, ν)) = (˜νν)2f(˜
λ,˜ν)
˜ν2ν(˜
λTλT)h˜
ηi
˜ν(71)
11
3 Variational approximations with exponential
family distributions
3.1 Variational inference with conjugate p(oste)riors
In variational inference, we replace an intractable distribution (or density)
q(X|d) (i.e. one where marginals and conditionals are hard to compute) with
a tractable, factorized approximation q(X). dis the observed data. Strictly
speaking, q(X) = q(X|d) but it is customary to omit writing this condition-
ing, since it is only approximate. The approximation is linked to the correct
distribution via the variational bound also ’evidence lower bound (ELBO)’:
log(q(d)) = log X
x
q(d,x)!= log X
x
q(x)p(d|x)p(x)
p(x)!
Zx
q(x) log p(d|x)p(x)
q(x)=hlog p(d|x)iq(X)D(q(X)||p(X))
L(q, d) = hlog p(d|x)iq(X)D(q(X)||p(X)) log(p(d)) (72)
where the second line follows from Jensen’s inequality for convex functions and
the definition of the Kullback-Leibler divergence. L(q, d) is a lower bound on the
log-marginal-likelihood, which we try to maximize w.r.t. q(x). The resulting
q(x) is an approximate version of the correct posterior p(x|d) which will be
exact iff p(x|d) is contained in the class of distributions which can be modeled
by q(x). In that case (and only in that case), the bound will be tight.
In the following, we will derive the posterior update rules for the case p(X)
is conjugate to q(d|X) and both are in the exponential family. We also assume
that q(X) is conjugate to the likelihood, such that posterior updated reduce to
parameter updates, like in section 2. Lastly, assume that the data are comprised
of Ni.i.d. observations, i.e. p(d|X) = QN
i=1 p(di|X). We use a generalized
version of the ELBO, which has an inverse temperature parameter β0:
L(q, d) = hlog p(d|x)iq(X)βD(q(X)||p(X)) log(P(d)) (73)
β6= 1 can be used to model deviations from optimal inference, or for stochastic
updating in minibatches etc.. Denote the prior parameters with ν, λand the
posterior parameters with ˜ν, ˜
λ.
The expected log-likelihood under the posterior hlog p(d|x)iq(X)for Ndat-
apoints can then be computed from eqn. (47):
hlog(p(d|η))iq(η|˜
λ,˜ν)=
N
X
i=1
log(h(di))+hηTi(
N
X
i=1
u(di)N˜
λ)Nlog(f(˜
λ,˜ν))
˜ν
(74)
where hηTiis given by eqn. (30) and the KL-divergence D(q(X)||p(X)) by eqn.
(68). To maximize eqn. (73) w.r.t. the posterior parameters ˜
λ,˜ν, we will rewrite
the elbo as difference between one part that does not depend on the posterior
parameters, and a KL-divergence between the posterior, and a distribution in
the same exponential family as the posterior that depends on parameters λ0, ν0
Because the KL-divergence is zero exactly if the two distributions that enter it
12
are pointwise equal, we can then compute the maximal ELBO by setting ˜
λ=λ0
and ν0= ˜ν.
L(q, d) = hlog p(d|x)iq(X)βD(q(X)||p(X))
=
N
X
i=1
log(h(di)) + hηTi N
X
i=1
u(di)N˜
λ!Nlog(f(˜
λ,˜ν))
˜ν
β"log f(˜
λ,˜ν)
f(λ, ν)!(˜νν)log(f(˜
λ,˜ν))
˜ν+ (˜
λTλT)νhηi#
=
N
X
i=1
log(h(di)) βlog f(˜
λ,˜ν)
f(λ, ν)!
+hηTi N
X
i=1
u(di) + βνλ(N+βν)˜
λ!
(N+βν β˜ν)log(f(˜
λ,˜ν))
˜ν(75)
Define
Posterior parameters for β-variational update
ν0=N
β+ν(76)
λ0=PN
i=1 u(di) + βν ˜
λ
N+βν =νλ+PN
i=1 u(di)
ν0(77)
and note the similarity of these definitions with the exact posterior updates eqn.
(26) all data-related quantities have been divided by β. Collecting terms in
the ELBO eqn. (75) and plugging in these definitions, we find
L(q, d) =
N
X
i=1
log(h(di)) βlog f(˜
λ,˜ν)
f(λ, ν)!
+βν0hηTiλ0˜
λ
β(˜νν0)log(f(˜
λ,˜ν))
˜ν(78)
Comparing the last two lines to the expression for the KL divergence eqn. (68),
we find that up to a log f(˜
λ,˜ν)
f(λ00)term and a factor β, these lines are a KL-
divergence. Inserting and subtracting this term, we find
13
X1X2
X6X7
X3X5
X4
X1X2
X6X7
X3X5
X4
p(X4|X1)p(X5|X2)
p(X3|X1, X2)
p(X6|X4, X3)p(X7|X5, X3)
p(X1)p(X2)
Figure 1: Left: Bayesian network with (undirected) loops. Exact sum-product
can not be run on this graph. Right: corresponding factor graph. For vari-
ational message passing, a node (e.g. X3) needs to receive messages from all
members of its Markov blanket, which are the variables connected to neighbour-
ing factors (for X3: all other nodes). through the connecting factors.
L(q, d) =
N
X
i=1
log(h(di)) βlog f(˜
λ,˜ν)
f(λ, ν)!+βlog f(˜
λ,˜ν)
f(λ0, ν0)!
βD(q(X|˜
λ,˜ν)||p(X|λ0, ν0))
=
N
X
i=1
log(h(di)) βlog f(λ0, ν0)
f(λ, ν)(79)
βD(q(X|˜
λ,˜ν)||p(X|λ0, ν0)) (80)
which is maximal if D(q(X|˜
λ,˜ν)||p(X|λ0, ν0)) = 0, which happens if and only
if ˜
λ=λ0and ν0= ˜ν. Thus, the maximal ELBO after the posterior update is
given by eqn. (79), which is achieved for the parameters given in eqns. (76).
Now we derive an expression for the expected log-likelihood eqn. (74) that
depends only on the p(oste)rior parameters. This is possible because these
parameters are computed from the sufficient statistics (which are by definition
sufficient to determine the likelihood). Rewrite the posterior parameters as
N=β(ν0ν) and β(λ0ν0λν) and substitute these expressions into the log-
likelihood, then
hlog(p(d|η))iq(η|˜
λ,˜ν)=
N
X
i=1
log(h(di))+βνhηTi(λ0λ)β(ν0ν)log(f(λ0, ν0))
ν0
(81)
3.2 Variational message passing
Denote the set of indexes of latent variables by L, the set of indexes of observed
variables with O, and the set of all indexes by Nsuch that LO=and
14
LO=N. We consider a fully factorized approximation, i.e. one where the
density of the latent variables
Q(x) = Y
iL
Qi(xi) = Y
iL
Q(xi)
is a product over distributions of individual variables. Strictly speaking, the
notation in the middle is correct because there is one density per variable. We
omit the extra index and assume the reader knows what is meant. Let the
correct density be expressed as a Bayes net,
P(d,x) = Y
jO
P(dj|paXj)Y
iL
P(xi|paXi)
and djis the observed data at node Xj. Furthermore, to simplify notation, we
introduce the ”sum-product” symbol:
XY
iK
Q(xi) := X
xi:iKY
iK
Q(xi) (82)
The bound then is:
L(Q, d) = X
xY
iL
Q(xi)
X
jO
log(P(dj|paXj)) + X
kL
log(P(xk|paXk)) X
kL
log(Q(xk))
=X
xY
iL
Q(xi)
X
jO
log(P(dj|paXj)) + X
kL
log(P(xk|paXk))
X
iLX
xi
Q(xi) log(Q(xi))
=X
jOXY
iLpaXj
Q(xi) log(p(dj|paXj))
+X
kLXY
iLpaXk∪{k}
Q(xi) log(P(xk|paXk))
X
iLX
xi
Q(xi) log(Q(xi)) (83)
To find the Q(x) that maximizes the bound, we take the derivative w.r.t. q(x)
and set it to zero. This is a necessary condition for a maximum, it can be shown
that it is sufficient, too. We furthermore impose the constraint that all q(xi)
have to be distributions, i.e. q(xi)0 and Pxiq(xi) = 1. It will turn out that
we do not have to impose the first constraint, we do however need to make sure
the second one is fulfilled. This can be achieved by a Lagrange multiplier for
each distribution. The Lagrangian functional therefore is
L(Q, d) = L(Q, d) + X
iL
τi X
xi
Q(xi)1!(84)
A necessary condition for an extemum of L(Q, d) is a stationary point of L(Q, d),
i.e. the derivatives w.r.t. the components of Q(x) have to vanish:
15
δL(Q, d)
δQ(Xm)=X
jOchXmXY
iLpaXj\{m}
Q(xi) log(P(dj|paXj))
+XY
iLpaXm
Q(xi) log(P(xm|paXm))
+X
kLchXmXY
iLpaXk∪{k}\{m}
Q(xi) log(P(xk|paXk))
log(Q(xm)) 1 + τm(85)
which we set to 0 and solve for Q(xm) to find the optimal approximate posterior
marginals. To do so, we interpret the terms in eqn. 85 as messages sent to node
Xmon the factor graph corresponding to the Bayesian network (see fig. 1 for
an example).
Define the messages sent a variable node Xmto a neighbouring factor node
f(Xm, . . .) as just the variational posterior marginals for unobserved nodes, and
a 1-or-0 message for observed nodes:
mL:µXmf(...,Xm,...)(Xm) = Q(Xm) (86)
jO:µXjf(...,Xj,...)(Xj) = δxj,dj(87)
and the messages sent from a factor f(...,Xm, . . .) depending on variables with
indices jF, m Fto a neighbouring variable node as:
µf(...,Xm,...)Xm(Xm) = XY
jF\{m}
Q(xj)f(...,xm, . . .) (88)
=XY
jF\{m}
f(...,xm, . . .)µXjf(...,Xj,...)(xj) (89)
i.e. the average over the factor with respect to the posterior of all variables that
connect to it, except for the variable where the message is being sent to. With
these message definitions, eqn. 85 becomes
δL(Q, d)
δQ(Xm)=X
jOchXm
µlog(P(Xj|paXj))Xm(Xm)
+µlog(P(Xm|paXm))Xm(Xm)
+X
kLchXm
µlog(P(Xk|paXk))Xm(Xm)
log(Q(xm)) 1 + τm(90)
=X
jchXm
µlog(P(Xj|paXj))Xm(Xm)
+µlog(P(Xm|paXm))Xm(Xm)
log(Q(Xm)) 1 + τm(91)
16
Defining exp(τm1) = 1
Zm, we can solve for Q(Xm) now:
Q(Xm) = 1
Zm
exp
X
jchXm
µlog(P(Xj|paXj))Xm(Xm) + µlog(P(Xm|paXm))Xm(Xm)
(92)
In other words, the variational posterior at variable node Xmis computed by
adding up all incoming messages, exponentiating, and normalizing (since Ziis
computed from the Lagrange multiplier which enforces normalization). This
message-passing scheme has to be iterated until convergence, which is guaran-
teed since the bound L(Q, D) is a Lyapunov function of the iteration dynamics.
Another way of deriving this algorithm without computing derivatives is via
the KL divergence D(Q(X)||˜
Q(X)). Recall that the KL divergence is positive,
and zero if and only if the distributions are equal everywhere. Assume again we
wanted to maximize eqn. 83 w.r.t. Q(Xk). To carry out this maximization, we
only need to consider terms which depend on Q(Xk), which in turn depend on
the members of Xk’s Markov blanket:
argmax
Q(Xk)
[L(Q, d)] = argmax
Q(Xk)
X
jOchXkXY
iLpaXj
Q(xi) log(P(dj|paXj))
+X
jLchXk∪{k}XY
iLpaXj∪{j}
log(P(xj|paXj))
X
xk
Q(xk) log(Q(xk))#(93)
Note that all terms on the r.h.s. include a factor Q(xk) and a summation over
xk. We can therefore pull it out:
argmax
Q(Xk)
[L(Q, d)] = argmax
Q(Xk)
X
xk
Q(xk)
X
jOchXkXY
iLpaXj\{k}
Q(xi) log(P(dj|paXj))
+XY
iLpaXk
log(P(xk|paXk))
+X
jLchXkXY
iLpaXj∪{j}\{k}
log(P(xj|paXj)) log(Q(xk))
(94)
17
With the message definitions (eqns. 86 88) the r.h.s. can be written as
argmax
Q(Xk)
[L(Q, d)] = argmax
Q(Xk)
X
xk
Q(xk)
X
jOchXk
µlog(P(dj|paXj))Xk(xk)
+µlog(P(Xk|paXk))Xk(xk)
+X
jLchXk
µlog(P(Xj|paXj)Xk(xk)log(Q(xk))
(95)
i.e. we need the incoming messages from all neighbouring factor nodes to com-
pute this expression. Note that the unions of the index sets of the sums in the
first and the last line are simply the indexes of all children of Xk, whereas the
message on the second line is the incoming message from the parents. Thus,
define
log(U(xk)) = X
jchXk
µlog(P(dj|paXj))Xk(xk) + µlog(P(Xk|paXkXk(xk) (96)
and let
U(xk) = Zk˜
Q(xk) (97)
with Pxk
˜
Q(xk) = 1 and Zk>0, i.e. ˜
Q(Xk) is a probability distribution. With
these definition we obtain
argmax
Q(Xk)
[L(Q, d)] = argmax
Q(Xk)"X
xk
Q(xk) log(U(xk)) log(Q(xk))#
= argmax
Q(Xk)"X
xk
Q(xk) log(Zk) + X
xk
Q(xk) log ˜
Q(xk)
Q(xk)!#
= log(Zk)argmax
Q(Xk)hD(Q(Xk)||˜
Q(Xk))i(98)
Since the KL-divergence is 0, it follows that the variational bound L(Q, d)
is maximized if Q(Xk) = ˜
Q(Xk). In other words, to compute the optimal
distribution at a given variable node given the distributions of the variables in
its Markov blanket, do the following:
sum all incoming messages from neighbouring factor nodes,
exponentiate,
normalize.
The factor nodes collect messages from their neighbouring variable nodes, and
compute messages by summing their log-factor over all variables except the one
where the message is being sent to, similar to sum-product message passing.
18
Figure 2: Variational approximation of parameter (Θ) learning in a Bayesian
network with a fully factorized approximation. A: fragment of a Bayesian net-
work. paXare the parents of X, which may have parents and other chil-
dren. For learning the parameters Θ, these other children/parents are not rele-
vant. B: corresponding factor graph fragment. The parameters Θ appear only
in the factor connecting Xto its parents. C: fully factorized approximation
and Dcorresponding factor graph. When computing the variational bound,
log(P(X|Θ,paX)) has to be averaged over all variables that appear in it, which
are the neighbours of the factor node P(X|Θ,paX) in B.
3.3 Learning parameters with exponential family distri-
butions
To apply variational message passing, it is necessary to know the factors, which
are the conditional probability distributions in case of a Bayesian network. If
we want to learn these factors from data, then it is useful to have a compact
parametrization of them, which can be done with exponential family distri-
butions and their conjugate p(oste)riors. Consider the network (fragment) in
fig. 2. Assume we wanted to learn the conditional distribution of Xgiven its
parents, and parametrize this distribution by Θ. We lump these parents to-
gether in one supernode. Xmay be continuous or discrete, but we assume that
the parents of X, paXare discrete (in some special cases, continuous models
are tractable). Also assume that the distribution of Xgiven paXis from the
exponential family, i.e.
p(X|paX,ηpaX) = h(x)g(ηpaX) exp(ηT
paXu(x)) (99)
i.e. there is one parameter vector ηpaXfor each value of paX, and Θ is the
concatentation of these parameter vectors. The conjugate prior on each ηpaX
is then
p(ηpaX|λpaX, νpaX) = f(λpaX, νpaX)g(ηpaX)νexp(νpaXηT
paXλpaX) (100)
Assume now we had observed n= 1, . . . , N datapoints d1:Nand computed
the corresponding latent variable distributions Q(Xn), Q(pan
X). If any of the
nodes are observed, replace the corresponding distribution with a distribution
concentrated at the observed value. Note that in a fully factorized approxi-
mation, each Q(paXn) is a actually a product over the parents, Q(paXn) =
19
QYnpaXnQ(Yn). Looking at fig. 2, we note that there is exactly one fac-
tor connecting the parameter node to Xand paX, and the prior factor for Θ.
Using an approximating posterior for Q(Θ) which has the same form as the
prior (eqn. 100), the summands in the variational bound which depend on the
posterior distribution of Θ are (where yn,yrange(paXn)):
LΘ=
N
X
n=1 X
xnX
yn
Q(xn)Q(yn)hlog(P(xn|Θ,yn))iQ(Θ) βD(Q(Θ)||P(Θ))
=
N
X
n=1 X
xnX
yn
Q(xn)Q(yn)log(P(xn|ηyn))Q(ηyn|˜
λyn,˜νyn)
X
y
βD(Q(ηy|˜
λy,˜νy)||P(ηy|λy, νy)) (101)
We rewrite the first term as a sum over the possible values of xand paX:
LΘ=X
xX
ylog(P(x|ηy))Q(ηy|˜
λy,˜νy)
N
X
n=1
Q(Xn=x)Q(paXn=y)
X
y
βD(Q(ηy|˜
λy,˜νy)||P(ηy|λy, νy)) (102)
and define the ”responsibilities” (because they measure how much a given setting
of the latent variables contributes to explaining a datapoint)
rx,y=
N
X
n=1
Q(Xn=x)Q(paXn=y) (103)
Using the definition of the exponential familiy distribution (eqn. 1), the Kullback-
Leibler divergence between conjugate p(oste)riors (eqn. 68) and denoting Q(ηy) =
Q(ηy|˜
λy,˜νy), we find
LΘ=X
xX
yhlog(h(x)) + hlog(g(ηy))iQ(ηy)+u(x)ThηyiQ(ηy)irx,y
X
y"βlog f(˜
λy,˜νy)
f(λy, νy)!β(˜νyνy)log(f(˜
λy,˜νy))
˜νy
+β(˜
λT
yλT
y)νyhηyiQ(ηy)#
(104)
The expectations hlog(g(ηy))iQ(ηy)can be expressed using eqn. 31:
hlog(g(ηy))iQ(ηy)=log(f(˜
λy,˜νy)
˜νy˜
λT
yhηyiQ(ηy)(105)
Inserting this expression into eqn. 104:
20
LΘ=X
xX
y"log(h(x)) log(f(˜
λy,˜νy)
˜νy˜
λT
yhηyiQ(ηy)+u(x)ThηyiQ(ηy)#rx,y
X
y"βlog f(˜
λy,˜νy)
f(λy, νy)!β(˜νyνy)log(f(˜
λy,˜νy))
˜νy
+β(˜
λT
yλT
y)νyhηyiQ(ηy)#
(106)
Collecting terms, we find:
LΘ=X
xX
y
log(h(x))rx,yX
y
βlog f(˜
λy,˜νy)
f(λy, νy)!
+X
y
log(f(˜
λy,˜νy)
˜νy"β(˜νyνy)X
x
rx,y#
X
y
νyβ˜
λT
yλT
y+˜
λT
yPxrx,y
νyPxrx,yu(x)T
νyhηyiQ(ηy)
(107)
With the expressions
ˆνy=νy+Px,yrx,y
β=νy1 + Pxrx,y
νyβ
ˆ
λy=νyλy+Pxrx,y
βu(x)
νy+Pxrx,y
β
=νy
ˆνy
λy+Pxrx,y
βu(x)
νy
(108)
and noting that
βˆνy=βνyX
x,y
rx,y(109)
νyβ˜
λT
y+˜
λT
yPxrx,y
νy=βνy˜
λT
y+˜
λT
yPxrx,y
νyβ
=βνy˜
λT
y1 + Pxrx,y
νyβ=βˆνy˜
λT
y(110)
νyβλT
yPxrx,yu(x)T
νy=νyβλT
y+Pxrx,yu(x)T
νyβ
=βˆνyˆ
λT
y(111)
we insert a zero by adding and subtracting the term Pyβlog f(˜
λy,˜νy)
f(ˆ
λy,ˆνy)
21
and write (using eqn. 68)
LΘ=X
xX
y
log(h(x))rx,yX
y
βlog f(˜
λy,˜νy)
f(λy, νy)!
+X
y
βlog f(˜
λy,˜νy)
f(ˆ
λy,ˆνy)!X
y
βlog f(˜
λy,˜νy)
f(ˆ
λy,ˆνy)!
+X
y
log(f(˜
λy,˜νy)
˜νy
β(˜νyˆνy)
X
y
βˆνy(˜
λT
yˆ
λT
y)hηyiQ(ηy)(112)
=X
xX
y
log(h(x))rx,yX
y
βlog f(ˆ
λy,ˆνy)
f(λy, νy)!
X
y
βD(Q(ηy|˜
λy,˜νy)||Q(ηy|ˆ
λy,ˆνy)) (113)
The first part of this expression is constant w.r.t. to the ˜
λy,˜νy, and the second
part is a sum of KL-divergences. To maximize LΘ, we therefore choose the
posterior parameters so that each KL-divergence is zero, i.e. ˜
λy=ˆ
λyand
˜νy= ˆνy:
variational posterior parameters
rx,y:=
N
X
n=1
Q(Xn=x)Q(paXn=y) (114)
˜νy:= νy+Pxrx,y
β(115)
˜
λy:= νyλy+Pxrx,y
βu(x)
νy+Pxrx,y
β
(116)
Comparing these learning rules to the conjugate update rules (eqns. 24,25),
we see that in the variational framework, a datapoint can be ”shared” between
different values of the latent variables in the model. This sharing comes about
because the responsibilities are (approximate) probability, rather than deter-
ministic 0s or 1s. Otherwise the rules are identical. Note in particular that if
rx,y {0,1}, i.e. the variables are known with certainty, then the variational
rules reduce to the exact update rules. Also, the β-variational update rules can
be obtained by dividing all responsibilities by β. Thus, if β > 1, the prior be-
comes ’stiffer’ and tends to ignore the data, whereas for β0, we get maximum
likelihood updates.
22
4 Frequently used special cases
This section contains frequently used conjugate pairs and relevant quanitities
computed thereof.
4.1 Bernoulli-Beta
For a discrete random variable x {0; 1}, where 1 is alternatively called ”suc-
cess” (e.g. when betting on coin tosses), is given by
P(x|q) = qx(1 q)(1x).(117)
Its canonical conjugate prior is a Beta distribution in qwith density
p(q|α, β) = 1
B(α, β)qα1(1 q)β1.(118)
To transform these expressions into the exponential family normal form (eqns.
1 and 22), introduce the logit
η= log q
1q(119)
whence q=1
1+exp(η), 1 q=1
1+exp(η)and dq
=exp(η)
(1+exp(η))2=q(1 q).
Substitute ηin eqn. (117):
P(x|η) = exp (xlog(q) + (1 x) log(1 q))
= exp xlog q
1q+ log(1 q)
= (1 q) exp(ηx) = 1
1 + exp(η)exp(ηx) (120)
Hence, h(x) = 1, g(η) = 1
1+exp(η)(cf. eqn. (1)) and
P(x|η) = h(x)g(η) exp(ηx).(121)
To transform the Beta density into exponential family normal form, note that
densities transform like p(η) = p(q(η))
dq
:
p(η|α, β) = 1
B(α, β)qα1(1 q)β1
dq
=1
B(α, β)q1(1 q)1exp (αlog(q) + βlog(1 q)) q(1 q)
=1
B(α, β)exp αlog q
1q+ (α+β) log(1 q)
=1
B(α, β)1
1 + exp(η)α+β
exp (αη) (122)
23
Letting ν:= α+β,λ:= α
ν,f(λ, ν) = 1
B(νλ,ν(1λ)) ,m(η) = 1 we find
p(η|λ, ν) = 1
B(νλ, ν(1 λ)) 1
1 + exp(η)ν
exp(νλη)
=f(λ, ν)m(η)g(η)νexp(νλη).(123)
Bernoulli distribution
standard form qx(1 q)1x
constraints x x {0,1}
constraints q q [0,1]
u(x)x
ηlog q
1q
q1
1+exp(η)
constraints η η R
g(η)1
1+exp(η)= 1 q
h(x) 1
hu(x)i1
1+exp(η)=q
Var(u(x)) exp(η)
(1+exp(η))2=q(1 q)
Beta distribution
standard form qα1(1q)β1
B(α,β)
constraints α, β α, β R+
λα
α+β
ν α +β
α νλ
β ν(1 λ)
constraints ν ν R+
constraints λ λ [0,1]
f(λ, ν)1
B(νλ,ν(1λ))
m(η) 1
hηiψ(νλ)ψ(ν(1 λ))
Var(η)ψ(νλ)ψ(ν(1λ))
ν2+ψ0(νλ) + ψ0(ν(1 λ))
log(f(λ,ν))
ν ψ(ν)λψ(νλ)(1 λ)ψ(ν(1 λ))
2log(f(λ,ν))
ν2ψ0(ν)λ2ψ0(νλ)(1 λ)2ψ0(ν(1 λ))
hg(η)i1λ=β
α+β= 1 hqi
Var(g(η)) (1λ)λ
ν+1 =
β
α+β
α
α+β
α+β+1
p(x|λ, ν)λx + (1 λ)(1 x) = xhqi+ (1 x)(1 hqi)
hu(x)ip(x|λ,ν)λ
Table 1: Bernoulli distribution and conjugate Beta prior
4.2 Multinomial-Dirichlet
The multinomial distribution is the generalization of the Bernoulli distribution
to Kpossible outcomes. It is convenient to represent multinomial random
24
variates xby vectors with Kcomponents, such that xk {0; 1}and PK
k=1 xk=
1, whence xK= 1 PK1
k=1 xk. This is called 1-of-K representation, because
exactly one component of xis 1 and all others are 0. Let q= (q1, . . . , qK) be
the probabilities of the Kpossible x, such that qK= 1 PK1
k=1 qk. Then, the
multinomial distribution can be written as
P(x|q) =
K1
Y
k=1
qxk
k 1
K1
X
k=1
qk!xK
(124)
This expression can be transformed into exponential family form via
P(x|q) = exp K1
X
k=1
xklog(qk) + 1
K1
X
k=1
xk!log 1
K1
X
k=1
qk!!
= 1
K1
X
k=1
qk!exp K1
X
k=1
xklog qk
1PK1
i=1 qi!! (125)
and by introducing the generalized logit ηk= log qk
1PK1
i=1 qiwe find that
k= 1, . . . , K 1 : qk=exp(ηk)
1 + PK1
i=1 exp(ηi)(126)
qK=1
1 + PK1
i=1 exp(ηi)(127)
(alternatively, we could fix ηK= 0, and let q= softmax(η)). Hence, h(x) = 1
(after xk {0; 1}has been enforced) and
g(η) = 1
K1
X
k=1
qk!=qK=1
1 + PK1
i=1 exp(ηi)
.
The standard conjugate prior on the multinomial is the Dirichlet distribu-
tion. Let α= (α1, . . . , αK) with αk0 and
M=QK
k=1 Γ(αk)
Γ(PK
k=1 αk)(128)
then the density of the Dirichlet distribution is
p(q|α) = M
K1
Y
k=1
qαk1
k 1
K1
X
k=1
qk!αK1
(129)
=Mexp K1
X
k=1
(αk1) log(qk)+(αK1) log 1
K1
X
k=1
qk!!
=M
K1
Y
k=1
q1
k(1
K1
X
k=1
qk)1exp K1
X
k=1
αklog qk
qK+
K
X
k=1
αKlog(qK)!
=M qPK
k=1 αK
K
K1
Y
k=1
q1
k(1
K1
X
k=1
qk)1exp αTη(130)
25
To substitute qby η, we need to compute the determinant of the Jacobian dq
dη.
Let Z= 1 + PK1
k=1 exp(ηk), i.e. qk=exp(ηk)
Z. Then
dq
dη=
exp(η1)
Zexp(2η1)
Z2. . . exp(η1) exp(ηK)
Z2
...
exp(ηK) exp(η1)
Z2. . . exp(ηK)
Zexp(2ηK)
Z2
=
q1. . . 0
0...0
0. . . qK
q1
.
.
.
qK
(q1. . . qK)
=
q1. . . 0
0...0
0. . . qK
1K×K
1
.
.
.
1
(q1. . . qK)
=
K1
Y
k=1
qk 1
K1
X
k=1
qk!(131)
And thus
p(η|α) = p(q(η)|α)
dq
dη
=M 1
1 + PK1
i=1 exp(ηi)!PK
k=1 αK
exp αTη(132)
With ν=PK
k=1 αk,λk=αk
ν,f(λ, ν) = M,m(η) = 1 and g(η) as above, we
find
p(η|λ, ν) = f(λ, ν)m(η)g(η)νexp(νηλ) (133)
26
multinomial distribution
standard form QK1
k=1 qxk
k1PK1
k=1 qkxK
constraints xxk {0,1},PK
k=1 xk= 1
constraints qqk[0,1], PK1
k=1 qk1
u(x)x
ηlog q
1PK1
i=1 qi
qexp(η)
1+PK1
i=1 exp(ηi)
constraints ηηkR
g(η)1
1+PK1
i=1 exp(ηi)= 1 PK1
i=1 qi=qK
h(x) 1
hu(x)iexp(η)
1+PK1
i=1 exp(ηi)=q
Cov(u(x)) Ckl =δkl exp(ηk)
1+PK1
i=1 exp(ηi)exp(ηk+ηl)
(1+PK1
i=1 exp(ηi))2=δklqlqlqk
Dirichlet distribution
standard form Γ(PK
k=1 αk)
QK
k=1 Γ(αk)QK1
k=1 qαk1
k1PK1
k=1 qkαK1
constraints ααkR+
λk= 1, . . . , K 1 : λk=αk
PK
i=1 αi
νPK
i=1 αi
α νλ
constraints ν ν R+
constraints λλi[0,1], PK1
i=1 λi1
f(λ, ν)Γ(ν)
Γ(ν(1PK1
k=1 λk))QK1
i=1 Γ(νλi)
m(η) 1
hηiψ(νλ)ψν(1 PK1
i=1 λk)
Cov(ηi, ηj)δi,jψ0(νλi) + ψ0ν(1 PK1
i=1 λk)
log(f(λ))
ν ψ(ν)(1 PK1
i=1 λi)ψν(1 PK1
i=1 λi)PK1
i=1 λiψ(νλi)
2log(f(λ))
ν2ψ0(ν)(1 PK1
i=1 λi)2ψ0ν(1 PK1
i=1 λi)PK1
i=1 λ2
iψ0(νλi)
hg(η)i1PK1
i=1 λi=αK
PK
i=1 αi=hqKi
Var(g(η)) (1PK1
i=1 λi)PK1
i=1 λi
ν+1 =hqKih1qKi
ν+1
p(x|λ, ν)PK1
i=1 xiλi+xK(1 PK1
i=1 λi) = PK
i=1 xihqii
hu(x)ip(x|λ)λ
Table 2: Multiomial distribution and conjugate Dirichlet prior
27
4.3 Multinomial-StickBreaking
The stick-breaking construction is another way of parameterizing multinomial
distributions. It has attracted a lot of attention in Machine Learning around
2005, because it is a convenient way of representing infinite multinomials with
a Dirichlet-process prior. To re-parameterize the distribution of a Kcompo-
nent multinomial variable x(in 1-of-K representation) with probabilities q,
introduce the variables v= (v1, . . . , vK1) such that
q1=v1(134)
q2= (1 v1)v2(135)
q3= (1 v1)(1 v2)v3(136)
.
.
.
qK1= (1 v1)(1 v2). . . vK1(137)
qK= (1 v1)(1 v2). . . (1 vK1).(138)
The distribution of xcan then be written as
p(x|v) =
K1
Y
i=1
vxi
i(1 vi)PK
j=i+1 xj(139)
We now introduce the sufficient statistics u(x)=(u0, . . . , uK1)
uk=
K
X
i=k+1
xi(140)
hence u0= 1. uk= 1 implies that xi= 1 with i>k. Eqn. 139 becomes
p(x|v) =
K1
Y
i=1
vui1ui
i(1vi)ui= exp K1
X
i=1
(ui1ui) log(vi) + uilog(1 vi)!
(141)
To transform this expression into exponential family normal form, rewrite the
exponent on the r.h.s. as
log(v1) +
K2
X
i=1
uilog 1vi
vi
vi+1+uK1log 1vK1
vK1(142)
and introduce the natural parameters η= (η1, . . . , ηK1):
ηK1= log 1vK1
vK1(143)
k=K2,...,1 : ηk= log 1vk
vk
vk+1(144)
28
Solving for vyields
vK1=1
1 + exp(ηK1)(145)
vK2=1
1 + exp(ηK2)(1 + exp(ηK1)) (146)
.
.
.
v1=1
1 + exp(η1)(1 + exp(η2)(1 + . . . (1 + exp(ηK1)) . . .)(147)
Thus
P(x|η) = v1(η) exp ηTu(x)(148)
which is in exponential family normal form, with h(x) = 1 and g(η) = v1(η).
The conjugate p(oste)rior on vis given by a product of Beta distributions.
Let α,βbe the parameters of these Beta distributions, then
p(v|α,β) =
K1
Y
i=1
1
B(αi, βi)vαi1
i(1 vi)βi1(149)
To derive the corresponding density in ηwe need the determinant of the
Jacobian dv
dη. Note that as a consequence of eqns. 145-147, the Jacobian is an
upper triangular matrix, hence the determinant is given by the product of the
diagonal entries:
dvi
i
=exp(ηi)(1 + exp(ηi+1)...)
(1 + exp(ηi)(1 + exp(ηi+1)(....)))2=vi(1 vi) (150)
dv
dη=
K1
Y
i=1
vi(1 vi).(151)
Now we can rewrite eqn. 149 in exponential form
p(v|α,β) =
K1
Y
i=1
v1
i(1 vi)1
B(αi, βi)exp K1
X
k=1
αilog(vi) + βilog(1 vi)!(152)
and rearrange the exponent as:
K1
X
k=1
αilog(vi) + βilog(1 vi)
=
K1
X
k=1
(αi+βi) log(vi) + βilog 1vi
vi(153)
=βK1log 1vK1
vK1+
K2
X
i=1
βilog 1vi
vi
vi+1
+(α1+β1) log(v1) +
K1
X
i=2
(αi+βiβi1)
| {z }
=:ci
log(vi) (154)
=βK1ηK1+
K2
X
i=1
βiηi+ (α1+β1) log(g(η)) +
K1
X
i=2
cilog(vi) (155)
29
Hence, letting N(α,β) = QK1
i=1 B(αi, βi)1we obtain for the density of η:
p(η|α,β) = p(v(η)|α, β)
dv
dη(156)
=N(α,β)
K1
Y
i=2
vi(η)cig(η)α1+β1exp ηTβ(157)
which is almost in exponential family normal form, except for:
QK1
i=2 vi(η)cishould be m(η), i.e. it must not depend on the data, hence
the cineed to be constant. Since αand βwill be updated on observation,
this requires that i= 2, . . . , K 1 : αi=ciβi+βi1. This implies
that αis determined by βup to the ci, except for α1.
ν:= α1+β1and λi:= βi
ν, hence α1=ν(1 λ1).
f(λ, ν) = N(α(λ, ν), νλ)
In the standard parametrization of the prior on the stick-breaking construc-
tion (eqn. 149), the αiare the pseudocounts on the number of instances where
xi= 1. To see that our definition
αi:= ciβi+βi1(158)
has the same meaning, assume that we had observed Ndatapoints x1:Nand
corresponding sufficient statistics u1:N. Using the exponential family update
rules (eqns. 24 and 25), we find for the posterior parameters
αN
1=νβN
1=α1+β1+Nβ1
N
X
n=1
un
1
=α1+N
N
X
n=1
un
1
=α1+
N
X
n=1
x1(159)
i= 2, . . . , K 1 : βN
i=βi+
N
X
n=1
un
i(160)
αN
i=ciβN
i+βN
i1
=ciβi+βi1+
N
X
n=1
(un
i1un
i)
=ci+βi1βi+
N
X
n=1
xn
i
=αi+
N
X
n=1
xn
i(161)
Hence, the meaning of αis preserved.
30
In this table, let
γk= 1 + exp(ηk)(1 + exp(ηk+1)(1 + . . . (1 + exp(ηK1)) . . .))
multinomial stick-breaking distribution
standard form QK1
k=1 vxk
k(1 vk)PK
i=k+1 xi
constraints xxk {0,1},PK
k=1 xk= 1
constraints vvk[0,1]
u(x)k= 1, . . . , K 1 : uk=PK
i=k+1 xi
ηηK1= log 1vK1
vK1,k < K 1 : ηk= log 1vk
vkvk+1
vvK1=1
1+exp(ηK1),k < K 1 : vi=1
1+ exp(ηi)
vi+1
constraints ηηkR
g(η)1
γ1=v1
h(x) 1
hu(x)i huki=exp(Pk
i=1 ηi)γk+1
γ1=Qk
i=1(1 vi)
Cov(u(x)) Ckl =expPmax(k,l)
i=1 ηiγmax(k,l)+1
γ1exp(Pk
i=1 ηi)γk+1 exp(Pk
i=1 ηi)γl+1
γ2
1
=Qmax(k,l)
i=1 (1 vi)Qk
i=1(1 vi)Ql
j=1(1 vj)
Table 3: The stick-breaking distribution for multinomial variables
31
stick-breaking prior
standard form QK1
i=1 1
B(αii)vαi1
i(1 vi)βi1
constraints α,βαk, βkR+
λβ
ν
ν α1+β1
constraints ci= 2, . . . , K 1 : ci> βiβi1
αα1=ν(1 λ1), i= 2, . . . , K 1 : αi=ciβi+βi1
constraints ν ν R+
constraints λλiR+
f(λ, ν)Γ(ν)
Γ(νλ1)Γ(ν(1λ1)) QK1
i=2
Γ(ci+νλi1)
Γ(νλi)Γ(ciνλi+νλi1)
m(η)QK1
i=2 vi(η)ci
i<K1 : hηiiψ(νλi)ψ(ciνλi+νλi1) + ψ(ci+1 νλi+1 +νλi)ψ(ci+1 +νλi)
=ψ(βi)ψ(αi) + ψ(αi+1)ψ(αi+1 +βi+1)
hηK1iψ(νλi)ψ(ciνλi+νλi1) = ψ(βi)ψ(αi)
Var(η1)ψ0(νλ1) + ψ0(ν(1 λ1)) ψ0(c2+νλ1) + ψ0(c2νλ2+νλ1)
=ψ0(β1) + ψ0(α1)ψ0(α2+β2) + ψ0(α2)
1< i < K 1 : Var(ηi)ψ0(νλi) + ψ0(ciνλi+νλi1)ψ0(ci+1 +νλi) + ψ0(ci+1 νλi+1 +νλi)
=ψ0(βi) + ψ0(αi)ψ0(αi+1 +βi+1) + ψ0(αi+1)
Cov(ηi, ηi+1)ψ0(ci+1 νλi+1 +νλi) = ψ0(αi+1)
Var(ηK1)ψ0(νλK1) + ψ0(cK1νλK1+νλK2) = ψ0(βK1) + ψ0(αK1)
log(f(λ))
ν
ψ(ν)λ1ψ(νλ1)(1 λ1)ψ(ν(1 λ1))
+PK1
i=2 [λi1ψ(ci+νλi1)λiψ(νλi)(λi+1 λi)ψ(ciνλi+νλi+1)]
2log(f(λ))
ν2
ψ0(ν)λ2
1ψ0(νλ1)(1 λ1)2ψ0(ν(1 λ1))
+PK1
i=2 λ2
i1ψ0(ci+νλi1)λ2
iψ0(νλi)(λi+1 λi)2ψ0(ciνλi+νλi+1)
hg(η)i1λ1=α1
ν
Var(g(η)) λ1(1λ1)
nu+1 =α1β1
ν2(1ν)
p(x|λ, ν)
x1= 1 : 1 λ1=α1
α1+β1
xi∈{2;...;K2}= 1 : λ1Qi1
j=2
νλj
cj+νλj1
ciνλi+νλi1
ci+νλi1=Pi1
j=1
βj
αj+βj
αi
αj+βj
xK= 1 : λ1QK1
j=2
νλj
cj+νλj1=QK1
j=1
βj
αj+βj
Table 4: The conjugate prior on the stick-breaking distribution for multinomial
variables
32
4.4 Poisson-Gamma
The Poisson distribution is a distribution over a univariate integer-valued ran-
dom variable x, e.g. spike count or radioactive decay events. In standard form,
it is given by
p(x|r) = rxexp(r)
x!(162)
where rR+
0is the rate. Its sufficient statistic and natural parameter are
u(x) = x(163)
η= log(r) (164)
and hence
p(x|η) = 1
Γ(x+ 1)
| {z }
h(x)
exp(exp(η))
| {z }
g(η)
exp(η u(x)).(165)
Note that for xN, Γ(x+ 1) = x!. The conjugate prior on ris a Gamma
distribution with density
p(r|α, S) = 1
Γ(α)Sαrα1exp r
S(166)
where αR+
0is the shape parameter, and SR+is the scale. To tranform
this into exponential family form, let
ν=1
Sλ=αS (167)
and note that
dr(η)
= exp(η) = r(η). Thus, we find
p(η|ν, λ) = ννλ
Γ(νλ)r(η)νλ1exp (νr(η)) r(η)
=ννλ
Γ(νλ)exp (νλ log(r(η))) exp (νr(η))
=ννλ
Γ(νλ)
|{z}
f(λ,ν)
exp(exp(η)))ν
| {z }
g(η)ν
exp(νλη) (168)
4.5 Multivariate Gaussian with Gauss-Wishart prior
The multivariate Gaussian is widely used, e.g. all finite-sized marginals of a
Gaussian process are Gaussian. But it is also a standard ingredient in parametric
models for regression, e.g. linear or other basis function. In the standard form, a
multivariate Gaussian density of a vector-valued random variable with variates
x,dim(x) = Dis parameterized by a mean vector µand a symmetric, positive
definite covariance matrix Σ:
p(x|µ,Σ) = 1
2πDp|Σ|
exp 0.5·(xµ)TΣ1(xµ)(169)
33
Poisson distribution
standard form rxexp(r)
x!
constraints xN0,rR+
0
u(x)x
ηlog(r)
constraints η η R
g(η) exp (exp(η))
h(x)1
Γ(x+1)
hu(x)iexp(η) = r
Var(u(x)) exp(η) = r
Gamma prior for Poisson-distributed RV
standard form 1
Γ(α)Sαrα1exp r
S
constraints αR+
0,SR+
λ αS
ν1
S
constraints νR+,λR+
0
f(λ, ν)ννλ
Γ(νλ)
m(η) 1
hηiψ(νλ)log(ν)
Var(η)ψ0(νλ)
log(f(λ))
ν λ(log(ν)+1ψ(νλ))
2log(f(λ))
ν2λ1
νλψ0(νλ)
hg(η)i1 + 1
νλν
Var(g(η)) 1 + 2
νλν 1 + 1
ν2λν
p(x|λ, ν)Γ(νλ+x)
Γ(νλ)Γ(x+1)
(ν+1)(νλ+x)
νλν
hu(x)ip(x|λ,ν)λ
Table 5: Poisson distribution and conjugate Gamma prior
It is often convenient to use the inverse of Σ, called precision matrix P=Σ1,
which is also symmetric (P=PT) and positive definite:
p(x|µ,P) = p|P|
2πDexp 1
2·(xµ)TP(xµ)(170)
To transform the Gaussian into the exponential family normal form, we rewrite
the exponent as
1
2·(xµ)TP(xµ) = 1
2xTP x +µTP x 1
2µTP µ (171)
Note that
xTP x =X
iX
j
xiPi,jxj=X
i
Pi,ix2
i+ 2 X
iX
j<i
Pi,jxixj
µTP x =X
i
(P µ)ixi(172)
34
We therefore introduce sufficient statistics and natural parameters comprised of
three parts: first, from the diagonal elements in eqn. 172, the row vectors
ud= (x2
1,x2
2,...,x2
D)T(173)
ηd=1
2(P1,1,P2,2,...,Pd,d)T.(174)
Second, we order the off-diagonal elements (with i < j) in some arbitrary fashion
(e.g. lectically) and construct the vectors
uc= (x2x1,x3x1,x3x2,...,xDxD1)T= lt(xxT) (175)
ηc=(P2,1,P3,1,P3,2,...,PD,D1)T=lt(P) (176)
i.e. ηccontains the lower triangle of P, and the lt() operator extracts the lower
triangle of a matrix, excluding the diagonal. Third, from eqn. 172:
uµ=x(177)
ηµ=P µ.(178)
We stack these vectors into the total sufficient statistics and natural parameters
u(x) =
ud
uc
uµ
(179)
η=
ηd
ηc
ηµ
.(180)
Noting that µTP µ =µTPTP1P µ =ηT
µP1ηµand P=P(η), eqn. 171
can be written as
ηTu(x)1
2µtP µ =ηT
dud+ηT
cuc
| {z }
1
2xTP x
+ηT
µuµ
|{z}
µTP x
1
2ηT
µP(η)1ηµ(181)
With these substitutions, the exponential family normal form of the multivariate
Gaussian is therefore
p(x|η) = p|P(η)|
2πDexp 1
2ηT
µP(η)1ηµ
| {z }
=:g(η)
exp ηTu(x)(182)
and thus
log(g(η)) = D
2log(2π) + 1
2log (|P(η)|)1
2ηT
µP(η)1ηµ.(183)
To compute moments, we need the gradient of this expression w.r.t. η, whose
components can be computed via the chain rule. Since
Pi,j
(ηd)k
=2δi,kδj,k (184)
Pi,j
(ηc)(k,l)
=δi,kδj,l δi,lδj,k (185)
35
and Pdoes not depend on ηµ, we find
log(g(η))
(ηµ)k
=1
2·2(P1ηµ)k=µk(186)
log(g(η))
(ηd)k
=X
i,j
1
2P1
i,j +P1ηµηT
µP1·(2)δi,kδj,k =P1
k,k µ2
k(187)
log(g(η))
(ηc)(k,l)
=X
i,j
1
2P1
i,j +P1ηµηT
µP1·(δi,kδj,l δi,lδj,k) = P1
k,l µkµl
(188)
which implies, by virtue of eqn. 4:
huµ(x)i=µ(189)
hud(x)i=diag(P1)µTµ(190)
huc(x)(kl)i=P1
k,l µkµl(191)
i.e. the well-known expressions for expectations of Gaussians. The second
derivatives (necessary for the evaluation of the gradient of the KL divergence)
are omitted here, they can be computed by automatic differentiation from the
above expressions, e.g. by Theano.
The prior on the parameters of the Gaussian is given by a (multivariate)
Gauss-Wishart distribution [1]. In standard form, it is given by
p(µ,P|β, µ0, γ, V) = p(µ|β, P)p(P|γ, V) (192)
p(µ|β, P) = βD
2|P|1
2
2πDexp 1
2β(µµ0)TP(µµ0)(193)
p(P|γ, V) = |P|γD1
2
2γD
2|V|γ
2ΓDγ
2exp 1
2tr(V1P)(194)
where ΓD(x) is the multivariate gamma function [3]:
ΓD(x) = πD(D1)
4
D
Y
j=1
Γx+1j
2(195)
We reparameterize as follows:
β=ν(196)
γ=ν+α, α > 0 and const. (197)
λµ=µ0(198)
B=V1(199)
(λd)i=Bi,i
ν+ (µ0)2
i(200)
(λc)(k,l)=Bk,l
ν+ (µ0)k(µ0)l=Bl,k
ν+ (µ0)k(µ0)l(201)
λ= (λµ,λd,λc)T(202)
36
Using these substitutions and µTP µ =ηT
µP1ηµ, the arguments of the expo-
nentials in eqn. 192 can be written as
ν1
2µTP µ +νµT
0P µ ν1
2µT
0P µ01
2tr(BP )
=ν1
2ηT
µP1ηµ+ηT
µµ01
2tr PB
ν+µ0µT
0
=ν1
2ηT
µP1ηµ+ηT
µλµ+ηT
dλds+ηT
cλc(203)
Thus eqn. 192 becomes
p(µ,P|ν, µ0,B) = 2π(ν1)DνD
2|B(λ)|ν+α
2
2(ν+α)D
2ΓDν+α
2|P|αD
2
× 1
2πD!ν
|P|ν
2exp 1
2ηT
µP1ηµν
×exp νηT
µλµ+νηT
dλd+νηT
cλc.(204)
To transform this into the desired exponential family form, we need to change
µ,Pinto the natural parameters of the multivariate Gaussian (see eqns. 174,176,178).
This can be accomplished by multiplying eqn.192 with the determinant of the
Jacobian Jof that transformation, which can be constructed as a block matrix
as follows: stack the diagonal elements of Pinto the vector Pdof dimen-
sionality D, the off-diagonal lower triangle into the vector Pcof dimensionality
Q=D(D1)
2. The Jacobian then has the following structure, which follows from
the definitions in eqns. 174,176,178 (rows and columns labelled with variable
names):
J=
µ1. . . µD(Pd)1. . . (Pd)D(Pc)1. . . (Pc)Q
(ηµ)1
.
.
.P10 0
(ηµ)D
(ηd)1
.
.
.µ
ηd2·1D0
(ηd)D
(ηc)1
.
.
.µ
ηc01Q
(ηc)Q
(205)
By virtue of eqns. 397 and 398 in [2], the absolute value of the determinant of
Jis therefore
|J|= 2D|P1(η)|(206)
Reparameterizing eqn. 204 in terms of ηand multiplying with this expression
37
yields (cf. eqn. 182 for the definition of g(η)):
p(η|ν, µ0,B)=2D2π(ν1)DνD
2|B(λ)|ν+α
2
2(ν+α)D
2ΓDν+α
2
| {z }
=:f(ν,λ)
|P|αD
21
| {z }
=:m(η)
× 1
2πD!ν
|P|ν
2exp 1
2ηT
µP1ηµν
| {z }
=g(η)ν
×exp νηT
µλµ+νηT
dλd+νηT
cλc
=f(ν, λ)m(η)g(η)νexp(νηTλ) (207)
which is the exponential family normal form of the Gauss-Wishart prior on the
parameters of the multivariate Gaussian.
To calculate expectations, we need the derivatives of log(f(ν, λ)):
log(f(ν, λ)) = C+Dν
2log(2π)+D
2log(ν)log ΓDν+α
2+ν+α
2log(|B(λ)|)
(208)
with C=Dlog(2) D
2log(2π)αD
2log(2). Thus
f(ν, λ)
ν =D
2log(2π) + D
2ν1
2ΨDν+α
2+1
2log(|B(λ)|) (209)
where ΨD(x) = logD(x))
x is the multivariate digamma function. For the
derivatives w.r.t. λ, note that eqns. 196-202 imply Bi,i =ν(2(λd)i(λmu)2
i
and i>j:Bij=ν((λc)i,j (λµ)i(λµ)j,j > i :Bij=ν((λc)j,i (λµ)i(λµ)j,
hence
Bi,j
(λµ)k
=ν((λµ)iδj,k + (λµ)jδi,k) (210)
Bi,j
(λd)k
=νδi,kδj,k (211)
Bi,j
(λc)k,l
=ν(δi,kδj,l +δi,lδj,k) (212)
With formula 57 from [2], we therefore find
log(f(ν, λ))
(λd)k
=X
m,n
ν+α
2
log(|B|)
Bm,n
Bm,n
(λd)k
=ν(ν+α)
2X
m,n
B1
m,nδm,kδn,k =ν(ν+α)B1
k,k (213)
log(f(ν, λ))
(λc)k,l
=ν(ν+α)B1
k,l (214)
log(f(ν, λ))
(λµ)k
=ν(ν+α)
2X
m,n
B1
m,n((λµ)mδn,k + (λµ)nδm,k)
=ν(ν+α)X
n
B1
k,n(λµ)n(215)
38
The expectations of the sufficient statistics of the multivariate Gaussian are thus
(lt(X): vectorized form of lower triangle of X)
hηµi=hP µi= (ν+α)B1(λ)λµ=γV µ0(216)
hηdi=1
2hdiag(P)i=(ν+α)
2diag(B1) (217)
hηci=−hlt(P)i=(ν+α)lt(B1) (218)
hPi=γV(219)
Second derivatives can be obtained similarly, or more easily by automatic dif-
ferentiation.
The expectation of the sufficient statistics u(x) can be computed from eqn.
57. We already computed λ(eqn. 202). The surface integral vanishes, because
p(x|λ, ν)0 as λµ , since the density must be normalized. Furthermore,
in the subspace where |P|= 0, the density is also zero if γ > D 1. Since m(η)
does not depend on µ0, we see that
huµ(x)ip(x|λ,µ)=λµ=µ0(220)
For ud,uc, it remains to evaluate the expectation
h∇ηlog(m(η))ip(x|λ,µ)=2αD
21P1p(x|λ,µ)
=(αD2) V1
ν+αD1(221)
where the factor 2 results from the differentation of the elements of Pw.r.t. η,
see also the Jacobian above. The entries of V1have to be suitably rearranged
to match the entries of η. Thus we find
hud(x)ip(x|λ,µ)= diag(µ0µT
0) + diag(V1)
ναD2
ν
diag(V1)
ν+αD1
= diag(µ0µT
0) + ν+ 1
ν
diag(V1)
ν+αD1(222)
and likewise for uc.
4.5.1 Univariate Gauss-Gauss-Gamma
trivial, see above.
5 Random identities that might be useful
Decomposition of Kullback-Leibler divergence for multivariate Gaussians: let
p(x1,x2|µ1,µ2,K11,K21,K22) = N(x1,x2)|(µ1,µ2),K11 KT
21
K21 K22 
(223)
39
Multivariate Gaussian distribution
standard form 1
2πD|Σ|exp 0.5·(xµ)TΣ1(xµ)
constraints x,µRD,Σpositive semidefinite and symmetric
u(x)ud= diag(xxT),uc= lt(xxT),uµ=x
η ηd=1
2diag(P), ηc=lt(P), ηµ=P µ
constraints ηηµRD,ηd0, ηcs.t. Ppos.semidef.
g(η)|P(η)|
2πDexp 1
2ηT
µP(η)1ηµ
h(x) 1
hu(x)i huµi=µ,hudi=diag(P1+µµT),huci=lt(P1+µµT)
Gauss-Wishart prior for multivariate Gaussian RV
standard form βD
2|P|1
2
2πDexp 1
2β(µµ0)TP(µµ0)|P|γD1
2
2γD
2|V|γ
2ΓD(γ
2)exp 1
2tr(V1P)
constraints µ,µ0RD,P,Vpos.semidef., β > 0,γ > D 1
λ B =V1,λd= diag B
ν+µ0µT
0,λc= lt B
ν+µ0µT
0,λµ=µ0
ν ν =β,γ=ν+α
constraints νR+,αconst. and s.t. ν+α > D 1
f(ν, λ) 2D(1α)
2π(ν1)DνD
2|B(λ)|ν+α
2
ΓD(ν+α
2)
m(η)|P(η)|αD
21
hηi hηµi= (ν+α)B1(λ)λµ,hηdi=(ν+α)diag(B1), hηci=(ν+α)lt(B1)
log(f(λ))
ν
D
2log(π)D
2ν1
2ΨDν+α
2+1
2log(|B(λ)|)
p(x|λ, ν)ΓD(ν+α+1
2)
πΓD(ν+α
2)ν
ν+1 D
2|B(λ)|ν+α
2
|B(νλ+u(x)
ν+1 )|ν+α+1
2
huµ(x)ip(x|λ)λu=µ0
hud(x)ip(x|λ)λdαD2
ν
diag(V1)
ν+αD1= diag(µ0µT
0) + ν+1
νdiag(V1)
huc(x)ip(x|λ)λcαD2
ν
lt(V1)
ν+αD1= lt(µ0µT
0) + ν+1
νlt(V1)
Table 6: Multivariate Gaussian distribution and conjugate Gauss-Wishart prior
be a joint multivariate Gaussian on x1,x2. The conditional distribution of x2
given x1is then also multivariate Gaussian (see [2]):
M21 =K21K1
11 (224)
K2|1=K22 M21K1
11 MT
21 (225)
p(x1|x2) = Nx2|µ2+M21(x1µ1),K2|1(226)
Assume now we had a variational posterior for x1,x2, which we decompose in
the following way (tilde indicates variational parameters):
q(x1,x2) = q(x2|x1)q(x1) = Nx2|˜
µ2+˜
M21x1,˜
K2|1Nx1|˜
µ1,˜
K1
(227)
which is a generalized form of the conditional Gaussian above, because ˜
µ2|1,˜
M21,˜
K2|1
are now free variational parameters. Note the following decomposition property
40
of the KL-divergence, which follows directly from its definition:
D(q(x1,x2)|p(x1,x2)) = hD(q(x2|x1)|p(x2|x1))iq(x1) +D(q(x1)|p(x1))
(228)
Using the above distributions, the second term on the right hand side is given
by the usual expression for the KL-divergence between multivariate Gaussians:
D(q(x1)|p(x1)) = 1
2tr hK1
1˜
K1i+ (˜
µ1µ1)TK1
1(˜
µ1µ1)
dim[x1] + log(|K1|)log(|˜
K1|)(229)
and the first term is
hD(q(x2|x1)|p(x2|x1))iq(x1)
=1
2tr hK1
2|1˜
K2|1i+
+˜
µ2+˜
M2|1˜
µ1(µ2+M2|1(˜
µ1µ1))T
K1
1˜
µ2+˜
M2|1˜
µ1(µ2+M2|1(˜
µ1µ1))
dim[x2] + log(|K2|1|)log(|˜
K2|1|)
+tr h(˜
M2|1M2|1)TK1
2|1(˜
M2|1M2|1)˜
K2|1i (230)
This expression is zero iff: ˜
M2|1=M2|1,K2|1=˜
K2|1,˜
µ1=µ1,˜
µ2=µ2
M2|1µ1. Note that a zero expectation can not be achieved if we had made the
Ansatz q(x2|x1) = Nx2|˜
µ2|1,˜
K2|1, because the KL-divergence is positive,
thus for its expectation to be zero, it has to be zero for all values of x1which
requires an explicit representation of the mean projection matrix M.
References
[1] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, New
York, 2007.
[2] K. B. Petersen and M. S. Pedersen. The Matrix Cookbook. Version
20121115, http://www.math.uwaterloo.ca/ hwolkowi/matrixcookbook.pdf
[3] A.T. James. Distributions of Matrix Variates and Latent Roots Derived
from Normal Samples. Ann. Math. Statist. 35 (1964), no. 2, 475–501.
41