a Xi :2505.17856 1 [cs.LG] 23 May 2025
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
Moule Lin Shuhao Guan Weipeng Jing Goe z Bo e weck And ea Pa ane
Le o, T ini y College
Dublin
Uni e si y College
Dublin
No heas Fo es y
Uni e si y
Le o, T ini y College
Dublin
Le o, T ini y College
Dublin
Abs ac
While o e ing a p incipled amewo k o un-
ce ain y quan i ica ion in deep lea ning, he em-
ploymen o Bayesian Neu al Ne wo ks (BNNs)
is s ill cons ained by hei inc eased compu a-
ional equi emen s and he con e gence di icul-
ies when aining e y deep, s a e-o - he-a a -
chi ec u es. In his wo k, we ein e p e weigh -
sha ing quan iza ion echniques om a s ochas-
ic pe spec i e in he con ex o aining and in-
e ence wi h Bayesian Neu al Ne wo ks (BNNs).
Speci ically, we le e age 2D-adap i e Gaussian
dis ibu ions, Wasse s ein dis ance es ima ions,
and alpha-blending o encode he s ochas ic be-
ha iou o a BNN in a lowe -dimensional, so
Gaussian ep esen a ion. Th ough ex ensi e em-
pi ical in es iga ion, we demons a e ha ou ap-
p oach signi ican ly educes he compu a ional
o e head inhe en in Bayesian lea ning by se -
e al o de s o magni ude, enabling he e icien
Bayesian aining o la ge-scale models, such as
ResNe -101 and Vision T ans o me (VIT). On
a ious compu e ision benchma ks—including
CIFAR-10, CIFAR-100, and ImageNe 1k—ou
app oach comp esses model pa ame e s by ap-
p oxima ely 50×and educes model size by 75%,
while achie ing accu acy and unce ain y es ima-
ions compa able o he s a e-o - he-a .
1 INTRODUCTION
Bayesian Neu al Ne wo ks (BNNs) p omise o combine
he ep esen a ional capaci y o deep lea ning wi h p inci-
pled unce ain y es ima ions enabled by means o Bayesian
lea ning heo y (Hin on and Neal,1995). A guably, his
combina ion makes hem pa icula ly appealing o sa e y-
P oceedings o he 28 h In e na ional Con e ence on A i icial In-
elligence and S a is ics (AISTATS) 2025, Mai Khao, Thailand.
PMLR: Volume 258. Copy igh 2025 by he au ho (s).
c i ical machine lea ning applica ions whe e he quan i ica-
ion o unce ain y is o pa amoun impo ance (Fo sbe g
e al.,2020). Indeed, hey ha e been widely employed in
scena ios like e-Heal h (Ma cos e al.,2010), obus con-
ol (Wicke e al.,2024), au onomous d i ing (Michel-
mo e e al.,2020), human-in- he-loop applica ions (T eiss
e al.,2021), au oma ed diagnosis (Billah and Ja ed,2022)
and many o he s (Lampinen and Veh a i,2001;Bha adiya,
2023;Veh a i and Lampinen,1999).
Un o una ely, hough, he p incipled ea men o unce -
ain y comes a he p ice o an inc eased p essu e on com-
pu a ional esou ces, including he model size (×2 in he
common case o mean- ield Va ia ional In e ence (Blun-
dell e al.,2015)), and he in e ence ime, inc eased by an
o de o magni ude as mul iple o wa d passes a e needed
(Hin on and Neal,1995). The e o e, despi e hei po en ial,
he use o BNNs in edge-AI and esou ce-cons ained ap-
plica ions is s ill e y limi ed (Bonne e al.,2023). While
ecen wo ks ha e in es iga ed he de elopmen o ech-
niques o ackle he a o emen ioned challenges, hese a e
gene ally limi ed o he applica ion o me hods o iginally
de eloped o de e minis ic neu al ne wo ks (NNs) (Fe i-
anc e al.,2021;Pa k e al.,2021;Chien and Chang,2023),
o he usage o he Bayesian pa adigm a aining ime o
model-o de educ ion pu poses bu wi hou he unce ain y
es ima ion a in e ence ime (Van Baalen e al.,2020;Guo,
2018;Pe in e al.,2024;Subia-Waud and Dasmahapa a,
2024).
In his wo k, we p esen a quan isa ion echnique speci i-
cally ailo ed o cap u e he s ochas ic beha iou o BNNs.
Mo e speci ically, we design a s ochas ic weigh -sha ing
quan isa ion me hod, called 2DGBNN, based on dynam-
ically adap i e mini-ba ch 2D Gaussian Mix u e Models
and p edica ed on op imising weigh dis ibu ions h ough
me ics de i ed om Wasse s ein dis ances (Chiza e al.,
2020;De Palma e al.,2021), ne wo k g adien s, and in a-
class a iance. Ou echnique wo ks by ein e p e ing
s anda d weigh -sha ing (Subia-Waud and Dasmahapa a,
2024) om a 2D pe spec i e, accoun ing o bo h he mean
and a iance o BNN’s pa ame e s in he case o mean-
ield Va ia ional In e ence (VI), and by gi ing i a s ochas-
ic seman ic. A each aining s ep, he cu en sum- o al o
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
ne wo k pa ame e s is clus e ed using s anda d pa ame e -
ee echniques, and a mini-ba ch app oach is used o he
es ima ion o Gaussian dis ibu ions on a pa ame e space
accoun ing o (possibly) millions o pa ame e s. Rep e-
sen a i es o each clus e a e hen selec ed, and alpha-
blending echniques a e used o sampling pa ame e eal-
isa ions du ing o wa d passes h ough he ne wo k a chi-
ec u e. Thanks o i s simplici y, he me hod can be seam-
lessly in eg a ed in o commonly used Bayesian app oxi-
ma ion echniques based on Va ia ional In e ence (Blun-
dell e al.,2015;Gal and Ghah amani,2016;Minka,2001;
Welling and Teh,2011).
We pe o m an ex ensi e empi ical in es iga ion on
he e ec i eness o ou me hod in aining la ge-scale
models and in educing he compu a ional oo p in o
BNNs. We u ilize ou widely used image classi ica-
ion da ase s (i.e., MNIST (LeCun e al.,1998), CIFAR-
10 (K izhe sky,2009), CIFAR-100 (K izhe sky,2009) and
ImageNe 1K (Deng e al.,2009)) and es he model e-
sul s in ou widely employed neu al ne wo k a chi ec u es,
including ResNe -18, ResNe -50, ResNe -101 (He e al.,
2016) and Vision T ans o me (ViT) (Doso i skiy e al.,
2021). Th ough ou app oach, we can educe he numbe
o ainable pa ame e s in he ne wo k by up o 50×, while
ob aining accu acy and unce ain y me ics compa able o
he s a e-o - he-a .1
This pape makes he ollowing main con ibu ions:
•We in oduce a s ochas ic weigh -sha ing echnique
speci ically- ailo ed o BNNs and ha employs
Wasse s ein dis ance, g adien s, and wi hin-class
a iance in o de o imp o e model e iciency.
•We empi ically demons a e ha ou s ochas ic
weigh -sha ing me hod compa es a ou ably agains
quan isa ion me hods employed o BNNs in e ms o
u he educing he compu a ional equi emen s and
be e p ese ing accu acy and unce ain y me ics.
•In a a ie y o a chi ec u es, including ResNe -18,
ResNe -50, ResNe -10 and ViT, we show how ou
echnique can educe model size by up o jus a qua e
o he o iginal size, while wo king on pa wi h s a e-
o - he-a echniques o la ge-scale app oxima e BNN
aining.
2 RELATED WORK
Compu a ional e iciency is one o he long-s anding issues
conce ning he applica ion o BNNs o edge-AI and em-
bedded sys ems (Bonne e al.,2023). Indeed, he e is a
as sec ion o he li e a u e ha aims o ackle he issue
1To suppo ep oducibili y, ou code is a ailable a h ps:
//gi hub.com/moulelin/2DGBNN
om a a ie y o di e en angles. One such closely ela ed
a ea is ha o quan isa ion o he BNN’s pa ame e s (Guo,
2018), whe e he p ecision o he la e is educed o min-
imise hei compu a ional oo p in . Wo ks ha e adap ed
echniques ini ially de eloped o de e minis ic NNs (Fe i-
anc e al.,2021;Subeda e al.,2021;Lin e al.,2023;Dong
e al.,2022;Ull ich e al.,2017) o de eloped new ech-
niques ha ake in o accoun he dis ibu ional beha iou
o BNNs’ pa ame e s (Chien and Chang,2023;Pa k e al.,
2021;Yang e al.,2020a). While hese echniques inc ease
he compu a ional e iciency o a gi en BNN a chi ec u e,
by de eloping a weigh -sha ing echnique (and he e o e
no only quan ising bu also in e ec educing he num-
be o BNN pa ame e s) he me hod we in oduce is able o
ma ch hei beha iou in s anda d BNN benchma ks, while
a he same ime, i allows o aining o la ge scale models
(He nández-Loba o and Adams,2015).
Se e al wo ks ha e looked a educing he numbe o pa-
ame e s in BNNs, ei he by applying p uning echniques
(Sha ma and Jennings,2021;Becke s e al.,2023;Ro h and
Pe nkop ,2018) o using low- ank app oxima ions (Doan
e al.,2024;Dusenbe y e al.,2020;Swia kowski e al.,
2020). The la e wo k by epa ame e ising BNN weigh s
and biases using a lowe ank ep esen a ion, enabling hem
o scale BNN in e ence o la ge models such as ResNe -50
(Dusenbe y e al.,2020) and ViT (Doan e al.,2024) a chi-
ec u es, and a e as such closely ela ed o ou wo k in ha
a smalle common ep esen a ion is ound o BNN pa am-
e e s. The numbe o inal pa ame e s is s ill hough gen-
e ally highe han hei ull de e minis ic coun e pa , and
he e o e canno be used o edge-AI applica ions. In Sec-
ion 5, we will obse e ha , albei a he p ice o a small
educ ion in accu acy, ou me hod educes he numbe o
pa ame e s o 3 o de s o magni ude.
Finally, a numbe o wo ks ha e looked in o applying
Bayesian echniques o quan isa ion o de e minis ic NNs
(Subia-Waud and Dasmahapa a,2024;Louizos e al.,
2017;Van Baalen e al.,2020;Pe in e al.,2024;Ach e -
hold e al.,2018;Soud y e al.,2014;Yang e al.,2020b), in-
cluding he applica ion o weigh -sha ing echniques (Ro h
and Pe nkop ,2018;Subia-Waud and Dasmahapa a,2024;
Nowlan and Hin on,2018). While hese echniques p o ide
encou aging esul s on he sui abili y o Bayesian heo y o
inc easing he e iciency o deep lea ning, being speci ically
ailo ed o de e minis ic neu al ne wo ks hey canno be ap-
plied o BNNs.
3 BAYESIAN NEURAL NETWORKS
We conside a neu al ne wo k a chi ec u e 𝑓𝐰∶ℝ𝑛→ℝ𝑚
pa ame e ised by a ec o o weigh s and biases 𝐰∈ℝ𝑛𝑤,
which we e e o collec i ely as pa ame e s o he neu al
ne wo k. Bayesian lea ning o neu al ne wo ks begins by
placing a p io dis ibu ion, 𝑝(𝐰), o e he ne wo ks’ pa am-
Moule Lin, Shuhao Guan, Weipeng Jing, Goe z Bo e weck, And ea Pa ane
e e . This is o en assumed o be encoded h ough a ec o
o independen Gaussian dis ibu ions, one o each weigh
andbiasin he BNN (Blundell e al.,2015). Thisp io belie
is hen upda ed gi en a da ase ’s e idence h ough he appli-
ca ion o he Bayesian lea ning ule. Le = {(𝑥𝑖, 𝑦𝑖)}𝑛
𝑖=1
deno e he ull aining da ase , 𝐗= (𝑥1,…, 𝑥𝑛) he com-
bined ec o o aining inpu s and 𝐲= (𝑦1,…, 𝑦𝑛) hei
co esponding ou pu s, hen he pos e io dis ibu ion on
he weigh is compu ed as:
𝑝(𝐰|𝐗,𝐲) = 𝑝(𝐲|𝐗,𝐰)𝑝(𝐰)
𝑝(𝐲|𝐗),(1)
whe e 𝑝(𝐲|𝐗,𝐰)is he likelihood and 𝑝(𝐲|𝐗)is he model
e idence. Finally, gi en a es poin 𝑥∗, he BNN’s pos e io
p edic i e dis ibu ion on 𝑥∗is de ined by:
𝑝(𝑦∗|𝑥∗,𝐗,𝐲) = ∫𝑝(𝑦∗|𝑥∗,𝐰)𝑝(𝐰|𝐗,𝐲)𝑑𝐰.(2)
Un o una ely, nei he Equa ion (1) no Equa ion (2) can
gene ally be compu ed exac ly (Hin on and Neal,1995).
The e o e a a ie y o app oxima e Bayesian in e ence ech-
niques ha e been de eloped in he li e a u e, wi h he
wo mos p ominen classes o app oaches being based
on Mon e Ca lo algo i hms (Hin on and Neal,1995) o
on Va ia ional In e ence me hods (Blundell e al.,2015).
While he o me p o ides he gold s anda d in e ms o
app oxima ion accu acy in small a chi ec u es, Va ia ional
In e ence (VI) gua an ees be e scalabili y and is he e o e
he ocus o his pape . B ie ly, VI wo ks by app oxima ing
he ue pos e io , 𝑝(𝐰|𝐗,𝐲), by op imising he KL di e -
gence o e a simple , pa ame e ised dis ibu ion, 𝑞(𝐰), o -
en a mul idimensional Gaussian dis ibu ion wi h diagonal
co a iance. The p edic i e dis ibu ion o Equa ion (2) is
hen app oxima ed by sampling mul iple imes om 𝑞(𝐰),
and a e aging he esul s.
Despi e he app oxima ion, howe e , VI BNNs s ill come
wi h se e al limi a ions ha impede hei deploymen in
p ac ice. Fi s , e en in he case o diagonal Gaussian dis i-
bu ions, he numbe o pa ame e s in he BNN is doubled
compa ed o hei de e minis ic coun e pa . Fu he mo e,
app oxima ing he p edic i e dis ibu ion equi es mul iple
sampling p ocedu es and mul iple o wa d passes h ough
he ne wo k so ha hei compu a ional ime is o de s o
magni ude highe han, again, hei de e minis ic coun e -
pa . Finally, despi e i s g ea e lexibili y, s anda d Va ia-
ional In e ence s uggles o lea n e y deep BNNs and i is
gene ally limi ed o mo e adi ional a chi ec u e and small-
o-medium-size da ase s. In he ollowing, we de elop a
weigh -sha ing quan isa ion scheme a ge ed a BNNs o
ackle hese limi a ions.
4 2D GAUSSIAN BAYESIAN NEURAL
NETWORK
Conside he ec o 𝐰o he BNN’s weigh s, whe e, a each
s ep o he aining p ocess, each weigh , 𝑤𝑖𝑖= 1,…, 𝑛𝐰,
is dis ibu ed acco dingly o a gi en Gaussian dis ibu ion
(𝜇𝑤𝑖, 𝜎𝑤𝑖). We deno e wi h 𝑓𝑢𝑙𝑙 he ull se o weigh
dis ibu ions. Ou s ochas ic weigh sha ing echnique aims
a inding a se o 2D Gaussian dis ibu ions (which we col-
lec i ely deno e as ws)(𝜇1,Σ1),…,(𝜇𝑘,Σ𝑘),2wi h
𝑘≪𝑛𝐰, and such ha ws can be used o app oxima es
he beha iou o 𝑓𝑢𝑙𝑙 in e ms o esul ing accu acy and
unce ain y.
B ie ly, we do his by i s modelling all he hype pa am-
e e s o he 𝑓𝑢𝑙𝑙 dis ibu ions h ough a Gaussian Mix-
u e Model (GMM) (Sec ion 4.2), and hen applying alpha-
blending o sample weigh s om he esul ing ealisa ions
o he GMM (Sec ion 4.3). Addi ionally, 2DGBNN imple-
men s se e al s eps in o med by bes p ac ice in quan isa-
ion o de e minis ic NNs o u he educing he sha ed
numbe o weigh s, including ou lie s de ec ion (Sec ion
4.1), clus e dimensionali y educ ion and he me ging o
simila dis ibu ions (Sec ion 4.2). Finally, we will p esen
he o e all algo i hm o 2DGBNN in Sec ion 4.4.
4.1 Ou lie s s. Inlie s Classi ica ion
The key obse a ion behind he weigh -classi ica ion s age
o ou algo i hm is ha no all he weigh s o a neu al ne -
wo k ha e an equal impac on he ou pu . Taking inspi a ion
om quan isa ion echniques o de e minis ic neu al ne -
wo ks (Subeda e al.,2021), we, he e o e, do no quan ise
ex eme alues in he BNN as hose a e, likely, pa icula ly
in luen ial in he inal esul . Speci ically, we pa i ion he
ull weigh ec o win o wo sepa a e ec o s, w𝑖𝑛 and w𝑜𝑢𝑡,
and only apply weigh -sha ing o he o me . We do his by
using wo di e en c i e ia.
Mean Th eshold: Weigh s associa ed wi h a mean wi h
an absolu e alue g ea e han a h eshold 𝜏(e.g., 𝜏= 0.2)
a e classi ied as ou lie s, i.e.:
𝑤𝑖∈𝐰is an ou lie i |𝜇𝑖|> 𝜏,
A discussion o how we chose pa ame e s like 𝜏is p o ided
in Appendix E.
G adien Th eshold: Weigh s associa ed wi h g adien
magni udes exceeding he h eshold ha places hem wi hin
he op 1% du ing backp opaga ion a e also ca ego ized as
ou lie s, as hei high g adien alues likely signi y hei
2As s anda d we use 𝜎 o deno e he one-dimensional s anda d
de ia ion in he case o 1d Gaussian, and Σ o deno e he mul idi-
mensional co a iance in he case o mul idimensional Gaussian.
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
subs an ial impac on model pe o mance, i.e.:
𝑤𝑖∈𝐰is an ou lie i |∇𝑤𝑖|is op 1% o {|∇𝑤𝑗|}𝑛𝐰
𝑗=1.
Algo i hm 1 2DGBNN
Inpu : NN a chi ec u e 𝑓𝐰, aining da a = {(𝐗,𝐲)} –
𝜏𝑤, 𝜏𝑑, 𝜏𝑔, 𝜏𝑣algo i hm h esholds – BNN p io 𝑝(𝐰).
Ou pu : S ochas ic weigh -sha ing ained BNN.
S age 1: Ini ialise GMM
1: Ini ialise 𝜇,𝜎acco ding o 𝑝(𝐰)
2: o each epoch do ⊳P e- aining
3: Sample weigh s: 𝐰=𝜇+𝜎 ⊙ 𝜖,𝜖∼(0,𝐈)
4: Upda e 𝜇,𝜎by aining on
5: end o
6: i |𝑤𝑖|> 𝜏𝑤o |∇𝑤𝑖|in op 1% hen 𝑤𝑖is ou lie
7: else 𝑤𝑖is inlie ⊳§4.1
8: end i
9: Lea n GMM on inlie pa ams Θ𝑖𝑛 (Equa ion (3))
S age 2: Re ine GMM
10: o each inlie weigh 𝑤𝑖do
11: Pe o m §4.3 check on Mahalanobis dis ance
12: i Ou side 95 h pe cen ile hen
13: 𝑤𝑖is assigned o mul iple clus e s.
14: else 𝑤𝑖is assigned only o he closes Gaussian.
15: end i
16: end o
17: Apply alpha-blending o ellipse poin s (Eq. 7)
18: epea
19: o each pai (1,2)in GMM do
20: i 𝑊(1,2)< 𝜏𝑑,Δ𝑔< 𝜏𝑔,Δ𝑣< 𝜏𝑣 hen
21: Me ge 1,2using Eqs. (5), (6)
22: end i
23: end o
24: un il No mo e Gaussians can be me ged
S age 3: Final BNN T aining
25: o each epoch do
26: o each weigh 𝑤𝑖do
27: i 𝑤𝑖is inlie hen
28: Sample 𝑤𝑖∼∑𝐾
𝑘=1 𝜋𝑘(𝜇𝑘,Σ𝑘)
29: else
30: Use 𝑤𝑖∼(𝜇𝑤𝑖, 𝜎2
𝑤𝑖)
31: end i
32: end o
33: Minimising s ep o
(, 𝑞)o Eq. (10)
34: end o
4.2 2DGBNN aining
Gi en he se o dis ibu ions o he inlie weigh s, which
we deno e as 𝑖𝑛, we p oceed by clus e ing hei means
and a iances on he 2-dimensional 𝜇-𝜎plane. We do
his by lea ning a Gaussian Mix u e Model (GMM) o he
o m: 𝑝((𝜇, 𝜎)) = ∑𝐾
𝑘=1 𝜋𝑘((𝜇, 𝜎)|𝜇𝑘,Σ𝑘)o e he se
o poin s Θ𝑖𝑛 = {(𝜇𝑤𝑖, 𝜎𝑤𝑖)}𝑛𝐰𝑖𝑛
𝑖=1 . Due o he la ge ol-
ume o poin s in ol ed in he lea ning o he GMM ( yp-
ically, millions o weigh s), we ely on mini-ba ch lea ning
o GMMs (Li e al.,2014). This is achie ed by sampling
andom mini-ba ches ⊂Θ𝑖𝑛, and i e a i ely minimising
he log-likelihood o e he mini-ba ch:
min ∑
(𝜇𝑖,𝜎𝑖)∈log (𝐾
∑
𝑘=1
𝜋𝑘((𝜇𝑖, 𝜎𝑖) ∣ 𝜇𝑘,Σ𝑘)).(3)
A e he Ini ial GMM lea ning, we pe o m wo u he e-
duc ion s eps based on he numbe o poin s a ound each
Gaussian, and on he dis ance be ween pai s o Gaussians.
Clus e Size Reduc ions: Du ing he ini ial Gaussian
clus e ing s age, clus e s associa ed wi h ewe han 30
weigh s a e iden i ied. Weigh s wi hin hese small clus-
e s a e ea ed as ou lie s due o hei lack o ep esen a ion
wi hin he b oade weigh dis ibu ion. O e all, we empi i-
cally ind ha app oxima ely 1.8% o he o al weigh s o a
neu al ne wo k a e gene ally alloca ed as ou lie s.
Me ging Gaussians: We me ge oge he Gaussian dis i-
bu ions ha a e e y close o each o he . We do his by
elying on he dis ance be ween Gaussians and hei g adi-
en s. Speci ically, we compu e he Wasse s ein-2 dis ance
be ween pai s o dis ibu ions as (Jacobs e al.,2023):
𝑊2((𝜇𝑖,Σ𝑖),(𝜇𝑗,Σ𝑗))2=||𝜇𝑖−𝜇𝑗||2
2
+T (Σ𝑖+ Σ𝑗− 2(Σ1∕2
𝑖Σ𝑗Σ1∕2
𝑖)1∕2)(4)
I he dis ance be ween wo Gaussians is less han a gi en
h eshold 𝛾, hen we inspec he g adien o he ne wo k in
he weigh associa ed o he clus e cen oid and i s a i-
ance. I hose a e smalle han wo gi en h eshold 𝜎and 𝛼
hen we p oceed by me ging he wo Gaussians in o one.3
The me ge is execu ed using he ollowing equa ions
(Agueh and Ca lie ,2011;Taka su,2011):
𝜇me ged =𝜇1+𝜇2
2(5)
Σme ged =Σ1+ Σ2
2+1
8(𝜇1−𝜇2)(𝜇1−𝜇2)𝑇
+1
2(Σ1∕2
1Σ2Σ1∕2
1)1∕2 (6)
Tha ensu es ha he newly o med Gaussian componen ac-
cu a ely e lec s he collec i e dis ibu ion cha ac e is ics o
he ini ial componen s while main aining minimal in e nal
a ia ion.
3De ails abou he h esholds we use in ou expe imen s can be
ound in he Appendix.
Moule Lin, Shuhao Guan, Weipeng Jing, Goe z Bo e weck, And ea Pa ane
Table 1: Compa ison o 2DGBNN and compe i i e echniques supe sc ip s indica e ma chinga chi ec u es)onImageNe 1k
da ase . We also p o ide he numbe o ou lie s, ellipses, and Gaussians de i ed by ou me hod.
A chi ec u e Me hod Accu acy ↑NLL ↓ECE ↓#Ou lie s #Ellipses #Gaussians #Pa ame e s(M) ↓/
Comp ession Ra io(%) ↑
ResNe -18 Mu ual BNN (Pham e al.,2024) 67.7 1.327 0.1300 - - - 23.4M / -
2DGBNN(ou s) 68.1 1.253 0.019 23013 10885 2217 0.038M /99%
ResNe -50
Deep Ensembles (Lakshmina ayanan e al.,2017) 77.5 0.877 0.0305 - - - 146.7M / -
Rank-1 BNN(Dusenbe y e al.,2020) 77.3 0.886 0.0166 - - - 26.0M / -
ATMC (30 samples) (Heek and Kalchb enne ,2019) 77.5 0.883 - - - - 768.0M / -
MCMC (9 samples) BNN (Zhang e al.,2019) 77.1 0.888 - - - - 230.4M / -
2DGBNN(ou s) 75.1 0.961 0.029 37172 56873 3250 0.101M/99%
ResNe -101 2DGBNN(ou s) 75.50 0.969 0.023 53641 4311 2464 0.063M / 99%
VIT-B-16 76.01 0.901 0.064 9765 338329 5440 0.359M / 98%
4.3 𝛼-blending (Mul i-Clus e s) o Weigh s
Be o e he inal sampling s ep, we eassess inlie weigh s
𝐰𝑖𝑛 by compu ing hei squa ed Mahalanobis dis ances o
clus e means using, i.e., 𝐷2= (𝑤𝑖−𝜇𝑖)⊤Σ−1
𝑘(𝑤𝑖−𝜇𝑖).
I a weigh ’s 𝐷2exceeds 5.991,4we eassess i s clus e as-
signmen : We do his by elying on 𝛼-blending (Mildenhall
e al.,2021).
Speci ically, o each he 𝑤𝑖∈𝐰𝑖𝑛 we compu e he subse
o GMM’s componen (𝜇𝑘,Σ𝑘), o 𝑘= 1,…, 𝑛𝑖such
ha he abo e condi ion on he 𝐷2is me . We hen sample
he inal alue o he weigh by he esul ing dis ibu ions:
𝑝(𝑤𝑖) =
𝑛𝑖
∑
𝑘=1
𝛼𝑘(𝜇𝑘,Σ𝑘)(7)
whe e 𝛼𝑘is he mixing coe icien , compu ed as he pd o
(𝜇𝑖, 𝜎2
𝑖)acco ding o (𝜇𝑘,Σ𝑘).
4.3.1 Combined Va ia ional Fo mula ion
Finally, we obse e ha ou s ochas ic weigh -sha ing ech-
nique can be seamlessly in eg a ed wi hin he ELBO o -
mula ion o VI aining (Nowlan and Hin on,2018;Zhang
e al.,2018). Fo mally, we assume ha 𝐰𝑜𝑢𝑡 and 𝐰𝑖𝑛 a e
ec o s o pai wise independen weigh s,5which allow us
4Co esponding o he 95 h pe cen ile o he 𝜒2
2dis ibu ion
which models Mahlanobis dis ance o mul idimensional Gaus-
sians.
5This is ue o he a ia ional dis ibu ion bu i is an app ox-
ima ion o he ue pos e io .
o bound he a ia ional objec i e as i ollows:
(, 𝑞) =
𝔼𝑞(𝐰)[log 𝑝(𝐲∣𝐗,𝐰)]−KL(𝑞(𝐰)‖𝑝(𝐰))=(8)
𝔼𝑞(𝐰)[log 𝑝(𝐲∣𝐗,𝐰)]−KL(𝑞(𝐰𝑖𝑛)‖𝑝(𝐰𝑖𝑛))−
KL(𝑞(𝐰𝑜𝑢𝑡)‖𝑝(𝐰𝑜𝑢𝑡))≈𝔼𝑞(𝐰)[log 𝑝(𝐲∣𝐗,𝐰)](9)
−KL(∑𝜋𝑘𝑘‖𝑝(𝐰𝑖𝑛))−KL (𝑞(𝐰𝑜𝑢𝑡)‖𝑝(𝐰𝑜𝑢𝑡))
≥𝔼𝑞(𝐰)[log 𝑝(𝐲∣𝐗,𝐰)]−∑
𝑘
𝜋𝑘KL(𝑘‖𝑝(𝐰𝑖𝑛))
⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟
GMM KL di e gence
−
KL(𝑞(𝐰𝑜𝑢𝑡)‖𝑝(𝐰𝑜𝑢𝑡))
⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟
Ou lie s KL di e gence
∶=
(, 𝑞),(10)
whe e he equali y in Equa ion (8) is due o he pai wise
independence assump ion be ween componen s o 𝐰𝑖𝑛 and
𝐰𝑜𝑢𝑡, he app oxima ion o Equa ion (9) is due o he GMM
app oxima ion o he inlie weigh s, and he inal inequali y
is due o he con exi y o he KL di e gence. No ice ha
he esul ing alue loss unc ion,
is an uppe bound on he
o iginal loss so ha i s minimisa ion by means o g adien
descen gua an ees he imp o emen o he la e .
4.4 O e all Me hodology
The o e all me hodology is p esen ed in pseudocode o m
in Algo i hm 1. 2DGBNN combines i s componen pa s in
h ee s ages. In he i s s age, he BNN is ini ialised wi h
he gi en p io , a p e- aining s ep is pe o med and he e-
sul ing BNN is used o ini ialise he GMM clus e ing. In
s age 2, he ini ial GMM clus e ing ob ained is e ined by
me ging close Gaussians, and by pe o ming 𝛼-blending.
Finally he BNN is ained by op imising he a ia ional ob-
jec i e on a combina ion o ull weigh dis ibu ions ( o he
ou lie s) and GMM-based weigh -sha ing ( o he inlie s).
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
(a) Densi y and F equency o ResNe 50 in ImageNe and CIFAR-100
(b) Weigh s Sca e ImageNe (abo e) and CIFAR-100 (below)
Figu e 1: Weigh dis ibu ion o he BNN p io o s ochas ic-sha ing. Panel (a) shows he densi y plo o he second con o-
lu ional laye o ResNe -50 when ained on ImageNe (in blue) and CIFAR-100 (in ed). Panel (b) shows he co esponding
sca e plo s, including lines o he 1% and 99%.
5 EXPERIMENTS
To alida e he e ec i eness and scalabili y o 2DGBNN,
we conduc comp ehensi e expe imen s using a ious NN
a chi ec u es on benchma k image classi ica ion da ase s.
This sec ion de ails he da ase s, models, expe imen al
se up, esul s, and analysis o ou indings. Speci ically,
we e alua e ou me hod on ou common image classi i-
ca ion benchma ks: MNIST, CIFAR-10, CIFAR-100, and
ImageNe 1k; as well as ou widely used NN a chi ec-
u es: ResNe -18, ResNe -50 and ResNe -101 and a Vision
T ans o me (ViT). The hype pa ame e s used and he de-
ails o he aining a e gi en in Appendix. Th oughou
his sec ion, we compa e ou echnique agains he esul s
ob ained by Deep Ensembles (Lakshmina ayanan e al.,
2017), Rank-1 BNN (Dusenbe y e al.,2020), MCMC
BNN (Zhang e al.,2019), ATMC (Heek and Kalchb en-
ne ,2019), Mu ual BNN (Pham e al.,2024), F-SGVB-LRT
Nguyen e al. (2024), ABNN (F anchi e al.,2024), LP-
BNN (F anchi e al.,2023), IR (Kim e al.,2023), SSVI
(Li e al.,2024) and mBCNN (Kong e al.,2023). We e al-
ua e he esul ing models in e ms o Accu acy, Nega i e
Log-Likelihood (NLL, which measu es he model’s unce -
ain y in i s p edic ions), and Expec ed Calib a ion E o
(ECE, which measu es he calib a ion o p edic ed p oba-
bili ies (Guo e al.,2017)). Each expe imen is conduc ed
h ee imes wi h di e en andom seeds, and we epo he
a e age esul s. We conduc expe imen s wi h all he a o e-
men ioned compa ison models, o aling 13 2DGBNN ex-
pe imen s, which include he base models o all compa -
ison me hods. Addi ionally, we pe o m wo quan isa ion
compa ison expe imen s (Sec ion 5.2) and an abla ion s udy
(Sec ion 5.3).
5.1 Pe o mance E alua ion
The esul s ob ained wi h 2DGBNN and hose o he s a e-
o - he-a a e lis ed in Tables 1,2and 3 o ImageNe 1k,
CIFAR-100 and CIFAR-10 espec i ely along wi h de ails
on he compu a ions pe o med by 2DGBNN.
In he case o ImageNe 1k (Table 1) he compa ison is pe -
o med agains Deep Ensembles (Lakshmina ayanan e al.,
2017), Rank-1 BNN (Dusenbe y e al.,2020), MCMC
BNN wi h 9 samples (Zhang e al.,2019), ATMC (Heek
and Kalchb enne ,2019), and Mu ual BNN (Pham e al.,
2024). We obse e ha , in all cases, ou me hod success-
ully educes he numbe o pa ame e s by 3 o 4 o de s o
magni udes.While he accu acy is educed by a ound 2% in
he ResNe -50 case, we do ob ain compa able unce ain y
es ima ion as e alua ed by NLL and ECE.
Simila esul s we ob ain on he CIFAR-100 da ase (Table
2), compa ing agains Deep Ensembles (Lakshmina ayanan
e al.,2017), Rank-1 BNN(Dusenbe y e al.,2020), F-
SGVB-LRT (Nguyen e al.,2024) and ABNN (F anchi
e al.,2024). Ou me hod achie es a subs an ial educ ion
in model pa ame e s while main aining o imp o ing pe -
o mance compa ed o o he me hods a he p ice o app ox-
ima ely 2% when compa ed o Deep Ensembles and Rank-1
BNN. Fo ins ance, o ResNe -18 we use only 0.019M pa-
ame e s, which is d as ically lowe han he 23.4M pa ame-
Moule Lin, Shuhao Guan, Weipeng Jing, Goe z Bo e weck, And ea Pa ane
Table 2: Compa ison o 2DGBNN and compe i i e echniques (supe sc ip s indica e ma ching a chi ec u es) on CIFAR-
100 da ase . We also p o ide he numbe o ou lie s, ellipses, and Gaussians de i ed by ou me hod.
A chi ec u e Me hod Accu acy ↑NLL ↓ECE ↓#Ou lie s #Ellipses #Gaussians #Pa ame e s(M) ↓/
Comp ession Ra io(%) ↑
ResNe -18
F-SGVB-LRT (Nguyen e al.,2024) 70.1 1.121 0.036 - - - 23.4M / -
SSVI (Li e al.,2024) 75.8 - 0.001 - - - 2.32M / 90%
mBCNN (Kong e al.,2023) 73.7 1.004 0.002 - - - 2.86M / 87.8%
2DGBNN(ou s) 74.7 1.053 0.038 14624 260 2387 0.019M /99%
WRN-28-10
Deep Ensembles (Lakshmina ayanan e al.,2017) 82.7 0.666 0.021 - - - 146M / -
Rank-1 BNN (Dusenbe y e al.,2020) 82.4 0.689 0.012 - - - 36.6M / -
LP-BNN (F anchi e al.,2023) 79.3 - 0.0702 - - - 26.8M / 63%
2DGBNN(ou s) 80.5 0.798 0.0432 40354 341 2390 0.045M /99%
ResNe -50 ABNN (F anchi e al.,2024) 74.20 0.828 4.5 - - - 54.2M / -
2DGBNN(ou s) 78.1 0.986 0.107 247591 330 1980 0.251M /99%
ResNe -101 2DGBNN(ou s) 78.4 0.834 0.066 45240 348 3199 0.052M /99%
Table 3: Compa ison o 2DGBNN and compe i i e echniques (supe sc ip s indica e ma ching a chi ec u es) on CIFAR-10
da ase . We also p o ide he numbe o ou lie s, ellipses, and Gaussians de i ed by ou me hod.
A chi ec u e Me hod Accu acy ↑NLL ↓ECE ↓#Ou lie s #Ellipses #Gaussians #Pa ame e s(M) ↓/
Comp ession Ra io(%) ↑
ResNe -18
F-SGVB-LRT (Nguyen e al.,2024) 90.31 0.262 0.014 - - - 23.4M / -
SSVI (Li e al.,2024) 93.74 - 0.006 - - - 1.17M / 95%
mBCNN (Kong e al.,2023) 93.20 0.220 0.008 - - - 0.93M / 96%
2DGBNN(ou s) 91.72 0.305 0.019 123310 67 1569 0.018M /99%
WRN-28-10
Deep Ensembles (Lakshmina ayanan e al.,2017) 96.2 0.143 0.020 - - - 146M / -
Rank-1 BNN (Dusenbe y e al.,2020) 96.3 0.128 0.008 - - - 36.6M / 50.8%
LP-BNN (F anchi e al.,2023) 95.0 - 0.009 - - - 26.8M / 63%
2DGBNN(ou s) 95.2 0.142 0.012 39395 365977 3950 0.413M /99%
ResNe -50 ABNN (F anchi e al.,2024) 95.01 0.160 1.0 - - - 54.2M / 25%
2DGBNN(ou s) 93.84 0.223 0.012 129640 01628 0.132M /99%
ResNe -101 2DGBNN(ou s) 92.78 0.270 0.015 45240 348 3199 0.052M /99%
e s used by he F-SGVB-LRT model, ye we achie e com-
pe i i e accu acy. Finally, Table 3lis s analogous esul s
in he con ex o CIFAR-10, compa ing agains IR (Kim
e al.,2023), F-SGVB-LRT (Nguyen e al.,2024), ABNN
(F anchi e al.,2024) and LP-BNN (F anchi e al.,2023).
Addi ionally o he a chi ec u es used o compa isons, he
ables epo esul s o ResNe -101 and ViT. In hese a -
chi ec u es oo, 2DGBNN is able o educe he numbe o
pa ame e s while ob aining accu acy and unce ain y me -
ics on pa wi h ha o s a e-o - he-a echniques ac oss
he emaining a chi ec u es. In e es ingly, obse ing he
aining esul s ob ained we no ice how he dis ibu ion o
weigh s in models ained on smalle da ase s (CIFAR-100
and CIFAR-10) ends o clus e nea ze o, as depic ed in
Figu e 1(b) (down). Con e sely, in ImageNe 1k he weigh
dis ibu ion is b oade , as can be seen om Figu e 1(a))
compa ing empi ical dis ibu ions ob ained on CIFAR-100
and ImageNe 1k. No ice how his ansla es o, o exam-
ple, he ResNe -50 model ained on ImageNe 1k da ase o
ha e a signi ican ly g ea e numbe o Gaussian and Ellipse
weigh s han when ained on he CIFAR-100 da ase .
5.2 Compa ison agains Quan isa ion
We now compa e 2DGBNN agains quan isa ion echniques
applied o BNNs (Subeda e al.,2021). Fo his pu pose,
we emo e he p e aining s age o 2DGBNN so o mimic
he “ anilla” BNN aining employed by Subeda e al.
(2021). We use a Gaussian p io wi h a mean o 0 and a
s anda d de ia ion o 0.1. No ice ha hese expe imen s a e
limi ed o CIFAR-10 and MNIST as he anilla aining o
BNNs used in Subeda e al. (2021) does no scale o he
la ge a chi ec u es and da ase s analysed in he p e ious
sec ion.
The compa a i e esul s a e p esen ed in Table 4. Accu acy
alues a e e ysimila ac oss heboa d, whileou echnique
ob ains signi ican ly be e unce ain y me ics, excep o
NLL in he case CIFAR-10.
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
Table 4: Compa ison agains he quan isa ion echnique o Subeda e al. (2021) on CIFAR-10 and MNIST.
Da ase s Algo i hm Quan isa ion echnique Accu acy ↑NLL ↓ECE ↓#Pa ame e s (MB: Megaby e)
#Ou lie s #Ellipses #Gaussians #Pa ame e s
CIFAR-10
BNNs Quan iza ion
(Subeda e al.,2021)
ResNe -20 (INT8 SIGMA4) 90.92 0.266 1.778 - - - 0.87 MB
ResNe -20 (INT8 SIGMA2) 90.85 0.273 2.547 - - - 0.72 MB
ResNe -20 (INT8 SIGMA1) 90.96 0.266 0.711 - - - 0.54 MB
2DGBNN ResNe -20(wi hou p e ained) 90.91 0.303 0.040 74181 3634 142 0.644MB (1.62MB)
ResNe -20(wi h p e ained) 91.04 0.303 0.037 14624 260 2387 0.020M /(0.71MB)
MNIST
BNNs Quan iza ion
(Subeda e al.,2021)
ResNe -20 (INT8 SIGMA4) 99.36 0.020 0.215 - - - 0.10 MB
ResNe -20 (INT8 SIGMA2) 99.32 0.024 0.277 - - - 0.08 MB
ResNe -20 (INT8 SIGMA1) 99.34 0.027 0.351 - - - 0.06 MB
2DGBNN ResNe -20(wi hou p e ained) 99.52 0.013 0.001 3206 581 237 0.092MB (0.403MB)
Table 5: Abla ion S udy: Impac o Ou lie s and Ellipses on CIFAR-10 Using ResNe -20
Con igu a ion #Ou lie s #Ellipses Accu acy (%) ↑NLL ↓ECE↓
2DGBNN ✓ ✓ 90.92 0.265 0.040
Wi hou Ellipse ✓– 90.43 0.304 0.043
Wi hou Ou lie s – ✓90.18 0.308 0.026
Wi hou Ou lie s and Ellipse – – 90.01 0.318 0.029
In e ms o he model size (he e compa ed in Megaby es),
he wo echniques compa e simila ly when i comes o he
size o ainable pa ame e s (co esponding o he alue e-
po ed no in b acke s o 2DGBNN), wi h quan isa ion ha -
ing a sligh edge when only 1 bi is used o encoding he
s anda d de ia ion. No ice, howe e , ha in small NNs (like
he one he e analysed) ou echniques incu signi ican s o -
age o e head in ha we need o keep an index (encoded in
uin 8) ha assigns each inlie weigh o i s clus e . When
his alue is added (size epo ed in b acke s in he Table)
quan isa ion has a signi ican ad an age o e ou s o age e-
qui emen s. While echniques such as Hu man coding o
mul i-le el index ables can po en ially educe he size o
he index ec o by se e al ac o s, we lea e u he in es-
iga ions o u u e wo k, and he e no ice ha despi e main-
aining ull p ecision on he wo kings o he BNN, weigh -
sha ing quan isa ion can al eady ob ain compa able esul s
o in 8 quan isa ion. We no ice ha he wo me hods a e
complimen a y, and in 8 quan isa ion can u he educe he
s o age equi emen s o he ou lie weigh s and GMMs.
5.3 Abla ion S udy
Table 5p esen s an abla ion s udy on CIFAR-10 using
ResNe -20 o explo e how ou lie s and ellipses con ibu e
o he pe o mance o 2DGBNN. When bo h ou lie s and
ellipses a e included, he model achie es an accu acy o
90.92%, wi h he lowes NLL o 0.265 and ECE o 0.040.
Howe e , emo ing ei he componen signi ican ly impac s
pe o mance. Wi hou ellipses, he accu acy d ops by
0.49% o 90.43%, and excluding ou lie s educes i sligh ly
u he by 0.25% o 90.18%. When bo h a e emo ed, he
accu acy d ops by 0.87% o a low o 90.05%.
6 CONCLUSIONS
We ha e p esen ed a s ochas ic weigh -sha ing quan isa ion
echnique based on GMMs speci ically ailo ed o BNNs.
In an ex ensi e empi ical e alua ion, we ha e seen how ou
echnique can signi ican ly educe he e ec i e numbe o
pa ame e s o a BNN while ob aining esul s on pa wi h
s a e-o - he-a in la ge da ase s and a chi ec u es such as
ImageNe 1k and ViT.
Fu u e wo k will explo e how o in eg a e ou me hod in o
a ully Bayesian amewo k and he applica ion o u he
quan isa ion o he ou lie weigh s. We ha e p esen ed a
s ochas ic weigh -sha ing quan isa ion echnique based on
GMMs speci ically ailo ed o BNNs. In an ex ensi e em-
pi ical e alua ion, we ha e seen how ou echnique can sig-
ni ican ly educe he e ec i e numbe o pa ame e s o a
BNN while ob aining esul s on pa wi h s a e-o - he-a
in la ge da ase s and a chi ec u es such as ImageNe 1k and
ViT. Fu u e wo k will explo e how o in eg a e ou me hod
in o a ully Bayesian amewo k and he applica ion o u -
he quan isa ion o he ou lie weigh s.
Moule Lin, Shuhao Guan, Weipeng Jing, Goe z Bo e weck, And ea Pa ane
7 ACKNOWLEDGEMENTS
This publica ion has emana ed om esea ch join ly unded
by Eu opean Union’s Ho izon Eu ope 2021–2027 ame-
wo k p og amme, Ma ie Skłodowska-Cu ie Ac ions, G an
Ag eemen No. 101072456 and Taighde Éi eann – Re-
sea ch I eland unde g an numbe 13/RC/2094_2.
Re e ences
Ach e hold, J., Koehle , J. M., Schmeink, A., and Ge-
newein, T. (2018). Va ia ional ne wo k quan iza ion. In
In e na ional con e ence on lea ning ep esen a ions.
Agueh, M. and Ca lie , G. (2011). Ba ycen e s in he
wasse s ein space. SIAM Jou nal on Ma hema ical Anal-
ysis, 43(2):904–924.
Becke s, J., Van E p, B., Zhao, Z., Kond asho , K., and
De V ies, B. (2023). P incipled p uning o bayesian neu-
al ne wo ks h ough a ia ional ee ene gy minimiza-
ion. IEEE Open Jou nal o Signal P ocessing.
Bha adiya, J. P. (2023). A e iew o bayesian machine
lea ning p inciples, me hods, and applica ions. In e na-
ional Jou nal o Inno a i e Science and Resea ch Tech-
nology, 8(5):2033–2038.
Billah, M. E. and Ja ed, F. (2022). Bayesian con olu ional
neu al ne wo k-based models o diagnosis o blood can-
ce . Applied A i icial In elligence, 36(1):2011688.
Blundell, C., Co nebise, J., Ka ukcuoglu, K., and Wie -
s a, D. (2015). Weigh unce ain y in neu al ne wo ks.
In P oceedings o he 32nd In e na ional Con e ence on
Machine Lea ning, pages 1613–1622.
Bonne , D., Hi zlin, T., Majumda , A., Dalga y, T., Es-
manho o, E., Meli, V., Cas ellani, N., Ma in, S., Nodin,
J.-F., Bou geois, G., e al. (2023). B inging unce -
ain y quan i ica ion o he ex eme-edge wi h mem is o -
based bayesian neu al ne wo ks. Na u e Communica-
ions, 14(1):7530.
Chien, J.-T. and Chang, S.-T. (2023). Bayesian asym-
me ic quan ized neu al ne wo ks. Pa e n Recogni ion,
139:109463.
Chiza , L., Roussillon, P., Lége , F., Viala d, F.-X., and
Pey é, G. (2020). Fas e wasse s ein dis ance es ima ion
wi h he sinkho n di e gence. Ad ances in Neu al In o -
ma ion P ocessing Sys ems, 33:2257–2269.
De Palma, G., Ma ian, M., T e isan, D., and Lloyd, S.
(2021). The quan um wasse s ein dis ance o o de 1.
IEEE T ansac ions on In o ma ion Theo y, 67(10):6627–
6643.
Deng, J., Dong, W., Soche , R., Li, L.-J., Li, K., and Fei-Fei,
L. (2009). Imagene : A la ge-scale hie a chical image
da abase. In 2009 IEEE Con e ence on Compu e Vision
and Pa e n Recogni ion, pages 248–255. IEEE.
Doan, B. G., Shamsi, A., Guo, X.-Y., Mohammadi, A.,
Alinejad-Rokny, H., Sejdino ic, D., Ranasinghe, D. C.,
and Abbasnejad, E. (2024). Bayesian low- ank lea n-
ing (bella): A p ac ical app oach o bayesian neu al ne -
wo ks. a Xi p ep in a Xi :2407.20891.
Dong, R., Tan, Z., Wu, M., Zhang, L., and Ma, K. (2022).
Finding he ask-op imal low-bi sub-dis ibu ion in deep
neu al ne wo ks. In In e na ional Con e ence on Ma-
chine Lea ning, pages 5343–5359. PMLR.
Doso i skiy, A., Beye , L., Kolesniko , A., Weissenbo n,
D., Zhai, X., Un e hine , T., Dehghani, M., Minde e ,
M., Heigold, G., Gelly, S., e al. (2021). An image is
wo h 16x16 wo ds: T ans o me s o image ecogni ion
a scale. In In e na ional Con e ence on Lea ning Rep-
esen a ions.
Dusenbe y, M., Je el, G., Wen, Y., Ma, Y., Snoek, J.,
Helle , K., Lakshmina ayanan, B., and T an, D. (2020).
E icien and scalable bayesian neu al ne s wi h ank-1
ac o s. In In e na ional con e ence on machine lea n-
ing, pages 2782–2792. PMLR.
Fe ianc, M., Maji, P., Ma ina, M., and Rod igues, M.
(2021). On he e ec s o quan isa ion on model unce -
ain y in bayesian neu al ne wo ks. In Unce ain y in A -
i icial In elligence, pages 929–938. PMLR.
Fo sbe g, H., Lindén, J., Hjo h, J., Måne jo d, T., and
Danesh alab, M. (2020). Challenges in using neu al ne -
wo ks in sa e y-c i ical applica ions. In 2020 AIAA/IEEE
39 h Digi al A ionics Sys ems Con e ence (DASC), pages
1–7. IEEE.
F anchi, G., Bu suc, A., Aldea, E., Dubuisson, S., and
Bloch, I. (2023). Encoding he la en pos e io o
bayesian neu al ne wo ks o unce ain y quan i ica ion.
IEEE T ansac ions on Pa e n Analysis and Machine In-
elligence.
F anchi, G., Lau en , O., Legué y, M., Bu suc, A., Pilze ,
A., and Yao, A. (2024). Make me a bnn: A simple s a -
egy o es ima ing bayesian unce ain y om p e- ained
models. In P oceedings o he IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion, pages 12194–
12204.
Gal, Y. and Ghah amani, Z. (2016). D opou as a bayesian
app oxima ion: Rep esen ing model unce ain y in deep
lea ning. In in e na ional con e ence on machine lea n-
ing, pages 1050–1059. PMLR.
Guo, C., Pleiss, G., Sun, Y., and Weinbe ge , K. Q. (2017).
On calib a ion o mode n neu al ne wo ks. In In e -
na ional con e ence on machine lea ning, pages 1321–
1330. PMLR.
Guo, Y. (2018). A su ey on me hods and heo ies o quan-
ized neu al ne wo ks. a Xi p ep in a Xi :1808.04752.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep esid-
ual lea ning o image ecogni ion. In P oceedings o
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
Table 9: De e minis ic neu al ne wo ks used as p io o he Ini ialisa ion 2DGBNN in he expe imen s discussed in Sec ion
5.1.
Me hod Da ase Accu acy ↑NLL ↓ECE ↓#Pa ame e s (M: million)
ImageNe -1k
ResNe -18 69.70 1.265 0.027 11.7M
ResNe -50 76.10 0.989 0.035 25.6M
ResNe -101 77.30 0.936 0.936 44.5M
VIT-B-16 81.07 0.856 0.056 86.0M
CIFAR-100
ResNe -18 77.10 1.038 0.114 11.7M
ResNe -50 79.20 0.950 0.054 25.6M
ResNe -101 80.02 0.849 0.095 44.5M
WRN-28-10 81.41 0.766 0.045 53.6M
CIFAR-10
ResNe -20 91.84 0.246 0.031 0.27M
ResNe -18 93.21 0.201 0.022 11.7M
ResNe -50 94.81 0.211 0.010 25.6M
ResNe -101 94.70 0.849 0.095 44.5M
WRN-28-10 95.92 0.131 0.010 53.6M
The able is s uc u ed o highligh he pe o mance ac oss mul iple a chi ec u es and da ase s, acili a ing a di ec com-
pa ison. Highe accu acies and lowe NLL and ECE alues indica e be e model pe o mance. Fo ins ance, VIT-B-16
on ImageNe -1k achie es an accu acy o 81.07% wi h he lowes ECE o 0.056 among i s da ase coun e pa s. Simila ly,
WRN-28-10 shows supe io pe o mance on CIFAR-10 wi h he highes accu acy o 95.92% and a ema kably low ECE
o 0.010. The numbe o pa ame e s, epo ed in millions, also p o ides insigh in o he model complexi y, anging om
0.27M o ResNe -20 on CIFAR-10 o 86.0M o VIT-B-16 on ImageNe -1k.
The 2DGBNN models on ImageNe 1k showed a dec ease in accu acy, wi h ou ResNe -50 con igu a ion d opping om
he benchma k high o 77.5% o 75.10%. This educ ion was coupled wi h a subs an ial dec ease in he numbe o pa am-
e e s— om models equi ing up o 25.6 million pa ame e s o jus 0.101 million. Despi e hese changes, he inc eases in
NLL and ECE we e minimal and wi hin accep able anges, indica ing ha he models main ain sa is ac o y p edic i e pe -
o mance and calib a ion despi e he educed complexi y. Simila ends we e obse ed in he CIFAR-100 da ase , whe e ou
WRN-28-10 model’s accu acy dec eased om 81.41% o 80.5%, while signi ican ly educing he pa ame e coun o only
0.045 million om 53.6 million. The NLL and ECE me ics, al hough sligh ly ele a ed, emained compe i i e, a i ming
he e ec i e unce ain y es ima ion capabili ies o he models despi e hei educed complexi y.
G Addi ional Backg ound
KL Di e gence be ween Gaussians The KL di e gence be ween wo Gaussian dis ibu ions 𝑞(𝑤) = (𝑤|𝜇𝑞, 𝜎2
𝑞)and
𝑝(𝑤) = (𝑤|𝜇𝑝, 𝜎2
𝑝)is gi en by:
KL(𝑞(𝑤)‖𝑝(𝑤)) = 1
2(𝜎2
𝑞
𝜎2
𝑝
+(𝜇𝑝−𝜇𝑞)2
𝜎2
𝑝
− 1 + ln 𝜎2
𝑝
𝜎2
𝑞)(11)
This o mula can be used o compu e he KL di e gence e ms in he ELBO exp ession o bo h inlie s and ou lie s.
Moule Lin, Shuhao Guan, Weipeng Jing, Goe z Bo e weck, And ea Pa ane
Expec ed Log-Likelihood Compu a ion
The expec ed log-likelihood e m in ol es an expec a ion o e he a ia ional pos e io :
𝔼𝑞(𝐰)[log 𝑝(𝐲|𝐗,𝐰)]=∫𝑞(𝐰) log 𝑝(𝐲|𝐗,𝐰)𝑑𝐰(12)
In p ac ice, his in eg al is in ac able and is app oxima ed using Mon e Ca lo sampling.
H Wasse s ein-based 2D Gaussian Me ging
In he con ex o compa ing wo Gaussian dis ibu ions, he Wasse s ein dis ance p o ides a meaning ul way o measu e
he dis ance be ween p obabili y dis ibu ions. Speci ically, o wo 2D Gaussian dis ibu ions (𝜇1,Σ1)and (𝜇2,Σ2),
whe e 𝜇1, 𝜇2a e he means and Σ1,Σ2a e he co a iance ma ices, he Wasse s ein-2 dis ance is gi en by he ollowing
o mula:
𝑊2((𝜇1,Σ1),(𝜇2,Σ2))2=||𝜇1−𝜇2||2
2+T (Σ1+ Σ2− 2(Σ1∕2
1Σ2Σ1∕2
1)1∕2)(13)
Whe e ||𝜇1−𝜇2||2is he Euclidean dis ance be ween he means o he wo dis ibu ions, T is he ace ope a o , which
sums he diagonal elemen s o a ma ix, Σ1∕2
1 e e s o he ma ix squa e oo o he co a iance ma ix Σ1.
The Wasse s ein dis ance h eshold is se a 1.5 × 10−7, wi h u he discussion on he I.1. I he dis ance be ween Gaussian
componen s alls below his, g adien in o ma ion is u he conside ed, ensu ing a p ecise analysis o componen simila i y.
This me hod ensu es ha he newly o med Gaussian componen accu a ely e lec s he collec i e dis ibu ion cha ac e is ics
o he ini ial componen s while main aining minimal in e nal a ia ion.
I Sub-module Discussion and Hype -pa ame e Con igu a ion
I.1 E ec i eness o 2D Gaussian Me ging Discussion
In his sec ion, we discuss he e ec i eness o he me ging o 2D Gaussian dis ibu ions in ou me hod. Expe imen al
esul s analysing i s e ec a e lis ed in Table 10. Fo he ResNe 20 model on he CIFAR-10 da ase , a e me ging he 2D
Gaussian, he accu acy imp o ed om 88.77% o 91.02%, he NLL dec eased om 0.3792 o 0.3066, and he ECE educed
om 0.0500 o 0.0374. Fo he ResNe 18 model on he CIFAR-100 da ase , al hough he imp o emen is smalle , he
accu acy s ill inc eased om 74.50% o 74.63%, and he NLL dec eased om 1.0555 o 1.0535. These esul s demons a e
he e ec i eness o ou 2D Gaussian me ging me hod in enhancing model pe o mance and calib a ion.
Table 10: Pe o mance o ResNe 20 on CIFAR-10 and ResNe 18 on CIFAR-100 be o e and a e me ging Gaussian dis i-
bu ions.
Model Da ase Me hod Accu acy (%) NLL ECE
ResNe 20 CIFAR-10 Be o e Me ging Gaussian 88.77 0.3792 0.0500
A e Me ging Gaussian 91.02 0.3066 0.0374
ResNe 18 CIFAR-100 Be o e Me ging Gaussian 74.50 1.0555 0.0391
A e Me ging Gaussian 74.63 1.0535 0.0400
I.2 Da a Augmen a ion
To imp o e he obus ness and gene aliza ion o ou models, we applied a se ies o da a augmen a ion echniques du ing
aining on he ImageNe -1K, CIFAR-100, and CIFAR-10 da ase s. Table 11 summa ises he speci ic augmen a ion me hods
used o each da ase . Fo he ImageNe -1K da ase , we applied andom esized c opping o ob ain images o size 224× 224
pixels. This was ollowed by andom ho izon al lipping wi h a p obabili y o 50% o augmen he da ase wi h mi o ed
images. Colo ji e ing was used o adjus he b igh ness, con as , sa u a ion, and hue o he images wi h ac o s o 0.4, 0.4,
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
Table 11: Summa y o da a augmen a ion echniques applied o each da ase .
Da ase Da a Augmen a ion Techniques
ImageNe -1K Random esized c op o 224 × 224 pixels,
Random ho izon al lip,
Colo ji e ing (b igh ness=0.4, con as =0.4,
sa u a ion=0.4, hue=0.1),
Con e sion o enso ,
No maliza ion
CIFAR-100 Con e sion o enso ,
Padding o 4 pixels ( e lec ion mode),
Random c op o 32 × 32 pixels,
Random ho izon al lip,
Con e sion o enso ,
No maliza ion
CIFAR-10 Random c op o 32 × 32 pixels,
Random ho izon al lip,
Con e sion o enso ,
No maliza ion
0.4, and 0.1, espec i ely. The images we e hen con e ed o enso s and no malised using he s anda d mean and s anda d
de ia ion alues o ImageNe . In he case o he CIFAR-100 da ase , we s a ed by con e ing he images o enso s. We
hen padded he images wi h 4 pixels on each side using e lec ion mode o p ese e edge in o ma ion. A e padding, we
pe o med a andom c op o 32 × 32 pixels, ollowed by andom ho izon al lipping o in oduce mi o a ia ions. The
images we e con e ed back o enso s and no maliseded acco dingly.
Fo he CIFAR-10 da ase , he augmen a ion p ocess in ol ed a andom c op o 32 × 32 pixels, which helps in eaching
he model o be in a ian o ansla ions. We also applied andom ho izon al lipping o include mi o ed e sions o he
images. Finally, he images we e con e ed o enso s and no malised o s anda dise he inpu da a.
I.3 Hype pa ame e s o aining he de e minis ic ne wo k
In ou expe imen s, simila o hose conduc ed by p e ious esea che s, we s anda dised he hype pa ame e s ac oss all
models in he ini ial s age. This app oach was applied o a ious models including ResNe -18, ResNe -50, ResNe -101, and
VIT. The hype pa ame e s used a e as ollows:
Hype pa ame e Va iable De aul Value
Ba ch Size -b 256
Wa m-up Phases -wa m 2
Lea ning Ra e -l 0.1
Resume T aining - esume False
To al Epochs -EPOCH 250
Miles ones -MILESTONES [30, 60, 90, 120, 150, 200]
Weigh Decay -Mul iS epLR 5e-4
T aining Mean –TRAIN_MEAN (0.5071, 0.4865, 0.4409)
T aining S d -TRAIN_STD (0.2673, 0.2564, 0.2761)
Table 12: Summa y o hype pa ame e s in he neu al ne wo k aining con igu a ion
The able summa izes he s anda dised hype pa ame e s used ac oss all models in ou neu al ne wo k aining con igu-
a ions, mi o ing se ings om p e ious esea ch. I ou lines common pa ame e s such as ba ch size, wa m-up phases,
Moule Lin, Shuhao Guan, Weipeng Jing, Goe z Bo e weck, And ea Pa ane
lea ning a e, and o al epochs, alongside speci ic se ings like weigh decay and lea ning a e miles ones.
I.4 Discussion he Ini ialisa ion o 𝜎
We expe imen ed wi h se e al di e en ini ialisa ion me hods o ou models, including ini ialisa ion ia a speci ic unc ion
as de ailed by Lee e al., andom gene a ion, and Gaussian dis ibu ion, among o he s. Acco ding o ou expe imen al
esul s, we ul ima ely adop ed he ollowing ini ialisa ion me hods o ou neu al ne wo k pa ame e s.
Fo he weigh pa ame e weigh _sigma, we used he Xa ie uni o m ini ialisa ion wi h a gain o 0.01, de ined as:
𝐖∼(−𝑔
√𝑛in +𝑛ou
,𝑔
√𝑛in +𝑛ou )(14)
whe e 𝑔= 0.01,𝑛in is he numbe o inpu uni s, 𝑛ou is he numbe o ou pu uni s, and (𝑎, 𝑏)deno es a uni o m dis i-
bu ion be ween 𝑎and 𝑏.
Fo he bias pa ame e bias_sigma, we ini ialised i using a no mal dis ibu ion wi h a mean o 0.0 and a s anda d de ia ion
o 0.001:
𝐛∼(0,0.0012)(15)
In ou neu al ne wo k, we assign di e en lea ning a es o di e en pa ame e s. Table 13 summa ises hese hype pa ame-
e s.
Table 13: Summa y o Hype pa ame e s Used in Ou Expe imen s
Hype pa ame e Symbol Value
Lea ning a e o weigh and bias 𝜇 𝜂weigh _mu 1 × 10−4
Lea ning a e o weigh and bias 𝜎 𝜂weigh _sigma 1 × 10−2
We se he lea ning a es o he pa ame e s as ollows: he lea ning a es o weigh _mu and bias_mu a e bo h 𝜂weigh _mu =
1 × 10−4; he lea ning a es o weigh _sigma and bias_sigma a e bo h 𝜂weigh _sigma = 1 × 10−3.
I.5 Hype -pa ame e s o aining Bayesian Neu al Ne wo k
In ou expe imen s, we u ilise se e al hype pa ame e s ha a e c ucial o he pe o mance and con e gence o ou Bayesian
neu al ne wo k (BNN) model. These hype pa ame e s a e ca e ully selec ed based on empi ical s udies o balance compu-
a ional e iciency and model accu acy. Table 14 summa ises he hype pa ame e s used in ou expe imen s.
In he KMeans clus e ing algo i hm, we ini ially use 𝐾= 2000 clus e s. Ou lie s in he weigh alues a e iden i ied using a
h eshold 𝑇𝑤= ±0.2. Any weigh alue exceeding his h eshold is conside ed an ou lie . This h eshold is chosen based on
he empi ical dis ibu ion o he weigh s a e ini ial aining. Simila ly, ou lie s in he g adien s a e iden i ied by selec ing
he op 𝑃𝑔= 1% o g adien magni udes.
A minimum clus e size o 𝑁min = 30 is en o ced o ensu e s a is ical signi icance in he clus e ing esul s. Clus e s wi h
ewe han 𝑁min samples a e conside ed in alid and hei associa ed weigh s a e ea ed as ou lie s. This p e en s he model
om being in luenced by clus e s ha may ep esen noise o insigni ican pa e ns.
Du ing he BNN aining, we use a lea ning a e o 𝜂BNN = 1 × 10−5, which is lowe han he ini ial lea ning a e used in
he p elimina y aining. The smalle lea ning a e is necessa y o accommoda e he Bayesian upda es and o ensu e ha
he pos e io dis ibu ions o e he weigh s con e ge p ope ly.
In he p edic i e unc ion, we d aw 𝑁𝑠= 30 samples om he pos e io dis ibu ion o es ima e he p edic i e mean and
unce ain y. The Expec ed Calib a ion E o (ECE) is compu ed using 𝑁𝑏= 15 bins. The Mahalanobis dis ance h eshold
𝑇𝑀= 5.991 co esponds o he chi-squa ed dis ibu ion alue wi h 2 deg ees o eedom a he 95% con idence le el. The
Mahalanobis dis ance o a da a poin 𝑥wi h espec o a Gaussian dis ibu ion wi h mean 𝜇and co a iance Σis calcula ed
as:
𝐷2
𝑀= (𝑥−𝜇)⊤Σ−1(𝑥−𝜇).(16)
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
Table 14: Summa y o Hype pa ame e s Used in Ou Expe imen s
Hype pa ame e Symbol Value
Ini ial lea ning a e o 𝜇 𝜂𝜇1 × 10−4
Numbe o epochs 𝐸200
Numbe o clus e s in KMeans 𝐾6000
Ou lie h eshold (weigh alue) 𝑇𝑤±0.2
Ou lie h eshold (g adien pe cen ile) 𝑃𝑔Top 1%
Minimum samples pe clus e 𝑁min 20
Lea ning a e in BNN aining 𝜂BNN 1 × 10−5
Numbe o samples in p edic i e unc ion 𝑁𝑠30
Numbe o bins in ECE compu a ion 𝑁𝑏15
Wasse s ein dis ance h eshold 𝑇𝑊1 × 10−2
Mahalanobis dis ance h eshold 𝑇𝑀5.991
Numbe o nea es Gaussians 𝑘5
Poin s wi h a Mahalanobis dis ance g ea e han 𝑇𝑀a e conside ed ou lie s.
In handling ou lie s, we conside he 𝑘= 5 nea es Gaussian componen s o each ou lie poin . This allows us o eassign
ou lie weigh s o he mos p obable Gaussian componen s based on hei p oximi y in he pa ame e space.