Stochastic Weight Sharing for Bayesian Neural Networks

Author: Lin, Moule

Publisher: Zenodo

DOI: 10.5281/zenodo.17533924

Source: https://zenodo.org/records/17533924/files/2GDBNNs.pdf

a Xi :2505.17856 1 [cs.LG] 23 May 2025
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
Moule Lin Shuhao Guan Weipeng Jing Goe z Bo e weck And ea Pa ane
Le o, T ini y College
Dublin
Uni e si y College
Dublin
No heas Fo es y
Uni e si y
Le o, T ini y College
Dublin
Le o, T ini y College
Dublin
Abs ac
While o e ing a p incipled amewo k o un-
ce ain y quan i ica ion in deep lea ning, he em-
ploymen o Bayesian Neu al Ne wo ks (BNNs)
is s ill cons ained by hei inc eased compu a-
ional equi emen s and he con e gence di icul-
ies when aining e y deep, s a e-o - he-a a -
chi ec u es. In his wo k, we ein e p e weigh -
sha ing quan iza ion echniques om a s ochas-
ic pe spec i e in he con ex o aining and in-
e ence wi h Bayesian Neu al Ne wo ks (BNNs).
Speci ically, we le e age 2D-adap i e Gaussian
dis ibu ions, Wasse s ein dis ance es ima ions,
and alpha-blending o encode he s ochas ic be-
ha iou o a BNN in a lowe -dimensional, so
Gaussian ep esen a ion. Th ough ex ensi e em-
pi ical in es iga ion, we demons a e ha ou ap-
p oach signi ican ly educes he compu a ional
o e head inhe en in Bayesian lea ning by se -
e al o de s o magni ude, enabling he e icien
Bayesian aining o la ge-scale models, such as
ResNe -101 and Vision T ans o me (VIT). On
a ious compu e ision benchma ks—including
CIFAR-10, CIFAR-100, and ImageNe 1k—ou
app oach comp esses model pa ame e s by ap-
p oxima ely 50×and educes model size by 75%,
while achie ing accu acy and unce ain y es ima-
ions compa able o he s a e-o - he-a .
1 INTRODUCTION
Bayesian Neu al Ne wo ks (BNNs) p omise o combine
he ep esen a ional capaci y o deep lea ning wi h p inci-
pled unce ain y es ima ions enabled by means o Bayesian
lea ning heo y (Hin on and Neal,1995). A guably, his
combina ion makes hem pa icula ly appealing o sa e y-
P oceedings o he 28 h In e na ional Con e ence on A i icial In-
elligence and S a is ics (AISTATS) 2025, Mai Khao, Thailand.
PMLR: Volume 258. Copy igh 2025 by he au ho (s).
c i ical machine lea ning applica ions whe e he quan i ica-
ion o unce ain y is o pa amoun impo ance (Fo sbe g
e al.,2020). Indeed, hey ha e been widely employed in
scena ios like e-Heal h (Ma cos e al.,2010), obus con-
ol (Wicke e al.,2024), au onomous d i ing (Michel-
mo e e al.,2020), human-in- he-loop applica ions (T eiss
e al.,2021), au oma ed diagnosis (Billah and Ja ed,2022)
and many o he s (Lampinen and Veh a i,2001;Bha adiya,
2023;Veh a i and Lampinen,1999).
Un o una ely, hough, he p incipled ea men o unce -
ain y comes a he p ice o an inc eased p essu e on com-
pu a ional esou ces, including he model size (×2 in he
common case o mean- ield Va ia ional In e ence (Blun-
dell e al.,2015)), and he in e ence ime, inc eased by an
o de o magni ude as mul iple o wa d passes a e needed
(Hin on and Neal,1995). The e o e, despi e hei po en ial,
he use o BNNs in edge-AI and esou ce-cons ained ap-
plica ions is s ill e y limi ed (Bonne e al.,2023). While
ecen wo ks ha e in es iga ed he de elopmen o ech-
niques o ackle he a o emen ioned challenges, hese a e
gene ally limi ed o he applica ion o me hods o iginally
de eloped o de e minis ic neu al ne wo ks (NNs) (Fe i-
anc e al.,2021;Pa k e al.,2021;Chien and Chang,2023),
o he usage o he Bayesian pa adigm a aining ime o
model-o de educ ion pu poses bu wi hou he unce ain y
es ima ion a in e ence ime (Van Baalen e al.,2020;Guo,
2018;Pe in e al.,2024;Subia-Waud and Dasmahapa a,
2024).
In his wo k, we p esen a quan isa ion echnique speci i-
cally ailo ed o cap u e he s ochas ic beha iou o BNNs.
Mo e speci ically, we design a s ochas ic weigh -sha ing
quan isa ion me hod, called 2DGBNN, based on dynam-
ically adap i e mini-ba ch 2D Gaussian Mix u e Models
and p edica ed on op imising weigh dis ibu ions h ough
me ics de i ed om Wasse s ein dis ances (Chiza e al.,
2020;De Palma e al.,2021), ne wo k g adien s, and in a-
class a iance. Ou echnique wo ks by ein e p e ing
s anda d weigh -sha ing (Subia-Waud and Dasmahapa a,
2024) om a 2D pe spec i e, accoun ing o bo h he mean
and a iance o BNN’s pa ame e s in he case o mean-
ield Va ia ional In e ence (VI), and by gi ing i a s ochas-
ic seman ic. A each aining s ep, he cu en sum- o al o
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
ne wo k pa ame e s is clus e ed using s anda d pa ame e -
ee echniques, and a mini-ba ch app oach is used o he
es ima ion o Gaussian dis ibu ions on a pa ame e space
accoun ing o (possibly) millions o pa ame e s. Rep e-
sen a i es o each clus e a e hen selec ed, and alpha-
blending echniques a e used o sampling pa ame e eal-
isa ions du ing o wa d passes h ough he ne wo k a chi-
ec u e. Thanks o i s simplici y, he me hod can be seam-
lessly in eg a ed in o commonly used Bayesian app oxi-
ma ion echniques based on Va ia ional In e ence (Blun-
dell e al.,2015;Gal and Ghah amani,2016;Minka,2001;
Welling and Teh,2011).
We pe o m an ex ensi e empi ical in es iga ion on
he e ec i eness o ou me hod in aining la ge-scale
models and in educing he compu a ional oo p in o
BNNs. We u ilize ou widely used image classi ica-
ion da ase s (i.e., MNIST (LeCun e al.,1998), CIFAR-
10 (K izhe sky,2009), CIFAR-100 (K izhe sky,2009) and
ImageNe 1K (Deng e al.,2009)) and es he model e-
sul s in ou widely employed neu al ne wo k a chi ec u es,
including ResNe -18, ResNe -50, ResNe -101 (He e al.,
2016) and Vision T ans o me (ViT) (Doso i skiy e al.,
2021). Th ough ou app oach, we can educe he numbe
o ainable pa ame e s in he ne wo k by up o 50×, while
ob aining accu acy and unce ain y me ics compa able o
he s a e-o - he-a .1
This pape makes he ollowing main con ibu ions:
•We in oduce a s ochas ic weigh -sha ing echnique
speci ically- ailo ed o BNNs and ha employs
Wasse s ein dis ance, g adien s, and wi hin-class
a iance in o de o imp o e model e iciency.
•We empi ically demons a e ha ou s ochas ic
weigh -sha ing me hod compa es a ou ably agains
quan isa ion me hods employed o BNNs in e ms o
u he educing he compu a ional equi emen s and
be e p ese ing accu acy and unce ain y me ics.
•In a a ie y o a chi ec u es, including ResNe -18,
ResNe -50, ResNe -10 and ViT, we show how ou
echnique can educe model size by up o jus a qua e
o he o iginal size, while wo king on pa wi h s a e-
o - he-a echniques o la ge-scale app oxima e BNN
aining.
2 RELATED WORK
Compu a ional e iciency is one o he long-s anding issues
conce ning he applica ion o BNNs o edge-AI and em-
bedded sys ems (Bonne e al.,2023). Indeed, he e is a
as sec ion o he li e a u e ha aims o ackle he issue
1To suppo ep oducibili y, ou code is a ailable a h ps:
//gi hub.com/moulelin/2DGBNN
om a a ie y o di e en angles. One such closely ela ed
a ea is ha o quan isa ion o he BNN’s pa ame e s (Guo,
2018), whe e he p ecision o he la e is educed o min-
imise hei compu a ional oo p in . Wo ks ha e adap ed
echniques ini ially de eloped o de e minis ic NNs (Fe i-
anc e al.,2021;Subeda e al.,2021;Lin e al.,2023;Dong
e al.,2022;Ull ich e al.,2017) o de eloped new ech-
niques ha ake in o accoun he dis ibu ional beha iou
o BNNs’ pa ame e s (Chien and Chang,2023;Pa k e al.,
2021;Yang e al.,2020a). While hese echniques inc ease
he compu a ional e iciency o a gi en BNN a chi ec u e,
by de eloping a weigh -sha ing echnique (and he e o e
no only quan ising bu also in e ec educing he num-
be o BNN pa ame e s) he me hod we in oduce is able o
ma ch hei beha iou in s anda d BNN benchma ks, while
a he same ime, i allows o aining o la ge scale models
(He nández-Loba o and Adams,2015).
Se e al wo ks ha e looked a educing he numbe o pa-
ame e s in BNNs, ei he by applying p uning echniques
(Sha ma and Jennings,2021;Becke s e al.,2023;Ro h and
Pe nkop ,2018) o using low- ank app oxima ions (Doan
e al.,2024;Dusenbe y e al.,2020;Swia kowski e al.,
2020). The la e wo k by epa ame e ising BNN weigh s
and biases using a lowe ank ep esen a ion, enabling hem
o scale BNN in e ence o la ge models such as ResNe -50
(Dusenbe y e al.,2020) and ViT (Doan e al.,2024) a chi-
ec u es, and a e as such closely ela ed o ou wo k in ha
a smalle common ep esen a ion is ound o BNN pa am-
e e s. The numbe o inal pa ame e s is s ill hough gen-
e ally highe han hei ull de e minis ic coun e pa , and
he e o e canno be used o edge-AI applica ions. In Sec-
ion 5, we will obse e ha , albei a he p ice o a small
educ ion in accu acy, ou me hod educes he numbe o
pa ame e s o 3 o de s o magni ude.
Finally, a numbe o wo ks ha e looked in o applying
Bayesian echniques o quan isa ion o de e minis ic NNs
(Subia-Waud and Dasmahapa a,2024;Louizos e al.,
2017;Van Baalen e al.,2020;Pe in e al.,2024;Ach e -
hold e al.,2018;Soud y e al.,2014;Yang e al.,2020b), in-
cluding he applica ion o weigh -sha ing echniques (Ro h
and Pe nkop ,2018;Subia-Waud and Dasmahapa a,2024;
Nowlan and Hin on,2018). While hese echniques p o ide
encou aging esul s on he sui abili y o Bayesian heo y o
inc easing he e iciency o deep lea ning, being speci ically
ailo ed o de e minis ic neu al ne wo ks hey canno be ap-
plied o BNNs.
3 BAYESIAN NEURAL NETWORKS
We conside a neu al ne wo k a chi ec u e 𝑓𝐰∶ℝ𝑛→ℝ𝑚
pa ame e ised by a ec o o weigh s and biases 𝐰∈ℝ𝑛𝑤,
which we e e o collec i ely as pa ame e s o he neu al
ne wo k. Bayesian lea ning o neu al ne wo ks begins by
placing a p io dis ibu ion, 𝑝(𝐰), o e he ne wo ks’ pa am-
Moule Lin, Shuhao Guan, Weipeng Jing, Goe z Bo e weck, And ea Pa ane
e e . This is o en assumed o be encoded h ough a ec o
o independen Gaussian dis ibu ions, one o each weigh
andbiasin he BNN (Blundell e al.,2015). Thisp io belie
is hen upda ed gi en a da ase ’s e idence h ough he appli-
ca ion o he Bayesian lea ning ule. Le = {(𝑥𝑖, 𝑦𝑖)}𝑛
𝑖=1
deno e he ull aining da ase , 𝐗= (𝑥1,…, 𝑥𝑛) he com-
bined ec o o aining inpu s and 𝐲= (𝑦1,…, 𝑦𝑛) hei
co esponding ou pu s, hen he pos e io dis ibu ion on
he weigh is compu ed as:
𝑝(𝐰|𝐗,𝐲) = 𝑝(𝐲|𝐗,𝐰)𝑝(𝐰)
𝑝(𝐲|𝐗),(1)
whe e 𝑝(𝐲|𝐗,𝐰)is he likelihood and 𝑝(𝐲|𝐗)is he model
e idence. Finally, gi en a es poin 𝑥∗, he BNN’s pos e io
p edic i e dis ibu ion on 𝑥∗is de ined by:
𝑝(𝑦∗|𝑥∗,𝐗,𝐲) = ∫𝑝(𝑦∗|𝑥∗,𝐰)𝑝(𝐰|𝐗,𝐲)𝑑𝐰.(2)
Un o una ely, nei he Equa ion (1) no Equa ion (2) can
gene ally be compu ed exac ly (Hin on and Neal,1995).
The e o e a a ie y o app oxima e Bayesian in e ence ech-
niques ha e been de eloped in he li e a u e, wi h he
wo mos p ominen classes o app oaches being based
on Mon e Ca lo algo i hms (Hin on and Neal,1995) o
on Va ia ional In e ence me hods (Blundell e al.,2015).
While he o me p o ides he gold s anda d in e ms o
app oxima ion accu acy in small a chi ec u es, Va ia ional
In e ence (VI) gua an ees be e scalabili y and is he e o e
he ocus o his pape . B ie ly, VI wo ks by app oxima ing
he ue pos e io , 𝑝(𝐰|𝐗,𝐲), by op imising he KL di e -
gence o e a simple , pa ame e ised dis ibu ion, 𝑞(𝐰), o -
en a mul idimensional Gaussian dis ibu ion wi h diagonal
co a iance. The p edic i e dis ibu ion o Equa ion (2) is
hen app oxima ed by sampling mul iple imes om 𝑞(𝐰),
and a e aging he esul s.
Despi e he app oxima ion, howe e , VI BNNs s ill come
wi h se e al limi a ions ha impede hei deploymen in
p ac ice. Fi s , e en in he case o diagonal Gaussian dis i-
bu ions, he numbe o pa ame e s in he BNN is doubled
compa ed o hei de e minis ic coun e pa . Fu he mo e,
app oxima ing he p edic i e dis ibu ion equi es mul iple
sampling p ocedu es and mul iple o wa d passes h ough
he ne wo k so ha hei compu a ional ime is o de s o
magni ude highe han, again, hei de e minis ic coun e -
pa . Finally, despi e i s g ea e lexibili y, s anda d Va ia-
ional In e ence s uggles o lea n e y deep BNNs and i is
gene ally limi ed o mo e adi ional a chi ec u e and small-
o-medium-size da ase s. In he ollowing, we de elop a
weigh -sha ing quan isa ion scheme a ge ed a BNNs o
ackle hese limi a ions.
4 2D GAUSSIAN BAYESIAN NEURAL
NETWORK
Conside he ec o 𝐰o he BNN’s weigh s, whe e, a each
s ep o he aining p ocess, each weigh , 𝑤𝑖𝑖= 1,…, 𝑛𝐰,
is dis ibu ed acco dingly o a gi en Gaussian dis ibu ion
(𝜇𝑤𝑖, 𝜎𝑤𝑖). We deno e wi h 𝑓𝑢𝑙𝑙 he ull se o weigh
dis ibu ions. Ou s ochas ic weigh sha ing echnique aims
a inding a se o 2D Gaussian dis ibu ions (which we col-
lec i ely deno e as ws)(𝜇1,Σ1),…,(𝜇𝑘,Σ𝑘),2wi h
𝑘≪𝑛𝐰, and such ha ws can be used o app oxima es
he beha iou o 𝑓𝑢𝑙𝑙 in e ms o esul ing accu acy and
unce ain y.
B ie ly, we do his by i s modelling all he hype pa am-
e e s o he 𝑓𝑢𝑙𝑙 dis ibu ions h ough a Gaussian Mix-
u e Model (GMM) (Sec ion 4.2), and hen applying alpha-
blending o sample weigh s om he esul ing ealisa ions
o he GMM (Sec ion 4.3). Addi ionally, 2DGBNN imple-
men s se e al s eps in o med by bes p ac ice in quan isa-
ion o de e minis ic NNs o u he educing he sha ed
numbe o weigh s, including ou lie s de ec ion (Sec ion
4.1), clus e dimensionali y educ ion and he me ging o
simila dis ibu ions (Sec ion 4.2). Finally, we will p esen
he o e all algo i hm o 2DGBNN in Sec ion 4.4.
4.1 Ou lie s s. Inlie s Classi ica ion
The key obse a ion behind he weigh -classi ica ion s age
o ou algo i hm is ha no all he weigh s o a neu al ne -
wo k ha e an equal impac on he ou pu . Taking inspi a ion
om quan isa ion echniques o de e minis ic neu al ne -
wo ks (Subeda e al.,2021), we, he e o e, do no quan ise
ex eme alues in he BNN as hose a e, likely, pa icula ly
in luen ial in he inal esul . Speci ically, we pa i ion he
ull weigh ec o win o wo sepa a e ec o s, w𝑖𝑛 and w𝑜𝑢𝑡,
and only apply weigh -sha ing o he o me . We do his by
using wo di e en c i e ia.
Mean Th eshold: Weigh s associa ed wi h a mean wi h
an absolu e alue g ea e han a h eshold 𝜏(e.g., 𝜏= 0.2)
a e classi ied as ou lie s, i.e.:
𝑤𝑖∈𝐰is an ou lie i |𝜇𝑖|> 𝜏,
A discussion o how we chose pa ame e s like 𝜏is p o ided
in Appendix E.
G adien Th eshold: Weigh s associa ed wi h g adien
magni udes exceeding he h eshold ha places hem wi hin
he op 1% du ing backp opaga ion a e also ca ego ized as
ou lie s, as hei high g adien alues likely signi y hei
2As s anda d we use 𝜎 o deno e he one-dimensional s anda d
de ia ion in he case o 1d Gaussian, and Σ o deno e he mul idi-
mensional co a iance in he case o mul idimensional Gaussian.
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
subs an ial impac on model pe o mance, i.e.:
𝑤𝑖∈𝐰is an ou lie i |∇𝑤𝑖|is op 1% o {|∇𝑤𝑗|}𝑛𝐰
𝑗=1.
Algo i hm 1 2DGBNN
Inpu : NN a chi ec u e 𝑓𝐰, aining da a = {(𝐗,𝐲)} –
𝜏𝑤, 𝜏𝑑, 𝜏𝑔, 𝜏𝑣algo i hm h esholds – BNN p io 𝑝(𝐰).
Ou pu : S ochas ic weigh -sha ing ained BNN.
S age 1: Ini ialise GMM
1: Ini ialise 𝜇,𝜎acco ding o 𝑝(𝐰)
2: o each epoch do ⊳P e- aining
3: Sample weigh s: 𝐰=𝜇+𝜎 ⊙ 𝜖,𝜖∼(0,𝐈)
4: Upda e 𝜇,𝜎by aining on 
5: end o
6: i |𝑤𝑖|> 𝜏𝑤o |∇𝑤𝑖|in op 1% hen 𝑤𝑖is ou lie
7: else 𝑤𝑖is inlie ⊳§4.1
8: end i
9: Lea n GMM on inlie pa ams Θ𝑖𝑛 (Equa ion (3))
S age 2: Re ine GMM
10: o each inlie weigh 𝑤𝑖do
11: Pe o m §4.3 check on Mahalanobis dis ance
12: i Ou side 95 h pe cen ile hen
13: 𝑤𝑖is assigned o mul iple clus e s.
14: else 𝑤𝑖is assigned only o he closes Gaussian.
15: end i
16: end o
17: Apply alpha-blending o ellipse poin s (Eq. 7)
18: epea
19: o each pai (1,2)in GMM do
20: i 𝑊(1,2)< 𝜏𝑑,Δ𝑔< 𝜏𝑔,Δ𝑣< 𝜏𝑣 hen
21: Me ge 1,2using Eqs. (5), (6)
22: end i
23: end o
24: un il No mo e Gaussians can be me ged
S age 3: Final BNN T aining
25: o each epoch do
26: o each weigh 𝑤𝑖do
27: i 𝑤𝑖is inlie hen
28: Sample 𝑤𝑖∼∑𝐾
𝑘=1 𝜋𝑘(𝜇𝑘,Σ𝑘)
29: else
30: Use 𝑤𝑖∼(𝜇𝑤𝑖, 𝜎2
𝑤𝑖)
31: end i
32: end o
33: Minimising s ep o 
(, 𝑞)o Eq. (10)
34: end o
4.2 2DGBNN aining
Gi en he se o dis ibu ions o he inlie weigh s, which
we deno e as 𝑖𝑛, we p oceed by clus e ing hei means
and a iances on he 2-dimensional 𝜇-𝜎plane. We do
his by lea ning a Gaussian Mix u e Model (GMM) o he
o m: 𝑝((𝜇, 𝜎)) = ∑𝐾
𝑘=1 𝜋𝑘((𝜇, 𝜎)|𝜇𝑘,Σ𝑘)o e he se
o poin s Θ𝑖𝑛 = {(𝜇𝑤𝑖, 𝜎𝑤𝑖)}𝑛𝐰𝑖𝑛
𝑖=1 . Due o he la ge ol-
ume o poin s in ol ed in he lea ning o he GMM ( yp-
ically, millions o weigh s), we ely on mini-ba ch lea ning
o GMMs (Li e al.,2014). This is achie ed by sampling
andom mini-ba ches ⊂Θ𝑖𝑛, and i e a i ely minimising
he log-likelihood o e he mini-ba ch:
min ∑
(𝜇𝑖,𝜎𝑖)∈log (𝐾
∑
𝑘=1
𝜋𝑘((𝜇𝑖, 𝜎𝑖) ∣ 𝜇𝑘,Σ𝑘)).(3)
A e he Ini ial GMM lea ning, we pe o m wo u he e-
duc ion s eps based on he numbe o poin s a ound each
Gaussian, and on he dis ance be ween pai s o Gaussians.
Clus e Size Reduc ions: Du ing he ini ial Gaussian
clus e ing s age, clus e s associa ed wi h ewe han 30
weigh s a e iden i ied. Weigh s wi hin hese small clus-
e s a e ea ed as ou lie s due o hei lack o ep esen a ion
wi hin he b oade weigh dis ibu ion. O e all, we empi i-
cally ind ha app oxima ely 1.8% o he o al weigh s o a
neu al ne wo k a e gene ally alloca ed as ou lie s.
Me ging Gaussians: We me ge oge he Gaussian dis i-
bu ions ha a e e y close o each o he . We do his by
elying on he dis ance be ween Gaussians and hei g adi-
en s. Speci ically, we compu e he Wasse s ein-2 dis ance
be ween pai s o dis ibu ions as (Jacobs e al.,2023):
𝑊2((𝜇𝑖,Σ𝑖),(𝜇𝑗,Σ𝑗))2=||𝜇𝑖−𝜇𝑗||2
2
+T (Σ𝑖+ Σ𝑗− 2(Σ1∕2
𝑖Σ𝑗Σ1∕2
𝑖)1∕2)(4)
I he dis ance be ween wo Gaussians is less han a gi en
h eshold 𝛾, hen we inspec he g adien o he ne wo k in
he weigh associa ed o he clus e cen oid and i s a i-
ance. I hose a e smalle han wo gi en h eshold 𝜎and 𝛼
hen we p oceed by me ging he wo Gaussians in o one.3
The me ge is execu ed using he ollowing equa ions
(Agueh and Ca lie ,2011;Taka su,2011):
𝜇me ged =𝜇1+𝜇2
2(5)
Σme ged =Σ1+ Σ2
2+1
8(𝜇1−𝜇2)(𝜇1−𝜇2)𝑇
+1
2(Σ1∕2
1Σ2Σ1∕2
1)1∕2 (6)
Tha ensu es ha he newly o med Gaussian componen ac-
cu a ely e lec s he collec i e dis ibu ion cha ac e is ics o
he ini ial componen s while main aining minimal in e nal
a ia ion.
3De ails abou he h esholds we use in ou expe imen s can be
ound in he Appendix.
Moule Lin, Shuhao Guan, Weipeng Jing, Goe z Bo e weck, And ea Pa ane
Table 1: Compa ison o 2DGBNN and compe i i e echniques supe sc ip s indica e ma chinga chi ec u es)onImageNe 1k
da ase . We also p o ide he numbe o ou lie s, ellipses, and Gaussians de i ed by ou me hod.
A chi ec u e Me hod Accu acy ↑NLL ↓ECE ↓#Ou lie s #Ellipses #Gaussians #Pa ame e s(M) ↓/
Comp ession Ra io(%) ↑
ResNe -18 Mu ual BNN (Pham e al.,2024) 67.7 1.327 0.1300 - - - 23.4M / -
2DGBNN(ou s) 68.1 1.253 0.019 23013 10885 2217 0.038M /99%
ResNe -50
Deep Ensembles (Lakshmina ayanan e al.,2017) 77.5 0.877 0.0305 - - - 146.7M / -
Rank-1 BNN(Dusenbe y e al.,2020) 77.3 0.886 0.0166 - - - 26.0M / -
ATMC (30 samples) (Heek and Kalchb enne ,2019) 77.5 0.883 - - - - 768.0M / -
MCMC (9 samples) BNN (Zhang e al.,2019) 77.1 0.888 - - - - 230.4M / -
2DGBNN(ou s) 75.1 0.961 0.029 37172 56873 3250 0.101M/99%
ResNe -101 2DGBNN(ou s) 75.50 0.969 0.023 53641 4311 2464 0.063M / 99%
VIT-B-16 76.01 0.901 0.064 9765 338329 5440 0.359M / 98%
4.3 𝛼-blending (Mul i-Clus e s) o Weigh s
Be o e he inal sampling s ep, we eassess inlie weigh s
𝐰𝑖𝑛 by compu ing hei squa ed Mahalanobis dis ances o
clus e means using, i.e., 𝐷2= (𝑤𝑖−𝜇𝑖)⊤Σ−1
𝑘(𝑤𝑖−𝜇𝑖).
I a weigh ’s 𝐷2exceeds 5.991,4we eassess i s clus e as-
signmen : We do his by elying on 𝛼-blending (Mildenhall
e al.,2021).
Speci ically, o each he 𝑤𝑖∈𝐰𝑖𝑛 we compu e he subse
o GMM’s componen (𝜇𝑘,Σ𝑘), o 𝑘= 1,…, 𝑛𝑖such
ha he abo e condi ion on he 𝐷2is me . We hen sample
he inal alue o he weigh by he esul ing dis ibu ions:
𝑝(𝑤𝑖) =
𝑛𝑖
∑
𝑘=1
𝛼𝑘(𝜇𝑘,Σ𝑘)(7)
whe e 𝛼𝑘is he mixing coe icien , compu ed as he pd o
(𝜇𝑖, 𝜎2
𝑖)acco ding o (𝜇𝑘,Σ𝑘).
4.3.1 Combined Va ia ional Fo mula ion
Finally, we obse e ha ou s ochas ic weigh -sha ing ech-
nique can be seamlessly in eg a ed wi hin he ELBO o -
mula ion o VI aining (Nowlan and Hin on,2018;Zhang
e al.,2018). Fo mally, we assume ha 𝐰𝑜𝑢𝑡 and 𝐰𝑖𝑛 a e
ec o s o pai wise independen weigh s,5which allow us
4Co esponding o he 95 h pe cen ile o he 𝜒2
2dis ibu ion
which models Mahlanobis dis ance o mul idimensional Gaus-
sians.
5This is ue o he a ia ional dis ibu ion bu i is an app ox-
ima ion o he ue pos e io .
o bound he a ia ional objec i e as i ollows:
(, 𝑞) =
𝔼𝑞(𝐰)[log 𝑝(𝐲∣𝐗,𝐰)]−KL(𝑞(𝐰)‖𝑝(𝐰))=(8)
𝔼𝑞(𝐰)[log 𝑝(𝐲∣𝐗,𝐰)]−KL(𝑞(𝐰𝑖𝑛)‖𝑝(𝐰𝑖𝑛))−
KL(𝑞(𝐰𝑜𝑢𝑡)‖𝑝(𝐰𝑜𝑢𝑡))≈𝔼𝑞(𝐰)[log 𝑝(𝐲∣𝐗,𝐰)](9)
−KL(∑𝜋𝑘𝑘‖𝑝(𝐰𝑖𝑛))−KL (𝑞(𝐰𝑜𝑢𝑡)‖𝑝(𝐰𝑜𝑢𝑡))
≥𝔼𝑞(𝐰)[log 𝑝(𝐲∣𝐗,𝐰)]−∑
𝑘
𝜋𝑘KL(𝑘‖𝑝(𝐰𝑖𝑛))
⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟
GMM KL di e gence
−
KL(𝑞(𝐰𝑜𝑢𝑡)‖𝑝(𝐰𝑜𝑢𝑡))
⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟
Ou lie s KL di e gence
∶= 
(, 𝑞),(10)
whe e he equali y in Equa ion (8) is due o he pai wise
independence assump ion be ween componen s o 𝐰𝑖𝑛 and
𝐰𝑜𝑢𝑡, he app oxima ion o Equa ion (9) is due o he GMM
app oxima ion o he inlie weigh s, and he inal inequali y
is due o he con exi y o he KL di e gence. No ice ha
he esul ing alue loss unc ion, 
is an uppe bound on he
o iginal loss so ha i s minimisa ion by means o g adien
descen gua an ees he imp o emen o he la e .
4.4 O e all Me hodology
The o e all me hodology is p esen ed in pseudocode o m
in Algo i hm 1. 2DGBNN combines i s componen pa s in
h ee s ages. In he i s s age, he BNN is ini ialised wi h
he gi en p io , a p e- aining s ep is pe o med and he e-
sul ing BNN is used o ini ialise he GMM clus e ing. In
s age 2, he ini ial GMM clus e ing ob ained is e ined by
me ging close Gaussians, and by pe o ming 𝛼-blending.
Finally he BNN is ained by op imising he a ia ional ob-
jec i e on a combina ion o ull weigh dis ibu ions ( o he
ou lie s) and GMM-based weigh -sha ing ( o he inlie s).

S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
(a) Densi y and F equency o ResNe 50 in ImageNe and CIFAR-100
(b) Weigh s Sca e ImageNe (abo e) and CIFAR-100 (below)
Figu e 1: Weigh dis ibu ion o he BNN p io o s ochas ic-sha ing. Panel (a) shows he densi y plo o he second con o-
lu ional laye o ResNe -50 when ained on ImageNe (in blue) and CIFAR-100 (in ed). Panel (b) shows he co esponding
sca e plo s, including lines o he 1% and 99%.
5 EXPERIMENTS
To alida e he e ec i eness and scalabili y o 2DGBNN,
we conduc comp ehensi e expe imen s using a ious NN
a chi ec u es on benchma k image classi ica ion da ase s.
This sec ion de ails he da ase s, models, expe imen al
se up, esul s, and analysis o ou indings. Speci ically,
we e alua e ou me hod on ou common image classi i-
ca ion benchma ks: MNIST, CIFAR-10, CIFAR-100, and
ImageNe 1k; as well as ou widely used NN a chi ec-
u es: ResNe -18, ResNe -50 and ResNe -101 and a Vision
T ans o me (ViT). The hype pa ame e s used and he de-
ails o he aining a e gi en in Appendix. Th oughou
his sec ion, we compa e ou echnique agains he esul s
ob ained by Deep Ensembles (Lakshmina ayanan e al.,
2017), Rank-1 BNN (Dusenbe y e al.,2020), MCMC
BNN (Zhang e al.,2019), ATMC (Heek and Kalchb en-
ne ,2019), Mu ual BNN (Pham e al.,2024), F-SGVB-LRT
Nguyen e al. (2024), ABNN (F anchi e al.,2024), LP-
BNN (F anchi e al.,2023), IR (Kim e al.,2023), SSVI
(Li e al.,2024) and mBCNN (Kong e al.,2023). We e al-
ua e he esul ing models in e ms o Accu acy, Nega i e
Log-Likelihood (NLL, which measu es he model’s unce -
ain y in i s p edic ions), and Expec ed Calib a ion E o
(ECE, which measu es he calib a ion o p edic ed p oba-
bili ies (Guo e al.,2017)). Each expe imen is conduc ed
h ee imes wi h di e en andom seeds, and we epo he
a e age esul s. We conduc expe imen s wi h all he a o e-
men ioned compa ison models, o aling 13 2DGBNN ex-
pe imen s, which include he base models o all compa -
ison me hods. Addi ionally, we pe o m wo quan isa ion
compa ison expe imen s (Sec ion 5.2) and an abla ion s udy
(Sec ion 5.3).
5.1 Pe o mance E alua ion
The esul s ob ained wi h 2DGBNN and hose o he s a e-
o - he-a a e lis ed in Tables 1,2and 3 o ImageNe 1k,
CIFAR-100 and CIFAR-10 espec i ely along wi h de ails
on he compu a ions pe o med by 2DGBNN.
In he case o ImageNe 1k (Table 1) he compa ison is pe -
o med agains Deep Ensembles (Lakshmina ayanan e al.,
2017), Rank-1 BNN (Dusenbe y e al.,2020), MCMC
BNN wi h 9 samples (Zhang e al.,2019), ATMC (Heek
and Kalchb enne ,2019), and Mu ual BNN (Pham e al.,
2024). We obse e ha , in all cases, ou me hod success-
ully educes he numbe o pa ame e s by 3 o 4 o de s o
magni udes.While he accu acy is educed by a ound 2% in
he ResNe -50 case, we do ob ain compa able unce ain y
es ima ion as e alua ed by NLL and ECE.
Simila esul s we ob ain on he CIFAR-100 da ase (Table
2), compa ing agains Deep Ensembles (Lakshmina ayanan
e al.,2017), Rank-1 BNN(Dusenbe y e al.,2020), F-
SGVB-LRT (Nguyen e al.,2024) and ABNN (F anchi
e al.,2024). Ou me hod achie es a subs an ial educ ion
in model pa ame e s while main aining o imp o ing pe -
o mance compa ed o o he me hods a he p ice o app ox-
ima ely 2% when compa ed o Deep Ensembles and Rank-1
BNN. Fo ins ance, o ResNe -18 we use only 0.019M pa-
ame e s, which is d as ically lowe han he 23.4M pa ame-
Moule Lin, Shuhao Guan, Weipeng Jing, Goe z Bo e weck, And ea Pa ane
Table 2: Compa ison o 2DGBNN and compe i i e echniques (supe sc ip s indica e ma ching a chi ec u es) on CIFAR-
100 da ase . We also p o ide he numbe o ou lie s, ellipses, and Gaussians de i ed by ou me hod.
A chi ec u e Me hod Accu acy ↑NLL ↓ECE ↓#Ou lie s #Ellipses #Gaussians #Pa ame e s(M) ↓/
Comp ession Ra io(%) ↑
ResNe -18
F-SGVB-LRT (Nguyen e al.,2024) 70.1 1.121 0.036 - - - 23.4M / -
SSVI (Li e al.,2024) 75.8 - 0.001 - - - 2.32M / 90%
mBCNN (Kong e al.,2023) 73.7 1.004 0.002 - - - 2.86M / 87.8%
2DGBNN(ou s) 74.7 1.053 0.038 14624 260 2387 0.019M /99%
WRN-28-10
Deep Ensembles (Lakshmina ayanan e al.,2017) 82.7 0.666 0.021 - - - 146M / -
Rank-1 BNN (Dusenbe y e al.,2020) 82.4 0.689 0.012 - - - 36.6M / -
LP-BNN (F anchi e al.,2023) 79.3 - 0.0702 - - - 26.8M / 63%
2DGBNN(ou s) 80.5 0.798 0.0432 40354 341 2390 0.045M /99%
ResNe -50 ABNN (F anchi e al.,2024) 74.20 0.828 4.5 - - - 54.2M / -
2DGBNN(ou s) 78.1 0.986 0.107 247591 330 1980 0.251M /99%
ResNe -101 2DGBNN(ou s) 78.4 0.834 0.066 45240 348 3199 0.052M /99%
Table 3: Compa ison o 2DGBNN and compe i i e echniques (supe sc ip s indica e ma ching a chi ec u es) on CIFAR-10
da ase . We also p o ide he numbe o ou lie s, ellipses, and Gaussians de i ed by ou me hod.
A chi ec u e Me hod Accu acy ↑NLL ↓ECE ↓#Ou lie s #Ellipses #Gaussians #Pa ame e s(M) ↓/
Comp ession Ra io(%) ↑
ResNe -18
F-SGVB-LRT (Nguyen e al.,2024) 90.31 0.262 0.014 - - - 23.4M / -
SSVI (Li e al.,2024) 93.74 - 0.006 - - - 1.17M / 95%
mBCNN (Kong e al.,2023) 93.20 0.220 0.008 - - - 0.93M / 96%
2DGBNN(ou s) 91.72 0.305 0.019 123310 67 1569 0.018M /99%
WRN-28-10
Deep Ensembles (Lakshmina ayanan e al.,2017) 96.2 0.143 0.020 - - - 146M / -
Rank-1 BNN (Dusenbe y e al.,2020) 96.3 0.128 0.008 - - - 36.6M / 50.8%
LP-BNN (F anchi e al.,2023) 95.0 - 0.009 - - - 26.8M / 63%
2DGBNN(ou s) 95.2 0.142 0.012 39395 365977 3950 0.413M /99%
ResNe -50 ABNN (F anchi e al.,2024) 95.01 0.160 1.0 - - - 54.2M / 25%
2DGBNN(ou s) 93.84 0.223 0.012 129640 01628 0.132M /99%
ResNe -101 2DGBNN(ou s) 92.78 0.270 0.015 45240 348 3199 0.052M /99%
e s used by he F-SGVB-LRT model, ye we achie e com-
pe i i e accu acy. Finally, Table 3lis s analogous esul s
in he con ex o CIFAR-10, compa ing agains IR (Kim
e al.,2023), F-SGVB-LRT (Nguyen e al.,2024), ABNN
(F anchi e al.,2024) and LP-BNN (F anchi e al.,2023).
Addi ionally o he a chi ec u es used o compa isons, he
ables epo esul s o ResNe -101 and ViT. In hese a -
chi ec u es oo, 2DGBNN is able o educe he numbe o
pa ame e s while ob aining accu acy and unce ain y me -
ics on pa wi h ha o s a e-o - he-a echniques ac oss
he emaining a chi ec u es. In e es ingly, obse ing he
aining esul s ob ained we no ice how he dis ibu ion o
weigh s in models ained on smalle da ase s (CIFAR-100
and CIFAR-10) ends o clus e nea ze o, as depic ed in
Figu e 1(b) (down). Con e sely, in ImageNe 1k he weigh
dis ibu ion is b oade , as can be seen om Figu e 1(a))
compa ing empi ical dis ibu ions ob ained on CIFAR-100
and ImageNe 1k. No ice how his ansla es o, o exam-
ple, he ResNe -50 model ained on ImageNe 1k da ase o
ha e a signi ican ly g ea e numbe o Gaussian and Ellipse
weigh s han when ained on he CIFAR-100 da ase .
5.2 Compa ison agains Quan isa ion
We now compa e 2DGBNN agains quan isa ion echniques
applied o BNNs (Subeda e al.,2021). Fo his pu pose,
we emo e he p e aining s age o 2DGBNN so o mimic
he “ anilla” BNN aining employed by Subeda e al.
(2021). We use a Gaussian p io wi h a mean o 0 and a
s anda d de ia ion o 0.1. No ice ha hese expe imen s a e
limi ed o CIFAR-10 and MNIST as he anilla aining o
BNNs used in Subeda e al. (2021) does no scale o he
la ge a chi ec u es and da ase s analysed in he p e ious
sec ion.
The compa a i e esul s a e p esen ed in Table 4. Accu acy
alues a e e ysimila ac oss heboa d, whileou echnique
ob ains signi ican ly be e unce ain y me ics, excep o
NLL in he case CIFAR-10.
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
Table 4: Compa ison agains he quan isa ion echnique o Subeda e al. (2021) on CIFAR-10 and MNIST.
Da ase s Algo i hm Quan isa ion echnique Accu acy ↑NLL ↓ECE ↓#Pa ame e s (MB: Megaby e)
#Ou lie s #Ellipses #Gaussians #Pa ame e s
CIFAR-10
BNNs Quan iza ion
(Subeda e al.,2021)
ResNe -20 (INT8 SIGMA4) 90.92 0.266 1.778 - - - 0.87 MB
ResNe -20 (INT8 SIGMA2) 90.85 0.273 2.547 - - - 0.72 MB
ResNe -20 (INT8 SIGMA1) 90.96 0.266 0.711 - - - 0.54 MB
2DGBNN ResNe -20(wi hou p e ained) 90.91 0.303 0.040 74181 3634 142 0.644MB (1.62MB)
ResNe -20(wi h p e ained) 91.04 0.303 0.037 14624 260 2387 0.020M /(0.71MB)
MNIST
BNNs Quan iza ion
(Subeda e al.,2021)
ResNe -20 (INT8 SIGMA4) 99.36 0.020 0.215 - - - 0.10 MB
ResNe -20 (INT8 SIGMA2) 99.32 0.024 0.277 - - - 0.08 MB
ResNe -20 (INT8 SIGMA1) 99.34 0.027 0.351 - - - 0.06 MB
2DGBNN ResNe -20(wi hou p e ained) 99.52 0.013 0.001 3206 581 237 0.092MB (0.403MB)
Table 5: Abla ion S udy: Impac o Ou lie s and Ellipses on CIFAR-10 Using ResNe -20
Con igu a ion #Ou lie s #Ellipses Accu acy (%) ↑NLL ↓ECE↓
2DGBNN ✓ ✓ 90.92 0.265 0.040
Wi hou Ellipse ✓– 90.43 0.304 0.043
Wi hou Ou lie s – ✓90.18 0.308 0.026
Wi hou Ou lie s and Ellipse – – 90.01 0.318 0.029
In e ms o he model size (he e compa ed in Megaby es),
he wo echniques compa e simila ly when i comes o he
size o ainable pa ame e s (co esponding o he alue e-
po ed no in b acke s o 2DGBNN), wi h quan isa ion ha -
ing a sligh edge when only 1 bi is used o encoding he
s anda d de ia ion. No ice, howe e , ha in small NNs (like
he one he e analysed) ou echniques incu signi ican s o -
age o e head in ha we need o keep an index (encoded in
uin 8) ha assigns each inlie weigh o i s clus e . When
his alue is added (size epo ed in b acke s in he Table)
quan isa ion has a signi ican ad an age o e ou s o age e-
qui emen s. While echniques such as Hu man coding o
mul i-le el index ables can po en ially educe he size o
he index ec o by se e al ac o s, we lea e u he in es-
iga ions o u u e wo k, and he e no ice ha despi e main-
aining ull p ecision on he wo kings o he BNN, weigh -
sha ing quan isa ion can al eady ob ain compa able esul s
o in 8 quan isa ion. We no ice ha he wo me hods a e
complimen a y, and in 8 quan isa ion can u he educe he
s o age equi emen s o he ou lie weigh s and GMMs.
5.3 Abla ion S udy
Table 5p esen s an abla ion s udy on CIFAR-10 using
ResNe -20 o explo e how ou lie s and ellipses con ibu e
o he pe o mance o 2DGBNN. When bo h ou lie s and
ellipses a e included, he model achie es an accu acy o
90.92%, wi h he lowes NLL o 0.265 and ECE o 0.040.
Howe e , emo ing ei he componen signi ican ly impac s
pe o mance. Wi hou ellipses, he accu acy d ops by
0.49% o 90.43%, and excluding ou lie s educes i sligh ly
u he by 0.25% o 90.18%. When bo h a e emo ed, he
accu acy d ops by 0.87% o a low o 90.05%.
6 CONCLUSIONS
We ha e p esen ed a s ochas ic weigh -sha ing quan isa ion
echnique based on GMMs speci ically ailo ed o BNNs.
In an ex ensi e empi ical e alua ion, we ha e seen how ou
echnique can signi ican ly educe he e ec i e numbe o
pa ame e s o a BNN while ob aining esul s on pa wi h
s a e-o - he-a in la ge da ase s and a chi ec u es such as
ImageNe 1k and ViT.
Fu u e wo k will explo e how o in eg a e ou me hod in o
a ully Bayesian amewo k and he applica ion o u he
quan isa ion o he ou lie weigh s. We ha e p esen ed a
s ochas ic weigh -sha ing quan isa ion echnique based on
GMMs speci ically ailo ed o BNNs. In an ex ensi e em-
pi ical e alua ion, we ha e seen how ou echnique can sig-
ni ican ly educe he e ec i e numbe o pa ame e s o a
BNN while ob aining esul s on pa wi h s a e-o - he-a
in la ge da ase s and a chi ec u es such as ImageNe 1k and
ViT. Fu u e wo k will explo e how o in eg a e ou me hod
in o a ully Bayesian amewo k and he applica ion o u -
he quan isa ion o he ou lie weigh s.
Moule Lin, Shuhao Guan, Weipeng Jing, Goe z Bo e weck, And ea Pa ane
7 ACKNOWLEDGEMENTS
This publica ion has emana ed om esea ch join ly unded
by Eu opean Union’s Ho izon Eu ope 2021–2027 ame-
wo k p og amme, Ma ie Skłodowska-Cu ie Ac ions, G an
Ag eemen No. 101072456 and Taighde Éi eann – Re-
sea ch I eland unde g an numbe 13/RC/2094_2.
Re e ences
Ach e hold, J., Koehle , J. M., Schmeink, A., and Ge-
newein, T. (2018). Va ia ional ne wo k quan iza ion. In
In e na ional con e ence on lea ning ep esen a ions.
Agueh, M. and Ca lie , G. (2011). Ba ycen e s in he
wasse s ein space. SIAM Jou nal on Ma hema ical Anal-
ysis, 43(2):904–924.
Becke s, J., Van E p, B., Zhao, Z., Kond asho , K., and
De V ies, B. (2023). P incipled p uning o bayesian neu-
al ne wo ks h ough a ia ional ee ene gy minimiza-
ion. IEEE Open Jou nal o Signal P ocessing.
Bha adiya, J. P. (2023). A e iew o bayesian machine
lea ning p inciples, me hods, and applica ions. In e na-
ional Jou nal o Inno a i e Science and Resea ch Tech-
nology, 8(5):2033–2038.
Billah, M. E. and Ja ed, F. (2022). Bayesian con olu ional
neu al ne wo k-based models o diagnosis o blood can-
ce . Applied A i icial In elligence, 36(1):2011688.
Blundell, C., Co nebise, J., Ka ukcuoglu, K., and Wie -
s a, D. (2015). Weigh unce ain y in neu al ne wo ks.
In P oceedings o he 32nd In e na ional Con e ence on
Machine Lea ning, pages 1613–1622.
Bonne , D., Hi zlin, T., Majumda , A., Dalga y, T., Es-
manho o, E., Meli, V., Cas ellani, N., Ma in, S., Nodin,
J.-F., Bou geois, G., e al. (2023). B inging unce -
ain y quan i ica ion o he ex eme-edge wi h mem is o -
based bayesian neu al ne wo ks. Na u e Communica-
ions, 14(1):7530.
Chien, J.-T. and Chang, S.-T. (2023). Bayesian asym-
me ic quan ized neu al ne wo ks. Pa e n Recogni ion,
139:109463.
Chiza , L., Roussillon, P., Lége , F., Viala d, F.-X., and
Pey é, G. (2020). Fas e wasse s ein dis ance es ima ion
wi h he sinkho n di e gence. Ad ances in Neu al In o -
ma ion P ocessing Sys ems, 33:2257–2269.
De Palma, G., Ma ian, M., T e isan, D., and Lloyd, S.
(2021). The quan um wasse s ein dis ance o o de 1.
IEEE T ansac ions on In o ma ion Theo y, 67(10):6627–
6643.
Deng, J., Dong, W., Soche , R., Li, L.-J., Li, K., and Fei-Fei,
L. (2009). Imagene : A la ge-scale hie a chical image
da abase. In 2009 IEEE Con e ence on Compu e Vision
and Pa e n Recogni ion, pages 248–255. IEEE.
Doan, B. G., Shamsi, A., Guo, X.-Y., Mohammadi, A.,
Alinejad-Rokny, H., Sejdino ic, D., Ranasinghe, D. C.,
and Abbasnejad, E. (2024). Bayesian low- ank lea n-
ing (bella): A p ac ical app oach o bayesian neu al ne -
wo ks. a Xi p ep in a Xi :2407.20891.
Dong, R., Tan, Z., Wu, M., Zhang, L., and Ma, K. (2022).
Finding he ask-op imal low-bi sub-dis ibu ion in deep
neu al ne wo ks. In In e na ional Con e ence on Ma-
chine Lea ning, pages 5343–5359. PMLR.
Doso i skiy, A., Beye , L., Kolesniko , A., Weissenbo n,
D., Zhai, X., Un e hine , T., Dehghani, M., Minde e ,
M., Heigold, G., Gelly, S., e al. (2021). An image is
wo h 16x16 wo ds: T ans o me s o image ecogni ion
a scale. In In e na ional Con e ence on Lea ning Rep-
esen a ions.
Dusenbe y, M., Je el, G., Wen, Y., Ma, Y., Snoek, J.,
Helle , K., Lakshmina ayanan, B., and T an, D. (2020).
E icien and scalable bayesian neu al ne s wi h ank-1
ac o s. In In e na ional con e ence on machine lea n-
ing, pages 2782–2792. PMLR.
Fe ianc, M., Maji, P., Ma ina, M., and Rod igues, M.
(2021). On he e ec s o quan isa ion on model unce -
ain y in bayesian neu al ne wo ks. In Unce ain y in A -
i icial In elligence, pages 929–938. PMLR.
Fo sbe g, H., Lindén, J., Hjo h, J., Måne jo d, T., and
Danesh alab, M. (2020). Challenges in using neu al ne -
wo ks in sa e y-c i ical applica ions. In 2020 AIAA/IEEE
39 h Digi al A ionics Sys ems Con e ence (DASC), pages
1–7. IEEE.
F anchi, G., Bu suc, A., Aldea, E., Dubuisson, S., and
Bloch, I. (2023). Encoding he la en pos e io o
bayesian neu al ne wo ks o unce ain y quan i ica ion.
IEEE T ansac ions on Pa e n Analysis and Machine In-
elligence.
F anchi, G., Lau en , O., Legué y, M., Bu suc, A., Pilze ,
A., and Yao, A. (2024). Make me a bnn: A simple s a -
egy o es ima ing bayesian unce ain y om p e- ained
models. In P oceedings o he IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion, pages 12194–
12204.
Gal, Y. and Ghah amani, Z. (2016). D opou as a bayesian
app oxima ion: Rep esen ing model unce ain y in deep
lea ning. In in e na ional con e ence on machine lea n-
ing, pages 1050–1059. PMLR.
Guo, C., Pleiss, G., Sun, Y., and Weinbe ge , K. Q. (2017).
On calib a ion o mode n neu al ne wo ks. In In e -
na ional con e ence on machine lea ning, pages 1321–
1330. PMLR.
Guo, Y. (2018). A su ey on me hods and heo ies o quan-
ized neu al ne wo ks. a Xi p ep in a Xi :1808.04752.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep esid-
ual lea ning o image ecogni ion. In P oceedings o
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
Table 9: De e minis ic neu al ne wo ks used as p io o he Ini ialisa ion 2DGBNN in he expe imen s discussed in Sec ion
5.1.
Me hod Da ase Accu acy ↑NLL ↓ECE ↓#Pa ame e s (M: million)
ImageNe -1k
ResNe -18 69.70 1.265 0.027 11.7M
ResNe -50 76.10 0.989 0.035 25.6M
ResNe -101 77.30 0.936 0.936 44.5M
VIT-B-16 81.07 0.856 0.056 86.0M
CIFAR-100
ResNe -18 77.10 1.038 0.114 11.7M
ResNe -50 79.20 0.950 0.054 25.6M
ResNe -101 80.02 0.849 0.095 44.5M
WRN-28-10 81.41 0.766 0.045 53.6M
CIFAR-10
ResNe -20 91.84 0.246 0.031 0.27M
ResNe -18 93.21 0.201 0.022 11.7M
ResNe -50 94.81 0.211 0.010 25.6M
ResNe -101 94.70 0.849 0.095 44.5M
WRN-28-10 95.92 0.131 0.010 53.6M
The able is s uc u ed o highligh he pe o mance ac oss mul iple a chi ec u es and da ase s, acili a ing a di ec com-
pa ison. Highe accu acies and lowe NLL and ECE alues indica e be e model pe o mance. Fo ins ance, VIT-B-16
on ImageNe -1k achie es an accu acy o 81.07% wi h he lowes ECE o 0.056 among i s da ase coun e pa s. Simila ly,
WRN-28-10 shows supe io pe o mance on CIFAR-10 wi h he highes accu acy o 95.92% and a ema kably low ECE
o 0.010. The numbe o pa ame e s, epo ed in millions, also p o ides insigh in o he model complexi y, anging om
0.27M o ResNe -20 on CIFAR-10 o 86.0M o VIT-B-16 on ImageNe -1k.
The 2DGBNN models on ImageNe 1k showed a dec ease in accu acy, wi h ou ResNe -50 con igu a ion d opping om
he benchma k high o 77.5% o 75.10%. This educ ion was coupled wi h a subs an ial dec ease in he numbe o pa am-
e e s— om models equi ing up o 25.6 million pa ame e s o jus 0.101 million. Despi e hese changes, he inc eases in
NLL and ECE we e minimal and wi hin accep able anges, indica ing ha he models main ain sa is ac o y p edic i e pe -
o mance and calib a ion despi e he educed complexi y. Simila ends we e obse ed in he CIFAR-100 da ase , whe e ou
WRN-28-10 model’s accu acy dec eased om 81.41% o 80.5%, while signi ican ly educing he pa ame e coun o only
0.045 million om 53.6 million. The NLL and ECE me ics, al hough sligh ly ele a ed, emained compe i i e, a i ming
he e ec i e unce ain y es ima ion capabili ies o he models despi e hei educed complexi y.
G Addi ional Backg ound
KL Di e gence be ween Gaussians The KL di e gence be ween wo Gaussian dis ibu ions 𝑞(𝑤) = (𝑤|𝜇𝑞, 𝜎2
𝑞)and
𝑝(𝑤) = (𝑤|𝜇𝑝, 𝜎2
𝑝)is gi en by:
KL(𝑞(𝑤)‖𝑝(𝑤)) = 1
2(𝜎2
𝑞
𝜎2
𝑝
+(𝜇𝑝−𝜇𝑞)2
𝜎2
𝑝
− 1 + ln 𝜎2
𝑝
𝜎2
𝑞)(11)
This o mula can be used o compu e he KL di e gence e ms in he ELBO exp ession o bo h inlie s and ou lie s.

Moule Lin, Shuhao Guan, Weipeng Jing, Goe z Bo e weck, And ea Pa ane
Expec ed Log-Likelihood Compu a ion
The expec ed log-likelihood e m in ol es an expec a ion o e he a ia ional pos e io :
𝔼𝑞(𝐰)[log 𝑝(𝐲|𝐗,𝐰)]=∫𝑞(𝐰) log 𝑝(𝐲|𝐗,𝐰)𝑑𝐰(12)
In p ac ice, his in eg al is in ac able and is app oxima ed using Mon e Ca lo sampling.
H Wasse s ein-based 2D Gaussian Me ging
In he con ex o compa ing wo Gaussian dis ibu ions, he Wasse s ein dis ance p o ides a meaning ul way o measu e
he dis ance be ween p obabili y dis ibu ions. Speci ically, o wo 2D Gaussian dis ibu ions (𝜇1,Σ1)and (𝜇2,Σ2),
whe e 𝜇1, 𝜇2a e he means and Σ1,Σ2a e he co a iance ma ices, he Wasse s ein-2 dis ance is gi en by he ollowing
o mula:
𝑊2((𝜇1,Σ1),(𝜇2,Σ2))2=||𝜇1−𝜇2||2
2+T (Σ1+ Σ2− 2(Σ1∕2
1Σ2Σ1∕2
1)1∕2)(13)
Whe e ||𝜇1−𝜇2||2is he Euclidean dis ance be ween he means o he wo dis ibu ions, T is he ace ope a o , which
sums he diagonal elemen s o a ma ix, Σ1∕2
1 e e s o he ma ix squa e oo o he co a iance ma ix Σ1.
The Wasse s ein dis ance h eshold is se a 1.5 × 10−7, wi h u he discussion on he I.1. I he dis ance be ween Gaussian
componen s alls below his, g adien in o ma ion is u he conside ed, ensu ing a p ecise analysis o componen simila i y.
This me hod ensu es ha he newly o med Gaussian componen accu a ely e lec s he collec i e dis ibu ion cha ac e is ics
o he ini ial componen s while main aining minimal in e nal a ia ion.
I Sub-module Discussion and Hype -pa ame e Con igu a ion
I.1 E ec i eness o 2D Gaussian Me ging Discussion
In his sec ion, we discuss he e ec i eness o he me ging o 2D Gaussian dis ibu ions in ou me hod. Expe imen al
esul s analysing i s e ec a e lis ed in Table 10. Fo he ResNe 20 model on he CIFAR-10 da ase , a e me ging he 2D
Gaussian, he accu acy imp o ed om 88.77% o 91.02%, he NLL dec eased om 0.3792 o 0.3066, and he ECE educed
om 0.0500 o 0.0374. Fo he ResNe 18 model on he CIFAR-100 da ase , al hough he imp o emen is smalle , he
accu acy s ill inc eased om 74.50% o 74.63%, and he NLL dec eased om 1.0555 o 1.0535. These esul s demons a e
he e ec i eness o ou 2D Gaussian me ging me hod in enhancing model pe o mance and calib a ion.
Table 10: Pe o mance o ResNe 20 on CIFAR-10 and ResNe 18 on CIFAR-100 be o e and a e me ging Gaussian dis i-
bu ions.
Model Da ase Me hod Accu acy (%) NLL ECE
ResNe 20 CIFAR-10 Be o e Me ging Gaussian 88.77 0.3792 0.0500
A e Me ging Gaussian 91.02 0.3066 0.0374
ResNe 18 CIFAR-100 Be o e Me ging Gaussian 74.50 1.0555 0.0391
A e Me ging Gaussian 74.63 1.0535 0.0400
I.2 Da a Augmen a ion
To imp o e he obus ness and gene aliza ion o ou models, we applied a se ies o da a augmen a ion echniques du ing
aining on he ImageNe -1K, CIFAR-100, and CIFAR-10 da ase s. Table 11 summa ises he speci ic augmen a ion me hods
used o each da ase . Fo he ImageNe -1K da ase , we applied andom esized c opping o ob ain images o size 224× 224
pixels. This was ollowed by andom ho izon al lipping wi h a p obabili y o 50% o augmen he da ase wi h mi o ed
images. Colo ji e ing was used o adjus he b igh ness, con as , sa u a ion, and hue o he images wi h ac o s o 0.4, 0.4,
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
Table 11: Summa y o da a augmen a ion echniques applied o each da ase .
Da ase Da a Augmen a ion Techniques
ImageNe -1K Random esized c op o 224 × 224 pixels,
Random ho izon al lip,
Colo ji e ing (b igh ness=0.4, con as =0.4,
sa u a ion=0.4, hue=0.1),
Con e sion o enso ,
No maliza ion
CIFAR-100 Con e sion o enso ,
Padding o 4 pixels ( e lec ion mode),
Random c op o 32 × 32 pixels,
Random ho izon al lip,
Con e sion o enso ,
No maliza ion
CIFAR-10 Random c op o 32 × 32 pixels,
Random ho izon al lip,
Con e sion o enso ,
No maliza ion
0.4, and 0.1, espec i ely. The images we e hen con e ed o enso s and no malised using he s anda d mean and s anda d
de ia ion alues o ImageNe . In he case o he CIFAR-100 da ase , we s a ed by con e ing he images o enso s. We
hen padded he images wi h 4 pixels on each side using e lec ion mode o p ese e edge in o ma ion. A e padding, we
pe o med a andom c op o 32 × 32 pixels, ollowed by andom ho izon al lipping o in oduce mi o a ia ions. The
images we e con e ed back o enso s and no maliseded acco dingly.
Fo he CIFAR-10 da ase , he augmen a ion p ocess in ol ed a andom c op o 32 × 32 pixels, which helps in eaching
he model o be in a ian o ansla ions. We also applied andom ho izon al lipping o include mi o ed e sions o he
images. Finally, he images we e con e ed o enso s and no malised o s anda dise he inpu da a.
I.3 Hype pa ame e s o aining he de e minis ic ne wo k
In ou expe imen s, simila o hose conduc ed by p e ious esea che s, we s anda dised he hype pa ame e s ac oss all
models in he ini ial s age. This app oach was applied o a ious models including ResNe -18, ResNe -50, ResNe -101, and
VIT. The hype pa ame e s used a e as ollows:
Hype pa ame e Va iable De aul Value
Ba ch Size -b 256
Wa m-up Phases -wa m 2
Lea ning Ra e -l 0.1
Resume T aining - esume False
To al Epochs -EPOCH 250
Miles ones -MILESTONES [30, 60, 90, 120, 150, 200]
Weigh Decay -Mul iS epLR 5e-4
T aining Mean –TRAIN_MEAN (0.5071, 0.4865, 0.4409)
T aining S d -TRAIN_STD (0.2673, 0.2564, 0.2761)
Table 12: Summa y o hype pa ame e s in he neu al ne wo k aining con igu a ion
The able summa izes he s anda dised hype pa ame e s used ac oss all models in ou neu al ne wo k aining con igu-
a ions, mi o ing se ings om p e ious esea ch. I ou lines common pa ame e s such as ba ch size, wa m-up phases,
Moule Lin, Shuhao Guan, Weipeng Jing, Goe z Bo e weck, And ea Pa ane
lea ning a e, and o al epochs, alongside speci ic se ings like weigh decay and lea ning a e miles ones.
I.4 Discussion he Ini ialisa ion o 𝜎
We expe imen ed wi h se e al di e en ini ialisa ion me hods o ou models, including ini ialisa ion ia a speci ic unc ion
as de ailed by Lee e al., andom gene a ion, and Gaussian dis ibu ion, among o he s. Acco ding o ou expe imen al
esul s, we ul ima ely adop ed he ollowing ini ialisa ion me hods o ou neu al ne wo k pa ame e s.
Fo he weigh pa ame e weigh _sigma, we used he Xa ie uni o m ini ialisa ion wi h a gain o 0.01, de ined as:
𝐖∼(−𝑔
√𝑛in +𝑛ou
,𝑔
√𝑛in +𝑛ou )(14)
whe e 𝑔= 0.01,𝑛in is he numbe o inpu uni s, 𝑛ou is he numbe o ou pu uni s, and (𝑎, 𝑏)deno es a uni o m dis i-
bu ion be ween 𝑎and 𝑏.
Fo he bias pa ame e bias_sigma, we ini ialised i using a no mal dis ibu ion wi h a mean o 0.0 and a s anda d de ia ion
o 0.001:
𝐛∼(0,0.0012)(15)
In ou neu al ne wo k, we assign di e en lea ning a es o di e en pa ame e s. Table 13 summa ises hese hype pa ame-
e s.
Table 13: Summa y o Hype pa ame e s Used in Ou Expe imen s
Hype pa ame e Symbol Value
Lea ning a e o weigh and bias 𝜇 𝜂weigh _mu 1 × 10−4
Lea ning a e o weigh and bias 𝜎 𝜂weigh _sigma 1 × 10−2
We se he lea ning a es o he pa ame e s as ollows: he lea ning a es o weigh _mu and bias_mu a e bo h 𝜂weigh _mu =
1 × 10−4; he lea ning a es o weigh _sigma and bias_sigma a e bo h 𝜂weigh _sigma = 1 × 10−3.
I.5 Hype -pa ame e s o aining Bayesian Neu al Ne wo k
In ou expe imen s, we u ilise se e al hype pa ame e s ha a e c ucial o he pe o mance and con e gence o ou Bayesian
neu al ne wo k (BNN) model. These hype pa ame e s a e ca e ully selec ed based on empi ical s udies o balance compu-
a ional e iciency and model accu acy. Table 14 summa ises he hype pa ame e s used in ou expe imen s.
In he KMeans clus e ing algo i hm, we ini ially use 𝐾= 2000 clus e s. Ou lie s in he weigh alues a e iden i ied using a
h eshold 𝑇𝑤= ±0.2. Any weigh alue exceeding his h eshold is conside ed an ou lie . This h eshold is chosen based on
he empi ical dis ibu ion o he weigh s a e ini ial aining. Simila ly, ou lie s in he g adien s a e iden i ied by selec ing
he op 𝑃𝑔= 1% o g adien magni udes.
A minimum clus e size o 𝑁min = 30 is en o ced o ensu e s a is ical signi icance in he clus e ing esul s. Clus e s wi h
ewe han 𝑁min samples a e conside ed in alid and hei associa ed weigh s a e ea ed as ou lie s. This p e en s he model
om being in luenced by clus e s ha may ep esen noise o insigni ican pa e ns.
Du ing he BNN aining, we use a lea ning a e o 𝜂BNN = 1 × 10−5, which is lowe han he ini ial lea ning a e used in
he p elimina y aining. The smalle lea ning a e is necessa y o accommoda e he Bayesian upda es and o ensu e ha
he pos e io dis ibu ions o e he weigh s con e ge p ope ly.
In he p edic i e unc ion, we d aw 𝑁𝑠= 30 samples om he pos e io dis ibu ion o es ima e he p edic i e mean and
unce ain y. The Expec ed Calib a ion E o (ECE) is compu ed using 𝑁𝑏= 15 bins. The Mahalanobis dis ance h eshold
𝑇𝑀= 5.991 co esponds o he chi-squa ed dis ibu ion alue wi h 2 deg ees o eedom a he 95% con idence le el. The
Mahalanobis dis ance o a da a poin 𝑥wi h espec o a Gaussian dis ibu ion wi h mean 𝜇and co a iance Σis calcula ed
as:
𝐷2
𝑀= (𝑥−𝜇)⊤Σ−1(𝑥−𝜇).(16)
S ochas ic Weigh Sha ing o Bayesian Neu al Ne wo ks
Table 14: Summa y o Hype pa ame e s Used in Ou Expe imen s
Hype pa ame e Symbol Value
Ini ial lea ning a e o 𝜇 𝜂𝜇1 × 10−4
Numbe o epochs 𝐸200
Numbe o clus e s in KMeans 𝐾6000
Ou lie h eshold (weigh alue) 𝑇𝑤±0.2
Ou lie h eshold (g adien pe cen ile) 𝑃𝑔Top 1%
Minimum samples pe clus e 𝑁min 20
Lea ning a e in BNN aining 𝜂BNN 1 × 10−5
Numbe o samples in p edic i e unc ion 𝑁𝑠30
Numbe o bins in ECE compu a ion 𝑁𝑏15
Wasse s ein dis ance h eshold 𝑇𝑊1 × 10−2
Mahalanobis dis ance h eshold 𝑇𝑀5.991
Numbe o nea es Gaussians 𝑘5
Poin s wi h a Mahalanobis dis ance g ea e han 𝑇𝑀a e conside ed ou lie s.
In handling ou lie s, we conside he 𝑘= 5 nea es Gaussian componen s o each ou lie poin . This allows us o eassign
ou lie weigh s o he mos p obable Gaussian componen s based on hei p oximi y in he pa ame e space.

Related note

Why institutions use Plag.ai for originality review, entry 23
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by doctoral supervisors in universities, research institutes, colleges, schools, and publishing workflows, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer documentation of academic decisions, reduced manual checking effort, and clearer separation between similarity and misconduct. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For course assignments, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai