Communication-Efficient Distributed Deep Learning via Federated Dynamic Averaging

Author: Theologitis, Michael; Frangias, Georgios; Anestis, Georgios; Samoladas, Vasilis; Deligiannakis, Antonios

Publisher: Zenodo

DOI: 10.48786/edbt.2025.33

Source: https://zenodo.org/records/17640662/files/paper-113.pdf

Communica ion-E icien Dis ibu ed Deep Lea ning ia
Fede a ed Dynamic A e aging
Michail Theologi is
Technical Uni e si y o C e e
Chania, G eece
[email p o ec ed]
Geo gios F angias
Technical Uni e si y o C e e
Chania, G eece
[email p o ec ed]
Geo gios Anes is
Technical Uni e si y o C e e
Chania, G eece
[email p o ec ed]
Vasilis Samoladas
Technical Uni e si y o C e e
Chania, G eece
[email p o ec ed]
An onios Deligiannakis
Technical Uni e si y o C e e
Chania, G eece
[email p o ec ed]
ABSTRACT
D i en by he e e -g owing olume and decen alized na u e o
da a, coupled wi h he need o ha ness his da a and gene a e
knowledge om i , has led o he ex ensi e use o dis ibu ed
deep lea ning (DDL) echniques o aining. These echniques
ely on local aining ha is pe o med a he dis ibu ed nodes
based on locally collec ed da a, ollowed by a pe iodic synch o-
niza ion p ocess ha combines hese models o c ea e a global
model. Howe e , equen synch oniza ion o DL models, en-
compassing millions o many billions o pa ame e s, c ea es a
communica ion bo leneck, se e ely hinde ing scalabili y. Wo se
ye , DDL algo i hms ypically was e aluable bandwid h, and
make hemsel es less p ac ical in bandwid h-cons ained ede -
a ed se ings, by elying on o e ly simplis ic, pe iodic, and igid
synch oniza ion schedules. These d awbacks also ha e a di ec
impac on he ime equi ed o he aining p ocess, necessi-
a ing excessi e ime o da a communica ion. To add ess hese
sho comings, we p opose Fede a ed Dynamic A e aging (FDA),
a communica ion-e icien DDL s a egy ha dynamically ig-
ge s synch oniza ion based on he alue o he model a iance.
In essence, he cos ly synch oniza ion s ep is igge ed only i
he local models, which a e ini ialized om a common global
model a e each synch oniza ion, ha e signi ican ly di e ged.
This decision is acili a ed by he communica ion o a small local
s a e om each dis ibu ed node/wo ke . Th ough ex ensi e ex-
pe imen s ac oss a wide ange o lea ning asks we demons a e
ha FDA educes communica ion cos by o de s o magni ude,
compa ed o bo h adi ional and cu ing-edge communica ion-
e icien algo i hms. Addi ionally, we show ha FDA main ains
obus pe o mance ac oss di e se da a he e ogenei y se ings.
1 INTRODUCTION
The big da a e a has been ma ked by an unp eceden ed scale
o aining da ase s [
41
,
67
]. These da ase s a e no only g ow-
ing in size, bu a e o en physically dis ibu ed and canno be
easily cen alized due o business conside a ions, p i acy con-
ce ns, bandwid h limi a ions (especially in ede a ed se ings,
such as d ones collec ing and collabo a i ely building a global
model/ iew o an a ea), and da a so e eign y laws [
9
,
23
,
64
].
Such cons ain s complica e he use o Deep Lea ning (DL) ech-
niques in he a o emen ioned scena ios.
©2025 Copy igh held by he owne /au ho (s). Published in P oceedings o he
28 h In e na ional Con e ence on Ex ending Da abase Technology (EDBT), 25 h
Ma ch-28 h Ma ch, 2025, ISBN 978-3-89318-098-1 on OpenP oceedings.o g.
Dis ibu ion o his pape is pe mi ed unde he e ms o he C ea i e Commons
license CC-by-nc-nd 4.0.
Dis ibu ed Deep Lea ning (DDL) has eme ged as an al e na-
i e pa adigm o he adi ional cen alized app oach [
6
,
69
],
o e ing e icien lea ning o e la ge-scale da a ac oss mul i-
ple wo ke -nodes, enhancing he speed o aining DL models
and pa ing he way o mo e scalable and esilien DL applica-
ions [
10
,
28
,
35
,
55
,
68
]. Mos DDL me hods a e i e a i e, whe e,
in each i e a ion, some amoun o local aining is ollowed by
synch oniza ion o he local models wi h he global one. The p e-
dominan me hod, based on he bulk synch onous pa allel (BSP)
app oach [
56
], is o a e age he local model upda es and hen
apply he a e age upda e o each local model [
69
]. Less synch o-
nized a ian s ha e also been p oposed, o amelio a e he e ec
o s aggle wo ke s [
14
,
37
] bu comp omise con e gence speed
and model quali y.
A signi ican challenge inhe en in he adi ional echniques,
especially in ede a ed DL se ings, whe e models a e huge and
wo ke in e connec ions a e slow, is he communica ion bo -
leneck, es ic ing sys em scalabili y [
53
,
60
]. Speci ically, he
communica ion bo leneck a ises om he equen exchange
(synch oniza ion) o model pa ame e s, o en in he ange o
billions, ac oss dis ibu ed wo ke s. The synch oniza ion p o-
cess en ails subs an ial da a olume ans e and gene ally dom-
ina es he o e all aining ime, leading o a low compu a ion-
o-communica ion a io [
14
,
46
]. Add essing his challenge o
expedi e DDL algo i hms has been a ocal poin o esea ch o
many yea s; speeding-up SGD is a guably among he mos im-
pac ul and ans o ma i e p oblems in machine lea ning [58].
The mos di ec me hod o alle ia e he communica ion bu den
is o educe he equency o communica ion ounds. Local-SGD
is he p ime example o his app oach. I allows wo ke s o pe -
o m
𝜏
local upda e s eps on hei models be o e agg ega ing
hem, as opposed o a e aging he upda es in e e y s ep [
17
,
66
].
Al hough Local-SGD is e ec i e in educing communica ion
while main aining compa able model quali y [
58
], de e mining
he op imal alue o
𝜏
p esen s a c i ical challenge, wi h only a
hand ul o s udies o e ing heo e ical insigh s in o i s in luence
on con e gence [50, 58, 66].
To u he educe communica ion cos s o Local-SGD, mo e
sophis ica ed communica ion s a egies in oduce a ying se-
quences o local upda e s eps
{𝜏0, ...,𝜏𝑅}
, ins ead o a ixed
𝜏
.
In [
57
], in o de o minimize con e gence e o wi h espec o
wall- ime, he au ho s p oposed a dec easing sequence o local
upda e s eps. Con e sely, he ocus in [
17
] was on educing he
numbe o communica ion ounds o a ixed numbe o model
upda es and an inc easing sequence eme ged. These con as ing
app oaches unde sco e he mul i ace ed na u e o communica-
ion s a egies in dis ibu ed deep lea ning, highligh ing no only
Se ies ISSN: 2367-2005 411 10.48786/edb .2025.33
he absence o a one-size- i s-all solu ion bu also he g owing
need o dynamic, con ex -awa e s a egies ha can con inuously
adap o he speci ic in icacies o he lea ning ask.
Main Idea and Con ibu ions. Ou wo k add esses c i ical
e iciency challenges in DDL, pa icula ly in communica ion-
cons ained en i onmen s, such as he ones encoun e ed in Fed-
e a ed Lea ning (FL) applica ions [
23
]. We in oduce Fede a ed
Dynamic A e aging (FDA), a no el, adap i e dis ibu ed deep
lea ning s a egy ha massi ely imp o es communica ion e i-
ciency o e p e ious wo k.
FDA u ilizes a no el 2-ac ion, condi ional synch oniza ion p o-
ocol, designed o a oid he need o decide o guess he p ope
alues o local upda e s eps, o o synch onize a e each ain-
ing s ep, bu a he only pe o ms he cos ly synch oniza ion
p ocess when needed. Ou FDA algo i hm dynamically igge s
synch oniza ion based on he alue o model a iance ac oss
wo ke -nodes. In a nu shell, he cos ly synch oniza ion s ep is
only igge ed i he local models ha e di e ged signi ican ly,
which implies ha he global model may no longe be accu a e.
As Figu e 1 demons a es, a he s a , wo ke s en e he lo-
cal aining s ep wi h he same global model (Figu e 1.A). Then,
local aining commences and each dis ibu ed wo ke -node com-
pu es i s local s a e, which encapsula es help ul in o ma ion o
es ima ing he model a iance (Figu e 1.B). This is ollowed by
he ansmission (Figu e 1.C) o hese small-size local s a es, an
ope a ion ha is bandwid h- and ime-e icien because o hei
small size. Du ing ansmission, he local s a es a e agg ega ed
and hei a e age is made a ailable o all wo ke s—an ope a ion
known as AllReduce. This ope a ion does no equi e (o p o-
hibi ) he use o a cen al node. Based on he agg ega ed s a e,
he wo ke s can es ima e (Figu e 1.D) whe he he a iance o
he local models may ha e exceeded a h eshold. I his is no he
case, he cos ly synch oniza ion s ep (Figu e 1.E) is a oided and
local aining con inues. Wha is impo an is how o p ope ly
pick hese local s a es compu ed a , and hen ansmi ed by, he
local wo ke s. To add ess his p oblem, we p opose wo a ian s
o ou FDA algo i hm. Ou con ibu ions can be summa ized as
ollows:
•
We p opose FDA, an algo i hm ha dynamically decides o
synch onize local wo ke s when model a iance ac oss wo ke s
exceeds a h eshold. This s a egy d as ically educes com-
munica ion, while p ese ing cohesi e p og ess owa ds he
sha ed aining objec i e.
•
We p opose wo a ian s o FDA, which di e in he amoun o
in o ma ion p ese ed in he local s a es ha a e ansmi ed
by each wo ke and agg ega ed o subsequen es ima ion o
model a iance. These wo a ian s, e med Ske chFDA and
Linea FDA, o e a di e en balance be ween communica ion
e iciency and app oxima ion accu acy.
•
We e alua e and compa e FDA wi h o he DDL algo i hms
h ough a comp ehensi e sui e o expe imen s wi h di e se
da ase s, models, and asks. Ou expe imen s demons a e ha
FDA ou pe o ms adi ional and con empo a y FL algo i hms
by 1-2 o de s o magni ude in communica ion sa ings, while
main aining equi alen model pe o mance. Fu he mo e, i
e ec i ely balances he compe ing demands o communica ion
and compu a ion, p o iding g ea ly imp o ed ade-o s.
•
We demons a e FDA’s obus ness in a ious challenging Non-
IID se ings, common in eal-wo ld Fede a ed Lea ning applica-
ions. While s a e-o - he-a me hods ypically equi e subs an-
ially mo e esou ces o con e ge unde
Non-IID
condi ions,
Nodes s a
aining s ep
Local aining s ep /
compu e local s a e
Es ima e i synch oniza ion is
needed. I no go o S ep (B)
Agg ega e
local s a e
using
AllReduce
Synch onize
models using
AllReduce.
Go o S ep (A)
(A) (B) (C) (D) (E)
Figu e 1: FDA. The local aining s ep is ollowed by he
compu a ion o a local s a e by all wo ke -nodes. Then,
he (small in size) local s a es a e agg ega ed. Based on he
agg ega ed esul , all wo ke s es ima e i synch oniza ion
is equi ed. In mos cases, he expensi e synch oniza ion
s ep o he models is a oided and local aining con inues
FDA main ains consis en and compa able pe o mance ac oss
bo h IID and Non-IID se ings.
Ou line. The emainde o his pape is o ganized as ollows:
Sec ion 2 e iews ela ed wo k. Sec ion 3 in oduces ou DDL
echnique, Fede a ed Dynamic A e aging (FDA), and i s wo a i-
an s. Sec ion 4 de ails he expe imen al se up, and discusses he
insigh s and conclusions d awn om ou empi ical in es iga ion.
Las ly, Sec ion 5 con ains concluding ema ks.
2 RELATED WORK
P oblem o mula ion. Conside dis ibu ed aining o deep
neu al ne wo ks o e mul iple wo ke s [
11
,
31
]. In his se ing,
each wo ke ep esen s a da a owne (equi alen ly, a local model
owne ) and has access o i s own se o aining da a
D𝑘
. Wo ke s
can u ilize any a ailable ha dwa e hey possess (e.g., GPUs, CPUs)
o pe o m lea ning s eps. The collec i e goal is o ind a common
model w
∈R𝑑
by minimizing he o e all aining loss. This
scena io can be e ec i ely modeled as a dis ibu ed op imiza ion
p oblem, o mula ed as ollows:
minimize
w∈R𝑑𝐹(w)≜1
𝐾
𝐾
∑︁
𝑘=1
𝐹𝑘(w)(1)
whe e
𝐾
is he numbe o wo ke s and
𝐹𝑘(
w
)≜ E𝜁𝑘∼D𝑘[ℓ(w;𝜁𝑘)]
is he local objec i e unc ion o wo ke
𝑘
. Func ion
ℓ(
w;
𝜁𝑘)
ep esen s he loss o da a sample 𝜁𝑘gi en model w.
Solu ion di ec ion. As no ed in he seminal wo k [
23
], esea ch
in FL should ocus p ima ily on synch onous solu ions. This al-
lows di e en lines o esea ch (e.g., comp ession, p i acy, e c.) o
be de eloped independen ly and hen combined seamlessly. Ou
wo k, along wi h mos communica ion-e icien FL s a egies, ad-
he es o his synch onous pa adigm. Howe e , such app oaches
may be less e ec i e in en i onmen s whe e each communica-
ion ope a ion incu s signi ican o e head ega dless o he size
o he da a being ansmi ed (e.g., high-la ency). In hese scena -
ios, asynch onous mechanisms become necessa y, hough hey
412
ypically all ou side he p ima y ocus o con empo a y FL e-
sea ch. Tha said, FDA can be modi ied o wo k asynch onously
(as explained in Sec ion 3.3).
Communica ion e icien Local-SGD. The wo k in [
31
] de-
composes each ound in o wo phases. In he i s phase, each
wo ke uns Local-SGD wi h
𝜏=𝐼1
, while he second phase
uns
𝐼2
s eps wi h
𝜏=
1; [
31
] p oposes o exponen ially decay
𝐼1
e e y
𝑀
ounds. In he he e ogeneous se ing, he wo k in [
40
],
by analysing he con e gence a e, p oposes an inc easing se-
quence o local upda e s eps o s ongly-con ex local objec i es
and ixed local upda e s eps o o he ypes o local objec i es.
The s udy in [
65
] dynamically inc eases ba ch sizes o educe
communica ion ounds, main aining he same con e gence a e
as SSP-SGD. Howe e , he la ge-ba ch app oach leads o poo
gene aliza ion [
20
], a challenge add essed by he pos -local SGD
me hod [
32
], which di ides aining in o wo phases: BSP-SGD
ollowed by Local-SGD wi h a ixed numbe o s eps. In he
Lazily Agg ega ed Algo i hm (LAG) [
5
], a di e en app oach was
aken, using only new g adien s om some selec ed wo ke s and
eusing he ou da ed g adien s om he es , which essen ially
skips communica ion ounds.
Fede a ed A e aging (FedA g) [
36
] is ano he ep esen a i e
o communica ion e icien Local-SGD algo i hms, which is a
pi o al me hod in Fede a ed Lea ning (FL) [
23
]. In he FL se ing
wi h edge compu ing sys ems, he wo k in [
59
] ies o ind he
op imal synch oniza ion pe iod
𝜏
subjec o local compu a ion
and agg ega ion cons ain s. Recen ly [
38
], in he FL se ing wi h
he assump ion o s ongly-con ex objec i es, by analysing he
balance be ween as con e gence and highe - ound comple ion
a e, a decaying local upda e s ep scheme eme ged.
Unlike p e ious app oaches ha ely on p ede e mined syn-
ch oniza ion schedules ( ixed, decaying, o o he wise), ou wo k
in oduces a dynamic synch oniza ion s a egy. FDA adap s con-
inuously du ing he aining p ocess, basing synch oniza ion
decisions on a eal- ime me ic: he model a iance ac oss wo k-
e s.
Accele a ing con e gence. An indi ec , ye highly e ec i e
way o mi iga e he communica ion bu den in DDL, is o speed
up con e gence. Consequen ly, ecen wo ks ha e buil upon
communica ion e icien Local-SGD me hods by deploying ac-
cele a ed e sions o SGD o he dis ibu ed se ing. Speci ically,
FedAdam [
42
] ex ends Adam [
26
] and FedA gM [
21
] ex ends SGD
wi h momen um (SGD-M) [
51
]. Recen ly, Mime [
24
] p o ides
a amewo k o adap a bi a y cen alized op imiza ion algo-
i hms o he FL se ing. Howe e , hese me hods s ill su e om
he model di e gence p oblem, pa icula ly in he e ogeneous
se ings. When sol ing
(1)
, he dispa i y be ween each wo ke ’s
op imal solu ion w
∗
𝑘
o hei objec i e
𝐹𝑘
, and he global op i-
mum w
∗
o
𝐹
, can po en ially cause wo ke models o di e ge
(d i ) owa ds hei dispa a e minima [
25
,
42
,
63
]. The esul is
slow and uns able con e gence wi h signi ican communica ion
o e head. To add ess his p oblem, he SCAFFOLD algo i hm [
25
]
used con ol- a ia es (in he same spi i o SVRG), wi h signi -
ican speed-up. FedP ox [
45
] e-pa ame e ized FedA g [
36
] by
adding
𝐿2
egula iza ion in he wo ke s’ objec i es o be nea he
global model. Las ly, FedDyn [
2
] imp o ed upon hese ideas wi h
a dynamic egula ize making su e ha i local models con e ge
o a consensus, his consensus poin aligns wi h he s a iona y
poin o he global objec i e unc ion.
While hese app oaches p ima ily ocus on enhancing he op-
imiza ion p ocess and ypically employ ixed synch oniza ion
in e als (e.g., e e y local epoch), ou wo k add esses a comple-
men a y aspec : de e mining he op imal iming o synch oniza-
ion. FDA’s dynamic synch oniza ion s a egy is o hogonal o
hese op imiza ion echniques and can be in eg a ed wi h hem
by simply adjus ing he synch oniza ion decision.
Comp ession. To educe communica ion o e head in DDL, sig-
ni ican e o s ha e been di ec ed owa ds minimizing message
sizes. Key s a egies include spa si ica ion, whe e only c ucial
componen s o in o ma ion a e ansmi ed, as explo ed in [
3
],
and quan iza ion echniques, which in ol e ansmi ing only
quan ized g adien s, as de ailed in [
47
]. These echniques can be
combined wi h Local-SGD me hods o enhance communica ion-
e iciency u he . An example is Qspa se-local-SGD [
4
], which
in eg a es agg essi e spa si ica ion and quan iza ion wi h Local-
SGD, achie ing subs an ial communica ion sa ings. C ucially,
FDA is ully compa ible wi h any echnique ha educes he
cos o synch oniza ion (e.g. model comp ession). Ou app oach
simply adjus s he iming o he synch oniza ion decision wi h-
ou al e ing he da a being synch onized. This ensu es ha any
comp ession echnique e ec i e in adi ional me hods (BSP,
Local-SGD, e c.) will be equally e ec i e when deployed wi h
FDA. The e o e, he communica ion sa ings demons a ed in he
ele an li e a u e [
61
] can be sa ely expec ed o ca y o e o
ou app oach as well.
Addi ionally, ske ching eme ges as ano he undamen al ool
in la ge-scale machine lea ning. I e ec i ely comp esses high-
dimensional p oblems in o lowe dimensions o sa e un ime and
memo y, ypically u ilizing hash-based p obabilis ic da a s uc-
u es. Fo ins ance, [
49
] use Coun Ske ches o comp ess auxil-
ia y a iables in op imiza ion algo i hms, signi ican ly eeing
up memo y. Simila ly, Fe chSGD [
43
] employs Coun Ske ches o
comp ess model upda es and le e ages hei linea i y o e icien
me ging. In con as o hese applica ions, ou app oach u ilizes
ske ches no o comp ession bu o es ima e local s a e in o ma-
ion, and based on his o decide whe he a synch oniza ion is
equi ed—an o hogonal applica ion o adi ional use cases. A
comp ehensi e su ey o comp ession echniques in DDL can be
ound in [61].
3 FEDERATED DYNAMIC AVERAGING
We now p esen ou algo i hms, based on ou no ion o Fede a ed
Dynamic A e aging (FDA). Ou algo i hms de ia e om p io
wo k in hese wo key ways:
(1) The decision on when o synch onize.
(2) The ac ual synch oniza ion p ocess.
To he bes o ou knowledge, his is he i s Dis ibu ed Deep
Lea ning algo i hm ha dynamically decides when o synch o-
nize based on he cu en collec i e s a e o he aining p og ess—
whe he i is ad ancing well o poo ly.
No a ion. A each ime s ep
𝑡
, each wo ke
𝑘
independen ly main-
ains i s own ec o o model pa ame e s
1
, deno ed as w
(𝑘)
𝑡∈R𝑑
.
Le w
𝑡
ep esen he
𝐾×𝑑
enso o all local model ec o s, and
w𝑡
be he a e age model ec o ( his no a ion applies o all ec o
quan i ies):
w𝑡=hw(1)
𝑡, . . . , w(𝐾)
𝑡i,w𝑡=
1
𝐾
𝐾
∑︁
𝑘=1
w(𝑘)
𝑡
1
The e ms “model” and "model pa ame e s" a e used in e changeably, as is common
in he li e a u e.
413
Table 1: No a ion
Symbol Meaning
⟨·,·⟩Do p oduc
𝑡Time s ep index
𝐾Numbe o wo ke s
𝑑Model dimension
D𝑘T aining da a o wo ke 𝑘
B(𝑘)
𝑡A ba ch sampled om D𝑘
w(𝑘)
𝑡∈R𝑑Model o wo ke 𝑘
w𝑡=[w(1)
𝑡, . . . , w(𝐾)
𝑡]Tenso o local models
w𝑡=1
𝐾Í𝐾
𝑘=1w(𝑘)
𝑡A e age model (global model)
w𝑡0Model a e mos ecen sync.
w𝑡−1Model a e 2nd mos ecen sync.
u(𝑘)
𝑡=w(𝑘)
𝑡−w𝑡0Local model d i
u𝑡=1
𝐾Í𝐾
𝑘=1u(𝑘)
𝑡A e age model d i (global d i )
Va (w𝑡)Model a iance
ΘModel a iance h eshold
S(𝑘)
𝑡S a e o wo ke 𝑘
S𝑡=1
𝐾Í𝐾
𝑘=1S(𝑘)
𝑡A e age s a e
𝐻(·) Func ion o a iance es ima ion
sk(·) :R𝑑→R𝑙×𝑚AMS ske ch ope a o (§3.1)
M2(·) :R𝑙×𝑚→R𝐿2no m squa ed es ima e (§3.1)
𝜖E o o ske ch es ima e (§3.1)
(1−𝛿)Con idence o app oxima ion (§3.1)
𝑙=O(log 1/𝛿)#Rows o ske ch ma ix (§3.1)
𝑚=O(1/𝜖2)#Columns o ske ch ma ix (§3.1)
𝜉=
w𝑡0−w𝑡−1
∥w𝑡0−w𝑡−1∥2
Heu is ic ec. o Linea FDA (§3.2)
Fu he mo e, le
Op imize(
w
,B)
be he upda ed model [
16
] com-
pu ed by some op imiza ion algo i hm (e.g., SGD, Adam) using
he model w, and he ba ch
B
o aining da a. I inco po a es
he lea ning a e, loss unc ion and ele an g adien s. Du ing
s ep 𝑡, each wo ke 𝑘 i s applies he upda e:
w(𝑘)
𝑡=Op imize(w(𝑘)
𝑡−1,B(𝑘)
𝑡)
Mo eo e , ope a ion
AllReduce(
w
(𝑘)
𝑡)
compu es and e u ns
he a e age model ec o [30]:
w𝑡=AllReduce(w(𝑘)
𝑡)
Wo ke s synch onize by execu ing
AllReduce(
w
(𝑘)
𝑡)
, he eby
se ing w
(𝑘)
𝑡
:
=w𝑡
. I synch oniza ion is no pe o med a s ep
𝑡
,
each wo ke con inues aining wi h i s locally upda ed model. A
comp ehensi e lis o he no a ion used h oughou his sec ion
is p o ided in Table 1.
Model Va iance and FDA. The model a iance quan i ies he
dispe sion o sp ead o wo ke models a ound he a e age model:
Va (w𝑡)=
1
𝐾
𝐾
∑︁
𝑘=1w(𝑘)
𝑡−w𝑡
2
2(2)
This measu e p o ides insigh in o how closely aligned he wo k-
e s’ models a e a any gi en ime. High a iance indica es ha he
models a e widely sp ead ou , essen ially d i ing apa , leading o
a lack o cohesion in he agg ega ed model. Con e sely, a mode -
a e o low a iance sugges s ha he wo ke s’ models a e closely
aligned, wo king collec i ely owa ds he sha ed objec i e.
The FDA algo i hm (Algo i hm 1) is based on he p emise ha ,
as long as he a iance is below a h eshold
Θ
, synch oniza ion
is no needed. Thus, we in oduce he Round In a ian (RI):
Va (w𝑡)≤Θ(3)
To p ese e he RI, ou FDA algo i hm main ains (Lines 4-6 o
Algo i hm 1) a each wo ke
𝑘
a local (low-dimensional) s a e-
ec o S
(𝑘)
𝑡
, which is compu ed based on w
(𝑘)
𝑡
. These s a e ec o s
a e i al o he subsequen es ima ion o he model a iance,
and unde pin he wo a ian s o he FDA algo i hm (p o ided in
Sec ions 3.1 and 3.2, espec i ely). Ou es ima ion echniques be-
gin by pe o ming
AllReduce
on he s a es S
(𝑘)
𝑡
, consolida ing
hem in o he a e age s a e
S𝑡
(Line 7). Impo an ly, his commu-
nica ion s ep equi es signi ican ly less bandwid h and esou ces
han ansmi ing he ull models w(𝑘)
𝑡.
Fo each FDA a ian , we also de ine a (di e en ) unc ion
𝐻(S𝑡)
ha o e es ima es he a iance, i.e., i ensu es ha as
long as
𝐻(S𝑡) ≤ Θ
hen he a iance is bounded by
Θ
. This
gua an ee is p obabilis ic o he Ske ch-based a ian o FDA,
and de e minis ic o i s Linea coun e pa . Consequen ly, i
𝐻(S𝑡)>Θ hen synch oniza ion is pe o med (Lines 8-9) — he
RI in a ian canno be gua an eed. A e synch oniza ion, he
model a iance is ze o.
E icien ly Moni o ing he RI. Es ima ing model a iance e i-
cien ly is a he hea o FDA. To his end, we i s in oduce he
local model d i ,u(𝑘)
𝑡, and a e age d i ,u𝑡, de ined as ollows:
u(𝑘)
𝑡=w(𝑘)
𝑡−w𝑡0,u𝑡=
1
𝐾
𝐾
∑︁
𝑘=1
u(𝑘)
𝑡
He e,
w𝑡0
deno es he model ec o a e he mos ecen syn-
ch oniza ion. Subsequen ly, he model a iance can be w i en
as:
Va (w𝑡)= 1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2!−∥u𝑡∥2
2(4)
P oo .
Adding an o se (
−w𝑡0
) o each w
(𝑘)
𝑡
does no al e
he a iance, he e o e:
Va (w𝑡)=Va w𝑡−w𝑡0=Va (u𝑡)=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡−u𝑡
2
2
=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−2Du(𝑘)
𝑡,u𝑡E+∥u𝑡∥2
2
= 1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2!−2 1
𝐾
𝐾
∑︁
𝑘=1Du(𝑘)
𝑡,u𝑡E!+ 1
𝐾
𝐾
∑︁
𝑘=1
∥u𝑡∥2
2!
= 1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2!−2* 1
𝐾
𝐾
∑︁
𝑘=1
u(𝑘)
𝑡!,u𝑡++∥u𝑡∥2
2
= 1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2!−2⟨u𝑡,u𝑡⟩+∥u𝑡∥2
2
= 1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2!−2∥u𝑡∥2
2+∥u𝑡∥2
2
= 1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2!−∥u𝑡∥2
2
□
414
Algo i hm 1 Fede a ed Dynamic A e aging - FDA
Requi e: 𝐾: The numbe o wo ke s indexed by 𝑘
Requi e: Θ: The model a iance h eshold
Requi e: 𝑏: The local mini-ba ch size
1: Ini ialize w(𝑘)
0=w0∈R𝑑
2: o each s ep 𝑡=1,2, . . . do
3: o each wo ke 𝑘=1, . . . , 𝐾 in pa allel do
4: B(𝑘)
𝑡←(sample a ba ch o size 𝑏 om D𝑘)
5: w(𝑘)
𝑡←Op imize(w(𝑘)
𝑡−1,B(𝑘)
𝑡)
6: Upda e S(𝑘)
𝑡
7: S𝑡←AllReduce(S(𝑘)
𝑡)
8: i 𝐻(S𝑡)>Θ hen
9: w(𝑘)
𝑡←AllReduce(w(𝑘)
𝑡)⊲In-place
Concep ually, ollowing Eq
(4)
, o p ecisely moni o he a i-
ance, we need o calcula e wo quan i ies: (1)
1
𝐾Í𝐾
𝑘=1∥
u
(𝑘)
𝑡∥2
2
,
and (2)
∥u𝑡∥2
2
. The i s quan i y equi es an
AllReduce
ope a-
ion on he squa ed no m o he wo ke d i s, which in ol es
minimal o e head since hese alues a e scala . In con as , he
second quan i y necessi a es an
AllReduce
ope a ion on he
wo ke d i s hemsel es, which a e o model dimension, hus in-
cu ing a high communica ion cos . In ac , his ope a ion is equi -
alen o synch oniza ion, which is exac ly wha we aim o a oid
in he i s place. Thus, i becomes e iden ha communica ion-
e icien model a iance es ima ion hinges on es ima ing
∥u𝑡∥2
2
e icien ly.
Upcoming sec ions will de ail wo echniques o communi-
ca ion e icien a iance es ima ion (which p ima ily in ol es
es ima ing
∥u𝑡∥2
2
): Ske chFDA and Linea FDA. To p esen hem
uni o mly, we in oduce he local s a e S
(𝑘)
𝑡
, a enso which con-
ains: (1) he scala alue
∥
u
(𝑘)
𝑡∥2
2
o p ecisely calcula ing he i s
quan i y, and (2) a low-dimensional summa y o u
(𝑘)
𝑡
, di e en o
each echnique, o es ima ing he second quan i y. Fo each ech-
nique we de ine an es ima ion unc ion
𝐻(·)
ha calcula es he
cu en a iance es ima e om a e age s a e
S𝑡=1
𝐾Í𝐾
𝑘=1
S
(𝑘)
𝑡
(ob ained ia AllReduce).
3.1 Ske chFDA: Ske ch-based Es ima ion
An op imal es ima o o
∥u𝑡∥2
2
can be ob ained h ough he
u iliza ion and p ope ies o AMS ske ches, as de ailed in [
8
]. An
AMS ske ch o a ec o ∈R𝑑is an 𝑙×𝑚 eal ma ix:
sk ( )=𝜓1𝜓2. . . 𝜓𝑙⊤∈R𝑙×𝑚, 𝑙 ·𝑚≪𝑑
An es ima e o squa ed-no m ∥ ∥2
2is p o ided by he o mula
M2(sk( ))=median ∥𝜓𝑖∥2
2, 𝑖 =1, . . . ,𝑙
The quali y o es ima ion depends on he size o he ske ch.
Fo chosen
𝜖, 𝛿 >
0, whe e ske ch dimensions a e gi en by
𝑙=O(log 1/𝛿)
and
𝑚=O1/𝜖2
, we ha e he ollowing p oba-
bilis ic gua an ee: wi h con idence a leas 1−𝛿,
M2(sk( )) ∈ (1±𝜖)∥ ∥2
2
No ably, obse e ha he accu acy (
𝜖
) and con idence (1
−𝛿
) only
depend on he size o he ske ch and no on he dimensionali y
o ec o .
Two c ucial p ope ies o he AMS ske ch a e ha (a) i is a
linea ans o ma ion, i.e., o 𝛼1, 𝛼2∈Rand 1, 2∈R𝑑,
sk(𝛼1 1+𝛼2 2)=𝛼1sk( 1) + 𝛼2sk( 2)
and (b) can be compu ed e icien ly in ime 𝑂(𝑙·𝑑).
In he Ske chFDA app oach, he salien idea is o employ AMS
ske ches
sk(
u
(𝑘)
𝑡) ∈ R𝑙×𝑚
as a low-dimensional ep esen a ion
o he local d i s u(𝑘)
𝑡.
Theo em 3.1. Le
𝑙=O(log 1
𝛿)
and
𝑚=O( 1
𝜖2)
. De ine he
local s a e as
S(𝑘)
𝑡=u(𝑘)
𝑡
2
2,sk u(𝑘)
𝑡∈R×R𝑙×𝑚
and he app oxima ion unc ion as
𝐻S𝑡=
1
𝐾∑︁
𝑘u(𝑘)
𝑡
2
2−1
1+𝜖M2 1
𝐾
𝐾
∑︁
𝑘=1
sk u(𝑘)
𝑡!.
Then, he condi ion
𝐻(S𝑡) ≤ Θ
implies
Va (w𝑡)≤Θ
wi h p oba-
bili y a leas (1−𝛿).
P oo .
𝐻S𝑡=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−1
1+𝜖M2 1
𝐾
𝐾
∑︁
𝑖=1
sk u(𝑘)
𝑡!
(lin.)
=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−1
1+𝜖M2 sk 1
𝐾
𝐾
∑︁
𝑖=1
u(𝑘)
𝑡!!
=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−1
1+𝜖M2(sk (u𝑡))
(𝜖-e .)
≥1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−∥u𝑡∥2
2wi h p ob. a leas (1−𝛿)
=Va (w𝑡)
We p o ed ha
𝐻(S𝑡) ≥ Va (w𝑡)
wi h p obabili y a leas (1
−𝛿
),
i.e., we o e es ima e he model a iance wi h p obabili y a leas
(1−𝛿), comple ing he p oo . □
In Sec ion 3.3, we discuss he empi ical basis o choosing he
alues o
𝑙
and
𝑚
, and how hey p ac ically impac he quali y o
he ske ch app oxima ion.
3.2 Linea FDA: Linea App oxima ion
Al hough AMS ske ches p o ide good es ima es o a iance,
hei dimension is in he se e al hund eds, and he communi-
ca ion cos o
AllReduce
on ske ches, pe o med a each s ep,
may be non-negligible. The e o e, we also in oduce a low-cos ,
ad-hoc es ima ion a ian .
In his app oach, ins ead o an AMS ske ch, each local s a e
con ains he scala alue
⟨𝜉 ,
u
(𝑘)
𝑡⟩ ∈ R
, whe e
𝜉∈R𝑑
is a uni
ec o , known o all wo ke s.
Theo em 3.2. De ine he local s a e as
S(𝑘)
𝑡=u(𝑘)
𝑡
2
2,D𝜉 , u(𝑘)
𝑡E∈R×R,∥𝜉∥2=1
and he app oxima ion unc ion as
𝐻S𝑡=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−
1
𝐾
𝐾
∑︁
𝑖=1D𝜉 , u(𝑘)
𝑡E
2
Then, he condi ion 𝐻(S𝑡) ≤ Θimplies Va (w𝑡)≤Θ.
415

P oo .
𝐻S𝑡=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−
1
𝐾
𝐾
∑︁
𝑖=1D𝜉 , u(𝑘)
𝑡E
2
=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−*𝜉 , 1
𝐾
𝐾
∑︁
𝑖=1
u(𝑘)
𝑡+
2
=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−|⟨𝜉 , u𝑡⟩|2
≥1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−∥𝜉∥2
2∥u𝑡∥2
2
=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−∥u𝑡∥2
2
=Va (w𝑡)
We p o ed ha 𝐻(S𝑡) ≥ Va (w𝑡), i.e., we always o e es ima e
he model a iance, comple ing he p oo . □
An a bi a y choice o
𝜉
(e.g., a andom ec o ) is likely o
es ima e ∥u𝑡∥2
2poo ly; i 𝜉is unco ela ed o u𝑡, hen |⟨𝜉 , u𝑡⟩|2
will likely be close o ze o. A heu is ic choice ha migh be
co ela ed o
u𝑡
is he (no malized) alue o
u𝑡0
, he global d i
ec o igh a he ime o las synch oniza ion. All nodes can
compu e i independen ly wi hou ex a communica ion, i hey
ake he di e ence o he models o he las wo synch oniza ions:
𝜉=
u𝑡0
u𝑡02
=
w𝑡0−w𝑡−1
w𝑡0−w𝑡−12
3.3 Discussion
FDA: In ui ion. The main in ui ion o FDA is summa ized in
making he decision o synch onize dynamic, based on model
a iance du ing aining. This me ic is designed o cap u e he
collec i e s a e o he aining p ocess. In wha ollows, we p o-
ide in ui ion on why his is he case. I is impo an o emembe
ha he global model
w𝑡
and, by ex ension, he global d i
u𝑡
,
a e ul ima ely wha we ca e abou and e alua e.
Model a iance, as de ined in Equa ion
(4)
, is he di e ence
be ween he a e age o he squa ed local d i s
1
𝐾Í∥
u
(𝑘)
𝑡∥2
2
and
he squa ed global d i
∥u𝑡∥2
2
. The i s e m e lec s how a he
indi idual wo ke models ha e mo ed–essen ially, how much
each wo ke has lea ned. The second e m indica es how much
o his lea ning is e ained in he global model a e agg ega ion.
The in e play be ween hese wo quan i ies is c ucial. Fo
example, when he local d i s a e high bu he global d i is
low, he a iance inc eases, signaling he need o synch oniza-
ion. This scena io sugges s ha while indi idual wo ke s ha e
made signi ican p og ess (as indica ed by high local d i s), his
p og ess is no being e ec i ely cap u ed in he global model
(indica ed by he low global d i ). In o he wo ds, he wo ke
models ha e mo ed signi ican ly, bu he global model has e-
mained ela i ely s a iona y in his high-dimensional space. This
misalignmen indica es ha aining is no longe p og essing
op imally, as he wo ke s a e mo ing owa ds dispa a e and con-
lic ing local minima, making i c ucial o synch onize and ealign
hem. Con e sely, when bo h he local and global d i s a e ei he
low o high, synch oniza ion is no necessa y, and he a iance
na u ally emains low.
(
( )
)
,
,
Local S a eLocal D i
LɪɴᴇᴀʀFDA
SᴋᴇᴛᴄʜFDA
Figu e 2: Ske chFDA &Linea FDA: Local S a e s uc u e.
Nei he he a e age o he local d i s no he global d i alone
p o ides a comple e pic u e o he collec i e aining p og ess.
Relying solely on one o he o he would lead o subop imal
synch oniza ion decisions and likely p o e ine ec i e. In FDA,
i is he ela ionship be ween hese quan i ies, as cap u ed by
he model a iance, ha o e s aluable insigh s and guides he
c ucial decision o when o synch onize.
Ske chFDA s. Linea FDA: Bo h me hods send he squa ed
no m o he d i
∥
u
(𝑘)
𝑡∥2
2
, bu di e in he addi ional accompa-
nying lowe -dimensional ep esen a ion hey ansmi (Figu e
2):
(1) Ske chFDA: An AMS ske ch o he local d i .
(2)
Linea FDA: The do p oduc o a ec o and he local d i .
The key di e ence be ween hese wo a ian s lies in he ideli y
o app oxima ion o he model a iance. While bo h me hods
conse a i ely o e es ima e he a iance, Ske chFDA p o ides a
p o ably accu a e es ima ion, which is expec ed o lead o ewe
synch oniza ions. Linea FDA equi es less compu a ional e o
and bandwid h o c ea e and communica e he local s a es, bu
may o e es ima e a iance by oo much, causing unnecessa y
synch oniza ions.
Ske chFDA: Choice o
𝑙
and
𝑚
. We empi ically measu ed he
app oxima ion achie ed wi h ske ch dimensions o
𝑙=
5 ows
and
𝑚=
250 columns (as de ined in Sec ion 3.1): hese se ings
yield an e o bound o
𝜖≈
6% and a p obabilis ic con idence
o
(
1
−𝛿) ≈
95%. Based on ou expe imen s, we ha e adop ed
hese alues in ou expe imen s and ecommend hem. Using
hese alues, he by e-size o a ske ch is
𝑙·𝑚·
4
by es =
5
kB
,
signi ican ly smalle han he size o all ou models. Ske ches o
smalle size could be used, albei weakening he app oxima ion o
he a iance. Howe e , gi en ha Linea FDA simila ly weakens
app oxima ion and a oids using AMS ske ches, in he in e es o
space we do no explo e a ying AMS ske ch sizes in his pape .
FDA: Asynch onous Ope a ion. As men ioned in Sec ion 2,
FDA can be eadily modi ied o ope a e asynch onously. In his
se up, one wo ke -node ac s as a coo dina o , agg ega ing local
s a es and de e mining whe he synch oniza ion is needed each
ime a local s a e is ecei ed. This decision is based on he mos
ecen local s a es om all wo ke s. I is impo an o no e ha ,
since local s a es a e small in size, asynch onous ope a ion is
unlikely o alle ia e bandwid h issues. The p ima y ad an age
is ha i allows aining o con inue e en in he p esence o
s aggle s. Asynch onous ope a ion migh also be bene icial in
a e cases whe e he o e head o ini ializing communica ion
domina es he ac ual ansmission ime.
416
4 EXPERIMENTS
4.1 Se up
Table 2 p o ides a comp ehensi e o e iew o ou expe imen s.
Fo each expe imen , we de ail he Neu al Ne wo k (NN) a chi-
ec u e, i s pa ame e coun (
𝑑
), and he da ase used o aining.
The able also speci ies key hype -pa ame e s: he ba ch size
(
𝑏
), he numbe o wo ke s (
𝐾
), and he FDA-speci ic a iance
h eshold (
Θ
). Addi ionally, we indica e he chosen op imize (as
de ailed in Sec ion 3) and he aining algo i hms employed o
each con igu a ion.
Pla o m. We employ Tenso Flow [
1
], in eg a ed wi h Ke as [
7
],
as he pla o m o conduc ing ou expe imen s. We used Ten-
so Flow o implemen ou FDA a ian s and all compe i i e al-
go i hms. All ele an code, igu es, and da a o his s udy a e
a ailable in h ps://gi hub.com/mike heologi is/FedL-Sync-FDA.
Ha dwa e & In as uc u e. We conduc ed ou expe imen s
on he ARIS High pe o mance compu ing (HPC) en i onmen
2
,
u ilizing a clus e o 44 GPU-accele a ed wo ke -nodes. Each
wo ke is equipped wi h wo NVIDIA Tesla K40m GPUs and
in e connec ed ia an In iniBand FDR14 ne wo k, p o iding up o
56 GB/s o bandwid h. C ucially, ou e alua ion emains agnos ic
o he unde lying in as uc u e o he speci ic wo ke s.
Da ase s & Models. The co e expe imen s in ol e aining Con-
olu ional Neu al Ne wo ks (CNNs) o a ying sizes and com-
plexi ies on wo da ase s: MNIST [
12
] and CIFAR-10 [
27
]. Fo
he MNIST da ase , we employ LeNe -5 [
29
], composed o ap-
p oxima ely 62 housand pa ame e s, and a modi ied e sion o
VGG16 [
48
], deno ed as VGG16*, consis ing o 2.6 million pa am-
e e s. VGG16* was speci ically adap ed o he MNIST da ase ,
a less demanding lea ning p oblem compa ed o ImageNe [
44
],
o which VGG16 was designed. In VGG16*, we omi ed he 512-
channel con olu ional blocks and downscaled he inal wo ully
connec ed (FC) laye s om 4096 o 512 uni s each. Bo h models
use Glo o uni o m ini ializa ion [
15
]. Fo CIFAR-10, we u ilize
DenseNe 121 and DenseNe 201 [
22
], as implemen ed in Ke as [
7
],
wi h he addi ion o d opou egula iza ion laye s a a e 0.2 and
weigh decay o 10
−4
, as p esc ibed in [
22
]. The DenseNe 121 and
DenseNe 201 models ha e 6.9 million and 18 million pa ame e s,
espec i ely, and a e bo h ini ialized wi h He no mal [19].
Las ly, we explo e a ans e lea ning scena io on he da ase
CIFAR-100
[
27
], a choice e lec ing he DL communi y’s g ow-
ing p e e ence o using p e- ained models in such downs eam
asks [
18
]. Fo example, a p e- ained isual ans o me (ViT) on
ImageNe , ans e ed o classi y CIFAR-100, is cu en ly on pa
wi h he s a e-o - he-a esul s o his ask [
13
]. We adop his
exac ans e lea ning scena io, le e aging he mo e powe ul
Con NeX La ge
model, p e- ained on ImageNe , wi h 198 mil-
lion pa ame e s [
7
,
33
]. Following he ea u e ex ac ion s ep [
16
],
he es ing accu acy on CIFAR-100 s ands a 60%. Subsequen ly,
we employ and e alua e ou FDA algo i hms in he a duous
ine- uning s age, whe e he en i e y o he model is ained [
39
].
Algo i hms. We conside i e dis ibu ed deep lea ning algo-
i hms: Linea FDA,Ske chFDA,Synch onous
3
,FedAdam [
42
],
and FedA gM [
21
]; he i s h ee a e s anda d in all expe i-
men s. Depending on he local op imize , Adam [
26
] o SGD
wi h Nes e o momen um (SGD-NM) [
52
], we also include hei
2h ps://www.hpc.g ne .g /en/ha dwa e-2/
3
The name was de i ed om he Bulk Synch onous Pa allel app oach; can be
unde s ood as a special case o he FDA Algo i hm 1 whe e Θis se o ze o.
communica ion-e icien ede a ed coun e pa s FedAdam o Fe-
dA gM, espec i ely.
E alua ion Me hodology. Compa ing DDL algo i hms is no
s aigh o wa d. Fo example, compa ing DDL algo i hms based
on he a e age cos o a aining epoch can be misleading, as i
does no conside he e ec s on he ained model’s quali y. To
achie e a comp ehensi e pe o mance assessmen o FDA, we
de ine a aining un as he p ocess o execu ing he DDL algo-
i hm unde e alua ion, on (a) a speci ic DL model and aining
da ase , and (b) un il a inal epoch in which he ained model
achie es a speci ic es ing accu acy ( e med as Accu acy Ta ge in
igu es). Based on his de ini ion, we ocus on wo pe o mance
me ics:
(1)
Communica ion cos , which is he o al da a (in by es)
ansmi ed by all wo ke s. No ably, communica ion cos
is una ec ed by he aining da a olume since only model
upda es (when synch onizing) and local s a es (a each s ep),
bu no aining da a, a e ansmi ed. Thus, he communi-
ca ion cos mainly depends on he complexi y (numbe o
pa ame e s) o he used model. T ansla ing he communica-
ion cos o wall-clock ime (i.e., he o al ime equi ed o
he compu a ion and communica ion o he DDL) depends
on he ne wo k in as uc u e connec ing he wo ke s and
on he o e head o es ablishing and ini ializing communi-
ca ion. I s impac is la ge in FL scena ios, whe e wo ke s
o en use slowe Wi-Fi connec ions.
(2)
Compu a ion cos , which is he numbe o mini-ba ch
s eps ( e med as In-Pa allel Lea ning S eps in igu es) pe -
o med by each wo ke . T ansla ing his cos o wall-clock
ime is de e mined by he mini-ba ch size and he compu a-
ional esou ces o he wo ke -nodes. I s impac is la ge
o wo ke s wi h lowe compu a ional esou ces.
Hype -Pa ame e s & Op imize s. Hype -pa ame e s unique
o each aining da ase and model a e de ailed in Table 2;
Θ
is
pe inen o FDA algo i hms and no applicable o o he s. No ably,
a guideline o se ing he pa ame e
Θ
is p o ided in Sec ion 4.3.
Fo expe imen s in ol ing FedA gM and FedAdam, we use
𝐸=
1
local epochs, ollowing [
42
]. Fo expe imen s wi h LeNe -5 and
VGG16*, local op imiza ion employs Adam, using he de aul se -
ings as pe [
26
]. In hese cases, FedAdam also adhe es o he
de aul se ings o bo h local and se e op imiza ion [
7
,
42
].
Fo DenseNe 121 and DenseNe 201, local op imiza ion is pe -
o med using SGD wi h Nes e o momen um (SGD-NM), se ing
he momen um pa ame e a 0
.
9and lea ning a e a 0
.
1[
22
].
Fo FedA gM, local op imiza ion is conduc ed wi h de aul se -
ings [
7
,
21
], while se e op imiza ion employs SGD wi h mo-
men um, se ing he momen um pa ame e and lea ning a e o
0
.
9and 0
.
316, espec i ely [
42
]. Las ly, o he ans e lea ning
expe imen s, local op imiza ion le e ages AdamW [
34
], wi h he
hype -pa ame e s used o ine- uning Con NeX La ge in he
o iginal s udy [33].
Da a Dis ibu ion. In all expe imen s, he aining da ase is
di ided in o app oxima ely equal pa s among he wo ke s. To
assess he impac o da a he e ogenei y, we explo e h ee scena -
ios:
(1) IID — Independen and iden ically dis ibu ed.
(2)
Non-IID:
𝑋
%— A po ion
𝑋
%o he da ase is so ed by
label and sequen ially alloca ed o wo ke s, wi h he e-
mainde dis ibu ed in an IID ashion.
417
Table 2: Summa y o Expe imen s
Hype -Pa ame e s T aining
NN d Da ase Θb K Op imize Algo i hms
LeNe -5 62K MNIST {0.5,1,1.5,2,3,5,7}32 { 5, 10, ..., 60 } Adam FDA,Synch onous,FedAdam
VGG16* 2.6M MNIST {20,25,30,50,75,90,100}32 { 5, 10, ..., 60 } Adam FDA,Synch onous,FedAdam
DenseNe 121 6.9M CIFAR-10 {200,250,275,300,325,350,400}32 { 5, 10, ..., 30 } SGD-NM FDA,Synch onous,FedA gM
DenseNe 201 18M CIFAR-10 {350,500,600,700,800,850,900}32 { 5, 10, ..., 30 } SGD-NM FDA,Synch onous,FedA gM
( ine- uning)
Con NeX La ge 198M CIFAR-100 {25,50,100,150}32 { 3, 5 } AdamW FDA,Synch onous
10 1100101102
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.985
Linea FDA
Ske chFDA
FedAdam
Synch onous
10 1100101102
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
Non-IID: Label "
0
", Accu acy Ta ge :
0.985
Linea FDA
Ske chFDA
FedAdam
Synch onous
10 1100101102
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
Non-IID:
60%
, Accu acy Ta ge :
0.985
Linea FDA
Ske chFDA
FedAdam
Synch onous
Figu e 3: LeNe -5 on MNIST. A Non-IID: Label "0", he sam-
ples o Label "0" a e assigned o ew wo ke s. A Non-IID:
60%, 60% o he da ase is so ed and alloca ed o wo ke s,
causing some wo ke s o ecei e many samples om he
same label
(3)
Non-IID: Label
𝑌
— All samples om label
𝑌
a e assigned
o a ew wo ke s, while he es a e dis ibu ed in an IID
manne .
4.2 Main Findings
The main indings o ou expe imen al analyses a e:
(1)
Linea FDA and Ske chFDA ou pe o m he Synch o-
nous,FedAdam and FedA gM echniques ( hei use de-
pends on he local op imize choice) by 1-2 o de s o mag-
ni ude in communica ion, while main aining equi alen
model pe o mance.
(2)
Linea FDA and Ske chFDA also signi ican ly ou pe o m
he FedAdam and FedA gM echniques in e ms o compu-
a ion.
(3)
The pe o mance o Linea FDA and Ske chFDA is com-
pa able in mos expe imen s. Ske chFDA p o ides a mo e
accu a e es ima o o he a iance and leads o ewe syn-
ch oniza ions han Linea FDA, bu has a la ge commu-
nica ion o e head o i s local s a e (a ske ch, compa ed
o wo numbe s). Ske chFDA signi ican ly ou pe o ms
Linea FDA a he ans e lea ning scena io.
(4)
The FDA a ian s emain obus a a ious da a he e ogene-
i y se ings, main aining compa able pe o mance o he
IID case.
4.3 Resul s
Due o he ex ensi e se o unique expe imen s (o e 1000), as
de ailed in Table 2, we le e age Ke nel Densi y Es ima ion (KDE)
plo s [
62
] o isualize he bi a ia e dis ibu ion o compu a ion
and communica ion cos s incu ed by each s a egy o a ain-
ing he Accu acy Ta ge . These KDE plo s p o ide a high-le el
o e iew o he cos ade-o o aining accu a e models. The
a ying le els o opaci y in he illed a eas o he KDE plo s ep-
esen he densi y o he unde lying da a poin s: highe opaci y
indica es a eas wi h a g ea e concen a ion o da a, whe eas
lowe opaci y signi ies less dense a eas.
As an illus a i e example, Figu e 3 depic s he s a egies’ bi-
a ia e dis ibu ion o he LeNe -5 model ained on MNIST wi h
di e en da a he e ogenei y se ups. In hese plo s, he Ske chFDA
dis ibu ion is gene a ed om expe imen s ac oss all hype -
pa ame e combina ions (
Θ
and
𝐾
in Table 2) ha a ained he
Accu acy Ta ge o 0
.
985. The obse ed high a iance in he
me hod’s dis ibu ion s ems om he a ying
𝐾
and
Θ
alues. In
subsequen subsec ions, we elucida e how hese hype -pa ame e s
in luence he communica ion and compu a ion cos s.
FDA balances Communica ion s. Compu a ion. DDL algo-
i hms ace a undamen al challenge: balancing he compe ing
demands o compu a ion and communica ion. F equen commu-
nica ion accele a es con e gence and po en ially imp o es model
pe o mance, bu incu s highe ne wo k o e head, an o e head
ha may be p ohibi i e when wo ke s communica e h ough
lowe speed connec ions. Con e sely, educing communica ion
sa es bandwid h bu isks hinde ing, o e en s alling, con e -
gence. T adi ional DDL app oaches, like Synch onous, equi e
synch onizing model pa ame e s a e e e y lea ning s ep, lead-
ing o signi ican communica ion o e head bu acili a ing as e
con e gence (lowe compu a ion cos ). This is e iden in Fig-
u es 3, 4, 5, and 6 (whe e Synch onous appea s in he bo om
igh — low compu a ion, e y high communica ion). Con e sely,
Fede a ed Op imiza ion (FedOp ) me hods [
42
] a e designed o
be communica ion-e icien , educing communica ion be ween
418
100101102103
Communica ion (GB)
103
104
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.994
Linea FDA
Ske chFDA
FedAdam
Synch onous
101102103
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.995
Linea FDA
Ske chFDA
FedAdam
Synch onous
100101102103
Communica ion (GB)
103
104
In-Pa allel Lea ning S eps
Non-IID: Label "
0
", Accu acy Ta ge :
0.994
Linea FDA
Ske chFDA
FedAdam
Synch onous
101102103
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
Non-IID: Label "
0
", Accu acy Ta ge :
0.995
Linea FDA
Ske chFDA
FedAdam
Synch onous
101102103
Communica ion (GB)
103
104
In-Pa allel Lea ning S eps
Non-IID: Label "
8
", Accu acy Ta ge :
0.994
Linea FDA
Ske chFDA
FedAdam
Synch onous
101102103
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
Non-IID: Label "
8
", Accu acy Ta ge :
0.995
Linea FDA
Ske chFDA
FedAdam
Synch onous
Figu e 4: VGG16* on MNIST
101102103
Communica ion (GB)
103
104
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.78
Linea FDA
Ske chFDA
FedA gM
Synch onous
101102103104
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.81
Linea FDA
Ske chFDA
FedA gM
Synch onous
Figu e 5: DenseNe 121 on CIFAR-10
101102103104
Communica ion (GB)
104
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.78
Linea FDA
Ske chFDA
FedA gM
Synch onous
102103104
Communica ion (GB)
104
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.8
Linea FDA
Ske chFDA
FedA gM
Synch onous
Figu e 6: DenseNe 201 on CIFAR-10
de ices (wo ke s) a he expense o inc eased local compu a ion.
Indeed, as shown in Figu es 3-6, FedA gM and FedAdam e-
duce communica ion by o de s o magni ude bu a he p ice o a
co esponding inc ease in compu a ion. Ou wo p oposed FDA
s a egies achie e he bes o bo h wo lds: he low compu a ion
cos o adi ional me hods and he communica ion e iciency o
FedOp app oaches, as seen in Figu es 3, 4, 5, and 6. In ac , hey
signi ican ly ou pe o m FedA gM and FedAdam in hei ele-
men , ha is, communica ion-e iciency. Ac oss all expe imen s,
419

Related note

Why institutions use Plag.ai for originality review, entry 17
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by research administrators in North America, Europe, Latin America, and international online education, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also stronger evidence for review committees, more reliable review records, and clearer documentation of academic decisions. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For research files, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai