Communica ion-E icien Dis ibu ed Deep Lea ning ia
Fede a ed Dynamic A e aging
Michail Theologi is
Technical Uni e si y o C e e
Chania, G eece
[email p o ec ed]
Geo gios F angias
Technical Uni e si y o C e e
Chania, G eece
[email p o ec ed]
Geo gios Anes is
Technical Uni e si y o C e e
Chania, G eece
[email p o ec ed]
Vasilis Samoladas
Technical Uni e si y o C e e
Chania, G eece
[email p o ec ed]
An onios Deligiannakis
Technical Uni e si y o C e e
Chania, G eece
[email p o ec ed]
ABSTRACT
D i en by he e e -g owing olume and decen alized na u e o
da a, coupled wi h he need o ha ness his da a and gene a e
knowledge om i , has led o he ex ensi e use o dis ibu ed
deep lea ning (DDL) echniques o aining. These echniques
ely on local aining ha is pe o med a he dis ibu ed nodes
based on locally collec ed da a, ollowed by a pe iodic synch o-
niza ion p ocess ha combines hese models o c ea e a global
model. Howe e , equen synch oniza ion o DL models, en-
compassing millions o many billions o pa ame e s, c ea es a
communica ion bo leneck, se e ely hinde ing scalabili y. Wo se
ye , DDL algo i hms ypically was e aluable bandwid h, and
make hemsel es less p ac ical in bandwid h-cons ained ede -
a ed se ings, by elying on o e ly simplis ic, pe iodic, and igid
synch oniza ion schedules. These d awbacks also ha e a di ec
impac on he ime equi ed o he aining p ocess, necessi-
a ing excessi e ime o da a communica ion. To add ess hese
sho comings, we p opose Fede a ed Dynamic A e aging (FDA),
a communica ion-e icien DDL s a egy ha dynamically ig-
ge s synch oniza ion based on he alue o he model a iance.
In essence, he cos ly synch oniza ion s ep is igge ed only i
he local models, which a e ini ialized om a common global
model a e each synch oniza ion, ha e signi ican ly di e ged.
This decision is acili a ed by he communica ion o a small local
s a e om each dis ibu ed node/wo ke . Th ough ex ensi e ex-
pe imen s ac oss a wide ange o lea ning asks we demons a e
ha FDA educes communica ion cos by o de s o magni ude,
compa ed o bo h adi ional and cu ing-edge communica ion-
e icien algo i hms. Addi ionally, we show ha FDA main ains
obus pe o mance ac oss di e se da a he e ogenei y se ings.
1 INTRODUCTION
The big da a e a has been ma ked by an unp eceden ed scale
o aining da ase s [
41
,
67
]. These da ase s a e no only g ow-
ing in size, bu a e o en physically dis ibu ed and canno be
easily cen alized due o business conside a ions, p i acy con-
ce ns, bandwid h limi a ions (especially in ede a ed se ings,
such as d ones collec ing and collabo a i ely building a global
model/ iew o an a ea), and da a so e eign y laws [
9
,
23
,
64
].
Such cons ain s complica e he use o Deep Lea ning (DL) ech-
niques in he a o emen ioned scena ios.
©2025 Copy igh held by he owne /au ho (s). Published in P oceedings o he
28 h In e na ional Con e ence on Ex ending Da abase Technology (EDBT), 25 h
Ma ch-28 h Ma ch, 2025, ISBN 978-3-89318-098-1 on OpenP oceedings.o g.
Dis ibu ion o his pape is pe mi ed unde he e ms o he C ea i e Commons
license CC-by-nc-nd 4.0.
Dis ibu ed Deep Lea ning (DDL) has eme ged as an al e na-
i e pa adigm o he adi ional cen alized app oach [
6
,
69
],
o e ing e icien lea ning o e la ge-scale da a ac oss mul i-
ple wo ke -nodes, enhancing he speed o aining DL models
and pa ing he way o mo e scalable and esilien DL applica-
ions [
10
,
28
,
35
,
55
,
68
]. Mos DDL me hods a e i e a i e, whe e,
in each i e a ion, some amoun o local aining is ollowed by
synch oniza ion o he local models wi h he global one. The p e-
dominan me hod, based on he bulk synch onous pa allel (BSP)
app oach [
56
], is o a e age he local model upda es and hen
apply he a e age upda e o each local model [
69
]. Less synch o-
nized a ian s ha e also been p oposed, o amelio a e he e ec
o s aggle wo ke s [
14
,
37
] bu comp omise con e gence speed
and model quali y.
A signi ican challenge inhe en in he adi ional echniques,
especially in ede a ed DL se ings, whe e models a e huge and
wo ke in e connec ions a e slow, is he communica ion bo -
leneck, es ic ing sys em scalabili y [
53
,
60
]. Speci ically, he
communica ion bo leneck a ises om he equen exchange
(synch oniza ion) o model pa ame e s, o en in he ange o
billions, ac oss dis ibu ed wo ke s. The synch oniza ion p o-
cess en ails subs an ial da a olume ans e and gene ally dom-
ina es he o e all aining ime, leading o a low compu a ion-
o-communica ion a io [
14
,
46
]. Add essing his challenge o
expedi e DDL algo i hms has been a ocal poin o esea ch o
many yea s; speeding-up SGD is a guably among he mos im-
pac ul and ans o ma i e p oblems in machine lea ning [58].
The mos di ec me hod o alle ia e he communica ion bu den
is o educe he equency o communica ion ounds. Local-SGD
is he p ime example o his app oach. I allows wo ke s o pe -
o m
𝜏
local upda e s eps on hei models be o e agg ega ing
hem, as opposed o a e aging he upda es in e e y s ep [
17
,
66
].
Al hough Local-SGD is e ec i e in educing communica ion
while main aining compa able model quali y [
58
], de e mining
he op imal alue o
𝜏
p esen s a c i ical challenge, wi h only a
hand ul o s udies o e ing heo e ical insigh s in o i s in luence
on con e gence [50, 58, 66].
To u he educe communica ion cos s o Local-SGD, mo e
sophis ica ed communica ion s a egies in oduce a ying se-
quences o local upda e s eps
{𝜏0, ...,𝜏𝑅}
, ins ead o a ixed
𝜏
.
In [
57
], in o de o minimize con e gence e o wi h espec o
wall- ime, he au ho s p oposed a dec easing sequence o local
upda e s eps. Con e sely, he ocus in [
17
] was on educing he
numbe o communica ion ounds o a ixed numbe o model
upda es and an inc easing sequence eme ged. These con as ing
app oaches unde sco e he mul i ace ed na u e o communica-
ion s a egies in dis ibu ed deep lea ning, highligh ing no only
Se ies ISSN: 2367-2005 411 10.48786/edb .2025.33
he absence o a one-size- i s-all solu ion bu also he g owing
need o dynamic, con ex -awa e s a egies ha can con inuously
adap o he speci ic in icacies o he lea ning ask.
Main Idea and Con ibu ions. Ou wo k add esses c i ical
e iciency challenges in DDL, pa icula ly in communica ion-
cons ained en i onmen s, such as he ones encoun e ed in Fed-
e a ed Lea ning (FL) applica ions [
23
]. We in oduce Fede a ed
Dynamic A e aging (FDA), a no el, adap i e dis ibu ed deep
lea ning s a egy ha massi ely imp o es communica ion e i-
ciency o e p e ious wo k.
FDA u ilizes a no el 2-ac ion, condi ional synch oniza ion p o-
ocol, designed o a oid he need o decide o guess he p ope
alues o local upda e s eps, o o synch onize a e each ain-
ing s ep, bu a he only pe o ms he cos ly synch oniza ion
p ocess when needed. Ou FDA algo i hm dynamically igge s
synch oniza ion based on he alue o model a iance ac oss
wo ke -nodes. In a nu shell, he cos ly synch oniza ion s ep is
only igge ed i he local models ha e di e ged signi ican ly,
which implies ha he global model may no longe be accu a e.
As Figu e 1 demons a es, a he s a , wo ke s en e he lo-
cal aining s ep wi h he same global model (Figu e 1.A). Then,
local aining commences and each dis ibu ed wo ke -node com-
pu es i s local s a e, which encapsula es help ul in o ma ion o
es ima ing he model a iance (Figu e 1.B). This is ollowed by
he ansmission (Figu e 1.C) o hese small-size local s a es, an
ope a ion ha is bandwid h- and ime-e icien because o hei
small size. Du ing ansmission, he local s a es a e agg ega ed
and hei a e age is made a ailable o all wo ke s—an ope a ion
known as AllReduce. This ope a ion does no equi e (o p o-
hibi ) he use o a cen al node. Based on he agg ega ed s a e,
he wo ke s can es ima e (Figu e 1.D) whe he he a iance o
he local models may ha e exceeded a h eshold. I his is no he
case, he cos ly synch oniza ion s ep (Figu e 1.E) is a oided and
local aining con inues. Wha is impo an is how o p ope ly
pick hese local s a es compu ed a , and hen ansmi ed by, he
local wo ke s. To add ess his p oblem, we p opose wo a ian s
o ou FDA algo i hm. Ou con ibu ions can be summa ized as
ollows:
•
We p opose FDA, an algo i hm ha dynamically decides o
synch onize local wo ke s when model a iance ac oss wo ke s
exceeds a h eshold. This s a egy d as ically educes com-
munica ion, while p ese ing cohesi e p og ess owa ds he
sha ed aining objec i e.
•
We p opose wo a ian s o FDA, which di e in he amoun o
in o ma ion p ese ed in he local s a es ha a e ansmi ed
by each wo ke and agg ega ed o subsequen es ima ion o
model a iance. These wo a ian s, e med Ske chFDA and
Linea FDA, o e a di e en balance be ween communica ion
e iciency and app oxima ion accu acy.
•
We e alua e and compa e FDA wi h o he DDL algo i hms
h ough a comp ehensi e sui e o expe imen s wi h di e se
da ase s, models, and asks. Ou expe imen s demons a e ha
FDA ou pe o ms adi ional and con empo a y FL algo i hms
by 1-2 o de s o magni ude in communica ion sa ings, while
main aining equi alen model pe o mance. Fu he mo e, i
e ec i ely balances he compe ing demands o communica ion
and compu a ion, p o iding g ea ly imp o ed ade-o s.
•
We demons a e FDA’s obus ness in a ious challenging Non-
IID se ings, common in eal-wo ld Fede a ed Lea ning applica-
ions. While s a e-o - he-a me hods ypically equi e subs an-
ially mo e esou ces o con e ge unde
Non-IID
condi ions,
Nodes s a
aining s ep
Local aining s ep /
compu e local s a e
Es ima e i synch oniza ion is
needed. I no go o S ep (B)
Agg ega e
local s a e
using
AllReduce
Synch onize
models using
AllReduce.
Go o S ep (A)
(A) (B) (C) (D) (E)
Figu e 1: FDA. The local aining s ep is ollowed by he
compu a ion o a local s a e by all wo ke -nodes. Then,
he (small in size) local s a es a e agg ega ed. Based on he
agg ega ed esul , all wo ke s es ima e i synch oniza ion
is equi ed. In mos cases, he expensi e synch oniza ion
s ep o he models is a oided and local aining con inues
FDA main ains consis en and compa able pe o mance ac oss
bo h IID and Non-IID se ings.
Ou line. The emainde o his pape is o ganized as ollows:
Sec ion 2 e iews ela ed wo k. Sec ion 3 in oduces ou DDL
echnique, Fede a ed Dynamic A e aging (FDA), and i s wo a i-
an s. Sec ion 4 de ails he expe imen al se up, and discusses he
insigh s and conclusions d awn om ou empi ical in es iga ion.
Las ly, Sec ion 5 con ains concluding ema ks.
2 RELATED WORK
P oblem o mula ion. Conside dis ibu ed aining o deep
neu al ne wo ks o e mul iple wo ke s [
11
,
31
]. In his se ing,
each wo ke ep esen s a da a owne (equi alen ly, a local model
owne ) and has access o i s own se o aining da a
D𝑘
. Wo ke s
can u ilize any a ailable ha dwa e hey possess (e.g., GPUs, CPUs)
o pe o m lea ning s eps. The collec i e goal is o ind a common
model w
∈R𝑑
by minimizing he o e all aining loss. This
scena io can be e ec i ely modeled as a dis ibu ed op imiza ion
p oblem, o mula ed as ollows:
minimize
w∈R𝑑𝐹(w)≜1
𝐾
𝐾
∑︁
𝑘=1
𝐹𝑘(w)(1)
whe e
𝐾
is he numbe o wo ke s and
𝐹𝑘(
w
)≜ E𝜁𝑘∼D𝑘[ℓ(w;𝜁𝑘)]
is he local objec i e unc ion o wo ke
𝑘
. Func ion
ℓ(
w;
𝜁𝑘)
ep esen s he loss o da a sample 𝜁𝑘gi en model w.
Solu ion di ec ion. As no ed in he seminal wo k [
23
], esea ch
in FL should ocus p ima ily on synch onous solu ions. This al-
lows di e en lines o esea ch (e.g., comp ession, p i acy, e c.) o
be de eloped independen ly and hen combined seamlessly. Ou
wo k, along wi h mos communica ion-e icien FL s a egies, ad-
he es o his synch onous pa adigm. Howe e , such app oaches
may be less e ec i e in en i onmen s whe e each communica-
ion ope a ion incu s signi ican o e head ega dless o he size
o he da a being ansmi ed (e.g., high-la ency). In hese scena -
ios, asynch onous mechanisms become necessa y, hough hey
412
ypically all ou side he p ima y ocus o con empo a y FL e-
sea ch. Tha said, FDA can be modi ied o wo k asynch onously
(as explained in Sec ion 3.3).
Communica ion e icien Local-SGD. The wo k in [
31
] de-
composes each ound in o wo phases. In he i s phase, each
wo ke uns Local-SGD wi h
𝜏=𝐼1
, while he second phase
uns
𝐼2
s eps wi h
𝜏=
1; [
31
] p oposes o exponen ially decay
𝐼1
e e y
𝑀
ounds. In he he e ogeneous se ing, he wo k in [
40
],
by analysing he con e gence a e, p oposes an inc easing se-
quence o local upda e s eps o s ongly-con ex local objec i es
and ixed local upda e s eps o o he ypes o local objec i es.
The s udy in [
65
] dynamically inc eases ba ch sizes o educe
communica ion ounds, main aining he same con e gence a e
as SSP-SGD. Howe e , he la ge-ba ch app oach leads o poo
gene aliza ion [
20
], a challenge add essed by he pos -local SGD
me hod [
32
], which di ides aining in o wo phases: BSP-SGD
ollowed by Local-SGD wi h a ixed numbe o s eps. In he
Lazily Agg ega ed Algo i hm (LAG) [
5
], a di e en app oach was
aken, using only new g adien s om some selec ed wo ke s and
eusing he ou da ed g adien s om he es , which essen ially
skips communica ion ounds.
Fede a ed A e aging (FedA g) [
36
] is ano he ep esen a i e
o communica ion e icien Local-SGD algo i hms, which is a
pi o al me hod in Fede a ed Lea ning (FL) [
23
]. In he FL se ing
wi h edge compu ing sys ems, he wo k in [
59
] ies o ind he
op imal synch oniza ion pe iod
𝜏
subjec o local compu a ion
and agg ega ion cons ain s. Recen ly [
38
], in he FL se ing wi h
he assump ion o s ongly-con ex objec i es, by analysing he
balance be ween as con e gence and highe - ound comple ion
a e, a decaying local upda e s ep scheme eme ged.
Unlike p e ious app oaches ha ely on p ede e mined syn-
ch oniza ion schedules ( ixed, decaying, o o he wise), ou wo k
in oduces a dynamic synch oniza ion s a egy. FDA adap s con-
inuously du ing he aining p ocess, basing synch oniza ion
decisions on a eal- ime me ic: he model a iance ac oss wo k-
e s.
Accele a ing con e gence. An indi ec , ye highly e ec i e
way o mi iga e he communica ion bu den in DDL, is o speed
up con e gence. Consequen ly, ecen wo ks ha e buil upon
communica ion e icien Local-SGD me hods by deploying ac-
cele a ed e sions o SGD o he dis ibu ed se ing. Speci ically,
FedAdam [
42
] ex ends Adam [
26
] and FedA gM [
21
] ex ends SGD
wi h momen um (SGD-M) [
51
]. Recen ly, Mime [
24
] p o ides
a amewo k o adap a bi a y cen alized op imiza ion algo-
i hms o he FL se ing. Howe e , hese me hods s ill su e om
he model di e gence p oblem, pa icula ly in he e ogeneous
se ings. When sol ing
(1)
, he dispa i y be ween each wo ke ’s
op imal solu ion w
∗
𝑘
o hei objec i e
𝐹𝑘
, and he global op i-
mum w
∗
o
𝐹
, can po en ially cause wo ke models o di e ge
(d i ) owa ds hei dispa a e minima [
25
,
42
,
63
]. The esul is
slow and uns able con e gence wi h signi ican communica ion
o e head. To add ess his p oblem, he SCAFFOLD algo i hm [
25
]
used con ol- a ia es (in he same spi i o SVRG), wi h signi -
ican speed-up. FedP ox [
45
] e-pa ame e ized FedA g [
36
] by
adding
𝐿2
egula iza ion in he wo ke s’ objec i es o be nea he
global model. Las ly, FedDyn [
2
] imp o ed upon hese ideas wi h
a dynamic egula ize making su e ha i local models con e ge
o a consensus, his consensus poin aligns wi h he s a iona y
poin o he global objec i e unc ion.
While hese app oaches p ima ily ocus on enhancing he op-
imiza ion p ocess and ypically employ ixed synch oniza ion
in e als (e.g., e e y local epoch), ou wo k add esses a comple-
men a y aspec : de e mining he op imal iming o synch oniza-
ion. FDA’s dynamic synch oniza ion s a egy is o hogonal o
hese op imiza ion echniques and can be in eg a ed wi h hem
by simply adjus ing he synch oniza ion decision.
Comp ession. To educe communica ion o e head in DDL, sig-
ni ican e o s ha e been di ec ed owa ds minimizing message
sizes. Key s a egies include spa si ica ion, whe e only c ucial
componen s o in o ma ion a e ansmi ed, as explo ed in [
3
],
and quan iza ion echniques, which in ol e ansmi ing only
quan ized g adien s, as de ailed in [
47
]. These echniques can be
combined wi h Local-SGD me hods o enhance communica ion-
e iciency u he . An example is Qspa se-local-SGD [
4
], which
in eg a es agg essi e spa si ica ion and quan iza ion wi h Local-
SGD, achie ing subs an ial communica ion sa ings. C ucially,
FDA is ully compa ible wi h any echnique ha educes he
cos o synch oniza ion (e.g. model comp ession). Ou app oach
simply adjus s he iming o he synch oniza ion decision wi h-
ou al e ing he da a being synch onized. This ensu es ha any
comp ession echnique e ec i e in adi ional me hods (BSP,
Local-SGD, e c.) will be equally e ec i e when deployed wi h
FDA. The e o e, he communica ion sa ings demons a ed in he
ele an li e a u e [
61
] can be sa ely expec ed o ca y o e o
ou app oach as well.
Addi ionally, ske ching eme ges as ano he undamen al ool
in la ge-scale machine lea ning. I e ec i ely comp esses high-
dimensional p oblems in o lowe dimensions o sa e un ime and
memo y, ypically u ilizing hash-based p obabilis ic da a s uc-
u es. Fo ins ance, [
49
] use Coun Ske ches o comp ess auxil-
ia y a iables in op imiza ion algo i hms, signi ican ly eeing
up memo y. Simila ly, Fe chSGD [
43
] employs Coun Ske ches o
comp ess model upda es and le e ages hei linea i y o e icien
me ging. In con as o hese applica ions, ou app oach u ilizes
ske ches no o comp ession bu o es ima e local s a e in o ma-
ion, and based on his o decide whe he a synch oniza ion is
equi ed—an o hogonal applica ion o adi ional use cases. A
comp ehensi e su ey o comp ession echniques in DDL can be
ound in [61].
3 FEDERATED DYNAMIC AVERAGING
We now p esen ou algo i hms, based on ou no ion o Fede a ed
Dynamic A e aging (FDA). Ou algo i hms de ia e om p io
wo k in hese wo key ways:
(1) The decision on when o synch onize.
(2) The ac ual synch oniza ion p ocess.
To he bes o ou knowledge, his is he i s Dis ibu ed Deep
Lea ning algo i hm ha dynamically decides when o synch o-
nize based on he cu en collec i e s a e o he aining p og ess—
whe he i is ad ancing well o poo ly.
No a ion. A each ime s ep
𝑡
, each wo ke
𝑘
independen ly main-
ains i s own ec o o model pa ame e s
1
, deno ed as w
(𝑘)
𝑡∈R𝑑
.
Le w
𝑡
ep esen he
𝐾×𝑑
enso o all local model ec o s, and
w𝑡
be he a e age model ec o ( his no a ion applies o all ec o
quan i ies):
w𝑡=hw(1)
𝑡, . . . , w(𝐾)
𝑡i,w𝑡=
1
𝐾
𝐾
∑︁
𝑘=1
w(𝑘)
𝑡
1
The e ms “model” and "model pa ame e s" a e used in e changeably, as is common
in he li e a u e.
413
Table 1: No a ion
Symbol Meaning
⟨·,·⟩Do p oduc
𝑡Time s ep index
𝐾Numbe o wo ke s
𝑑Model dimension
D𝑘T aining da a o wo ke 𝑘
B(𝑘)
𝑡A ba ch sampled om D𝑘
w(𝑘)
𝑡∈R𝑑Model o wo ke 𝑘
w𝑡=[w(1)
𝑡, . . . , w(𝐾)
𝑡]Tenso o local models
w𝑡=1
𝐾Í𝐾
𝑘=1w(𝑘)
𝑡A e age model (global model)
w𝑡0Model a e mos ecen sync.
w𝑡−1Model a e 2nd mos ecen sync.
u(𝑘)
𝑡=w(𝑘)
𝑡−w𝑡0Local model d i
u𝑡=1
𝐾Í𝐾
𝑘=1u(𝑘)
𝑡A e age model d i (global d i )
Va (w𝑡)Model a iance
ΘModel a iance h eshold
S(𝑘)
𝑡S a e o wo ke 𝑘
S𝑡=1
𝐾Í𝐾
𝑘=1S(𝑘)
𝑡A e age s a e
𝐻(·) Func ion o a iance es ima ion
sk(·) :R𝑑→R𝑙×𝑚AMS ske ch ope a o (§3.1)
M2(·) :R𝑙×𝑚→R𝐿2no m squa ed es ima e (§3.1)
𝜖E o o ske ch es ima e (§3.1)
(1−𝛿)Con idence o app oxima ion (§3.1)
𝑙=O(log 1/𝛿)#Rows o ske ch ma ix (§3.1)
𝑚=O(1/𝜖2)#Columns o ske ch ma ix (§3.1)
𝜉=
w𝑡0−w𝑡−1
∥w𝑡0−w𝑡−1∥2
Heu is ic ec. o Linea FDA (§3.2)
Fu he mo e, le
Op imize(
w
,B)
be he upda ed model [
16
] com-
pu ed by some op imiza ion algo i hm (e.g., SGD, Adam) using
he model w, and he ba ch
B
o aining da a. I inco po a es
he lea ning a e, loss unc ion and ele an g adien s. Du ing
s ep 𝑡, each wo ke 𝑘 i s applies he upda e:
w(𝑘)
𝑡=Op imize(w(𝑘)
𝑡−1,B(𝑘)
𝑡)
Mo eo e , ope a ion
AllReduce(
w
(𝑘)
𝑡)
compu es and e u ns
he a e age model ec o [30]:
w𝑡=AllReduce(w(𝑘)
𝑡)
Wo ke s synch onize by execu ing
AllReduce(
w
(𝑘)
𝑡)
, he eby
se ing w
(𝑘)
𝑡
:
=w𝑡
. I synch oniza ion is no pe o med a s ep
𝑡
,
each wo ke con inues aining wi h i s locally upda ed model. A
comp ehensi e lis o he no a ion used h oughou his sec ion
is p o ided in Table 1.
Model Va iance and FDA. The model a iance quan i ies he
dispe sion o sp ead o wo ke models a ound he a e age model:
Va (w𝑡)=
1
𝐾
𝐾
∑︁
𝑘=1w(𝑘)
𝑡−w𝑡
2
2(2)
This measu e p o ides insigh in o how closely aligned he wo k-
e s’ models a e a any gi en ime. High a iance indica es ha he
models a e widely sp ead ou , essen ially d i ing apa , leading o
a lack o cohesion in he agg ega ed model. Con e sely, a mode -
a e o low a iance sugges s ha he wo ke s’ models a e closely
aligned, wo king collec i ely owa ds he sha ed objec i e.
The FDA algo i hm (Algo i hm 1) is based on he p emise ha ,
as long as he a iance is below a h eshold
Θ
, synch oniza ion
is no needed. Thus, we in oduce he Round In a ian (RI):
Va (w𝑡)≤Θ(3)
To p ese e he RI, ou FDA algo i hm main ains (Lines 4-6 o
Algo i hm 1) a each wo ke
𝑘
a local (low-dimensional) s a e-
ec o S
(𝑘)
𝑡
, which is compu ed based on w
(𝑘)
𝑡
. These s a e ec o s
a e i al o he subsequen es ima ion o he model a iance,
and unde pin he wo a ian s o he FDA algo i hm (p o ided in
Sec ions 3.1 and 3.2, espec i ely). Ou es ima ion echniques be-
gin by pe o ming
AllReduce
on he s a es S
(𝑘)
𝑡
, consolida ing
hem in o he a e age s a e
S𝑡
(Line 7). Impo an ly, his commu-
nica ion s ep equi es signi ican ly less bandwid h and esou ces
han ansmi ing he ull models w(𝑘)
𝑡.
Fo each FDA a ian , we also de ine a (di e en ) unc ion
𝐻(S𝑡)
ha o e es ima es he a iance, i.e., i ensu es ha as
long as
𝐻(S𝑡) ≤ Θ
hen he a iance is bounded by
Θ
. This
gua an ee is p obabilis ic o he Ske ch-based a ian o FDA,
and de e minis ic o i s Linea coun e pa . Consequen ly, i
𝐻(S𝑡)>Θ hen synch oniza ion is pe o med (Lines 8-9) — he
RI in a ian canno be gua an eed. A e synch oniza ion, he
model a iance is ze o.
E icien ly Moni o ing he RI. Es ima ing model a iance e i-
cien ly is a he hea o FDA. To his end, we i s in oduce he
local model d i ,u(𝑘)
𝑡, and a e age d i ,u𝑡, de ined as ollows:
u(𝑘)
𝑡=w(𝑘)
𝑡−w𝑡0,u𝑡=
1
𝐾
𝐾
∑︁
𝑘=1
u(𝑘)
𝑡
He e,
w𝑡0
deno es he model ec o a e he mos ecen syn-
ch oniza ion. Subsequen ly, he model a iance can be w i en
as:
Va (w𝑡)= 1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2!−∥u𝑡∥2
2(4)
P oo .
Adding an o se (
−w𝑡0
) o each w
(𝑘)
𝑡
does no al e
he a iance, he e o e:
Va (w𝑡)=Va w𝑡−w𝑡0=Va (u𝑡)=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡−u𝑡
2
2
=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−2Du(𝑘)
𝑡,u𝑡E+∥u𝑡∥2
2
= 1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2!−2 1
𝐾
𝐾
∑︁
𝑘=1Du(𝑘)
𝑡,u𝑡E!+ 1
𝐾
𝐾
∑︁
𝑘=1
∥u𝑡∥2
2!
= 1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2!−2* 1
𝐾
𝐾
∑︁
𝑘=1
u(𝑘)
𝑡!,u𝑡++∥u𝑡∥2
2
= 1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2!−2⟨u𝑡,u𝑡⟩+∥u𝑡∥2
2
= 1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2!−2∥u𝑡∥2
2+∥u𝑡∥2
2
= 1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2!−∥u𝑡∥2
2
□
414
Algo i hm 1 Fede a ed Dynamic A e aging - FDA
Requi e: 𝐾: The numbe o wo ke s indexed by 𝑘
Requi e: Θ: The model a iance h eshold
Requi e: 𝑏: The local mini-ba ch size
1: Ini ialize w(𝑘)
0=w0∈R𝑑
2: o each s ep 𝑡=1,2, . . . do
3: o each wo ke 𝑘=1, . . . , 𝐾 in pa allel do
4: B(𝑘)
𝑡←(sample a ba ch o size 𝑏 om D𝑘)
5: w(𝑘)
𝑡←Op imize(w(𝑘)
𝑡−1,B(𝑘)
𝑡)
6: Upda e S(𝑘)
𝑡
7: S𝑡←AllReduce(S(𝑘)
𝑡)
8: i 𝐻(S𝑡)>Θ hen
9: w(𝑘)
𝑡←AllReduce(w(𝑘)
𝑡)⊲In-place
Concep ually, ollowing Eq
(4)
, o p ecisely moni o he a i-
ance, we need o calcula e wo quan i ies: (1)
1
𝐾Í𝐾
𝑘=1∥
u
(𝑘)
𝑡∥2
2
,
and (2)
∥u𝑡∥2
2
. The i s quan i y equi es an
AllReduce
ope a-
ion on he squa ed no m o he wo ke d i s, which in ol es
minimal o e head since hese alues a e scala . In con as , he
second quan i y necessi a es an
AllReduce
ope a ion on he
wo ke d i s hemsel es, which a e o model dimension, hus in-
cu ing a high communica ion cos . In ac , his ope a ion is equi -
alen o synch oniza ion, which is exac ly wha we aim o a oid
in he i s place. Thus, i becomes e iden ha communica ion-
e icien model a iance es ima ion hinges on es ima ing
∥u𝑡∥2
2
e icien ly.
Upcoming sec ions will de ail wo echniques o communi-
ca ion e icien a iance es ima ion (which p ima ily in ol es
es ima ing
∥u𝑡∥2
2
): Ske chFDA and Linea FDA. To p esen hem
uni o mly, we in oduce he local s a e S
(𝑘)
𝑡
, a enso which con-
ains: (1) he scala alue
∥
u
(𝑘)
𝑡∥2
2
o p ecisely calcula ing he i s
quan i y, and (2) a low-dimensional summa y o u
(𝑘)
𝑡
, di e en o
each echnique, o es ima ing he second quan i y. Fo each ech-
nique we de ine an es ima ion unc ion
𝐻(·)
ha calcula es he
cu en a iance es ima e om a e age s a e
S𝑡=1
𝐾Í𝐾
𝑘=1
S
(𝑘)
𝑡
(ob ained ia AllReduce).
3.1 Ske chFDA: Ske ch-based Es ima ion
An op imal es ima o o
∥u𝑡∥2
2
can be ob ained h ough he
u iliza ion and p ope ies o AMS ske ches, as de ailed in [
8
]. An
AMS ske ch o a ec o ∈R𝑑is an 𝑙×𝑚 eal ma ix:
sk ( )=𝜓1𝜓2. . . 𝜓𝑙⊤∈R𝑙×𝑚, 𝑙 ·𝑚≪𝑑
An es ima e o squa ed-no m ∥ ∥2
2is p o ided by he o mula
M2(sk( ))=median ∥𝜓𝑖∥2
2, 𝑖 =1, . . . ,𝑙
The quali y o es ima ion depends on he size o he ske ch.
Fo chosen
𝜖, 𝛿 >
0, whe e ske ch dimensions a e gi en by
𝑙=O(log 1/𝛿)
and
𝑚=O1/𝜖2
, we ha e he ollowing p oba-
bilis ic gua an ee: wi h con idence a leas 1−𝛿,
M2(sk( )) ∈ (1±𝜖)∥ ∥2
2
No ably, obse e ha he accu acy (
𝜖
) and con idence (1
−𝛿
) only
depend on he size o he ske ch and no on he dimensionali y
o ec o .
Two c ucial p ope ies o he AMS ske ch a e ha (a) i is a
linea ans o ma ion, i.e., o 𝛼1, 𝛼2∈Rand 1, 2∈R𝑑,
sk(𝛼1 1+𝛼2 2)=𝛼1sk( 1) + 𝛼2sk( 2)
and (b) can be compu ed e icien ly in ime 𝑂(𝑙·𝑑).
In he Ske chFDA app oach, he salien idea is o employ AMS
ske ches
sk(
u
(𝑘)
𝑡) ∈ R𝑙×𝑚
as a low-dimensional ep esen a ion
o he local d i s u(𝑘)
𝑡.
Theo em 3.1. Le
𝑙=O(log 1
𝛿)
and
𝑚=O( 1
𝜖2)
. De ine he
local s a e as
S(𝑘)
𝑡=u(𝑘)
𝑡
2
2,sk u(𝑘)
𝑡∈R×R𝑙×𝑚
and he app oxima ion unc ion as
𝐻S𝑡=
1
𝐾∑︁
𝑘u(𝑘)
𝑡
2
2−1
1+𝜖M2 1
𝐾
𝐾
∑︁
𝑘=1
sk u(𝑘)
𝑡!.
Then, he condi ion
𝐻(S𝑡) ≤ Θ
implies
Va (w𝑡)≤Θ
wi h p oba-
bili y a leas (1−𝛿).
P oo .
𝐻S𝑡=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−1
1+𝜖M2 1
𝐾
𝐾
∑︁
𝑖=1
sk u(𝑘)
𝑡!
(lin.)
=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−1
1+𝜖M2 sk 1
𝐾
𝐾
∑︁
𝑖=1
u(𝑘)
𝑡!!
=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−1
1+𝜖M2(sk (u𝑡))
(𝜖-e .)
≥1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−∥u𝑡∥2
2wi h p ob. a leas (1−𝛿)
=Va (w𝑡)
We p o ed ha
𝐻(S𝑡) ≥ Va (w𝑡)
wi h p obabili y a leas (1
−𝛿
),
i.e., we o e es ima e he model a iance wi h p obabili y a leas
(1−𝛿), comple ing he p oo . □
In Sec ion 3.3, we discuss he empi ical basis o choosing he
alues o
𝑙
and
𝑚
, and how hey p ac ically impac he quali y o
he ske ch app oxima ion.
3.2 Linea FDA: Linea App oxima ion
Al hough AMS ske ches p o ide good es ima es o a iance,
hei dimension is in he se e al hund eds, and he communi-
ca ion cos o
AllReduce
on ske ches, pe o med a each s ep,
may be non-negligible. The e o e, we also in oduce a low-cos ,
ad-hoc es ima ion a ian .
In his app oach, ins ead o an AMS ske ch, each local s a e
con ains he scala alue
⟨𝜉 ,
u
(𝑘)
𝑡⟩ ∈ R
, whe e
𝜉∈R𝑑
is a uni
ec o , known o all wo ke s.
Theo em 3.2. De ine he local s a e as
S(𝑘)
𝑡=u(𝑘)
𝑡
2
2,D𝜉 , u(𝑘)
𝑡E∈R×R,∥𝜉∥2=1
and he app oxima ion unc ion as
𝐻S𝑡=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−
1
𝐾
𝐾
∑︁
𝑖=1D𝜉 , u(𝑘)
𝑡E
2
Then, he condi ion 𝐻(S𝑡) ≤ Θimplies Va (w𝑡)≤Θ.
415
P oo .
𝐻S𝑡=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−
1
𝐾
𝐾
∑︁
𝑖=1D𝜉 , u(𝑘)
𝑡E
2
=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−*𝜉 , 1
𝐾
𝐾
∑︁
𝑖=1
u(𝑘)
𝑡+
2
=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−|⟨𝜉 , u𝑡⟩|2
≥1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−∥𝜉∥2
2∥u𝑡∥2
2
=
1
𝐾
𝐾
∑︁
𝑘=1u(𝑘)
𝑡
2
2−∥u𝑡∥2
2
=Va (w𝑡)
We p o ed ha 𝐻(S𝑡) ≥ Va (w𝑡), i.e., we always o e es ima e
he model a iance, comple ing he p oo . □
An a bi a y choice o
𝜉
(e.g., a andom ec o ) is likely o
es ima e ∥u𝑡∥2
2poo ly; i 𝜉is unco ela ed o u𝑡, hen |⟨𝜉 , u𝑡⟩|2
will likely be close o ze o. A heu is ic choice ha migh be
co ela ed o
u𝑡
is he (no malized) alue o
u𝑡0
, he global d i
ec o igh a he ime o las synch oniza ion. All nodes can
compu e i independen ly wi hou ex a communica ion, i hey
ake he di e ence o he models o he las wo synch oniza ions:
𝜉=
u𝑡0
u𝑡02
=
w𝑡0−w𝑡−1
w𝑡0−w𝑡−12
3.3 Discussion
FDA: In ui ion. The main in ui ion o FDA is summa ized in
making he decision o synch onize dynamic, based on model
a iance du ing aining. This me ic is designed o cap u e he
collec i e s a e o he aining p ocess. In wha ollows, we p o-
ide in ui ion on why his is he case. I is impo an o emembe
ha he global model
w𝑡
and, by ex ension, he global d i
u𝑡
,
a e ul ima ely wha we ca e abou and e alua e.
Model a iance, as de ined in Equa ion
(4)
, is he di e ence
be ween he a e age o he squa ed local d i s
1
𝐾Í∥
u
(𝑘)
𝑡∥2
2
and
he squa ed global d i
∥u𝑡∥2
2
. The i s e m e lec s how a he
indi idual wo ke models ha e mo ed–essen ially, how much
each wo ke has lea ned. The second e m indica es how much
o his lea ning is e ained in he global model a e agg ega ion.
The in e play be ween hese wo quan i ies is c ucial. Fo
example, when he local d i s a e high bu he global d i is
low, he a iance inc eases, signaling he need o synch oniza-
ion. This scena io sugges s ha while indi idual wo ke s ha e
made signi ican p og ess (as indica ed by high local d i s), his
p og ess is no being e ec i ely cap u ed in he global model
(indica ed by he low global d i ). In o he wo ds, he wo ke
models ha e mo ed signi ican ly, bu he global model has e-
mained ela i ely s a iona y in his high-dimensional space. This
misalignmen indica es ha aining is no longe p og essing
op imally, as he wo ke s a e mo ing owa ds dispa a e and con-
lic ing local minima, making i c ucial o synch onize and ealign
hem. Con e sely, when bo h he local and global d i s a e ei he
low o high, synch oniza ion is no necessa y, and he a iance
na u ally emains low.
(
( )
)
,
,
Local S a eLocal D i
LɪɴᴇᴀʀFDA
SᴋᴇᴛᴄʜFDA
Figu e 2: Ske chFDA &Linea FDA: Local S a e s uc u e.
Nei he he a e age o he local d i s no he global d i alone
p o ides a comple e pic u e o he collec i e aining p og ess.
Relying solely on one o he o he would lead o subop imal
synch oniza ion decisions and likely p o e ine ec i e. In FDA,
i is he ela ionship be ween hese quan i ies, as cap u ed by
he model a iance, ha o e s aluable insigh s and guides he
c ucial decision o when o synch onize.
Ske chFDA s. Linea FDA: Bo h me hods send he squa ed
no m o he d i
∥
u
(𝑘)
𝑡∥2
2
, bu di e in he addi ional accompa-
nying lowe -dimensional ep esen a ion hey ansmi (Figu e
2):
(1) Ske chFDA: An AMS ske ch o he local d i .
(2)
Linea FDA: The do p oduc o a ec o and he local d i .
The key di e ence be ween hese wo a ian s lies in he ideli y
o app oxima ion o he model a iance. While bo h me hods
conse a i ely o e es ima e he a iance, Ske chFDA p o ides a
p o ably accu a e es ima ion, which is expec ed o lead o ewe
synch oniza ions. Linea FDA equi es less compu a ional e o
and bandwid h o c ea e and communica e he local s a es, bu
may o e es ima e a iance by oo much, causing unnecessa y
synch oniza ions.
Ske chFDA: Choice o
𝑙
and
𝑚
. We empi ically measu ed he
app oxima ion achie ed wi h ske ch dimensions o
𝑙=
5 ows
and
𝑚=
250 columns (as de ined in Sec ion 3.1): hese se ings
yield an e o bound o
𝜖≈
6% and a p obabilis ic con idence
o
(
1
−𝛿) ≈
95%. Based on ou expe imen s, we ha e adop ed
hese alues in ou expe imen s and ecommend hem. Using
hese alues, he by e-size o a ske ch is
𝑙·𝑚·
4
by es =
5
kB
,
signi ican ly smalle han he size o all ou models. Ske ches o
smalle size could be used, albei weakening he app oxima ion o
he a iance. Howe e , gi en ha Linea FDA simila ly weakens
app oxima ion and a oids using AMS ske ches, in he in e es o
space we do no explo e a ying AMS ske ch sizes in his pape .
FDA: Asynch onous Ope a ion. As men ioned in Sec ion 2,
FDA can be eadily modi ied o ope a e asynch onously. In his
se up, one wo ke -node ac s as a coo dina o , agg ega ing local
s a es and de e mining whe he synch oniza ion is needed each
ime a local s a e is ecei ed. This decision is based on he mos
ecen local s a es om all wo ke s. I is impo an o no e ha ,
since local s a es a e small in size, asynch onous ope a ion is
unlikely o alle ia e bandwid h issues. The p ima y ad an age
is ha i allows aining o con inue e en in he p esence o
s aggle s. Asynch onous ope a ion migh also be bene icial in
a e cases whe e he o e head o ini ializing communica ion
domina es he ac ual ansmission ime.
416
4 EXPERIMENTS
4.1 Se up
Table 2 p o ides a comp ehensi e o e iew o ou expe imen s.
Fo each expe imen , we de ail he Neu al Ne wo k (NN) a chi-
ec u e, i s pa ame e coun (
𝑑
), and he da ase used o aining.
The able also speci ies key hype -pa ame e s: he ba ch size
(
𝑏
), he numbe o wo ke s (
𝐾
), and he FDA-speci ic a iance
h eshold (
Θ
). Addi ionally, we indica e he chosen op imize (as
de ailed in Sec ion 3) and he aining algo i hms employed o
each con igu a ion.
Pla o m. We employ Tenso Flow [
1
], in eg a ed wi h Ke as [
7
],
as he pla o m o conduc ing ou expe imen s. We used Ten-
so Flow o implemen ou FDA a ian s and all compe i i e al-
go i hms. All ele an code, igu es, and da a o his s udy a e
a ailable in h ps://gi hub.com/mike heologi is/FedL-Sync-FDA.
Ha dwa e & In as uc u e. We conduc ed ou expe imen s
on he ARIS High pe o mance compu ing (HPC) en i onmen
2
,
u ilizing a clus e o 44 GPU-accele a ed wo ke -nodes. Each
wo ke is equipped wi h wo NVIDIA Tesla K40m GPUs and
in e connec ed ia an In iniBand FDR14 ne wo k, p o iding up o
56 GB/s o bandwid h. C ucially, ou e alua ion emains agnos ic
o he unde lying in as uc u e o he speci ic wo ke s.
Da ase s & Models. The co e expe imen s in ol e aining Con-
olu ional Neu al Ne wo ks (CNNs) o a ying sizes and com-
plexi ies on wo da ase s: MNIST [
12
] and CIFAR-10 [
27
]. Fo
he MNIST da ase , we employ LeNe -5 [
29
], composed o ap-
p oxima ely 62 housand pa ame e s, and a modi ied e sion o
VGG16 [
48
], deno ed as VGG16*, consis ing o 2.6 million pa am-
e e s. VGG16* was speci ically adap ed o he MNIST da ase ,
a less demanding lea ning p oblem compa ed o ImageNe [
44
],
o which VGG16 was designed. In VGG16*, we omi ed he 512-
channel con olu ional blocks and downscaled he inal wo ully
connec ed (FC) laye s om 4096 o 512 uni s each. Bo h models
use Glo o uni o m ini ializa ion [
15
]. Fo CIFAR-10, we u ilize
DenseNe 121 and DenseNe 201 [
22
], as implemen ed in Ke as [
7
],
wi h he addi ion o d opou egula iza ion laye s a a e 0.2 and
weigh decay o 10
−4
, as p esc ibed in [
22
]. The DenseNe 121 and
DenseNe 201 models ha e 6.9 million and 18 million pa ame e s,
espec i ely, and a e bo h ini ialized wi h He no mal [19].
Las ly, we explo e a ans e lea ning scena io on he da ase
CIFAR-100
[
27
], a choice e lec ing he DL communi y’s g ow-
ing p e e ence o using p e- ained models in such downs eam
asks [
18
]. Fo example, a p e- ained isual ans o me (ViT) on
ImageNe , ans e ed o classi y CIFAR-100, is cu en ly on pa
wi h he s a e-o - he-a esul s o his ask [
13
]. We adop his
exac ans e lea ning scena io, le e aging he mo e powe ul
Con NeX La ge
model, p e- ained on ImageNe , wi h 198 mil-
lion pa ame e s [
7
,
33
]. Following he ea u e ex ac ion s ep [
16
],
he es ing accu acy on CIFAR-100 s ands a 60%. Subsequen ly,
we employ and e alua e ou FDA algo i hms in he a duous
ine- uning s age, whe e he en i e y o he model is ained [
39
].
Algo i hms. We conside i e dis ibu ed deep lea ning algo-
i hms: Linea FDA,Ske chFDA,Synch onous
3
,FedAdam [
42
],
and FedA gM [
21
]; he i s h ee a e s anda d in all expe i-
men s. Depending on he local op imize , Adam [
26
] o SGD
wi h Nes e o momen um (SGD-NM) [
52
], we also include hei
2h ps://www.hpc.g ne .g /en/ha dwa e-2/
3
The name was de i ed om he Bulk Synch onous Pa allel app oach; can be
unde s ood as a special case o he FDA Algo i hm 1 whe e Θis se o ze o.
communica ion-e icien ede a ed coun e pa s FedAdam o Fe-
dA gM, espec i ely.
E alua ion Me hodology. Compa ing DDL algo i hms is no
s aigh o wa d. Fo example, compa ing DDL algo i hms based
on he a e age cos o a aining epoch can be misleading, as i
does no conside he e ec s on he ained model’s quali y. To
achie e a comp ehensi e pe o mance assessmen o FDA, we
de ine a aining un as he p ocess o execu ing he DDL algo-
i hm unde e alua ion, on (a) a speci ic DL model and aining
da ase , and (b) un il a inal epoch in which he ained model
achie es a speci ic es ing accu acy ( e med as Accu acy Ta ge in
igu es). Based on his de ini ion, we ocus on wo pe o mance
me ics:
(1)
Communica ion cos , which is he o al da a (in by es)
ansmi ed by all wo ke s. No ably, communica ion cos
is una ec ed by he aining da a olume since only model
upda es (when synch onizing) and local s a es (a each s ep),
bu no aining da a, a e ansmi ed. Thus, he communi-
ca ion cos mainly depends on he complexi y (numbe o
pa ame e s) o he used model. T ansla ing he communica-
ion cos o wall-clock ime (i.e., he o al ime equi ed o
he compu a ion and communica ion o he DDL) depends
on he ne wo k in as uc u e connec ing he wo ke s and
on he o e head o es ablishing and ini ializing communi-
ca ion. I s impac is la ge in FL scena ios, whe e wo ke s
o en use slowe Wi-Fi connec ions.
(2)
Compu a ion cos , which is he numbe o mini-ba ch
s eps ( e med as In-Pa allel Lea ning S eps in igu es) pe -
o med by each wo ke . T ansla ing his cos o wall-clock
ime is de e mined by he mini-ba ch size and he compu a-
ional esou ces o he wo ke -nodes. I s impac is la ge
o wo ke s wi h lowe compu a ional esou ces.
Hype -Pa ame e s & Op imize s. Hype -pa ame e s unique
o each aining da ase and model a e de ailed in Table 2;
Θ
is
pe inen o FDA algo i hms and no applicable o o he s. No ably,
a guideline o se ing he pa ame e
Θ
is p o ided in Sec ion 4.3.
Fo expe imen s in ol ing FedA gM and FedAdam, we use
𝐸=
1
local epochs, ollowing [
42
]. Fo expe imen s wi h LeNe -5 and
VGG16*, local op imiza ion employs Adam, using he de aul se -
ings as pe [
26
]. In hese cases, FedAdam also adhe es o he
de aul se ings o bo h local and se e op imiza ion [
7
,
42
].
Fo DenseNe 121 and DenseNe 201, local op imiza ion is pe -
o med using SGD wi h Nes e o momen um (SGD-NM), se ing
he momen um pa ame e a 0
.
9and lea ning a e a 0
.
1[
22
].
Fo FedA gM, local op imiza ion is conduc ed wi h de aul se -
ings [
7
,
21
], while se e op imiza ion employs SGD wi h mo-
men um, se ing he momen um pa ame e and lea ning a e o
0
.
9and 0
.
316, espec i ely [
42
]. Las ly, o he ans e lea ning
expe imen s, local op imiza ion le e ages AdamW [
34
], wi h he
hype -pa ame e s used o ine- uning Con NeX La ge in he
o iginal s udy [33].
Da a Dis ibu ion. In all expe imen s, he aining da ase is
di ided in o app oxima ely equal pa s among he wo ke s. To
assess he impac o da a he e ogenei y, we explo e h ee scena -
ios:
(1) IID — Independen and iden ically dis ibu ed.
(2)
Non-IID:
𝑋
%— A po ion
𝑋
%o he da ase is so ed by
label and sequen ially alloca ed o wo ke s, wi h he e-
mainde dis ibu ed in an IID ashion.
417
Table 2: Summa y o Expe imen s
Hype -Pa ame e s T aining
NN d Da ase Θb K Op imize Algo i hms
LeNe -5 62K MNIST {0.5,1,1.5,2,3,5,7}32 { 5, 10, ..., 60 } Adam FDA,Synch onous,FedAdam
VGG16* 2.6M MNIST {20,25,30,50,75,90,100}32 { 5, 10, ..., 60 } Adam FDA,Synch onous,FedAdam
DenseNe 121 6.9M CIFAR-10 {200,250,275,300,325,350,400}32 { 5, 10, ..., 30 } SGD-NM FDA,Synch onous,FedA gM
DenseNe 201 18M CIFAR-10 {350,500,600,700,800,850,900}32 { 5, 10, ..., 30 } SGD-NM FDA,Synch onous,FedA gM
( ine- uning)
Con NeX La ge 198M CIFAR-100 {25,50,100,150}32 { 3, 5 } AdamW FDA,Synch onous
10 1100101102
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.985
Linea FDA
Ske chFDA
FedAdam
Synch onous
10 1100101102
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
Non-IID: Label "
0
", Accu acy Ta ge :
0.985
Linea FDA
Ske chFDA
FedAdam
Synch onous
10 1100101102
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
Non-IID:
60%
, Accu acy Ta ge :
0.985
Linea FDA
Ske chFDA
FedAdam
Synch onous
Figu e 3: LeNe -5 on MNIST. A Non-IID: Label "0", he sam-
ples o Label "0" a e assigned o ew wo ke s. A Non-IID:
60%, 60% o he da ase is so ed and alloca ed o wo ke s,
causing some wo ke s o ecei e many samples om he
same label
(3)
Non-IID: Label
𝑌
— All samples om label
𝑌
a e assigned
o a ew wo ke s, while he es a e dis ibu ed in an IID
manne .
4.2 Main Findings
The main indings o ou expe imen al analyses a e:
(1)
Linea FDA and Ske chFDA ou pe o m he Synch o-
nous,FedAdam and FedA gM echniques ( hei use de-
pends on he local op imize choice) by 1-2 o de s o mag-
ni ude in communica ion, while main aining equi alen
model pe o mance.
(2)
Linea FDA and Ske chFDA also signi ican ly ou pe o m
he FedAdam and FedA gM echniques in e ms o compu-
a ion.
(3)
The pe o mance o Linea FDA and Ske chFDA is com-
pa able in mos expe imen s. Ske chFDA p o ides a mo e
accu a e es ima o o he a iance and leads o ewe syn-
ch oniza ions han Linea FDA, bu has a la ge commu-
nica ion o e head o i s local s a e (a ske ch, compa ed
o wo numbe s). Ske chFDA signi ican ly ou pe o ms
Linea FDA a he ans e lea ning scena io.
(4)
The FDA a ian s emain obus a a ious da a he e ogene-
i y se ings, main aining compa able pe o mance o he
IID case.
4.3 Resul s
Due o he ex ensi e se o unique expe imen s (o e 1000), as
de ailed in Table 2, we le e age Ke nel Densi y Es ima ion (KDE)
plo s [
62
] o isualize he bi a ia e dis ibu ion o compu a ion
and communica ion cos s incu ed by each s a egy o a ain-
ing he Accu acy Ta ge . These KDE plo s p o ide a high-le el
o e iew o he cos ade-o o aining accu a e models. The
a ying le els o opaci y in he illed a eas o he KDE plo s ep-
esen he densi y o he unde lying da a poin s: highe opaci y
indica es a eas wi h a g ea e concen a ion o da a, whe eas
lowe opaci y signi ies less dense a eas.
As an illus a i e example, Figu e 3 depic s he s a egies’ bi-
a ia e dis ibu ion o he LeNe -5 model ained on MNIST wi h
di e en da a he e ogenei y se ups. In hese plo s, he Ske chFDA
dis ibu ion is gene a ed om expe imen s ac oss all hype -
pa ame e combina ions (
Θ
and
𝐾
in Table 2) ha a ained he
Accu acy Ta ge o 0
.
985. The obse ed high a iance in he
me hod’s dis ibu ion s ems om he a ying
𝐾
and
Θ
alues. In
subsequen subsec ions, we elucida e how hese hype -pa ame e s
in luence he communica ion and compu a ion cos s.
FDA balances Communica ion s. Compu a ion. DDL algo-
i hms ace a undamen al challenge: balancing he compe ing
demands o compu a ion and communica ion. F equen commu-
nica ion accele a es con e gence and po en ially imp o es model
pe o mance, bu incu s highe ne wo k o e head, an o e head
ha may be p ohibi i e when wo ke s communica e h ough
lowe speed connec ions. Con e sely, educing communica ion
sa es bandwid h bu isks hinde ing, o e en s alling, con e -
gence. T adi ional DDL app oaches, like Synch onous, equi e
synch onizing model pa ame e s a e e e y lea ning s ep, lead-
ing o signi ican communica ion o e head bu acili a ing as e
con e gence (lowe compu a ion cos ). This is e iden in Fig-
u es 3, 4, 5, and 6 (whe e Synch onous appea s in he bo om
igh — low compu a ion, e y high communica ion). Con e sely,
Fede a ed Op imiza ion (FedOp ) me hods [
42
] a e designed o
be communica ion-e icien , educing communica ion be ween
418
100101102103
Communica ion (GB)
103
104
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.994
Linea FDA
Ske chFDA
FedAdam
Synch onous
101102103
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.995
Linea FDA
Ske chFDA
FedAdam
Synch onous
100101102103
Communica ion (GB)
103
104
In-Pa allel Lea ning S eps
Non-IID: Label "
0
", Accu acy Ta ge :
0.994
Linea FDA
Ske chFDA
FedAdam
Synch onous
101102103
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
Non-IID: Label "
0
", Accu acy Ta ge :
0.995
Linea FDA
Ske chFDA
FedAdam
Synch onous
101102103
Communica ion (GB)
103
104
In-Pa allel Lea ning S eps
Non-IID: Label "
8
", Accu acy Ta ge :
0.994
Linea FDA
Ske chFDA
FedAdam
Synch onous
101102103
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
Non-IID: Label "
8
", Accu acy Ta ge :
0.995
Linea FDA
Ske chFDA
FedAdam
Synch onous
Figu e 4: VGG16* on MNIST
101102103
Communica ion (GB)
103
104
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.78
Linea FDA
Ske chFDA
FedA gM
Synch onous
101102103104
Communica ion (GB)
104
105
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.81
Linea FDA
Ske chFDA
FedA gM
Synch onous
Figu e 5: DenseNe 121 on CIFAR-10
101102103104
Communica ion (GB)
104
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.78
Linea FDA
Ske chFDA
FedA gM
Synch onous
102103104
Communica ion (GB)
104
In-Pa allel Lea ning S eps
IID, Accu acy Ta ge :
0.8
Linea FDA
Ske chFDA
FedA gM
Synch onous
Figu e 6: DenseNe 201 on CIFAR-10
de ices (wo ke s) a he expense o inc eased local compu a ion.
Indeed, as shown in Figu es 3-6, FedA gM and FedAdam e-
duce communica ion by o de s o magni ude bu a he p ice o a
co esponding inc ease in compu a ion. Ou wo p oposed FDA
s a egies achie e he bes o bo h wo lds: he low compu a ion
cos o adi ional me hods and he communica ion e iciency o
FedOp app oaches, as seen in Figu es 3, 4, 5, and 6. In ac , hey
signi ican ly ou pe o m FedA gM and FedAdam in hei ele-
men , ha is, communica ion-e iciency. Ac oss all expe imen s,
419