scieee Science in your language
[en] (orig)

Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

Author: Jiaqi, Liu; Jichao, Zhang; Rota, Paolo; Sebe, Niculae
Publisher: Zenodo
DOI: 10.1109/CVPR52734.2025.01493
Source: https://zenodo.org/records/17688122/files/Liu_Multi-focal_Conditioned_Latent_Diffusion_for_Person_Image_Synthesis_CVPR_2025_paper.pdf
Mul i- ocal Condi ioned La en Di usion o Pe son Image Syn hesis
Jiaqi Liu1Jichao Zhang2BPaolo Ro a1Nicu Sebe1
Uni e si y o T en o1Ocean Uni e si y o China2
Abs ac
The La en Di usion Model (LDM) has demons a ed
s ong capabili ies in high- esolu ion image gene a ion and
has been widely employed o Pose-Guided Pe son Image
Syn hesis (PGPIS), yielding p omising esul s. Howe e ,
he comp ession p ocess o LDM o en esul s in he de e-
io a ion o de ails, pa icula ly in sensi i e a eas such as
acial ea u es and clo hing ex u es. In his pape , we p o-
pose a Mul i- ocal Condi ioned La en Di usion (MCLD)
me hod o add ess hese limi a ions by condi ioning he
model on disen angled, pose-in a ian ea u es om hese
sensi i e egions. Ou app oach u ilizes a mul i- ocal con-
di ion agg ega ion module, which e ec i ely in eg a es a-
cial iden i y and ex u e-speci ic in o ma ion, enhancing
he model’s abili y o p oduce appea ance ealis ic and
iden i y-consis en images. Ou me hod demons a es con-
sis en iden i y and appea ance gene a ion on he Deep-
Fashion da ase and enables lexible pe son image edi ing
due o i s gene a ion consis ency. The code is a ailable a
h ps://gi hub.com/jqliu09/mcld.
1. In oduc ion
The pose-guided pe son image syn hesis (PGPIS) ask
ocuses on ans o ming a sou ce image o a pe son in o a
a ge pose, while p ese ing he appea ance and iden i y
o he indi idual as accu a ely as possible. This ask has
signi ican implica ions in applica ions like i ual eali y,
e-comme ce, and he ashion indus y, whe e main aining
pho o ealis ic quali y and iden i y consis ency is essen ial.
Recen app oaches o PGPIS la gely ely on Gene a i e
Ad e sa ial Ne wo ks (GANs) [6], which, despi e hei suc-
cess, o en s uggle wi h aining ins abili y and mode col-
lapse, esul ing in subop imal p ese a ion o iden i y and
ga men de ails [27,36,45,51,54,57]. As an al e na-
i e, di usion models [11,37] ha e shown p omise in gen-
e a ing high-quali y images by p og essi ely e ining de-
ails h ough mul iple denoising s eps. The in oduc ion o
PIDM [2] ma ked he i s applica ion o di usion models
o PGPIS, whe e la en di usion models (LDM) [37] com-
p ess images in o high-le el ea u e ep esen a ions, he eby
(a) (b)
Sou ce Image VAE Recon VAE Recon + ε GT CFLDOu s
Figu e 1. (a) The VAE [37] econs uc ion de e io a es he de-
ailed in o ma ion o pe son images, especially he acial egions
and complex ex u es. These issues wo sen o he gene a ed la-
en wi h small de ia ions. A small de ia ion ϵ= 0.2is added o
demons a e he o en case o gene a ed la en . (b) Ou me hods
p ese e his de ailed in o ma ion be e han o he LDM-based
me hods by in oducing mul i- ocal condi ions.
educing he compu a ional complexi y while suppo ing
high- esolu ion ou pu s. Ex ensions such as PoCoLD [9]
enhance 3D pose co espondence using pose-cons ained
a en ion, and CFLD [24] emphasize seman ic unde s and-
ing wi h hyb id-g anula i y a en ion.
Despi e hese ad ancemen s, LDM-based me hods en-
coun e limi a ions in eco e ing ine appea ance de ails, es-
pecially in acial and clo hing egions. As shown in Fig. 1
(a), his challenge is p ima ily due o he lossy na u e o au-
oencode comp ession [1], which can deg ade complex ex-
u es and iden i y-speci ic ea u es du ing encoding. Since
he lossy econs uc ed images a e he uppe bound o gen-
e a ed images o LDM-based me hods, his issue wo sens
when doing in e ence since he gene a ed la en de ia es
om he comp essed eal la en . Addi ionally, LDM’s e-
liance on whole-image condi ioning o en s uggles o ocus
on sensi i e egions whe e appea ance p ecision is c i ical.
The in eg a ion o pose and appea ance in o ma ion com-
plica es de ail econs uc ion, leading o subop imal pe o -
mance ac oss di e se poses and sensi i e a eas.
To o e come hese limi a ions, we in oduce a Mul i-
This CVPR pape is he Open Access e sion, p o ided by he Compu e Vision Founda ion.
Excep o his wa e ma k, i is iden ical o he accep ed e sion;
he inal published e sion o he p oceedings is a ailable on IEEE Xplo e.
16019
ocal Condi ioned La en Di usion (MCLD) app oach o
PGPIS. Ou me hod mi iga es he loss o de ail in sensi-
i e egions by condi ioning he di usion model on he co -
esponding selec i ely decoupled ea u es a he han he
en i e image. Speci ically, we isola e high- equency e-
gions, such as acial iden i y and appea ance ex u es, om
he sou ce image and ea hem as independen condi ions.
This decoupling s a egy enhances con ol o e sensi i e
a eas, ensu ing be e iden i y p ese a ion and ex u e i-
deli y. Ou app oach i s gene a es pose-in a ian embed-
dings o he selec ed egions sha ed in he sou ce and a -
ge images using p e ained modules, which a e hen used
wi hin he Mul i- ocal Condi ion Agg ega ion module. This
module in oduces selec i e c oss-a en ion laye s, le e -
aging he s uc u al ad an ages o UNe o combine he
condi ions e ec i ely. Consequen ly, ou MCLD me hod
achie es imp o ed con ol and accu acy, acili a ing high-
quali y, ealis ic pe son image syn hesis. Ou main con i-
bu ions can be summa ized as ollows:
• We in oduce a new app oach, MCLD, ha ocuses on
alle ia ing he de e io a ion o impo an de ails in sen-
si i e a eas like he ace and clo hing by using sepa a e
condi ions o hese egions, which imp o es bo h iden-
i y p ese a ion and appea ance ideli y.
• We de elop a mul i- ocal condi ion agg ega ion module
ha combines con ols om mul iple ocus a eas, allow-
ing ou model o p oduce mo e ealis ic images wi hou
losing o collapse o de ails in key egions.
• Ou me hod achie es consis en appea ance gene a ion
ac oss di e en poses, especially in challenging egions
like aces and ex u es, leading o s a e-o - he-a esul s
on he Deep ashion da ase [22] and lexible-bu -accu a e
edi ing downs eam applica ions.
2. Rela ed Wo ks
Pose-Guided Pe son Image Syn hesis was ini ially p o-
posed by PG2 [26], which i s ly applied condi ional
GANs o ad e sa ially e ine pose-guided human gene a-
ion. La e , GAN-based esea ch add essed his p oblem
h ough wo main app oaches. The i s ocuses on he
ans e p ocess, whe e me hods model he de o ma ion be-
ween poses using a ine ans o ma ions [42,43] and low
ields [18,33–35]. The second app oach aims o enhance
he gene a ion quali y by be e disen angling pose and ap-
pea ance in o ma ion. This disen anglemen can be implic-
i ly achie ed by modeling he spa ial co espondence be-
ween he pose and appea ance ea u es [27,36,45,51,54,
55,57]. Auxilia y explici in o ma ion is also in oduced o
imp o e he appea ance quali y, especially o UV ex u e
map [7,40,41,50] ha p o ides pose-i ele an appea ance
guidance. Howe e , due o he ins abili y in aining and he
mode collapse issues associa ed wi h GAN models, p e i-
ous GAN-based wo ks ha e encoun e ed challenges wi h
un ealis ic o changed ex u es in posed pe son images.
To mi iga e his limi a ion, di usion based me hods ha e
been mo e ecen ly in oduced in PGPIS. PIDM [2] was he
i s o u ilize he i e a i e denoising p ope y o he di u-
sion model. Subsequen me hods [9,24] ha e sough o
imp o e he gene a ion capabili y by employing la en di -
usion models [37] (LDM) a he han he pixel space. In
de ail, CFLD [24] add esses he impo ance o seman ic
unde s anding owa ds he decoupling o ine-g ained ap-
pea ance and poses while PoCoLD [9] es ablishes he co -
espondence be ween pose and appea ance. Mo e ecen
some ideo pe son anima ion me hods also ook he bene-
i o comp essed la en in LDM, bu hey mainly concen-
a ed on keeping he empo al consis ency by spa ial a -
en ion [12,47] and consis en pose guidance [56]. Bo h
he image and ideo syn hesis me hods use a sou ce pe -
son image as condi ion and he gene a ed image would col-
lapse when he a ge pose a ies g ea ly om he sou ce
image. In addi ion, i has been no iced ha he e is a de-
e io a ion [1,9] o image quali y when LDM comp esses
images o lowe dimensions, especially o images o high-
equency in o ma ion. Howe e , e y ew conside ed ack-
ling his p oblem.
Condi ional Di usion Models. Recen ly, di usion mod-
els [11,44] ha e ou pe o med GANs and signi ican ly im-
p o ed he isual ideli y o syn hesized images ac oss a -
ious domains, including ex - o-image gene a ion [37–39],
pe son image gene a ion [2,4,9,24,48], and 3D a a a
gene a ion [13,17,19,20,31]. Fo mos asks, he widely
used model is S able Di usion [37] (and i s a ian s), which
is a uni ied condi ional di usion model ha allows o se-
man ic maps, ex , o images o be used as condi ions o
con olling gene a ion. I s key con ibu ions lie in apply-
ing he di usion p ocess in la en space, which minimizes
esou ce consump ion while main aining gene a ion quali y
and lexibili y. In his pape , ou a chi ec u e, along wi h he
baseline’s, is de i ed om his condi ional model, i.e, S a-
ble Di usion. P e ious condi ional di usion models can be
ca ego ized in o h ee ypes based on he condi ion: ex -
condi ioned [37,38], image-condi ioned [12,15,24,28],
and mixed-condi ioned models [49,52]. These me hods
ypically use a p e ained model [29,32,37] o ex ac con-
di ion ea u es, which a e hen injec ed in o he denoising
UNe ia c oss-a en ion. Di e en om he main s eam
app oaches ha ega d images and ex s as a whole, ou
p oposed Mul i- ocal Condi ioned me hod akes a human
image as he inpu , ans o ms i by di e en ocuses(e.g.,
ex u e maps and acial ea u es), and encodes hese ocuses
in o embeddings using a ious p e- ained models. This ap-
p oach is a sophis ica ed combina ion o image-condi ioned
and mixed-condi ioned s a egies. Addi ionally, we in o-
duce a Mul i- ocal Condi ions Agg ega ion echnique o e -
ec i ely dis ibu e hese condi ions in o he UNe .
16020
… … …
Re e enceNe
Pose
Guide
Face Encode
P ojec ion
Mul i- ocal Condi ion Agg ega ion (MFCA)
CLIPCLIP
VAE
Appea ance Region (A)Face Region (F)
Sou ce Human Image (I)
Wa p
Tex u emap Sou ce Pose
Ta ge Pose (p )
Noise
Ta ge Image (I )
(b) MFCA
Q
Ks
(a) Mul i- ocal Fea u e Ex ac ion
K V
(d) Pose Guide
Vs
So max
WkF W F
WQ
Wks W s
So max
z
s
Femb
…
C e
Cemb
(c) Denoising UNe
Femb
Iemb
Aemb
A e
Iemb Aemb
Figu e 2. The o e all pipeline o ou p oposed Mul i- ocal Condi ioned Di usion Model. (a) Face egions and appea ance egions a e i s
ex ac ed om he sou ce pe son images; (b) mul i- ocal condi ion agg ega ion module ϕis used o use he ocal embeddings as cemb;
(c) Re e enceNe Ris used o agg ega e in o ma ion om he appea ance ex u e map, deno ed as c e ; (d) Densepose p o ides he pose
con ol o be used in o UNe wi h noise by Pose Guide .
3. Me hodology
Gi en a e e ence image I ep esen ing he appea ance
condi ion c, he ask o PGPIS aims o gene a e a a ge im-
age I wi h a desi ed pose p . This is achie ed by lea ning
a condi ional ne wo k Tsuch ha I =T(c, p ). While he
ep esen a ion o p is ypically p ede ined [3,8,23], he
success o gene a ion la gely elies on he ne wo k Tand
condi ions c, which ex ac he sha ed pose-i ele an ap-
pea ance ea u es be ween Iand I , ensu ing high-quali y
syn hesis o I . To enhance syn hesis, we in oduce a di u-
sion model ϵθcondi ioned by mul iple ac o s, collec i ely
deno ed as c∗, which i e a i ely eco e s I om noise.
3.1. Mul i-Condi ioned La en Di usion Model
The backbone o ou p oposed me hod is based on S able
Di usion [37] (SD), which is an implemen a ion o LDM.
An encode Ecomp esses he image I o a la en z, and a
decode D ans o ms zback o an image I′=D(z). The
comp essed la en ep esen a ion educes he op imiza ion
spaces and allows he gene a ion o highe esolu ion and
iche di e si y. The op imiza ion o loss Lin LDM can be
epu posed as:
Lmse =Ez ,p, ,ϵ,c∗(||ϵ−ϵθ(z , p , , c∗)||),(1)
whe e ϵθ ep esen s he o wa d p ocess o UNe in LDM,
p is he a ge pose, z is he noisy la en zunde imes ep
, and c∗is ou mul i- ocal condi ion.
Despi e he ad an ages o ha ing a la en ep esen a ion,
I′de e io a es du ing he comp ession p ocess. While he
pe cep ual di e ences be ween I′and Imay be e y small,
his degene a ion diminishes he signi icance o he la en
code z, pa icula ly o ea u es ha a e supposed o exhibi
subs an ial a iance in he o iginal inpu I, such as acial
ai s and ga men ex u e. Fu he mo e, his de e io a ion is
u he magni ied since Lmse o Tcould no gua an ee he
gene a ed la en wi hou any de ia ion, and inally esul s
in an unsa is ac o y appea ance gene a ion esul s in hese
high- equency egions. P e ious LDM-based me hods [9,
24] ha e neglec ed his issue by elying only on images,
which esul ed in he model’s ailu e o accu a ely gene a e
sensi i e egions.
To add ess his p oblem, we p opose a solu ion ha u i-
lizes mul i- ocal condi ions c∗ o ocus a en ion on he im-
po an a eas o he image. To implemen his app oach,
we ha e designed a wo-b anch condi ional di usion model
ha e ec i ely cap u es mul i- ocal a en ion. The pipeline
is shown in Fig.2. On he i s b anch, we ollow he s uc-
u e o Re e enceNe [12] o p o ide he seman ic and low-
le el ea u es c e , which a e conca ena ed wi h he UNe
16021
ea u es in each s age. In he second b anch, we exploi
p e ained models o embed h ee selec ed ocal ea u es
om he sou ce image I, ace egion F, and appea ance
egion A, espec i ely. These embeddings a e agg ega ed
in o UNe wi h ou Mul i- ocal Condi ions Agg ega ion
(MFCA).
3.2. Mul i- ocal Condi ions Agg ega ion.
Mul i- ocal Regions. To enhance la en ea u e p ese a-
ion, we inco po a e high- equency ocal egions c∗ om
he sou ce image Ias condi ioning inpu s. These ocal
egions help guide a en ion mechanisms o mi iga e he
deg ada ion o human-sensi i e ea u es. In ou implemen-
a ion, we ocus on egions o he ace and appea ance ha ,
al hough hey cons i u e a small po ion o he image, cap-
u e essen ial pe cep ual a ia ions. The deg ada ion o
hese a eas wi hin he au oencode can lead o losing ine
de ails, po en ially causing he la en ea u e ep esen a ion
o o e look sub le dis inc ions p esen in he sou ce image.
Speci ically, we employ [21] o c op he sou ce image
Iob aining he ace egion F. Addi ionally, we a ain he
appea ance egion Aby wa ping Iin o a s uc u ed ex u e
map de ined by he SMPL model [23], indexing om i s
es ima ed DensePose [8]pI. The ex u e map disen angles
appea ance om he posed image, e aining only he pose-
in a ian ex u e in o ma ion.
Mul i- ocal Embeddings. The h ee mul i- ocal condi ions
a e managed using p e ained modules. S a ing wi h a
sou ce image I, we ex ac i s embedding Iemb using a p e-
ained CLIP image encode [32]. The ex u e map Ais
p ocessed in wo ways h ough T. Fi s , we encode Awi h
a VAE encode [37], p oducing an ou pu A e , which is
hen passed o Re e enceNe R. Addi ionally, we use CLIP
o ob ain he ex u e map encoding Aemb. Fo acial egions
F, we no e ha gene al image encode s like CLIP may
s uggle o accu a ely cap u e iden i y ea u es, as ace ap-
pea ance and iew in I may di e signi ican ly om hose
in I. To add ess his, we u ilize a p e ained ace ecogni-
ion model [5] o localize he ace egion and ex ac iden i y
ea u es. These ea u es a e hen p ojec ed o ma ch he di-
mensionali y o he p e ious embeddings, no ed as Femb.
I ’s impo an o no e ha bo h Femb and Aemb a e sha ed
be ween Iand I , as hey a e pose-in a ian and ep esen
a ibu es o he same appea ance.
Mul i- ocal Condi ioning. The condi ions c∗a e assem-
bled as ollows:
c∗=c e =R(A e )
cemb =ϕ(Iemb,Aemb,Femb, z),(2)
whe e Ris a ainable Re e enceNe ex ac ing bo h he
s uc u ed de ails and layou s o appea ance egions. ϕde-
no es a muli - ocal condi ion agg ega ion module (MFCA)
o agg ega e he embeddings o UNe . zis a la en inpu in
UNe . Inspi ed by Ins an ID [46], ϕis de ined as ollows
(see Fig.2(b)):
ϕ=X
i∈{s,Femb}
λiA n(Q, Ki, Vi),
Q=zWq, Ki=iWki, Vi=iW i,
(3)
whe e Q,Ki,Via e que y, key and alue ma ices o c oss-
a en ions. Wis he a en ion weigh and λiis he scaling
ac o . Qis compu ed om la en zwhile Ki,Via e com-
pu ed om condi ioning embeddings i, including he ace
Femb and a selec i e condi ion swi che s.sis de ined as:
s=


Iemb i z ∈ UE
ca (Iemb,Aemb)i z ∈ UM
Aemb i z ∈ UD
(4)
whe e UE,UM,UDa e he encode , he la en laye and
decode o UNe , espec i ely.
When combining all condi ions, ou Mul i-Focal Condi-
ion Agg ega o (MFCA) e icien ly agg ega es he mul i-
ocal embeddings. This e iciency s ems om educing a -
en ion ope a ions o ocus on a speci ic egion a each s ep,
while simul aneously le e aging he embedding p ope ies
and he inhe en s uc u e o he UNe a chi ec u e.
Mo eo e , we in oduce a selec i e condi ion injec ion
app oach o accommoda e he dis inc cha ac e is ics o he
UNe s uc u e. Speci ically, UEencodes in o ma ion in o
a lowe -dimensional space, whe e injec ing global in o -
ma ion om Iemb ela ed o high-le el seman ics, such as
clo h ca ego ies, and gene al backg ound. Con e sely, du -
ing he decoding s age, UD equi es ine-g ained in o ma-
ion o e ec i ely econs uc he inal image; hus, Aemb
a e injec ed o p o ide pose-i ele an ea u es, such as ex-
u e de ails and ga men s de ails, a his phase o ul ill ha
need. This a ge ed injec ion s a egy educes pa ame e
coun s and guides he model o p io i ize he in o ma ion
mos ele an o each speci ic a chi ec u al s age.
Since Femb is de i ed om a p e ained ace ecogni ion
model, i main ains obus ness ac oss di e se iews and ap-
pea ances. We e ain Femb h oughou all s ages o UNe
o consis en ly ep esen bo h he inpu and a ge aces. An
addi ion ope a ion is employed o agg ega e Femb and s.
Pose Guide . We ha ness Densepose as ou pose condi-
ion as i p o ides app op ia e 3D in o ma ion as claimed
in PoCoLD [9]. In addi ion, Densepose coo dina es es ab-
lish a bijec ion be ween he UV space ex u e map Aand
image pixels o I , which implici ly b idges he appea ance
alignmen o he wo ocuses. Simila o [12], we employ a
ligh weigh pose guide module cons uc ed wi h a se ies o
con olu ional laye s de i ed om Con olNe . This mod-
ule is ini ialized wi h p e ained pa ame e s om he Con-
olNe segmen a ion model, enabling i o le e age p io
knowledge o enhanced guidance.
16022
Me hods FID↓LPIPS↓SSIM ↑PSNR↑
E alua ion on 256 ×176 esolu ion
GFLA‡[34] (CVPR20) 9.827 0.1878 0.7082 –
SPGNe ‡[25] (CVPR21) 16.184 0.2256 0.6965 17.222
NTED‡[36] (CVPR22) 8.517 0.177 0.7156 17.74
CASD‡[55](ECCV22) 13.137 0.1781 0.7224 17.880
PIDM†[2] (CVPR23) 6.36 0.1678 0.7312 –
PoCoLD†[9] (ICCV23) 8.067 0.1642 0.7310 –
CFLD†(CVPR24) 6.804 0.1519 0.7378 18.235
MCLD (B3) 6.86 0.157 0.734 18.03
MCLD (Ou s) 6.693 0.1482 0.7511 18.84
E alua ion on 512 ×352 esolu ion
CoCosNe s [54] (CVPR22) 13.325 0.2265 0.7236 –
NTED‡[36] (CVPR22) 7.645 0.1999 0.7359 17.385
PIDM†[2] (CVPR23) 5.8365 0.1768 0.7419 –
PoCoLD†[9] (ICCV23) 8.416 0.1920 0.7430 –
CFLD†[24] (CVPR24) 7.149 0.1819 0.7478 17.645
MCLD (B3) 7.23 0.1951 0.7405 17.48
MCLD (Ou s) 7.079 0.1757 0.7557 18.211
Table 1. Quali a i e compa ison wi h he s a e-o - he-a s in e ms
o image quali y benchma ks. †The sco es a e epo ed in hei
pape , since he same spli is ollowed. ‡The sco es a e e alua ed
and epo ed in CFLD [24], since hey spli alida ion se in a di -
e en way. Ou e alua ion code is he same as CFLD.
3.3. O e all objec i e
To o ce he model o concen a e mo e on he a ge ace
egions, we in oduce an ex a loss o supe ision a ace
egions:
L ace =Ez ,p, ,ϵ,c∗(||(ϵ−ϵθ(z , p , , c∗)) ⊙m||)(5)
whe e mis he segmen a ion mask o ace egions, which is
pa sed om he densepose p .
Combining eq.(1), he o e all objec i e unc ion is:
Lo e all =Lmse +L ace (6)
4. Expe imen s
In his sec ion, we p esen a de ailed analysis o ou ex-
pe imen s including he da ase se up, e alua ion me ics,
implemen a ion de ails, and a ho ough compa ison o ou
app oach wi h s a e-o - he-a me hods.
Da ase . Following [2,9,24,57], expe imen s a e con-
duc ed using he DeepFashion In-Shop Clo hes Re ie al
Benchma k [22], which con ains 52,712 high- esolu ion im-
ages o ashion models. Consis en wi h he CFLD, we
spli he da ase in o aining and alida ion subse s, com-
p ising 101,966 and 8,570 non-o e lapping image pai s, e-
spec i ely. Pose pai s a e ex ac ed by Densepose and we
e alua e esul s on 256×176 and 512×352 esolu ions.
Me ics. We conduc wo g oups o objec i e me ics o
e alua e he o e all gene a ed image quali y and he gene -
a ed ace p ese a ion, espec i ely. Fo he o e all gen-
e a ed image quali y, ou me ics a e adop ed o com-
pa ison. The F ´
eche Incep ion Dis ance (FID) [10] mea-
Me hods FS e ↑dis e ↓FS g ↑dis g ↓
E alua ion on 256 ×176 esolu ion
CASD [55] (ECCV22) 0.207 28.80 0.317 26.28
PIDM [2] (CVPR23) 0.270 28.06 0.394 25.17
CFLD [24] (CVPR24) 0.243 28.86 0.363 26.11
MCLD (B3) 0.279 28.1 0.381 25.7
MCLD (Ou s) 0.301 27.65 0.413 25.02
e – – 0.497 22.53
E alua ion on 512 ×352 esolu ion
CFLD [24] (CVPR24) 0.227 28.62 0.286 27.54
MCLD (B3) 0.289 27.04 0.333 26.25
MCLD (Ou s) 0.294 27.01 0.344 26.31
e – – 0.643 17.42
Table 2. Quali a i e compa ison wi h he s a e o he a ega d-
ing ace quali y benchma ks. F S is he ace simila i y me ic,
while dis is he euclidean dis ance measu e. e e e o he in-
pu sou ce human image, g is he g ound u h image. Bo h e
and g a e eal images.
su es he Wasse s ein-2 dis ance be ween he ea u e dis-
ibu ions o gene a ed images and eal images, wi h ea-
u es ex ac ed om he Incep ion- 3 p e ained ne wo k.
Speci ically, he gene a ed image ea u es come om he
alida ion da ase , while he eal image ea u es a e om he
aining da ase . The Lea ned Pe cep ual Image Pa ch Sim-
ila i y (LPIPS) [53] compu es image-wise simila i y in he
pe cep ual ea u e space. Bo h FID and LPIPS assess im-
age quali y in a high-le el ea u e domain. Addi ionally, we
use wo pixel-wise me ics: he S uc u al Simila i y Index
Measu e (SSIM) and Peak Signal- o-Noise Ra io (PSNR),
which e alua e he accu acy o pixel-wise co espondences
be ween he gene a ed and eal images. To assess he iden-
i y p ese a ion o he ace egion, we use a p e ained Face
Recogni ion Model [5] o ex ac he ace embeddings and
compu e he ace cosine simila i y F S and euclidean dis-
ance dis be ween he ace egions o gene a ed images
and eal images. Bo h he sou ce image e and he a ge
image g a e e alua ed o assess he o e all model abili y.
Implemen a ion de ails. Ou model is implemen ed on
S able Di usion [37] 1.5 model using PyTo ch [30] and
Hugging ace Di use s. The sou ce image and he a ge im-
age a e esized o 512×512. Face egions a e de ec ed by
a single sho de ec o [21] implemen ed in OpenCV [14],
while he ace embedding is acqui ed by a p e ained ace
analysis model, an elope 21. Fo appea ance egions, he
images a e i s con e ed o 24 pa s de ined in Densepose
wi h he size o 200×200, hen hese pa s a e ans o med
o 512×512 SMPL ex u e map by a p ede ined mapping.
The model is ained o 60,000 i e a ions using Adam op i-
mize [16] wi h a lea ning a e o 1e-5. We ain ou model
on wo N idia A100 GPUs wi h a ba ch size o 12 o each
GPU. Du ing sampling, a classi ie - ee guidance (CFG)
s a egy is adop ed o imp o e he sampling quali y. We
1h ps://gi hub.com/deepinsigh /insigh ace
16023

(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
Sou ce Pose Ta ge NTED CASD PIDM CFLD Ou s Sou ce Pose Ta ge NTED CASD PIDM CFLD Ou s
Figu e 3. Quali a i e Compa ison wi h se e al s a e-o - he-a models on he Deep ashion da ase . The inpu s o ou models a e he a ge
pose p and he sou ce pe son image I. F om le o igh he esul s a e o NTED, CASD, PIDM, CFLD and ou s espec i ely.
se he CFG scale o 3.5 and λiin MFCA o 1 and 0.5.
4.1. Quan i a i e Compa ison
Ou me hod is compa ed wi h bo h GAN-based and
di usion-based s a e-o - he-a app oaches. Speci ically,
PIDM [2] is di usion based while PoCoLD [9] and
CFLD [24] is LDM-based. The e alua ion is pe o med on
wo esolu ions, 256×176 and 512×352. In addi ion, we
compa e ou me hod wi h ou baseline B3since i has an
agg ega ion s uc u e simila o [46]. As shown in Tab 1,
ou me hod pe o ms be e by condi ioning wi h mul i-
ocal egions in he image quali y benchma ks. LDM-based
me hods a e known o encoun e challenges due o au oen-
code comp ession, which o en esul s in subop imal FID
sco es compa ed o ully di usion-based app oaches. Ou
p oposed me hod mi iga es hese limi a ions, achie ing im-
p o ed FID sco es among LDM-based echniques. While
ce ain ecen di usion-based me hods do no publicly e-
lease hei bes -pe o ming checkpoin s, we epo esul s
as s a ed in hei espec i e publica ions. Addi ionally, as
demons a ed in Tab. 2, ou me hod exhibi s obus iden i y
p ese a ion ac oss e alua ed me ics. The able also in-
cludes simila i y me ics be ween he e e ence sou ce im-
age and a ge g ound u h, whe e ou me hod achie es pe -
o mance on pa wi h e e ence images, which se e as au-
hen ic ep esen a ions p o iding acial cues o he ne wo k.
4.2. Quali a i e Compa ison
We p esen ou comp ehensi e isual compa ison wi h
ecen app oaches ha elease hei alida ion esul s o e-
p oducible, om he le o igh is NETD [36], CASD [55],
PIDM [2], CFLD [24] and ou s espec i ely. We obse e
se e al conclusions lis ed below. Fi s ly, cu en me hods
a e su e ing om econs uc ing he de ails o he ex u es
since hey only use he sou ce image as condi ion. This is
especially no iced in GAN based me hods and LDM based
me hod. This is pa ially because o he limi ed de ails ep-
esen a ion abili y o GAN, and he in o ma ion de e io a-
16024
Me hod Condi ions Agg ega ion Pa ams FID LPIPS SSIM PSNR
E alua ion on 256 ×176 esolu ion
B1 I– 1622 M 6.427 0.1629 0.7371 18.18
B2 I,Aconca 1622 M 6.830 0.1620 0.7357 18.20
B3 I,A,Fconca 1698 M 6.858 0.1536 0.7340 18.03
B4 F,AMFCA 1711 M 6.717 0.1619 0.734 18.16
B5 I,AMFCA 1622 M 6.723 0.1483 0.7499 18.72
Ou s I,A,FMFCA 1717 M 6.693 0.1482 0.7511 18.84
E alua ion on 512 ×352 esolu ion
B1 I– 1622 M 6.738 0.1923 0.7463 17.64
B2 I,Aconca 1622 M 7.129 0.1932 0.7433 17.63
B3 I,A,Fconca 1698 M 7.23 0.1951 0.7405 17.48
B4 F,AMFCA 1711 M 7.138 0.1923 0.7408 17.56
B5 I,AMFCA 1622 M 7.047 0.1766 0.7543 18.13
Ou s I,A,FMFCA 1717 M 7.079 0.1757 0.7557 18.21
Table 3. Quali a i e compa ison o abla ion s udies. I,A,Fa e
he embeddings om sou ce images, appea ance egions, and ace
egions espec i ely. Agg ega ion column e e s o he ea u e u-
sion s a egy. Pa ams e e s o he ainable pa ame e in ne wo k.
Sou ce
Image
Ta ge
Pose GT B1 B2 B3 B4 B5 Ou s
Figu e 4. Quali a i e abla ion compa ison. Re e o Tab. 3 o
baseline se ings.
ion in LDM. Howe e , a e in oducing he appea ance e-
gions by ex u e map, ou me hod shows a consis en gen-
e a ion esul s when he p o ided in o ma ion om appea -
ance egion and ace egion is adequa e. In ows 1-2, ou
me hod p ese es be e clo hing s yles e en when hese
s yles is a e o be seen in da ase . In ows 3-4, ou me hod
also shows a consis en abili y o econs uc he appea -
ance pa e ns unde he gi en e e ence images. While o he
me hods a e s uggled o he de ails o o iginal pa e ns. In
ow 5, o hese inpu images wi h complex pa e ns, all he
me hods ail o ep oduce he same de ails. Howe e , ou
me hods shows a consis en layou o clo hs. In addi ion,
iden i y p ese a ion is one o he mos challenging ask
o cu en me hods, since i is highly sensi i e om human
pe cep ion bu no o gene a i e losses. As he illus a ed
image shows, especially in ows 6-8, ou me hod pe o ms
a good iden i y p ese ing by in oducing he in a ian ace
egion embeddings as condi ions and supe isions.
4.3. Abla ion S udy
We pe o m abla ion s udies on ou MFCA module o
demons a e he alue o mul i- ocal condi ions. The quan-
i a i e esul is shown in Tab. 3, while he quali a i e esul s
a e illus a ed in Fig. 4.B1only akes Iemb om he sou ce
image as a condi ion, which is simila o o he image-based
me hods. Due o he limi ed powe o he image condi-
ion, he gene a ed image ails o p ese e acial and ex u al
ai s, in oducing undesi ed a i ac s. When we g adually
add Aemb and Femb o B2and B3wi h a conca ena ion
s a egy, he clo h s yle and ex u es in B2sligh ly imp o e.
In oducing acial condi ioning (i.e.B3) inc easingly im-
p o es pe o mance. Howe e , his simple conca ena ion
does no ensu e s able pe o mance. When oo many condi-
ions a e handled in pa allel, he e o o each condi ion
emains unclea , and he ocused a eas become ambigu-
ous, esul ing in unp edic able clo hs s yles, ex u es, and
iden i ies. Quan i a i e esul s also p o e ha conca ena ion
s uggles o imp o e he gene a ed image quali y. Thus, in
B4and B5, we adop ou MFCA wi hou he condi ions
Iemb and Femb, espec i ely. O e all, his esul s in a di-
minishing in unwan ed a i ac s due o he educed a en-
ion egions. Quali a i ely, when d opping he Iemb in B4,
he clo h s yles lose in de ail. Femb and Aemb only ecei e
he in o ma ion inside he Densepose es ima ion, and he
egions ou side Aa e andomly gene a ed which is no con-
sis en o he seman ic clo h s yle. In B5 ex u e imp o es
in quali y bu he acial ai s a e almos en i ely los .This
seems o con i m ou quan i a i e indings whe e he de-
o med and incomple e wa ping in he ex u e map a ec s
he acial appea ance. Though ou me hod is close o B5
in e ms o me ics, p obably due o he ac ha he ace
egions only occupy a small po ion o he en i e image, he
acial ai s a e well p ese ed.
Finally, we no iced a dec ease in FID pe o mance a e
in oducing mo e condi ions. As epo ed in [24], he FID
o VAE econs uc ion in LDM me hods is 7.967. Conse-
quen ly, a lowe FID in LDM-based me hods does no nec-
essa ily indica e a supe io o e all pe o mance. The o he
h ee me ics p o ide s onge quan i a i e e idence.
4.4. Appea ance Edi ing
Ou app oach enables lexible, localized edi ing by
adjus ing speci ic ocal condi ions wi hin he gene a i e
pipeline, allowing p ecise con ol o e ocal egions. Edi -
ing examples a e illus a ed in Fig. 5.
By modi ying he ex u e map A o designa ed clo hing
egions, we can seamlessly al e clo hing o e lec chosen
e e ence s yles, showcasing he s ong con ol capabili y o
ou ex u e map ocaliza ion ( ow 2). Addi ionally, by sub-
s i u ing he ace embedding Femb and upda ing he co e-
sponding acial egions in he ex u e map, ou me hod sup-
po s pe son iden i y swapping ( ow 3). This disen angling
o maps, acial iden i y, and pose pe mi s a bi a y combi-
na ions o iden i ies, clo hing s yles, and poses.
Fo selec i e edi s, such as al e ing only speci ic clo hing
16025
Re
Image
Re
Image
Re
Image
Sou ce Sou ce Sou ce
Image Image Image
Pose Pose
Pose
Figu e 5. Appea ance edi ing esul s. Ou me hod accep s lexible edi ing o gi en iden i ies, poses, and clo hes. This is achie ed only by
modi ying some egions o condi ions, and no need o any masking o u he aining.
Sou ce
Image
Wa ped
Tex u emap
Ta ge
Pose
G ound
T u h
Gene a ed
Image
(1)
(2)
(3)
(4)
(5)
Sou ce
Image
Wa ped
Tex u emap
Ta ge
Pose
G ound
T u h
Gene a ed
Image
Figu e 6. Failu e cases caused by (1) W ong a ge pose, (2) In-
comple e ex u e map, (3) Squeezed ex u e map, (4) Missing ace
in o ma ion, (5) Signi ican iew changes.
pa s, we can eplace co esponding egions wi hin he ex-
u e map, which is pa icula ly e ec i e o simple clo h-
ing designs wi h clea ex u e segmen s(as shown in he op
sec ion o he ow 4). Unlike adi ional di usion-based
me hods, which ely on mask-based blending wi hin la en
spaces, ou me hod p o ides a mo e s eamlined and adap -
able edi ing solu ion h ough s uc u ed modi ica ions.
In gene al, ou app oach o e s a mo e s aigh o wa d
and lexible edi ing solu ion by solely modi ying he s uc-
u ed condi ions. This highligh s he supe io i y o ou p o-
posed Mul i-Focal Condi ions Agg ega ion module in e ms
o edi ing capabili ies. Fu he mo e, ou edi ing esul s a e
mo e ealis ic han hose o baseline me hods [2,9,24],
as hey a oid he bounda y a i ac s o en associa ed wi h
mask-based echniques. A de ailed compa ison can be
ound in he supplemen a y ma e ials.
4.5. Failu e Cases
Despi e achie ing sa is ac o y appea ance-p ese ing
abili y in mos cases, ou model occasionally ails o p o-
duce desi ed esul s when dealing wi h uncommon o ab up
images, as shown in Fig. 6. We no ice se e al ailu e scena -
ios: (1) he a ge pose is w ongly es ima ed, (2) he sou ce
ex u e map is missing o incomple e, (3) he sou ce ex-
u e map is ully es ima ed, bu i s appea ance is shi ed o
limi ed pixel esolu ions. (4) missing acial ai s; (5) sig-
ni ican iew changes ha a e no cap u ed by sou ce image.
5. Conclusions
In his pape , we in oduced he MCLD amewo k o
pose-guided pe son image gene a ion. We add essed he
challenge o comp ession deg ada ion in LDMs, especially
o e sensi i e egions, by de eloping a mul i- ocal con-
di ioning s a egy ha s eng hens con ol o e bo h iden-
i y and appea ance. Ou MFCA module selec i ely in-
eg a es pose-in a ian ocal poin s as condi ioning inpu s,
signi ican ly enhancing he quali y o he gene a ed images.
Th ough ex ensi e quali a i e and quan i a i e e alua ions,
we demons a e ha MFCA su passes exis ing me hods in
p ese ing bo h he appea ance and iden i y o he subjec .
Mo eo e , ou app oach enables mo e lexible image edi -
ing h ough imp o ed condi ion disen anglemen . In u u e
wo k, we aim o explo e 3D p io s o u he enhance gen-
e a ion consis ency and imp o e appea ance ideli y.
Acknowledgemen This wo k was suppo ed by he MUR
PNRR p ojec FAIR (PE00000013) unded by he Nex Gen-
e a ionEU and he EU Ho izon p ojec ELIAS (No.
101120237). We acknowledge he CINECA awa d un-
de he ISCRA ini ia i e, o he a ailabili y o high-
pe o mance compu ing esou ces and suppo .
16026
Re e ences
[1] Om i A ahami, Ohad F ied, and Dani Lischinski. Blended
la en di usion. ACM ansac ions on g aphics (TOG), 2023.
1,2
[2] Ankan Kuma Bhunia, Salman Khan, Hisham Cholakkal,
Rao Muhammad Anwe , Jo ma Laaksonen, Muba ak Shah,
and Fahad Shahbaz Khan. Pe son image syn hesis ia de-
noising di usion model. In CVPR, 2023. 1,2,5,6,8
[3] Zhe Cao, Tomas Simon, Shih-En Wei, and Yase Sheikh.
Real ime mul i-pe son 2d pose es ima ion using pa a ini y
ields. In CVPR, 2017. 3
[4] Soon Yau Cheong, A min Mus a a, and And ew Gilbe . Up-
gp : Uni e sal di usion model o pe son image gene a ion,
edi ing and pose ans e . In ICCV, 2023. 2
[5] Jiankang Deng, Jia Guo, Niannan Xue, and S e anos
Za ei iou. A c ace: Addi i e angula ma gin loss o deep
ace ecogni ion. In CVPR, 2019. 4,5
[6] Ian Good ellow, Jean Pouge -Abadie, Mehdi Mi za, Bing
Xu, Da id Wa de-Fa ley, She jil Ozai , Aa on Cou ille, and
Yoshua Bengio. Gene a i e ad e sa ial ne s. In Neu IPS,
2014. 1
[7] A u G igo e , A em Se as opolsky, Alexande Vakhi o ,
and Vic o Lempi sky. Coo dina e-based ex u e inpain ing
o pose-guided human image gene a ion. In CVPR, 2019. 2
[8] Rıza Alp G¨
ule , Na alia Ne e o a, and Iasonas Kokkinos.
Densepose: Dense human pose es ima ion in he wild. In
CVPR, 2018. 3,4
[9] Xiao Han, Xia ian Zhu, Jiankang Deng, Yi-Zhe Song, and
Tao Xiang. Con ollable pe son image syn hesis wi h pose-
cons ained la en di usion. In ICCV, 2023. 1,2,3,4,5,6,
8
[10] Ma in Heusel, Hube Ramsaue , Thomas Un e hine ,
Be nha d Nessle , and Sepp Hoch ei e . Gans ained by a
wo ime-scale upda e ule con e ge o a local nash equilib-
ium. In Neu IPS, 2017. 5
[11] Jona han Ho, Ajay Jain, and Pie e Abbeel. Denoising di u-
sion p obabilis ic models. Neu IPS, 33, 2020. 1,2
[12] Li Hu. Anima e anyone: Consis en and con ollable image-
o- ideo syn hesis o cha ac e anima ion. In CVPR, 2024.
2,3,4
[13] Xin Huang, Ruizhi Shao, Qi Zhang, Hongwen Zhang, Ying
Feng, Yebin Liu, and Qing Wang. Humanno m: Lea ning
no mal di usion model o high-quali y and ealis ic 3d hu-
man gene a ion. CVPR, 2024. 2
[14] I seez. Open sou ce compu e ision lib a y. h ps://
gi hub.com/i seez/openc , 2015. 5
[15] Jeongho Kim, Guojung Gu, Minho Pa k, Sunghyun Pa k,
and Jaegul Choo. S able i on: Lea ning seman ic co e-
spondence wi h la en di usion model o i ual y-on. In
CVPR, 2024. 2
[16] Diede ik P Kingma. Adam: A me hod o s ochas ic op i-
miza ion. a Xi p ep in a Xi :1412.6980, 2014. 5
[17] Nikos Kolo ou os, Thiemo Alldieck, And ei Zan i , Ed-
ua d Gab iel Baza an, Mihai Fie a u, and C is ian Smin-
chisescu. D eamhuman: Anima able 3d a a a s om ex .
Neu IPS, 2023. 2
[18] Yining Li, Chen Huang, and Chen Change Loy. Dense in-
insic appea ance low o human pose ans e . In CVPR,
2019. 2
[19] Ting ing Liao, Hongwei Yi, Yuliang Xiu, Jiaxaing Tang,
Yangyi Huang, Jus us Thies, and Michael J Black. Tada!
ex o anima able digi al a a a s. 3DV, 2024. 2
[20] Ruoshi Liu, Rundi Wu, Basile Van Hoo ick, Pa el Tok-
mako , Se gey Zakha o , and Ca l Vond ick. Ze o-1- o-3:
Ze o-sho one image o 3d objec , 2023. 2
[21] Wei Liu, D agomi Anguelo , Dumi u E han, Ch is ian
Szegedy, Sco Reed, Cheng-Yang Fu, and Alexande C
Be g. Ssd: Single sho mul ibox de ec o . In ECCV.
Sp inge , 2016. 4,5
[22] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou
Tang. Deep ashion: Powe ing obus clo hes ecogni ion and
e ie al wi h ich anno a ions. In CVPR, 2016. 2,5
[23] Ma hew Lope , Nau een Mahmood, Ja ie Rome o, Ge a d
Pons-Moll, and Michael J. Black. SMPL: A skinned mul i-
pe son linea model. ACM T ansac ions on G aphics (TOG),
34(6):248:1–248:16, 2015. 3,4
[24] Yanzuo Lu, Manlin Zhang, Andy J Ma, Xiaohua Xie, and
Jianhuang Lai. Coa se- o- ine la en di usion o pose-
guided pe son image syn hesis. In CVPR, 2024. 1,2,3,
5,6,7,8
[25] Zhengyao L , Xiaoming Li, Xin Li, Fu Li, Tianwei Lin,
Dongliang He, and Wangmeng Zuo. Lea ning seman ic pe -
son image gene a ion by egion-adap i e no maliza ion. In
CVPR, 2021. 5
[26] Liqian Ma, Xu Jia, Qian u Sun, Be n Schiele, Tinne Tuy e-
laa s, and Luc Van Gool. Pose guided pe son image gene a-
ion. In Neu IPS, 2017. 2
[27] Yi ang Men, Yiming Mao, Yuning Jiang, Wei-Ying Ma, and
Zhouhui Lian. Con ollable pe son image syn hesis wi h
a ibu e-decomposed gan. In CVPR, 2020. 1,2
[28] Chong Mou, Xin ao Wang, Liangbin Xie, Yanze Wu, Jian
Zhang, Zhongang Qi, and Ying Shan. T2i-adap e : Lea ning
adap e s o dig ou mo e con ollable abili y o ex - o-image
di usion models. In AAAI, 2024. 2
[29] Maxime Oquab, Timo h´
ee Da ce , Theo Mou akanni, Huy V.
Vo, Ma c Sza aniec, Vasil Khalido , Pie e Fe nandez,
Daniel Haziza, F ancisco Massa, Alaaeldin El-Nouby, Rus-
sell Howes, Po-Yao Huang, Hu Xu, Vasu Sha ma, Shang-
Wen Li, Wojciech Galuba, Mike Rabba , Mido Ass an, Nico-
las Ballas, Gab iel Synnae e, Ishan Mis a, He e Jegou,
Julien Mai al, Pa ick Laba u , A mand Joulin, and Pio Bo-
janowski. Dino 2: Lea ning obus isual ea u es wi hou
supe ision, 2023. 2
[30] Adam Paszke, Sam G oss, F ancisco Massa, Adam Le e ,
James B adbu y, G ego y Chanan, T e o Killeen, Zem-
ing Lin, Na alia Gimelshein, Luca An iga, e al. Py o ch:
An impe a i e s yle, high-pe o mance deep lea ning lib a y.
Neu IPS, 2019. 5
[31] Ben Poole, Ajay Jain, Jona han T. Ba on, and Ben Milden-
hall. D eam usion: Tex - o-3d using 2d di usion. ICLR,
2022. 2
[32] Alec Rad o d, Jong Wook Kim, Ch is Hallacy, Adi ya
Ramesh, Gab iel Goh, Sandhini Aga wal, Gi ish Sas y,
16027