scieee Science in your language
[en] (orig)

Expotion: Facial Expression and Motion Control for Multimodal Music Generation

Author: Fathinah Izzati; Xinyue Li; Gus Xia
Publisher: Zenodo
DOI: 10.5281/zenodo.17706414
Source: https://zenodo.org/records/17706414/files/000041.pdf
EXPOTION: FACIAL EXPRESSION AND MOTION CONTROL FOR
MULTIMODAL MUSIC GENERATION
Fa hinah Izza i∗Xinyue Li∗Gus Xia
Mohamed bin Zayed Uni e si y o A i icial In elligence, Uni ed A ab Emi a es
{ a hinah.izza i, xinyue.li, gus.xia}@mbzuai.ac.ae
ABSTRACT
We p opose EXPOTION (Facial Exp ession and Mo ion
Con ol o Mul imodal Music Gene a ion), a gen-
e a i e model le e aging mul imodal isual con-
ols—speci ically, human acial exp essions and
uppe -body mo ion—as well as ex p omp s o p o-
duce exp essi e and empo ally accu a e music. We adop
pa ame e -e icien ine- uning (PEFT) on he p e ained
ex - o-music gene a ion model, enabling ine-g ained
adap a ion o he mul imodal con ols using a small
da ase . To ensu e p ecise synch oniza ion be ween
ideo and music, we in oduce a empo al smoo hing
s a egy o align mul iple modali ies. Expe imen s demon-
s a e ha in eg a ing isual ea u es alongside ex ual
desc ip ions enhances he o e all quali y o gene a ed
music in e ms o musicali y, c ea i i y, bea - empo
consis ency, empo al alignmen wi h he ideo, and ex
adhe ence, su passing bo h p oposed baselines and ex-
is ing s a e-o - he-a ideo- o-music gene a ion models.
Addi ionally, we in oduce a no el da ase consis ing
o 7 hou s o synch onized ideo eco dings cap u ing
exp essi e acial and uppe -body ges u es aligned wi h
co esponding music, p o iding signi ican po en ial o
u u e esea ch in mul imodal and in e ac i e music gen-
e a ion. Code, demo and da ase a e a ailable a h ps:
//gi hub.com/xinyueli2896/Expo ion.gi
1. INTRODUCTION
Music gene a ion models ha e become inc easingly e sa-
ile and in e ac i e, capable o in eg a ing con ol signals
om a ious modali ies, including audio, ex , and sym-
bolic ep esen a ions (such as MIDI o musical sco es).
These condi ions ac as con ol o guide he model owa d
p oducing mo e p ecise, and a ge ed ou pu s aligned wi h
use expec a ions. Al hough cu en ex - o-music gene a-
ion models can p oduce imp essi e musical quali y, hey
o en lack he ine-g ained empo al con ol mechanisms
∗These au ho s con ibu ed equally o his wo k.
© F. Izza i, X. Li, and G. Xia. Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu-
ion: F. Izza i, X. Li, and G. Xia, “Expo ion: Facial Exp ession and Mo-
ion Con ol o Mul imodal Music Gene a ion”, in P oc. o he 26 h In .
Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea,
2025.
Figu e 1. O e iew o Expo ion’s mul imodal in e ence
pipeline, showing how isual ges u es and acial exp es-
sions guide exp essi e music gene a ion.
and exp essi i y necessa y o lexibly adap o a wide ange
o eal-wo ld scena ios.
Inspi ed by he idea ha ges u es and acial exp es-
sions can ac as impo an guides o music, simila o
wha conduc o s do, we u he explo e isual con ols in
his s udy and p opose Expo ion, a deep music gene a i e
model wi h mul imodal con ols– acial exp ession and up-
pe body mo ion, as well as ex p omp s. Expo ion is de-
signed o syn hesize high-quali y music ha is bo h exp es-
si e (ensu ing ha he musical con en e lec s he emo-
ional and exp essi e cues o he ace and ges u es) and
empo ally accu a e (so ha e e y change in mo ion and
exp ession is accu a ely mi o ed in he music) wi h he
inpu ideo, as shown in Figu e 1.
While some s udies ha e explo ed audio–music gene -
a ion asks gi en ideos— anging om audio e ec s gen-
e a ion (e.g., [1–11]) ideo music backg ound gene a ion
(e.g., [12–15]), and dance music gene a ion (e.g., [16–18]
among o he s), ou wo k ocuses on he sub le dynamics
o ges u es and acial exp essions, emphasizing hei ine-
g ained empo al synch oniza ion wi h music gene a ion.
This opens up po en ial applica ions in eal- ime, in e ac-
i e audio isual sys ems.
To achie e ou goal, we i s in oduce a newly cu-
a ed da ase comp ising o 7 hou s o ca e ully synch o-
nized ideo-music pai s ea u ing exp essi e ges u es and
acial exp essions closely ma ched o he co esponding
music. Gi en he limi ed amoun o da a, we employ
pa ame e -e icien ine- uning (PEFT) on a ans o me -
based ex - o-music gene a ion model [19]—le e aging
powe ul abili y o he model p e ained on a massi e
354
music- ex da a ha ha e been shown o be e ec i e in in-
co po a ing addi ional modali ies [9, 10, 20–23]. By ine-
uning only 4% o he o iginal model’s pa ame e s [20],
ou me hod seamlessly in eg a es mul imodal isual in-
pu s ( acial exp essions and uppe -body ges u es) using
only 130 ideo–audio pai s o aining, he eby minimiz-
ing ne wo k complexi y while ensu ing obus mul imodal
usion. We also p opose an app oach called empo-
al smoo hing o ensu e p ecise and e icien empo al
alignmen be ween audio and ideo modali ies.
Ou expe imen s show ha Expo ion can gene a e high-
quali y, empo ally accu a e music ha ai h ully e lec s
he exp essi e nuances inhe en in he isual inpu s and
ex ual desc ip ions. Al hough ex ual desc ip ions sup-
ply he p ima y con ex ual cues—le e aging he model’s
s ong ex -unde s anding capabili ies o gua an ee base-
line music quali y— he addi ion o isual inpu u he
enhances alignmen , exp essi eness, and consis ency. In
comp ehensi e subjec i e and objec i e e alua ions, Ex-
po ion consis en ly ou pe o ms cu en s a e-o - he-a
ideo- o-music gene a ion models [24] and mul imodal
cap ioning baselines ac oss mul iple me ics, including (1)
Quali y o Gene a ed Music, (2) Bea s and Tempo Con-
sis ency, (3) Tex -Audio Simila i y, and (4) Video–Music
Consis ency.
To he bes o ou knowledge, his wo k is he i s o
le e age synch onized exp essi e ges u es and acial ex-
p essions o music gene a ion. Ou expe imen s show
ha inco po a ing isual ea u es as con ol signals no
only enhances he empo al alignmen be ween he ideo
and gene a ed music bu also imp o es ex adhe ence and
o e all musical quali y—highligh ing he complemen a y
s eng hs o bo h modali ies. We belie e ha Expo ion will
empowe a is s wi h a mo e exp essi e, con ollable, and
in e ac i e app oach o music c ea ion.
2. RELATED WORK
We e iew h ee key pa adigms o con ollable music
gene a ion: isual and mo ion-based con ol, ex ual
and symbolic condi ioning, and aining and adap a ion
s a egies.
Visual and Mo ion-Based Con ol Ea ly in e ac i e sys-
ems mapped acial o bodily ea u es di ec ly o sound.
Valen i e al.’s Soni y You Face modula ed audio ia
Bayesian classi ica ion o acial mo ion uni s [25], and
Clay e al. ansla ed whole-body emo ional exp essions
in o elec onic-music pa ame e s [26]. D2MNe ex ac ed
global s yle and local bea ec o s om LMA-de i ed
mo emen signals o d i e an au o eg essi e gene a o
[27]. DeepTunes [28] and [29] combine CNN-based emo-
ion de ec ion wi h GPT-2 ly ic models, LSTMs, and ans-
o me s o join ly p edic disc e e and alence–a ousal
emo ions and p oduce synch onized music (and ly ics)
ha closely e lec he use ’s image inpu . Video-
condi ioned models such as VidMuse [24], V2Meow [15],
and Video2Music [14] align music wi h mo ion cues and
scene con ex , while Foley-s yle sys ems ansla e indi-
idual e en s o sound ia la en di usion o ans o m-
e s [1,2,8].
Tex ual and Symbolic Condi ioning Tex - o-music mod-
els like MusicLM [30] and MusicGen [19] gene a e
high-quali y audio om desc ip i e p omp s bu lack ex-
plici ime- a ying con ol. MusicGen-Melody ex ends
his by condi ioning on a e e ence melody ack o
pi ch con ou s [19]. CoCoMulla le e ages cho d cha s
and d um pa e ns o con ol ha mony and hy hm [20],
and Ske ch2Sound in oduces con inuous ocal–imi a ion
cu es (loudness, b igh ness, pi ch) alongside ex o mod-
ula e audio gene a ion [21].
T aining and Adap a ion S a egies Many ideo- o-
audio and music models a e ained om sc a ch on
la ge pai ed da ase s—such as AudioSe [31] and VG-
GSound [32]— o lea n c oss-modal co espondences [1–4,
24]. When da a is limi ed, pa ame e -e icien ine- uning
(PEFT) is e ec i e: Ske ch2Sound ine- unes a single lin-
ea laye pe con ol signal on a ozen di usion backbone
[21], while CoCoMulla and Ai Gen a ach ligh weigh
adap e s o MusicGen, uning unde 4% o pa ame e s o
cho d and hy hm con ol o music inpain ing [20,33]. Ex-
po ion adop s a simila PEFT app oach wi h isual mul i-
modal inpu s.
3. METHODOLOGY
Ou app oach consis s o o 1) a join embedding encode
o in eg a e empo ally aligned ideo-based con ols, and
2) a condi ion adap o o ine- une MusicGen by inco po-
a ing he lea ned join isual embeddings. We oze he
pa ame e s o he anilla Musicgen du ing aining o p e-
se e i s ex unde s anding abili y.
3.1 Join Visual Embeddings
To e ec i ely inco po a e acial exp ession and uppe -
body mo emen ea u es om ideo, we adop a join em-
bedding amewo k combining hese wo ypes o ea u es,
as shown in Figu e 2. Since ideos cap u ing acial ex-
p essions and uppe -body ges u es ypically exhibi less
dynamic mo ion, a ela i ely low ame a e is su icien
o smoo h human pe cep ion. To add ess he disc epancy
be ween his ame a e and he ame a e o MusicGen,
we in oduce a ge ed sampling s a egies o ideo ea-
u e p ocessing ha di e om hose applied o audio. We
also p opose an app oach called empo al smoo hing o en-
su e p ecise and e icien empo al alignmen be ween au-
dio and ideo modali ies, main aining he exp essi i y o
ideo, and enhancing he o e all mul imodal in eg a ion.
3.1.1 Facial Exp ession Embedding
To ex ac acial exp ession ea u es om ideo, we em-
ployed MARLIN, a sel -supe ised lea ning amewo k
speci ically designed o de i e uni e sal acial ep esen a-
ions om unanno a ed ideo da a [34]. Fo each ame,
MARLIN gene a es ea u es ha cap u e in o ma ion om
bo h he cu en ame and i s neighbo ing ames. To em-
po ally align hese acial exp ession ea u es wi h he au-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
355
Figu e 2. Join embedding o isual ea u es. Facial exp ession and mo ion ea u es a e i s ex ac ed om he gi en
ideo and empo ally smoo hed h ough in e pola ion, p oducing e ined in e media e ep esen a ions. These empo ally
aligned embeddings hen unde go dimensionali y educ ion h ough he low- ank p ojec ion laye . Then, he conca ena ed
embeddings a e p ojec ed o he same dimension as he MusicGen hidden laye s. Finally, he posi ional encoding is added
o he la en embeddings o o m he inal join embeddings.
dio codes p oduced by Encodec in MusicGen, we u ilized a
specialized empo al smoo hing s a egy: i s esampling
he ideo om 30 ps o 80 ps and hen, since MARLIN
p ocesses 16 ames simul aneously, ob aining he acial
exp ession ea u es in a ame a e o 5 ps by se ing he
s ide o 16 ames. To minimize in o ma ion loss om his
downsampling, we applied linea in e pola ion o he acial
exp ession ea u e z ∈RT×768, whe e T is he o al num-
be o ames and 768 is he hidden dimension. z(i)
∈Rd1
ep esen s he ea u e a i- h ame. Fo a desi ed (possibly
non-in ege ) ime index , le
i=⌊ ⌋, α = −i,
so ha lies be ween he i- h and (i+ 1)- h ames. The
in e pola ed ea u e ˆz( )
is compu ed as
ˆz = (1 −α)z i+α z i+1 .(1)
This o mula linea ly weigh s he neighbo ing ea u es, en-
su ing a smoo h ansi ion be ween ames.
We u he comp ess he in e pola ed acial ea u es by
p ojec ing hem on o a low-dimensional space wi h dimen-
sion d1using a ainable ma ix W ∈Rd1×768:
z′
i=WT
ˆz i∈Rd1.(2)
3.1.2 Mo ion Embedding
We compa e wo mo ion ep esen a ions: one ex ac ed
om he Synch o me isual encode [35] and ano he
om RAFT op ical low [36].
Fo Synch o me , we s anda dize ideos ps and seg-
men each ideo in o 16- ame clips wi h a s ide o 5,
yielding ou pu s o shape (T, 8,dim), whe e he 8 dimen-
sion cap u es local empo al con ex . We la en he i s
wo dimensions, pe o m empo al in e pola ion as
ˆzm = (1 −α)zmi+α zmi+1 ,(3)
and hen p ojec he in e pola ed ea u es o a lowe -
dimensional space using a ainable ma ix Wm∈Rd2×D
(D= o iginal ea u e dimension), yielding
z′
mi=WT
mˆzmi∈Rd2.(4)
Fo RAFT, we sample he ideo a 5 ps o ob ain
ames {I }T
=1 and compu e he dense op ical low F ∈
RH×W×2 o each consecu i e pai (I , I +1)using RAFT
[36]. Each low ield is hen p ocessed by a Flow Embed-
ding CNN F o yield a compac ea u e ec o :
z low
=F(F )∈R256.(5)
Fis composed o se e al con olu ional laye s wi h ke -
nel size 3 and ReLU ac i a ions, ollowed by an adap-
i e a e age pooling ope a ion. We pe o m empo al in-
e pola ion and p ojec he esul ing sequence o a lowe -
dimensional space in he same way as Sych o me . This
ensu es ha bo h mo ion ep esen a ions a e aligned in
ime and compa ible wi h he MusicGen ans o me ’s in-
pu equi emen s.
3.1.3 Posi ional Embeddings
We de ine a lea nable ma ix We∈R(d1+d2)×d o use he
embeddings men ioned abo e, oge he wi h a lea nable
posi ional embedding zpos,i ∈Rd1+d2 o suppo sequen-
ial modeling. The combined join symbolic and acous ic
embedding is compu ed as:
zi=WT
ez i;zmi] + zpos,i∈Rd.(6)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
356
Le Tdeno e he o al numbe o ames. Then, he o e all
sequen ial join embedding is gi en by:
z={z1, z2, . . . , zT} ∈ RT×d.(7)
3.2 Condi ion Adap o
Adop ing a simila app oach o CoCoMulla [20], we ex-
end he idea o a condi ion adap o o handle ime- a ying
ideo inpu s, such as acial exp essions and mo ion. In a
s anda d T ans o me , each sel -a en ion laye p ocesses
Thidden embeddings (one pe Encodec ame). In ou ap-
p oach, he inal Llaye s o he MusicGen decode expand
his o 2Tembeddings by adding Tcondi ion p e ix posi-
ions, which encode he con ol in o ma ion. Speci ically,
we inse a sequence o lea nable inpu embeddings in o
he (N−L+ 1) h decode laye o ini ia e he condi ion
p e ix. In his p e ix, he hidden s a es go only h ough
sel -a en ion laye s (omi ing c oss-a en ion). Le
Hp
l∈RT×d(N−L+ 1 ≤l≤N)
be he ou pu o he condi ion p e ix, wi h Hp
0as he lea n-
able inpu embeddings. The condi ion p e ix is compu ed
as:
Qp
l, Kp
l, V p
l=QKV-p ojec o Hp
l+Zl,
Hp
l+1 =Sel -A en ionQp
l, Kp
l, V p
l,
whe e Zla e he sequen ial join embeddings (de ined in
Eq. (7)). No causal mask is applied he e, and he condi ion
p e ix does no a end o he Encodec okens.
Fo he emaining pa , he hidden s a es Hl∈RT×d
( o 1≤l≤N) a e p ocessed no mally. Thei s anda d
a en ion ou pu Slis compu ed as:
Ql, Kl, Vl=QKV-p ojec o (Hl),
Sl=Sel -A en ionQl, Kl, Vl.
To inco po a e condi ion in o ma ion in he las L
laye s, we compu e c oss a en ion S′
lbe ween Qland
{Kp
l, V p
l}using sel -a en ion, using Qlwi h Qp
l:
S′
l=Sel -A en ionQl+Qp
l, Kp
l, V p
l.
A lea nable ga ing ac o gl(ini ialized o ze o) combines
he ou pu s:
Hl+1 =C oss-A en ionSl+gl·S′
l, ex .
In ou implemen a ion, all MusicGen laye s (including
QKV-p ojec o , Sel -A en ion, and C oss-A en ion) a e
ozen; only Hp
0,Wp,Wa,We,zpos, and gla e ainable.
4. EXPERIMENTS
4.1 Da ase
Due o he lack o su icien pai ed ideo-audio da a wi h
clea acial ea u es, we cu a ed ou own da ase by collec -
ing he da a manually. We ec ui ed olun ee s o eco d
hei acial exp essions and uppe body mo emen s while
lis ening o 30-second audio clips. Be o e s a ing he
eco ding, he olun ee s we e asked o lis en o he mu-
sic ack once, allowing hem ime o hink abou he acial
exp essions and body mo emen s ha would align wi h
he music. The audio clips used we e licensed ins umen-
al acks om Epidemic Sound, ensu ing no ocals we e
p esen . The collec ion includes a a ie y o music gen es,
such as pop, jazz, blues, classical, and epic, among o he s.
We we e able o collec 7 hou s o pai ed ideo-audio da a.
The pai ed ideo-audio da a a e hen chopped in o 10 sec-
onds pe clip. We se aside 30 minu es o da a o es and
alida ion. We also gene a ed cap ions o each audio clip
wi h audio-cap ioning model SALMONN [37] o be used
as ex p omp s du ing aining and in e ence. The p omp
gi en o SALMONN o cap ioning is ’Please desc ibe he
music’. In he da a collec ing p ocess, olun ee s ga e in-
o med consen , ag eeing ha hei ideos would be used
only o esea ch and anonymized o p o ec hei p i acy.
4.2 Implemen a ion
Ou base model, MusicGen ( ex -only), consis s o h ee
main pa s: a p e- ained EnCodec, a p e- ained T5 en-
code , and an acous ic ans o me decode . The de-
code has 48 laye s, each wi h causal sel -a en ion and
c oss-a en ion o p ocess ex p omp s. MusicGen uses
EnCodec, a Residual Vec o Quan iza ion (RVQ) au o-
encode [38], o con e audio sampled a 32,000 Hz in o
disc e e codes a 50 Hz, which a e hen passed o he ans-
o me decode .
We ained he p oposed model using ou A1000
GPUs, employing an ini ial lea ning a e o 1e-02 and a
ba ch size o 10 consis ing o en 10-second audio samples,
o 40 epochs. T aining was s opped a e 40 epochs o p e-
en o e i ing. Du ing aining, he model’s pa ame e s
we e upda ed using a c oss-en opy econs uc ion loss. In
he low- ank p ojec ion s ep, we se d1in Equa ion (2) and
d2in Equa ion (4) o be 12.
4.3 Baselines
Since ou model add esses a no el domain o which no
exis ing opensou ce models a e speci ically ained wi h
bo h ideo ( acial exp ession and mo ion) and ex con-
ols, we selec ed anilla MusicGen ( ex -condi ioned)
and wo ecen ideo-condi ioned music gene a ion mod-
els—VidMuse [24] and Video2Music [39]—as ou base-
lines. Video2Music uses an A ec i e Mul imodal T ans-
o me o gene a e emo ionally aligned, exp essi e sym-
bolic music (cho ds) om ideo inpu s by le e aging se-
man ic, mo ion, scene, and emo ion ea u es, while Vid-
Muse in eg a es bo h local and global isual cues h ough
a Long-Sho -Te m Visual Module.
4.4 E alua ion
Ou e alua ion consis s o wo pa s: subjec i e and ob-
jec i e e alua ions. Because MusicGen only accep s ex-
ual inpu , we gene a e used mul imodal cap ions om he
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
357
Gene al Music Quali y Rhy hm Alignmen Tex /Video-Audio Consis ency
Mo ion Fea . P omp FAD-VGG↓A g. KL↓A g. IS Sco e↑Tempo E . (bpm)↓F1-sco e (Bea )↑Tex -Audio↑Video-Audio↑
ace – Gene a ed 2.52 0.74 1.45 31.53 0.38 0.48 0.55
mo ion RAFT Gene ic 3.56 1.05 1.16 28.07 0.37 0.38 0.65
mo ion RAFT Gene a ed 2.25 0.66 1.53 33.71 0.37 0.52 0.59
mo ion Sync o me Gene ic 3.52 1.08 1.20 31.03 0.35 0.37 0.61
mo ion Sync o me Gene a ed 1.93 0.65 1.57 32.89 0.36 0.52 0.61
ace+mo ion Sync o me Gene a ed 2.55 0.67 1.49 32.72 0.37 0.54 0.59
MusicGen – Gene a ed 2.76 0.79 1.54 35.84 0.33 0.50 0.42
Video2Music – – 18.97 1.32 1.01 26.13 0.38 0.25 0.59
VidMuse – – 9.91 1.10 1.33 33.01 0.37 0.36 0.52
Table 1. Compa ison o music quali y, hy hm alignmen , and ex / ideo-audio consis ency me ics ac oss di e en con-
igu a ions.
ideo–audio pai s o se e as p omp s o he ex -only Mu-
sicGen baseline. This app oach ensu es a ai compa ison
by p o iding equi alen desc ip i e in o ma ion ac oss all
models.
4.4.1 Objec i e E alua ion
In ou objec i e e alua ion, we assess he music gen-
e a ed om h ee pe spec i es: (1) he inhe en qual-
i y, (2) hy hm alignmen , and (3) ex / ideo-audio con-
sis ency. To measu e he inhe en quali y o he music,
we employed F eche Audio Dis ance(FAD) wi h VGGish
embeddings which measu es pe cep ual simila i y o eal
music [40, 41], Kullback-Leible (KL) di e gence which
quan i ies dis ibu ional alignmen wi h eal audio labels
[42,43], and In e -Sample Sco e (IS Sco e) which cap u es
he di e si y among gene a ed samples [44]. To e alua e
he hy hm alignmen be ween gene a ed and g ound u h
music, we calcula ed he empo e o , which is de ined
as how a an es ima ed BPM(bea s-pe -minu e) de ia es
om he ue (g ound- u h) BPM, and bea consis ency
be ween he gene a ed and e e ence music. To e alua e
he ex / ideo-audio consis ency, we use CLAP [45] and
LanguageBind [46] espec i ely. We employ hese mod-
els ained wi h con as i e lea ning app oach o measu e
model’s abili y o gene a e music ha e lec s he seman ic
meaning o he ex and ideo.
4.4.2 Subjec i e E alua ion
We ga he ed pa icipan s o a ying le el o musical back-
g ound o a e he gene a ed music—ac oss a ious con ig-
u a ions and he baseline—o e i e g oups o six ideos.
They we e shown he ex p omp s wi hou being in o med
which model p oduced each ack. Ra ings we e based on
he ollowing c i e ia:
•Musicali y: How e ec i ely he audio cap u es key mu-
sical quali ies.
•Tex -audio Simila i y: The deg ee o alignmen be-
ween he gene a ed music and he p o ided ex ual
p omp .
•Video-audio Consis ency: The ex en o which he mu-
sic co esponds wi h he ideo con en in e ms o empo
and emo ional exp ession.
•C ea i i y: The uniqueness and inno a i eness o he
gene a ed audio.
4.5 Abla ion s udies
We conduc ed abla ion s udies o e alua e he e ec s o
a ious expe imen al se ups. Speci ically, hese s udies ex-
plo ed he ollowing model se ups: (1) he use o di e -
en mo ion ea u es (RAFT e sus Synch o me ) and (2)
he use o gene ic p omp s o ‘music wi h ca chy melody’
e sus de ailed p omp s gene a ed by he audio-cap ioning
model.
5. RESULTS
5.1 Objec i e E alua ions
Gene al Music Quali y. The esul s in Table 1 demon-
s a e ha models inco po a ing mo ion in o ma ion, pa -
icula ly hose using he Sync o me ea u es wi h gene -
a ed p omp s, consis en ly ou pe o m o he s ac oss gen-
e al music quali y me ics, especially he baselines. This
con igu a ion pe o ms bes o e all, wi h he lowes FAD
and KL sco es and he highes IS Sco e, indica ing eal-
is ic, di e se, and well-aligned music gene a ion. Com-
pa a i ely, models using he RAFT ea u es pe o m less
consis en ly, and hose using only acial inpu o gene ic
p omp s yield weake esul s. Baseline me hods like Vid-
Muse and Video2Music show signi ican ly poo e FD and
KL sco es, highligh ing he ad an age o mul imodal con-
ol o ou model.
Rhy hm Alignmen . Among all he con igu a ion es ed,
he model ained wi h RAFT mo ion ea u es and gene ic
cap ions achie es he lowes A e age Tempo E o (28.07
BPM), indica ing be e empo al alignmen wi h he au-
dio compa ed o o he p oposed me hods. Video2Music
achie es he lowes empo e o because i ansc ibes he
audio in o MIDI and compu es hy hmic cha ac e is ics in
he o m o no e densi y and loudness om he audio–a
p oxies o he music’s hy hms [14]. The baseline model
MusicGen shows he poo es pe o mance in all empo and
bea acking me ics, unde sco ing i s non-exp essi eness
in con olling gene a ed music.
Tex /Video-Audio Consis ency The ex -audio simila -
i y sco es e eal ha mul imodal condi ioning ( ace and
mo ion) signi ican ly enhance ex -alignmen compa ed
o ex -only condi ioning baseline (MusicGen), al hough
hey do no p o ide explici ex ual in o ma ion. The
ace+mo ion model achie es he highes ex -music sco es
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
358

(0.54).In con as , ideo-audio simila i y sco es show
ha models ained wi h gene ic (non-desc ip i e) cap-
ions achie e s onge alignmen wi h ideo (e.g., 0.65
and 0.61), sugges ing ha ex condi ioning may no
bene i —and may e en hinde — ideo-audio consis ency.
The poo ideo-audio alignmen o MusicGen ou pu s
and he low ex -audio alignmen in music gene a ed by
Video2Music and VidMuse is jus ied as hese models a e
no condi ioned on he espec i e modali ies.
5.2 Subjec i e E alua ion
Figu e 3. Subjec i e E alua ion Resul s on Fou Me ics
Figu e 3 p esen s a subjec i e analysis o he qual-
i y o music gene a ed by i e models— anilla Music-
Gen(baseline) and ou a ian s o ou models ained wi h
di e en con igu a ions. A o al o 13 pa icipan s (6 e-
male, 7 male) aged 18–40 yea s (Median = 27) ook pa .
Based on sel - epo ed musical aining, 20 %o pa ic-
ipan s ha e beginne le el o expe ience in music, 20 %
in e media e, and 60 %p o essional.
No ably, all o ou models—excep o
one—ou pe o m he baseline in c ea i i y and musi-
cali y, sugges ing ha inco po a ing acial exp essions and
mo ion cues enables he sys em o be e cap u e he ex-
p essi e quali ies o he inpu ideo. This exp essi eness
is e lec ed in he gene a ed music, which pa icipan s
pe cei e as mo e musical and c ea i e. Ou model ained
wi h only mo ion ea u es and gene ic p omp s, pe o m
poo ly in all e alua ion me ics, indica ing ha gene ic
ex ual p omp s a e insu icien o guide high-quali y mu-
sic gene a ion. Wi hou meaning ul ex ual con ex , isual
ea u es alone do no p o ide enough seman ic g ounding
o p oduce good music. The supe io pe o mance o
mo ion ea u es o e acial ea u es in bo h objec i e and
subjec i e e alua ions likely s ems om he na u e o he
da ase : pa icipan s ound i easie o con ey musical
cues h ough mo emen s a he han acial exp essions,
esul ing in iche and mo e exp essi e mo ion da a. These
indings highligh ha isual and ex ual modali ies a e
complemen a y: while ex ual inpu p o ides seman ic
in en , isual ea u es—especially mo ion—en ich he
gene a ed music’s exp essi eness. O e all, ou model,
in eg a ing ea u es om bo h modali y is capable o
p oducing music ha is cohe en , exp essi e, and c ea i e
as pe cei ed by human.
5.3 Abla ion S udies
5.3.1 RAFT s. Synch o me
The compa ison be ween RAFT and Sync o me as mo ion
ea u e ex ac o s e eals no able di e ences in bo h gen-
e al music quali y and hy hm- ela ed me ics as shown in
Table 1. Sync o me ou pe o ms RAFT ac oss FAD, and
IS Sco e, indica ing mo e ealis ic and sligh ly mo e di-
e se music gene a ion. Howe e , he e is no signi ican
di e ence be ween he choice o hese wo mo ion ea ues
in e ms o empo e o and bea accu acy o he gene -
a ed music. These inding sugges s ha Sync o me may
cap u e mo e exp essi e and seman ically ich mo ion pa -
e ns, whe eas RAFT may be be e a p ese ing hy hmic
consis ency in simple con ex s.
5.3.2 Gene ic s. Gene a ed Cap ions
Compa ing models ained wi h gene a ed cap ions o
hose using gene ic cap ions, we obse ed ha al hough
music gene a ed wi h gene ic cap ions yielded lowe
CLAP sco es—likely due o ecei ing minimal ex ual
in o ma ion— hey achie ed be e empo accu acy han
hose ained wi h gene a ed p omp s ( empo e o o 28.07
BPM empo e sus 33.71 BPM when compa ing mod-
els ained wi h same con igu a ion excep o choice o
p omp s). This sugges s ha adding ex a ex ual con ex
may in oduce noise o dis ac om he pu ely isual mo-
ion cues, ul ima ely educing empo al accu acy.
6. CONCLUSION
Expo ion demons a es ha isual cues—speci ically, body
mo emen s and acial exp essions—can e ec i ely se e
as exp essi e con ols o music gene a ion. By le e ag-
ing a p e ained ex - o-music model [19] and applying
pa ame e -e icien ine- uning, ou app oach achie es no-
able imp o emen s om he o iginal ex -only condi ion-
ing MusicGen using only 130 clips (6 hou s) o aining
da a in 40 epochs. In eg a ing hese mul imodal signals
wi h ex ual p omp s, Expo ion p oduces music ha shows
s eng h in musicali y, c ea i i y, and empo al accu acy,
as e idenced by enhanced bea , empo, and o e all seman-
ic consis ency ac oss ex , ideo, and audio. A empo al
smoo hing s a egy u he ensu es ine-g ained alignmen
be ween he isual cues and he gene a ed music. Ou
esul s ou pe o m s a e-o - he-a baselines, and subjec-
i e s udies con i m ha he combina ion o acial and mo-
ion ea u es yields supe io pe o mance, while objec i e
e alua ions highligh ha mo ion ea u es—pa icula ly
hose ex ac ed ia Synch o me —s ike an op imal bal-
ance be ween hy hmic consis ency and exp essi e dynam-
ics. O e all, Expo ion ep esen s a p omising s ep owa d
mo e exp essi e, con ollable, and in e ac i e audio isual
music gene a ion sys ems.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
359
7. REFERENCES
[1] H. K. Cheng, M. Ishii, A. Hayakawa, T. Shibuya,
A. Schwing, and Y. Mi su uji, “Taming Mul imodal
Join T aining o High-Quali y Video- o-Audio Syn-
hesis,” CVPR, 2025.
[2] S. Luo, C. Yan, C. Hu, and H. Zhao, “Di -Foley: Syn-
ch onized Video- o-Audio Syn hesis wi h La en Di -
usion Models,” in Ad ances in Neu al In o ma ion
P ocessing Sys ems (Neu IPS), 2023.
[3] I. Vie ola, V. Iashin, and E. Rah u, “Tempo ally
Aligned Audio o Video wi h Au o eg ession,” IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP), 2025.
[4] Y. Zhang, Y. Gu, Y. Zeng, Z. Xing, Y. Wang, Z. Wu,
and K. Chen, “FoleyC a e : B ing Silen Videos o
Li e wi h Li elike and Synch onized Sounds,” a Xi
p ep in a Xi :2407.01494, 2024.
[5] Y. Wang, W. Guo, R. Huang, J. Huang, Z. Wang,
F. You, R. Li, and Z. Zhao, “F ie en: E icien Video-
o-Audio Gene a ion Ne wo k wi h Rec i ied Flow
Ma ching,” in Ad ances in Neu al In o ma ion P o-
cessing Sys ems (Neu IPS), 2024.
[6] M. Sun, W. Wang, Y. Qiao, J. Sun, Z. Qin, L. Guo,
X. Zhu, and J. Liu, “MM-LDM: Mul i-Modal La en
Di usion Model o Sounding Video Gene a ion,” in
P oceedings o he 32nd ACM In e na ional Con e -
ence on Mul imedia (ACM MM), 2024.
[7] S. Yang, Z. Zhong, M. Zhao, S. Takahashi, M. Ishii,
T. Shibuya, and Y. Mi su uji, “Visual Echoes: A Sim-
ple Uni ied T ans o me o Audio-Visual Gene a ion,”
a Xi p ep in a Xi :2405.14598, 2024.
[8] J. Lee, J. Im, D. Kim, and J. Nam, “Video-Foley:
Two-S age Video-To-Sound Gene a ion ia Tempo al
E en Condi ion Fo Foley Sound,” a Xi p ep in
a Xi :2408.11915, 2024.
[9] X. Liu, K. Su, and E. Shlize man, “Tell wha you hea
om wha you see: Video o audio gene a ion h ough
ex ,” a Xi p ep in a Xi :2402.05937, 2024.
[10] S. Mo, J. Shi, and Y. Tian, “Tex - o-audio gen-
e a ion synch onized wi h ideos,” a Xi p ep in
a Xi :2403.07055, 2024.
[11] Y. Du, Z. Chen, J. Salamon, B. Russell, and A. Owens,
“Condi ional Gene a ion o Audio om Video ia Fo-
ley Analogies,” in P oceedings o he IEEE/CVF Con-
e ence on Compu e Vision and Pa e n Recogni ion
(CVPR), 2023, pp. 2426–2436.
[12] R. Li, S. Zheng, X. Cheng, Z. Zhang, S. Ji, and
Z. Zhao, “MuVi: Video- o-Music Gene a ion wi h
Seman ic Alignmen and Rhy hmic Synch oniza ion,”
a Xi p ep in a Xi :2410.12957, 2024.
[13] Y.-B. Lin, Y. Tian, L. Yang, G. Be asius, and H. Wang,
“VMAs: Video- o-Music Gene a ion ia Seman ic
Alignmen in Web Music Videos,” in P oceedings o
he IEEE/CVF Win e Con e ence on Applica ions o
Compu e Vision (WACV), 2025.
[14] J. Kang, S. Po ia, and D. He emans, “Video2Music:
Sui able Music Gene a ion om Videos using an A -
ec i e Mul imodal T ans o me model,” Expe Sys-
ems wi h Applica ions, ol. 249, p. 123640, 2024.
[15] K. Su, J. Y. Li, Q. Huang, D. Kuzmin, J. Lee, C. Don-
ahue, F. Sha, A. Jansen, Y. Wang, M. Ve ze i, and
T. I. Denk, “V2Meow: Meowing o he Visual Bea
ia Video- o-Music Gene a ion,” in P oceedings o
he AAAI Con e ence on A i icial In elligence (AAAI),
2024, pp. 4952–4960.
[16] X. Liang, W. Li, L. Huang, and C. Gao, “DanceCom-
pose : Dance- o-Music Gene a ion Using a P og es-
si e Condi ional Music Gene a o ,” IEEE T ansac ions
on Mul imedia, 2024.
[17] Y. Zhu, Y. Wu, K. Olszewski, J. Ren, S. Tulyako ,
and Y. Yan, “Disc e e Con as i e Di usion o C oss-
Modal Music and Image Gene a ion,” in P oceedings
o he In e na ional Con e ence on Lea ning Rep esen-
a ions (ICLR), 2023.
[18] J. Yu, Y. Wang, X. Chen, X. Sun, and Y. Qiao,
“Long-Te m Rhy hmic Video Sound acke ,” in P o-
ceedings o he 40 h In e na ional Con e ence on Ma-
chine Lea ning (ICML), 2023.
[19] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” in Ad ances in Neu al In o -
ma ion P ocessing Sys ems (Neu IPS), 2024.
[20] L. Lin, G. Xia, J. Jiang, and Y. Zhang, “Con en -
based con ols o music la ge language modeling,”
in P oceedings o he In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), 2024.
[Online]. A ailable: h ps://a xi .o g/abs/2310.17162
[21] H. F. Ga cia, O. Nie o, J. Salamon, B. Pa do, and
P. See ha aman, “Ske ch2sound: Con ollable audio
gene a ion ia ime- a ying signals and sonic imi a-
ions,” a Xi p ep in a Xi :2402.13253, 2024.
[22] A. Guzho , F. Raue, J. Hees, and A. Dengel, “Au-
dioclip: Ex ending clip o image, ex and audio,”
in ICASSP 2022 - IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2022, pp. 976–980.
[23] C. Ra el, N. Shazee , A. Robe s, K. Lee, S. Na ang,
M. Ma ena, Y. Zhou, W. Li, and P. J. Liu, “Explo ing
he limi s o ans e lea ning wi h a uni ied ex - o- ex
ans o me ,” Jou nal o Machine Lea ning Resea ch,
ol. 21, no. 140, pp. 1–67, 2020.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
360
[24] Z. Tian, Z. Liu, R. Yuan, J. Pan, Q. Liu, X. Tan,
Q. Chen, W. Xue, and Y. Guo, “Vidmuse: A simple
ideo- o-music gene a ion amewo k wi h long-sho -
e m modeling,” CVPR, 2025.
[25] R. Valen i, A. Jaimes, and N. Sebe, “Soni y You Face:
Facial Exp essions o Sound Gene a ion,” in P oceed-
ings o he 2010 ACM Mul imedia Wo kshop on Vi-
sual Media In e p e a ion and Unde s anding (VMIU),
2010.
[26] A. Clay, N. Cou u e, E. Deca sin, M. Desain e-
Ca he ine, P.-H. Vullia d, and J. La alde, “Mo emen
o emo ions o music: using whole body emo ional ex-
p ession as an in e ac ion o elec onic music gene a-
ion,” in P oceedings o he In e na ional Con e ence
on New In e aces o Musical Exp ession (NIME),
2012.
[27] J. Huang, X. Huang, L. Yang, and Z. Tao, “D2MNe o
music gene a ion join ly d i en by acial exp essions
and dance mo emen s,” A ay, 2024.
[28] V. P, P. A, S. G. Vasis , S. Rao, and K. S. S ini as,
“DeepTunes: Music Gene a ion based on Facial Emo-
ions using Deep Lea ning,” in In e na ional Con e -
ence on In elligen Compu ing and Technology (I2CT),
2022.
[29] J. Huang, X. Huang, L. Yang, and Z. Tao, “A Con inu-
ous Emo ional Music Gene a ion Sys em Based on Fa-
cial Exp essions,” in P oceedings o he In e na ional
Con e ence on In elligen Da a (ICID), 2022.
[30] G. Resea ch, “Musiclm: Gene a ing music om ex ,”
2023, p ep in a ailable on a Xi .
[31] J. F. Gemmeke, D. P. W. Ellis, D. F eedman, A. Jansen,
W. Law ence, and R. C. Moo e, “Audio se : An on-
ology and human-labeled da ase o audio e en s,”
in 2017 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2017,
pp. 776–780.
[32] H. Chen, W. Xie, A. Vedaldi, and A. Zisse man, “Vg-
gsound: A la ge-scale audio- isual da ase ,” in P o-
ceedings o he IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP),
2020, pp. 721–725.
[33] L. Lin, G. Xia, Y. Zhang, and J. Jiang, “A ange, in-
pain , and e ine: S ee able long- e m music audio gen-
e a ion and edi ing ia con en -based con ols,” in P o-
ceedings o he 32nd In e na ional Join Con e ence on
A i icial In elligence (IJCAI). IJCAI, 2024.
[34] Z. Cai, S. Ghosh, K. S e ano , A. Dhall, J. Cai,
H. Reza o ighi, R. Ha a i, and M. Haya , “Ma lin:
Masked au oencode o acial ideo ep esen a ion
lea ning,” in CVPR. CVPR, 2023.
[35] X. W. R. E. Iashin, V. and A. Zisse man, “Synch-
o me : E icien synch oniza ion om spa se cues,”
in ICASSP 2024-2024 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2024.
[36] Z. Teed and J. Deng, “Ra : Recu en all-pai s ield
ans o ms o op ical low,” in Eu opean Con e ence
on Compu e Vision (ECCV). Sp inge , 2020, pp.
402–419.
[37] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu,
Z. Ma, and C. Zhang, “Salmonn: Towa ds gene ic
hea ing abili ies o la ge language models,” in P o-
ceedings o he In e na ional Con e ence on Lea ning
Rep esen a ions (ICLR), 2024.
[38] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” 2022. [Online].
A ailable: h ps://a xi .o g/abs/2210.13438
[39] J. Kang, S. Po ia, and D. He emans, “Video2music:
Sui able music gene a ion om ideos using an
a ec i e mul imodal ans o me model,” Expe
Sys ems wi h Applica ions, ol. 249, p. 123640, Sep.
2024. [Online]. A ailable: h p://dx.doi.o g/10.1016/j.
eswa.2024.123640
[40] K. Kilgou , R. Cla k, K. Simonyan, and M. Sha i i,
“F eche audio dis ance: A e e ence- ee me ic o
e alua ing music enhancemen algo i hms,” in In e -
speech, 2019, pp. 2350–2354.
[41] S. He shey, S. Chaudhu i, D. P. W. Ellis, J. F. Gem-
meke, A. Jansen, R. Moo e, M. Plakal, D. Pla , R. A.
Sau ous, B. Seybold, M. Slaney, R. J. Weiss, and
K. Wilson, “Cnn a chi ec u es o la ge-scale audio
classi ica ion,” in IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2017, pp. 131–135.
[42] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and
M. D. Plumbley, “Panns: La ge-scale p e ained au-
dio neu al ne wo ks o audio pa e n ecogni ion,”
IEEE/ACM T ansac ions on Audio, Speech, and Lan-
guage P ocessing, ol. 28, pp. 2880–2894, 2020.
[43] K. Kou ini, H. Eghbal-zadeh, M. Wid ich, J. B ands e -
e , A. Thaku , V. Be enz, T. Mö wald, S. Hoch ei e ,
and B. Hamme , “E icien aining o audio ans o m-
e s wi h pa chou ,” in P oceedings o he IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2022, pp. 874–878.
[44] T. Salimans, I. Good ellow, W. Za emba, V. Cheung,
A. Rad o d, and X. Chen, “Imp o ed echniques o
aining gans,” in Ad ances in Neu al In o ma ion P o-
cessing Sys ems (Neu IPS), 2016, pp. 2234–2242.
[45] Y. Wu, K. Chen, T. Zhang, Y. Hui, M. Nezhu ina,
T. Be g-Ki kpa ick, and S. Dubno , “La ge-scale
con as i e language-audio p e aining wi h ea u e
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
361
usion and keywo d- o-cap ion augmen a ion,” in
P oceedings o he IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP),
2023. [Online]. A ailable: h ps://a xi .o g/abs/2211.
06687
[46] B. Zhu, B. Lin, M. Ning, Y. Yan, J. Cui, H. Wang,
Y. Pang, W. Jiang, J. Zhang, Z. Li, W. Zhang, Z. Li,
W. Liu, and L. Yuan, “Languagebind: Ex ending
ideo-language p e aining o n-modali y by language-
based seman ic alignmen ,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2310.01852
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
362