Expotion: Facial Expression and Motion Control for Multimodal Music Generation

Author: Fathinah Izzati; Xinyue Li; Gus Xia

Publisher: Zenodo

DOI: 10.5281/zenodo.17706414

Source: https://zenodo.org/records/17706414/files/000041.pdf

EXPOTION: FACIAL EXPRESSION AND MOTION CONTROL FOR
MULTIMODAL MUSIC GENERATION
Fa hinah Izza i∗Xinyue Li∗Gus Xia
Mohamed bin Zayed Uni e si y o A i icial In elligence, Uni ed A ab Emi a es
{ a hinah.izza i, xinyue.li, gus.xia}@mbzuai.ac.ae
ABSTRACT
We p opose EXPOTION (Facial Exp ession and Mo ion
Con ol o Mul imodal Music Gene a ion), a gen-
e a i e model le e aging mul imodal isual con-
ols—speci ically, human acial exp essions and
uppe -body mo ion—as well as ex p omp s o p o-
duce exp essi e and empo ally accu a e music. We adop
pa ame e -e icien ine- uning (PEFT) on he p e ained
ex - o-music gene a ion model, enabling ine-g ained
adap a ion o he mul imodal con ols using a small
da ase . To ensu e p ecise synch oniza ion be ween
ideo and music, we in oduce a empo al smoo hing
s a egy o align mul iple modali ies. Expe imen s demon-
s a e ha in eg a ing isual ea u es alongside ex ual
desc ip ions enhances he o e all quali y o gene a ed
music in e ms o musicali y, c ea i i y, bea - empo
consis ency, empo al alignmen wi h he ideo, and ex
adhe ence, su passing bo h p oposed baselines and ex-
is ing s a e-o - he-a ideo- o-music gene a ion models.
Addi ionally, we in oduce a no el da ase consis ing
o 7 hou s o synch onized ideo eco dings cap u ing
exp essi e acial and uppe -body ges u es aligned wi h
co esponding music, p o iding signi ican po en ial o
u u e esea ch in mul imodal and in e ac i e music gen-
e a ion. Code, demo and da ase a e a ailable a h ps:
//gi hub.com/xinyueli2896/Expo ion.gi
1. INTRODUCTION
Music gene a ion models ha e become inc easingly e sa-
ile and in e ac i e, capable o in eg a ing con ol signals
om a ious modali ies, including audio, ex , and sym-
bolic ep esen a ions (such as MIDI o musical sco es).
These condi ions ac as con ol o guide he model owa d
p oducing mo e p ecise, and a ge ed ou pu s aligned wi h
use expec a ions. Al hough cu en ex - o-music gene a-
ion models can p oduce imp essi e musical quali y, hey
o en lack he ine-g ained empo al con ol mechanisms
∗These au ho s con ibu ed equally o his wo k.
© F. Izza i, X. Li, and G. Xia. Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu-
ion: F. Izza i, X. Li, and G. Xia, “Expo ion: Facial Exp ession and Mo-
ion Con ol o Mul imodal Music Gene a ion”, in P oc. o he 26 h In .
Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea,
2025.
Figu e 1. O e iew o Expo ion’s mul imodal in e ence
pipeline, showing how isual ges u es and acial exp es-
sions guide exp essi e music gene a ion.
and exp essi i y necessa y o lexibly adap o a wide ange
o eal-wo ld scena ios.
Inspi ed by he idea ha ges u es and acial exp es-
sions can ac as impo an guides o music, simila o
wha conduc o s do, we u he explo e isual con ols in
his s udy and p opose Expo ion, a deep music gene a i e
model wi h mul imodal con ols– acial exp ession and up-
pe body mo ion, as well as ex p omp s. Expo ion is de-
signed o syn hesize high-quali y music ha is bo h exp es-
si e (ensu ing ha he musical con en e lec s he emo-
ional and exp essi e cues o he ace and ges u es) and
empo ally accu a e (so ha e e y change in mo ion and
exp ession is accu a ely mi o ed in he music) wi h he
inpu ideo, as shown in Figu e 1.
While some s udies ha e explo ed audio–music gene -
a ion asks gi en ideos— anging om audio e ec s gen-
e a ion (e.g., [1–11]) ideo music backg ound gene a ion
(e.g., [12–15]), and dance music gene a ion (e.g., [16–18]
among o he s), ou wo k ocuses on he sub le dynamics
o ges u es and acial exp essions, emphasizing hei ine-
g ained empo al synch oniza ion wi h music gene a ion.
This opens up po en ial applica ions in eal- ime, in e ac-
i e audio isual sys ems.
To achie e ou goal, we i s in oduce a newly cu-
a ed da ase comp ising o 7 hou s o ca e ully synch o-
nized ideo-music pai s ea u ing exp essi e ges u es and
acial exp essions closely ma ched o he co esponding
music. Gi en he limi ed amoun o da a, we employ
pa ame e -e icien ine- uning (PEFT) on a ans o me -
based ex - o-music gene a ion model [19]—le e aging
powe ul abili y o he model p e ained on a massi e
354
music- ex da a ha ha e been shown o be e ec i e in in-
co po a ing addi ional modali ies [9, 10, 20–23]. By ine-
uning only 4% o he o iginal model’s pa ame e s [20],
ou me hod seamlessly in eg a es mul imodal isual in-
pu s ( acial exp essions and uppe -body ges u es) using
only 130 ideo–audio pai s o aining, he eby minimiz-
ing ne wo k complexi y while ensu ing obus mul imodal
usion. We also p opose an app oach called empo-
al smoo hing o ensu e p ecise and e icien empo al
alignmen be ween audio and ideo modali ies.
Ou expe imen s show ha Expo ion can gene a e high-
quali y, empo ally accu a e music ha ai h ully e lec s
he exp essi e nuances inhe en in he isual inpu s and
ex ual desc ip ions. Al hough ex ual desc ip ions sup-
ply he p ima y con ex ual cues—le e aging he model’s
s ong ex -unde s anding capabili ies o gua an ee base-
line music quali y— he addi ion o isual inpu u he
enhances alignmen , exp essi eness, and consis ency. In
comp ehensi e subjec i e and objec i e e alua ions, Ex-
po ion consis en ly ou pe o ms cu en s a e-o - he-a
ideo- o-music gene a ion models [24] and mul imodal
cap ioning baselines ac oss mul iple me ics, including (1)
Quali y o Gene a ed Music, (2) Bea s and Tempo Con-
sis ency, (3) Tex -Audio Simila i y, and (4) Video–Music
Consis ency.
To he bes o ou knowledge, his wo k is he i s o
le e age synch onized exp essi e ges u es and acial ex-
p essions o music gene a ion. Ou expe imen s show
ha inco po a ing isual ea u es as con ol signals no
only enhances he empo al alignmen be ween he ideo
and gene a ed music bu also imp o es ex adhe ence and
o e all musical quali y—highligh ing he complemen a y
s eng hs o bo h modali ies. We belie e ha Expo ion will
empowe a is s wi h a mo e exp essi e, con ollable, and
in e ac i e app oach o music c ea ion.
2. RELATED WORK
We e iew h ee key pa adigms o con ollable music
gene a ion: isual and mo ion-based con ol, ex ual
and symbolic condi ioning, and aining and adap a ion
s a egies.
Visual and Mo ion-Based Con ol Ea ly in e ac i e sys-
ems mapped acial o bodily ea u es di ec ly o sound.
Valen i e al.’s Soni y You Face modula ed audio ia
Bayesian classi ica ion o acial mo ion uni s [25], and
Clay e al. ansla ed whole-body emo ional exp essions
in o elec onic-music pa ame e s [26]. D2MNe ex ac ed
global s yle and local bea ec o s om LMA-de i ed
mo emen signals o d i e an au o eg essi e gene a o
[27]. DeepTunes [28] and [29] combine CNN-based emo-
ion de ec ion wi h GPT-2 ly ic models, LSTMs, and ans-
o me s o join ly p edic disc e e and alence–a ousal
emo ions and p oduce synch onized music (and ly ics)
ha closely e lec he use ’s image inpu . Video-
condi ioned models such as VidMuse [24], V2Meow [15],
and Video2Music [14] align music wi h mo ion cues and
scene con ex , while Foley-s yle sys ems ansla e indi-
idual e en s o sound ia la en di usion o ans o m-
e s [1,2,8].
Tex ual and Symbolic Condi ioning Tex - o-music mod-
els like MusicLM [30] and MusicGen [19] gene a e
high-quali y audio om desc ip i e p omp s bu lack ex-
plici ime- a ying con ol. MusicGen-Melody ex ends
his by condi ioning on a e e ence melody ack o
pi ch con ou s [19]. CoCoMulla le e ages cho d cha s
and d um pa e ns o con ol ha mony and hy hm [20],
and Ske ch2Sound in oduces con inuous ocal–imi a ion
cu es (loudness, b igh ness, pi ch) alongside ex o mod-
ula e audio gene a ion [21].
T aining and Adap a ion S a egies Many ideo- o-
audio and music models a e ained om sc a ch on
la ge pai ed da ase s—such as AudioSe [31] and VG-
GSound [32]— o lea n c oss-modal co espondences [1–4,
24]. When da a is limi ed, pa ame e -e icien ine- uning
(PEFT) is e ec i e: Ske ch2Sound ine- unes a single lin-
ea laye pe con ol signal on a ozen di usion backbone
[21], while CoCoMulla and Ai Gen a ach ligh weigh
adap e s o MusicGen, uning unde 4% o pa ame e s o
cho d and hy hm con ol o music inpain ing [20,33]. Ex-
po ion adop s a simila PEFT app oach wi h isual mul i-
modal inpu s.
3. METHODOLOGY
Ou app oach consis s o o 1) a join embedding encode
o in eg a e empo ally aligned ideo-based con ols, and
2) a condi ion adap o o ine- une MusicGen by inco po-
a ing he lea ned join isual embeddings. We oze he
pa ame e s o he anilla Musicgen du ing aining o p e-
se e i s ex unde s anding abili y.
3.1 Join Visual Embeddings
To e ec i ely inco po a e acial exp ession and uppe -
body mo emen ea u es om ideo, we adop a join em-
bedding amewo k combining hese wo ypes o ea u es,
as shown in Figu e 2. Since ideos cap u ing acial ex-
p essions and uppe -body ges u es ypically exhibi less
dynamic mo ion, a ela i ely low ame a e is su icien
o smoo h human pe cep ion. To add ess he disc epancy
be ween his ame a e and he ame a e o MusicGen,
we in oduce a ge ed sampling s a egies o ideo ea-
u e p ocessing ha di e om hose applied o audio. We
also p opose an app oach called empo al smoo hing o en-
su e p ecise and e icien empo al alignmen be ween au-
dio and ideo modali ies, main aining he exp essi i y o
ideo, and enhancing he o e all mul imodal in eg a ion.
3.1.1 Facial Exp ession Embedding
To ex ac acial exp ession ea u es om ideo, we em-
ployed MARLIN, a sel -supe ised lea ning amewo k
speci ically designed o de i e uni e sal acial ep esen a-
ions om unanno a ed ideo da a [34]. Fo each ame,
MARLIN gene a es ea u es ha cap u e in o ma ion om
bo h he cu en ame and i s neighbo ing ames. To em-
po ally align hese acial exp ession ea u es wi h he au-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
355
Figu e 2. Join embedding o isual ea u es. Facial exp ession and mo ion ea u es a e i s ex ac ed om he gi en
ideo and empo ally smoo hed h ough in e pola ion, p oducing e ined in e media e ep esen a ions. These empo ally
aligned embeddings hen unde go dimensionali y educ ion h ough he low- ank p ojec ion laye . Then, he conca ena ed
embeddings a e p ojec ed o he same dimension as he MusicGen hidden laye s. Finally, he posi ional encoding is added
o he la en embeddings o o m he inal join embeddings.
dio codes p oduced by Encodec in MusicGen, we u ilized a
specialized empo al smoo hing s a egy: i s esampling
he ideo om 30 ps o 80 ps and hen, since MARLIN
p ocesses 16 ames simul aneously, ob aining he acial
exp ession ea u es in a ame a e o 5 ps by se ing he
s ide o 16 ames. To minimize in o ma ion loss om his
downsampling, we applied linea in e pola ion o he acial
exp ession ea u e z ∈RT×768, whe e T is he o al num-
be o ames and 768 is he hidden dimension. z(i)
∈Rd1
ep esen s he ea u e a i- h ame. Fo a desi ed (possibly
non-in ege ) ime index , le
i=⌊ ⌋, α = −i,
so ha lies be ween he i- h and (i+ 1)- h ames. The
in e pola ed ea u e ˆz( )
is compu ed as
ˆz = (1 −α)z i+α z i+1 .(1)
This o mula linea ly weigh s he neighbo ing ea u es, en-
su ing a smoo h ansi ion be ween ames.
We u he comp ess he in e pola ed acial ea u es by
p ojec ing hem on o a low-dimensional space wi h dimen-
sion d1using a ainable ma ix W ∈Rd1×768:
z′
i=WT
ˆz i∈Rd1.(2)
3.1.2 Mo ion Embedding
We compa e wo mo ion ep esen a ions: one ex ac ed
om he Synch o me isual encode [35] and ano he
om RAFT op ical low [36].
Fo Synch o me , we s anda dize ideos ps and seg-
men each ideo in o 16- ame clips wi h a s ide o 5,
yielding ou pu s o shape (T, 8,dim), whe e he 8 dimen-
sion cap u es local empo al con ex . We la en he i s
wo dimensions, pe o m empo al in e pola ion as
ˆzm = (1 −α)zmi+α zmi+1 ,(3)
and hen p ojec he in e pola ed ea u es o a lowe -
dimensional space using a ainable ma ix Wm∈Rd2×D
(D= o iginal ea u e dimension), yielding
z′
mi=WT
mˆzmi∈Rd2.(4)
Fo RAFT, we sample he ideo a 5 ps o ob ain
ames {I }T
=1 and compu e he dense op ical low F ∈
RH×W×2 o each consecu i e pai (I , I +1)using RAFT
[36]. Each low ield is hen p ocessed by a Flow Embed-
ding CNN F o yield a compac ea u e ec o :
z low
=F(F )∈R256.(5)
Fis composed o se e al con olu ional laye s wi h ke -
nel size 3 and ReLU ac i a ions, ollowed by an adap-
i e a e age pooling ope a ion. We pe o m empo al in-
e pola ion and p ojec he esul ing sequence o a lowe -
dimensional space in he same way as Sych o me . This
ensu es ha bo h mo ion ep esen a ions a e aligned in
ime and compa ible wi h he MusicGen ans o me ’s in-
pu equi emen s.
3.1.3 Posi ional Embeddings
We de ine a lea nable ma ix We∈R(d1+d2)×d o use he
embeddings men ioned abo e, oge he wi h a lea nable
posi ional embedding zpos,i ∈Rd1+d2 o suppo sequen-
ial modeling. The combined join symbolic and acous ic
embedding is compu ed as:
zi=WT
ez i;zmi] + zpos,i∈Rd.(6)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
356
Le Tdeno e he o al numbe o ames. Then, he o e all
sequen ial join embedding is gi en by:
z={z1, z2, . . . , zT} ∈ RT×d.(7)
3.2 Condi ion Adap o
Adop ing a simila app oach o CoCoMulla [20], we ex-
end he idea o a condi ion adap o o handle ime- a ying
ideo inpu s, such as acial exp essions and mo ion. In a
s anda d T ans o me , each sel -a en ion laye p ocesses
Thidden embeddings (one pe Encodec ame). In ou ap-
p oach, he inal Llaye s o he MusicGen decode expand
his o 2Tembeddings by adding Tcondi ion p e ix posi-
ions, which encode he con ol in o ma ion. Speci ically,
we inse a sequence o lea nable inpu embeddings in o
he (N−L+ 1) h decode laye o ini ia e he condi ion
p e ix. In his p e ix, he hidden s a es go only h ough
sel -a en ion laye s (omi ing c oss-a en ion). Le
Hp
l∈RT×d(N−L+ 1 ≤l≤N)
be he ou pu o he condi ion p e ix, wi h Hp
0as he lea n-
able inpu embeddings. The condi ion p e ix is compu ed
as:
Qp
l, Kp
l, V p
l=QKV-p ojec o Hp
l+Zl,
Hp
l+1 =Sel -A en ionQp
l, Kp
l, V p
l,
whe e Zla e he sequen ial join embeddings (de ined in
Eq. (7)). No causal mask is applied he e, and he condi ion
p e ix does no a end o he Encodec okens.
Fo he emaining pa , he hidden s a es Hl∈RT×d
( o 1≤l≤N) a e p ocessed no mally. Thei s anda d
a en ion ou pu Slis compu ed as:
Ql, Kl, Vl=QKV-p ojec o (Hl),
Sl=Sel -A en ionQl, Kl, Vl.
To inco po a e condi ion in o ma ion in he las L
laye s, we compu e c oss a en ion S′
lbe ween Qland
{Kp
l, V p
l}using sel -a en ion, using Qlwi h Qp
l:
S′
l=Sel -A en ionQl+Qp
l, Kp
l, V p
l.
A lea nable ga ing ac o gl(ini ialized o ze o) combines
he ou pu s:
Hl+1 =C oss-A en ionSl+gl·S′
l, ex .
In ou implemen a ion, all MusicGen laye s (including
QKV-p ojec o , Sel -A en ion, and C oss-A en ion) a e
ozen; only Hp
0,Wp,Wa,We,zpos, and gla e ainable.
4. EXPERIMENTS
4.1 Da ase
Due o he lack o su icien pai ed ideo-audio da a wi h
clea acial ea u es, we cu a ed ou own da ase by collec -
ing he da a manually. We ec ui ed olun ee s o eco d
hei acial exp essions and uppe body mo emen s while
lis ening o 30-second audio clips. Be o e s a ing he
eco ding, he olun ee s we e asked o lis en o he mu-
sic ack once, allowing hem ime o hink abou he acial
exp essions and body mo emen s ha would align wi h
he music. The audio clips used we e licensed ins umen-
al acks om Epidemic Sound, ensu ing no ocals we e
p esen . The collec ion includes a a ie y o music gen es,
such as pop, jazz, blues, classical, and epic, among o he s.
We we e able o collec 7 hou s o pai ed ideo-audio da a.
The pai ed ideo-audio da a a e hen chopped in o 10 sec-
onds pe clip. We se aside 30 minu es o da a o es and
alida ion. We also gene a ed cap ions o each audio clip
wi h audio-cap ioning model SALMONN [37] o be used
as ex p omp s du ing aining and in e ence. The p omp
gi en o SALMONN o cap ioning is ’Please desc ibe he
music’. In he da a collec ing p ocess, olun ee s ga e in-
o med consen , ag eeing ha hei ideos would be used
only o esea ch and anonymized o p o ec hei p i acy.
4.2 Implemen a ion
Ou base model, MusicGen ( ex -only), consis s o h ee
main pa s: a p e- ained EnCodec, a p e- ained T5 en-
code , and an acous ic ans o me decode . The de-
code has 48 laye s, each wi h causal sel -a en ion and
c oss-a en ion o p ocess ex p omp s. MusicGen uses
EnCodec, a Residual Vec o Quan iza ion (RVQ) au o-
encode [38], o con e audio sampled a 32,000 Hz in o
disc e e codes a 50 Hz, which a e hen passed o he ans-
o me decode .
We ained he p oposed model using ou A1000
GPUs, employing an ini ial lea ning a e o 1e-02 and a
ba ch size o 10 consis ing o en 10-second audio samples,
o 40 epochs. T aining was s opped a e 40 epochs o p e-
en o e i ing. Du ing aining, he model’s pa ame e s
we e upda ed using a c oss-en opy econs uc ion loss. In
he low- ank p ojec ion s ep, we se d1in Equa ion (2) and
d2in Equa ion (4) o be 12.
4.3 Baselines
Since ou model add esses a no el domain o which no
exis ing opensou ce models a e speci ically ained wi h
bo h ideo ( acial exp ession and mo ion) and ex con-
ols, we selec ed anilla MusicGen ( ex -condi ioned)
and wo ecen ideo-condi ioned music gene a ion mod-
els—VidMuse [24] and Video2Music [39]—as ou base-
lines. Video2Music uses an A ec i e Mul imodal T ans-
o me o gene a e emo ionally aligned, exp essi e sym-
bolic music (cho ds) om ideo inpu s by le e aging se-
man ic, mo ion, scene, and emo ion ea u es, while Vid-
Muse in eg a es bo h local and global isual cues h ough
a Long-Sho -Te m Visual Module.
4.4 E alua ion
Ou e alua ion consis s o wo pa s: subjec i e and ob-
jec i e e alua ions. Because MusicGen only accep s ex-
ual inpu , we gene a e used mul imodal cap ions om he
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
357
Gene al Music Quali y Rhy hm Alignmen Tex /Video-Audio Consis ency
Mo ion Fea . P omp FAD-VGG↓A g. KL↓A g. IS Sco e↑Tempo E . (bpm)↓F1-sco e (Bea )↑Tex -Audio↑Video-Audio↑
ace – Gene a ed 2.52 0.74 1.45 31.53 0.38 0.48 0.55
mo ion RAFT Gene ic 3.56 1.05 1.16 28.07 0.37 0.38 0.65
mo ion RAFT Gene a ed 2.25 0.66 1.53 33.71 0.37 0.52 0.59
mo ion Sync o me Gene ic 3.52 1.08 1.20 31.03 0.35 0.37 0.61
mo ion Sync o me Gene a ed 1.93 0.65 1.57 32.89 0.36 0.52 0.61
ace+mo ion Sync o me Gene a ed 2.55 0.67 1.49 32.72 0.37 0.54 0.59
MusicGen – Gene a ed 2.76 0.79 1.54 35.84 0.33 0.50 0.42
Video2Music – – 18.97 1.32 1.01 26.13 0.38 0.25 0.59
VidMuse – – 9.91 1.10 1.33 33.01 0.37 0.36 0.52
Table 1. Compa ison o music quali y, hy hm alignmen , and ex / ideo-audio consis ency me ics ac oss di e en con-
igu a ions.
ideo–audio pai s o se e as p omp s o he ex -only Mu-
sicGen baseline. This app oach ensu es a ai compa ison
by p o iding equi alen desc ip i e in o ma ion ac oss all
models.
4.4.1 Objec i e E alua ion
In ou objec i e e alua ion, we assess he music gen-
e a ed om h ee pe spec i es: (1) he inhe en qual-
i y, (2) hy hm alignmen , and (3) ex / ideo-audio con-
sis ency. To measu e he inhe en quali y o he music,
we employed F eche Audio Dis ance(FAD) wi h VGGish
embeddings which measu es pe cep ual simila i y o eal
music [40, 41], Kullback-Leible (KL) di e gence which
quan i ies dis ibu ional alignmen wi h eal audio labels
[42,43], and In e -Sample Sco e (IS Sco e) which cap u es
he di e si y among gene a ed samples [44]. To e alua e
he hy hm alignmen be ween gene a ed and g ound u h
music, we calcula ed he empo e o , which is de ined
as how a an es ima ed BPM(bea s-pe -minu e) de ia es
om he ue (g ound- u h) BPM, and bea consis ency
be ween he gene a ed and e e ence music. To e alua e
he ex / ideo-audio consis ency, we use CLAP [45] and
LanguageBind [46] espec i ely. We employ hese mod-
els ained wi h con as i e lea ning app oach o measu e
model’s abili y o gene a e music ha e lec s he seman ic
meaning o he ex and ideo.
4.4.2 Subjec i e E alua ion
We ga he ed pa icipan s o a ying le el o musical back-
g ound o a e he gene a ed music—ac oss a ious con ig-
u a ions and he baseline—o e i e g oups o six ideos.
They we e shown he ex p omp s wi hou being in o med
which model p oduced each ack. Ra ings we e based on
he ollowing c i e ia:
•Musicali y: How e ec i ely he audio cap u es key mu-
sical quali ies.
•Tex -audio Simila i y: The deg ee o alignmen be-
ween he gene a ed music and he p o ided ex ual
p omp .
•Video-audio Consis ency: The ex en o which he mu-
sic co esponds wi h he ideo con en in e ms o empo
and emo ional exp ession.
•C ea i i y: The uniqueness and inno a i eness o he
gene a ed audio.
4.5 Abla ion s udies
We conduc ed abla ion s udies o e alua e he e ec s o
a ious expe imen al se ups. Speci ically, hese s udies ex-
plo ed he ollowing model se ups: (1) he use o di e -
en mo ion ea u es (RAFT e sus Synch o me ) and (2)
he use o gene ic p omp s o ‘music wi h ca chy melody’
e sus de ailed p omp s gene a ed by he audio-cap ioning
model.
5. RESULTS
5.1 Objec i e E alua ions
Gene al Music Quali y. The esul s in Table 1 demon-
s a e ha models inco po a ing mo ion in o ma ion, pa -
icula ly hose using he Sync o me ea u es wi h gene -
a ed p omp s, consis en ly ou pe o m o he s ac oss gen-
e al music quali y me ics, especially he baselines. This
con igu a ion pe o ms bes o e all, wi h he lowes FAD
and KL sco es and he highes IS Sco e, indica ing eal-
is ic, di e se, and well-aligned music gene a ion. Com-
pa a i ely, models using he RAFT ea u es pe o m less
consis en ly, and hose using only acial inpu o gene ic
p omp s yield weake esul s. Baseline me hods like Vid-
Muse and Video2Music show signi ican ly poo e FD and
KL sco es, highligh ing he ad an age o mul imodal con-
ol o ou model.
Rhy hm Alignmen . Among all he con igu a ion es ed,
he model ained wi h RAFT mo ion ea u es and gene ic
cap ions achie es he lowes A e age Tempo E o (28.07
BPM), indica ing be e empo al alignmen wi h he au-
dio compa ed o o he p oposed me hods. Video2Music
achie es he lowes empo e o because i ansc ibes he
audio in o MIDI and compu es hy hmic cha ac e is ics in
he o m o no e densi y and loudness om he audio–a
p oxies o he music’s hy hms [14]. The baseline model
MusicGen shows he poo es pe o mance in all empo and
bea acking me ics, unde sco ing i s non-exp essi eness
in con olling gene a ed music.
Tex /Video-Audio Consis ency The ex -audio simila -
i y sco es e eal ha mul imodal condi ioning ( ace and
mo ion) signi ican ly enhance ex -alignmen compa ed
o ex -only condi ioning baseline (MusicGen), al hough
hey do no p o ide explici ex ual in o ma ion. The
ace+mo ion model achie es he highes ex -music sco es
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
358

(0.54).In con as , ideo-audio simila i y sco es show
ha models ained wi h gene ic (non-desc ip i e) cap-
ions achie e s onge alignmen wi h ideo (e.g., 0.65
and 0.61), sugges ing ha ex condi ioning may no
bene i —and may e en hinde — ideo-audio consis ency.
The poo ideo-audio alignmen o MusicGen ou pu s
and he low ex -audio alignmen in music gene a ed by
Video2Music and VidMuse is jus ied as hese models a e
no condi ioned on he espec i e modali ies.
5.2 Subjec i e E alua ion
Figu e 3. Subjec i e E alua ion Resul s on Fou Me ics
Figu e 3 p esen s a subjec i e analysis o he qual-
i y o music gene a ed by i e models— anilla Music-
Gen(baseline) and ou a ian s o ou models ained wi h
di e en con igu a ions. A o al o 13 pa icipan s (6 e-
male, 7 male) aged 18–40 yea s (Median = 27) ook pa .
Based on sel - epo ed musical aining, 20 %o pa ic-
ipan s ha e beginne le el o expe ience in music, 20 %
in e media e, and 60 %p o essional.
No ably, all o ou models—excep o
one—ou pe o m he baseline in c ea i i y and musi-
cali y, sugges ing ha inco po a ing acial exp essions and
mo ion cues enables he sys em o be e cap u e he ex-
p essi e quali ies o he inpu ideo. This exp essi eness
is e lec ed in he gene a ed music, which pa icipan s
pe cei e as mo e musical and c ea i e. Ou model ained
wi h only mo ion ea u es and gene ic p omp s, pe o m
poo ly in all e alua ion me ics, indica ing ha gene ic
ex ual p omp s a e insu icien o guide high-quali y mu-
sic gene a ion. Wi hou meaning ul ex ual con ex , isual
ea u es alone do no p o ide enough seman ic g ounding
o p oduce good music. The supe io pe o mance o
mo ion ea u es o e acial ea u es in bo h objec i e and
subjec i e e alua ions likely s ems om he na u e o he
da ase : pa icipan s ound i easie o con ey musical
cues h ough mo emen s a he han acial exp essions,
esul ing in iche and mo e exp essi e mo ion da a. These
indings highligh ha isual and ex ual modali ies a e
complemen a y: while ex ual inpu p o ides seman ic
in en , isual ea u es—especially mo ion—en ich he
gene a ed music’s exp essi eness. O e all, ou model,
in eg a ing ea u es om bo h modali y is capable o
p oducing music ha is cohe en , exp essi e, and c ea i e
as pe cei ed by human.
5.3 Abla ion S udies
5.3.1 RAFT s. Synch o me
The compa ison be ween RAFT and Sync o me as mo ion
ea u e ex ac o s e eals no able di e ences in bo h gen-
e al music quali y and hy hm- ela ed me ics as shown in
Table 1. Sync o me ou pe o ms RAFT ac oss FAD, and
IS Sco e, indica ing mo e ealis ic and sligh ly mo e di-
e se music gene a ion. Howe e , he e is no signi ican
di e ence be ween he choice o hese wo mo ion ea ues
in e ms o empo e o and bea accu acy o he gene -
a ed music. These inding sugges s ha Sync o me may
cap u e mo e exp essi e and seman ically ich mo ion pa -
e ns, whe eas RAFT may be be e a p ese ing hy hmic
consis ency in simple con ex s.
5.3.2 Gene ic s. Gene a ed Cap ions
Compa ing models ained wi h gene a ed cap ions o
hose using gene ic cap ions, we obse ed ha al hough
music gene a ed wi h gene ic cap ions yielded lowe
CLAP sco es—likely due o ecei ing minimal ex ual
in o ma ion— hey achie ed be e empo accu acy han
hose ained wi h gene a ed p omp s ( empo e o o 28.07
BPM empo e sus 33.71 BPM when compa ing mod-
els ained wi h same con igu a ion excep o choice o
p omp s). This sugges s ha adding ex a ex ual con ex
may in oduce noise o dis ac om he pu ely isual mo-
ion cues, ul ima ely educing empo al accu acy.
6. CONCLUSION
Expo ion demons a es ha isual cues—speci ically, body
mo emen s and acial exp essions—can e ec i ely se e
as exp essi e con ols o music gene a ion. By le e ag-
ing a p e ained ex - o-music model [19] and applying
pa ame e -e icien ine- uning, ou app oach achie es no-
able imp o emen s om he o iginal ex -only condi ion-
ing MusicGen using only 130 clips (6 hou s) o aining
da a in 40 epochs. In eg a ing hese mul imodal signals
wi h ex ual p omp s, Expo ion p oduces music ha shows
s eng h in musicali y, c ea i i y, and empo al accu acy,
as e idenced by enhanced bea , empo, and o e all seman-
ic consis ency ac oss ex , ideo, and audio. A empo al
smoo hing s a egy u he ensu es ine-g ained alignmen
be ween he isual cues and he gene a ed music. Ou
esul s ou pe o m s a e-o - he-a baselines, and subjec-
i e s udies con i m ha he combina ion o acial and mo-
ion ea u es yields supe io pe o mance, while objec i e
e alua ions highligh ha mo ion ea u es—pa icula ly
hose ex ac ed ia Synch o me —s ike an op imal bal-
ance be ween hy hmic consis ency and exp essi e dynam-
ics. O e all, Expo ion ep esen s a p omising s ep owa d
mo e exp essi e, con ollable, and in e ac i e audio isual
music gene a ion sys ems.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
359
7. REFERENCES
[1] H. K. Cheng, M. Ishii, A. Hayakawa, T. Shibuya,
A. Schwing, and Y. Mi su uji, “Taming Mul imodal
Join T aining o High-Quali y Video- o-Audio Syn-
hesis,” CVPR, 2025.
[2] S. Luo, C. Yan, C. Hu, and H. Zhao, “Di -Foley: Syn-
ch onized Video- o-Audio Syn hesis wi h La en Di -
usion Models,” in Ad ances in Neu al In o ma ion
P ocessing Sys ems (Neu IPS), 2023.
[3] I. Vie ola, V. Iashin, and E. Rah u, “Tempo ally
Aligned Audio o Video wi h Au o eg ession,” IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP), 2025.
[4] Y. Zhang, Y. Gu, Y. Zeng, Z. Xing, Y. Wang, Z. Wu,
and K. Chen, “FoleyC a e : B ing Silen Videos o
Li e wi h Li elike and Synch onized Sounds,” a Xi
p ep in a Xi :2407.01494, 2024.
[5] Y. Wang, W. Guo, R. Huang, J. Huang, Z. Wang,
F. You, R. Li, and Z. Zhao, “F ie en: E icien Video-
o-Audio Gene a ion Ne wo k wi h Rec i ied Flow
Ma ching,” in Ad ances in Neu al In o ma ion P o-
cessing Sys ems (Neu IPS), 2024.
[6] M. Sun, W. Wang, Y. Qiao, J. Sun, Z. Qin, L. Guo,
X. Zhu, and J. Liu, “MM-LDM: Mul i-Modal La en
Di usion Model o Sounding Video Gene a ion,” in
P oceedings o he 32nd ACM In e na ional Con e -
ence on Mul imedia (ACM MM), 2024.
[7] S. Yang, Z. Zhong, M. Zhao, S. Takahashi, M. Ishii,
T. Shibuya, and Y. Mi su uji, “Visual Echoes: A Sim-
ple Uni ied T ans o me o Audio-Visual Gene a ion,”
a Xi p ep in a Xi :2405.14598, 2024.
[8] J. Lee, J. Im, D. Kim, and J. Nam, “Video-Foley:
Two-S age Video-To-Sound Gene a ion ia Tempo al
E en Condi ion Fo Foley Sound,” a Xi p ep in
a Xi :2408.11915, 2024.
[9] X. Liu, K. Su, and E. Shlize man, “Tell wha you hea
om wha you see: Video o audio gene a ion h ough
ex ,” a Xi p ep in a Xi :2402.05937, 2024.
[10] S. Mo, J. Shi, and Y. Tian, “Tex - o-audio gen-
e a ion synch onized wi h ideos,” a Xi p ep in
a Xi :2403.07055, 2024.
[11] Y. Du, Z. Chen, J. Salamon, B. Russell, and A. Owens,
“Condi ional Gene a ion o Audio om Video ia Fo-
ley Analogies,” in P oceedings o he IEEE/CVF Con-
e ence on Compu e Vision and Pa e n Recogni ion
(CVPR), 2023, pp. 2426–2436.
[12] R. Li, S. Zheng, X. Cheng, Z. Zhang, S. Ji, and
Z. Zhao, “MuVi: Video- o-Music Gene a ion wi h
Seman ic Alignmen and Rhy hmic Synch oniza ion,”
a Xi p ep in a Xi :2410.12957, 2024.
[13] Y.-B. Lin, Y. Tian, L. Yang, G. Be asius, and H. Wang,
“VMAs: Video- o-Music Gene a ion ia Seman ic
Alignmen in Web Music Videos,” in P oceedings o
he IEEE/CVF Win e Con e ence on Applica ions o
Compu e Vision (WACV), 2025.
[14] J. Kang, S. Po ia, and D. He emans, “Video2Music:
Sui able Music Gene a ion om Videos using an A -
ec i e Mul imodal T ans o me model,” Expe Sys-
ems wi h Applica ions, ol. 249, p. 123640, 2024.
[15] K. Su, J. Y. Li, Q. Huang, D. Kuzmin, J. Lee, C. Don-
ahue, F. Sha, A. Jansen, Y. Wang, M. Ve ze i, and
T. I. Denk, “V2Meow: Meowing o he Visual Bea
ia Video- o-Music Gene a ion,” in P oceedings o
he AAAI Con e ence on A i icial In elligence (AAAI),
2024, pp. 4952–4960.
[16] X. Liang, W. Li, L. Huang, and C. Gao, “DanceCom-
pose : Dance- o-Music Gene a ion Using a P og es-
si e Condi ional Music Gene a o ,” IEEE T ansac ions
on Mul imedia, 2024.
[17] Y. Zhu, Y. Wu, K. Olszewski, J. Ren, S. Tulyako ,
and Y. Yan, “Disc e e Con as i e Di usion o C oss-
Modal Music and Image Gene a ion,” in P oceedings
o he In e na ional Con e ence on Lea ning Rep esen-
a ions (ICLR), 2023.
[18] J. Yu, Y. Wang, X. Chen, X. Sun, and Y. Qiao,
“Long-Te m Rhy hmic Video Sound acke ,” in P o-
ceedings o he 40 h In e na ional Con e ence on Ma-
chine Lea ning (ICML), 2023.
[19] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” in Ad ances in Neu al In o -
ma ion P ocessing Sys ems (Neu IPS), 2024.
[20] L. Lin, G. Xia, J. Jiang, and Y. Zhang, “Con en -
based con ols o music la ge language modeling,”
in P oceedings o he In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), 2024.
[Online]. A ailable: h ps://a xi .o g/abs/2310.17162
[21] H. F. Ga cia, O. Nie o, J. Salamon, B. Pa do, and
P. See ha aman, “Ske ch2sound: Con ollable audio
gene a ion ia ime- a ying signals and sonic imi a-
ions,” a Xi p ep in a Xi :2402.13253, 2024.
[22] A. Guzho , F. Raue, J. Hees, and A. Dengel, “Au-
dioclip: Ex ending clip o image, ex and audio,”
in ICASSP 2022 - IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2022, pp. 976–980.
[23] C. Ra el, N. Shazee , A. Robe s, K. Lee, S. Na ang,
M. Ma ena, Y. Zhou, W. Li, and P. J. Liu, “Explo ing
he limi s o ans e lea ning wi h a uni ied ex - o- ex
ans o me ,” Jou nal o Machine Lea ning Resea ch,
ol. 21, no. 140, pp. 1–67, 2020.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
360
[24] Z. Tian, Z. Liu, R. Yuan, J. Pan, Q. Liu, X. Tan,
Q. Chen, W. Xue, and Y. Guo, “Vidmuse: A simple
ideo- o-music gene a ion amewo k wi h long-sho -
e m modeling,” CVPR, 2025.
[25] R. Valen i, A. Jaimes, and N. Sebe, “Soni y You Face:
Facial Exp essions o Sound Gene a ion,” in P oceed-
ings o he 2010 ACM Mul imedia Wo kshop on Vi-
sual Media In e p e a ion and Unde s anding (VMIU),
2010.
[26] A. Clay, N. Cou u e, E. Deca sin, M. Desain e-
Ca he ine, P.-H. Vullia d, and J. La alde, “Mo emen
o emo ions o music: using whole body emo ional ex-
p ession as an in e ac ion o elec onic music gene a-
ion,” in P oceedings o he In e na ional Con e ence
on New In e aces o Musical Exp ession (NIME),
2012.
[27] J. Huang, X. Huang, L. Yang, and Z. Tao, “D2MNe o
music gene a ion join ly d i en by acial exp essions
and dance mo emen s,” A ay, 2024.
[28] V. P, P. A, S. G. Vasis , S. Rao, and K. S. S ini as,
“DeepTunes: Music Gene a ion based on Facial Emo-
ions using Deep Lea ning,” in In e na ional Con e -
ence on In elligen Compu ing and Technology (I2CT),
2022.
[29] J. Huang, X. Huang, L. Yang, and Z. Tao, “A Con inu-
ous Emo ional Music Gene a ion Sys em Based on Fa-
cial Exp essions,” in P oceedings o he In e na ional
Con e ence on In elligen Da a (ICID), 2022.
[30] G. Resea ch, “Musiclm: Gene a ing music om ex ,”
2023, p ep in a ailable on a Xi .
[31] J. F. Gemmeke, D. P. W. Ellis, D. F eedman, A. Jansen,
W. Law ence, and R. C. Moo e, “Audio se : An on-
ology and human-labeled da ase o audio e en s,”
in 2017 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2017,
pp. 776–780.
[32] H. Chen, W. Xie, A. Vedaldi, and A. Zisse man, “Vg-
gsound: A la ge-scale audio- isual da ase ,” in P o-
ceedings o he IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP),
2020, pp. 721–725.
[33] L. Lin, G. Xia, Y. Zhang, and J. Jiang, “A ange, in-
pain , and e ine: S ee able long- e m music audio gen-
e a ion and edi ing ia con en -based con ols,” in P o-
ceedings o he 32nd In e na ional Join Con e ence on
A i icial In elligence (IJCAI). IJCAI, 2024.
[34] Z. Cai, S. Ghosh, K. S e ano , A. Dhall, J. Cai,
H. Reza o ighi, R. Ha a i, and M. Haya , “Ma lin:
Masked au oencode o acial ideo ep esen a ion
lea ning,” in CVPR. CVPR, 2023.
[35] X. W. R. E. Iashin, V. and A. Zisse man, “Synch-
o me : E icien synch oniza ion om spa se cues,”
in ICASSP 2024-2024 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2024.
[36] Z. Teed and J. Deng, “Ra : Recu en all-pai s ield
ans o ms o op ical low,” in Eu opean Con e ence
on Compu e Vision (ECCV). Sp inge , 2020, pp.
402–419.
[37] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu,
Z. Ma, and C. Zhang, “Salmonn: Towa ds gene ic
hea ing abili ies o la ge language models,” in P o-
ceedings o he In e na ional Con e ence on Lea ning
Rep esen a ions (ICLR), 2024.
[38] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” 2022. [Online].
A ailable: h ps://a xi .o g/abs/2210.13438
[39] J. Kang, S. Po ia, and D. He emans, “Video2music:
Sui able music gene a ion om ideos using an
a ec i e mul imodal ans o me model,” Expe
Sys ems wi h Applica ions, ol. 249, p. 123640, Sep.
2024. [Online]. A ailable: h p://dx.doi.o g/10.1016/j.
eswa.2024.123640
[40] K. Kilgou , R. Cla k, K. Simonyan, and M. Sha i i,
“F eche audio dis ance: A e e ence- ee me ic o
e alua ing music enhancemen algo i hms,” in In e -
speech, 2019, pp. 2350–2354.
[41] S. He shey, S. Chaudhu i, D. P. W. Ellis, J. F. Gem-
meke, A. Jansen, R. Moo e, M. Plakal, D. Pla , R. A.
Sau ous, B. Seybold, M. Slaney, R. J. Weiss, and
K. Wilson, “Cnn a chi ec u es o la ge-scale audio
classi ica ion,” in IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2017, pp. 131–135.
[42] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and
M. D. Plumbley, “Panns: La ge-scale p e ained au-
dio neu al ne wo ks o audio pa e n ecogni ion,”
IEEE/ACM T ansac ions on Audio, Speech, and Lan-
guage P ocessing, ol. 28, pp. 2880–2894, 2020.
[43] K. Kou ini, H. Eghbal-zadeh, M. Wid ich, J. B ands e -
e , A. Thaku , V. Be enz, T. Mö wald, S. Hoch ei e ,
and B. Hamme , “E icien aining o audio ans o m-
e s wi h pa chou ,” in P oceedings o he IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2022, pp. 874–878.
[44] T. Salimans, I. Good ellow, W. Za emba, V. Cheung,
A. Rad o d, and X. Chen, “Imp o ed echniques o
aining gans,” in Ad ances in Neu al In o ma ion P o-
cessing Sys ems (Neu IPS), 2016, pp. 2234–2242.
[45] Y. Wu, K. Chen, T. Zhang, Y. Hui, M. Nezhu ina,
T. Be g-Ki kpa ick, and S. Dubno , “La ge-scale
con as i e language-audio p e aining wi h ea u e
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
361
usion and keywo d- o-cap ion augmen a ion,” in
P oceedings o he IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP),
2023. [Online]. A ailable: h ps://a xi .o g/abs/2211.
06687
[46] B. Zhu, B. Lin, M. Ning, Y. Yan, J. Cui, H. Wang,
Y. Pang, W. Jiang, J. Zhang, Z. Li, W. Zhang, Z. Li,
W. Liu, and L. Yuan, “Languagebind: Ex ending
ideo-language p e aining o n-modali y by language-
based seman ic alignmen ,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2310.01852
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
362

Related note

Why organizations use Identific for document trust, entry 88
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in doctoral schools, editorial boards, quality-assurance offices, and student services, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer separation between similarity and misconduct, more consistent review procedures, and reduced manual checking effort. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For final dissertations, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com