scieee Science in your language
[en] (orig)

A Survey on Vision-to-Music Generation: Methods, Datasets, Evaluation, and Challenges

Author: Zhaokai Wang; Chenxi Bao; Le Zhuo; Jingrui Han; Yang Yue; Yihong Tang; Victor Shea-Jay Huang; Yue Liao
Publisher: Zenodo
DOI: 10.5281/zenodo.17706381
Source: https://zenodo.org/records/17706381/files/000027.pdf
A SURVEY ON VISION-TO-MUSIC GENERATION:
METHODS, DATASETS, EVALUATION, AND CHALLENGES
Zhaokai Wang1, Chenxi Bao2, Le Zhuo3, Jing ui Han4
Yang Yue5, Yihong Tang6, Vic o Shea-Jay Huang3, Yue Liao7
1Shanghai Jiao Tong Uni e si y 2Music Tech Lab, DynamiX
3The Chinese Uni e si y o Hong Kong 4Beijing Film Academy
5Tsinghua Uni e si y 6McGill Uni e si y 7Na ional Uni e si y o Singapo e
[email p o ec ed] {cloudingcxb17,zhuole1025,liaoyue.ai}@gmail.com
ABSTRACT
Vision- o-music gene a ion, including ideo- o-music
and image- o-music asks, is a signi ican b anch o mul-
imodal a i icial in elligence demons a ing as applica-
ions like ilm sco ing and sho ideo c ea ion. Howe e ,
esea ch in ision- o-music is s ill in i s p elimina y s age
due o i s complex in e nal s uc u e and he di icul y o
modeling dynamic ela ionships wi h ideo. To he bes o
ou knowledge, exis ing su eys ocus on gene al music
gene a ion wi hou comp ehensi e discussion on ision-
o-music. In his pape , we sys ema ically e iew he e-
sea ch p og ess in he ield o ision- o-music gene a ion.
We i s analyze he echnical cha ac e is ics and co e chal-
lenges o h ee inpu ypes: gene al ideos, human mo e-
men ideos, and images, as well as wo ou pu ypes o
symbolic music and audio music. We hen summa ize he
exis ing me hodologies om he a chi ec u e pe spec i e.
A de ailed e iew o common da ase s and e alua ion me -
ics is p o ided. Finally, we discuss cu en challenges and
u u e di ec ions. We hope ou su ey can inspi e u he
inno a ion in ision- o-music gene a ion and he b oade
ield o mul imodal gene a ion in academic esea ch and
indus ial applica ions.
1. INTRODUCTION
Recen ad ances in mul imodal a i icial in elligence ha e
wi nessed subs an ial p og ess in gene a ing and unde -
s anding con en o modali ies like ex , images, ideo,
and speech [1–7]. Music gene a ion, as an impo an
pa o his mul imodal ecosys em, has also seen ema k-
able de elopmen . Among he a ious music gene a ion
asks (e.g. uncondi ional music gene a ion [8,9] and ex -
o-music gene a ion [10,11]), ision- o-music, including
ideo- o-music and image- o-music gene a ion, has ga -
ne ed pa icula in e es due o i s p ac ical applica ions in
ilm sco ing, sho ideo pla o ms, and music accompa-
© Au ho s. Licensed unde a C ea i e Commons A ibu-
ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: Au ho s, “A
Su ey on Vision- o-Music Gene a ion: Me hods, Da ase s, E alua ion,
and Challenges”, in P oc. o he 26 h In . Socie y o Music In o ma ion
Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
nimen . Fo gene al use s, au oma ically gene a ed back-
g ound music can alle ia e copy igh conce ns and educe
he ime spen sea ching o sui able music. Fo p o-
essional compose s, AI-assis ed music composi ion can
s eamline he i e a i e p ocess o ma ching a sco e o i-
sual con en , expedi ing he communica ion cycle wi h di-
ec o s and p oduce s.
Despi e his demand, he de elopmen o ision- o-
music emains ela i ely p elimina y. Fo academic e-
sea ch, he inhe en challenges anging om aligning ich
isual cues wi h musical s uc u e o handling he mul i-
ace ed na u e o music gene a ion con ibu e o he ask’s
high complexi y, making i mo e di icul han he com-
mon ex - o-music ask [10–12]. Al hough a g owing
numbe o wo ks ha e eme ged in ecen yea s [13–18],
hey a e s ill a om mee ing he di e se equi emen s
o eal-wo ld scena ios. Fo indus ial in eg a ion, while
o he AI-gene a ed con en (AIGC) ields, such as ex - o-
image [19–21] and ex - o- ideo [22,23], ha e expe ienced
apid adop ion in bo h p o essional and consume con ex s,
ision- o-music sys ems ha e ye o see b oad indus ial
deploymen , wi h only pilo p oduc s like Tianpuyue AI 1.
The unique demands o ilm sco ing, which o en equi e
p ecise emo ional and empo al synch oniza ion wi h i-
sual s o y elling, heigh en he di icul y o achie ing obus
and a is ically consis en esul s h ough AI me hods.
To he bes o ou knowledge, al hough exis ing wo ks
p o ide e iews on gene al music gene a ion [24–28],
he e lack su eys ocusing on he ision- o-music gen-
e a ion ask. Gi en he abo e gaps, we aim o p o ide a
comp ehensi e su ey on ision- o-music gene a ion. We
p o ide a imeline o ep esen a i e wo ks in Fig. 1, and an
o e iew o ision- o-music gene a ion in Fig. 2.
The subsequen sec ions o his pape a e o ganized as
ollows: Sec. 2in oduces he undamen als o ision-
o-music gene a ion, analyzing he echnical cha ac e is-
ics and co e challenges o h ee majo scena ios: gene al
ideos, human mo emen ideos, and images. Sec. 3 e-
iews cu en ision- o-music me hods, compa ing inno a-
ions and limi a ions in ision encoding, ision-music p o-
jec ion, and music gene a ion module design. Sec. 4dis-
cusses ecen ision- o-music da ase s. Sec. 5in oduces
1h ps://www. ianpuyue.cn/ ideo2music
223
V-MusP od
V2Meow
MuMu-LLaMA
Video2Music
2020-2022 2023.12
VidMuse
Moza ’s Touch
VMAS
MeLFusion
Di -BGM
SONIQUE
VidMusician
VEH
MuVi
AudioX
GVMGen
XMusic
FilmCompose
MTM
2024.9 2024.12 2025.3
EIMG
CMT
Foley Music
Dance2music
Rhy hmicNe
Figu e 1: Timeline o ep esen a i e wo ks in ision- o-music gene a ion.
e alua ion me ics, ca ego ized hei pu poses (music-only
and ision-music co espondence) and app oaches (objec-
i e and subjec i e). Sec. 6discusses he cu en esea ch
s a us and exis ing challenges. Th ough his wo k, we as-
pi e o inspi e u he inno a ion in ision- o-music gen-
e a ion and he b oade ield o mul imodal lea ning com-
muni ies, d i ing p og ess in bo h academic esea ch and
indus ial applica ions o ision- o-music gene a ion.
2. FUNDAMENTALS
In he b oad mul imodal esea ch communi y, music is o -
en ea ed as a subse o audio [8,22,29–33]. Howe e ,
unlike gene al audio which may include backg ound noise,
speech, o sound e ec s, music embodies in ica e in e -
nal s uc u es and ichness o in o ma ion, including ha -
mony, coun e poin , and ins umen a ion. These complex-
i ies make i essen ial o conside music as an indepen-
den modali y, which se s he s age o explo ing ision-
o-music gene a ion.
When del ing in o his speci ic a ea, we i s need o
ecognize he unique ela ionship be ween isual inpu and
musical ou pu . We analyze he cha ac e is ics o h ee in-
pu ypes in ision- o-music gene a ion: gene al ideos,
human mo emen ideos, and images, and wo ou pu
ypes: symbolic music and audio music. This ca ego iza-
ion helps us be e unde s and he cu en s a e and chal-
lenges o he ield.
2.1 Inpu Types
Gene al Videos. This includes a wide ange o ideo con-
en s, such as na u al landscapes, ilms, spo s, anima ions,
e c. Techniques in his ca ego y ypically ocus on ex ac -
ing ea u es like mo ion, colo , o isual seman ics o c e-
a e music ha aligns wi h he isual na a i e.
Images. App oaches in his domain ocus on ans o m-
ing s a ic images in o music. Since images lack empo al
seman ics and hy hm, hese me hods usually only need o
ocus on he o e all s yle, and he e is no s ic equi e-
men o he du a ion o he gene a ed music. The appli-
ca ion scena ios o image- o-music a e no as ex ensi e as
ideo- o-music, bu hey include unc ionali ies like c ea -
ing musical memo ies o pho o albums. This ype o pai
da a is easy o collec , bu i s inhe en co ela ion may no
be e y s ong.
Human Mo emen Videos. These ideos ypically in-
clude dance, spo s, ins umen pe o mances, and o he
human mo emen s. Fo ins umen pe o mance ideos,
whe e humans play music ins umen s bu he audio is e-
mo ed, he music is de e mined by he inpu ideo o some
ex en , and he gene a ion p ocess is simila o econs uc -
ing music om he silen ideos. Fo dance, spo s, and
o he human mo emen s, hey emphasize hy hmic align-
men (especially local hy hm) mo e han gene al ideos,
while seman ic cons ain s a e gene ally weake , equi ing
only o e all s yle ma ching. The e o e, ex ac ed 2D/3D
keypoin s ep esen ing human mo ion a e o en di ec ly
used as inpu s ins ead o aw ideos.
In he emaining sec ions, we will mainly ocus on gen-
e al ideos and images, while paying ela i ely less a en-
ion o human mo emen ideos. This is because hey
ocus on hy hmic ela ions and he seman ic associa ion
wi h music is ela i ely weak, whe e 2D body keypoin s
a e di ec ly used as ideo ea u es. Thei applica ion sce-
na ios a e also ela i ely limi ed [13].
2.2 Ou pu Types
Symbolic Music. Symbolic music is ep esen ed as dis-
c e e elemen s like no es, cho ds, o sequences o musi-
cal symbols [25]. Mos ea ly ision- o-music me hods a e
symbolic [13,15,34,35]. Symbolic music can inco po a e
music heo y, such as cho ds, and gene a e longe pieces
wi h good con ollabili y. Howe e , he limi ed da a a ail-
abili y es ic s i s scalabili y o la ge models, and he ex-
p essi e and emo ional dep h is cons ained by sound on s.
Audio Music. Such me hods aim o gene a e music in i s
audio o m [14,16–18,36–39], o en employing gene a-
i e models such as ans o me s [40], VAEs, GANs [41],
o di usion models [42] o syn hesize ealis ic sound om
he isual inpu . Audio music bene i s om la ge-scale
da ase s o aining, enabling end- o-end gene a ion wi h
ich exp essi eness and pe o mance. Howe e , despi e e -
o s o adding con ols in audio gene a ion [43], he con-
ollabili y o audio music is s ill ela i ely weak, com-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
224
Vision- o-Music Gene a ion
Inpu
Types
Mo ion
Videos
Gene al
Videos Images
Ou pu
Types
Symbolic Music Audio Music
E alua ion
A chi ec u e
Vision-Music
P ojec ion
Au o-Reg essi e
Di usion
+noise
Tex
Adap o
Music
Gene a ion
Vision
Encoding
CLIP
Video
CLIP
Modeling Vision-Music Rela ionships (Seman ic, Rhy hm)
Da ase s
Music-only Vision-music
Co espondence
objec i e
subjec i e
FAD, FD, KL, …
Melody, Rhy hm,
O e all, …
ImageBind Sco e,
CLAP Sco e, …
Seman ic, Rhy hm,
Emo ion, O e all, ...
Challenges
S anda dized Da ase and Benchma k
Cus omiza ion and Con ollabili y
Choice o Symbolic o Audio Fo ms
…… Fea u e
Figu e 2: O e iew o ision- o-music gene a ion.
pa ed o symbolic music whe e mul iple con ol signals
can be used, e.g. pi ch, du a ion, ins umen , hy hm,
cho d, e c. Mo eo e , he gene a ed music is ypically
sho e due o sampling a e limi a ions, e.g. usually unde
20 seconds o me hods in Tab. 1, whe e symbolic me hods
can easily achie e whole-song leng h.
3. METHODS
In his sec ion, we discuss he exis ing wo ks on ision- o-
music gene a ion. We summa ize ision- o-music gene a-
ion me hods in Tab. 1.
3.1 Tasks
We begin by ca ego izing he me hods based on he inpu
ypes ou lined in Sec. 2.
Gene al Video- o-Music. Ea ly gene al ideo- o-music
me hods we e usually symbolic [13,15,34,35]. Wi h he
de elopmen o audio- o m music gene a ion [11,12], a
la ge numbe o audio-based gene al ideo- o-music wo ks
ha e eme ged in he pas ew yea s [16–18,36–39,44–46].
Image- o-Music. Ea ly image- o-music wo ks we e also
p ima ily symbolic [47–50], whe e models analyze colo ,
ex u e, and seman ic con en o gene a e music. Recen
wo ks [16,38,51,52] gene a ed audio- o m music om
mul iple modali ies ( ideo, image, and ex ).
Human Mo emen Video- o-music. Fo dance o spo s
ideos, exis ing me hods ocus on ex ac ing hy hmic pa -
e ns om dance ideos and mapping hem o musical
hy hm gene a ion [31,53–59]. Fo music pe o mance
ideos, cu en me hods lea n o econs uc he o iginal
music om he silen ideos [60–63].
3.2 A chi ec u e
The a chi ec u e o ision- o-music sys ems can be b o-
ken down in o h ee majo componen s: ision encoding,
ision-music p ojec ion, and music gene a ion.
Vision Encoding. This s age is ocused on ex ac ing
ea u es om he inpu ideo o image. A commonly
used ision encode is CLIP [66], which is p e ained on
massi e image- ex pai s o achie e open-domain isual
unde s anding capabili ies. Video unde s anding back-
bones [64,70,78,85,88,92,98] a e also used o ex ac spa-
io empo al ea u es. Some also use addi ional encode s
o colo in o ma ion [15], emo ion in o ma ion [46], o
in e media e ex ea u es [37]. Fo human mo emen
ideos, i is impo an o ex ac mo ion ea u es o hy h-
mic alignmen , e.g., di ec ly calcula ing i s -o de di e -
ence om human keypoin s as mo ion eloci y [31,53], o
using p e- ained mo ion encode [14,54].
Vision-Music P ojec ion. This componen in ol es map-
ping he isual ea u es in o he music space. Mos me h-
ods di ec ly use he isual ea u es as he inpu o he mu-
sic gene a ion model, o h ough simple c oss-a en ion
mechanisms [15,35,37,52]. Some me hods design spe-
cialized adap e s [16,17,36] o be e ea u e alignmen ,
e.g. o cap u e empo al- ela ed o local ea u es. Be-
sides using ea u e-based mapping, some s udies sugges
using ex as an in e media e ep esen a ion o he isual
ea u es [18,38,44,87] and subsequen ly u ilizing ex - o-
music models o music gene a ion. Some symbolic music
gene a ion me hods use symbolic elemen s as he ision-
music mapping [13,99].
Music Gene a ion. Once he isual and music ea u es
ha e been aligned, he nex s ep is o gene a e he musical
ou pu . This s age can be ackled using au o- eg essi e o
di usion-based gene a i e models. Au o- eg essi e mod-
els [40] can be used o bo h symbolic [13,15,99] and
audio music gene a ion [17,18,37,39,44,45]. Di usion
models [42,86] can be used o symbolic [35] music o
di ec ly gene a e piano olls, bu mos ly o audio mu-
sic [12,38,46,101].
3.3 Vision-Music Rela ionships
Vision-music ela ionships es ablish he co espondence
be ween ideos and music. Unlike he ision-music p o-
jec ion discussed in he p e ious sec ion (which ocuses
on he a chi ec u e), he ela ionships discussed he e ocus
on he o e all co espondence be ween inpu and ou pu .
These ela ionships can be b oadly classi ied in o wo ca -
ego ies: seman ic ela ionships and hy hmic ela ionships.
Seman ic Rela ionships. This ype o ela ionship ocuses
on how isual elemen s (such as colo , objec s, o scenes)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
225
Table 1: Me hods o ision- o-music gene a ion. Sem: Seman ics. Rhy: Rhy hm. AR: Au o- eg essi e. Di .: Di usion.
Me hod Demo Da e Inpu Type Modali y Music Vision-Music Vision Encoding Vision-Music Music Gene a ion
Leng h Rela ionships P ojec ion
▼Gene al Videos and Images:
CMT [13]Link 2021/11 Gene al Video Symbolic 3min Rhy - Elemen s AR (CP [9])
V-MusP od [15]Link 2022/11 Gene al Video Symbolic 6min Sem, Rhy CLIP2Video [64], His ogan [65] Fea u e AR (CP [9])
V2Meow [14]Link 2023/05 Gene al Video Audio 10sec Sem, Rhy CLIP [66], I3D Flow [67], Fea u e AR
ViT-VQGAN [68]
MuMu-LLaMA [51]Link 2023/11 Gene al Video, Image Audio 30sec Sem ViT [69], ViViT [70] Adap e AR (LLaMA2 [71])
(M2UGen [16])
Video2Music [34]Link 2023/11 Gene al Video Symbolic 5min Sem, Rhy CLIP [66] Fea u e AR
EIMG [72]Link 2023/12 Image Symbolic 15sec Sem ALAE [73], β-VAE [74], VQ-VAE [75] Adap e VAE (FNT [76], LSR [77])
Di -BGM [35]Link 2024/05 Gene al Video Symbolic 5min Sem VideoCLIP [78] Fea u e Di . (Poly usion [79])
Moza ’s Touch [44]Link 2024/05 Gene al Video, Image Audio 10sec Sem BLIP [80] Tex AR (MusicGen [11])
MeLFusion [52]Link 2024/06 Image Audio 10sec Sem DDIM [81] + T2I LDM [82] Fea u e Di .
VidMuse [17]Link 2024/06 Gene al Video Audio 20sec Sem CLIP [66] Adap e AR (MusicGen [11])
S2L2-V2M [83]Link 2024/08 Gene al Video Audio 10sec Sem Enhanced Video Mamba Adap e AR (LLaMA2 [71])
VMAS [45]Link 2024/09 Gene al Video Audio 10sec Sem, Rhy Hie a [84] Fea u e AR
MuVi [36]Link 2024/10 Gene al Video Audio 20sec Sem, Rhy VideoMAE V2 [85] Adap e Di . (DiT [86])
SONIQUE [87]Link 2024/10 Gene al Video Audio 20sec Sem, Rhy Video-LLaMA [88], CLAP [89] Tex Di . (S able Audio [90])
VEH [91] - 2024/10 Gene al Video Symbolic 30sec Sem VideoCha [92] Tex AR (T5 [93])
M2M-Gen [94]Link 2024/10 Image (Manga) Audio 1min Sem CLIP [66], GPT-4 [1] Tex AR (MusicLM [95])
HPM [46]Link 2024/11 Gene al Video Audio 10sec Sem CLIP [66], TAVAR [96], WECL [97] Fea u e Di . (AudioLDM [29])
VidMusician [37]Link 2024/12 Gene al Video Audio 30sec Sem, Rhy CLIP [66], T5 [93] Adap e AR (MusicGen [11])
MTM [38]Link 2024/12 Gene al Video, Image Audio 30sec Sem In e nVL2 [98] Tex Di . (S able Audio Open [12])
XMusic [99]Link 2025/01 Gene al Video, Image Symbolic 20sec Sem, Rhy ResNe [100], CLIP [66] Elemen s AR (CP [9])
GVMGen [39]Link 2025/01 Gene al Video Audio 15sec Sem CLIP [66] Adap e AR (MusicGen [11])
AudioX [101]Link 2025/03 Gene al Video Audio 10sec Sem CLIP [66] Fea u e Di . (S able Audio Open [12])
FilmCompose [18]Link 2025/03 Gene al Video Audio 15sec Sem, Rhy Con ollable Rhy hm T ans o me , Tex AR (MusicGen [11])
GPT-4 [1], Mo ion De ec o
▼Human Mo emen Videos:
Audeo [61]Link 2020/06 Pe o mance Video Symbolic 30sec Rhy ResNe [100] Fea u e GAN
Foley Music [62]Link 2020/07 Pe o mance Video Symbolic 10sec Rhy 2D Body Keypoin s Fea u e AR
Mul i-Ins ucmen Ne [60] - 2020/12 Pe o mance Video Audio 10sec Rhy 2D Body Keypoin s Fea u e VAE
Rhy hmicNe [53]Link 2021/06 Dance Video Symbolic 10sec Rhy 2D Body Keypoin s Fea u e AR (REMI [102])
Dance2Music [57]Link 2021/07 Dance Video Symbolic 12sec Rhy 2D Body Keypoin s Fea u e AR
D2M-GAN [54]Link 2022/04 Dance Video Audio 2sec Rhy 2D Body Keypoin s, I3D [103] Fea u e GAN
CDCD [55]Link 2022/06 Dance Video Audio 2sec Rhy 2D Body Keypoin s, I3D [103] Fea u e Di .
LORIS [31]Link 2023/05 Mo emen Video Audio 50sec Rhy 2D Body Keypoin s, I3D [103] Fea u e Di .
VisBea Ne [56] - 2024/01 Dance Video Symbolic Real ime Rhy 2D Body Keypoin s Fea u e AR
UniMuMo [104]Link 2024/10 Dance Video Audio 10sec Rhy 2D Body Keypoin s Fea u e Di .
ela e o musical componen s (such as mood, melody, o
cho ds). Fo music pe o mance ideos, he music is de-
e mined by he ideo ins ead o a gene al and implici
seman ic ela ionship [61,62]. Fo dance and mo emen
ideos, he seman ics in he ideo is no u ilized. Sym-
bolic me hods [15,34,99] explici ly de ine seman ic, colo ,
and emo ion ela ionships ex ac ed om p e ained mod-
els o u ilize he con ollabili y o symbolic music. Recen
audio-based me hods gene ally use a single ision encode
o ex ac seman ic ea u es. These seman ic ea u es a e
usually global and insensi i e o seman ic changes wi hin
he ideo. Some me hods [17,36,39] also design special
modules o enhance local seman ic co espondence. How-
e e , o mos audio me hods gene a ing 10-second music,
he concep o “local” may no ha e a signi ican impac .
Rhy hmic Rela ionships. Rhy hmic ela ionships mainly
e e o he co espondence be ween he hy hm o he
ideo (e.g. local mo emen s, scene ansi ions, global
ideo hy hm) and he hy hm o he music (e.g. local
bea s, global empo). Fo human mo emen ideos, such
as dance o ins umen playing, hy hmic ela ionships be-
come signi ican , especially he co espondence be ween
local hy hm and human mo emen s. Fo gene al ideos,
ea ly wo ks [13–15,45] use op ical low o RGB Di e -
ence o ep esen he ideo hy hm. Recen wo ks mos ly
do no conside hy hm in o ma ion o use ame-by- ame
seman ic ea u es o implici ly p o ide local hy hm co -
espondence [17,37], which is no p ominen in he gen-
e a ed music. In me hods ha use ex o ision-music
p ojec ion [38,87], he ideo con en is used o gene a e
equi emen s o musical hy hm, such as he hy hm o
each scene o he o e all empo.
4. DATASETS
In his sec ion, we in oduce common da ase s o ision-
o-music. Plen y o da ase s ha e been p oposed in he li -
e a u e o he ision- o-music ield, and di e en me hods
o en use di e en da ase s o aining and es ing. The e-
o e, i is necessa y o o ganize and analyze hese da ase s.
Common da ase s a e lis ed in Tab. 2.
4.1 Inpu Ca ego ies
Based on he ypes o ideos/images in ision-music
da ase s, we ca ego ize he da ase s as ollows:
Gene al Videos. Videos in hese da ase s a e usually
sou ced om pla o ms like YouTube. Mos da ase s o-
cus on Music Videos [15,34,35,45,105,107], as hey
ha e sa is ac o y ideo-music alignmen and a e easie o
collec . O he da ase s include a a ie y o ideo ypes,
such as aile s, ad e isemen s, anima ions, and documen-
a ies [17,37], o subse s om la ge da ase s like Au-
dioSe [108]. These ideos o e be e di e si y, bu he
ideo-music alignmen may be weake , equi ing s ic il-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
226
Table 2: Da ase s o ision- o-music gene a ion.
Da ase Access Da e Sou ce Modali y Size To al Leng h A g. Leng h Anno a ions
(hou ) (second)
▼Gene al Videos:
HIMV-200K [105]Link 2017/04 Music Video (You ube-8M [106]) Audio 200K - - -
MVED [107]Link 2020/09 Music Video Audio 1.9K 16.5 30 Emo ion
SymMV [15]Link 2022/11 Music Video MIDI, Audio 1.1K 76.5 241 Ly ics, Gen e, Cho d,
Melody, Tonali y, Bea
MV100K [14] - 2023/05 Music Video (You ube-8M [106]) Audio 110K 5000 163 Gen e
MusicCaps [95]Link 2023/01 Di e se Videos (AudioSe [108]) Audio 5.5K 15.3 10 Gen e, Cap ion, Emo ion,
Tempo, Ins umen , Rhy hm, ...
EmoMV [109]Link 2023/03 Music Video (MVED [107], AudioSe [108]) Audio 6K 44.3 27 Emo ion
MUVideo [16]Link 2023/11 Di e se Videos (Balanced-AudioSe [108]) Audio 14.5K 40.3 10 Ins uc ions
MuVi-Sync [34]Link 2023/11 Music Video MIDI, Audio 784 - - Scene O se , Emo ion, Mo ion, Seman ic,
Cho d, Key, Loudness, Densi y, ...
BGM909 [35]Link 2024/05 Music Video MIDI 909 - - Cap ion, S yle, Cho d, Melody, Bea , Sho
V2M [17] - 2024/06 Di e se Videos Audio 360K 18000 180 Gen e
DISCO-MV [45] - 2024/09 Music Video (DISCO-10M [110]) Audio 2200K 47000 77 Gen e
FilmSco eDB [46] - 2024/11 Film Video Audio 32K 90.3 10 Mo ie Ti le
DVMSe [37] - 2024/12 Di e se Videos Audio 3.8K - - -
Ha monySe [111]Link 2025/03 Di e se Videos Audio 48K 458.8 32 Desc ip ion
MusicP o-7k [18]Link 2025/03 Film Video Audio 7K - - Desc ip ion, Melody, Rhy hm Spo s
▼Human Mo emen Videos
URMP [112]Link 2016/12 Pe o mance Video MIDI, Audio 44 1.3 106 Ins umen s
MUSIC [113]Link 2018/04 Pe o mance Video Audio 685 45.7 239 Ins umen s
AIST++ [114]Link 2021/01 Dance Video (AIST [115]) Audio 1.4K 5.2 13 3D Mo ion
TikTok Dance-Music [54]Link 2022/04 Dance Video Audio 445 1.5 12 -
LORIS [31]Link 2023/05 Dance Video, Spo s Video Audio 16K 86.43 19 2D Pose
(AIST [115], FisV [116], FS1000 [117])
▼Images
Music-Image [118]Link 2016/07 Image (Music Video) Audio 22.6K 377 60 Ly ics
Shu e song [119]Link 2017/08 Image (Shu e song App) Audio 586 - - Ly ics
IMAC [120]Link 2019/04 Image (FI [121]) Audio 3.8K 63.3 60 Emo ion
MUImage [16]Link 2023/11 Image (Balanced-AudioSe [108]) Audio 14.5k 40.3 10 Ins uc ions
EIMG [72]Link 2023/12 Image (IAPS [122], NAPS [123]) MIDI 3K 12.5 15 VA Value
MeLBench [52]Link 2024/06 Image (Di e se Videos) Audio 11.2K 31.2 10 Gen e, Cap ion
e ing [16,95]. FilmSco eDB and MusicP o-7k [18,46] o-
cus on ilm sco es, whe e he music has a deepe seman ic
co espondence wi h he ideo and se es as an accompa-
nimen a he han being he p ima y ocus, as in music
ideos. Recen ly, some da ase s also p o ide ex ual de-
sc ip ions o ideos and music [38,111,124] o assis ex -
b idged ideo- o-music gene a ion me hods. 3. Human
Mo emen Videos. These ideos can be di ided in o in-
s umen pe o mances and dance/spo ca ego ies. Ins u-
men pe o mance da ase s [112,113] aim o econs uc
music om ins umen al pe o mance ideos. Dance/spo
da ase s [31,54,114] ocus on gene a ing music om dance
o spo s ideos, emphasizing local hy hmic alignmen
while downplaying seman ic ela ionships.
Images. Exis ing image- o-music da ase s a e ela i ely
sca ce. Sou ces o he images a e usually ames om mu-
sic ideos [52,118] o exis ing image da ase s [16,72,120].
4.2 Music Domains
Vision-music da ase s can be di ided in o MIDI and audio
based on he music modali y. MIDI da ase s [15,34,35]
a e c ea ed by ansc ibing audio in o he MIDI o ma o
sou ced om exis ing music-only da ase s [125]. Audio
da ase s con ain only aw audio iles.
Compa ed o audio da ase s, MIDI da ase s ha e he
ollowing ad an ages: (1) Mo e anno a ions like Cho d,
Melody, Bea , Tonali y, e c; (2) Longe a e age du a ion
enables gene a ing longe music pieces; (3) Sui able o
aining bo h symbolic and audio music gene a ion mod-
els. Howe e , a signi ican limi a ion o MIDI da ase s is
hei smalle scale (e.g. 1K songs, 100 hou s s. 100K-2M
songs, 5K-50K hou s) and ela i ely limi ed di e si y.
5. EVALUATION
Common me ics o ision- o-music a e ca ego ized in
Tab. 3and 4. The e alua ion o ision- o-music gene a ion
can be di ided in o wo ca ego ies: objec i e and subjec-
i e. Objec i e e alua ion uses ixed ule-based algo i hms
o exis ing models o ex ac ea u es and calcula e mu-
sical me ics. I is ela i ely objec i e and con enien o
ai compa ison, bu has ce ain biases and canno co e all
aspec s o music gene a ion, o en di e ing signi ican ly
om human subjec i e pe cep ion. Simila o o he gen-
e a ion asks [126,127], subjec i e e alua ion is ypically
used in ision- o-music gene a ion o a mo e comp ehen-
si e assessmen , i.e. conduc ing use s udies whe e pa ic-
ipan s a e/compa e music gene a ed by di e en models.
F om ano he pe spec i e, me ics can be di ided in o
music-only and ision-music co espondence based on
assessmen pu poses. The o me only e alua es whe he
he music i sel is pleasan / ealis ic/s uc u ally comple e,
e c., while he la e ocuses on he co espondence be-
ween he music and he isual inpu .
Fo music-only objec i e me ics, symbolic music
gene a ion me hods [13,15,35,72,83,99] use some
s a is ics-based me hods o calcula e ce ain pi ch o
hy hm- ela ed s a is ical me ics o MIDI, such as Scale
Consis ency, Pi ch En opy, e c. These me ics a e usually
compa ed wi h g ound u h music, and he close hey a e,
he mo e ealis ic he music is conside ed. Audio music
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
227

Table 3: Objec i e me ics o ision- o-music gene a ion.
M: MIDI. A: Audio. V: Video. I: Image. T:Tex . Pi : Pi .
Rhy: Rhy hm. Fid: Fideli y. Sem: Seman ic.
Me ic Used in Pape Inpu Type
▼Music-only:
Scale Consis ency [15,83] M Pi
Pi En opy [15,72,83] M Pi
Pi Class His og am En opy [13,15,35,83,99] M Pi
Emp y Bea Ra e [15,83,99] M Rhy
A e age In e -Onse In e al [15,83] M Rhy
G oo ing Pa e n Simila i y [13,35,99] M Rhy
S uc u e Indica o [13,35] M Rhy
F eche Audio Dis ance (FAD) [14,16–18,36,39,52]A Fid
[44–46,83,91,101]
F eche Dis ance (FD) [14,17,36–38,52,87,101] A Fid
Kullback-Leible Di e gence (KL) [14,16–18,36–39,44–46]A Fid
[52,83,87,91,101]
Bea s Co e age Sco e (BCS) [36,46] A Rhy
Bea s Hi Sco e (BHS) [36,46] A Rhy
Incep ion Sco e (IS) [36,46,101] A Fid
▼Vision-music Co espondence:
ImageBind Sco e/Rank [16–18,37,38,44,83,101] A,V/I Sem
CLAP Sco e [37,87,91] A,A/T Sem
Video-Music CLIP P ecision [15,83] A,V Sem
Video-Music Co espondence [35] A,V Sem
C oss-modal Rele ance [39] A,V Sem
Tempo al Alignmen [39] A,V Rhy
Rhy hm Alignmen [37] A,V Rhy
gene a ion me hods widely adop me ics such as F eche
Audio Dis ance (FAD), F eche Dis ance (FD) 2, and Kull-
back Leible Di e gence (KL) o e alua e he simila i y
be ween gene a ed music and g ound u h music. Some
me hods [36,46] also in oduce me ics like BCS and BHS
o measu e hy hmic simila i y based on music bea s.
Objec i e me ics o ision-music co espondence
usually ocus on he audio modali y. The mos com-
monly used a e ImageBind Sco e/Rank and CLAP sco e,
which le e age p e ained mul imodal models like Image-
Bind [131] and CLAP [89] o simila i y e alua ion. Some
me hods [15,35,39,83] ha e also designed speci ic ision-
music e ie al e alua ion me ics, wi h sligh di e ences
in model selec ion and e ie al me hods. Addi ionally,
GVMGen [39] and VidMusician [37] ha e designed objec-
i e me ics o e alua e he hy hmic co espondence be-
ween isions and music. Howe e , since he p e ained
models a e usually ained wi h gene al audio da a ins ead
o speci ied music da a, hese objec i e me ics commonly
do no pe ec ly align wi h human judgmen s.
Subjec i e me ics mainly include MOS (gene ally us-
ing a 5-poin Like scale), pai p e e ence (i.e. win a e),
and anking di e en music. Common subjec i e me ics
in ision- o-music gene a ion a e gi en in Tab. 4. The
selec ion o speci ic subjec i e me ics depends on he
ision-music ela ionship emphasized by he me hod.
6. CHALLENGES
Despi e ad ances in ision- o-music gene a ion, we iden-
i y se e al key challenges o he academic communi y:
Lack o S anda dized Objec i e Da ase s and Bench-
ma ks: The aining and e alua ion da ase s di e ac oss
2The di e ence be ween FAD and FD is he ea u e ex ac o :
FAD [128] uses VGGish [129], while FD uses PANNs [130].
Table 4: Subjec i e me ics o ision- o-music gene a ion.
Me ic Used in Pape
▼Music-only:
Music Melody [15,35]
Music Rhy hm [15,35]
Music Richness [39,99]
Audio Quali y [17,36]
O e all Music Quali y [13,14,17,18,34,38,39,44,45,52,91,94]
▼Vision-music Co espondence:
Seman ic Consis ency [15,18,35–38]
Rhy hm Consis ency [15,18,34,35,38,91,99]
Emo ion Consis ency [38,91,99]
O e all Co espondence [13–18,34,39,44,45,52,83,87,94]
models, some imes leading o compa isons be ween mod-
els ine- uned on p op ie a y da ase s and hose e alua ed
ia ze o-sho in e ence on o he da ase s. This dispa i y
signi ican ly unde mines he ai ness o model compa -
isons and makes i challenging o iden i y he s a e-o - he-
a . Besides, cu en e alua ion me ics o en do no align
wi h ac ual human pe cep ion, e.g. FAD and KL a e based
on gene al audio da a a he han on music-speci ic da a,
and symbolic me ics a e s a is ically based and exhibi low
co ela ion wi h human p e e ences. Though mos pape s
p o ide demos o quali a i e compa isons, hey a e p one
o issues such as che y-picked examples, insu icien sam-
ple size, and subjec i i y in e alua ing he ou pu s.
Limi ed Cus omiza ion and Con ollabili y: Mos exis -
ing models unc ion as black boxes, making i challenging
o pe sonalize o con ol a ibu es o he gene a ed music,
such as s yle, ins umen a ion, and hy hm. This signi i-
can ly a ec s he models’ p ac ical applicabili y.
T ade-o Be ween Symbolic and Audio Fo ms: As dis-
cussed in Sec. 2, audio-based me hods bene i om la ge-
scale da a bu gene ally o e limi ed con ollabili y and a e
cons ained by he compu a ional cos o high- ideli y gen-
e a ion, o en esul ing in sho e musical pieces. In con-
as , symbolic app oaches, while limi ed by a ailable da a,
o e be e con ollabili y and can p oduce longe compo-
si ions. A p omising di ec ion is o combine symbolic and
audio me hods o achie e a be e ade-o .
Beyond academic challenges, explo ing how o align
hese echnologies wi h music indus y applica ions (such
as ilm and game sco ing) and end-use p oduc s (like
sho - ideo pla o m backg ound music) in consume
p oduc s o e s signi ican comme cial oppo uni ies.
7. CONCLUSION
In his pape , we e iew he ecen ad ancemen s in ision-
o-music gene a ion, co e ing bo h symbolic and audio-
based app oaches. We iden i ied key echnical challenges
in isual ea u e ex ac ion, c oss-modal p ojec ion, and
music gene a ion, and discussed he limi a ions in cu en
da ase s and e alua ion me ics. We belie e ha add ess-
ing hese challenges will pa e he way o u u e esea ch
and applica ions on ision- o-music gene a ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
228
8. ETHICAL STATEMENT
This wo k is conduc ed wi h a clea awa eness o he
e hical challenges associa ed wi h ision- o-music gene -
a ion echnologies. As models lea n o ansla e isual in-
pu s—such as images, ideos, o a wo ks—in o music,
hey ope a e a he in e sec ion o mul iple cul u al, le-
gal, and emo ional domains. One majo conce n is copy-
igh in ingemen , pa icula ly when models a e ained
on copy igh ed music ha closely esemble exis ing com-
posi ions wi hou clea a ibu ion o licensing. Gi en
he c ea i e and exp essi e na u e o music, e en s ylis ic
mimic y may aise legal and e hical ques ions.
In addi ion, hese models isk ampli ying cul u al o
s ylis ic biases p esen in he aining da a. Fo example,
models ained p ima ily on wes e n classical o pop music
may ma ginalize non-wes e n musical adi ions o unde -
ep esen di e se emo ional and cul u al exp essions. The
alignmen be ween isual inpu and musical ou pu can
u he ein o ce p oblema ic s e eo ypes o ail o cap u e
cul u ally app op ia e in e p e a ions, especially when i-
sual con en ca ies eligious o poli ical signi icance.
A hi d dimension in ol es emo ional app op ia eness.
Music is a powe ul emo ional medium. Gene a ing music
om sensi i e o auma ic isual con en —such as scenes
o iolence, g ie , o his o ical auma—may esul in emo-
ionally disco dan o insensi i e ou comes.
To mi iga e hese isks, we ad oca e o anspa en da a
collec ion p ac ices, inclusion o di e se musical and i-
sual cul u es, and e alua ion amewo ks ha assess no
only pe cep ual quali y bu also cul u al and emo ional
alignmen . We u he encou age in e disciplina y collab-
o a ion be ween echnologis s, musicians, e hicis s, and le-
gal schola s o ensu e he esponsible de elopmen and de-
ploymen o hese sys ems.
9. REFERENCES
[1] J. Achiam, S. Adle , S. Aga wal, L. Ahmad, I. Akkaya,
F. L. Aleman, D. Almeida, J. Al enschmid , S. Al -
man, S. Anadka e al., “Gp -4 echnical epo ,” a Xi
p ep in a Xi :2303.08774, 2023.
[2] H. Zhang, X. Li, and L. Bing, “Video-llama:
An ins uc ion- uned audio- isual language
model o ideo unde s anding,” a Xi p ep in
a Xi :2306.02858, 2023.
[3] G. Luo, X. Yang, W. Dou, Z. Wang, J. Liu, J. Dai,
Y. Qiao, and X. Zhu, “Mono-in e n l: Pushing he
bounda ies o monoli hic mul imodal la ge language
models wi h endogenous isual p e- aining,” a Xi
p ep in a Xi :2410.08202, 2024.
[4] X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang,
F. Zhang, Y. Wang, Z. Li, Q. Yu e al., “Emu3:
Nex - oken p edic ion is all you need,” a Xi p ep in
a Xi :2409.18869, 2024.
[5] Z. Wang, X. Zhu, X. Yang, G. Luo, H. Li, C. Tian,
W. Dou, J. Ge, L. Lu, Y. Qiao, and J. Dai, “Pa ame e -
in e ed image py amid ne wo ks o isual pe cep-
ion and mul imodal unde s anding,” a Xi p ep in
a Xi :2501.07783, 2025.
[6] Y. Tang, A. Qu, Z. Wang, D. Zhuang, Z. Wu, W. Ma,
S. Wang, Y. Zheng, Z. Zhao, and J. Zhao, “Spa kle:
Mas e ing basic spa ial capabili ies in ision language
models elici s gene aliza ion o composi e spa ial ea-
soning,” a Xi p ep in a Xi :2410.16162, 2024.
[7] B. Kang, Y. Yue, R. Lu, Z. Lin, Y. Zhao, K. Wang,
G. Huang, and J. Feng, “How a is ideo gene a ion
om wo ld model: A physical law pe spec i e,” a Xi
p ep in a Xi :2411.02385, 2024.
[8] Z. Bo sos, R. Ma inie , D. Vincen , E. Kha i ono ,
O. Pie quin, M. Sha i i, D. Roblek, O. Teboul,
D. G angie , M. Tagliasacchi e al., “Audiolm: a
language modeling app oach o audio gene a ion,”
IEEE/ACM ansac ions on audio, speech, and lan-
guage p ocessing, ol. 31, pp. 2523–2533, 2023.
[9] W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang,
“Compound wo d ans o me : Lea ning o compose
ull-song music o e dynamic di ec ed hype g aphs,”
in P oceedings o he AAAI Con e ence on A i icial In-
elligence, ol. 35, no. 1, 2021, pp. 178–186.
[10] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi e al., “Musiclm:
Gene a ing music om ex ,” a Xi p ep in
a Xi :2301.11325, 2023.
[11] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, pp. 47 704–47 720,
2023.
[12] Z. E ans, J. D. Pa ke , C. Ca , Z. Zukowski, J. Tay-
lo , and J. Pons, “S able audio open,” a Xi p ep in
a Xi :2407.14358, 2024.
[13] S. Di, Z. Jiang, S. Liu, Z. Wang, L. Zhu, Z. He, H. Liu,
and S. Yan, “Video backg ound music gene a ion wi h
con ollable music ans o me ,” in P oceedings o he
29 h ACM In e na ional Con e ence on Mul imedia,
2021, pp. 2037–2045.
[14] K. Su, J. Y. Li, Q. Huang, D. Kuzmin, J. Lee, C. Don-
ahue, F. Sha, A. Jansen, Y. Wang, M. Ve ze i e al.,
“V2meow: meowing o he isual bea ia ideo- o-
music gene a ion,” in P oceedings o he AAAI Con e -
ence on A i icial In elligence, ol. 38, no. 5, 2024, pp.
4952–4960.
[15] L. Zhuo, Z. Wang, B. Wang, Y. Liao, C. Bao, S. Peng,
S. Han, A. Zhang, F. Fang, and S. Liu, “Video back-
g ound music gene a ion: Da ase , me hod and e alua-
ion,” in P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision, 2023, pp. 15 637–
15 647.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
229
[16] S. Liu, A. S. Hussain, Q. Wu, C. Sun, and Y. Shan,
“Mumu-llama: Mul i-modal music unde s anding and
gene a ion ia la ge language models,” a Xi p ep in
a Xi :2412.06660, 2024.
[17] Z. Tian, Z. Liu, R. Yuan, J. Pan, Q. Liu, X. Tan,
Q. Chen, W. Xue, and Y. Guo, “Vidmuse: A simple
ideo- o-music gene a ion amewo k wi h long-sho -
e m modeling,” a Xi p ep in a Xi :2406.04321,
2024.
[18] Z. Xie, Q. He, Y. Zhu, Q. He, and M. Li, “Film-
compose : Llm-d i en music p oduc ion o silen ilm
clips,” a Xi p ep in a Xi :2503.08147, 2025.
[19] P. Esse , S. Kulal, A. Bla mann, R. En eza i, J. Mülle ,
H. Saini, Y. Le i, D. Lo enz, A. Saue , F. Boesel e al.,
“Scaling ec i ied low ans o me s o high- esolu ion
image syn hesis,” in Fo y- i s in e na ional con e -
ence on machine lea ning, 2024.
[20] L. Zhuo, R. Du, H. Xiao, Y. Li, D. Liu, R. Huang,
W. Liu, X. Zhu, F.-Y. Wang, Z. Ma e al., “Lumina-
nex : Making lumina- 2x s onge and as e wi h nex -
di ,” in The Thi y-eigh h Annual Con e ence on Neu al
In o ma ion P ocessing Sys ems.
[21] J. Chen, Y. Jincheng, G. Chongjian, L. Yao, E. Xie,
Z. Wang, J. Kwok, P. Luo, H. Lu, and Z. Li, “Pixa -α:
Fas aining o di usion ans o me o pho o ealis ic
ex - o-image syn hesis,” in The Twel h In e na ional
Con e ence on Lea ning Rep esen a ions.
[22] A. Polyak, A. Zoha , A. B own, A. Tjand a, A. Sinha,
A. Lee, A. Vyas, B. Shi, C.-Y. Ma, C.-Y. Chuang,
D. Yan e al., “Mo ie gen: A cas o media ounda ion
models,” a Xi p ep in a Xi :2410.13720, 2024.
[23] W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou,
J. Xiong, X. Li, B. Wu, J. Zhang e al., “Hunyuan-
ideo: A sys ema ic amewo k o la ge ideo gene a-
i e models,” a Xi p ep in a Xi :2412.03603, 2024.
[24] Y. Ma, A. Øland, A. Ragni, B. M. Del Se e, C. Sai is,
C. Donahue, C. Lin, C. Plachou as, E. Bene os, E. Sha-
i e al., “Founda ion models o music: A su ey,”
a Xi p ep in a Xi :2408.14340, 2024.
[25] S. Ji, X. Yang, and J. Luo, “A su ey on deep lea ning
o symbolic music gene a ion: Rep esen a ions, algo-
i hms, e alua ions, and challenges,” ACM Compu ing
Su eys, ol. 56, no. 1, pp. 1–39, 2023.
[26] S. Ji, J. Luo, and X. Yang, “A comp ehensi e su ey
on deep music gene a ion: Mul i-le el ep esen a ions,
algo i hms, e alua ions, and u u e di ec ions,” a Xi
p ep in a Xi :2011.06801, 2020.
[27] C. He nandez-Oli an and J. R. Bel an, “Music com-
posi ion wi h deep lea ning: A e iew,” Ad ances in
speech and music echnology: compu a ional aspec s
and applica ions, pp. 25–50, 2022.
[28] Y. Zhu, J. Baca, B. Rekabda , and R. Rawassizadeh, “A
su ey o ai music gene a ion ools and models,” a Xi
p ep in a Xi :2308.12982, 2023.
[29] H. Liu, Y. Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian,
Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley,
“Audioldm 2: Lea ning holis ic audio gene a ion wi h
sel -supe ised p e aining,” IEEE/ACM T ansac ions
on Audio, Speech, and Language P ocessing, 2024.
[30] X. Liu, K. Su, and E. Shlize man, “Tell wha you hea
om wha you see– ideo o audio gene a ion h ough
ex ,” a Xi p ep in a Xi :2411.05679, 2024.
[31] J. Yu, Y. Wang, X. Chen, X. Sun, and Y. Qiao, “Long-
e m hy hmic ideo sound acke ,” in In e na ional
Con e ence on Machine Lea ning. PMLR, 2023, pp.
40 339–40 353.
[32] Z. Tang, Z. Yang, C. Zhu, M. Zeng, and M. Bansal,
“Any- o-any gene a ion ia composable di usion,”
Neu IPS, ol. 36, 2024.
[33] S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Nex -
gp : Any- o-any mul imodal llm,” a Xi : 2309.05519,
2023.
[34] J. Kang, S. Po ia, and D. He emans, “Video2music:
Sui able music gene a ion om ideos using an a ec-
i e mul imodal ans o me model,” Expe Sys ems
wi h Applica ions, ol. 249, p. 123640, 2024.
[35] S. Li, Y. Qin, M. Zheng, X. Jin, and Y. Liu, “Di -bgm:
A di usion model o ideo backg ound music gene -
a ion,” in P oceedings o he IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion, 2024, pp.
27 348–27 357.
[36] R. Li, S. Zheng, X. Cheng, Z. Zhang, S. Ji, and
Z. Zhao, “Mu i: Video- o-music gene a ion wi h
seman ic alignmen and hy hmic synch oniza ion,”
a Xi p ep in a Xi :2410.12957, 2024.
[37] S. Li, B. Yang, C. Yin, C. Sun, Y. Zhang, W. Dong,
and C. Li, “Vidmusician: Video- o-music gene a ion
wi h seman ic- hy hmic alignmen ia hie a chical i-
sual ea u es,” a Xi p ep in a Xi :2412.06296, 2024.
[38] B. Wang, L. Zhuo, Z. Wang, C. Bao, W. Chengjing,
X. Nie, J. Dai, J. Han, Y. Liao, and S. Liu, “Mul imodal
music gene a ion wi h explici b idges and e ie al
augmen a ion,” a Xi p ep in a Xi :2412.09428,
2024.
[39] H. Zuo, W. You, J. Wu, S. Ren, P. Chen, M. Zhou,
Y. Lu, and L. Sun, “G mgen: A gene al ideo- o-
music gene a ion model wi h hie a chical a en ions,”
a Xi p ep in a Xi :2501.09972, 2025.
[40] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, Ł. Kaise , and I. Polosukhin,
“A en ion is all you need,” Ad ances in neu al in o -
ma ion p ocessing sys ems, ol. 30, 2017.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
230
[41] I. J. Good ellow, J. Pouge -Abadie, M. Mi za, B. Xu,
D. Wa de-Fa ley, S. Ozai , A. Cou ille, and Y. Ben-
gio, “Gene a i e ad e sa ial ne s,” Ad ances in neu al
in o ma ion p ocessing sys ems, ol. 27, 2014.
[42] J. Ho, A. Jain, and P. Abbeel, “Denoising di usion
p obabilis ic models,” Ad ances in neu al in o ma ion
p ocessing sys ems, ol. 33, pp. 6840–6851, 2020.
[43] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music con olne : Mul iple ime- a ying con ols o
music gene a ion,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 32, pp. 2692–
2703, 2024.
[44] J. Li, T. Xu, X. Chen, X. Yao, and S. Liu, “Moza ’s
ouch: A ligh weigh mul i-modal music gene a ion
amewo k based on p e- ained la ge models,” a Xi
p ep in a Xi :2405.02801, 2024.
[45] Y.-B. Lin, Y. Tian, L. Yang, G. Be asius, and
H. Wang, “Vmas: Video- o-music gene a ion ia se-
man ic alignmen in web music ideos,” a Xi p ep in
a Xi :2409.07450, 2024.
[46] F. Qi, L. Ni, and C. Xu, “Ha monizing pixels
and melodies: Maes o-guided ilm sco e gene a-
ion and composi ion s yle ans e ,” a Xi p ep in
a Xi :2411.07539, 2024.
[47] X. Tan, M. An ony, and H. Kong, “Au oma ed music
gene a ion o isual a h ough emo ion.” in ICCC,
2020, pp. 247–250.
[48] A. San os, H. Pin o, R. Pe ei a Jo ge, and N. Co eia,
“Musy i: music syn hesis om images,” in P oceed-
ings o he 12 h In e na ional Con e ence on Compu-
a ional C ea i i y, 2021, pp. 103–112.
[49] R. Zhang, Y. Zhang, K. Shao, Y. Shan, and G. Xia,
“Vis2mus: Explo ing mul imodal ep esen a ion map-
ping o con ollable music gene a ion,” a Xi p ep in
a Xi :2211.05543, 2022.
[50] Z. Xiong, P.-C. Lin, and A. Fa judian, “Re aining se-
man ics in image o music con e sion,” in 2022 IEEE
In e na ional Symposium on Mul imedia (ISM). IEEE,
2022, pp. 228–235.
[51] S. Liu, A. S. Hussain, C. Sun, and Y. Shan, “M2ugen:
Mul i-modal music unde s anding and gene a ion wi h
he powe o la ge language models,” a Xi p ep in
a Xi :2311.11255, 2023.
[52] S. Chowdhu y, S. Nag, K. Joseph, B. V. S ini asan,
and D. Manocha, “Mel usion: Syn hesizing music
om image and language cues using di usion mod-
els,” in P oceedings o he IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion, 2024, pp.
26 826–26 835.
[53] K. Su, X. Liu, and E. Shlize man, “How does i
sound?” Ad ances in Neu al In o ma ion P ocessing
Sys ems, ol. 34, pp. 29 258–29 273, 2021.
[54] Y. Zhu, K. Olszewski, Y. Wu, P. Achliop as, M. Chai,
Y. Yan, and S. Tulyako , “Quan ized gan o com-
plex music gene a ion om dance ideos,” in Eu o-
pean Con e ence on Compu e Vision. Sp inge , 2022,
pp. 182–199.
[55] Y. Zhu, Y. Wu, K. Olszewski, J. Ren, S. Tulyako , and
Y. Yan, “Disc e e con as i e di usion o c oss-modal
music and image gene a ion,” in The Ele en h In e na-
ional Con e ence on Lea ning Rep esen a ions, 2023.
[56] X. Liu, K. Su, and E. Shlize man, “Le he bea ol-
low you-c ea ing in e ac i e d um sounds om body
hy hm,” in P oceedings o he IEEE/CVF Win e Con-
e ence on Applica ions o Compu e Vision, 2024, pp.
7187–7197.
[57] G. Agga wal and D. Pa ikh, “Dance2music: Au o-
ma ic dance-d i en music gene a ion,” a Xi p ep in
a Xi :2107.06252, 2021.
[58] S. Li, W. Dong, Y. Zhang, F. Tang, C. Ma, O. Deussen,
T.-Y. Lee, and C. Xu, “Dance- o-music gene a ion wi h
encode -based ex ual in e sion,” in SIGGRAPH Asia
2024 Con e ence Pape s, 2024, pp. 1–11.
[59] X. Liang, W. Li, L. Huang, and C. Gao, “Dancecom-
pose : Dance- o-music gene a ion using a p og essi e
condi ional music gene a o ,” IEEE T ansac ions on
Mul imedia, 2024.
[60] K. Su, X. Liu, and E. Shlize man, “Mul i-
ins umen alis ne : Unsupe ised gene a ion o
music om body mo emen s,” a Xi p ep in
a Xi :2012.03478, 2020.
[61] ——, “Audeo: Audio gene a ion o a silen pe o -
mance ideo,” Ad ances in Neu al In o ma ion P o-
cessing Sys ems, ol. 33, pp. 3325–3337, 2020.
[62] C. Gan, D. Huang, P. Chen, J. B. Tenenbaum, and
A. To alba, “Foley music: Lea ning o gene a e music
om ideos,” in Compu e Vision–ECCV 2020: 16 h
Eu opean Con e ence, Glasgow, UK, Augus 23–28,
2020, P oceedings, Pa XI 16. Sp inge , 2020, pp.
758–775.
[63] A. S. Koepke, O. Wiles, Y. Moses, and A. Zisse man,
“Sigh o sound: An end- o-end app oach o isual pi-
ano ansc ip ion,” in ICASSP 2020-2020 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2020, pp. 1838–1842.
[64] H. Fang, P. Xiong, L. Xu, and Y. Chen, “Clip2 ideo:
Mas e ing ideo- ex e ie al ia image clip,” a Xi
p ep in a Xi :2106.11097, 2021.
[65] M. A i i, M. A. B ubake , and M. S. B own, “His ogan:
Con olling colo s o gan-gene a ed and eal images
ia colo his og ams,” in P oceedings o he IEEE/CVF
con e ence on compu e ision and pa e n ecogni-
ion, 2021, pp. 7941–7950.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
231