A Survey on Vision-to-Music Generation: Methods, Datasets, Evaluation, and Challenges

Author: Zhaokai Wang; Chenxi Bao; Le Zhuo; Jingrui Han; Yang Yue; Yihong Tang; Victor Shea-Jay Huang; Yue Liao

Publisher: Zenodo

DOI: 10.5281/zenodo.17706381

Source: https://zenodo.org/records/17706381/files/000027.pdf

A SURVEY ON VISION-TO-MUSIC GENERATION:
METHODS, DATASETS, EVALUATION, AND CHALLENGES
Zhaokai Wang1, Chenxi Bao2, Le Zhuo3, Jing ui Han4
Yang Yue5, Yihong Tang6, Vic o Shea-Jay Huang3, Yue Liao7
1Shanghai Jiao Tong Uni e si y 2Music Tech Lab, DynamiX
3The Chinese Uni e si y o Hong Kong 4Beijing Film Academy
5Tsinghua Uni e si y 6McGill Uni e si y 7Na ional Uni e si y o Singapo e
[email p o ec ed] {cloudingcxb17,zhuole1025,liaoyue.ai}@gmail.com
ABSTRACT
Vision- o-music gene a ion, including ideo- o-music
and image- o-music asks, is a signi ican b anch o mul-
imodal a i icial in elligence demons a ing as applica-
ions like ilm sco ing and sho ideo c ea ion. Howe e ,
esea ch in ision- o-music is s ill in i s p elimina y s age
due o i s complex in e nal s uc u e and he di icul y o
modeling dynamic ela ionships wi h ideo. To he bes o
ou knowledge, exis ing su eys ocus on gene al music
gene a ion wi hou comp ehensi e discussion on ision-
o-music. In his pape , we sys ema ically e iew he e-
sea ch p og ess in he ield o ision- o-music gene a ion.
We i s analyze he echnical cha ac e is ics and co e chal-
lenges o h ee inpu ypes: gene al ideos, human mo e-
men ideos, and images, as well as wo ou pu ypes o
symbolic music and audio music. We hen summa ize he
exis ing me hodologies om he a chi ec u e pe spec i e.
A de ailed e iew o common da ase s and e alua ion me -
ics is p o ided. Finally, we discuss cu en challenges and
u u e di ec ions. We hope ou su ey can inspi e u he
inno a ion in ision- o-music gene a ion and he b oade
ield o mul imodal gene a ion in academic esea ch and
indus ial applica ions.
1. INTRODUCTION
Recen ad ances in mul imodal a i icial in elligence ha e
wi nessed subs an ial p og ess in gene a ing and unde -
s anding con en o modali ies like ex , images, ideo,
and speech [1–7]. Music gene a ion, as an impo an
pa o his mul imodal ecosys em, has also seen ema k-
able de elopmen . Among he a ious music gene a ion
asks (e.g. uncondi ional music gene a ion [8,9] and ex -
o-music gene a ion [10,11]), ision- o-music, including
ideo- o-music and image- o-music gene a ion, has ga -
ne ed pa icula in e es due o i s p ac ical applica ions in
ilm sco ing, sho ideo pla o ms, and music accompa-
© Au ho s. Licensed unde a C ea i e Commons A ibu-
ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: Au ho s, “A
Su ey on Vision- o-Music Gene a ion: Me hods, Da ase s, E alua ion,
and Challenges”, in P oc. o he 26 h In . Socie y o Music In o ma ion
Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
nimen . Fo gene al use s, au oma ically gene a ed back-
g ound music can alle ia e copy igh conce ns and educe
he ime spen sea ching o sui able music. Fo p o-
essional compose s, AI-assis ed music composi ion can
s eamline he i e a i e p ocess o ma ching a sco e o i-
sual con en , expedi ing he communica ion cycle wi h di-
ec o s and p oduce s.
Despi e his demand, he de elopmen o ision- o-
music emains ela i ely p elimina y. Fo academic e-
sea ch, he inhe en challenges anging om aligning ich
isual cues wi h musical s uc u e o handling he mul i-
ace ed na u e o music gene a ion con ibu e o he ask’s
high complexi y, making i mo e di icul han he com-
mon ex - o-music ask [10–12]. Al hough a g owing
numbe o wo ks ha e eme ged in ecen yea s [13–18],
hey a e s ill a om mee ing he di e se equi emen s
o eal-wo ld scena ios. Fo indus ial in eg a ion, while
o he AI-gene a ed con en (AIGC) ields, such as ex - o-
image [19–21] and ex - o- ideo [22,23], ha e expe ienced
apid adop ion in bo h p o essional and consume con ex s,
ision- o-music sys ems ha e ye o see b oad indus ial
deploymen , wi h only pilo p oduc s like Tianpuyue AI 1.
The unique demands o ilm sco ing, which o en equi e
p ecise emo ional and empo al synch oniza ion wi h i-
sual s o y elling, heigh en he di icul y o achie ing obus
and a is ically consis en esul s h ough AI me hods.
To he bes o ou knowledge, al hough exis ing wo ks
p o ide e iews on gene al music gene a ion [24–28],
he e lack su eys ocusing on he ision- o-music gen-
e a ion ask. Gi en he abo e gaps, we aim o p o ide a
comp ehensi e su ey on ision- o-music gene a ion. We
p o ide a imeline o ep esen a i e wo ks in Fig. 1, and an
o e iew o ision- o-music gene a ion in Fig. 2.
The subsequen sec ions o his pape a e o ganized as
ollows: Sec. 2in oduces he undamen als o ision-
o-music gene a ion, analyzing he echnical cha ac e is-
ics and co e challenges o h ee majo scena ios: gene al
ideos, human mo emen ideos, and images. Sec. 3 e-
iews cu en ision- o-music me hods, compa ing inno a-
ions and limi a ions in ision encoding, ision-music p o-
jec ion, and music gene a ion module design. Sec. 4dis-
cusses ecen ision- o-music da ase s. Sec. 5in oduces
1h ps://www. ianpuyue.cn/ ideo2music
223
V-MusP od
V2Meow
MuMu-LLaMA
Video2Music
2020-2022 2023.12
VidMuse
Moza ’s Touch
VMAS
MeLFusion
Di -BGM
SONIQUE
VidMusician
VEH
MuVi
AudioX
GVMGen
XMusic
FilmCompose
MTM
2024.9 2024.12 2025.3
EIMG
CMT
Foley Music
Dance2music
Rhy hmicNe
Figu e 1: Timeline o ep esen a i e wo ks in ision- o-music gene a ion.
e alua ion me ics, ca ego ized hei pu poses (music-only
and ision-music co espondence) and app oaches (objec-
i e and subjec i e). Sec. 6discusses he cu en esea ch
s a us and exis ing challenges. Th ough his wo k, we as-
pi e o inspi e u he inno a ion in ision- o-music gen-
e a ion and he b oade ield o mul imodal lea ning com-
muni ies, d i ing p og ess in bo h academic esea ch and
indus ial applica ions o ision- o-music gene a ion.
2. FUNDAMENTALS
In he b oad mul imodal esea ch communi y, music is o -
en ea ed as a subse o audio [8,22,29–33]. Howe e ,
unlike gene al audio which may include backg ound noise,
speech, o sound e ec s, music embodies in ica e in e -
nal s uc u es and ichness o in o ma ion, including ha -
mony, coun e poin , and ins umen a ion. These complex-
i ies make i essen ial o conside music as an indepen-
den modali y, which se s he s age o explo ing ision-
o-music gene a ion.
When del ing in o his speci ic a ea, we i s need o
ecognize he unique ela ionship be ween isual inpu and
musical ou pu . We analyze he cha ac e is ics o h ee in-
pu ypes in ision- o-music gene a ion: gene al ideos,
human mo emen ideos, and images, and wo ou pu
ypes: symbolic music and audio music. This ca ego iza-
ion helps us be e unde s and he cu en s a e and chal-
lenges o he ield.
2.1 Inpu Types
Gene al Videos. This includes a wide ange o ideo con-
en s, such as na u al landscapes, ilms, spo s, anima ions,
e c. Techniques in his ca ego y ypically ocus on ex ac -
ing ea u es like mo ion, colo , o isual seman ics o c e-
a e music ha aligns wi h he isual na a i e.
Images. App oaches in his domain ocus on ans o m-
ing s a ic images in o music. Since images lack empo al
seman ics and hy hm, hese me hods usually only need o
ocus on he o e all s yle, and he e is no s ic equi e-
men o he du a ion o he gene a ed music. The appli-
ca ion scena ios o image- o-music a e no as ex ensi e as
ideo- o-music, bu hey include unc ionali ies like c ea -
ing musical memo ies o pho o albums. This ype o pai
da a is easy o collec , bu i s inhe en co ela ion may no
be e y s ong.
Human Mo emen Videos. These ideos ypically in-
clude dance, spo s, ins umen pe o mances, and o he
human mo emen s. Fo ins umen pe o mance ideos,
whe e humans play music ins umen s bu he audio is e-
mo ed, he music is de e mined by he inpu ideo o some
ex en , and he gene a ion p ocess is simila o econs uc -
ing music om he silen ideos. Fo dance, spo s, and
o he human mo emen s, hey emphasize hy hmic align-
men (especially local hy hm) mo e han gene al ideos,
while seman ic cons ain s a e gene ally weake , equi ing
only o e all s yle ma ching. The e o e, ex ac ed 2D/3D
keypoin s ep esen ing human mo ion a e o en di ec ly
used as inpu s ins ead o aw ideos.
In he emaining sec ions, we will mainly ocus on gen-
e al ideos and images, while paying ela i ely less a en-
ion o human mo emen ideos. This is because hey
ocus on hy hmic ela ions and he seman ic associa ion
wi h music is ela i ely weak, whe e 2D body keypoin s
a e di ec ly used as ideo ea u es. Thei applica ion sce-
na ios a e also ela i ely limi ed [13].
2.2 Ou pu Types
Symbolic Music. Symbolic music is ep esen ed as dis-
c e e elemen s like no es, cho ds, o sequences o musi-
cal symbols [25]. Mos ea ly ision- o-music me hods a e
symbolic [13,15,34,35]. Symbolic music can inco po a e
music heo y, such as cho ds, and gene a e longe pieces
wi h good con ollabili y. Howe e , he limi ed da a a ail-
abili y es ic s i s scalabili y o la ge models, and he ex-
p essi e and emo ional dep h is cons ained by sound on s.
Audio Music. Such me hods aim o gene a e music in i s
audio o m [14,16–18,36–39], o en employing gene a-
i e models such as ans o me s [40], VAEs, GANs [41],
o di usion models [42] o syn hesize ealis ic sound om
he isual inpu . Audio music bene i s om la ge-scale
da ase s o aining, enabling end- o-end gene a ion wi h
ich exp essi eness and pe o mance. Howe e , despi e e -
o s o adding con ols in audio gene a ion [43], he con-
ollabili y o audio music is s ill ela i ely weak, com-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
224
Vision- o-Music Gene a ion
Inpu
Types
Mo ion
Videos
Gene al
Videos Images
Ou pu
Types
Symbolic Music Audio Music
E alua ion
A chi ec u e
Vision-Music
P ojec ion
Au o-Reg essi e
Di usion
+noise
Tex
Adap o
Music
Gene a ion
Vision
Encoding
CLIP
Video
CLIP
Modeling Vision-Music Rela ionships (Seman ic, Rhy hm)
Da ase s
Music-only Vision-music
Co espondence
objec i e
subjec i e
FAD, FD, KL, …
Melody, Rhy hm,
O e all, …
ImageBind Sco e,
CLAP Sco e, …
Seman ic, Rhy hm,
Emo ion, O e all, ...
Challenges
S anda dized Da ase and Benchma k
Cus omiza ion and Con ollabili y
Choice o Symbolic o Audio Fo ms
…… Fea u e
Figu e 2: O e iew o ision- o-music gene a ion.
pa ed o symbolic music whe e mul iple con ol signals
can be used, e.g. pi ch, du a ion, ins umen , hy hm,
cho d, e c. Mo eo e , he gene a ed music is ypically
sho e due o sampling a e limi a ions, e.g. usually unde
20 seconds o me hods in Tab. 1, whe e symbolic me hods
can easily achie e whole-song leng h.
3. METHODS
In his sec ion, we discuss he exis ing wo ks on ision- o-
music gene a ion. We summa ize ision- o-music gene a-
ion me hods in Tab. 1.
3.1 Tasks
We begin by ca ego izing he me hods based on he inpu
ypes ou lined in Sec. 2.
Gene al Video- o-Music. Ea ly gene al ideo- o-music
me hods we e usually symbolic [13,15,34,35]. Wi h he
de elopmen o audio- o m music gene a ion [11,12], a
la ge numbe o audio-based gene al ideo- o-music wo ks
ha e eme ged in he pas ew yea s [16–18,36–39,44–46].
Image- o-Music. Ea ly image- o-music wo ks we e also
p ima ily symbolic [47–50], whe e models analyze colo ,
ex u e, and seman ic con en o gene a e music. Recen
wo ks [16,38,51,52] gene a ed audio- o m music om
mul iple modali ies ( ideo, image, and ex ).
Human Mo emen Video- o-music. Fo dance o spo s
ideos, exis ing me hods ocus on ex ac ing hy hmic pa -
e ns om dance ideos and mapping hem o musical
hy hm gene a ion [31,53–59]. Fo music pe o mance
ideos, cu en me hods lea n o econs uc he o iginal
music om he silen ideos [60–63].
3.2 A chi ec u e
The a chi ec u e o ision- o-music sys ems can be b o-
ken down in o h ee majo componen s: ision encoding,
ision-music p ojec ion, and music gene a ion.
Vision Encoding. This s age is ocused on ex ac ing
ea u es om he inpu ideo o image. A commonly
used ision encode is CLIP [66], which is p e ained on
massi e image- ex pai s o achie e open-domain isual
unde s anding capabili ies. Video unde s anding back-
bones [64,70,78,85,88,92,98] a e also used o ex ac spa-
io empo al ea u es. Some also use addi ional encode s
o colo in o ma ion [15], emo ion in o ma ion [46], o
in e media e ex ea u es [37]. Fo human mo emen
ideos, i is impo an o ex ac mo ion ea u es o hy h-
mic alignmen , e.g., di ec ly calcula ing i s -o de di e -
ence om human keypoin s as mo ion eloci y [31,53], o
using p e- ained mo ion encode [14,54].
Vision-Music P ojec ion. This componen in ol es map-
ping he isual ea u es in o he music space. Mos me h-
ods di ec ly use he isual ea u es as he inpu o he mu-
sic gene a ion model, o h ough simple c oss-a en ion
mechanisms [15,35,37,52]. Some me hods design spe-
cialized adap e s [16,17,36] o be e ea u e alignmen ,
e.g. o cap u e empo al- ela ed o local ea u es. Be-
sides using ea u e-based mapping, some s udies sugges
using ex as an in e media e ep esen a ion o he isual
ea u es [18,38,44,87] and subsequen ly u ilizing ex - o-
music models o music gene a ion. Some symbolic music
gene a ion me hods use symbolic elemen s as he ision-
music mapping [13,99].
Music Gene a ion. Once he isual and music ea u es
ha e been aligned, he nex s ep is o gene a e he musical
ou pu . This s age can be ackled using au o- eg essi e o
di usion-based gene a i e models. Au o- eg essi e mod-
els [40] can be used o bo h symbolic [13,15,99] and
audio music gene a ion [17,18,37,39,44,45]. Di usion
models [42,86] can be used o symbolic [35] music o
di ec ly gene a e piano olls, bu mos ly o audio mu-
sic [12,38,46,101].
3.3 Vision-Music Rela ionships
Vision-music ela ionships es ablish he co espondence
be ween ideos and music. Unlike he ision-music p o-
jec ion discussed in he p e ious sec ion (which ocuses
on he a chi ec u e), he ela ionships discussed he e ocus
on he o e all co espondence be ween inpu and ou pu .
These ela ionships can be b oadly classi ied in o wo ca -
ego ies: seman ic ela ionships and hy hmic ela ionships.
Seman ic Rela ionships. This ype o ela ionship ocuses
on how isual elemen s (such as colo , objec s, o scenes)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
225
Table 1: Me hods o ision- o-music gene a ion. Sem: Seman ics. Rhy: Rhy hm. AR: Au o- eg essi e. Di .: Di usion.
Me hod Demo Da e Inpu Type Modali y Music Vision-Music Vision Encoding Vision-Music Music Gene a ion
Leng h Rela ionships P ojec ion
▼Gene al Videos and Images:
CMT [13]Link 2021/11 Gene al Video Symbolic 3min Rhy - Elemen s AR (CP [9])
V-MusP od [15]Link 2022/11 Gene al Video Symbolic 6min Sem, Rhy CLIP2Video [64], His ogan [65] Fea u e AR (CP [9])
V2Meow [14]Link 2023/05 Gene al Video Audio 10sec Sem, Rhy CLIP [66], I3D Flow [67], Fea u e AR
ViT-VQGAN [68]
MuMu-LLaMA [51]Link 2023/11 Gene al Video, Image Audio 30sec Sem ViT [69], ViViT [70] Adap e AR (LLaMA2 [71])
(M2UGen [16])
Video2Music [34]Link 2023/11 Gene al Video Symbolic 5min Sem, Rhy CLIP [66] Fea u e AR
EIMG [72]Link 2023/12 Image Symbolic 15sec Sem ALAE [73], β-VAE [74], VQ-VAE [75] Adap e VAE (FNT [76], LSR [77])
Di -BGM [35]Link 2024/05 Gene al Video Symbolic 5min Sem VideoCLIP [78] Fea u e Di . (Poly usion [79])
Moza ’s Touch [44]Link 2024/05 Gene al Video, Image Audio 10sec Sem BLIP [80] Tex AR (MusicGen [11])
MeLFusion [52]Link 2024/06 Image Audio 10sec Sem DDIM [81] + T2I LDM [82] Fea u e Di .
VidMuse [17]Link 2024/06 Gene al Video Audio 20sec Sem CLIP [66] Adap e AR (MusicGen [11])
S2L2-V2M [83]Link 2024/08 Gene al Video Audio 10sec Sem Enhanced Video Mamba Adap e AR (LLaMA2 [71])
VMAS [45]Link 2024/09 Gene al Video Audio 10sec Sem, Rhy Hie a [84] Fea u e AR
MuVi [36]Link 2024/10 Gene al Video Audio 20sec Sem, Rhy VideoMAE V2 [85] Adap e Di . (DiT [86])
SONIQUE [87]Link 2024/10 Gene al Video Audio 20sec Sem, Rhy Video-LLaMA [88], CLAP [89] Tex Di . (S able Audio [90])
VEH [91] - 2024/10 Gene al Video Symbolic 30sec Sem VideoCha [92] Tex AR (T5 [93])
M2M-Gen [94]Link 2024/10 Image (Manga) Audio 1min Sem CLIP [66], GPT-4 [1] Tex AR (MusicLM [95])
HPM [46]Link 2024/11 Gene al Video Audio 10sec Sem CLIP [66], TAVAR [96], WECL [97] Fea u e Di . (AudioLDM [29])
VidMusician [37]Link 2024/12 Gene al Video Audio 30sec Sem, Rhy CLIP [66], T5 [93] Adap e AR (MusicGen [11])
MTM [38]Link 2024/12 Gene al Video, Image Audio 30sec Sem In e nVL2 [98] Tex Di . (S able Audio Open [12])
XMusic [99]Link 2025/01 Gene al Video, Image Symbolic 20sec Sem, Rhy ResNe [100], CLIP [66] Elemen s AR (CP [9])
GVMGen [39]Link 2025/01 Gene al Video Audio 15sec Sem CLIP [66] Adap e AR (MusicGen [11])
AudioX [101]Link 2025/03 Gene al Video Audio 10sec Sem CLIP [66] Fea u e Di . (S able Audio Open [12])
FilmCompose [18]Link 2025/03 Gene al Video Audio 15sec Sem, Rhy Con ollable Rhy hm T ans o me , Tex AR (MusicGen [11])
GPT-4 [1], Mo ion De ec o
▼Human Mo emen Videos:
Audeo [61]Link 2020/06 Pe o mance Video Symbolic 30sec Rhy ResNe [100] Fea u e GAN
Foley Music [62]Link 2020/07 Pe o mance Video Symbolic 10sec Rhy 2D Body Keypoin s Fea u e AR
Mul i-Ins ucmen Ne [60] - 2020/12 Pe o mance Video Audio 10sec Rhy 2D Body Keypoin s Fea u e VAE
Rhy hmicNe [53]Link 2021/06 Dance Video Symbolic 10sec Rhy 2D Body Keypoin s Fea u e AR (REMI [102])
Dance2Music [57]Link 2021/07 Dance Video Symbolic 12sec Rhy 2D Body Keypoin s Fea u e AR
D2M-GAN [54]Link 2022/04 Dance Video Audio 2sec Rhy 2D Body Keypoin s, I3D [103] Fea u e GAN
CDCD [55]Link 2022/06 Dance Video Audio 2sec Rhy 2D Body Keypoin s, I3D [103] Fea u e Di .
LORIS [31]Link 2023/05 Mo emen Video Audio 50sec Rhy 2D Body Keypoin s, I3D [103] Fea u e Di .
VisBea Ne [56] - 2024/01 Dance Video Symbolic Real ime Rhy 2D Body Keypoin s Fea u e AR
UniMuMo [104]Link 2024/10 Dance Video Audio 10sec Rhy 2D Body Keypoin s Fea u e Di .
ela e o musical componen s (such as mood, melody, o
cho ds). Fo music pe o mance ideos, he music is de-
e mined by he ideo ins ead o a gene al and implici
seman ic ela ionship [61,62]. Fo dance and mo emen
ideos, he seman ics in he ideo is no u ilized. Sym-
bolic me hods [15,34,99] explici ly de ine seman ic, colo ,
and emo ion ela ionships ex ac ed om p e ained mod-
els o u ilize he con ollabili y o symbolic music. Recen
audio-based me hods gene ally use a single ision encode
o ex ac seman ic ea u es. These seman ic ea u es a e
usually global and insensi i e o seman ic changes wi hin
he ideo. Some me hods [17,36,39] also design special
modules o enhance local seman ic co espondence. How-
e e , o mos audio me hods gene a ing 10-second music,
he concep o “local” may no ha e a signi ican impac .
Rhy hmic Rela ionships. Rhy hmic ela ionships mainly
e e o he co espondence be ween he hy hm o he
ideo (e.g. local mo emen s, scene ansi ions, global
ideo hy hm) and he hy hm o he music (e.g. local
bea s, global empo). Fo human mo emen ideos, such
as dance o ins umen playing, hy hmic ela ionships be-
come signi ican , especially he co espondence be ween
local hy hm and human mo emen s. Fo gene al ideos,
ea ly wo ks [13–15,45] use op ical low o RGB Di e -
ence o ep esen he ideo hy hm. Recen wo ks mos ly
do no conside hy hm in o ma ion o use ame-by- ame
seman ic ea u es o implici ly p o ide local hy hm co -
espondence [17,37], which is no p ominen in he gen-
e a ed music. In me hods ha use ex o ision-music
p ojec ion [38,87], he ideo con en is used o gene a e
equi emen s o musical hy hm, such as he hy hm o
each scene o he o e all empo.
4. DATASETS
In his sec ion, we in oduce common da ase s o ision-
o-music. Plen y o da ase s ha e been p oposed in he li -
e a u e o he ision- o-music ield, and di e en me hods
o en use di e en da ase s o aining and es ing. The e-
o e, i is necessa y o o ganize and analyze hese da ase s.
Common da ase s a e lis ed in Tab. 2.
4.1 Inpu Ca ego ies
Based on he ypes o ideos/images in ision-music
da ase s, we ca ego ize he da ase s as ollows:
Gene al Videos. Videos in hese da ase s a e usually
sou ced om pla o ms like YouTube. Mos da ase s o-
cus on Music Videos [15,34,35,45,105,107], as hey
ha e sa is ac o y ideo-music alignmen and a e easie o
collec . O he da ase s include a a ie y o ideo ypes,
such as aile s, ad e isemen s, anima ions, and documen-
a ies [17,37], o subse s om la ge da ase s like Au-
dioSe [108]. These ideos o e be e di e si y, bu he
ideo-music alignmen may be weake , equi ing s ic il-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
226
Table 2: Da ase s o ision- o-music gene a ion.
Da ase Access Da e Sou ce Modali y Size To al Leng h A g. Leng h Anno a ions
(hou ) (second)
▼Gene al Videos:
HIMV-200K [105]Link 2017/04 Music Video (You ube-8M [106]) Audio 200K - - -
MVED [107]Link 2020/09 Music Video Audio 1.9K 16.5 30 Emo ion
SymMV [15]Link 2022/11 Music Video MIDI, Audio 1.1K 76.5 241 Ly ics, Gen e, Cho d,
Melody, Tonali y, Bea
MV100K [14] - 2023/05 Music Video (You ube-8M [106]) Audio 110K 5000 163 Gen e
MusicCaps [95]Link 2023/01 Di e se Videos (AudioSe [108]) Audio 5.5K 15.3 10 Gen e, Cap ion, Emo ion,
Tempo, Ins umen , Rhy hm, ...
EmoMV [109]Link 2023/03 Music Video (MVED [107], AudioSe [108]) Audio 6K 44.3 27 Emo ion
MUVideo [16]Link 2023/11 Di e se Videos (Balanced-AudioSe [108]) Audio 14.5K 40.3 10 Ins uc ions
MuVi-Sync [34]Link 2023/11 Music Video MIDI, Audio 784 - - Scene O se , Emo ion, Mo ion, Seman ic,
Cho d, Key, Loudness, Densi y, ...
BGM909 [35]Link 2024/05 Music Video MIDI 909 - - Cap ion, S yle, Cho d, Melody, Bea , Sho
V2M [17] - 2024/06 Di e se Videos Audio 360K 18000 180 Gen e
DISCO-MV [45] - 2024/09 Music Video (DISCO-10M [110]) Audio 2200K 47000 77 Gen e
FilmSco eDB [46] - 2024/11 Film Video Audio 32K 90.3 10 Mo ie Ti le
DVMSe [37] - 2024/12 Di e se Videos Audio 3.8K - - -
Ha monySe [111]Link 2025/03 Di e se Videos Audio 48K 458.8 32 Desc ip ion
MusicP o-7k [18]Link 2025/03 Film Video Audio 7K - - Desc ip ion, Melody, Rhy hm Spo s
▼Human Mo emen Videos
URMP [112]Link 2016/12 Pe o mance Video MIDI, Audio 44 1.3 106 Ins umen s
MUSIC [113]Link 2018/04 Pe o mance Video Audio 685 45.7 239 Ins umen s
AIST++ [114]Link 2021/01 Dance Video (AIST [115]) Audio 1.4K 5.2 13 3D Mo ion
TikTok Dance-Music [54]Link 2022/04 Dance Video Audio 445 1.5 12 -
LORIS [31]Link 2023/05 Dance Video, Spo s Video Audio 16K 86.43 19 2D Pose
(AIST [115], FisV [116], FS1000 [117])
▼Images
Music-Image [118]Link 2016/07 Image (Music Video) Audio 22.6K 377 60 Ly ics
Shu e song [119]Link 2017/08 Image (Shu e song App) Audio 586 - - Ly ics
IMAC [120]Link 2019/04 Image (FI [121]) Audio 3.8K 63.3 60 Emo ion
MUImage [16]Link 2023/11 Image (Balanced-AudioSe [108]) Audio 14.5k 40.3 10 Ins uc ions
EIMG [72]Link 2023/12 Image (IAPS [122], NAPS [123]) MIDI 3K 12.5 15 VA Value
MeLBench [52]Link 2024/06 Image (Di e se Videos) Audio 11.2K 31.2 10 Gen e, Cap ion
e ing [16,95]. FilmSco eDB and MusicP o-7k [18,46] o-
cus on ilm sco es, whe e he music has a deepe seman ic
co espondence wi h he ideo and se es as an accompa-
nimen a he han being he p ima y ocus, as in music
ideos. Recen ly, some da ase s also p o ide ex ual de-
sc ip ions o ideos and music [38,111,124] o assis ex -
b idged ideo- o-music gene a ion me hods. 3. Human
Mo emen Videos. These ideos can be di ided in o in-
s umen pe o mances and dance/spo ca ego ies. Ins u-
men pe o mance da ase s [112,113] aim o econs uc
music om ins umen al pe o mance ideos. Dance/spo
da ase s [31,54,114] ocus on gene a ing music om dance
o spo s ideos, emphasizing local hy hmic alignmen
while downplaying seman ic ela ionships.
Images. Exis ing image- o-music da ase s a e ela i ely
sca ce. Sou ces o he images a e usually ames om mu-
sic ideos [52,118] o exis ing image da ase s [16,72,120].
4.2 Music Domains
Vision-music da ase s can be di ided in o MIDI and audio
based on he music modali y. MIDI da ase s [15,34,35]
a e c ea ed by ansc ibing audio in o he MIDI o ma o
sou ced om exis ing music-only da ase s [125]. Audio
da ase s con ain only aw audio iles.
Compa ed o audio da ase s, MIDI da ase s ha e he
ollowing ad an ages: (1) Mo e anno a ions like Cho d,
Melody, Bea , Tonali y, e c; (2) Longe a e age du a ion
enables gene a ing longe music pieces; (3) Sui able o
aining bo h symbolic and audio music gene a ion mod-
els. Howe e , a signi ican limi a ion o MIDI da ase s is
hei smalle scale (e.g. 1K songs, 100 hou s s. 100K-2M
songs, 5K-50K hou s) and ela i ely limi ed di e si y.
5. EVALUATION
Common me ics o ision- o-music a e ca ego ized in
Tab. 3and 4. The e alua ion o ision- o-music gene a ion
can be di ided in o wo ca ego ies: objec i e and subjec-
i e. Objec i e e alua ion uses ixed ule-based algo i hms
o exis ing models o ex ac ea u es and calcula e mu-
sical me ics. I is ela i ely objec i e and con enien o
ai compa ison, bu has ce ain biases and canno co e all
aspec s o music gene a ion, o en di e ing signi ican ly
om human subjec i e pe cep ion. Simila o o he gen-
e a ion asks [126,127], subjec i e e alua ion is ypically
used in ision- o-music gene a ion o a mo e comp ehen-
si e assessmen , i.e. conduc ing use s udies whe e pa ic-
ipan s a e/compa e music gene a ed by di e en models.
F om ano he pe spec i e, me ics can be di ided in o
music-only and ision-music co espondence based on
assessmen pu poses. The o me only e alua es whe he
he music i sel is pleasan / ealis ic/s uc u ally comple e,
e c., while he la e ocuses on he co espondence be-
ween he music and he isual inpu .
Fo music-only objec i e me ics, symbolic music
gene a ion me hods [13,15,35,72,83,99] use some
s a is ics-based me hods o calcula e ce ain pi ch o
hy hm- ela ed s a is ical me ics o MIDI, such as Scale
Consis ency, Pi ch En opy, e c. These me ics a e usually
compa ed wi h g ound u h music, and he close hey a e,
he mo e ealis ic he music is conside ed. Audio music
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
227

Table 3: Objec i e me ics o ision- o-music gene a ion.
M: MIDI. A: Audio. V: Video. I: Image. T:Tex . Pi : Pi .
Rhy: Rhy hm. Fid: Fideli y. Sem: Seman ic.
Me ic Used in Pape Inpu Type
▼Music-only:
Scale Consis ency [15,83] M Pi
Pi En opy [15,72,83] M Pi
Pi Class His og am En opy [13,15,35,83,99] M Pi
Emp y Bea Ra e [15,83,99] M Rhy
A e age In e -Onse In e al [15,83] M Rhy
G oo ing Pa e n Simila i y [13,35,99] M Rhy
S uc u e Indica o [13,35] M Rhy
F eche Audio Dis ance (FAD) [14,16–18,36,39,52]A Fid
[44–46,83,91,101]
F eche Dis ance (FD) [14,17,36–38,52,87,101] A Fid
Kullback-Leible Di e gence (KL) [14,16–18,36–39,44–46]A Fid
[52,83,87,91,101]
Bea s Co e age Sco e (BCS) [36,46] A Rhy
Bea s Hi Sco e (BHS) [36,46] A Rhy
Incep ion Sco e (IS) [36,46,101] A Fid
▼Vision-music Co espondence:
ImageBind Sco e/Rank [16–18,37,38,44,83,101] A,V/I Sem
CLAP Sco e [37,87,91] A,A/T Sem
Video-Music CLIP P ecision [15,83] A,V Sem
Video-Music Co espondence [35] A,V Sem
C oss-modal Rele ance [39] A,V Sem
Tempo al Alignmen [39] A,V Rhy
Rhy hm Alignmen [37] A,V Rhy
gene a ion me hods widely adop me ics such as F eche
Audio Dis ance (FAD), F eche Dis ance (FD) 2, and Kull-
back Leible Di e gence (KL) o e alua e he simila i y
be ween gene a ed music and g ound u h music. Some
me hods [36,46] also in oduce me ics like BCS and BHS
o measu e hy hmic simila i y based on music bea s.
Objec i e me ics o ision-music co espondence
usually ocus on he audio modali y. The mos com-
monly used a e ImageBind Sco e/Rank and CLAP sco e,
which le e age p e ained mul imodal models like Image-
Bind [131] and CLAP [89] o simila i y e alua ion. Some
me hods [15,35,39,83] ha e also designed speci ic ision-
music e ie al e alua ion me ics, wi h sligh di e ences
in model selec ion and e ie al me hods. Addi ionally,
GVMGen [39] and VidMusician [37] ha e designed objec-
i e me ics o e alua e he hy hmic co espondence be-
ween isions and music. Howe e , since he p e ained
models a e usually ained wi h gene al audio da a ins ead
o speci ied music da a, hese objec i e me ics commonly
do no pe ec ly align wi h human judgmen s.
Subjec i e me ics mainly include MOS (gene ally us-
ing a 5-poin Like scale), pai p e e ence (i.e. win a e),
and anking di e en music. Common subjec i e me ics
in ision- o-music gene a ion a e gi en in Tab. 4. The
selec ion o speci ic subjec i e me ics depends on he
ision-music ela ionship emphasized by he me hod.
6. CHALLENGES
Despi e ad ances in ision- o-music gene a ion, we iden-
i y se e al key challenges o he academic communi y:
Lack o S anda dized Objec i e Da ase s and Bench-
ma ks: The aining and e alua ion da ase s di e ac oss
2The di e ence be ween FAD and FD is he ea u e ex ac o :
FAD [128] uses VGGish [129], while FD uses PANNs [130].
Table 4: Subjec i e me ics o ision- o-music gene a ion.
Me ic Used in Pape
▼Music-only:
Music Melody [15,35]
Music Rhy hm [15,35]
Music Richness [39,99]
Audio Quali y [17,36]
O e all Music Quali y [13,14,17,18,34,38,39,44,45,52,91,94]
▼Vision-music Co espondence:
Seman ic Consis ency [15,18,35–38]
Rhy hm Consis ency [15,18,34,35,38,91,99]
Emo ion Consis ency [38,91,99]
O e all Co espondence [13–18,34,39,44,45,52,83,87,94]
models, some imes leading o compa isons be ween mod-
els ine- uned on p op ie a y da ase s and hose e alua ed
ia ze o-sho in e ence on o he da ase s. This dispa i y
signi ican ly unde mines he ai ness o model compa -
isons and makes i challenging o iden i y he s a e-o - he-
a . Besides, cu en e alua ion me ics o en do no align
wi h ac ual human pe cep ion, e.g. FAD and KL a e based
on gene al audio da a a he han on music-speci ic da a,
and symbolic me ics a e s a is ically based and exhibi low
co ela ion wi h human p e e ences. Though mos pape s
p o ide demos o quali a i e compa isons, hey a e p one
o issues such as che y-picked examples, insu icien sam-
ple size, and subjec i i y in e alua ing he ou pu s.
Limi ed Cus omiza ion and Con ollabili y: Mos exis -
ing models unc ion as black boxes, making i challenging
o pe sonalize o con ol a ibu es o he gene a ed music,
such as s yle, ins umen a ion, and hy hm. This signi i-
can ly a ec s he models’ p ac ical applicabili y.
T ade-o Be ween Symbolic and Audio Fo ms: As dis-
cussed in Sec. 2, audio-based me hods bene i om la ge-
scale da a bu gene ally o e limi ed con ollabili y and a e
cons ained by he compu a ional cos o high- ideli y gen-
e a ion, o en esul ing in sho e musical pieces. In con-
as , symbolic app oaches, while limi ed by a ailable da a,
o e be e con ollabili y and can p oduce longe compo-
si ions. A p omising di ec ion is o combine symbolic and
audio me hods o achie e a be e ade-o .
Beyond academic challenges, explo ing how o align
hese echnologies wi h music indus y applica ions (such
as ilm and game sco ing) and end-use p oduc s (like
sho - ideo pla o m backg ound music) in consume
p oduc s o e s signi ican comme cial oppo uni ies.
7. CONCLUSION
In his pape , we e iew he ecen ad ancemen s in ision-
o-music gene a ion, co e ing bo h symbolic and audio-
based app oaches. We iden i ied key echnical challenges
in isual ea u e ex ac ion, c oss-modal p ojec ion, and
music gene a ion, and discussed he limi a ions in cu en
da ase s and e alua ion me ics. We belie e ha add ess-
ing hese challenges will pa e he way o u u e esea ch
and applica ions on ision- o-music gene a ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
228
8. ETHICAL STATEMENT
This wo k is conduc ed wi h a clea awa eness o he
e hical challenges associa ed wi h ision- o-music gene -
a ion echnologies. As models lea n o ansla e isual in-
pu s—such as images, ideos, o a wo ks—in o music,
hey ope a e a he in e sec ion o mul iple cul u al, le-
gal, and emo ional domains. One majo conce n is copy-
igh in ingemen , pa icula ly when models a e ained
on copy igh ed music ha closely esemble exis ing com-
posi ions wi hou clea a ibu ion o licensing. Gi en
he c ea i e and exp essi e na u e o music, e en s ylis ic
mimic y may aise legal and e hical ques ions.
In addi ion, hese models isk ampli ying cul u al o
s ylis ic biases p esen in he aining da a. Fo example,
models ained p ima ily on wes e n classical o pop music
may ma ginalize non-wes e n musical adi ions o unde -
ep esen di e se emo ional and cul u al exp essions. The
alignmen be ween isual inpu and musical ou pu can
u he ein o ce p oblema ic s e eo ypes o ail o cap u e
cul u ally app op ia e in e p e a ions, especially when i-
sual con en ca ies eligious o poli ical signi icance.
A hi d dimension in ol es emo ional app op ia eness.
Music is a powe ul emo ional medium. Gene a ing music
om sensi i e o auma ic isual con en —such as scenes
o iolence, g ie , o his o ical auma—may esul in emo-
ionally disco dan o insensi i e ou comes.
To mi iga e hese isks, we ad oca e o anspa en da a
collec ion p ac ices, inclusion o di e se musical and i-
sual cul u es, and e alua ion amewo ks ha assess no
only pe cep ual quali y bu also cul u al and emo ional
alignmen . We u he encou age in e disciplina y collab-
o a ion be ween echnologis s, musicians, e hicis s, and le-
gal schola s o ensu e he esponsible de elopmen and de-
ploymen o hese sys ems.
9. REFERENCES
[1] J. Achiam, S. Adle , S. Aga wal, L. Ahmad, I. Akkaya,
F. L. Aleman, D. Almeida, J. Al enschmid , S. Al -
man, S. Anadka e al., “Gp -4 echnical epo ,” a Xi
p ep in a Xi :2303.08774, 2023.
[2] H. Zhang, X. Li, and L. Bing, “Video-llama:
An ins uc ion- uned audio- isual language
model o ideo unde s anding,” a Xi p ep in
a Xi :2306.02858, 2023.
[3] G. Luo, X. Yang, W. Dou, Z. Wang, J. Liu, J. Dai,
Y. Qiao, and X. Zhu, “Mono-in e n l: Pushing he
bounda ies o monoli hic mul imodal la ge language
models wi h endogenous isual p e- aining,” a Xi
p ep in a Xi :2410.08202, 2024.
[4] X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang,
F. Zhang, Y. Wang, Z. Li, Q. Yu e al., “Emu3:
Nex - oken p edic ion is all you need,” a Xi p ep in
a Xi :2409.18869, 2024.
[5] Z. Wang, X. Zhu, X. Yang, G. Luo, H. Li, C. Tian,
W. Dou, J. Ge, L. Lu, Y. Qiao, and J. Dai, “Pa ame e -
in e ed image py amid ne wo ks o isual pe cep-
ion and mul imodal unde s anding,” a Xi p ep in
a Xi :2501.07783, 2025.
[6] Y. Tang, A. Qu, Z. Wang, D. Zhuang, Z. Wu, W. Ma,
S. Wang, Y. Zheng, Z. Zhao, and J. Zhao, “Spa kle:
Mas e ing basic spa ial capabili ies in ision language
models elici s gene aliza ion o composi e spa ial ea-
soning,” a Xi p ep in a Xi :2410.16162, 2024.
[7] B. Kang, Y. Yue, R. Lu, Z. Lin, Y. Zhao, K. Wang,
G. Huang, and J. Feng, “How a is ideo gene a ion
om wo ld model: A physical law pe spec i e,” a Xi
p ep in a Xi :2411.02385, 2024.
[8] Z. Bo sos, R. Ma inie , D. Vincen , E. Kha i ono ,
O. Pie quin, M. Sha i i, D. Roblek, O. Teboul,
D. G angie , M. Tagliasacchi e al., “Audiolm: a
language modeling app oach o audio gene a ion,”
IEEE/ACM ansac ions on audio, speech, and lan-
guage p ocessing, ol. 31, pp. 2523–2533, 2023.
[9] W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang,
“Compound wo d ans o me : Lea ning o compose
ull-song music o e dynamic di ec ed hype g aphs,”
in P oceedings o he AAAI Con e ence on A i icial In-
elligence, ol. 35, no. 1, 2021, pp. 178–186.
[10] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi e al., “Musiclm:
Gene a ing music om ex ,” a Xi p ep in
a Xi :2301.11325, 2023.
[11] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, pp. 47 704–47 720,
2023.
[12] Z. E ans, J. D. Pa ke , C. Ca , Z. Zukowski, J. Tay-
lo , and J. Pons, “S able audio open,” a Xi p ep in
a Xi :2407.14358, 2024.
[13] S. Di, Z. Jiang, S. Liu, Z. Wang, L. Zhu, Z. He, H. Liu,
and S. Yan, “Video backg ound music gene a ion wi h
con ollable music ans o me ,” in P oceedings o he
29 h ACM In e na ional Con e ence on Mul imedia,
2021, pp. 2037–2045.
[14] K. Su, J. Y. Li, Q. Huang, D. Kuzmin, J. Lee, C. Don-
ahue, F. Sha, A. Jansen, Y. Wang, M. Ve ze i e al.,
“V2meow: meowing o he isual bea ia ideo- o-
music gene a ion,” in P oceedings o he AAAI Con e -
ence on A i icial In elligence, ol. 38, no. 5, 2024, pp.
4952–4960.
[15] L. Zhuo, Z. Wang, B. Wang, Y. Liao, C. Bao, S. Peng,
S. Han, A. Zhang, F. Fang, and S. Liu, “Video back-
g ound music gene a ion: Da ase , me hod and e alua-
ion,” in P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision, 2023, pp. 15 637–
15 647.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
229
[16] S. Liu, A. S. Hussain, Q. Wu, C. Sun, and Y. Shan,
“Mumu-llama: Mul i-modal music unde s anding and
gene a ion ia la ge language models,” a Xi p ep in
a Xi :2412.06660, 2024.
[17] Z. Tian, Z. Liu, R. Yuan, J. Pan, Q. Liu, X. Tan,
Q. Chen, W. Xue, and Y. Guo, “Vidmuse: A simple
ideo- o-music gene a ion amewo k wi h long-sho -
e m modeling,” a Xi p ep in a Xi :2406.04321,
2024.
[18] Z. Xie, Q. He, Y. Zhu, Q. He, and M. Li, “Film-
compose : Llm-d i en music p oduc ion o silen ilm
clips,” a Xi p ep in a Xi :2503.08147, 2025.
[19] P. Esse , S. Kulal, A. Bla mann, R. En eza i, J. Mülle ,
H. Saini, Y. Le i, D. Lo enz, A. Saue , F. Boesel e al.,
“Scaling ec i ied low ans o me s o high- esolu ion
image syn hesis,” in Fo y- i s in e na ional con e -
ence on machine lea ning, 2024.
[20] L. Zhuo, R. Du, H. Xiao, Y. Li, D. Liu, R. Huang,
W. Liu, X. Zhu, F.-Y. Wang, Z. Ma e al., “Lumina-
nex : Making lumina- 2x s onge and as e wi h nex -
di ,” in The Thi y-eigh h Annual Con e ence on Neu al
In o ma ion P ocessing Sys ems.
[21] J. Chen, Y. Jincheng, G. Chongjian, L. Yao, E. Xie,
Z. Wang, J. Kwok, P. Luo, H. Lu, and Z. Li, “Pixa -α:
Fas aining o di usion ans o me o pho o ealis ic
ex - o-image syn hesis,” in The Twel h In e na ional
Con e ence on Lea ning Rep esen a ions.
[22] A. Polyak, A. Zoha , A. B own, A. Tjand a, A. Sinha,
A. Lee, A. Vyas, B. Shi, C.-Y. Ma, C.-Y. Chuang,
D. Yan e al., “Mo ie gen: A cas o media ounda ion
models,” a Xi p ep in a Xi :2410.13720, 2024.
[23] W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou,
J. Xiong, X. Li, B. Wu, J. Zhang e al., “Hunyuan-
ideo: A sys ema ic amewo k o la ge ideo gene a-
i e models,” a Xi p ep in a Xi :2412.03603, 2024.
[24] Y. Ma, A. Øland, A. Ragni, B. M. Del Se e, C. Sai is,
C. Donahue, C. Lin, C. Plachou as, E. Bene os, E. Sha-
i e al., “Founda ion models o music: A su ey,”
a Xi p ep in a Xi :2408.14340, 2024.
[25] S. Ji, X. Yang, and J. Luo, “A su ey on deep lea ning
o symbolic music gene a ion: Rep esen a ions, algo-
i hms, e alua ions, and challenges,” ACM Compu ing
Su eys, ol. 56, no. 1, pp. 1–39, 2023.
[26] S. Ji, J. Luo, and X. Yang, “A comp ehensi e su ey
on deep music gene a ion: Mul i-le el ep esen a ions,
algo i hms, e alua ions, and u u e di ec ions,” a Xi
p ep in a Xi :2011.06801, 2020.
[27] C. He nandez-Oli an and J. R. Bel an, “Music com-
posi ion wi h deep lea ning: A e iew,” Ad ances in
speech and music echnology: compu a ional aspec s
and applica ions, pp. 25–50, 2022.
[28] Y. Zhu, J. Baca, B. Rekabda , and R. Rawassizadeh, “A
su ey o ai music gene a ion ools and models,” a Xi
p ep in a Xi :2308.12982, 2023.
[29] H. Liu, Y. Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian,
Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley,
“Audioldm 2: Lea ning holis ic audio gene a ion wi h
sel -supe ised p e aining,” IEEE/ACM T ansac ions
on Audio, Speech, and Language P ocessing, 2024.
[30] X. Liu, K. Su, and E. Shlize man, “Tell wha you hea
om wha you see– ideo o audio gene a ion h ough
ex ,” a Xi p ep in a Xi :2411.05679, 2024.
[31] J. Yu, Y. Wang, X. Chen, X. Sun, and Y. Qiao, “Long-
e m hy hmic ideo sound acke ,” in In e na ional
Con e ence on Machine Lea ning. PMLR, 2023, pp.
40 339–40 353.
[32] Z. Tang, Z. Yang, C. Zhu, M. Zeng, and M. Bansal,
“Any- o-any gene a ion ia composable di usion,”
Neu IPS, ol. 36, 2024.
[33] S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Nex -
gp : Any- o-any mul imodal llm,” a Xi : 2309.05519,
2023.
[34] J. Kang, S. Po ia, and D. He emans, “Video2music:
Sui able music gene a ion om ideos using an a ec-
i e mul imodal ans o me model,” Expe Sys ems
wi h Applica ions, ol. 249, p. 123640, 2024.
[35] S. Li, Y. Qin, M. Zheng, X. Jin, and Y. Liu, “Di -bgm:
A di usion model o ideo backg ound music gene -
a ion,” in P oceedings o he IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion, 2024, pp.
27 348–27 357.
[36] R. Li, S. Zheng, X. Cheng, Z. Zhang, S. Ji, and
Z. Zhao, “Mu i: Video- o-music gene a ion wi h
seman ic alignmen and hy hmic synch oniza ion,”
a Xi p ep in a Xi :2410.12957, 2024.
[37] S. Li, B. Yang, C. Yin, C. Sun, Y. Zhang, W. Dong,
and C. Li, “Vidmusician: Video- o-music gene a ion
wi h seman ic- hy hmic alignmen ia hie a chical i-
sual ea u es,” a Xi p ep in a Xi :2412.06296, 2024.
[38] B. Wang, L. Zhuo, Z. Wang, C. Bao, W. Chengjing,
X. Nie, J. Dai, J. Han, Y. Liao, and S. Liu, “Mul imodal
music gene a ion wi h explici b idges and e ie al
augmen a ion,” a Xi p ep in a Xi :2412.09428,
2024.
[39] H. Zuo, W. You, J. Wu, S. Ren, P. Chen, M. Zhou,
Y. Lu, and L. Sun, “G mgen: A gene al ideo- o-
music gene a ion model wi h hie a chical a en ions,”
a Xi p ep in a Xi :2501.09972, 2025.
[40] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, Ł. Kaise , and I. Polosukhin,
“A en ion is all you need,” Ad ances in neu al in o -
ma ion p ocessing sys ems, ol. 30, 2017.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
230
[41] I. J. Good ellow, J. Pouge -Abadie, M. Mi za, B. Xu,
D. Wa de-Fa ley, S. Ozai , A. Cou ille, and Y. Ben-
gio, “Gene a i e ad e sa ial ne s,” Ad ances in neu al
in o ma ion p ocessing sys ems, ol. 27, 2014.
[42] J. Ho, A. Jain, and P. Abbeel, “Denoising di usion
p obabilis ic models,” Ad ances in neu al in o ma ion
p ocessing sys ems, ol. 33, pp. 6840–6851, 2020.
[43] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music con olne : Mul iple ime- a ying con ols o
music gene a ion,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 32, pp. 2692–
2703, 2024.
[44] J. Li, T. Xu, X. Chen, X. Yao, and S. Liu, “Moza ’s
ouch: A ligh weigh mul i-modal music gene a ion
amewo k based on p e- ained la ge models,” a Xi
p ep in a Xi :2405.02801, 2024.
[45] Y.-B. Lin, Y. Tian, L. Yang, G. Be asius, and
H. Wang, “Vmas: Video- o-music gene a ion ia se-
man ic alignmen in web music ideos,” a Xi p ep in
a Xi :2409.07450, 2024.
[46] F. Qi, L. Ni, and C. Xu, “Ha monizing pixels
and melodies: Maes o-guided ilm sco e gene a-
ion and composi ion s yle ans e ,” a Xi p ep in
a Xi :2411.07539, 2024.
[47] X. Tan, M. An ony, and H. Kong, “Au oma ed music
gene a ion o isual a h ough emo ion.” in ICCC,
2020, pp. 247–250.
[48] A. San os, H. Pin o, R. Pe ei a Jo ge, and N. Co eia,
“Musy i: music syn hesis om images,” in P oceed-
ings o he 12 h In e na ional Con e ence on Compu-
a ional C ea i i y, 2021, pp. 103–112.
[49] R. Zhang, Y. Zhang, K. Shao, Y. Shan, and G. Xia,
“Vis2mus: Explo ing mul imodal ep esen a ion map-
ping o con ollable music gene a ion,” a Xi p ep in
a Xi :2211.05543, 2022.
[50] Z. Xiong, P.-C. Lin, and A. Fa judian, “Re aining se-
man ics in image o music con e sion,” in 2022 IEEE
In e na ional Symposium on Mul imedia (ISM). IEEE,
2022, pp. 228–235.
[51] S. Liu, A. S. Hussain, C. Sun, and Y. Shan, “M2ugen:
Mul i-modal music unde s anding and gene a ion wi h
he powe o la ge language models,” a Xi p ep in
a Xi :2311.11255, 2023.
[52] S. Chowdhu y, S. Nag, K. Joseph, B. V. S ini asan,
and D. Manocha, “Mel usion: Syn hesizing music
om image and language cues using di usion mod-
els,” in P oceedings o he IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion, 2024, pp.
26 826–26 835.
[53] K. Su, X. Liu, and E. Shlize man, “How does i
sound?” Ad ances in Neu al In o ma ion P ocessing
Sys ems, ol. 34, pp. 29 258–29 273, 2021.
[54] Y. Zhu, K. Olszewski, Y. Wu, P. Achliop as, M. Chai,
Y. Yan, and S. Tulyako , “Quan ized gan o com-
plex music gene a ion om dance ideos,” in Eu o-
pean Con e ence on Compu e Vision. Sp inge , 2022,
pp. 182–199.
[55] Y. Zhu, Y. Wu, K. Olszewski, J. Ren, S. Tulyako , and
Y. Yan, “Disc e e con as i e di usion o c oss-modal
music and image gene a ion,” in The Ele en h In e na-
ional Con e ence on Lea ning Rep esen a ions, 2023.
[56] X. Liu, K. Su, and E. Shlize man, “Le he bea ol-
low you-c ea ing in e ac i e d um sounds om body
hy hm,” in P oceedings o he IEEE/CVF Win e Con-
e ence on Applica ions o Compu e Vision, 2024, pp.
7187–7197.
[57] G. Agga wal and D. Pa ikh, “Dance2music: Au o-
ma ic dance-d i en music gene a ion,” a Xi p ep in
a Xi :2107.06252, 2021.
[58] S. Li, W. Dong, Y. Zhang, F. Tang, C. Ma, O. Deussen,
T.-Y. Lee, and C. Xu, “Dance- o-music gene a ion wi h
encode -based ex ual in e sion,” in SIGGRAPH Asia
2024 Con e ence Pape s, 2024, pp. 1–11.
[59] X. Liang, W. Li, L. Huang, and C. Gao, “Dancecom-
pose : Dance- o-music gene a ion using a p og essi e
condi ional music gene a o ,” IEEE T ansac ions on
Mul imedia, 2024.
[60] K. Su, X. Liu, and E. Shlize man, “Mul i-
ins umen alis ne : Unsupe ised gene a ion o
music om body mo emen s,” a Xi p ep in
a Xi :2012.03478, 2020.
[61] ——, “Audeo: Audio gene a ion o a silen pe o -
mance ideo,” Ad ances in Neu al In o ma ion P o-
cessing Sys ems, ol. 33, pp. 3325–3337, 2020.
[62] C. Gan, D. Huang, P. Chen, J. B. Tenenbaum, and
A. To alba, “Foley music: Lea ning o gene a e music
om ideos,” in Compu e Vision–ECCV 2020: 16 h
Eu opean Con e ence, Glasgow, UK, Augus 23–28,
2020, P oceedings, Pa XI 16. Sp inge , 2020, pp.
758–775.
[63] A. S. Koepke, O. Wiles, Y. Moses, and A. Zisse man,
“Sigh o sound: An end- o-end app oach o isual pi-
ano ansc ip ion,” in ICASSP 2020-2020 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2020, pp. 1838–1842.
[64] H. Fang, P. Xiong, L. Xu, and Y. Chen, “Clip2 ideo:
Mas e ing ideo- ex e ie al ia image clip,” a Xi
p ep in a Xi :2106.11097, 2021.
[65] M. A i i, M. A. B ubake , and M. S. B own, “His ogan:
Con olling colo s o gan-gene a ed and eal images
ia colo his og ams,” in P oceedings o he IEEE/CVF
con e ence on compu e ision and pa e n ecogni-
ion, 2021, pp. 7941–7950.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
231

Related note

Why organizations use Identific for document trust, entry 82
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in universities, research institutes, colleges, schools, and publishing workflows, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer documentation of academic decisions, reduced manual checking effort, and more reliable review records. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For policy papers, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com