VIDEO-GUIDED TEXT-TO-MUSIC GENERATION USING PUBLIC
DOMAIN MOVIE COLLECTIONS
Ha en Kim1Zacha y No ack1Weihan Xu2Julian McAuley1Hao-Wen Dong3
1Uni e si y o Cali o nia San Diego
2Duke Uni e si y
3Uni e si y o Michigan
ABSTRACT
Despi e ecen ad ancemen s in music gene a ion sys-
ems, hei applica ion in ilm p oduc ion emains limi ed,
as hey s uggle o cap u e he nuances o eal-wo ld ilm-
making, whe e ilmmake s conside mul iple ac o s—such
as isual con en , dialogue, and emo ional one—when se-
lec ing o composing music o a scene. This limi a ion p i-
ma ily s ems om he absence o comp ehensi e da ase s
ha in eg a e hese elemen s. To add ess his gap, we in o-
duce Open Sc een Sound ack Lib a y (OSSL), a da ase
consis ing o mo ie clips om public domain ilms, o-
aling app oxima ely 36.5 hou s, pai ed wi h high-quali y
sound acks and human-anno a ed mood in o ma ion. To
demons a e he e ec i eness o ou da ase in imp o ing
he pe o mance o p e- ained models on ilm music gene -
a ion asks, we in oduce a new ideo adap e ha enhances
an au o eg essi e ans o me -based ex - o-music model
by adding ideo-based condi ioning. Ou expe imen al e-
sul s demons a e ha ou p oposed app oach e ec i ely
enhances MusicGen-Medium in e ms o bo h objec i e
measu es o dis ibu ional and pai ed ideli y, and subjec i e
compa ibili y in mood and gen e. To acili a e ep oducibil-
i y and os e u u e wo k, we publicly elease he da ase ,
code, and demo 1.
1. INTRODUCTION
Music plays a c ucial ole in ilms, shaping i s a is ic qual-
i y and in luencing i s comme cial success [1]. A well-
composed sound ack enhances he emo ional dep h o a
scene, guiding audience pe cep ion and engagemen [2
–
5].
Despi e ecen ad ancemen s in music gene a ion sys ems,
signi ican challenges emain in adap ing hese echnologies
o ilm p oduc ion, as hese sys ems a e no designed o
1Da ase : h ps://ha enpe sona.gi hub.io/ossl- 1
Code: h ps://gi hub.com/ha enpe sona/ossl- 1
Demo: h ps://ha enpe sona.gi hub.io/demo/ismi 2025
© Ha en Kim, Zacha y No ack, Weihan Xu, Julian
McAuley, and Hao-Wen Dong. Licensed unde a C ea i e Commons
A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: Ha en
Kim, Zacha y No ack, Weihan Xu, Julian McAuley, and Hao-Wen Dong,
“Video-Guided Tex - o-Music Gene a ion Using Public Domain Mo ie
Collec ions”, in P oc. o he 26 h In . Socie y o Music In o ma ion
Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
align wi h he eal-wo ld p ac ices o ilm music composi-
ion, whe e mul iple elemen s—such as isual con en , dia-
logue, and emo ional one—a e conside ed al oge he [6].
A majo obs acle is he lack o comp ehensi e da ase s
con aining mo ie clips pai ed wi h hei co esponding
sound acks. While some exis ing ilm da ase s a e de i ed
om comme cial mo ies, many a e no longe a ailable o
download [7
–
9], and some p o ide only ideo embeddings
a he han aw mo ie clips [10], making i challenging o
dis inguish segmen s con aining music om hose ha do
no . The emaining da ase s do no include isola ed sound-
ack s ems [11
–
14]. While we acknowledge he possibili y
o cons uc ing a da ase using sou ce-sepa a ed music, as
done in p e ious app oaches [15], we ind he quali y o
sou ce-sepa a ed music o be subop imal, as i occasion-
ally includes inco ec audio such as oice o sound e ec s,
making i unsui able o ou use case.
Ano he missing elemen ha plays a pi o al ole in
making ilm music is mood in o ma ion, as dialogues and
scenes alone a e o en insu icien o ully con ey emo ions,
which explains why sc ip s ypically include no only dia-
logues bu also sepa a e mood desc ip ions. Howe e , we
ind ha his aspec emains la gely unadd essed in exis ing
esou ces, wi h no publicly a ailable ilm da ase s anno a ed
wi h mood in o ma ion.
To b idge his gap, we in oduce he Open Sc een
Sound ack Lib a y (OSSL), a da ase comp ising ap-
p oxima ely 36.5 hou s o mo ie clips om public domain
ilms, each pai ed wi h hei co esponding sound acks
and mood anno a ions. The da ase is c ea ed by au oma -
ically iden i ying imes amps o musical segmen s wi hin
he ilms, mapping hese segmen s o he sound acks, and
hen e i ying he mappings manually. To demons a e he
e ec i eness o his da ase o enhancing music gene a-
ion o mo ie clips, we inco po a e a ideo adap e in o a
ex - o-music gene a ion model (MusicGen) [16] and ine-
une i on ou da ase . We e alua e ou models on bo h
public domain ilms and comme cial ilms, which we call
OSSL E alua ion Se - Public and Comme cial (OES-Pub
and OES-Com), espec i ely, in o de o ensu e ha ou ap-
p oach can handle a ious kinds o da a. Ou objec i e and
subjec i e e alua ions demons a e ha we success ully ex-
end a ex - o-music gene a ion model, MusicGen-Medium,
o handle bo h ex ual and ideo inpu s o enhanced ilm
music gene a ion.
518
Fo ep oducibili y and accessibili y, we publicly elease
ou main da ase , OSSL, as well as he e alua ion se s,
OES-Pub and OES-Com.
2. RELATED WORK
Audio-Domain Music Gene a ion. Con empo a y mu-
sic gene a ion a chi ec u es in he audio domain p edom-
inan ly ollow wo dis inc pa adigms. The i s employs
neu al codecs o ans o m digi al audio signals in o disc e e
okens, enabling ans o me -based models [17] o gene -
a e music by lea ning oken dis ibu ions om p omp s
such as ex [16, 18, 19] o ideo [20
–
22]. The second
pa adigm le e ages di usion-based amewo ks, whe e di -
usion models a e ained o gene a e audio signals, condi-
ioned on p omp s such as ex ual desc ip ions [23
–
29] o
ideo ames [30, 31]. Ou models a e buil upon Music-
Gen [16], a ex - o-music gene a ion amewo k based on
he o me app oach.
Mul imodal Music Gene a ion. Mos exis ing audio-
domain music gene a ion models suppo pu ely unimodal
condi ions, such as ex [18, 29], o ha e limi ed mul i-
modal suppo o ex along wi h audio [16, 18, 32] o
abs ac musical signals [33
–
35]. In cons as , wo k on
condi ioning on isual signals is mo e limi ed. One ex-
ample is a ans o me -based model condi ioned on ex ,
speech, images, and ideos, which gene a es disc e e music
okens ha a e la e con e ed in o aw wa e o ms om
embeddings ob ained om p e- ained speech, image, and
ideo encode s [36]. An al e na i e app oach in ol es con-
e ing isual inpu s in o de ailed ex ual desc ip ions, and
eeding hem in o a model. This enables he use o di-
e se modali ies, including ex , ideos, and images as in-
pu s [37]. Ano he ecen s udy ained a ideo- o-music
gene a ion model on a di e se se o ideo ea u es, while
in eg a ing ex ual condi ioning o enable high-le el con ol
h ough ex ual embeddings de i ed om a p e- ained ex
encode [21]. On he o he hand, ou app oach ocuses on
in eg a ing a ideo adap e in o a p e- ained ex - o-music
gene a ion model in o de o build a music gene a ion model
condi ioned on mul imodal inpu s.
Video-Condi ioned Music Gene a ion. One o he ea -
lies con ibu ions o ideo- o-music gene a ion u ilized
human pose ea u es ex ac ed by p e- ained models, in
o de o gene a e plausible music o ideo clips con aining
indi iduals playing musical ins umen s [38]. In con as ,
ano he ea ly s udy employed sel -de ined handc a ed ea-
u es o cap u e ele an ideo a ibu es [39]. Mo e e-
cen ly, he ield has shi ed owa d le e aging embeddings
de i ed om p e- ained ideo encode s [20
–
22,40], bo h
in he audio and symbolic domain. This app oach has been
in eg a ed in o ou models, enabling hem o pe o m c oss-
a en ion on embeddings de i ed om a p e- ained ideo
encode designed o cap u e bo h spa ial and empo al ideo
in o ma ion [41].
Adap e Mechanisms. Adap e mechanisms, which in e-
g a e compac and ainable modules in o p e- ained mod-
els, we e o iginally in oduced o enhance ans e lea ning
Da ase Audio MIDI Sel -
Hos ed Mood Video
Con en
Leng h
(Hou s)
HIMV-200K [48] ✓✗ ✗ ✗ Music Video,
Use -Gene a ed Video -
URMP [49] ✓ ✓ ✗ ✗ Music Pe o mance 33.5
TikTok [50] ✓✗ ✗ ✗ Dance Video 1.5
AIST++ [51] ✓✗✓✗3D Dance Mo ion 5.2
SymMV [52] ✓ ✓ ✗ ✗ Music Video 76.5
MuVi-Sync [53] ✓ ✓ ✗ ✗ Music Video -
BGM909 [54] ✓ ✓ ✗ ✗ Music Video -
NES-VMDB [55] ✗✓✗ ✗ Gameplay Video 474.0
OSSL (Ou s) ✓✗✓ ✓ Films 36.5
Table 1: Compa ison o ideo-music da ase s a ailable
as o June 2025. The p oposed OSSL da ase is he i s
sel -hos ed ideo-music da ase (i.e., wi hou equi ing sep-
a a e download p ocedu es ia YouTube URLs o sha ing
eques s) which includes mood anno a ions.
e iciency in na u al language p ocessing (NLP) [42]. This
app oach is commonly employed o adap p e- ained back-
bone models o a ious asks [43,44].
Ou wo k is pa icula ly inspi ed by me hods ha mod-
i y p e- ained models o accommoda e ex e nal condi ions
wi h new modali ies. No ably, p io esea ch has demon-
s a ed he e ec i eness o ligh weigh adap e s in inco -
po a ing image p omp s in o ex - o-image di usion mod-
els [45, 46]. Building on his idea, subsequen wo k has
ex ended he app oach by using adap e s o in eg a e au-
dio p omp s in o di usion-based ex - o-music gene a ion
models [47]. On he o he hand, ou app oach a emp s o
in eg a e ideo p omp s in o au o eg essi e ans o me -
based ex - o-music gene a ion models.
3. DATASET CONSTRUCTION
3.1 Open Sc een Sound ack Lib a y (OSSL)
We in oduce he Open Sc een Sound ack Lib a y (OSSL),
a collec ion o mo ie clips wi h hei co esponding sound-
acks and associa ed me ada a, including mood anno a ions.
We p o ide an o e iew o he compa ison o ideo-music
da ase s in Table 1 and an illus a ion o ou da ase con-
s uc ion me hodology in Figu e 1.
Da a Collec ion. We compile a lis o public domain ilms
along wi h me ada a (e.g., i le, elease da e, and gen es)
and ob ain comple e e sions om YouTube. To achie e he
highes music quali y by ob aining sound ack s ems wi h-
ou unnecessa y noise, we download sound acks, ins ead
o sou ce-sepa a ed music, o each ilm om YouTube,
guided by IMDB
2
me ada a. The ame a e and esolu ion
o each ideo a e 25 ps and 960x720, espec i ely, and he
sampling a e o each sound ack is 44.1kHz.
Musical Segmen Iden i ica ion. To iden i y he imes-
amps o segmen s wi h sound acks wi hin he ilms, we
employ a p e- ained sou ce sepa a ion model [56] ained
o decompose mo ie audio in o h ee componen s: mu-
sic, e ec , and dialogue. A e sou ce sepa a ion, we
apply a silence de ec ion algo i hm o music pa s using
pyAudioAnalysis
[57], wi h a ame leng h and s ep size
o 20ms and a h eshold scaling ac o o 0.2. We de ine a
2An online da abase on ilms. h ps://www.imdb.com
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
519
Figu e 1: Illus a ion o ou me hodology o cons uc ing he OSSL da ase . We collec publicly a ailable mo ies and hei
sound acks. We sepa a e mo ie audio in o music and o he elemen s, and ex ac music using a silence de ec ion algo i hm.
An e en de ec ion algo i hm ensu es he ex ac ed audio is uly music. We hen calcula e ch oma dis ance o ma ch clips
wi h hei co esponding sound acks, assigning he closes ma ch. Human inspec ion e i ies hese mappings.
Figu e 2: Dis ibu ion o elease yea s o he OSSL da ase
segmen as a clip i i is longe han 10 seconds and con ains
no silence longe han 1 second.
Following his, we e i y he p esence o musical con en
using an e en de ec ion algo i hm [58], as he ex ac ed
sound ack may con ain non-musical elemen s owing o he
limi a ions o cu en sou ce sepa a ion models. Speci i-
cally, we apply he e en de ec ion algo i hm o each clip
and e ain only hose whe e he a e age p obabili y o con-
aining a musical e en exceeds 0.3. (This h eshold alue
is de e mined empi ically by es ing di e en alues ac oss
mul iple samples.)
Mo ie Clips-Sound acks Mapping. Because a mo ie
ypically con ains mul iple sound acks, i was essen ial o
de e mine which sound ack each mo ie clip co esponds o.
The mos e ec i e me hod we ound is using ch oma simi-
la i y
3
. Speci ically, we compa e he ch oma ea u es o
sou ce-sepa a ed sound acks om mo ie clips wi h hose
o each sound ack in he ilm, assigning clips o he sound-
ack wi h he minimum ch oma dis ance.
Manual Quali y Inspec ion. The esul ing mo ie clips
a e manually assessed by human e alua o s, and any clips
wi h inco ec mappings (i.e., when a clip is pai ed o a
3
We a emp ed a inge p in ing app oach
(h ps://gi hub.com/wo ld eil/deja u) o es mapping 20 clips om he
mo ie “D.O.A” agains 24 sound acks. Howe e , his me hod p oduced
inaccu a e mappings in mos ins ances, achie ing only one co ec
iden i ica ion ou o 20 samples—a ailu e a e o 95%. Ou ch oma
simila i y app oach, on he o he hand, accu a ely mapped 17 clips o hei
co esponding sound acks, yielding a success a e o 85%.
A ibu es OSSL OES-Pub OES-Com
Numbe o samples 736 100 100
Numbe o unique ilms 299 76 37
A e age clip du a ion 178.47 sec 30 sec 30 sec
To al du a ion 36.49 hou s 0.83 hou s 0.83 hou s
Table 2: S a is ical o e iew o OSSL, OES-Pub, and OES-
Com
w ong sound ack) a e excluded om ou da ase .
Mood Anno a ion. The inal s age o ou da ase con-
s uc ion in ol es anno a ing mood in o ma ion o each
mo ie clip. We classi y mood in o ou ca ego ies based on
a p e iously sugges ed axonomy—Russell’s 4Q, whe e
he ou classes a e one o he HVHA (high alence,
high a ousal), HVLA (high alence, low a ousal, LVHA
(low alence, high a ousal), and LVLA (low alence, low
a ousal) [59,60].
We employ wo human anno a o s. We i s p o ide
hem wi h a b ie explana ion o he concep s o alence
and a ousal in he con ex o music and hen ask hem o
independen ly anno a e he mo ie clips using he ou mood
ca ego ies. In he majo i y o cases (89.9% o samples),
bo h anno a o s assign he same label o a mo ie clip. When
ag eemen occu s, we e ain he assigned anno a ion. I
hey disag ee, hey discuss hei choices un il hey each a
consensus. As a esul , we ob ain 276 mo ie clips classi ied
as happy, 30 as sad, 315 as ne ous, and 115 as peace ul.
3.2 OSSL E alua ion Se (OES)
We e alua e ou models on wo dis inc da ase s: OSSL
E alua ion Se -Public (OES-Pub), comp ising o mo ie
clips om public domain ilms ha a e no included in
OSSL, and OSSL E alua ion Se -Comme cial (OES-Com),
consis ing o comme cial ilms. The e alua ion on OES-Pub
se es o con i m ha ou models ha e e ec i ely lea ned
o gene alize om his da a dis ibu ion. On he o he hand,
he e alua ion on OES-Com shows whe he hese models
can gene alize o con empo a y comme cial ilms. No ably,
OES-Com inco po a es a la ge p opo ion (89%) o clips
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
520
Figu e 3: Ex ending MusicGen: We in oduce a ideo
adap e , which applies c oss-a en ion o linea ly ans-
o med ideo embeddings, scaled by
α
, and in eg a es i
in o he o iginal c oss-a en ion mechanism. A i e icon
deno es ainable componen s, while a snow lake icon indi-
ca es ozen componen s.
om ilms eleased wi hin he pas yea . Finally, we anno-
a e each mo ie clip wi h mood in o ma ion in he same
way as we did o cons uc ing OSSL ([MOOD]).
Finally, we p esen de ailed s a is ical in o ma ion o
OSSL, OES-Pub, and OES-Com in Table 2.
4. MODEL ARCHITECTURE
In his sec ion, we p esen ou me hodology o in eg a ing
a ideo adap e in o an exis ing ex - o-music gene a ion
model, MusicGen [16], along wi h i s illus a ion in Fig-
u e 3.
MusicGen [16] is an au o eg essi e ans o me [17]-
based model ha gene a es disc e e okens which a e subse-
quen ly con e ed in o audio signals by a neu al codec [61]
om ex ual p omp s. Each a en ion head in he o iginal
model’s c oss a en ion modules is ini ially de ined as:
headi=A en ion(xW(q)
i, z W(k)
i, z W( )
i)(1)
In his o mula ion,
x
ep esen s he decode ’s cu en hid-
den s a es,
z
means he ex encode ’s ou pu , and
i
is he
index o an a en ion head.
To implemen ou ideo adap e , we le e age a p e-
ained ideo encode . Speci ically, we choose he
ViViT
4
[41] , a ans o me -based ideo unde s anding
model p e ained on a la ge-scale da ase [62], which p o-
ides a s ong ounda ion o ou ask o ilm ideo unde -
s anding. We i s ob ain he ideo embeddings (
z ∈Rn
)
using i . Subsequen ly, we apply an a ine linea ans-
o ma ion
X∈Rm×n
o adjus he dimension o ideo
embeddings om he model’s o iginal dimension
n
o he
dimension ha is compa ible wi h ou ex embeddings
m
(˜z =X∗z ).
We modi y he c oss-a en ion laye , which o iginally
p ocesses single modali ies, o inco po a e ideo embed-
dings, he eby enabling he model o a end o mul imodal
4google/ i i -b-16x2-kine ics400
con ex s. Speci ically, we augmen he o iginal c oss-
a en ion mechanism wi h a ideo-condi ioned componen ,
whe e each a en ion head compu es i s ou pu by le e aging
bo h ex and ideo modali ies.
headi=A en ion(xW(q)
i, z W(k)
i, z W( )
i)
+α×A en ion(x˜
Wi
(q),˜z ˜
Wi
(k),˜z ˜
Wi
( ))
(2)
whe e
x
ep esen s he decode ’s cu en hidden s a es,
˜z
ep esen s he ideo embeddings wi h adjus ed dimen-
sions, and αis a ainable pa ame e .
Du ing aining, only he newly in oduced pa ame e s
˜
Wi
(q)
,
˜
Wi
(k)
,
˜
Wi
( )
,
α
, and
X
a e op imized a e an-
dom ini ializa ion, while all o he componen s o he model
emain ozen.
5. EXPERIMENTAL SETTING
5.1 Compa ison Models
We ine- une MusicGen-Small and MusicGen-Medium wi h
ideo adap e s,
5
as desc ibed in he p e ious sec ion. He e,
S-MULTI and M-MULTI deno e hese models, whe e “S”
and “M” s and o “Small” and “Medium,” espec i ely.
As baselines, we use he o iginal MusicGen-Small and
MusicGen-Medium models, which gene a e esul s solely
based on ex p omp s. We deno e hese models as S-BASE
and M-BASE, espec i ely.
To assess he e ec i eness o ideo adap e s, we also
compa e hem agains models ine- uned on ou da ase
using only ex ual p omp s (i.e., wi hou ideo adap e s).
These models a e deno ed as S-TEXT and M-TEXT. Unlike
S-MULTI and M-MULTI, whe e exis ing pa ame e s a e
ozen and new pa ame e s a e ained, S-TEXT and M-
TEXT do no inco po a e new pa ame e s. Ins ead, we apply
Low-Rank Adap a ion [63] when ine- uning hese models.
5.2 Da ase P ep ocessing
Since he de aul maximum leng h o MusicGen is 30 sec-
onds, we ain ou models o gene a e only he i s 30
seconds o music o he i s 30 seconds o each ideo clip.
This is because we de ec ed he s a ime o each segmen ,
so i p e en s gene a ing om he middle o a ack. Be-
cause MusicGen p oduces aw audio a a sampling a e o
32,000 Hz, all sound acks a e esampled o his a e be o e
aining. Addi ionally, we no malize all sound acks so ha
hei maximum ampli ude is always one.
5.3 Tex P omp s Design
P omp s ha e been sugges ed o be a signi ican ac o in-
luencing he ou comes o gene a i e models [64
–
66]. In
ou expe imen s, ex p omp s se e no only as p ima y
in o ma ion o ine- uning ex -based models (S-TEXT and
M-TEXT) bu also as a basis o in e ence in baseline mod-
els (S-BASE and M-BASE). To ensu e a ai compa ison,
he e o e, we ca e ully design hem o ully le e age he
capabili ies o ex - o-music gene a ion models,.
5
Due o compu a ional cons ain s, we did no conduc expe imen s on
MusicGen-La ge.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
521
We expe imen wi h h ee ypes o ex ual in o ma ion:
(1) mood anno a ions om ou da ase , (2) mo ie gen e
labels (e.g., h ille , omance), and (3) LLM-gene a ed mu-
sic desc ip ions ob ained using an open-sou ce music cap-
ioning model [67]. We obse e ha while bo h mood an-
no a ions and LLM-gene a ed music cap ions subjec i ely
imp o e gene a ion quali y, adding gen e in o ma ion o en
has a nega i e impac . Fo ins ance, when using he p omp ,
“A ilm sound ack o a peace ul scene in a h ille mo ie,”
ou baseline models, S-BASE and M-BASE, o e empha-
sizes he keywo d “ h ille ,” p oducing music unsui able
o peace ul scenes. As a esul , ou inal s uc u ed em-
pla e includes only mood in o ma ion and LLM-gene a ed
music cap ions: “A ilm sound ack o a
[MOOD]
scene.
[CAPTION]
.” He e,
[MOOD]
is a na u al language desc ip-
ion o ou Russell’s Q4-based mood anno a ion, which is
one o happy (high alence, high a ousal), sad (low alence,
low a ousal), ne ous (low alence, high a ousal), o peace-
ul (high alence, low a ousal). To gene a e music cap ions
(
[CAPTION]
), we ex ac he i s 30 seconds o each sound-
ack in ou da ase , di ide i in o h ee 10-second segmen s,
and gene a e a cap ion o each using a music cap ioning
model [67], as we ecognize ha ilm music e ol es o e
ime. We hen use a comme cial LLM, speci ically Claude
3.5 Sonne , o summa ize hese cap ions wi h he p omp :
“Summa ize he desc ip ion o each song in one sen ence
om 0 o 30 seconds.” Consequen ly, he esul ing cap-
ions concisely desc ibe how he music changes o e ime,
p o iding a b ie musical summa y (e.g., ‘The piece ansi-
ions om cembalo o ma imba, concluding wi h a Tibe an
singing bowl and animal sounds’).
When designing ex p omp s o OES-Pub and
OES-Com, we use sou ce-sepa a ed music o gene a e
[CAPTION]
, unlike OSSL, which ea u es o iginal sound-
ack s ems. This esul s in he music cap ioning model
equen ly desc ibing he audio as ha ing poo eco ding
quali y. To mi iga e his, we explici ly ins uc Claude 3.5
Sonne o exclude any men ion o audio quali y when sum-
ma izing he cap ions o OES-Pub and OES-Com.
5.4 T aining De ails
Ou models a e ained using he AdamW op imize [68]
(β1= 0.9, β2= 0.999)
wi h a weigh decay o
1×10−2
.
The ini ial lea ning a e is se o
1×10−4
and scheduled us-
ing a cosine annealing s a egy [69] wi h a linea wa m-up
phase. The OSSL da ase is spli in o aining and ali-
da ion se s in a 9:1 a io. To p e en o e i ing, we s op
aining a model when he model does no imp o e o h ee
epochs du ing he alida ion s age. Due o compu a ional
cons ain s, he ba ch size is se o 1. T aining is conduc ed
on a single NVIDIA A6000 GPU.
5.5 Objec i e E alua ion Me ics
Dis ibu ional Fideli y. As common in mos audio-
domain TTM esea ch [70, 71], we assess he quali y o
ou ou pu s a he dis ibu ional le el by compa ing gene a-
ions om ou model agains a high-quali y e e ence se ,
he e comp ised o 5K comme cial sound acks. To do so,
we i s ex ac embedding om ou gene a ed audio o
each me hod and he e e ence se using CLAP [72]. Wi h
hese embeddings, we hen compu e F eche Audio Dis-
ance (FAD) [73] and P ecision [74] o his pu pose. FAD
compa es dis ibu ional dis ance by i ing a high dimen-
sional Gaussian o each da ase and measu ing he F eché
dis ance be ween hem, while P ecision uses a k-NN es i-
ma e o he e e ence se ’s dis ibu ion and measu es how
many gene a ed samples lie in he es ima ed mani old.
Pai ed Fideli y. As ou e alua ion se s, OES-Pub and
OES-Com, also con ain pai ed ideo-music da a in he o m
o sou ce-sepa a ed musical acks om each mo ie clip,
we a e also able o di ec ly assess how well ou models
ec ea e he e e ence music on a pai ed sample-by-sample
basis. Speci ically, he e we measu e he CLAP Audio Simi-
la i y and Kullback-Leible (KL) Di e gence be ween he
gene a ed music and he e e ence music o his pu pose.
The CLAP Audio Simila i y is calcula ed as a cosine simi-
la i y be ween he CLAP embeddings o he gene a ed and
e e ence samples o each ideo clip. The KL di e gence
is calcula ed using he es ima ed dis ibu ions he gene a ed
and e e ence samples wi h he PaSST audio classi ie [75].
Sample Di e si y. To e alua e he di e si y o he gen-
e a ed samples, we employ Recall [74], ollowing p io
wo k in music gene a ion [70,71]. Using he same embed-
ding model (CLAP) and e e ence/gene a ed da ase s om
Dis ibu ional Fideli y, we use a k-NN es ima e o each gen-
e a ed dis ibu ion and calcula e he ac ion o eal samples
ha lie in he gene a ed mani old.
5.6 Subjec i e Su ey
To subjec i ely e alua e ou models, we i s selec 10 ep-
esen a i e samples om ou e alua ion se , OES-Com.
Speci ically, he OES-Com is di ided in o 10 clus e s us-
ing k-means clus e ing, based on he CLAP embeddings o
sou ce-sepa a ed music, and we selec he sample closes
o he cen oid o each clus e . Using hese 10 samples, we
design a su ey in he o m o a websi e and dis ibu e i
wi hin ou social ne wo k, ec ui ing 15 pa icipan s.
The su ey p ocedu e is s uc u ed as ollows: Each
pa icipan is andomly assigned 2 ou o he 10 samples.
Fi s , pa icipan s a e equi ed o wa ch he o iginal e sion
o he i s mo ie clip. This s ep se es wo pu poses: o
amilia ize hem wi h he a mosphe e o he clip and o s an-
da dize he expe ience o pa icipan s ega dless o p io
exposu e o he mo ie. A e iewing he o iginal clip, pa -
icipan s a e he in e ence esul s o each model (S-BASE,
S-TEXT, S-MULTI, M-BASE, M-TEXT, and M-MULTI)
o he co esponding clip, wi h he p esen a ion o de o
he models andomized by he websi e. Subsequen ly, pa -
icipan s epea he p ocess o he second assigned mo ie
clip, wa ching i s o iginal e sion and a ing he in e ence
esul s o each model, wi h he andomized o de .
The e alua ion assesses each model ac oss h ee dimen-
sions—gen e, mood, and audio quali y—using a 10-poin
Like scale. Fo gen e, pa icipan s a e how cinema ic he
AI-gene a ed music sounds (1: no cinema ic a all, 10: e y
cinema ic). Fo mood, pa icipan s e alua e he compa ibil-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
522
OSSL
Fine-
uned
Video
Adap e
In e-
g a ed
Objec i e Subjec i e
Dis ibu ional Fideli y Pai ed Fideli y Di e si y Human Ra ings
FAD ↓P ecision ↑Simila i y ↑KL ↓Recall ↑Mood ↑Gen e ↑Quali y ↑
pub com pub com pub com pub com pub com a g ±CI a g ±CI a g ±CI
S-BASE ✗ ✗ 64.91 77.99 22.00 14.00 41.55 34.77 1.20 1.93 4.78 6.20 4.53 ±0.91 5.27 ±1.04 5.67 ±0.99
S-TEXT ✓✗61.98 75.75 20.00 14.00 42.44 34.81 1.13 1.97 9.90 19.00 5.77 ±1.73 6.00 ±1.11 6.20 ±0.76
S-MULTI ✓ ✓ 64.39 75.59 16.00 14.00 43.36 36.09 1.15 1.98 7.50 3.30 4.93 ±0.99 5.87 ±1.00 6.10 ±0.91
M-BASE ✗ ✗ 60.91 76.79 21.00 12.00 43.61 34.39 1.06 1.88 8.04 8.46 5.13 ±0.94 5.60 ±1.12 6.20 ±0.84
M-TEXT ✓✗61.15 77.79 24.00 17.00 45.31 33.72 1.04 1.90 7.28 13.78 5.20 ±0.99 6.03 ±0.98 6.00 ±1.02
M-MULTI ✓ ✓ 59.51 73.26 25.00 21.00 45.31 36.25 1.00 1.81 9.96 8.72 6.20 ±1.05 6.70 ±1.06 7.07 ±0.93
Table 3: E alua ion esul s. Objec i e me ics include FAD, P ecision, CLAP Audio Simila i y (Simila i y), KL Di e gence,
and Recall. Subjec i e me ics include human a ings o mood, gen e, and audio quali y. pub and com indica e esul s on
OES-Pub and OES-Com, espec i ely; a g ±CI e e s o a e age alues and 95% con idence in e als.
i y o he music wi h he emo ional one o he mo ie clip
(1: no compa ible a all, 10: e y compa ible). Fo audio
quali y, pa icipan s p o ide a subjec i e assessmen o he
audio quali y (1: e y poo quali y, 10: e y high quali y).
In he ollowing sec ion, we p esen he a e age a ing
alues and 95% con idence in e als o each model, based
on each o he h ee c i e ia.
6. RESULTS AND ANALYSIS
We p esen ou comp ehensi e e alua ion esul s in Table 3.
Dis ibu ional Fideli y Ou e alua ion on OES-Com e-
eals ha ine- uning on OSSL enhances dis ibu ional i-
deli y. Speci ically, S-TEXT and S-MULTI achie e lowe
FAD sco es compa ed o S-BASE while M-TEXT and M-
MULTI exhibi highe P ecision sco es ela i e o M-BASE,
indica ing close alignmen wi h he a ge dis ibu ion. In
con as , he esul s on OES-Pub show no signi ican im-
p o emen in dis ibu ional ideli y. Al hough M-TEXT
and M-MULTI demons a e inc eased P ecision sco es com-
pa ed o M-BASE, S-TEXT and S-MULTI ac ually expe i-
ence a dec ease in P ecision when e alua ed on OES-Pub.
This disc epancy is expec ed, as dis ibu ional ideli y is
calcula ed using e e ence embeddings om comme cial
mo ie sound acks. Encou agingly, he abili y o imp o e
pe o mance on comme cial sound acks by aining on pub-
lic domain da a sugges s e ec i e ea u e ans e ac oss
domains. Pa icula ly, M-MULTI achie ed he lowes FAD
sco es and highes P ecision sc oes on bo h e alua ion se s,
demons a ing i s ou s anding pe o mance in dis ibu ional
ideli y.
Pai ed Fideli y Fine- uning on OSSL consis en ly in-
c eases he CLAP audio simila i y be ween e e ence and
gene a ed music ac oss bo h OES-Pub and OES-Com
da ase s. Addi ionally, aining educes KL di e gence
when e alua ed on OES-Pub, indica ing imp o ed align-
men in classi ie p edic ions. Howe e , his imp o emen
is less p onounced on OES-Com, sugges ing some limi a-
ions in gene alizing o comme cial sound acks. No ably,
M-MULTI s ands ou by achie ing signi ican ly lowe KL
sco es on bo h da ase s, highligh ing i s supe io pe o -
mance in pai ed ideli y.
Sample Di e si y T aining on OSSL gene ally esul s in
a sligh inc ease in sample di e si y ela i e o he base mod-
els, as indica ed by Recall sco es when e alua ed on OES-
Pub, wi h he excep ion o M-TEXT, which shows a mino
dec ease. We obse e he signi ican ly highe Recall sco es
o S-TEXT and M-TEXT on OES-Com, despi e he use o
comme cial sound acks as e e ence acks. This inding
coun e s ini ial conce ns abou o e i ing o pa e ns unique
o public domain da a, demons a ing ha ine- uning can
enhance di e si y e en on ou -o -domain e e ences. How-
e e , S-MULTI exhibi s a no able d op in di e si y, likely
due o he added complexi y o lea ning om bo h ex
and ideo inpu s. In con as , he la ge M-MULTI model
a oids his issue and e en shows sligh inc eases in Recall
sco es on bo h da ase s ela i e o he baseline, sugges ing
ha g ea e model capaci y helps mi iga e he challenges o
mul imodal aining.
Human Ra ings Human a ings o mood, gen e, and
quali y exhibi wide 95% con idence in e als, e lec ing
conside able a iabili y in subjec i e assessmen s. Despi e
his, models ine- uned on OSSL gene ally ecei e highe
a e age a ings o mood and gen e ideli y compa ed o
hei baseline coun e pa s, indica ing be e alignmen wi h
human expec a ions in hese dimensions. Howe e , aining
has minimal impac on pe cep ual quali y. We obse e
an in e es ing end ideo adap e s: in eg a ing hem in o
smalle models (S-MULTI) sligh ly lowe s a e age a ings
ac oss all me ics, whe eas in medium-sized models (M-
MULTI), hey enhance pe o mance in mood, gen e, and
quali y. This indica es ha he bene i s o ideo in eg a ion
may be con ingen on su icien model capaci y.
7. CONCLUSION AND FUTURE WORK
In his pape , we in oduced he Open Sc een Sound ack
Lib a y (OSSL), a da ase comp ising mo ie clips, co e-
sponding sound acks, and mood anno a ions. To show he
e ec i eness o ou da ase , we adap ed a ex - o-music
gene a ion model wi h ideo condi ions and ine- uned i
on ou da ase . We conduc ed e alua ions bo h on public
domain and comme cial ilms, and he esul s show he e -
ec i eness o ou da ase and a chi ec u e, when applied o
medium-sized models. Howe e , due o he limi ed numbe
o pa icipan s, he subjec i e e alua ion equi es u he
alida ion. Despi e his, we belie e ha he OSSL will ad-
ance ilm music esea ch wi hin e hical bounda ies and
ou expe imen s on he p oposed me hods show insigh ul
obse a ions o in eg a ing ideo modali ies in o exis ing
ex - o-music gene a ion models.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
523
8. ETHICS STATEMENT
Ou esea ch adhe es o e hical p inciples by ensu ing
ha he cons uc ion o Open Sc een Sound ack Lib a y
(OSSL) and aining me hodologies a e based solely on
copy igh - ee ma e ials. By publicly eleasing ou da ase ,
we aim o p omo e e hical esea ch p ac ices and encou age
he b oade communi y o u ilize copy igh - ee da a o
model aining, os e ing anspa ency and esponsible AI
de elopmen .
Fo e alua ion, we used comme cial mo ie clips s ic ly
o scien i ic pu poses, wi h no in en o in inge upon he
igh s o con en p oduce s. We belie e ou use alls wi hin
e hical and ai -use guidelines. Howe e , o espec copy-
igh laws, we will no elease ideo clips o he OES-Com,
and ins ead p o ide YouTube URLs o ideo clips o e-
p oducibili y.
9. REFERENCES
[1]
B. Mille , J. Cha ah, and S. Ahn, “Sound ack design:
The impac o music on isual a en ion and a ec i e
esponses,” Applied e gonomics, ol. 93, p. 103301,
2021.
[2]
H. T. P. Thao, D. He emans, and G. Roig, “Mul imodal
deep models o p edic ing a ec i e esponses e oked
by mo ies.” in ICCV Wo kshops, 2019, pp. 1618–1627.
[3]
M. Won, J. Salamon, N. J. B yan, G. J. Myso e, and
X. Se a, “Emo ion embedding spaces o ma ching mu-
sic o s o ies,” a Xi p ep in a Xi :2111.13468, 2021.
[4]
H. T. P. Thao, B. Balamu ali, G. Roig, and D. He e-
mans, “A enda ec ne –emo ion p edic ion o mo ie
iewe s using mul imodal usion wi h sel -a en ion,”
Senso s, ol. 21, no. 24, p. 8356, 2021.
[5]
P. Chua, D. Mak is, D. He emans, G. Roig, and
K. Ag es, “P edic ing emo ion om music ideos: ex-
plo ing he ela i e con ibu ion o isual and audi-
o y in o ma ion o a ec i e esponses,” a Xi p ep in
a Xi :2202.10453, 2022.
[6]
K. Xu, “Analysis o he oles o ilm sound acks in
ilms,” in 2022 In e na ional Con e ence on Comp e-
hensi e A and Cul u al Communica ion (CACC 2022).
A lan is P ess, 2022, pp. 351–355.
[7]
M. Ma szalek, I. Lap e , and C. Schmid, “Ac ions in
con ex ,” in 2009 IEEE Con e ence on Compu e Vision
and Pa e n Recogni ion. IEEE, 2009, pp. 2929–2936.
[8]
M. Tapaswi, Y. Zhu, R. S ie elhagen, A. To alba, R. U -
asun, and S. Fidle , “Mo ieqa: Unde s anding s o ies
in mo ies h ough ques ion-answe ing,” in P oceedings
o he IEEE con e ence on compu e ision and pa e n
ecogni ion, 2016, pp. 4631–4640.
[9]
Q. Huang, Y. Xiong, A. Rao, J. Wang, and D. Lin,
“Mo iene : A holis ic da ase o mo ie unde s and-
ing,” in Compu e Vision–ECCV 2020: 16 h Eu opean
Con e ence, Glasgow, UK, Augus 23–28, 2020, P o-
ceedings, Pa IV 16. Sp inge , 2020, pp. 709–727.
[10]
M. Soldan, A. Pa do, J. L. Alcáza , F. Caba, C. Zhao,
S. Giancola, and B. Ghanem, “Mad: A scalable da ase
o language g ounding in ideos om mo ie audio
desc ip ions,” in P oceedings o he IEEE/CVF Con-
e ence on Compu e Vision and Pa e n Recogni ion,
2022, pp. 5026–5035.
[11]
A. Roh bach, M. Roh bach, N. Tandon, and B. Schiele,
“A da ase o mo ie desc ip ion,” in P oceedings o
he IEEE con e ence on compu e ision and pa e n
ecogni ion, 2015, pp. 3202–3212.
[12]
P. Vicol, M. Tapaswi, L. Cas ejon, and S. Fidle ,
“Mo ieg aphs: Towa ds unde s anding human-cen ic
si ua ions om ideos,” in P oceedings o he IEEE
con e ence on compu e ision and pa e n ecogni ion,
2018, pp. 8581–8590.
[13]
K. Cu is, G. Awad, S. Rajpu , and I. Sobo o , “Hl u:
A new challenge o es deep unde s anding o mo ies
he way humans do,” in P oceedings o he 2020 In e -
na ional Con e ence on Mul imedia Re ie al, 2020, pp.
355–361.
[14]
M. Bain, A. Nag ani, A. B own, and A. Zisse man,
“Condensed mo ies: S o y based e ie al wi h con ex-
ual embeddings,” in P oceedings o he Asian Con e -
ence on Compu e Vision, 2020.
[15]
W. Xu, P. P. Liang, H. Kim, J. McAuley, T. Be g-
Ki kpa ick, and H.-W. Dong, “Tease gen: Gene a ing
ease s o long documen a ies,” 2024. [Online].
A ailable: h ps://a xi .o g/abs/2410.05586
[16]
J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, 2024.
[17]
A. Vaswani, “A en ion is all you need,” Ad ances in
Neu al In o ma ion P ocessing Sys ems, 2017.
[18]
A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi e al., “Musiclm: Gene a -
ing music om ex ,” a Xi p ep in a Xi :2301.11325,
2023.
[19]
Y.-H. Lan, W.-Y. Hsiao, H.-C. Cheng, and Y.-H.
Yang, “Musicongen: Rhy hm and cho d con ol o
ans o me -based ex - o-music gene a ion,” a Xi
p ep in a Xi :2407.15060, 2024.
[20]
Z. Tian, Z. Liu, R. Yuan, J. Pan, Q. Liu, X. Tan, Q. Chen,
W. Xue, and Y. Guo, “Vidmuse: A simple ideo- o-
music gene a ion amewo k wi h long-sho - e m mod-
eling,” a Xi p ep in a Xi :2406.04321, 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
524
[21]
K. Su, J. Y. Li, Q. Huang, D. Kuzmin, J. Lee, C. Don-
ahue, F. Sha, A. Jansen, Y. Wang, M. Ve ze i e al.,
“V2meow: Meowing o he isual bea ia ideo- o-
music gene a ion,” in P oceedings o he AAAI Con e -
ence on A i icial In elligence, ol. 38, no. 5, 2024, pp.
4952–4960.
[22]
H. Zuo, W. You, J. Wu, S. Ren, P. Chen, M. Zhou,
Y. Lu, and L. Sun, “G mgen: A gene al ideo- o-music
gene a ion model wi h hie a chical a en ions,” a Xi
p ep in a Xi :2501.09972, 2025.
[23]
S. Fo sg en and H. Ma i os, “Ri usion-s able di usion
o eal- ime music gene a ion,” URL h ps:// i usion.
com, 2022.
[24]
Q. Huang, D. S. Pa k, T. Wang, T. I. Denk, A. Ly,
N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. F ank e al.,
“Noise2music: Tex -condi ioned music gene a ion wi h
di usion models,” a Xi p ep in a Xi :2302.03917,
2023.
[25]
F. Schneide , O. Kamal, Z. Jin, and B. Schölkop ,
“Mo
ˆ usai: Tex - o-music gene a ion wi h long-con ex
la en di usion,” a Xi p ep in a Xi :2301.11757,
2023.
[26]
M. W. Lam, Q. Tian, T. Li, Z. Yin, S. Feng, M. Tu,
Y. Ji, R. Xia, M. Ma, X. Song e al., “E icien neu-
al music gene a ion,” Ad ances in Neu al In o ma ion
P ocessing Sys ems, ol. 36, pp. 17 450–17 463, 2023.
[27]
J. Melecho sky, Z. Guo, D. Ghosal, N. Majumde ,
D. He emans, and S. Po ia, “Mus ango: Towa d
con ollable ex - o-music gene a ion,” a Xi p ep in
a Xi :2311.08355, 2023.
[28]
T. Ka chkhadze, M. R. Izadi, K. Chen, G. Assayag, and
S. Dubno , “Mul i- ack musicldm: Towa ds e sa ile
music gene a ion wi h la en di usion model,” a Xi
p ep in a Xi :2409.02845, 2024.
[29]
Z. E ans, J. D. Pa ke , C. Ca , Z. Zukowski, J. Tay-
lo , and J. Pons, “S able audio open,” a Xi p ep in
a Xi :2407.14358, 2024.
[30]
Y.-B. Lin, Y. Tian, L. Yang, G. Be asius, and
H. Wang, “Vmas: Video- o-music gene a ion ia se-
man ic alignmen in web music ideos,” a Xi p ep in
a Xi :2409.07450, 2024.
[31]
R. Li, S. Zheng, X. Cheng, Z. Zhang, S. Ji, and Z. Zhao,
“Mu i: Video- o-music gene a ion wi h seman ic align-
men and hy hmic synch oniza ion,” a Xi p ep in
a Xi :2410.12957, 2024.
[32]
O. Tal, A. Zi , I. Ga , F. K euk, and Y. Adi,
“Join audio and symbolic condi ioning o empo ally
con olled ex - o-music gene a ion,” a Xi p ep in
a Xi :2406.10970, 2024.
[33]
S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music con olne : Mul iple ime- a ying con ols o
music gene a ion,” 2023.
[34]
Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and
N. J. B yan, “DITTO: Di usion in e ence- ime -
op imiza ion o music gene a ion,” in In e na ional
Con e ence on Machine Lea ning (ICML), 2024.
[35]
——, “DITTO-2: Dis illed di usion in e ence- ime
-op imiza ion o music gene a ion,” in In e na ional
Socie y o Music In o ma ion Re ie al (ISMIR), 2024.
[36]
S. Liu, A. S. Hussain, Q. Wu, C. Sun, and Y. Shan,
“Mumu-llama: Mul i-modal music unde s anding and
gene a ion ia la ge language models,” a Xi p ep in
a Xi :2412.06660, 2024.
[37]
B. Wang, L. Zhuo, Z. Wang, C. Bao, W. Chengjing,
X. Nie, J. Dai, J. Han, Y. Liao, and S. Liu, “Mul imodal
music gene a ion wi h explici b idges and e ie al aug-
men a ion,” a Xi p ep in a Xi :2412.09428, 2024.
[38]
C. Gan, D. Huang, P. Chen, J. B. Tenenbaum, and
A. To alba, “Foley music: Lea ning o gene a e music
om ideos,” in Compu e Vision–ECCV 2020: 16 h
Eu opean Con e ence, Glasgow, UK, Augus 23–28,
2020, P oceedings, Pa XI 16. Sp inge , 2020, pp.
758–775.
[39]
S. Di, Z. Jiang, S. Liu, Z. Wang, L. Zhu, Z. He, H. Liu,
and S. Yan, “Video backg ound music gene a ion wi h
con ollable music ans o me ,” in P oceedings o he
29 h ACM In e na ional Con e ence on Mul imedia,
2021, pp. 2037–2045.
[40]
L. Zhuo, Z. Wang, B. Wang, Y. Liao, C. Bao, S. Peng,
S. Han, A. Zhang, F. Fang, and S. Liu, “Video back-
g ound music gene a ion: Da ase , me hod and e alu-
a ion,” in P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision, 2023, pp. 15 637–
15 647.
[41]
A. A nab, M. Dehghani, G. Heigold, C. Sun, M. Luˇ
ci´
c,
and C. Schmid, “Vi i : A ideo ision ans o me ,” in
P oceedings o he IEEE/CVF in e na ional con e ence
on compu e ision, 2021, pp. 6836–6846.
[42]
N. Houlsby, A. Giu giu, S. Jas zebski, B. Mo one,
Q. De La oussilhe, A. Gesmundo, M. A a iyan, and
S. Gelly, “Pa ame e -e icien ans e lea ning o
nlp,” in In e na ional con e ence on machine lea ning.
PMLR, 2019, pp. 2790–2799.
[43]
J. P ei e , A. Kama h, A. Rücklé, K. Cho, and
I. Gu e ych, “Adap e usion: Non-des uc i e ask
composi ion o ans e lea ning,” a Xi p ep in
a Xi :2005.00247, 2020.
[44]
J. P ei e , A. Rücklé, C. Po h, A. Kama h, I. Vuli´
c,
S. Rude , K. Cho, and I. Gu e ych, “Adap e hub: A
amewo k o adap ing ans o me s,” a Xi p ep in
a Xi :2007.07779, 2020.
[45]
H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang,
“Ip-adap e : Tex compa ible image p omp adap e
o ex - o-image di usion models,” a Xi p ep in
a Xi :2308.06721, 2023.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
525
[46]
C. Mou, X. Wang, L. Xie, Y. Wu, J. Zhang, Z. Qi,
and Y. Shan, “T2i-adap e : Lea ning adap e s o dig
ou mo e con ollable abili y o ex - o-image di usion
models,” in P oceedings o he AAAI Con e ence on
A i icial In elligence, ol. 38, no. 5, 2024, pp. 4296–
4304.
[47]
F.-D. Tsai, S.-L. Wu, H. Kim, B.-Y. Chen,
H.-C. Cheng, and Y.-H. Yang, “Audio p omp
adap e : Unleashing music edi ing abili ies o ex -
o-music wi h ligh weigh ine uning,” a Xi p ep in
a Xi :2407.16564, 2024.
[48]
S. Hong, W. Im, and H. S. Yang, “Con en -
based ideo-music e ie al using so in a-modal
s uc u e cons ain ,” 2017. [Online]. A ailable: h ps:
//a xi .o g/abs/1704.06761
[49]
B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sha ma,
“C ea ing a mul i ack classical music pe o mance
da ase o mul imodal music analysis: Challenges,
insigh s, and applica ions,” IEEE T ansac ions on
Mul imedia, ol. 21, no. 2, p. 522–535, Feb. 2019.
[Online]. A ailable: h p://dx.doi.o g/10.1109/TMM.
2018.2856090
[50]
Y. Zhu, K. Olszewski, Y. Wu, P. Achliop as, M. Chai,
Y. Yan, and S. Tulyako , “Quan ized gan o complex
music gene a ion om dance ideos,” 2022. [Online].
A ailable: h ps://a xi .o g/abs/2204.00604
[51]
R. Li, S. Yang, D. A. Ross, and A. Kanazawa,
“Ai cho eog aphe : Music condi ioned 3d dance
gene a ion wi h ais ++,” 2021. [Online]. A ailable:
h ps://a xi .o g/abs/2101.08779
[52]
L. Zhuo, Z. Wang, B. Wang, Y. Liao, C. Bao, S. Peng,
S. Han, A. Zhang, F. Fang, and S. Liu, “Video back-
g ound music gene a ion: Da ase , me hod and e alu-
a ion,” in P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision, 2023, pp. 15 637–
15 647.
[53]
J. Kang, S. Po ia, and D. He emans, “Video2music:
Sui able music gene a ion om ideos using an
a ec i e mul imodal ans o me model,” Expe
Sys ems wi h Applica ions, ol. 249, p. 123640, Sep.
2024. [Online]. A ailable: h p://dx.doi.o g/10.1016/j.
eswa.2024.123640
[54]
S. Li, Y. Qin, M. Zheng, X. Jin, and Y. Liu, “Di -
bgm: A di usion model o ideo backg ound music
gene a ion,” 2024.
[55]
I. Ca doso, R. O. Mo aes, and L. N. Fe ei a, “The
nes ideo-music da abase: A da ase o symbolic
ideo game music pai ed wi h gameplay ideos,” in
P oceedings o he 19 h In e na ional Con e ence
on he Founda ions o Digi al Games, se . FDG
2024. ACM, May 2024, p. 1–6. [Online]. A ailable:
h p://dx.doi.o g/10.1145/3649921.3650011
[56]
R. Solo ye , A. S empko skiy, and T. Hab use a,
“Benchma ks and leade boa ds o sound demixing
asks,” 2023.
[57]
T. Giannakopoulos, “pyaudioanalysis: An open-sou ce
py hon lib a y o audio signal analysis,” PloS one,
ol. 10, no. 12, p. e0144610, 2015.
[58]
Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D.
Plumbley, “Panns: La ge-scale p e ained audio neu al
ne wo ks o audio pa e n ecogni ion,” IEEE/ACM
T ansac ions on Audio, Speech, and Language P ocess-
ing, ol. 28, pp. 2880–2894, 2020.
[59]
J. A. Russell, “A ci cumplex model o a ec .” Jou nal
o pe sonali y and social psychology, ol. 39, no. 6, p.
1161, 1980.
[60]
H.-T. Hung, J. Ching, S. Doh, N. Kim, J. Nam, and
Y.-H. Yang, “Emopia: A mul i-modal pop piano da ase
o emo ion ecogni ion and emo ion-based music gen-
e a ion,” a Xi p ep in a Xi :2108.01374, 2021.
[61]
A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi,
“High ideli y neu al audio comp ession,” a Xi p ep in
a Xi :2210.13438, 2022.
[62]
W. Kay, J. Ca ei a, K. Simonyan, B. Zhang, C. Hillie ,
S. Vijayana asimhan, F. Viola, T. G een, T. Back,
P. Na se , M. Suleyman, and A. Zisse man, “The
kine ics human ac ion ideo da ase ,” 2017. [Online].
A ailable: h ps://a xi .o g/abs/1705.06950
[63]
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li,
S. Wang, L. Wang, and W. Chen, “Lo a: Low- ank
adap a ion o la ge language models,” a Xi p ep in
a Xi :2106.09685, 2021.
[64]
T. B own, B. Mann, N. Ryde , M. Subbiah, J. D. Ka-
plan, P. Dha iwal, A. Neelakan an, P. Shyam, G. Sas-
y, A. Askell e al., “Language models a e ew-sho
lea ne s,” Ad ances in neu al in o ma ion p ocessing
sys ems, ol. 33, pp. 1877–1901, 2020.
[65]
V. Liu and L. B. Chil on, “Design guidelines o p omp
enginee ing ex - o-image gene a i e models,” in P o-
ceedings o he 2022 CHI con e ence on human ac o s
in compu ing sys ems, 2022, pp. 1–23.
[66]
J. Whi e, Q. Fu, S. Hays, M. Sandbo n, C. Olea,
H. Gilbe , A. Elnasha , J. Spence -Smi h, and D. C.
Schmid , “A p omp pa e n ca alog o enhance
p omp enginee ing wi h cha gp ,” a Xi p ep in
a Xi :2302.11382, 2023.
[67]
S. Doh, K. Choi, J. Lee, and J. Nam, “Lp-musiccaps:
Llm-based pseudo music cap ioning,” a Xi p ep in
a Xi :2307.16372, 2023.
[68]
F. H. Ilya Loshchilo , “Decoupled weigh decay egu-
la iza ion,” a Xi p ep in a Xi :1711.05101, 2017.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
526