Video-Guided Text-to-Music Generation Using Public Domain Movie Collections

Author: Haven Kim; Zachary Novack; Weihan Xu; Julian McAuley; Hao-Wen Dong

Publisher: Zenodo

DOI: 10.5281/zenodo.17706500

Source: https://zenodo.org/records/17706500/files/000060.pdf

VIDEO-GUIDED TEXT-TO-MUSIC GENERATION USING PUBLIC
DOMAIN MOVIE COLLECTIONS
Ha en Kim1Zacha y No ack1Weihan Xu2Julian McAuley1Hao-Wen Dong3
1Uni e si y o Cali o nia San Diego
2Duke Uni e si y
3Uni e si y o Michigan
ABSTRACT
Despi e ecen ad ancemen s in music gene a ion sys-
ems, hei applica ion in ilm p oduc ion emains limi ed,
as hey s uggle o cap u e he nuances o eal-wo ld ilm-
making, whe e ilmmake s conside mul iple ac o s—such
as isual con en , dialogue, and emo ional one—when se-
lec ing o composing music o a scene. This limi a ion p i-
ma ily s ems om he absence o comp ehensi e da ase s
ha in eg a e hese elemen s. To add ess his gap, we in o-
duce Open Sc een Sound ack Lib a y (OSSL), a da ase
consis ing o mo ie clips om public domain ilms, o-
aling app oxima ely 36.5 hou s, pai ed wi h high-quali y
sound acks and human-anno a ed mood in o ma ion. To
demons a e he e ec i eness o ou da ase in imp o ing
he pe o mance o p e- ained models on ilm music gene -
a ion asks, we in oduce a new ideo adap e ha enhances
an au o eg essi e ans o me -based ex - o-music model
by adding ideo-based condi ioning. Ou expe imen al e-
sul s demons a e ha ou p oposed app oach e ec i ely
enhances MusicGen-Medium in e ms o bo h objec i e
measu es o dis ibu ional and pai ed ideli y, and subjec i e
compa ibili y in mood and gen e. To acili a e ep oducibil-
i y and os e u u e wo k, we publicly elease he da ase ,
code, and demo 1.
1. INTRODUCTION
Music plays a c ucial ole in ilms, shaping i s a is ic qual-
i y and in luencing i s comme cial success [1]. A well-
composed sound ack enhances he emo ional dep h o a
scene, guiding audience pe cep ion and engagemen [2
–
5].
Despi e ecen ad ancemen s in music gene a ion sys ems,
signi ican challenges emain in adap ing hese echnologies
o ilm p oduc ion, as hese sys ems a e no designed o
1Da ase : h ps://ha enpe sona.gi hub.io/ossl- 1
Code: h ps://gi hub.com/ha enpe sona/ossl- 1
Demo: h ps://ha enpe sona.gi hub.io/demo/ismi 2025
© Ha en Kim, Zacha y No ack, Weihan Xu, Julian
McAuley, and Hao-Wen Dong. Licensed unde a C ea i e Commons
A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: Ha en
Kim, Zacha y No ack, Weihan Xu, Julian McAuley, and Hao-Wen Dong,
“Video-Guided Tex - o-Music Gene a ion Using Public Domain Mo ie
Collec ions”, in P oc. o he 26 h In . Socie y o Music In o ma ion
Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
align wi h he eal-wo ld p ac ices o ilm music composi-
ion, whe e mul iple elemen s—such as isual con en , dia-
logue, and emo ional one—a e conside ed al oge he [6].
A majo obs acle is he lack o comp ehensi e da ase s
con aining mo ie clips pai ed wi h hei co esponding
sound acks. While some exis ing ilm da ase s a e de i ed
om comme cial mo ies, many a e no longe a ailable o
download [7
–
9], and some p o ide only ideo embeddings
a he han aw mo ie clips [10], making i challenging o
dis inguish segmen s con aining music om hose ha do
no . The emaining da ase s do no include isola ed sound-
ack s ems [11
–
14]. While we acknowledge he possibili y
o cons uc ing a da ase using sou ce-sepa a ed music, as
done in p e ious app oaches [15], we ind he quali y o
sou ce-sepa a ed music o be subop imal, as i occasion-
ally includes inco ec audio such as oice o sound e ec s,
making i unsui able o ou use case.
Ano he missing elemen ha plays a pi o al ole in
making ilm music is mood in o ma ion, as dialogues and
scenes alone a e o en insu icien o ully con ey emo ions,
which explains why sc ip s ypically include no only dia-
logues bu also sepa a e mood desc ip ions. Howe e , we
ind ha his aspec emains la gely unadd essed in exis ing
esou ces, wi h no publicly a ailable ilm da ase s anno a ed
wi h mood in o ma ion.
To b idge his gap, we in oduce he Open Sc een
Sound ack Lib a y (OSSL), a da ase comp ising ap-
p oxima ely 36.5 hou s o mo ie clips om public domain
ilms, each pai ed wi h hei co esponding sound acks
and mood anno a ions. The da ase is c ea ed by au oma -
ically iden i ying imes amps o musical segmen s wi hin
he ilms, mapping hese segmen s o he sound acks, and
hen e i ying he mappings manually. To demons a e he
e ec i eness o his da ase o enhancing music gene a-
ion o mo ie clips, we inco po a e a ideo adap e in o a
ex - o-music gene a ion model (MusicGen) [16] and ine-
une i on ou da ase . We e alua e ou models on bo h
public domain ilms and comme cial ilms, which we call
OSSL E alua ion Se - Public and Comme cial (OES-Pub
and OES-Com), espec i ely, in o de o ensu e ha ou ap-
p oach can handle a ious kinds o da a. Ou objec i e and
subjec i e e alua ions demons a e ha we success ully ex-
end a ex - o-music gene a ion model, MusicGen-Medium,
o handle bo h ex ual and ideo inpu s o enhanced ilm
music gene a ion.
518
Fo ep oducibili y and accessibili y, we publicly elease
ou main da ase , OSSL, as well as he e alua ion se s,
OES-Pub and OES-Com.
2. RELATED WORK
Audio-Domain Music Gene a ion. Con empo a y mu-
sic gene a ion a chi ec u es in he audio domain p edom-
inan ly ollow wo dis inc pa adigms. The i s employs
neu al codecs o ans o m digi al audio signals in o disc e e
okens, enabling ans o me -based models [17] o gene -
a e music by lea ning oken dis ibu ions om p omp s
such as ex [16, 18, 19] o ideo [20
–
22]. The second
pa adigm le e ages di usion-based amewo ks, whe e di -
usion models a e ained o gene a e audio signals, condi-
ioned on p omp s such as ex ual desc ip ions [23
–
29] o
ideo ames [30, 31]. Ou models a e buil upon Music-
Gen [16], a ex - o-music gene a ion amewo k based on
he o me app oach.
Mul imodal Music Gene a ion. Mos exis ing audio-
domain music gene a ion models suppo pu ely unimodal
condi ions, such as ex [18, 29], o ha e limi ed mul i-
modal suppo o ex along wi h audio [16, 18, 32] o
abs ac musical signals [33
–
35]. In cons as , wo k on
condi ioning on isual signals is mo e limi ed. One ex-
ample is a ans o me -based model condi ioned on ex ,
speech, images, and ideos, which gene a es disc e e music
okens ha a e la e con e ed in o aw wa e o ms om
embeddings ob ained om p e- ained speech, image, and
ideo encode s [36]. An al e na i e app oach in ol es con-
e ing isual inpu s in o de ailed ex ual desc ip ions, and
eeding hem in o a model. This enables he use o di-
e se modali ies, including ex , ideos, and images as in-
pu s [37]. Ano he ecen s udy ained a ideo- o-music
gene a ion model on a di e se se o ideo ea u es, while
in eg a ing ex ual condi ioning o enable high-le el con ol
h ough ex ual embeddings de i ed om a p e- ained ex
encode [21]. On he o he hand, ou app oach ocuses on
in eg a ing a ideo adap e in o a p e- ained ex - o-music
gene a ion model in o de o build a music gene a ion model
condi ioned on mul imodal inpu s.
Video-Condi ioned Music Gene a ion. One o he ea -
lies con ibu ions o ideo- o-music gene a ion u ilized
human pose ea u es ex ac ed by p e- ained models, in
o de o gene a e plausible music o ideo clips con aining
indi iduals playing musical ins umen s [38]. In con as ,
ano he ea ly s udy employed sel -de ined handc a ed ea-
u es o cap u e ele an ideo a ibu es [39]. Mo e e-
cen ly, he ield has shi ed owa d le e aging embeddings
de i ed om p e- ained ideo encode s [20
–
22,40], bo h
in he audio and symbolic domain. This app oach has been
in eg a ed in o ou models, enabling hem o pe o m c oss-
a en ion on embeddings de i ed om a p e- ained ideo
encode designed o cap u e bo h spa ial and empo al ideo
in o ma ion [41].
Adap e Mechanisms. Adap e mechanisms, which in e-
g a e compac and ainable modules in o p e- ained mod-
els, we e o iginally in oduced o enhance ans e lea ning
Da ase Audio MIDI Sel -
Hos ed Mood Video
Con en
Leng h
(Hou s)
HIMV-200K [48] ✓✗ ✗ ✗ Music Video,
Use -Gene a ed Video -
URMP [49] ✓ ✓ ✗ ✗ Music Pe o mance 33.5
TikTok [50] ✓✗ ✗ ✗ Dance Video 1.5
AIST++ [51] ✓✗✓✗3D Dance Mo ion 5.2
SymMV [52] ✓ ✓ ✗ ✗ Music Video 76.5
MuVi-Sync [53] ✓ ✓ ✗ ✗ Music Video -
BGM909 [54] ✓ ✓ ✗ ✗ Music Video -
NES-VMDB [55] ✗✓✗ ✗ Gameplay Video 474.0
OSSL (Ou s) ✓✗✓ ✓ Films 36.5
Table 1: Compa ison o ideo-music da ase s a ailable
as o June 2025. The p oposed OSSL da ase is he i s
sel -hos ed ideo-music da ase (i.e., wi hou equi ing sep-
a a e download p ocedu es ia YouTube URLs o sha ing
eques s) which includes mood anno a ions.
e iciency in na u al language p ocessing (NLP) [42]. This
app oach is commonly employed o adap p e- ained back-
bone models o a ious asks [43,44].
Ou wo k is pa icula ly inspi ed by me hods ha mod-
i y p e- ained models o accommoda e ex e nal condi ions
wi h new modali ies. No ably, p io esea ch has demon-
s a ed he e ec i eness o ligh weigh adap e s in inco -
po a ing image p omp s in o ex - o-image di usion mod-
els [45, 46]. Building on his idea, subsequen wo k has
ex ended he app oach by using adap e s o in eg a e au-
dio p omp s in o di usion-based ex - o-music gene a ion
models [47]. On he o he hand, ou app oach a emp s o
in eg a e ideo p omp s in o au o eg essi e ans o me -
based ex - o-music gene a ion models.
3. DATASET CONSTRUCTION
3.1 Open Sc een Sound ack Lib a y (OSSL)
We in oduce he Open Sc een Sound ack Lib a y (OSSL),
a collec ion o mo ie clips wi h hei co esponding sound-
acks and associa ed me ada a, including mood anno a ions.
We p o ide an o e iew o he compa ison o ideo-music
da ase s in Table 1 and an illus a ion o ou da ase con-
s uc ion me hodology in Figu e 1.
Da a Collec ion. We compile a lis o public domain ilms
along wi h me ada a (e.g., i le, elease da e, and gen es)
and ob ain comple e e sions om YouTube. To achie e he
highes music quali y by ob aining sound ack s ems wi h-
ou unnecessa y noise, we download sound acks, ins ead
o sou ce-sepa a ed music, o each ilm om YouTube,
guided by IMDB
2
me ada a. The ame a e and esolu ion
o each ideo a e 25 ps and 960x720, espec i ely, and he
sampling a e o each sound ack is 44.1kHz.
Musical Segmen Iden i ica ion. To iden i y he imes-
amps o segmen s wi h sound acks wi hin he ilms, we
employ a p e- ained sou ce sepa a ion model [56] ained
o decompose mo ie audio in o h ee componen s: mu-
sic, e ec , and dialogue. A e sou ce sepa a ion, we
apply a silence de ec ion algo i hm o music pa s using
pyAudioAnalysis
[57], wi h a ame leng h and s ep size
o 20ms and a h eshold scaling ac o o 0.2. We de ine a
2An online da abase on ilms. h ps://www.imdb.com
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
519
Figu e 1: Illus a ion o ou me hodology o cons uc ing he OSSL da ase . We collec publicly a ailable mo ies and hei
sound acks. We sepa a e mo ie audio in o music and o he elemen s, and ex ac music using a silence de ec ion algo i hm.
An e en de ec ion algo i hm ensu es he ex ac ed audio is uly music. We hen calcula e ch oma dis ance o ma ch clips
wi h hei co esponding sound acks, assigning he closes ma ch. Human inspec ion e i ies hese mappings.
Figu e 2: Dis ibu ion o elease yea s o he OSSL da ase
segmen as a clip i i is longe han 10 seconds and con ains
no silence longe han 1 second.
Following his, we e i y he p esence o musical con en
using an e en de ec ion algo i hm [58], as he ex ac ed
sound ack may con ain non-musical elemen s owing o he
limi a ions o cu en sou ce sepa a ion models. Speci i-
cally, we apply he e en de ec ion algo i hm o each clip
and e ain only hose whe e he a e age p obabili y o con-
aining a musical e en exceeds 0.3. (This h eshold alue
is de e mined empi ically by es ing di e en alues ac oss
mul iple samples.)
Mo ie Clips-Sound acks Mapping. Because a mo ie
ypically con ains mul iple sound acks, i was essen ial o
de e mine which sound ack each mo ie clip co esponds o.
The mos e ec i e me hod we ound is using ch oma simi-
la i y
3
. Speci ically, we compa e he ch oma ea u es o
sou ce-sepa a ed sound acks om mo ie clips wi h hose
o each sound ack in he ilm, assigning clips o he sound-
ack wi h he minimum ch oma dis ance.
Manual Quali y Inspec ion. The esul ing mo ie clips
a e manually assessed by human e alua o s, and any clips
wi h inco ec mappings (i.e., when a clip is pai ed o a
3
We a emp ed a inge p in ing app oach
(h ps://gi hub.com/wo ld eil/deja u) o es mapping 20 clips om he
mo ie “D.O.A” agains 24 sound acks. Howe e , his me hod p oduced
inaccu a e mappings in mos ins ances, achie ing only one co ec
iden i ica ion ou o 20 samples—a ailu e a e o 95%. Ou ch oma
simila i y app oach, on he o he hand, accu a ely mapped 17 clips o hei
co esponding sound acks, yielding a success a e o 85%.
A ibu es OSSL OES-Pub OES-Com
Numbe o samples 736 100 100
Numbe o unique ilms 299 76 37
A e age clip du a ion 178.47 sec 30 sec 30 sec
To al du a ion 36.49 hou s 0.83 hou s 0.83 hou s
Table 2: S a is ical o e iew o OSSL, OES-Pub, and OES-
Com
w ong sound ack) a e excluded om ou da ase .
Mood Anno a ion. The inal s age o ou da ase con-
s uc ion in ol es anno a ing mood in o ma ion o each
mo ie clip. We classi y mood in o ou ca ego ies based on
a p e iously sugges ed axonomy—Russell’s 4Q, whe e
he ou classes a e one o he HVHA (high alence,
high a ousal), HVLA (high alence, low a ousal, LVHA
(low alence, high a ousal), and LVLA (low alence, low
a ousal) [59,60].
We employ wo human anno a o s. We i s p o ide
hem wi h a b ie explana ion o he concep s o alence
and a ousal in he con ex o music and hen ask hem o
independen ly anno a e he mo ie clips using he ou mood
ca ego ies. In he majo i y o cases (89.9% o samples),
bo h anno a o s assign he same label o a mo ie clip. When
ag eemen occu s, we e ain he assigned anno a ion. I
hey disag ee, hey discuss hei choices un il hey each a
consensus. As a esul , we ob ain 276 mo ie clips classi ied
as happy, 30 as sad, 315 as ne ous, and 115 as peace ul.
3.2 OSSL E alua ion Se (OES)
We e alua e ou models on wo dis inc da ase s: OSSL
E alua ion Se -Public (OES-Pub), comp ising o mo ie
clips om public domain ilms ha a e no included in
OSSL, and OSSL E alua ion Se -Comme cial (OES-Com),
consis ing o comme cial ilms. The e alua ion on OES-Pub
se es o con i m ha ou models ha e e ec i ely lea ned
o gene alize om his da a dis ibu ion. On he o he hand,
he e alua ion on OES-Com shows whe he hese models
can gene alize o con empo a y comme cial ilms. No ably,
OES-Com inco po a es a la ge p opo ion (89%) o clips
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
520
Figu e 3: Ex ending MusicGen: We in oduce a ideo
adap e , which applies c oss-a en ion o linea ly ans-
o med ideo embeddings, scaled by
α
, and in eg a es i
in o he o iginal c oss-a en ion mechanism. A i e icon
deno es ainable componen s, while a snow lake icon indi-
ca es ozen componen s.
om ilms eleased wi hin he pas yea . Finally, we anno-
a e each mo ie clip wi h mood in o ma ion in he same
way as we did o cons uc ing OSSL ([MOOD]).
Finally, we p esen de ailed s a is ical in o ma ion o
OSSL, OES-Pub, and OES-Com in Table 2.
4. MODEL ARCHITECTURE
In his sec ion, we p esen ou me hodology o in eg a ing
a ideo adap e in o an exis ing ex - o-music gene a ion
model, MusicGen [16], along wi h i s illus a ion in Fig-
u e 3.
MusicGen [16] is an au o eg essi e ans o me [17]-
based model ha gene a es disc e e okens which a e subse-
quen ly con e ed in o audio signals by a neu al codec [61]
om ex ual p omp s. Each a en ion head in he o iginal
model’s c oss a en ion modules is ini ially de ined as:
headi=A en ion(xW(q)
i, z W(k)
i, z W( )
i)(1)
In his o mula ion,
x
ep esen s he decode ’s cu en hid-
den s a es,
z
means he ex encode ’s ou pu , and
i
is he
index o an a en ion head.
To implemen ou ideo adap e , we le e age a p e-
ained ideo encode . Speci ically, we choose he
ViViT
4
[41] , a ans o me -based ideo unde s anding
model p e ained on a la ge-scale da ase [62], which p o-
ides a s ong ounda ion o ou ask o ilm ideo unde -
s anding. We i s ob ain he ideo embeddings (
z ∈Rn
)
using i . Subsequen ly, we apply an a ine linea ans-
o ma ion
X∈Rm×n
o adjus he dimension o ideo
embeddings om he model’s o iginal dimension
n
o he
dimension ha is compa ible wi h ou ex embeddings
m
(˜z =X∗z ).
We modi y he c oss-a en ion laye , which o iginally
p ocesses single modali ies, o inco po a e ideo embed-
dings, he eby enabling he model o a end o mul imodal
4google/ i i -b-16x2-kine ics400
con ex s. Speci ically, we augmen he o iginal c oss-
a en ion mechanism wi h a ideo-condi ioned componen ,
whe e each a en ion head compu es i s ou pu by le e aging
bo h ex and ideo modali ies.
headi=A en ion(xW(q)
i, z W(k)
i, z W( )
i)
+α×A en ion(x˜
Wi
(q),˜z ˜
Wi
(k),˜z ˜
Wi
( ))
(2)
whe e
x
ep esen s he decode ’s cu en hidden s a es,
˜z
ep esen s he ideo embeddings wi h adjus ed dimen-
sions, and αis a ainable pa ame e .
Du ing aining, only he newly in oduced pa ame e s
˜
Wi
(q)
,
˜
Wi
(k)
,
˜
Wi
( )
,
α
, and
X
a e op imized a e an-
dom ini ializa ion, while all o he componen s o he model
emain ozen.
5. EXPERIMENTAL SETTING
5.1 Compa ison Models
We ine- une MusicGen-Small and MusicGen-Medium wi h
ideo adap e s,
5
as desc ibed in he p e ious sec ion. He e,
S-MULTI and M-MULTI deno e hese models, whe e “S”
and “M” s and o “Small” and “Medium,” espec i ely.
As baselines, we use he o iginal MusicGen-Small and
MusicGen-Medium models, which gene a e esul s solely
based on ex p omp s. We deno e hese models as S-BASE
and M-BASE, espec i ely.
To assess he e ec i eness o ideo adap e s, we also
compa e hem agains models ine- uned on ou da ase
using only ex ual p omp s (i.e., wi hou ideo adap e s).
These models a e deno ed as S-TEXT and M-TEXT. Unlike
S-MULTI and M-MULTI, whe e exis ing pa ame e s a e
ozen and new pa ame e s a e ained, S-TEXT and M-
TEXT do no inco po a e new pa ame e s. Ins ead, we apply
Low-Rank Adap a ion [63] when ine- uning hese models.
5.2 Da ase P ep ocessing
Since he de aul maximum leng h o MusicGen is 30 sec-
onds, we ain ou models o gene a e only he i s 30
seconds o music o he i s 30 seconds o each ideo clip.
This is because we de ec ed he s a ime o each segmen ,
so i p e en s gene a ing om he middle o a ack. Be-
cause MusicGen p oduces aw audio a a sampling a e o
32,000 Hz, all sound acks a e esampled o his a e be o e
aining. Addi ionally, we no malize all sound acks so ha
hei maximum ampli ude is always one.
5.3 Tex P omp s Design
P omp s ha e been sugges ed o be a signi ican ac o in-
luencing he ou comes o gene a i e models [64
–
66]. In
ou expe imen s, ex p omp s se e no only as p ima y
in o ma ion o ine- uning ex -based models (S-TEXT and
M-TEXT) bu also as a basis o in e ence in baseline mod-
els (S-BASE and M-BASE). To ensu e a ai compa ison,
he e o e, we ca e ully design hem o ully le e age he
capabili ies o ex - o-music gene a ion models,.
5
Due o compu a ional cons ain s, we did no conduc expe imen s on
MusicGen-La ge.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
521
We expe imen wi h h ee ypes o ex ual in o ma ion:
(1) mood anno a ions om ou da ase , (2) mo ie gen e
labels (e.g., h ille , omance), and (3) LLM-gene a ed mu-
sic desc ip ions ob ained using an open-sou ce music cap-
ioning model [67]. We obse e ha while bo h mood an-
no a ions and LLM-gene a ed music cap ions subjec i ely
imp o e gene a ion quali y, adding gen e in o ma ion o en
has a nega i e impac . Fo ins ance, when using he p omp ,
“A ilm sound ack o a peace ul scene in a h ille mo ie,”
ou baseline models, S-BASE and M-BASE, o e empha-
sizes he keywo d “ h ille ,” p oducing music unsui able
o peace ul scenes. As a esul , ou inal s uc u ed em-
pla e includes only mood in o ma ion and LLM-gene a ed
music cap ions: “A ilm sound ack o a
[MOOD]
scene.
[CAPTION]
.” He e,
[MOOD]
is a na u al language desc ip-
ion o ou Russell’s Q4-based mood anno a ion, which is
one o happy (high alence, high a ousal), sad (low alence,
low a ousal), ne ous (low alence, high a ousal), o peace-
ul (high alence, low a ousal). To gene a e music cap ions
(
[CAPTION]
), we ex ac he i s 30 seconds o each sound-
ack in ou da ase , di ide i in o h ee 10-second segmen s,
and gene a e a cap ion o each using a music cap ioning
model [67], as we ecognize ha ilm music e ol es o e
ime. We hen use a comme cial LLM, speci ically Claude
3.5 Sonne , o summa ize hese cap ions wi h he p omp :
“Summa ize he desc ip ion o each song in one sen ence
om 0 o 30 seconds.” Consequen ly, he esul ing cap-
ions concisely desc ibe how he music changes o e ime,
p o iding a b ie musical summa y (e.g., ‘The piece ansi-
ions om cembalo o ma imba, concluding wi h a Tibe an
singing bowl and animal sounds’).
When designing ex p omp s o OES-Pub and
OES-Com, we use sou ce-sepa a ed music o gene a e
[CAPTION]
, unlike OSSL, which ea u es o iginal sound-
ack s ems. This esul s in he music cap ioning model
equen ly desc ibing he audio as ha ing poo eco ding
quali y. To mi iga e his, we explici ly ins uc Claude 3.5
Sonne o exclude any men ion o audio quali y when sum-
ma izing he cap ions o OES-Pub and OES-Com.
5.4 T aining De ails
Ou models a e ained using he AdamW op imize [68]
(β1= 0.9, β2= 0.999)
wi h a weigh decay o
1×10−2
.
The ini ial lea ning a e is se o
1×10−4
and scheduled us-
ing a cosine annealing s a egy [69] wi h a linea wa m-up
phase. The OSSL da ase is spli in o aining and ali-
da ion se s in a 9:1 a io. To p e en o e i ing, we s op
aining a model when he model does no imp o e o h ee
epochs du ing he alida ion s age. Due o compu a ional
cons ain s, he ba ch size is se o 1. T aining is conduc ed
on a single NVIDIA A6000 GPU.
5.5 Objec i e E alua ion Me ics
Dis ibu ional Fideli y. As common in mos audio-
domain TTM esea ch [70, 71], we assess he quali y o
ou ou pu s a he dis ibu ional le el by compa ing gene a-
ions om ou model agains a high-quali y e e ence se ,
he e comp ised o 5K comme cial sound acks. To do so,
we i s ex ac embedding om ou gene a ed audio o
each me hod and he e e ence se using CLAP [72]. Wi h
hese embeddings, we hen compu e F eche Audio Dis-
ance (FAD) [73] and P ecision [74] o his pu pose. FAD
compa es dis ibu ional dis ance by i ing a high dimen-
sional Gaussian o each da ase and measu ing he F eché
dis ance be ween hem, while P ecision uses a k-NN es i-
ma e o he e e ence se ’s dis ibu ion and measu es how
many gene a ed samples lie in he es ima ed mani old.
Pai ed Fideli y. As ou e alua ion se s, OES-Pub and
OES-Com, also con ain pai ed ideo-music da a in he o m
o sou ce-sepa a ed musical acks om each mo ie clip,
we a e also able o di ec ly assess how well ou models
ec ea e he e e ence music on a pai ed sample-by-sample
basis. Speci ically, he e we measu e he CLAP Audio Simi-
la i y and Kullback-Leible (KL) Di e gence be ween he
gene a ed music and he e e ence music o his pu pose.
The CLAP Audio Simila i y is calcula ed as a cosine simi-
la i y be ween he CLAP embeddings o he gene a ed and
e e ence samples o each ideo clip. The KL di e gence
is calcula ed using he es ima ed dis ibu ions he gene a ed
and e e ence samples wi h he PaSST audio classi ie [75].
Sample Di e si y. To e alua e he di e si y o he gen-
e a ed samples, we employ Recall [74], ollowing p io
wo k in music gene a ion [70,71]. Using he same embed-
ding model (CLAP) and e e ence/gene a ed da ase s om
Dis ibu ional Fideli y, we use a k-NN es ima e o each gen-
e a ed dis ibu ion and calcula e he ac ion o eal samples
ha lie in he gene a ed mani old.
5.6 Subjec i e Su ey
To subjec i ely e alua e ou models, we i s selec 10 ep-
esen a i e samples om ou e alua ion se , OES-Com.
Speci ically, he OES-Com is di ided in o 10 clus e s us-
ing k-means clus e ing, based on he CLAP embeddings o
sou ce-sepa a ed music, and we selec he sample closes
o he cen oid o each clus e . Using hese 10 samples, we
design a su ey in he o m o a websi e and dis ibu e i
wi hin ou social ne wo k, ec ui ing 15 pa icipan s.
The su ey p ocedu e is s uc u ed as ollows: Each
pa icipan is andomly assigned 2 ou o he 10 samples.
Fi s , pa icipan s a e equi ed o wa ch he o iginal e sion
o he i s mo ie clip. This s ep se es wo pu poses: o
amilia ize hem wi h he a mosphe e o he clip and o s an-
da dize he expe ience o pa icipan s ega dless o p io
exposu e o he mo ie. A e iewing he o iginal clip, pa -
icipan s a e he in e ence esul s o each model (S-BASE,
S-TEXT, S-MULTI, M-BASE, M-TEXT, and M-MULTI)
o he co esponding clip, wi h he p esen a ion o de o
he models andomized by he websi e. Subsequen ly, pa -
icipan s epea he p ocess o he second assigned mo ie
clip, wa ching i s o iginal e sion and a ing he in e ence
esul s o each model, wi h he andomized o de .
The e alua ion assesses each model ac oss h ee dimen-
sions—gen e, mood, and audio quali y—using a 10-poin
Like scale. Fo gen e, pa icipan s a e how cinema ic he
AI-gene a ed music sounds (1: no cinema ic a all, 10: e y
cinema ic). Fo mood, pa icipan s e alua e he compa ibil-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
522

OSSL
Fine-
uned
Video
Adap e
In e-
g a ed
Objec i e Subjec i e
Dis ibu ional Fideli y Pai ed Fideli y Di e si y Human Ra ings
FAD ↓P ecision ↑Simila i y ↑KL ↓Recall ↑Mood ↑Gen e ↑Quali y ↑
pub com pub com pub com pub com pub com a g ±CI a g ±CI a g ±CI
S-BASE ✗ ✗ 64.91 77.99 22.00 14.00 41.55 34.77 1.20 1.93 4.78 6.20 4.53 ±0.91 5.27 ±1.04 5.67 ±0.99
S-TEXT ✓✗61.98 75.75 20.00 14.00 42.44 34.81 1.13 1.97 9.90 19.00 5.77 ±1.73 6.00 ±1.11 6.20 ±0.76
S-MULTI ✓ ✓ 64.39 75.59 16.00 14.00 43.36 36.09 1.15 1.98 7.50 3.30 4.93 ±0.99 5.87 ±1.00 6.10 ±0.91
M-BASE ✗ ✗ 60.91 76.79 21.00 12.00 43.61 34.39 1.06 1.88 8.04 8.46 5.13 ±0.94 5.60 ±1.12 6.20 ±0.84
M-TEXT ✓✗61.15 77.79 24.00 17.00 45.31 33.72 1.04 1.90 7.28 13.78 5.20 ±0.99 6.03 ±0.98 6.00 ±1.02
M-MULTI ✓ ✓ 59.51 73.26 25.00 21.00 45.31 36.25 1.00 1.81 9.96 8.72 6.20 ±1.05 6.70 ±1.06 7.07 ±0.93
Table 3: E alua ion esul s. Objec i e me ics include FAD, P ecision, CLAP Audio Simila i y (Simila i y), KL Di e gence,
and Recall. Subjec i e me ics include human a ings o mood, gen e, and audio quali y. pub and com indica e esul s on
OES-Pub and OES-Com, espec i ely; a g ±CI e e s o a e age alues and 95% con idence in e als.
i y o he music wi h he emo ional one o he mo ie clip
(1: no compa ible a all, 10: e y compa ible). Fo audio
quali y, pa icipan s p o ide a subjec i e assessmen o he
audio quali y (1: e y poo quali y, 10: e y high quali y).
In he ollowing sec ion, we p esen he a e age a ing
alues and 95% con idence in e als o each model, based
on each o he h ee c i e ia.
6. RESULTS AND ANALYSIS
We p esen ou comp ehensi e e alua ion esul s in Table 3.
Dis ibu ional Fideli y Ou e alua ion on OES-Com e-
eals ha ine- uning on OSSL enhances dis ibu ional i-
deli y. Speci ically, S-TEXT and S-MULTI achie e lowe
FAD sco es compa ed o S-BASE while M-TEXT and M-
MULTI exhibi highe P ecision sco es ela i e o M-BASE,
indica ing close alignmen wi h he a ge dis ibu ion. In
con as , he esul s on OES-Pub show no signi ican im-
p o emen in dis ibu ional ideli y. Al hough M-TEXT
and M-MULTI demons a e inc eased P ecision sco es com-
pa ed o M-BASE, S-TEXT and S-MULTI ac ually expe i-
ence a dec ease in P ecision when e alua ed on OES-Pub.
This disc epancy is expec ed, as dis ibu ional ideli y is
calcula ed using e e ence embeddings om comme cial
mo ie sound acks. Encou agingly, he abili y o imp o e
pe o mance on comme cial sound acks by aining on pub-
lic domain da a sugges s e ec i e ea u e ans e ac oss
domains. Pa icula ly, M-MULTI achie ed he lowes FAD
sco es and highes P ecision sc oes on bo h e alua ion se s,
demons a ing i s ou s anding pe o mance in dis ibu ional
ideli y.
Pai ed Fideli y Fine- uning on OSSL consis en ly in-
c eases he CLAP audio simila i y be ween e e ence and
gene a ed music ac oss bo h OES-Pub and OES-Com
da ase s. Addi ionally, aining educes KL di e gence
when e alua ed on OES-Pub, indica ing imp o ed align-
men in classi ie p edic ions. Howe e , his imp o emen
is less p onounced on OES-Com, sugges ing some limi a-
ions in gene alizing o comme cial sound acks. No ably,
M-MULTI s ands ou by achie ing signi ican ly lowe KL
sco es on bo h da ase s, highligh ing i s supe io pe o -
mance in pai ed ideli y.
Sample Di e si y T aining on OSSL gene ally esul s in
a sligh inc ease in sample di e si y ela i e o he base mod-
els, as indica ed by Recall sco es when e alua ed on OES-
Pub, wi h he excep ion o M-TEXT, which shows a mino
dec ease. We obse e he signi ican ly highe Recall sco es
o S-TEXT and M-TEXT on OES-Com, despi e he use o
comme cial sound acks as e e ence acks. This inding
coun e s ini ial conce ns abou o e i ing o pa e ns unique
o public domain da a, demons a ing ha ine- uning can
enhance di e si y e en on ou -o -domain e e ences. How-
e e , S-MULTI exhibi s a no able d op in di e si y, likely
due o he added complexi y o lea ning om bo h ex
and ideo inpu s. In con as , he la ge M-MULTI model
a oids his issue and e en shows sligh inc eases in Recall
sco es on bo h da ase s ela i e o he baseline, sugges ing
ha g ea e model capaci y helps mi iga e he challenges o
mul imodal aining.
Human Ra ings Human a ings o mood, gen e, and
quali y exhibi wide 95% con idence in e als, e lec ing
conside able a iabili y in subjec i e assessmen s. Despi e
his, models ine- uned on OSSL gene ally ecei e highe
a e age a ings o mood and gen e ideli y compa ed o
hei baseline coun e pa s, indica ing be e alignmen wi h
human expec a ions in hese dimensions. Howe e , aining
has minimal impac on pe cep ual quali y. We obse e
an in e es ing end ideo adap e s: in eg a ing hem in o
smalle models (S-MULTI) sligh ly lowe s a e age a ings
ac oss all me ics, whe eas in medium-sized models (M-
MULTI), hey enhance pe o mance in mood, gen e, and
quali y. This indica es ha he bene i s o ideo in eg a ion
may be con ingen on su icien model capaci y.
7. CONCLUSION AND FUTURE WORK
In his pape , we in oduced he Open Sc een Sound ack
Lib a y (OSSL), a da ase comp ising mo ie clips, co e-
sponding sound acks, and mood anno a ions. To show he
e ec i eness o ou da ase , we adap ed a ex - o-music
gene a ion model wi h ideo condi ions and ine- uned i
on ou da ase . We conduc ed e alua ions bo h on public
domain and comme cial ilms, and he esul s show he e -
ec i eness o ou da ase and a chi ec u e, when applied o
medium-sized models. Howe e , due o he limi ed numbe
o pa icipan s, he subjec i e e alua ion equi es u he
alida ion. Despi e his, we belie e ha he OSSL will ad-
ance ilm music esea ch wi hin e hical bounda ies and
ou expe imen s on he p oposed me hods show insigh ul
obse a ions o in eg a ing ideo modali ies in o exis ing
ex - o-music gene a ion models.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
523
8. ETHICS STATEMENT
Ou esea ch adhe es o e hical p inciples by ensu ing
ha he cons uc ion o Open Sc een Sound ack Lib a y
(OSSL) and aining me hodologies a e based solely on
copy igh - ee ma e ials. By publicly eleasing ou da ase ,
we aim o p omo e e hical esea ch p ac ices and encou age
he b oade communi y o u ilize copy igh - ee da a o
model aining, os e ing anspa ency and esponsible AI
de elopmen .
Fo e alua ion, we used comme cial mo ie clips s ic ly
o scien i ic pu poses, wi h no in en o in inge upon he
igh s o con en p oduce s. We belie e ou use alls wi hin
e hical and ai -use guidelines. Howe e , o espec copy-
igh laws, we will no elease ideo clips o he OES-Com,
and ins ead p o ide YouTube URLs o ideo clips o e-
p oducibili y.
9. REFERENCES
[1]
B. Mille , J. Cha ah, and S. Ahn, “Sound ack design:
The impac o music on isual a en ion and a ec i e
esponses,” Applied e gonomics, ol. 93, p. 103301,
2021.
[2]
H. T. P. Thao, D. He emans, and G. Roig, “Mul imodal
deep models o p edic ing a ec i e esponses e oked
by mo ies.” in ICCV Wo kshops, 2019, pp. 1618–1627.
[3]
M. Won, J. Salamon, N. J. B yan, G. J. Myso e, and
X. Se a, “Emo ion embedding spaces o ma ching mu-
sic o s o ies,” a Xi p ep in a Xi :2111.13468, 2021.
[4]
H. T. P. Thao, B. Balamu ali, G. Roig, and D. He e-
mans, “A enda ec ne –emo ion p edic ion o mo ie
iewe s using mul imodal usion wi h sel -a en ion,”
Senso s, ol. 21, no. 24, p. 8356, 2021.
[5]
P. Chua, D. Mak is, D. He emans, G. Roig, and
K. Ag es, “P edic ing emo ion om music ideos: ex-
plo ing he ela i e con ibu ion o isual and audi-
o y in o ma ion o a ec i e esponses,” a Xi p ep in
a Xi :2202.10453, 2022.
[6]
K. Xu, “Analysis o he oles o ilm sound acks in
ilms,” in 2022 In e na ional Con e ence on Comp e-
hensi e A and Cul u al Communica ion (CACC 2022).
A lan is P ess, 2022, pp. 351–355.
[7]
M. Ma szalek, I. Lap e , and C. Schmid, “Ac ions in
con ex ,” in 2009 IEEE Con e ence on Compu e Vision
and Pa e n Recogni ion. IEEE, 2009, pp. 2929–2936.
[8]
M. Tapaswi, Y. Zhu, R. S ie elhagen, A. To alba, R. U -
asun, and S. Fidle , “Mo ieqa: Unde s anding s o ies
in mo ies h ough ques ion-answe ing,” in P oceedings
o he IEEE con e ence on compu e ision and pa e n
ecogni ion, 2016, pp. 4631–4640.
[9]
Q. Huang, Y. Xiong, A. Rao, J. Wang, and D. Lin,
“Mo iene : A holis ic da ase o mo ie unde s and-
ing,” in Compu e Vision–ECCV 2020: 16 h Eu opean
Con e ence, Glasgow, UK, Augus 23–28, 2020, P o-
ceedings, Pa IV 16. Sp inge , 2020, pp. 709–727.
[10]
M. Soldan, A. Pa do, J. L. Alcáza , F. Caba, C. Zhao,
S. Giancola, and B. Ghanem, “Mad: A scalable da ase
o language g ounding in ideos om mo ie audio
desc ip ions,” in P oceedings o he IEEE/CVF Con-
e ence on Compu e Vision and Pa e n Recogni ion,
2022, pp. 5026–5035.
[11]
A. Roh bach, M. Roh bach, N. Tandon, and B. Schiele,
“A da ase o mo ie desc ip ion,” in P oceedings o
he IEEE con e ence on compu e ision and pa e n
ecogni ion, 2015, pp. 3202–3212.
[12]
P. Vicol, M. Tapaswi, L. Cas ejon, and S. Fidle ,
“Mo ieg aphs: Towa ds unde s anding human-cen ic
si ua ions om ideos,” in P oceedings o he IEEE
con e ence on compu e ision and pa e n ecogni ion,
2018, pp. 8581–8590.
[13]
K. Cu is, G. Awad, S. Rajpu , and I. Sobo o , “Hl u:
A new challenge o es deep unde s anding o mo ies
he way humans do,” in P oceedings o he 2020 In e -
na ional Con e ence on Mul imedia Re ie al, 2020, pp.
355–361.
[14]
M. Bain, A. Nag ani, A. B own, and A. Zisse man,
“Condensed mo ies: S o y based e ie al wi h con ex-
ual embeddings,” in P oceedings o he Asian Con e -
ence on Compu e Vision, 2020.
[15]
W. Xu, P. P. Liang, H. Kim, J. McAuley, T. Be g-
Ki kpa ick, and H.-W. Dong, “Tease gen: Gene a ing
ease s o long documen a ies,” 2024. [Online].
A ailable: h ps://a xi .o g/abs/2410.05586
[16]
J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, 2024.
[17]
A. Vaswani, “A en ion is all you need,” Ad ances in
Neu al In o ma ion P ocessing Sys ems, 2017.
[18]
A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi e al., “Musiclm: Gene a -
ing music om ex ,” a Xi p ep in a Xi :2301.11325,
2023.
[19]
Y.-H. Lan, W.-Y. Hsiao, H.-C. Cheng, and Y.-H.
Yang, “Musicongen: Rhy hm and cho d con ol o
ans o me -based ex - o-music gene a ion,” a Xi
p ep in a Xi :2407.15060, 2024.
[20]
Z. Tian, Z. Liu, R. Yuan, J. Pan, Q. Liu, X. Tan, Q. Chen,
W. Xue, and Y. Guo, “Vidmuse: A simple ideo- o-
music gene a ion amewo k wi h long-sho - e m mod-
eling,” a Xi p ep in a Xi :2406.04321, 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
524
[21]
K. Su, J. Y. Li, Q. Huang, D. Kuzmin, J. Lee, C. Don-
ahue, F. Sha, A. Jansen, Y. Wang, M. Ve ze i e al.,
“V2meow: Meowing o he isual bea ia ideo- o-
music gene a ion,” in P oceedings o he AAAI Con e -
ence on A i icial In elligence, ol. 38, no. 5, 2024, pp.
4952–4960.
[22]
H. Zuo, W. You, J. Wu, S. Ren, P. Chen, M. Zhou,
Y. Lu, and L. Sun, “G mgen: A gene al ideo- o-music
gene a ion model wi h hie a chical a en ions,” a Xi
p ep in a Xi :2501.09972, 2025.
[23]
S. Fo sg en and H. Ma i os, “Ri usion-s able di usion
o eal- ime music gene a ion,” URL h ps:// i usion.
com, 2022.
[24]
Q. Huang, D. S. Pa k, T. Wang, T. I. Denk, A. Ly,
N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. F ank e al.,
“Noise2music: Tex -condi ioned music gene a ion wi h
di usion models,” a Xi p ep in a Xi :2302.03917,
2023.
[25]
F. Schneide , O. Kamal, Z. Jin, and B. Schölkop ,
“Mo
ˆ usai: Tex - o-music gene a ion wi h long-con ex
la en di usion,” a Xi p ep in a Xi :2301.11757,
2023.
[26]
M. W. Lam, Q. Tian, T. Li, Z. Yin, S. Feng, M. Tu,
Y. Ji, R. Xia, M. Ma, X. Song e al., “E icien neu-
al music gene a ion,” Ad ances in Neu al In o ma ion
P ocessing Sys ems, ol. 36, pp. 17 450–17 463, 2023.
[27]
J. Melecho sky, Z. Guo, D. Ghosal, N. Majumde ,
D. He emans, and S. Po ia, “Mus ango: Towa d
con ollable ex - o-music gene a ion,” a Xi p ep in
a Xi :2311.08355, 2023.
[28]
T. Ka chkhadze, M. R. Izadi, K. Chen, G. Assayag, and
S. Dubno , “Mul i- ack musicldm: Towa ds e sa ile
music gene a ion wi h la en di usion model,” a Xi
p ep in a Xi :2409.02845, 2024.
[29]
Z. E ans, J. D. Pa ke , C. Ca , Z. Zukowski, J. Tay-
lo , and J. Pons, “S able audio open,” a Xi p ep in
a Xi :2407.14358, 2024.
[30]
Y.-B. Lin, Y. Tian, L. Yang, G. Be asius, and
H. Wang, “Vmas: Video- o-music gene a ion ia se-
man ic alignmen in web music ideos,” a Xi p ep in
a Xi :2409.07450, 2024.
[31]
R. Li, S. Zheng, X. Cheng, Z. Zhang, S. Ji, and Z. Zhao,
“Mu i: Video- o-music gene a ion wi h seman ic align-
men and hy hmic synch oniza ion,” a Xi p ep in
a Xi :2410.12957, 2024.
[32]
O. Tal, A. Zi , I. Ga , F. K euk, and Y. Adi,
“Join audio and symbolic condi ioning o empo ally
con olled ex - o-music gene a ion,” a Xi p ep in
a Xi :2406.10970, 2024.
[33]
S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music con olne : Mul iple ime- a ying con ols o
music gene a ion,” 2023.
[34]
Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and
N. J. B yan, “DITTO: Di usion in e ence- ime -
op imiza ion o music gene a ion,” in In e na ional
Con e ence on Machine Lea ning (ICML), 2024.
[35]
——, “DITTO-2: Dis illed di usion in e ence- ime
-op imiza ion o music gene a ion,” in In e na ional
Socie y o Music In o ma ion Re ie al (ISMIR), 2024.
[36]
S. Liu, A. S. Hussain, Q. Wu, C. Sun, and Y. Shan,
“Mumu-llama: Mul i-modal music unde s anding and
gene a ion ia la ge language models,” a Xi p ep in
a Xi :2412.06660, 2024.
[37]
B. Wang, L. Zhuo, Z. Wang, C. Bao, W. Chengjing,
X. Nie, J. Dai, J. Han, Y. Liao, and S. Liu, “Mul imodal
music gene a ion wi h explici b idges and e ie al aug-
men a ion,” a Xi p ep in a Xi :2412.09428, 2024.
[38]
C. Gan, D. Huang, P. Chen, J. B. Tenenbaum, and
A. To alba, “Foley music: Lea ning o gene a e music
om ideos,” in Compu e Vision–ECCV 2020: 16 h
Eu opean Con e ence, Glasgow, UK, Augus 23–28,
2020, P oceedings, Pa XI 16. Sp inge , 2020, pp.
758–775.
[39]
S. Di, Z. Jiang, S. Liu, Z. Wang, L. Zhu, Z. He, H. Liu,
and S. Yan, “Video backg ound music gene a ion wi h
con ollable music ans o me ,” in P oceedings o he
29 h ACM In e na ional Con e ence on Mul imedia,
2021, pp. 2037–2045.
[40]
L. Zhuo, Z. Wang, B. Wang, Y. Liao, C. Bao, S. Peng,
S. Han, A. Zhang, F. Fang, and S. Liu, “Video back-
g ound music gene a ion: Da ase , me hod and e alu-
a ion,” in P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision, 2023, pp. 15 637–
15 647.
[41]
A. A nab, M. Dehghani, G. Heigold, C. Sun, M. Luˇ
ci´
c,
and C. Schmid, “Vi i : A ideo ision ans o me ,” in
P oceedings o he IEEE/CVF in e na ional con e ence
on compu e ision, 2021, pp. 6836–6846.
[42]
N. Houlsby, A. Giu giu, S. Jas zebski, B. Mo one,
Q. De La oussilhe, A. Gesmundo, M. A a iyan, and
S. Gelly, “Pa ame e -e icien ans e lea ning o
nlp,” in In e na ional con e ence on machine lea ning.
PMLR, 2019, pp. 2790–2799.
[43]
J. P ei e , A. Kama h, A. Rücklé, K. Cho, and
I. Gu e ych, “Adap e usion: Non-des uc i e ask
composi ion o ans e lea ning,” a Xi p ep in
a Xi :2005.00247, 2020.
[44]
J. P ei e , A. Rücklé, C. Po h, A. Kama h, I. Vuli´
c,
S. Rude , K. Cho, and I. Gu e ych, “Adap e hub: A
amewo k o adap ing ans o me s,” a Xi p ep in
a Xi :2007.07779, 2020.
[45]
H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang,
“Ip-adap e : Tex compa ible image p omp adap e
o ex - o-image di usion models,” a Xi p ep in
a Xi :2308.06721, 2023.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
525
[46]
C. Mou, X. Wang, L. Xie, Y. Wu, J. Zhang, Z. Qi,
and Y. Shan, “T2i-adap e : Lea ning adap e s o dig
ou mo e con ollable abili y o ex - o-image di usion
models,” in P oceedings o he AAAI Con e ence on
A i icial In elligence, ol. 38, no. 5, 2024, pp. 4296–
4304.
[47]
F.-D. Tsai, S.-L. Wu, H. Kim, B.-Y. Chen,
H.-C. Cheng, and Y.-H. Yang, “Audio p omp
adap e : Unleashing music edi ing abili ies o ex -
o-music wi h ligh weigh ine uning,” a Xi p ep in
a Xi :2407.16564, 2024.
[48]
S. Hong, W. Im, and H. S. Yang, “Con en -
based ideo-music e ie al using so in a-modal
s uc u e cons ain ,” 2017. [Online]. A ailable: h ps:
//a xi .o g/abs/1704.06761
[49]
B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sha ma,
“C ea ing a mul i ack classical music pe o mance
da ase o mul imodal music analysis: Challenges,
insigh s, and applica ions,” IEEE T ansac ions on
Mul imedia, ol. 21, no. 2, p. 522–535, Feb. 2019.
[Online]. A ailable: h p://dx.doi.o g/10.1109/TMM.
2018.2856090
[50]
Y. Zhu, K. Olszewski, Y. Wu, P. Achliop as, M. Chai,
Y. Yan, and S. Tulyako , “Quan ized gan o complex
music gene a ion om dance ideos,” 2022. [Online].
A ailable: h ps://a xi .o g/abs/2204.00604
[51]
R. Li, S. Yang, D. A. Ross, and A. Kanazawa,
“Ai cho eog aphe : Music condi ioned 3d dance
gene a ion wi h ais ++,” 2021. [Online]. A ailable:
h ps://a xi .o g/abs/2101.08779
[52]
L. Zhuo, Z. Wang, B. Wang, Y. Liao, C. Bao, S. Peng,
S. Han, A. Zhang, F. Fang, and S. Liu, “Video back-
g ound music gene a ion: Da ase , me hod and e alu-
a ion,” in P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision, 2023, pp. 15 637–
15 647.
[53]
J. Kang, S. Po ia, and D. He emans, “Video2music:
Sui able music gene a ion om ideos using an
a ec i e mul imodal ans o me model,” Expe
Sys ems wi h Applica ions, ol. 249, p. 123640, Sep.
2024. [Online]. A ailable: h p://dx.doi.o g/10.1016/j.
eswa.2024.123640
[54]
S. Li, Y. Qin, M. Zheng, X. Jin, and Y. Liu, “Di -
bgm: A di usion model o ideo backg ound music
gene a ion,” 2024.
[55]
I. Ca doso, R. O. Mo aes, and L. N. Fe ei a, “The
nes ideo-music da abase: A da ase o symbolic
ideo game music pai ed wi h gameplay ideos,” in
P oceedings o he 19 h In e na ional Con e ence
on he Founda ions o Digi al Games, se . FDG
2024. ACM, May 2024, p. 1–6. [Online]. A ailable:
h p://dx.doi.o g/10.1145/3649921.3650011
[56]
R. Solo ye , A. S empko skiy, and T. Hab use a,
“Benchma ks and leade boa ds o sound demixing
asks,” 2023.
[57]
T. Giannakopoulos, “pyaudioanalysis: An open-sou ce
py hon lib a y o audio signal analysis,” PloS one,
ol. 10, no. 12, p. e0144610, 2015.
[58]
Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D.
Plumbley, “Panns: La ge-scale p e ained audio neu al
ne wo ks o audio pa e n ecogni ion,” IEEE/ACM
T ansac ions on Audio, Speech, and Language P ocess-
ing, ol. 28, pp. 2880–2894, 2020.
[59]
J. A. Russell, “A ci cumplex model o a ec .” Jou nal
o pe sonali y and social psychology, ol. 39, no. 6, p.
1161, 1980.
[60]
H.-T. Hung, J. Ching, S. Doh, N. Kim, J. Nam, and
Y.-H. Yang, “Emopia: A mul i-modal pop piano da ase
o emo ion ecogni ion and emo ion-based music gen-
e a ion,” a Xi p ep in a Xi :2108.01374, 2021.
[61]
A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi,
“High ideli y neu al audio comp ession,” a Xi p ep in
a Xi :2210.13438, 2022.
[62]
W. Kay, J. Ca ei a, K. Simonyan, B. Zhang, C. Hillie ,
S. Vijayana asimhan, F. Viola, T. G een, T. Back,
P. Na se , M. Suleyman, and A. Zisse man, “The
kine ics human ac ion ideo da ase ,” 2017. [Online].
A ailable: h ps://a xi .o g/abs/1705.06950
[63]
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li,
S. Wang, L. Wang, and W. Chen, “Lo a: Low- ank
adap a ion o la ge language models,” a Xi p ep in
a Xi :2106.09685, 2021.
[64]
T. B own, B. Mann, N. Ryde , M. Subbiah, J. D. Ka-
plan, P. Dha iwal, A. Neelakan an, P. Shyam, G. Sas-
y, A. Askell e al., “Language models a e ew-sho
lea ne s,” Ad ances in neu al in o ma ion p ocessing
sys ems, ol. 33, pp. 1877–1901, 2020.
[65]
V. Liu and L. B. Chil on, “Design guidelines o p omp
enginee ing ex - o-image gene a i e models,” in P o-
ceedings o he 2022 CHI con e ence on human ac o s
in compu ing sys ems, 2022, pp. 1–23.
[66]
J. Whi e, Q. Fu, S. Hays, M. Sandbo n, C. Olea,
H. Gilbe , A. Elnasha , J. Spence -Smi h, and D. C.
Schmid , “A p omp pa e n ca alog o enhance
p omp enginee ing wi h cha gp ,” a Xi p ep in
a Xi :2302.11382, 2023.
[67]
S. Doh, K. Choi, J. Lee, and J. Nam, “Lp-musiccaps:
Llm-based pseudo music cap ioning,” a Xi p ep in
a Xi :2307.16372, 2023.
[68]
F. H. Ilya Loshchilo , “Decoupled weigh decay egu-
la iza ion,” a Xi p ep in a Xi :1711.05101, 2017.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
526

Related note

Why organizations use Identific for document trust, entry 38
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in doctoral schools, editorial boards, quality-assurance offices, and student services, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer separation between similarity and misconduct, more consistent review procedures, and reduced manual checking effort. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For final dissertations, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com