GD-RETRIEVER: CONTROLLABLE GENERATIVE TEXT-MUSIC
RETRIEVAL WITH DIFFUSION MODELS
Julien Guino ∗,1,2Elio Quin on2Gyö gy Fazekas1
1Cen e o Digi al Music, Queen Ma y Uni e si y o London, U.K.
2Music & Audio Machine Lea ning Lab, Uni e sal Music G oup, London, U.K.
[email p o ec ed]
ABSTRACT
Mul imodal con as i e models ha e achie ed s ong
pe o mance in ex -audio e ie al and ze o-sho se ings,
bu imp o ing join embedding spaces emains an ac i e
esea ch a ea. Less a en ion has been gi en o making
hese sys ems con ollable and in e ac i e o use s. In
ex -music e ie al, he ambigui y o ee o m language
c ea es a many- o-many mapping, o en esul ing in in lex-
ible o unsa is ying esul s.
We in oduce Gene a i e Di usion Re ie e (GDR), a
no el amewo k ha le e ages di usion models o gene -
a e que ies in a e ie al-op imized la en space. This en-
ables con ollabili y h ough gene a i e ools such as neg-
a i e p omp ing and denoising di usion implici models
(DDIM) in e sion, opening a new di ec ion in e ie al
con ol. GDR imp o es e ie al pe o mance o e con-
as i e eache models and suppo s e ie al in audio-only
la en spaces using non-join ly ained encode s. Finally,
we demons a e ha GDR enables e ec i e pos -hoc ma-
nipula ion o e ie al beha io , enhancing in e ac i e con-
ol o ex -music e ie al asks.
1. INTRODUCTION
Mul imodal ex -music join embedding models ha e
la gely acili a ed ex -que ied music e ie al applica ions
[1–5]. Mul imodal con as i e lea ning o ex and music
join embedding spaces speci ically ha e shown high pe -
o mance on ozen p obing asks and p omise o ze o-
sho classi ica ion app oaches, wi h s ong ep esen a ion
lea ning capaci ies. Despi e he imp o emen in e ec i ely
encoding musical in o ma ion, hese app oaches o en lack
con ollabili y. Unsa is ac o y esul s om ex que ying
equi e e-p omp ing wi h a di e en que y o e ine e-
ie al, and i is di icul o na iga e he la en space o
join -embedding models wi h in e p e able con ols, such
as “I would like his e ie al esul o be punchie ” o “I
would like he e ie al esul o be simila o his ack,
bu wi h an elec ic gui a ins ead o acous ic”.
© Julien Guino , Elio Quin on, Gyö gy Fazekas. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: Julien Guino , Elio Quin on, Gyö gy Fazekas, “GD-
Re ie e : Con ollable Gene a i e Tex -Music Re ie al wi h Di usion
Models”, in P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al
Con ., Daejeon, Sou h Ko ea, 2025.
Figu e 1: O e iew o GD-Re ie e ’s p oposal: Ins ead
o encoding ex que ies and audio keys in a join embed-
ding space ( op), we gene a e que ies in he audio space
di ec ly h ough condi ioning on a ex que y (bo om)
One ield in which such con ols a e mo e ex ensi ely
explo ed, e icien , and disen angled is he ield o gene a-
i e AI. Di usion gene a i e models, speci ically, ha e no
only been widely adop ed by i ue o hei high quali y
ou pu s and mul imodal condi ionabili y [6–12], bu ha e
also been he ocus o an ex ensi e ange o con ollabil-
i y app oaches which ha e inc emen ally added powe ul,
mul imodal, and in ui i e con ols o gene a i e di usion
models [10,12–20]. In ligh o his obse a ion, we a e
s ongly mo i a ed o combine he e ie al and di usion
pa adigms. This wo k ocuses on explo ing he capabili-
ies o gene a i e ex -music models o e ie al, wi h he
mo i a ion o enabling con ollabili y mechanisms o in-
e ac i e e ie al. Fo ins ance, his would enable disco -
e ing di ec ions o modi ica ion o musical a ibu es in he
la en space o being able o modi y he gen e o ins u-
men a ion o a e ie ed musical piece wi hou modi ying
o he seman ic a ibu es.
We p opose Gene a i e Di usion Re ie e , a new
mechanism o e ie al, in which we ain a condi ional la-
en di usion model on a e ie al-op imized la en space.
We p omp GD-Re ie e o gene a e hypo he ical que ies
in he audio la en space and e ie e nea es neighbou s
262
(See Figu e 1). Ou con ibu ions a e 1:
1 We p esen a gene a i e di usion e ie al ame-
wo k ha condi ionally gene a es hypo he ical
que ies in he la en space, leading o imp o ed pe -
o mance on in-domain ex -music e ie al.
2 We dis inguish ou sel es om p e ious app oaches
by di ec ly gene a ing embedding sequences o e
agg ega ed embeddings, p omo ing ine -g ained
ex -music unde s anding.
3 We show ha we success ully unlock he a ay o
con ollabili y me hods o gene a i e models o e-
ie al h ough examples such as nega i e p omp ing
and DDIM in e sion [17].
4 Ou app oach is di ec ly applicable o audio-only la-
en spaces, and can le e age ex encode s ha ha e
no been join ly ained wi h an audio encode .
2. BACKGROUND
2.1 Tex -music con as i e lea ning and e ie al
Mul imodal con as i e lea ning has shown s ong esul s
in compu e ision [6,21–23], and has been success ully
ex ended o audio and music domains [1,2,5]. These mod-
els encode pai ed ex and audio inpu s using encode s ET
and EA, p ojec hem in o a sha ed la en space, and apply a
con as i e In oNCE loss [24] on pooled embeddings (ZT,
ZA) o align posi i e pai s while sepa a ing nega i e ones.
Ea ly ex -audio/music models such as CLAP [3,5],
MusCALL [1], and MuLan [2] adop CLIP-s yle aining
[21]. La e wo k imp o es alignmen h ough be e cap-
ions and oken-/ ime ine g ained mechanisms [4,25–27].
Lea ned ep esen a ions om hese models a e widely used
o e ie al, in which he lea ned simila i y me ic be-
ween ex and music encodings can be used o e ie e
he mos simila music key in a da abase o audio sam-
ples [1–3], gene a i e condi ioning [7,8,14,15,28], and
e ie al-augmen ed cap ioning [29,30].
2.2 Di usion Models
Di usion models a e powe ul gene a i e models ha i e -
a i ely e ine noisy inpu s o gene a e high-quali y ou pu s
by lea ning a e e se Ma ko p ocess [6,31,32]. These
models co up da a wi h noise o e mul iple s eps and hen
lea n o econs uc he sample om he s ep in o ma ion.
The denoising p ocess is modeled as a lea ned ansi-
ion, whe e a each s ep, a gene a o Gp edic s ei he he
o iginal da a x0(sample objec i e) o he noise added o
he o iginal la en (ϵobjec i e). We adop he sample p e-
dic ion objec i e, whe e he model p edic s he clean la en
a each s ep, as in p io wo k [33,34]. The di usion p o-
cess can be condi ioned on auxilia y condi ioning in o ma-
ion such as ex o o he modali ies [35,36], ypically ap-
plied wi h classi ie - ee guidance (CFG) [6] by in e pola -
ing be ween uncondi ional and condi ional p edic ions a
each denoising s ep. La en di usion models educe com-
1Code is made a ailable a h ps://gi hub.com/Pliploop/GDRe ie e
pu a ional cos s by ope a ing in a comp essed la en space
using p e ained au oencode s [7,8,14,15,37,38].
2.3 Con ollabili y o gene a ion and e ie al
Con ollabili y in gene a ion e e s o how well gene a i e
models espond o human-guided in e ac ions, allowing o
a ibu e modi ica ion o e inemen ei he du ing o a e
gene a ion. In di usion models, his includes echniques
like ex -based a ibu e edi ing [39–41], inpain ing [42],
in e sion, and nega i e p omp ing [17,43]. In music gen-
e a ion, con ollabili y is an ac i e a ea o esea ch due o
i s po en ial o c ea i e applica ions [12,44–46].
While ex ensi ely s udied in gene a ion, con ollabil-
i y in e ie al—especially o music— emains unde ex-
plo ed. This in ol es enabling use s o guide o modi y
e ie al esul s in e ac i ely, by speci ying a ibu es o in-
e es o e ie al in a disen angled way [47,48] o apply-
ing la en ans o ma ions, e.g. empo adjus men s [49].
Gene a i e e ie al has eme ged in ecen wo k o gene -
a e la en que ies, mainly o imp o e pe o mance in gen-
e al audio e ie al [34] o add mul imodal guidance [50],
a he han enabling in e ac i i y du ing o a e e ie al.
3. GENERATIVE DIFFUSION RETRIEVAL
We p opose an in ui i e gene a i e app oach o e ie al
using di usion models, which we name Gene a i e
Di usion Re ie e . Using a p e ained la en space op-
imized o audio-audio e ie al, we ain a gene a i e di -
usion model condi ioned on ex o gene a e audio la en
embeddings in his space. A in e ence ime, a he han
encoding he ex que y in o he sha ed la en space and e-
ie ing he nea es audio [1,3], we gene a e a “ghos ” au-
dio que y in he la en space condi ioned on he ex que y
(Simila o [12,50]) and e ie e he nea es neighbou s in
he audio space. By using a gene a i e model as a e ie e ,
we enable adap a ion o audio-only la en spaces o ex -
audio e ie al wi hou equi ing mul imodal p e aining.
The gene a i e modelling o e ie al allows o g ea e
e ie al con ollabili y h ough a ibu e modi ica ion and
in e ac i e e inemen echniques om he gene a i e do-
main. The app oach is shown in Figu e 2.
Conside an audio, que y cap ion pai {xq
, xa}and a
condi ioning ex encode ETwhich encodes he ex que y
in o a sequence o embeddings zq
T. Le EAbe a p e ained
and ozen audio encode encoding xain o a sequence o
audio embeddings zA. We ain Gwi h a di usion loss o
econs uc zAcondi ioned on zq
T, which we no a e ˜zq
A:
LG=Eτ,Za,zq
Th∥za− G(za,τ , τ, zq
T)∥2
2i(1)
Whe e τis he di usion s ep. We agg ega e ˜zq
Ain o ˜
Zq
A
h ough a e aging o e he sequence imes eps and use ˜
Zq
A
as a que y in he audio space o e ie e audio.
3.1 Model a chi ec u e
3.1.1 Di usion backbone
We use a well-es ablished UNe wi h c oss-a en ion con-
di ioning [6–9] as ou di usion model. Ea ly expe i-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
263
(a) S age 1 — Gene a i e p e aining. (b) S age 2 — Re ie al
Figu e 2:GD Re ie e Me hod: We ain a model o gene a e ex -condi ioned ghos que ies o e ie al. Le : A
di usion model is ained o gene a e audio la en s om ex cap ions. Righ : Using he ozen model, we gene a e audio
embeddings om a cap ion o e ie e simila audio ia ghos que ies.
men a ion led o he design choice o a ∼40M pa ame-
e model. Models a e condi ioned on ex embedding
sequences h ough c oss-a en ion. Compa ed o p e i-
ous wo k [34,50] ha pe o ms gene a i e di usion e-
ie al wi h agg ega ed embeddings, his enables mo e
ine-g ained in e ac ion be ween audio and ex [50], as
we show in Sec ion 4.2. Suppo ed by esul s in [33], we
ind ha a sample-objec i e (See Sec ion 2) yields be e
e ie al esul s han ϵ-objec i e.
3.1.2 Tex and audio encode s
We use h ee audio encode s: CLAP [3], MusCALL [1],
and MULE [51]. CLAP uses an HTSAT backbone [52]
wi h he publicly a ailable Music checkpoin 2. MusCALL
is based on a ResNe 50 encode , and MULE is eimple-
men ed and p e ained on MTG-Jamendo ollowing [48].
Fo ex encode s, we use CLAP’s RoBERTa-based en-
code [53,54], MusCALL’s 4-laye ans o me , o a p e-
ained Flan-T5 model [53] om HuggingFace 3. Flan-T5
is a ine- uned T5 language model commonly used in mu-
sic gene a ion [8,11,55,56].
3.2 Da ase s
We use wo well-explo ed public music-cap ion pai s
da ase s as well as a p i a e da ase o aining and e al-
ua ion. Song Desc ibe [57] (SD) is a da ase o 1100
c owd-sou ced cap ions co esponding o 700 exce p s o
2 minu e music clips. MusicCaps [58] is ano he music-
cap ion pai da ase consis ing o 5500 pai s wi h 10s
audio. Finally, we use a p i a e la ge scale da ase o
p o essionally-anno a ed song desc ip ions (P i a eCaps).
Table 1in en o ies da ase scales. Fo e alua ion on P i-
a eCaps, we use a 5500-sample subse o he es se .
3.3 T aining de ails
We ex ac la en embeddings om ou aining da ase s
(See Sec ion 3.2) wi h he selec ed audio encode s. We ex-
ac zaby sliding he encode o e he audio inpu a a
equency o 1Hz. GDR is ained on P i a eCaps wi h
2h ps://gi hub.com/LAION-AI/CLAP
3h ps://hugging ace.co/google/ lan- 5-base
Da ase # acks #cap ions Hou s T aining E al
SongDesc ibe [57] 0.7k 1.1k 23.3 ✗✓
MusicCaps [56] 5.5k 5.5k 15.3 ✗✓
P i a eCaps 251k 251k 12.5k ✓ ✓
Table 1: Da ase de ails - P i a eCaps is an in e nal da ase
o ull-leng h p o essionally anno a ed p oduc ion acks
a ba ch size o 256 o 100k s eps, wi h an ea ly s op-
ping mechanism on alida ion di usion loss. Models a e
ained on a single A5000 GPU. We use AdamW and lin-
ea wa mup he lea ning a e o 1e−4o e 5000 s eps hen
cosine decay o 0. Ou model is ained on la en sequences
o leng h T= 64, i.e. 1 minu e o audio. We use CFG on
ex condi ioning [35] (masking p obabili y 10%).
4. EXPERIMENTS
4.1 Re ie al
T→A E al da ase
Model Me ic PC SD MC
CLAP
R@1 ↑2.2 3.1 3.8
R@5 ↑7.2 13.7 12.9
R@10 ↑12.3 23.2 19.5
MedR (%) ↓3.7 4.0 1.4
GDR-CLAP
R@1 ↑6.9 4.7 2.7
R@5 ↑17.1 15.3 7.6
R@10 ↑22.9 24.7 11.5
MedR (%) ↓1.6 3.8 2.9
MusCALL
R@1 ↑10.1 3.6 1.0
R@5 ↑26.2 13.6 3.9
R@10 ↑35.1 22.0 7.0
MedR (%) ↓0.4 4.2 5.1
GDR-MusCALL
R@1 ↑10.8 5.1 1.8
R@5 ↑25.1 16.9 6.4
R@10 ↑33.3 25.5 9.9
MedR (%) ↓0.6 3.5 3.4
Table 2: Main e ie al esul s o GDR. We compa e
GDR-CLAP o CLAP and GDR-MusCALL o muscall o
R@1,5,10 on he PC, SD and MC Da ase s.
We e alua e GD-Re ie e ’s e ie al pe o mance
agains eache models. We gene a e nq= 5 audio la en
que ies ˜zAcondi ioned on zq
TWe a e age ˜zAo e ime
and nqin o ˜
ZA o e ie al. Teache s encode ex and au-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
264
dio in o embeddings ZTand ZA. . Resul s a e shown in
Table 2. While GD-Re ie e ou pe o ms eache models
in se e al in-domain scena ios—mos no ably on P i a e-
Caps (PC) and o a lesse ex en SongDesc ibe (SD)—i s
e ie al pe o mance deg ades on ou -o -domain Music-
Caps (MC). In pa icula , GDR-CLAP unde pe o ms on
MC ela i e o he CLAP eache , despi e showing s ong
imp o emen s on PC. Con e sely, GDR-MusCALL unde -
pe o ms on PC while p o iding s onge pe o mance han
he eache baseline on MC and SD. Despi e s ong in-
domain esul s, hese inconsis encies sugges ha domain
misma ch plays a ole in limi ing e ie al pe o mance.
4.1.1 Domain adap a ion
We iden i y wo compounding sou ces o domain shi .
Fi s , p e ained con as i e models like CLAP and Mus-
CALL o en ail o gene alize ac oss da ase s wi h di e ing
audio and ex dis ibu ions—a well-known issue in dense
e ie al and mul imodal lea ning [59,60]. CLAP, ained
on LAION-630k [3], pe o ms wo se han MusCALL on
i s in-domain e alua ion se (PC), bu shows simila pe -
o mance on SD, and be e pe o mance on MC.
Second, GD-Re ie e lea ns a ex - o-audio mapping
on he eache ’s ozen embedding space du ing di usion
aining. I he eache su e s om domain misma ch (e.g.,
CLAP on PC), GDR can compensa e by adap ing o he
aining dis ibu ion. Con e sely, when he eache is well-
aligned (e.g., MusCALL on PC), GDR ends o ma ch, bu
no exceed, i s pe o mance. This explains why GDR-
CLAP, ained on P i a eCaps wi h a CLAP eache , e-
p oduces pe o mance ends seen in MusCALL’s eache
embeddings—pe o ming bes on PC, accep ably on SD,
and poo ly on MC.
Da ase Encode Pai FTD ↓FAD R@5 FAD ↓R@5 ↑
ZTZAZA˜
ZA˜
Zalign
A˜
ZA˜
Zalign
A
PC CLAP - - 7.2 0.03 0.003 17.1 18.2
SD CLAP - - 13.7 0.09 0.001 15.3 15.9
MC CLAP - - 12.9 0.34 0.001 7.6 8.1
PC MusCALL 0.008 0.002 26.2 0.02 ∼0 25.1 25.3
SD MusCALL 0.14 0.18 13.6 0.12 ∼0 16.9 17.7
MC MusCALL 0.32 0.20 3.9 0.17 ∼0 6.4 7.2
Table 3: F éche dis ances o ZA/ZT o he aining dis-
ibu ion o eache models, F éche audio dis ance (FAD)
o he e alua ion se , and e ie al pe o mance (R@5) o
gene a ed que ies ( ˜
ZA), and aligned que ies ( ˜
Zalign
A) ac oss
da ase s and encode pai s. We a e unable o e alua e FAD
on ZA o CLAP’s aining se as LAION-630k is p i a e.
To es his hypo hesis, we compu e F eche Dis ances
be ween aining and e alua ion dis ibu ions, compa ing
(1) eache audio embeddings o each e alua ion se o he
aining dis ibu ion and (2) GDR-gene a ed que ies o he
e alua ion se , be o e and a e a ligh weigh alignmen
s ep. Following p io domain adap a ion wo k [61,62], we
apply a pos -hoc shi in mean and co a iance o ˜
ZA o
ma ch he e alua ion se (no a ed Zalign
A). This me hod is
model-agnos ic, e icien , and equi es no e aining.
We e alua e on join encode pai s o illus a e he join
ex -audio domain gene aliza ion gap. As shown in Ta-
ble 3, GDR-gene a ed la en s exhibi simila dis ibu ion
shi s as he eache . Alignmen consis en ly educes FAD
and imp o es R@5, suppo ing ou claim ha e ie al
deg ada ion s ems om inhe i ed dis ibu ion di e gence
a he han a limi a ion o ou app oach. While no a com-
ple e solu ion, his p o ides bo h e idence o ou diagno-
sis and a simple, e ec i e mi iga ion. We lea e b oade
gene aliza ion s a egies o u u e wo k.
4.1.2 Encode pai a ia ion
Two co e a o dances o GDR a e i s abili y o (1) ope a e
in audio-only la en spaces no ained join ly wi h ex , and
(2) suppo a bi a y ex encode s o condi ioning. This is
enabled by he di usion model lea ning a gene a i e map-
ping be ween ex and audio embeddings, independen o
any con as i e p e- aining alignmen . The gene a i e e-
ie al objec i e imposes no cons ain s on he mul imodal-
i y o he space o he choice o ex encode . To demon-
s a e his, we es se e al combina ions o audio and ex
encode s ha we e no join ly ained: we eplace he ex
encode in GDR-CLAP wi h Flan-T5 (Sec ion 3.1.2), and
use he audio encode o MULE pai ed wi h T5. Re ie al
esul s a e epo ed in Table 4.
T−→A E al da ase
Model ETMe ic PC SD MC
GDR-CLAP
T5
R@1 ↑8.1 4.9 2.3
R@5 ↑21.1 15.6 7.8
R@10 ↑29.2 25.1 11.7
MR ↓0.8 3.7 2.9
CLAP
R@1 ↑6.9 4.7 2.7
R@5 ↑17.1 15.3 7.6
R@10 ↑22.9 24.7 11.5
MR ↓1.6 3.8 2.9
GDR-MULE T5
R@1 ↑7.6 4.1 1.6
R@5 ↑18.5 13.9 6.2
R@10 ↑25.3 21.8 11.0
MR ↓1.6 4.2 3.2
Table 4: Compa ison o e ie al pe o mance ac oss
models and ex encode s. GD-Re ie e enables ex -
condi ioned e ie al on non-join ly ained encode s.
Including T5 as he ex encode imp o es in-domain
pe o mance o GDR-CLAP, likely by egula izing he
mapping be ween con as i ely ained audio and ex en-
code s. Howe e , his bene i is mo e limi ed ou -o -
domain, consis en wi h ou indings in Sec ion 4.1.1,
whe e domain misma ch s ems om aining se dis ibu-
ions. We also ind ha GDR-MULE suppo s ex -music
e ie al and ou pe o ms he CLAP eache on P i a e-
Caps and SongDesc ibe . This demons a es ha join e-
ie al spaces can be buil om unimodal audio la en s
wi hou la ge-scale mul imodal p e- aining—a key a o -
dance o ou app oach.
4.2 Quali y o Gene a ed Que ies
Beyond e ie al, we e alua e gene a ed que y quali y us-
ing ideli y, di e si y, and p omp adhe ence me ics. FAD
cap u es audio ideli y, while CLAP sco e assesses align-
men wi h inpu ex . To e alua e di e si y, we gene a e
clus e s o 10 audio que ies pe p omp and measu e he
in asample cosine simila i y (MICS) ollowing p e ious
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
265
wo k [50] and he no malized Vendi sco e (NVendi), along
wi h i s in aclus e a ian (MINVS), o assess he a ia-
ion in gene a ed que ies [63].
We compa e ou di usion-based UNe o wo baselines:
a eg ession UNe p edic ing sequences o audio embed-
dings om sequences o lea ned mask embeddings condi-
ioned on zq
T, and wo 1-hidden-laye MLP (GeLU, 4096
uni s), ained wi h espec i ely di usion and eg ession
objec i es o econs uc ZAcondi ioned on Zq
T. All mod-
els use he same aining hype pa ame e s and CLAP T/A
encode s. Table 5shows esul s on P i a eCaps. Ou di u-
sion UNe ou pe o ms al e na i es in e ie al and ideli y.
While MLP Di usion achie es highe di e si y, i comes a
he cos o ealism and e ie al pe o mance.
Me ic Di usion Reg ession
UNe MLP UNe MLP
Re ie al (SD) R@5 ↑15.3 11.3 9.1 7.1
NMedR ↓3.8 5.5 5.6 5.1
Fideli y FAD ↓0.09 0.13 0.13 0.13
CLAP ↑0.63 0.53 0.54 0.53
Di e si y
MICS ↓0.92 0.62 1 1
MINVS ↑1.42 4.53 1 1
NVendi ↑12.9 68.9 3 3.4
Table 5: Fideli y, Re ie al, and Di e si y me ics
o CLAP-Re ie e s ained on P i a eCaps. FAD is
g ounded on MTG-Jamendo [64].
4.3 Con ollabili y
4.3.1 Nega i e p omp ing
Nega i e p omp ing is an o - he-shel a o dance o di -
usion models ela ed o CFG [35], which allows use s o
speci y wha hey do no wan o be in he gene a ed ou pu .
Nega i e p omp ing modi ies he CFG upda e by inco po-
a ing a nega i e condi ioning signal zq−
T. Gi en a que y
embedding zq+
T, a denoising s ep is gi en by:
˜zNP
A,τ+1 = (1 + w)G(˜zA,τ , τ + 1, zq+
T)
−wG(˜zA,τ , τ + 1, zq−
T)(2)
whe e wis he classi ie ee guidance s eng h [35]. This
o mula ion emo es undesi ed a ibu es by in e pola ing
owa ds condi ional gene a ion and away om nega i ely
condi ioned ou pu s a each di usion s ep (See Sec ion 3).
We e alua e he e ec i eness o nega i e p omp ing in
e ie al by cu a ing 50 nega i e p omp s ac oss gen e,
mood, ins umen a ion, key, and empo (e.g. “a ock song”
as a nega i e p omp “ emo es” ock). Each ca ego y in-
cludes di e en ph asing o ma s. Fo each que y q, we
c ea e new, modi ied que y la en s zmod
ausing nega i e
p omp ing and h ee addi ional modi ica ion me hods as
baselines, all applied wi h guidance s eng h w:
Nega i e p omp ing (NP) modi ies la en s by apply-
ing Eq. 2wi h z−
Tas nega i e condi ioning. Tex In-
e pola ion (∆T) in e pola es om ZAaway om ZTq−:
Z′
A=ZA+w∆T, whe e ∆T=Zq+
T−Zq−
T. Audio
in e pola ion in e pola es along he di ec ion om ˜
ZAand
away om ˜
Z−
A:∆A=˜
ZA−˜
Z−
A. Ou las baseline is
Figu e 3: CLAP sco e esul s be ween Zmod
aand ˜
Zq+
A,
Zmod
aand ˜
Zq−
a o nega i e p omp ing and baselines o
di e en ca ego ies.
P e ix Nega ion P omp ing (PNP): We modi y he que y
by nega ing a ibu es (e.g., “a ock ack” →“no a ock
ack”), hen gene a e ˜zP NP
Awi h GDR.
Zmod
aob ained om all modi ica ion me hods is com-
pa ed using CLAP sco e o Zq+
A,˜
Zq+
A, and Zq−
A, o which
a highe CLAP sco e is be e as i signi ies highe simila -
i y o he posi i e p omp . We also compa e Zmod
a o ˜
Zq−
A,
and Zq−
A, o which a lowe CLAP sco e (less simila o he
nega i e condi ioning) is be e . We also use FAD o assess
he ideli y o Zmod
A. A modi ied audio la en ha is sim-
ila o he o iginal Zq+
Aand dissimila o Zq−
Abu is e y
a om any easonable dis ibu ion (i.e. la ge FAD) will
yield un ealis ic o uncommon music esul s. CLAP sco e
and FAD g ounded on MTG-Jamendo a e epo ed Table
6. Fine-g ained ca ego y expe imen s a e shown Table 3.
key Zq+
AModi ied que y ˜
Zmod
A
NP ∆T˜
∆APNP
CLAP ↑
ZA0.69 0.66 0.38 0.39 0.51
˜
ZA10.85 0.65 0.62 0.49
ZT0.42 0.39 0.28 0.33 0.21
CLAP ↓˜
Zn
A0.41 0.21 -0.02 -0.23 0.46
Zn
T0.17 0.08 -0.51 -0.04 0.21
Fideli y FAD 0.11 0.12 3.12 0.60 0.12
Table 6: Nega i e p omp ing expe imen s - CLAP sco e o
modi ied que ies zmod
A s o iginal/nega i e ZA/ZT
Nega i e p omp ing shows s ong desi able esul s: The
modi ied audio la en emains he mos simila o he o ig-
inal p omp ac oss modi ica ion baselines, while signi i-
can ly dis ancing i sel om he nega i e p omp . A he
same ime, NP-modi ied la en s emain ealis ic and in dis-
ibu ion, as demons a ed by he lowe FAD sco e. While
∆Tand ∆Acan lowe he CLAP sco e ela i e o unde-
si ed a ibu es, hey also deg ade seman ic alignmen wi h
he o iginal p omp , as e lec ed in lowe CLAP sco es. In
addi ion, his educ ion is achie ed un ealis ically: highe
FAD sco es indica e he esul ing la en s de ia e signi i-
can ly om he g ounding dis ibu ion. In e ie al, his
would lead o unna u al o implausible esul s which a e
less ele an , e en i less simila o he nega i e a ibu e.
Mo eo e , a lowe CLAP sco e o a nega i e p omp is
no always desi able: Usually, a CLAP sco e o 0 deno es
he absence o sha ed in o ma ion be ween he wo em-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
266
beddings, while a nega i e CLAP sco e signi ies opposi e
in o ma ion. Colloquially, a CLAP sco e o -1 o a e e -
ence embedding o ”gui a “ does no mean ”No gui a “,
bu a he ” he opposi e o gui a ”, which is ill-de ined.
4.3.2 DDIM In e sion
One use ul p ope y o di usion models is DDIM in e -
sion, which enables use -con olled modi ica ions while
p ese ing seman ic simila i y [17]. By e-noising an em-
bedding ia an in e se DDIM schedule , he la en e u ns
o a noisie s a e whe e high-le el ea u es a e es ablished.
F om his pi o , applying new guidance du ing denoising
yields ou pu s ha be e ma ch he new p omp while e-
maining close o he o iginal.
This is aluable o e ie al, whe e use s may wan o
e ine a speci ic a ibu e o a que y hey a e pa ly sa is ied
wi h. Cu en join embedding e ie al models lack na-
i e suppo o such in e ac ion. In con as , GD-Re ie e
suppo s DDIM in e sion di ec ly, allowing o in e ac i e
and con ollable e ie al e inemen .
Figu e 4: DDIM in e sion example. Le : CLAP sco es o
audio om o iginal and modi ied p omp s. Righ : CLAP
sco es o ex p omp s and added/ emo ed wo ds.
To demons a e DDIM in e sion o e ie al, we ap-
ply i o a p omp om he Song Desc ibe da ase : “A
choppy bea -hea y acous ic gui a song wi h so ocals.”
F om he gene a ed la en ˜zq
A, we pe o m in e sion us-
ing an in e sion p omp xq,in
T: “A smoo h, solo acous ic
gui a song wi h ha sh ocals.” The aim is o p oduce a la-
en Zmod
A ha emains close o he o iginal while aligning
mo e wi h he modi ied p omp . We ack CLAP sco es
be ween Zmod
A(τ)wi h τ he in e sion s ep and bo h o igi-
nal and modi ied audio/ ex que ies: ˜
Zq
A,˜
Zq,in
A,Zq
T,Zqin
T,
and he ex encodings o added (Z+
T) and emo ed (Z−
T)
wo ds. Resul s a e shown in Figu e 4. We obse e a
clea ansi ion: CLAP simila i y shi s om he o iginal
o he modi ied p omp while emaining high o he o ig-
inal, con i ming he e ec i eness o DDIM in e sion o
ine-g ained con ol in e ie al.
To alida e he usabili y o DDIM in e sion as a e-
ie al con ollabili y ool on a la ge scale. We now cu a e
50 p omp s om he Song Desc ibe da ase ha would
ep esen ealis ic use-cases o e ining a sea ch esul o
e ie al, and cu a e modi ied p omp s ep esen ing eal-
is ic modi ica ions o e ine a que y, ei he by modi ying
quali ica i es o subjec s, o adding mo e de ails.
O iginal p omp s a e no a ed zq
Tand modi ied p omp s
zq′
T(again, modi ied audio la en s a e no a ed zmod
a). Fo
in e sion, we e-noise ˜zq
A o 20 ou o 50 s eps and denoise
condi ioned on zq′
T. We use e-gene a ion as a baseline by
simply gene a ing ˜zq′
A. We compa e Zmod
a o Zq
A,˜
Zq
Aand
˜
Zq′
A o audio compa ison, and Zq
T,Zq′
T o ex . A desi -
able esul is o Zmod
A o be simila o hese embeddings,
meaning seman ic simila i y o he o iginal and modi ied
p omp s. Resul s a e shown in Figu e 5.
Figu e 5: Sys ema ic e alua ion o DDIM in e sion on cu-
a ed p omp modi ica ions, compa ing CLAP sco e o in-
e ed la en s and egene a ed la en s o (le : audio, igh :
ex ) o iginal and modi ied la en s.
DDIM in e sion, a na i e a o dance o GDR, yields
highe simila i y o he o iginal p omp compa ed o egen-
e a ion, which causes a d op in CLAP sco e be ween Z egen
A
and ˜
Zq′
A. In con as , Zin
A emains close o ˜
Zq
A. While
bo h me hods main ain simila alignmen wi h he modi-
ied p omp , only in e sion p ese es seman ic simila i y
o he o iginal, making i a mo e ai h ul con ollabili y
mechanism. Addi ionally, we obse e ha bo h classi ie -
ee guidance and he numbe o in e sion s eps modula e
he s eng h o he e ec , o e ing u he con ol. While
we use anilla DDIM he e, ecen imp o emen s in in e -
sion echniques [17,39] can be di ec ly applied o GDR o
mo e ealis ic and ine-g ained con ol.
5. CONCLUSION AND FUTURE WORK
We p esen GD-Re ie e , a gene a i e amewo k o
ex - o-music e ie al ha uses di usion models o p o-
duce la en que ies in e ie al- ele an spaces. GD-
Re ie e ou pe o ms con as i e eache models on in-
domain da a and enables e ie al in unimodal audio spaces
by le e aging independen ly p e ained ex and audio en-
code s— emo ing he need o join mul imodal aining.
Beyond e ie al pe o mance, GD-Re ie e enables
in e ence- ime con ollabili y h ough nega i e p omp ing
and DDIM in e sion, o e ing lexible and pos -hoc ma-
nipula ion o e ie al beha io . These a o dances open
he doo o in e ac i e and use -s ee able music e ie al.
While domain misma ch emains a challenge, ou indings
sugges his can be pa ially mi iga ed h ough la en align-
men . We encou age u u e wo k o expand gene a i e con-
ol in e ie al, aiming o mo e obus , adap able, and ex-
p essi e e ie al sys ems.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
267
6. ACKNOWLEDGEMENT
This wo k is suppo ed by he EPSRC UKRI Cen e o
Doc o al T aining in A i icial In elligence and Music
(EP/S022694/1) and Uni e sal Music G oup.
7. REFERENCES
[1] I. Manco, E. Bene os, E. Quin on e al., “Con as i e
audio-language lea ning o music,” in P oceedings o
he 23 d In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, ISMIR 2022, Bengalu u, India,
Decembe 4-8, 2022, 2022, pp. 640–649.
[2] Q. Huang, A. Jansen, J. Lee e al., “Mulan: A join em-
bedding o music audio and na u al language,” in P o-
ceedings o he 23 d In e na ional Socie y o Music
In o ma ion Re ie al Con e ence, ISMIR 2022, Ben-
galu u, India, Decembe 4-8, 2022, 2022, pp. 559–
566.
[3] Y. Wu, K. Chen, T. Zhang e al., “La ge-scale con-
as i e language-audio p e aining wi h ea u e usion
and keywo d- o-cap ion augmen a ion,” in ICASSP
2023-2023 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2023, pp. 1–5.
[4] J. Wu, W. Li, Z. No ack e al., “Collap: Con as i e
long- o m language-audio p e aining wi h musical
empo al s uc u e augmen a ion,” in ICASSP 2025-
2025 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2025,
pp. 1–5.
[5] B. Elizalde, S. Deshmukh, M. Al Ismail e al., “Clap
lea ning audio concep s om na u al language supe i-
sion,” in ICASSP 2023-2023 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2023, pp. 1–5.
[6] J. Ho, A. Jain, and P. Abbeel, “Denoising di usion
p obabilis ic models,” Ad ances in neu al in o ma ion
p ocessing sys ems, ol. 33, pp. 6840–6851, 2020.
[7] H. Liu, Z. Chen, Y. Yuan e al., “AudioLDM: Tex -
o-audio gene a ion wi h la en di usion models,” P o-
ceedings o he In e na ional Con e ence on Machine
Lea ning, 2023.
[8] H. Liu, Y. Yuan, X. Liu e al., “Audioldm 2: Lea ning
holis ic audio gene a ion wi h sel -supe ised p e ain-
ing,” IEEE/ACM T ansac ions on Audio, Speech, and
Language P ocessing, 2024.
[9] K. Chen, Y. Wu, H. Liu e al., “Musicldm: En-
hancing no el y in ex - o-music gene a ion using
bea -synch onous mixup s a egies,” a Xi p ep in
a Xi :2308.01546, 2023.
[10] S.-L. Wu, C. Donahue, S. Wa anabe e al., “Music con-
olne : Mul iple ime- a ying con ols o music gen-
e a ion,” IEEE/ACM T ansac ions on Audio, Speech,
and Language P ocessing, 2024.
[11] D. Ghosal, N. Majumde , A. Meh ish e al., “Tex - o-
audio gene a ion using ins uc ion guided la en di u-
sion model,” in P oceedings o he 31s ACM In e na-
ional Con e ence on Mul imedia, 2023, p. 3590–3598.
[12] J. Nis al, M. Pasini, C. Aouameu e al., “Di -a- i :
Musical accompanimen co-c ea ion ia la en di u-
sion models,” in ISMIR, 2024, 2024.
[13] Z. No ack, J. McAuley, T. Be g-Ki kpa ick e al.,
“Di o: di usion in e ence- ime -op imiza ion o mu-
sic gene a ion,” in P oceedings o he 41s In e na-
ional Con e ence on Machine Lea ning, 2024, pp.
38 426–38 447.
[14] Z. E ans, J. D. Pa ke , C. Ca e al., “Long- o m mu-
sic gene a ion wi h la en di usion,” in P oceedings o
he 25 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2024.
[15] Z. E ans, C. Ca , J. Taylo e al., “Fas iming-
condi ioned la en audio di usion,” in P oceedings o
he 41s In e na ional Con e ence on Machine Lea n-
ing, 2024, pp. 12 652–12 665.
[16] R. Gal, Y. Alalu , Y. A zmon e al., “An im-
age is wo h one wo d: Pe sonalizing ex - o-image
gene a ion using ex ual in e sion,” a Xi p ep in
a Xi :2208.01618, 2022.
[17] R. Mokady, A. He z, K. Abe man e al., “Null- ex
in e sion o edi ing eal images using guided di usion
models,” in P oceedings o he IEEE/CVF con e ence
on compu e ision and pa e n ecogni ion, 2023, pp.
6038–6047.
[18] F. Yang, S. Yang, M. A. Bu e al., “Dynamic p omp
lea ning: Add essing c oss-a en ion leakage o ex -
based image edi ing,” Ad ances in Neu al In o ma ion
P ocessing Sys ems, ol. 36, pp. 26 291–26 303, 2023.
[19] S. A. Baumann, F. K ause, M. Neumay e al., “Con-
inuous, subjec -speci ic a ibu e con ol in 2i mod-
els by iden i ying seman ic di ec ions,” a Xi p ep in
a Xi :2403.17064, 2024.
[20] X. Zhang, X.-Y. Wei, J. Wu e al., “Composi ional in-
e sion o s able di usion models,” in P oceedings o
he AAAI Con e ence on A i icial In elligence, ol. 38,
no. 7, 2024, pp. 7350–7358.
[21] A. Rad o d, J. W. Kim, C. Hallacy e al., “Lea n-
ing ans e able isual models om na u al language
supe ision,” in In e na ional con e ence on machine
lea ning. PMLR, 2021, pp. 8748–8763.
[22] X. Zhai, B. Mus a a, A. Kolesniko e al., “Sigmoid
loss o language image p e- aining,” in P oceedings
o he IEEE/CVF In e na ional Con e ence on Com-
pu e Vision, 2023, pp. 11 975–11 986.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
268
[23] I. Bica, A. Ili´
c, M. Baue e al., “Imp o ing ine-
g ained unde s anding in image- ex p e- aining,” in
P oceedings o he 41s In e na ional Con e ence on
Machine Lea ning, 2024, pp. 3974–3995.
[24] T. Chen, S. Ko nbli h, M. No ouzi e al., “A simple
amewo k o con as i e lea ning o isual ep esen-
a ions,” in In e na ional con e ence on machine lea n-
ing. PMLR, 2020, pp. 1597–1607.
[25] Y. Yuan, Z. Chen, X. Liu e al., “T-clap: Tempo al-
enhanced con as i e language-audio p e aining,” in
2024 IEEE 34 h In e na ional Wo kshop on Machine
Lea ning o Signal P ocessing (MLSP). IEEE, 2024,
pp. 1–6.
[26] I. Manco, J. Salamon, and O. Nie o, “Augmen , d op &
swap: Imp o ing di e si y in llm cap ions o e icien
music- ex ep esen a ion lea ning,” in P oceedings o
he 25 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2024.
[27] G. Zhu, J. Da e sky, and Z. Duan, “Cacophony: An
imp o ed con as i e audio- ex model,” IEEE/ACM
T ansac ions on Audio, Speech, and Language P o-
cessing, 2024.
[28] M. Comuni à, Z. Zhong, A. Takahashi e al., “Spec-
MaskGIT: Masked Gene a i e Modeling o Audio
Spec og ams o E icien Audio Syn hesis and Be-
yond,” in P oceedings o he 25 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2024.
[29] X. Li, W. Chen, Z. Ma e al., “D cap: Decoding clap la-
en s wi h e ie al-augmen ed gene a ion o ze o-sho
audio cap ioning,” in ICASSP 2025-2025 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2025, pp. 1–5.
[30] S. Ghosh, S. Kuma , C. K. R. E u u e al., “Recap:
Re ie al-augmen ed audio cap ioning,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2024, pp. 1161–1165.
[31] P. Esse , R. Rombach, and B. Omme , “Taming ans-
o me s o high- esolu ion image syn hesis,” in P o-
ceedings o he IEEE/CVF con e ence on compu e i-
sion and pa e n ecogni ion, 2021, pp. 12 873–12 883.
[32] J. Yu, Y. Xu, J. Y. Koh e al., “Scaling au o eg es-
si e models o con en - ich ex - o-image gene a ion,”
T ansac ions on Machine Lea ning Resea ch, 2022.
[33] A. Ramesh, P. Dha iwal, A. Nichol e al., “Hie a chi-
cal ex -condi ional image gene a ion wi h clip la en s,”
a Xi p ep in a Xi :2204.06125, ol. 1, no. 2, p. 3,
2022.
[34] S. Mo, Z. Chen, F. Bao e al., “Di gap: A ligh weigh
di usion module in con as i e space o b idging
c oss-model gap,” in ICASSP 2025-2025 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2025, pp. 1–5.
[35] J. Ho and T. Salimans, “Classi ie - ee di usion guid-
ance,” in Neu IPS 2021 Wo kshop on Deep Gene a i e
Models and Downs eam Applica ions, 2022.
[36] W. Peebles and S. Xie, “Scalable di usion models wi h
ans o me s,” in P oceedings o he IEEE/CVF In e -
na ional Con e ence on Compu e Vision, 2023, pp.
4195–4205.
[37] R. Rombach, A. Bla mann, D. Lo enz e al., “High-
esolu ion image syn hesis wi h la en di usion mod-
els,” in P oceedings o he IEEE/CVF con e ence on
compu e ision and pa e n ecogni ion, 2022, pp.
10 684–10 695.
[38] F. Schneide , Z. Jin, and B. Schölkop , “Moûsai:
Tex - o-music gene a ion wi h long-con ex la en di -
usion,” a Xi e-p in s, pp. a Xi –2301, 2023.
[39] W. Dong, S. Xue, X. Duan e al., “P omp uning in e -
sion o ex -d i en image edi ing using di usion mod-
els,” in P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision, 2023, pp. 7430–7440.
[40] A. He z, R. Mokady, J. Tenenbaum e al., “P omp -
o-p omp image edi ing wi h c oss-a en ion con ol,”
in The Ele en h In e na ional Con e ence on Lea ning
Rep esen a ions, 2022.
[41] D. S idha and N. Vasconcelos, “P omp slide s o
ine-g ained con ol, edi ing and e asing o concep s in
di usion models,” a Xi p ep in a Xi :2409.16535,
2024.
[42] A. Lugmay , M. Danelljan, A. Rome o e al., “Re-
pain : Inpain ing using denoising di usion p obabilis-
ic models,” in P oceedings o he IEEE/CVF con e -
ence on compu e ision and pa e n ecogni ion, 2022,
pp. 11 461–11 471.
[43] D. Miyake, A. Ioha a, Y. Sai o e al., “Nega i e-
p omp in e sion: Fas image in e sion o edi ing
wi h ex -guided di usion models,” a Xi p ep in
a Xi :2305.16807, 2023.
[44] Y. Zhang, Y. Ikemiya, G. Xia e al., “Musicmagus:
ze o-sho ex - o-music edi ing ia di usion models,”
in P oceedings o he Thi y-Thi d In e na ional Join
Con e ence on A i icial In elligence, 2024, pp. 7805–
7813.
[45] J. Nis al, M. Pasini, and S. La ne , “Imp o ing mu-
sical accompanimen co-c ea ion ia di usion ans-
o me s,” a Xi p ep in a Xi :2410.23005, 2024.
[46] L. Lin, G. Xia, Y. Zhang e al., “A ange, inpain , and
e ine: s ee able long- e m music audio gene a ion and
edi ing ia con en -based con ols,” in P oceedings o
he Thi y-Thi d In e na ional Join Con e ence on A -
i icial In elligence, 2024, pp. 7690–7698.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
269
[47] J. Lee, N. J. B yan, J. Salamon e al., “Disen angled
mul idimensional me ic lea ning o music simila i y,”
in ICASSP 2020-2020 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2020, pp. 6–10.
[48] J. Guino , E. Quin on, and G. Fazekas, “Lea e-one-
equi a ian : Alle ia ing in a iance- ela ed in o ma ion
loss in con as i e music ep esen a ions,” in ICASSP
2025-2025 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2025, pp. 1–5.
[49] M. C. McCallum, F. Henkel, J. Kim e al., “Simila
bu as e : manipula ion o empo in music audio em-
beddings o empo p edic ion and sea ch,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2024, pp. 686–690.
[50] X. Bao, J. Y. Li, Z. Y. Wan e al., “Di 4s ee : S ee -
able di usion p io o gene a i e music e ie al wi h
seman ic guidance,” in ICASSP 2025-2025 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2025, pp. 1–5.
[51] M. C. McCallum, F. Ko zeniowski, S. O amas e al.,
“Supe ised and unsupe ised lea ning o audio ep-
esen a ions o music unde s anding,” in Ismi 2022
Hyb id Con e ence, 2022.
[52] K. Chen, X. Du, B. Zhu e al., “H s-a : A hie a chical
oken-seman ic audio ans o me o sound classi ica-
ion and de ec ion,” in ICASSP 2022-2022 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2022, pp. 646–650.
[53] H. W. Chung, L. Hou, S. Longp e e al., “Scal-
ing ins uc ion- ine uned language models,” Jou nal o
Machine Lea ning Resea ch, ol. 25, no. 70, pp. 1–53,
2024.
[54] Y. Liu, M. O , N. Goyal e al., “Robe a: A obus ly
op imized be p e aining app oach,” a Xi p ep in
a Xi :1907.11692, 2019.
[55] J. Melecho sky, Z. Guo, D. Ghosal e al., “Mus ango:
Towa d con ollable ex - o-music gene a ion,” in P o-
ceedings o he 2024 Con e ence o he No h Ame ican
Chap e o he Associa ion o Compu a ional Linguis-
ics: Human Language Technologies (Volume 1: Long
Pape s), 2024, pp. 8286–8309.
[56] J. Cope , F. K euk, I. Ga e al., “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, pp. 47 704–47 720,
2023.
[57] I. Manco, B. Weck, S. Doh e al., “The song desc ibe
da ase : a co pus o audio cap ions o music-and-
language e alua ion,” Neu IPS Machine Lea ning o
Audio Wo kshop, 2023.
[58] A. Agos inelli, T. I. Denk, Z. Bo sos e al., “Mu-
sicLM: Gene a ing music om ex ,” a Xi p ep in
a Xi :2301.11325, 2023.
[59] Y. Yu, C. Xiong, S. Sun e al., “Coco-d : Comba ing
dis ibu ion shi s in ze o-sho dense e ie al wi h con-
as i e and dis ibu ionally obus lea ning,” in P o-
ceedings o he 2022 Con e ence on Empi ical Me h-
ods in Na u al Language P ocessing, 2022, pp. 1462–
1479.
[60] E. Kh am so a, S. Zhuang, M. Bak ashmo lagh e al.,
“Selec ing which dense e ie e o use o ze o-sho
sea ch,” in P oceedings o he Annual In e na ional
ACM SIGIR Con e ence on Resea ch and De elopmen
in In o ma ion Re ie al in he Asia Paci ic Region,
2023, pp. 223–233.
[61] Y. Zhou, J. Ren, F. Li e al., “Tes - ime dis ibu ion no -
maliza ion o con as i ely lea ned isual-language
models,” Ad ances in Neu al In o ma ion P ocessing
Sys ems, ol. 36, pp. 47 105–47 123, 2023.
[62] B. Sun, J. Feng, and K. Saenko, “Co ela ion alignmen
o unsupe ised domain adap a ion,” Domain adap-
a ion in compu e ision applica ions, pp. 153–171,
2017.
[63] D. F iedman and A. B. Dieng, “The endi sco e: A di-
e si y e alua ion me ic o machine lea ning,” a Xi
p ep in a Xi :2210.02410, 2022.
[64] D. Bogdano , M. Won, P. To s ogan e al., “The m g-
jamendo da ase o au oma ic music agging,” in Ma-
chine Lea ning o Music Disco e y Wo kshop, In-
e na ional Con e ence on Machine Lea ning (ICML
2019), Long Beach, CA, Uni ed S a es, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
270