GD-Retriever: Controllable Generative Text-Music Retrieval With Diffusion Models

Author: Julien Guinot; Elio Quinton; George Fazekas

Publisher: Zenodo

DOI: 10.5281/zenodo.17706387

Source: https://zenodo.org/records/17706387/files/000030.pdf

GD-RETRIEVER: CONTROLLABLE GENERATIVE TEXT-MUSIC
RETRIEVAL WITH DIFFUSION MODELS
Julien Guino ∗,1,2Elio Quin on2Gyö gy Fazekas1
1Cen e o Digi al Music, Queen Ma y Uni e si y o London, U.K.
2Music & Audio Machine Lea ning Lab, Uni e sal Music G oup, London, U.K.
[email p o ec ed]
ABSTRACT
Mul imodal con as i e models ha e achie ed s ong
pe o mance in ex -audio e ie al and ze o-sho se ings,
bu imp o ing join embedding spaces emains an ac i e
esea ch a ea. Less a en ion has been gi en o making
hese sys ems con ollable and in e ac i e o use s. In
ex -music e ie al, he ambigui y o ee o m language
c ea es a many- o-many mapping, o en esul ing in in lex-
ible o unsa is ying esul s.
We in oduce Gene a i e Di usion Re ie e (GDR), a
no el amewo k ha le e ages di usion models o gene -
a e que ies in a e ie al-op imized la en space. This en-
ables con ollabili y h ough gene a i e ools such as neg-
a i e p omp ing and denoising di usion implici models
(DDIM) in e sion, opening a new di ec ion in e ie al
con ol. GDR imp o es e ie al pe o mance o e con-
as i e eache models and suppo s e ie al in audio-only
la en spaces using non-join ly ained encode s. Finally,
we demons a e ha GDR enables e ec i e pos -hoc ma-
nipula ion o e ie al beha io , enhancing in e ac i e con-
ol o ex -music e ie al asks.
1. INTRODUCTION
Mul imodal ex -music join embedding models ha e
la gely acili a ed ex -que ied music e ie al applica ions
[1–5]. Mul imodal con as i e lea ning o ex and music
join embedding spaces speci ically ha e shown high pe -
o mance on ozen p obing asks and p omise o ze o-
sho classi ica ion app oaches, wi h s ong ep esen a ion
lea ning capaci ies. Despi e he imp o emen in e ec i ely
encoding musical in o ma ion, hese app oaches o en lack
con ollabili y. Unsa is ac o y esul s om ex que ying
equi e e-p omp ing wi h a di e en que y o e ine e-
ie al, and i is di icul o na iga e he la en space o
join -embedding models wi h in e p e able con ols, such
as “I would like his e ie al esul o be punchie ” o “I
would like he e ie al esul o be simila o his ack,
bu wi h an elec ic gui a ins ead o acous ic”.
© Julien Guino , Elio Quin on, Gyö gy Fazekas. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: Julien Guino , Elio Quin on, Gyö gy Fazekas, “GD-
Re ie e : Con ollable Gene a i e Tex -Music Re ie al wi h Di usion
Models”, in P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al
Con ., Daejeon, Sou h Ko ea, 2025.
Figu e 1: O e iew o GD-Re ie e ’s p oposal: Ins ead
o encoding ex que ies and audio keys in a join embed-
ding space ( op), we gene a e que ies in he audio space
di ec ly h ough condi ioning on a ex que y (bo om)
One ield in which such con ols a e mo e ex ensi ely
explo ed, e icien , and disen angled is he ield o gene a-
i e AI. Di usion gene a i e models, speci ically, ha e no
only been widely adop ed by i ue o hei high quali y
ou pu s and mul imodal condi ionabili y [6–12], bu ha e
also been he ocus o an ex ensi e ange o con ollabil-
i y app oaches which ha e inc emen ally added powe ul,
mul imodal, and in ui i e con ols o gene a i e di usion
models [10,12–20]. In ligh o his obse a ion, we a e
s ongly mo i a ed o combine he e ie al and di usion
pa adigms. This wo k ocuses on explo ing he capabili-
ies o gene a i e ex -music models o e ie al, wi h he
mo i a ion o enabling con ollabili y mechanisms o in-
e ac i e e ie al. Fo ins ance, his would enable disco -
e ing di ec ions o modi ica ion o musical a ibu es in he
la en space o being able o modi y he gen e o ins u-
men a ion o a e ie ed musical piece wi hou modi ying
o he seman ic a ibu es.
We p opose Gene a i e Di usion Re ie e , a new
mechanism o e ie al, in which we ain a condi ional la-
en di usion model on a e ie al-op imized la en space.
We p omp GD-Re ie e o gene a e hypo he ical que ies
in he audio la en space and e ie e nea es neighbou s
262
(See Figu e 1). Ou con ibu ions a e 1:
1 We p esen a gene a i e di usion e ie al ame-
wo k ha condi ionally gene a es hypo he ical
que ies in he la en space, leading o imp o ed pe -
o mance on in-domain ex -music e ie al.
2 We dis inguish ou sel es om p e ious app oaches
by di ec ly gene a ing embedding sequences o e
agg ega ed embeddings, p omo ing ine -g ained
ex -music unde s anding.
3 We show ha we success ully unlock he a ay o
con ollabili y me hods o gene a i e models o e-
ie al h ough examples such as nega i e p omp ing
and DDIM in e sion [17].
4 Ou app oach is di ec ly applicable o audio-only la-
en spaces, and can le e age ex encode s ha ha e
no been join ly ained wi h an audio encode .
2. BACKGROUND
2.1 Tex -music con as i e lea ning and e ie al
Mul imodal con as i e lea ning has shown s ong esul s
in compu e ision [6,21–23], and has been success ully
ex ended o audio and music domains [1,2,5]. These mod-
els encode pai ed ex and audio inpu s using encode s ET
and EA, p ojec hem in o a sha ed la en space, and apply a
con as i e In oNCE loss [24] on pooled embeddings (ZT,
ZA) o align posi i e pai s while sepa a ing nega i e ones.
Ea ly ex -audio/music models such as CLAP [3,5],
MusCALL [1], and MuLan [2] adop CLIP-s yle aining
[21]. La e wo k imp o es alignmen h ough be e cap-
ions and oken-/ ime ine g ained mechanisms [4,25–27].
Lea ned ep esen a ions om hese models a e widely used
o e ie al, in which he lea ned simila i y me ic be-
ween ex and music encodings can be used o e ie e
he mos simila music key in a da abase o audio sam-
ples [1–3], gene a i e condi ioning [7,8,14,15,28], and
e ie al-augmen ed cap ioning [29,30].
2.2 Di usion Models
Di usion models a e powe ul gene a i e models ha i e -
a i ely e ine noisy inpu s o gene a e high-quali y ou pu s
by lea ning a e e se Ma ko p ocess [6,31,32]. These
models co up da a wi h noise o e mul iple s eps and hen
lea n o econs uc he sample om he s ep in o ma ion.
The denoising p ocess is modeled as a lea ned ansi-
ion, whe e a each s ep, a gene a o Gp edic s ei he he
o iginal da a x0(sample objec i e) o he noise added o
he o iginal la en (ϵobjec i e). We adop he sample p e-
dic ion objec i e, whe e he model p edic s he clean la en
a each s ep, as in p io wo k [33,34]. The di usion p o-
cess can be condi ioned on auxilia y condi ioning in o ma-
ion such as ex o o he modali ies [35,36], ypically ap-
plied wi h classi ie - ee guidance (CFG) [6] by in e pola -
ing be ween uncondi ional and condi ional p edic ions a
each denoising s ep. La en di usion models educe com-
1Code is made a ailable a h ps://gi hub.com/Pliploop/GDRe ie e
pu a ional cos s by ope a ing in a comp essed la en space
using p e ained au oencode s [7,8,14,15,37,38].
2.3 Con ollabili y o gene a ion and e ie al
Con ollabili y in gene a ion e e s o how well gene a i e
models espond o human-guided in e ac ions, allowing o
a ibu e modi ica ion o e inemen ei he du ing o a e
gene a ion. In di usion models, his includes echniques
like ex -based a ibu e edi ing [39–41], inpain ing [42],
in e sion, and nega i e p omp ing [17,43]. In music gen-
e a ion, con ollabili y is an ac i e a ea o esea ch due o
i s po en ial o c ea i e applica ions [12,44–46].
While ex ensi ely s udied in gene a ion, con ollabil-
i y in e ie al—especially o music— emains unde ex-
plo ed. This in ol es enabling use s o guide o modi y
e ie al esul s in e ac i ely, by speci ying a ibu es o in-
e es o e ie al in a disen angled way [47,48] o apply-
ing la en ans o ma ions, e.g. empo adjus men s [49].
Gene a i e e ie al has eme ged in ecen wo k o gene -
a e la en que ies, mainly o imp o e pe o mance in gen-
e al audio e ie al [34] o add mul imodal guidance [50],
a he han enabling in e ac i i y du ing o a e e ie al.
3. GENERATIVE DIFFUSION RETRIEVAL
We p opose an in ui i e gene a i e app oach o e ie al
using di usion models, which we name Gene a i e
Di usion Re ie e . Using a p e ained la en space op-
imized o audio-audio e ie al, we ain a gene a i e di -
usion model condi ioned on ex o gene a e audio la en
embeddings in his space. A in e ence ime, a he han
encoding he ex que y in o he sha ed la en space and e-
ie ing he nea es audio [1,3], we gene a e a “ghos ” au-
dio que y in he la en space condi ioned on he ex que y
(Simila o [12,50]) and e ie e he nea es neighbou s in
he audio space. By using a gene a i e model as a e ie e ,
we enable adap a ion o audio-only la en spaces o ex -
audio e ie al wi hou equi ing mul imodal p e aining.
The gene a i e modelling o e ie al allows o g ea e
e ie al con ollabili y h ough a ibu e modi ica ion and
in e ac i e e inemen echniques om he gene a i e do-
main. The app oach is shown in Figu e 2.
Conside an audio, que y cap ion pai {xq
, xa}and a
condi ioning ex encode ETwhich encodes he ex que y
in o a sequence o embeddings zq
T. Le EAbe a p e ained
and ozen audio encode encoding xain o a sequence o
audio embeddings zA. We ain Gwi h a di usion loss o
econs uc zAcondi ioned on zq
T, which we no a e ˜zq
A:
LG=Eτ,Za,zq
Th∥za− G(za,τ , τ, zq
T)∥2
2i(1)
Whe e τis he di usion s ep. We agg ega e ˜zq
Ain o ˜
Zq
A
h ough a e aging o e he sequence imes eps and use ˜
Zq
A
as a que y in he audio space o e ie e audio.
3.1 Model a chi ec u e
3.1.1 Di usion backbone
We use a well-es ablished UNe wi h c oss-a en ion con-
di ioning [6–9] as ou di usion model. Ea ly expe i-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
263
(a) S age 1 — Gene a i e p e aining. (b) S age 2 — Re ie al
Figu e 2:GD Re ie e Me hod: We ain a model o gene a e ex -condi ioned ghos que ies o e ie al. Le : A
di usion model is ained o gene a e audio la en s om ex cap ions. Righ : Using he ozen model, we gene a e audio
embeddings om a cap ion o e ie e simila audio ia ghos que ies.
men a ion led o he design choice o a ∼40M pa ame-
e model. Models a e condi ioned on ex embedding
sequences h ough c oss-a en ion. Compa ed o p e i-
ous wo k [34,50] ha pe o ms gene a i e di usion e-
ie al wi h agg ega ed embeddings, his enables mo e
ine-g ained in e ac ion be ween audio and ex [50], as
we show in Sec ion 4.2. Suppo ed by esul s in [33], we
ind ha a sample-objec i e (See Sec ion 2) yields be e
e ie al esul s han ϵ-objec i e.
3.1.2 Tex and audio encode s
We use h ee audio encode s: CLAP [3], MusCALL [1],
and MULE [51]. CLAP uses an HTSAT backbone [52]
wi h he publicly a ailable Music checkpoin 2. MusCALL
is based on a ResNe 50 encode , and MULE is eimple-
men ed and p e ained on MTG-Jamendo ollowing [48].
Fo ex encode s, we use CLAP’s RoBERTa-based en-
code [53,54], MusCALL’s 4-laye ans o me , o a p e-
ained Flan-T5 model [53] om HuggingFace 3. Flan-T5
is a ine- uned T5 language model commonly used in mu-
sic gene a ion [8,11,55,56].
3.2 Da ase s
We use wo well-explo ed public music-cap ion pai s
da ase s as well as a p i a e da ase o aining and e al-
ua ion. Song Desc ibe [57] (SD) is a da ase o 1100
c owd-sou ced cap ions co esponding o 700 exce p s o
2 minu e music clips. MusicCaps [58] is ano he music-
cap ion pai da ase consis ing o 5500 pai s wi h 10s
audio. Finally, we use a p i a e la ge scale da ase o
p o essionally-anno a ed song desc ip ions (P i a eCaps).
Table 1in en o ies da ase scales. Fo e alua ion on P i-
a eCaps, we use a 5500-sample subse o he es se .
3.3 T aining de ails
We ex ac la en embeddings om ou aining da ase s
(See Sec ion 3.2) wi h he selec ed audio encode s. We ex-
ac zaby sliding he encode o e he audio inpu a a
equency o 1Hz. GDR is ained on P i a eCaps wi h
2h ps://gi hub.com/LAION-AI/CLAP
3h ps://hugging ace.co/google/ lan- 5-base
Da ase # acks #cap ions Hou s T aining E al
SongDesc ibe [57] 0.7k 1.1k 23.3 ✗✓
MusicCaps [56] 5.5k 5.5k 15.3 ✗✓
P i a eCaps 251k 251k 12.5k ✓ ✓
Table 1: Da ase de ails - P i a eCaps is an in e nal da ase
o ull-leng h p o essionally anno a ed p oduc ion acks
a ba ch size o 256 o 100k s eps, wi h an ea ly s op-
ping mechanism on alida ion di usion loss. Models a e
ained on a single A5000 GPU. We use AdamW and lin-
ea wa mup he lea ning a e o 1e−4o e 5000 s eps hen
cosine decay o 0. Ou model is ained on la en sequences
o leng h T= 64, i.e. 1 minu e o audio. We use CFG on
ex condi ioning [35] (masking p obabili y 10%).
4. EXPERIMENTS
4.1 Re ie al
T→A E al da ase
Model Me ic PC SD MC
CLAP
R@1 ↑2.2 3.1 3.8
R@5 ↑7.2 13.7 12.9
R@10 ↑12.3 23.2 19.5
MedR (%) ↓3.7 4.0 1.4
GDR-CLAP
R@1 ↑6.9 4.7 2.7
R@5 ↑17.1 15.3 7.6
R@10 ↑22.9 24.7 11.5
MedR (%) ↓1.6 3.8 2.9
MusCALL
R@1 ↑10.1 3.6 1.0
R@5 ↑26.2 13.6 3.9
R@10 ↑35.1 22.0 7.0
MedR (%) ↓0.4 4.2 5.1
GDR-MusCALL
R@1 ↑10.8 5.1 1.8
R@5 ↑25.1 16.9 6.4
R@10 ↑33.3 25.5 9.9
MedR (%) ↓0.6 3.5 3.4
Table 2: Main e ie al esul s o GDR. We compa e
GDR-CLAP o CLAP and GDR-MusCALL o muscall o
R@1,5,10 on he PC, SD and MC Da ase s.
We e alua e GD-Re ie e ’s e ie al pe o mance
agains eache models. We gene a e nq= 5 audio la en
que ies ˜zAcondi ioned on zq
TWe a e age ˜zAo e ime
and nqin o ˜
ZA o e ie al. Teache s encode ex and au-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
264
dio in o embeddings ZTand ZA. . Resul s a e shown in
Table 2. While GD-Re ie e ou pe o ms eache models
in se e al in-domain scena ios—mos no ably on P i a e-
Caps (PC) and o a lesse ex en SongDesc ibe (SD)—i s
e ie al pe o mance deg ades on ou -o -domain Music-
Caps (MC). In pa icula , GDR-CLAP unde pe o ms on
MC ela i e o he CLAP eache , despi e showing s ong
imp o emen s on PC. Con e sely, GDR-MusCALL unde -
pe o ms on PC while p o iding s onge pe o mance han
he eache baseline on MC and SD. Despi e s ong in-
domain esul s, hese inconsis encies sugges ha domain
misma ch plays a ole in limi ing e ie al pe o mance.
4.1.1 Domain adap a ion
We iden i y wo compounding sou ces o domain shi .
Fi s , p e ained con as i e models like CLAP and Mus-
CALL o en ail o gene alize ac oss da ase s wi h di e ing
audio and ex dis ibu ions—a well-known issue in dense
e ie al and mul imodal lea ning [59,60]. CLAP, ained
on LAION-630k [3], pe o ms wo se han MusCALL on
i s in-domain e alua ion se (PC), bu shows simila pe -
o mance on SD, and be e pe o mance on MC.
Second, GD-Re ie e lea ns a ex - o-audio mapping
on he eache ’s ozen embedding space du ing di usion
aining. I he eache su e s om domain misma ch (e.g.,
CLAP on PC), GDR can compensa e by adap ing o he
aining dis ibu ion. Con e sely, when he eache is well-
aligned (e.g., MusCALL on PC), GDR ends o ma ch, bu
no exceed, i s pe o mance. This explains why GDR-
CLAP, ained on P i a eCaps wi h a CLAP eache , e-
p oduces pe o mance ends seen in MusCALL’s eache
embeddings—pe o ming bes on PC, accep ably on SD,
and poo ly on MC.
Da ase Encode Pai FTD ↓FAD R@5 FAD ↓R@5 ↑
ZTZAZA˜
ZA˜
Zalign
A˜
ZA˜
Zalign
A
PC CLAP - - 7.2 0.03 0.003 17.1 18.2
SD CLAP - - 13.7 0.09 0.001 15.3 15.9
MC CLAP - - 12.9 0.34 0.001 7.6 8.1
PC MusCALL 0.008 0.002 26.2 0.02 ∼0 25.1 25.3
SD MusCALL 0.14 0.18 13.6 0.12 ∼0 16.9 17.7
MC MusCALL 0.32 0.20 3.9 0.17 ∼0 6.4 7.2
Table 3: F éche dis ances o ZA/ZT o he aining dis-
ibu ion o eache models, F éche audio dis ance (FAD)
o he e alua ion se , and e ie al pe o mance (R@5) o
gene a ed que ies ( ˜
ZA), and aligned que ies ( ˜
Zalign
A) ac oss
da ase s and encode pai s. We a e unable o e alua e FAD
on ZA o CLAP’s aining se as LAION-630k is p i a e.
To es his hypo hesis, we compu e F eche Dis ances
be ween aining and e alua ion dis ibu ions, compa ing
(1) eache audio embeddings o each e alua ion se o he
aining dis ibu ion and (2) GDR-gene a ed que ies o he
e alua ion se , be o e and a e a ligh weigh alignmen
s ep. Following p io domain adap a ion wo k [61,62], we
apply a pos -hoc shi in mean and co a iance o ˜
ZA o
ma ch he e alua ion se (no a ed Zalign
A). This me hod is
model-agnos ic, e icien , and equi es no e aining.
We e alua e on join encode pai s o illus a e he join
ex -audio domain gene aliza ion gap. As shown in Ta-
ble 3, GDR-gene a ed la en s exhibi simila dis ibu ion
shi s as he eache . Alignmen consis en ly educes FAD
and imp o es R@5, suppo ing ou claim ha e ie al
deg ada ion s ems om inhe i ed dis ibu ion di e gence
a he han a limi a ion o ou app oach. While no a com-
ple e solu ion, his p o ides bo h e idence o ou diagno-
sis and a simple, e ec i e mi iga ion. We lea e b oade
gene aliza ion s a egies o u u e wo k.
4.1.2 Encode pai a ia ion
Two co e a o dances o GDR a e i s abili y o (1) ope a e
in audio-only la en spaces no ained join ly wi h ex , and
(2) suppo a bi a y ex encode s o condi ioning. This is
enabled by he di usion model lea ning a gene a i e map-
ping be ween ex and audio embeddings, independen o
any con as i e p e- aining alignmen . The gene a i e e-
ie al objec i e imposes no cons ain s on he mul imodal-
i y o he space o he choice o ex encode . To demon-
s a e his, we es se e al combina ions o audio and ex
encode s ha we e no join ly ained: we eplace he ex
encode in GDR-CLAP wi h Flan-T5 (Sec ion 3.1.2), and
use he audio encode o MULE pai ed wi h T5. Re ie al
esul s a e epo ed in Table 4.
T−→A E al da ase
Model ETMe ic PC SD MC
GDR-CLAP
T5
R@1 ↑8.1 4.9 2.3
R@5 ↑21.1 15.6 7.8
R@10 ↑29.2 25.1 11.7
MR ↓0.8 3.7 2.9
CLAP
R@1 ↑6.9 4.7 2.7
R@5 ↑17.1 15.3 7.6
R@10 ↑22.9 24.7 11.5
MR ↓1.6 3.8 2.9
GDR-MULE T5
R@1 ↑7.6 4.1 1.6
R@5 ↑18.5 13.9 6.2
R@10 ↑25.3 21.8 11.0
MR ↓1.6 4.2 3.2
Table 4: Compa ison o e ie al pe o mance ac oss
models and ex encode s. GD-Re ie e enables ex -
condi ioned e ie al on non-join ly ained encode s.
Including T5 as he ex encode imp o es in-domain
pe o mance o GDR-CLAP, likely by egula izing he
mapping be ween con as i ely ained audio and ex en-
code s. Howe e , his bene i is mo e limi ed ou -o -
domain, consis en wi h ou indings in Sec ion 4.1.1,
whe e domain misma ch s ems om aining se dis ibu-
ions. We also ind ha GDR-MULE suppo s ex -music
e ie al and ou pe o ms he CLAP eache on P i a e-
Caps and SongDesc ibe . This demons a es ha join e-
ie al spaces can be buil om unimodal audio la en s
wi hou la ge-scale mul imodal p e- aining—a key a o -
dance o ou app oach.
4.2 Quali y o Gene a ed Que ies
Beyond e ie al, we e alua e gene a ed que y quali y us-
ing ideli y, di e si y, and p omp adhe ence me ics. FAD
cap u es audio ideli y, while CLAP sco e assesses align-
men wi h inpu ex . To e alua e di e si y, we gene a e
clus e s o 10 audio que ies pe p omp and measu e he
in asample cosine simila i y (MICS) ollowing p e ious
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
265
wo k [50] and he no malized Vendi sco e (NVendi), along
wi h i s in aclus e a ian (MINVS), o assess he a ia-
ion in gene a ed que ies [63].
We compa e ou di usion-based UNe o wo baselines:
a eg ession UNe p edic ing sequences o audio embed-
dings om sequences o lea ned mask embeddings condi-
ioned on zq
T, and wo 1-hidden-laye MLP (GeLU, 4096
uni s), ained wi h espec i ely di usion and eg ession
objec i es o econs uc ZAcondi ioned on Zq
T. All mod-
els use he same aining hype pa ame e s and CLAP T/A
encode s. Table 5shows esul s on P i a eCaps. Ou di u-
sion UNe ou pe o ms al e na i es in e ie al and ideli y.
While MLP Di usion achie es highe di e si y, i comes a
he cos o ealism and e ie al pe o mance.
Me ic Di usion Reg ession
UNe MLP UNe MLP
Re ie al (SD) R@5 ↑15.3 11.3 9.1 7.1
NMedR ↓3.8 5.5 5.6 5.1
Fideli y FAD ↓0.09 0.13 0.13 0.13
CLAP ↑0.63 0.53 0.54 0.53
Di e si y
MICS ↓0.92 0.62 1 1
MINVS ↑1.42 4.53 1 1
NVendi ↑12.9 68.9 3 3.4
Table 5: Fideli y, Re ie al, and Di e si y me ics
o CLAP-Re ie e s ained on P i a eCaps. FAD is
g ounded on MTG-Jamendo [64].
4.3 Con ollabili y
4.3.1 Nega i e p omp ing
Nega i e p omp ing is an o - he-shel a o dance o di -
usion models ela ed o CFG [35], which allows use s o
speci y wha hey do no wan o be in he gene a ed ou pu .
Nega i e p omp ing modi ies he CFG upda e by inco po-
a ing a nega i e condi ioning signal zq−
T. Gi en a que y
embedding zq+
T, a denoising s ep is gi en by:
˜zNP
A,τ+1 = (1 + w)G(˜zA,τ , τ + 1, zq+
T)
−wG(˜zA,τ , τ + 1, zq−
T)(2)
whe e wis he classi ie ee guidance s eng h [35]. This
o mula ion emo es undesi ed a ibu es by in e pola ing
owa ds condi ional gene a ion and away om nega i ely
condi ioned ou pu s a each di usion s ep (See Sec ion 3).
We e alua e he e ec i eness o nega i e p omp ing in
e ie al by cu a ing 50 nega i e p omp s ac oss gen e,
mood, ins umen a ion, key, and empo (e.g. “a ock song”
as a nega i e p omp “ emo es” ock). Each ca ego y in-
cludes di e en ph asing o ma s. Fo each que y q, we
c ea e new, modi ied que y la en s zmod
ausing nega i e
p omp ing and h ee addi ional modi ica ion me hods as
baselines, all applied wi h guidance s eng h w:
Nega i e p omp ing (NP) modi ies la en s by apply-
ing Eq. 2wi h z−
Tas nega i e condi ioning. Tex In-
e pola ion (∆T) in e pola es om ZAaway om ZTq−:
Z′
A=ZA+w∆T, whe e ∆T=Zq+
T−Zq−
T. Audio
in e pola ion in e pola es along he di ec ion om ˜
ZAand
away om ˜
Z−
A:∆A=˜
ZA−˜
Z−
A. Ou las baseline is
Figu e 3: CLAP sco e esul s be ween Zmod
aand ˜
Zq+
A,
Zmod
aand ˜
Zq−
a o nega i e p omp ing and baselines o
di e en ca ego ies.
P e ix Nega ion P omp ing (PNP): We modi y he que y
by nega ing a ibu es (e.g., “a ock ack” →“no a ock
ack”), hen gene a e ˜zP NP
Awi h GDR.
Zmod
aob ained om all modi ica ion me hods is com-
pa ed using CLAP sco e o Zq+
A,˜
Zq+
A, and Zq−
A, o which
a highe CLAP sco e is be e as i signi ies highe simila -
i y o he posi i e p omp . We also compa e Zmod
a o ˜
Zq−
A,
and Zq−
A, o which a lowe CLAP sco e (less simila o he
nega i e condi ioning) is be e . We also use FAD o assess
he ideli y o Zmod
A. A modi ied audio la en ha is sim-
ila o he o iginal Zq+
Aand dissimila o Zq−
Abu is e y
a om any easonable dis ibu ion (i.e. la ge FAD) will
yield un ealis ic o uncommon music esul s. CLAP sco e
and FAD g ounded on MTG-Jamendo a e epo ed Table
6. Fine-g ained ca ego y expe imen s a e shown Table 3.
key Zq+
AModi ied que y ˜
Zmod
A
NP ∆T˜
∆APNP
CLAP ↑
ZA0.69 0.66 0.38 0.39 0.51
˜
ZA10.85 0.65 0.62 0.49
ZT0.42 0.39 0.28 0.33 0.21
CLAP ↓˜
Zn
A0.41 0.21 -0.02 -0.23 0.46
Zn
T0.17 0.08 -0.51 -0.04 0.21
Fideli y FAD 0.11 0.12 3.12 0.60 0.12
Table 6: Nega i e p omp ing expe imen s - CLAP sco e o
modi ied que ies zmod
A s o iginal/nega i e ZA/ZT
Nega i e p omp ing shows s ong desi able esul s: The
modi ied audio la en emains he mos simila o he o ig-
inal p omp ac oss modi ica ion baselines, while signi i-
can ly dis ancing i sel om he nega i e p omp . A he
same ime, NP-modi ied la en s emain ealis ic and in dis-
ibu ion, as demons a ed by he lowe FAD sco e. While
∆Tand ∆Acan lowe he CLAP sco e ela i e o unde-
si ed a ibu es, hey also deg ade seman ic alignmen wi h
he o iginal p omp , as e lec ed in lowe CLAP sco es. In
addi ion, his educ ion is achie ed un ealis ically: highe
FAD sco es indica e he esul ing la en s de ia e signi i-
can ly om he g ounding dis ibu ion. In e ie al, his
would lead o unna u al o implausible esul s which a e
less ele an , e en i less simila o he nega i e a ibu e.
Mo eo e , a lowe CLAP sco e o a nega i e p omp is
no always desi able: Usually, a CLAP sco e o 0 deno es
he absence o sha ed in o ma ion be ween he wo em-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
266

beddings, while a nega i e CLAP sco e signi ies opposi e
in o ma ion. Colloquially, a CLAP sco e o -1 o a e e -
ence embedding o ”gui a “ does no mean ”No gui a “,
bu a he ” he opposi e o gui a ”, which is ill-de ined.
4.3.2 DDIM In e sion
One use ul p ope y o di usion models is DDIM in e -
sion, which enables use -con olled modi ica ions while
p ese ing seman ic simila i y [17]. By e-noising an em-
bedding ia an in e se DDIM schedule , he la en e u ns
o a noisie s a e whe e high-le el ea u es a e es ablished.
F om his pi o , applying new guidance du ing denoising
yields ou pu s ha be e ma ch he new p omp while e-
maining close o he o iginal.
This is aluable o e ie al, whe e use s may wan o
e ine a speci ic a ibu e o a que y hey a e pa ly sa is ied
wi h. Cu en join embedding e ie al models lack na-
i e suppo o such in e ac ion. In con as , GD-Re ie e
suppo s DDIM in e sion di ec ly, allowing o in e ac i e
and con ollable e ie al e inemen .
Figu e 4: DDIM in e sion example. Le : CLAP sco es o
audio om o iginal and modi ied p omp s. Righ : CLAP
sco es o ex p omp s and added/ emo ed wo ds.
To demons a e DDIM in e sion o e ie al, we ap-
ply i o a p omp om he Song Desc ibe da ase : “A
choppy bea -hea y acous ic gui a song wi h so ocals.”
F om he gene a ed la en ˜zq
A, we pe o m in e sion us-
ing an in e sion p omp xq,in
T: “A smoo h, solo acous ic
gui a song wi h ha sh ocals.” The aim is o p oduce a la-
en Zmod
A ha emains close o he o iginal while aligning
mo e wi h he modi ied p omp . We ack CLAP sco es
be ween Zmod
A(τ)wi h τ he in e sion s ep and bo h o igi-
nal and modi ied audio/ ex que ies: ˜
Zq
A,˜
Zq,in
A,Zq
T,Zqin
T,
and he ex encodings o added (Z+
T) and emo ed (Z−
T)
wo ds. Resul s a e shown in Figu e 4. We obse e a
clea ansi ion: CLAP simila i y shi s om he o iginal
o he modi ied p omp while emaining high o he o ig-
inal, con i ming he e ec i eness o DDIM in e sion o
ine-g ained con ol in e ie al.
To alida e he usabili y o DDIM in e sion as a e-
ie al con ollabili y ool on a la ge scale. We now cu a e
50 p omp s om he Song Desc ibe da ase ha would
ep esen ealis ic use-cases o e ining a sea ch esul o
e ie al, and cu a e modi ied p omp s ep esen ing eal-
is ic modi ica ions o e ine a que y, ei he by modi ying
quali ica i es o subjec s, o adding mo e de ails.
O iginal p omp s a e no a ed zq
Tand modi ied p omp s
zq′
T(again, modi ied audio la en s a e no a ed zmod
a). Fo
in e sion, we e-noise ˜zq
A o 20 ou o 50 s eps and denoise
condi ioned on zq′
T. We use e-gene a ion as a baseline by
simply gene a ing ˜zq′
A. We compa e Zmod
a o Zq
A,˜
Zq
Aand
˜
Zq′
A o audio compa ison, and Zq
T,Zq′
T o ex . A desi -
able esul is o Zmod
A o be simila o hese embeddings,
meaning seman ic simila i y o he o iginal and modi ied
p omp s. Resul s a e shown in Figu e 5.
Figu e 5: Sys ema ic e alua ion o DDIM in e sion on cu-
a ed p omp modi ica ions, compa ing CLAP sco e o in-
e ed la en s and egene a ed la en s o (le : audio, igh :
ex ) o iginal and modi ied la en s.
DDIM in e sion, a na i e a o dance o GDR, yields
highe simila i y o he o iginal p omp compa ed o egen-
e a ion, which causes a d op in CLAP sco e be ween Z egen
A
and ˜
Zq′
A. In con as , Zin
A emains close o ˜
Zq
A. While
bo h me hods main ain simila alignmen wi h he modi-
ied p omp , only in e sion p ese es seman ic simila i y
o he o iginal, making i a mo e ai h ul con ollabili y
mechanism. Addi ionally, we obse e ha bo h classi ie -
ee guidance and he numbe o in e sion s eps modula e
he s eng h o he e ec , o e ing u he con ol. While
we use anilla DDIM he e, ecen imp o emen s in in e -
sion echniques [17,39] can be di ec ly applied o GDR o
mo e ealis ic and ine-g ained con ol.
5. CONCLUSION AND FUTURE WORK
We p esen GD-Re ie e , a gene a i e amewo k o
ex - o-music e ie al ha uses di usion models o p o-
duce la en que ies in e ie al- ele an spaces. GD-
Re ie e ou pe o ms con as i e eache models on in-
domain da a and enables e ie al in unimodal audio spaces
by le e aging independen ly p e ained ex and audio en-
code s— emo ing he need o join mul imodal aining.
Beyond e ie al pe o mance, GD-Re ie e enables
in e ence- ime con ollabili y h ough nega i e p omp ing
and DDIM in e sion, o e ing lexible and pos -hoc ma-
nipula ion o e ie al beha io . These a o dances open
he doo o in e ac i e and use -s ee able music e ie al.
While domain misma ch emains a challenge, ou indings
sugges his can be pa ially mi iga ed h ough la en align-
men . We encou age u u e wo k o expand gene a i e con-
ol in e ie al, aiming o mo e obus , adap able, and ex-
p essi e e ie al sys ems.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
267
6. ACKNOWLEDGEMENT
This wo k is suppo ed by he EPSRC UKRI Cen e o
Doc o al T aining in A i icial In elligence and Music
(EP/S022694/1) and Uni e sal Music G oup.
7. REFERENCES
[1] I. Manco, E. Bene os, E. Quin on e al., “Con as i e
audio-language lea ning o music,” in P oceedings o
he 23 d In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, ISMIR 2022, Bengalu u, India,
Decembe 4-8, 2022, 2022, pp. 640–649.
[2] Q. Huang, A. Jansen, J. Lee e al., “Mulan: A join em-
bedding o music audio and na u al language,” in P o-
ceedings o he 23 d In e na ional Socie y o Music
In o ma ion Re ie al Con e ence, ISMIR 2022, Ben-
galu u, India, Decembe 4-8, 2022, 2022, pp. 559–
566.
[3] Y. Wu, K. Chen, T. Zhang e al., “La ge-scale con-
as i e language-audio p e aining wi h ea u e usion
and keywo d- o-cap ion augmen a ion,” in ICASSP
2023-2023 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2023, pp. 1–5.
[4] J. Wu, W. Li, Z. No ack e al., “Collap: Con as i e
long- o m language-audio p e aining wi h musical
empo al s uc u e augmen a ion,” in ICASSP 2025-
2025 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2025,
pp. 1–5.
[5] B. Elizalde, S. Deshmukh, M. Al Ismail e al., “Clap
lea ning audio concep s om na u al language supe i-
sion,” in ICASSP 2023-2023 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2023, pp. 1–5.
[6] J. Ho, A. Jain, and P. Abbeel, “Denoising di usion
p obabilis ic models,” Ad ances in neu al in o ma ion
p ocessing sys ems, ol. 33, pp. 6840–6851, 2020.
[7] H. Liu, Z. Chen, Y. Yuan e al., “AudioLDM: Tex -
o-audio gene a ion wi h la en di usion models,” P o-
ceedings o he In e na ional Con e ence on Machine
Lea ning, 2023.
[8] H. Liu, Y. Yuan, X. Liu e al., “Audioldm 2: Lea ning
holis ic audio gene a ion wi h sel -supe ised p e ain-
ing,” IEEE/ACM T ansac ions on Audio, Speech, and
Language P ocessing, 2024.
[9] K. Chen, Y. Wu, H. Liu e al., “Musicldm: En-
hancing no el y in ex - o-music gene a ion using
bea -synch onous mixup s a egies,” a Xi p ep in
a Xi :2308.01546, 2023.
[10] S.-L. Wu, C. Donahue, S. Wa anabe e al., “Music con-
olne : Mul iple ime- a ying con ols o music gen-
e a ion,” IEEE/ACM T ansac ions on Audio, Speech,
and Language P ocessing, 2024.
[11] D. Ghosal, N. Majumde , A. Meh ish e al., “Tex - o-
audio gene a ion using ins uc ion guided la en di u-
sion model,” in P oceedings o he 31s ACM In e na-
ional Con e ence on Mul imedia, 2023, p. 3590–3598.
[12] J. Nis al, M. Pasini, C. Aouameu e al., “Di -a- i :
Musical accompanimen co-c ea ion ia la en di u-
sion models,” in ISMIR, 2024, 2024.
[13] Z. No ack, J. McAuley, T. Be g-Ki kpa ick e al.,
“Di o: di usion in e ence- ime -op imiza ion o mu-
sic gene a ion,” in P oceedings o he 41s In e na-
ional Con e ence on Machine Lea ning, 2024, pp.
38 426–38 447.
[14] Z. E ans, J. D. Pa ke , C. Ca e al., “Long- o m mu-
sic gene a ion wi h la en di usion,” in P oceedings o
he 25 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2024.
[15] Z. E ans, C. Ca , J. Taylo e al., “Fas iming-
condi ioned la en audio di usion,” in P oceedings o
he 41s In e na ional Con e ence on Machine Lea n-
ing, 2024, pp. 12 652–12 665.
[16] R. Gal, Y. Alalu , Y. A zmon e al., “An im-
age is wo h one wo d: Pe sonalizing ex - o-image
gene a ion using ex ual in e sion,” a Xi p ep in
a Xi :2208.01618, 2022.
[17] R. Mokady, A. He z, K. Abe man e al., “Null- ex
in e sion o edi ing eal images using guided di usion
models,” in P oceedings o he IEEE/CVF con e ence
on compu e ision and pa e n ecogni ion, 2023, pp.
6038–6047.
[18] F. Yang, S. Yang, M. A. Bu e al., “Dynamic p omp
lea ning: Add essing c oss-a en ion leakage o ex -
based image edi ing,” Ad ances in Neu al In o ma ion
P ocessing Sys ems, ol. 36, pp. 26 291–26 303, 2023.
[19] S. A. Baumann, F. K ause, M. Neumay e al., “Con-
inuous, subjec -speci ic a ibu e con ol in 2i mod-
els by iden i ying seman ic di ec ions,” a Xi p ep in
a Xi :2403.17064, 2024.
[20] X. Zhang, X.-Y. Wei, J. Wu e al., “Composi ional in-
e sion o s able di usion models,” in P oceedings o
he AAAI Con e ence on A i icial In elligence, ol. 38,
no. 7, 2024, pp. 7350–7358.
[21] A. Rad o d, J. W. Kim, C. Hallacy e al., “Lea n-
ing ans e able isual models om na u al language
supe ision,” in In e na ional con e ence on machine
lea ning. PMLR, 2021, pp. 8748–8763.
[22] X. Zhai, B. Mus a a, A. Kolesniko e al., “Sigmoid
loss o language image p e- aining,” in P oceedings
o he IEEE/CVF In e na ional Con e ence on Com-
pu e Vision, 2023, pp. 11 975–11 986.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
268
[23] I. Bica, A. Ili´
c, M. Baue e al., “Imp o ing ine-
g ained unde s anding in image- ex p e- aining,” in
P oceedings o he 41s In e na ional Con e ence on
Machine Lea ning, 2024, pp. 3974–3995.
[24] T. Chen, S. Ko nbli h, M. No ouzi e al., “A simple
amewo k o con as i e lea ning o isual ep esen-
a ions,” in In e na ional con e ence on machine lea n-
ing. PMLR, 2020, pp. 1597–1607.
[25] Y. Yuan, Z. Chen, X. Liu e al., “T-clap: Tempo al-
enhanced con as i e language-audio p e aining,” in
2024 IEEE 34 h In e na ional Wo kshop on Machine
Lea ning o Signal P ocessing (MLSP). IEEE, 2024,
pp. 1–6.
[26] I. Manco, J. Salamon, and O. Nie o, “Augmen , d op &
swap: Imp o ing di e si y in llm cap ions o e icien
music- ex ep esen a ion lea ning,” in P oceedings o
he 25 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), 2024.
[27] G. Zhu, J. Da e sky, and Z. Duan, “Cacophony: An
imp o ed con as i e audio- ex model,” IEEE/ACM
T ansac ions on Audio, Speech, and Language P o-
cessing, 2024.
[28] M. Comuni à, Z. Zhong, A. Takahashi e al., “Spec-
MaskGIT: Masked Gene a i e Modeling o Audio
Spec og ams o E icien Audio Syn hesis and Be-
yond,” in P oceedings o he 25 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2024.
[29] X. Li, W. Chen, Z. Ma e al., “D cap: Decoding clap la-
en s wi h e ie al-augmen ed gene a ion o ze o-sho
audio cap ioning,” in ICASSP 2025-2025 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2025, pp. 1–5.
[30] S. Ghosh, S. Kuma , C. K. R. E u u e al., “Recap:
Re ie al-augmen ed audio cap ioning,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2024, pp. 1161–1165.
[31] P. Esse , R. Rombach, and B. Omme , “Taming ans-
o me s o high- esolu ion image syn hesis,” in P o-
ceedings o he IEEE/CVF con e ence on compu e i-
sion and pa e n ecogni ion, 2021, pp. 12 873–12 883.
[32] J. Yu, Y. Xu, J. Y. Koh e al., “Scaling au o eg es-
si e models o con en - ich ex - o-image gene a ion,”
T ansac ions on Machine Lea ning Resea ch, 2022.
[33] A. Ramesh, P. Dha iwal, A. Nichol e al., “Hie a chi-
cal ex -condi ional image gene a ion wi h clip la en s,”
a Xi p ep in a Xi :2204.06125, ol. 1, no. 2, p. 3,
2022.
[34] S. Mo, Z. Chen, F. Bao e al., “Di gap: A ligh weigh
di usion module in con as i e space o b idging
c oss-model gap,” in ICASSP 2025-2025 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2025, pp. 1–5.
[35] J. Ho and T. Salimans, “Classi ie - ee di usion guid-
ance,” in Neu IPS 2021 Wo kshop on Deep Gene a i e
Models and Downs eam Applica ions, 2022.
[36] W. Peebles and S. Xie, “Scalable di usion models wi h
ans o me s,” in P oceedings o he IEEE/CVF In e -
na ional Con e ence on Compu e Vision, 2023, pp.
4195–4205.
[37] R. Rombach, A. Bla mann, D. Lo enz e al., “High-
esolu ion image syn hesis wi h la en di usion mod-
els,” in P oceedings o he IEEE/CVF con e ence on
compu e ision and pa e n ecogni ion, 2022, pp.
10 684–10 695.
[38] F. Schneide , Z. Jin, and B. Schölkop , “Moûsai:
Tex - o-music gene a ion wi h long-con ex la en di -
usion,” a Xi e-p in s, pp. a Xi –2301, 2023.
[39] W. Dong, S. Xue, X. Duan e al., “P omp uning in e -
sion o ex -d i en image edi ing using di usion mod-
els,” in P oceedings o he IEEE/CVF In e na ional
Con e ence on Compu e Vision, 2023, pp. 7430–7440.
[40] A. He z, R. Mokady, J. Tenenbaum e al., “P omp -
o-p omp image edi ing wi h c oss-a en ion con ol,”
in The Ele en h In e na ional Con e ence on Lea ning
Rep esen a ions, 2022.
[41] D. S idha and N. Vasconcelos, “P omp slide s o
ine-g ained con ol, edi ing and e asing o concep s in
di usion models,” a Xi p ep in a Xi :2409.16535,
2024.
[42] A. Lugmay , M. Danelljan, A. Rome o e al., “Re-
pain : Inpain ing using denoising di usion p obabilis-
ic models,” in P oceedings o he IEEE/CVF con e -
ence on compu e ision and pa e n ecogni ion, 2022,
pp. 11 461–11 471.
[43] D. Miyake, A. Ioha a, Y. Sai o e al., “Nega i e-
p omp in e sion: Fas image in e sion o edi ing
wi h ex -guided di usion models,” a Xi p ep in
a Xi :2305.16807, 2023.
[44] Y. Zhang, Y. Ikemiya, G. Xia e al., “Musicmagus:
ze o-sho ex - o-music edi ing ia di usion models,”
in P oceedings o he Thi y-Thi d In e na ional Join
Con e ence on A i icial In elligence, 2024, pp. 7805–
7813.
[45] J. Nis al, M. Pasini, and S. La ne , “Imp o ing mu-
sical accompanimen co-c ea ion ia di usion ans-
o me s,” a Xi p ep in a Xi :2410.23005, 2024.
[46] L. Lin, G. Xia, Y. Zhang e al., “A ange, inpain , and
e ine: s ee able long- e m music audio gene a ion and
edi ing ia con en -based con ols,” in P oceedings o
he Thi y-Thi d In e na ional Join Con e ence on A -
i icial In elligence, 2024, pp. 7690–7698.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
269
[47] J. Lee, N. J. B yan, J. Salamon e al., “Disen angled
mul idimensional me ic lea ning o music simila i y,”
in ICASSP 2020-2020 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2020, pp. 6–10.
[48] J. Guino , E. Quin on, and G. Fazekas, “Lea e-one-
equi a ian : Alle ia ing in a iance- ela ed in o ma ion
loss in con as i e music ep esen a ions,” in ICASSP
2025-2025 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2025, pp. 1–5.
[49] M. C. McCallum, F. Henkel, J. Kim e al., “Simila
bu as e : manipula ion o empo in music audio em-
beddings o empo p edic ion and sea ch,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2024, pp. 686–690.
[50] X. Bao, J. Y. Li, Z. Y. Wan e al., “Di 4s ee : S ee -
able di usion p io o gene a i e music e ie al wi h
seman ic guidance,” in ICASSP 2025-2025 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2025, pp. 1–5.
[51] M. C. McCallum, F. Ko zeniowski, S. O amas e al.,
“Supe ised and unsupe ised lea ning o audio ep-
esen a ions o music unde s anding,” in Ismi 2022
Hyb id Con e ence, 2022.
[52] K. Chen, X. Du, B. Zhu e al., “H s-a : A hie a chical
oken-seman ic audio ans o me o sound classi ica-
ion and de ec ion,” in ICASSP 2022-2022 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2022, pp. 646–650.
[53] H. W. Chung, L. Hou, S. Longp e e al., “Scal-
ing ins uc ion- ine uned language models,” Jou nal o
Machine Lea ning Resea ch, ol. 25, no. 70, pp. 1–53,
2024.
[54] Y. Liu, M. O , N. Goyal e al., “Robe a: A obus ly
op imized be p e aining app oach,” a Xi p ep in
a Xi :1907.11692, 2019.
[55] J. Melecho sky, Z. Guo, D. Ghosal e al., “Mus ango:
Towa d con ollable ex - o-music gene a ion,” in P o-
ceedings o he 2024 Con e ence o he No h Ame ican
Chap e o he Associa ion o Compu a ional Linguis-
ics: Human Language Technologies (Volume 1: Long
Pape s), 2024, pp. 8286–8309.
[56] J. Cope , F. K euk, I. Ga e al., “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, pp. 47 704–47 720,
2023.
[57] I. Manco, B. Weck, S. Doh e al., “The song desc ibe
da ase : a co pus o audio cap ions o music-and-
language e alua ion,” Neu IPS Machine Lea ning o
Audio Wo kshop, 2023.
[58] A. Agos inelli, T. I. Denk, Z. Bo sos e al., “Mu-
sicLM: Gene a ing music om ex ,” a Xi p ep in
a Xi :2301.11325, 2023.
[59] Y. Yu, C. Xiong, S. Sun e al., “Coco-d : Comba ing
dis ibu ion shi s in ze o-sho dense e ie al wi h con-
as i e and dis ibu ionally obus lea ning,” in P o-
ceedings o he 2022 Con e ence on Empi ical Me h-
ods in Na u al Language P ocessing, 2022, pp. 1462–
1479.
[60] E. Kh am so a, S. Zhuang, M. Bak ashmo lagh e al.,
“Selec ing which dense e ie e o use o ze o-sho
sea ch,” in P oceedings o he Annual In e na ional
ACM SIGIR Con e ence on Resea ch and De elopmen
in In o ma ion Re ie al in he Asia Paci ic Region,
2023, pp. 223–233.
[61] Y. Zhou, J. Ren, F. Li e al., “Tes - ime dis ibu ion no -
maliza ion o con as i ely lea ned isual-language
models,” Ad ances in Neu al In o ma ion P ocessing
Sys ems, ol. 36, pp. 47 105–47 123, 2023.
[62] B. Sun, J. Feng, and K. Saenko, “Co ela ion alignmen
o unsupe ised domain adap a ion,” Domain adap-
a ion in compu e ision applica ions, pp. 153–171,
2017.
[63] D. F iedman and A. B. Dieng, “The endi sco e: A di-
e si y e alua ion me ic o machine lea ning,” a Xi
p ep in a Xi :2210.02410, 2022.
[64] D. Bogdano , M. Won, P. To s ogan e al., “The m g-
jamendo da ase o au oma ic music agging,” in Ma-
chine Lea ning o Music Disco e y Wo kshop, In-
e na ional Con e ence on Machine Lea ning (ICML
2019), Long Beach, CA, Uni ed S a es, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
270

Related note

Why institutions use Plag.ai for originality review, entry 5
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by academic integrity officers in doctoral schools, editorial boards, quality-assurance offices, and student services, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also more transparent source review, better handling of multilingual submissions, and faster first-level screening. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For journal manuscripts, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai