scieee Science in your language
[en] (orig)

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

Author: Yixiao Zhang; Yukara Ikemiya; Woosung Choi; Naoki Murata; Marco Martínez-Ramírez; Liwei Lin; Gus Xia; Wei-Hsiang Liao; Yuki Mitsufuji; Simon Dixon
Publisher: Zenodo
DOI: 10.5281/zenodo.17706408
Source: https://zenodo.org/records/17706408/files/000038.pdf
INSTRUCT-MUSICGEN: UNLOCKING TEXT-TO-MUSIC EDITING FOR
MUSIC LANGUAGE MODELS VIA INSTRUCTION TUNING
Yixiao Zhang1, Yuka a Ikemiya2, Woosung Choi2, Naoki Mu a a2, Ma co A. Ma ínez-Ramí ez2,
Liwei Lin3, Gus Xia3, Wei-Hsiang Liao2, Yuki Mi su uji2, Simon Dixon1
1C4DM, Queen Ma y Uni e si y o London
2Sony AI 3Music X Lab, MBZUAI
[email p o ec ed], [email p o ec ed], {gus.xia, ll4270}@nyu.edu
ABSTRACT
The ask o ex - o-music edi ing, which employs ex
que ies o modi y music (e.g. by changing i s s yle o ad-
jus ing ins umen al componen s), p esen s unique chal-
lenges and oppo uni ies o AI-assis ed music c ea ion.
P e ious app oaches in his domain ha e been cons ained
by he necessi y o ain speci ic edi ing models om
sc a ch, which is bo h esou ce-in ensi e and ine icien ;
o he esea ch uses la ge language models o p edic edi ed
music, esul ing in imp ecise audio econs uc ion. In his
pape , we in oduce Ins uc -MusicGen, a no el app oach
ha ine unes a p e ained MusicGen model o e icien ly
ollow edi ing ins uc ions such as adding, emo ing, o
sepa a ing s ems. Ou app oach in ol es a modi ica ion o
he o iginal MusicGen a chi ec u e by inco po a ing a ex
usion module and an audio usion module, which allow
he model o p ocess ins uc ion ex s and audio inpu con-
cu en ly and yield he desi ed edi ed music. Rema kably,
al hough Ins uc -MusicGen only in oduces ∼8% new pa-
ame e s o he o iginal MusicGen model and only ains
o 5K s eps, i achie es supe io pe o mance ac oss all
asks compa ed o exis ing baselines. This ad ancemen
no only enhances he e iciency o ex - o-music edi ing
bu also b oadens he applicabili y o music language mod-
els in dynamic music p oduc ion en i onmen s. 1 2
1. INTRODUCTION
The apid ad ances in ex - o-music gene a ion ha e
opened up new possibili ies o AI-assis ed music c e-
a ion [1–5]. This pa adigm shi has also spa ked a g ow-
ing in e es in de eloping models ha o e g ea e con ol-
1Code, model weigh s and demo a e a ailable a : h ps://
gi hub.com/ldzhangyx/ins uc -musicgen.
2This wo k was done du ing Yixiao Zhang’s in e nship a Sony AI.
© Y. Zhang, Y. Ikemiya, W. Choi, N. Mu a a, M. A.
Ma ínez-Ramí ez, L. Lin, G. Xia, W.-H. Liao, Y. Mi su uji, and S.
Dixon. Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: Y. Zhang, Y. Ikemiya, W. Choi, N.
Mu a a, M. A. Ma ínez-Ramí ez, L. Lin, G. Xia, W.-H. Liao, Y. Mi su-
uji, and S. Dixon, “Ins uc -MusicGen: Unlocking Tex - o-Music Edi -
ing o Music Language Models ia Ins uc ion Tuning”, in P oc. o he
26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h
Ko ea, 2025.
MusicGen
T5 encode
Music ou pu
Tex desc ip ion inpu
"Gene a e music piece o sad jazz"
ins uc -MusicGen
T5 encode
Edi ed music ou pu
Tex ins uc ion
"Ins uc ion: add d ums."
Sou ce music
Figu e 1: Compa ison be ween MusicGen and ins uc -
MusicGen. Ins uc -MusicGen accep s bo h audio inpu
and edi ing ins uc ion ex as condi ions.
labili y [6–9] and edi abili y [10–12] o e he music gene -
a ion p ocess. In music p oduc ion, a s em—a mixed g oup
o acks o en ela ed by ins umen ype (like d ums o
lead ocals)—is essen ial o mixing and mas e ing be-
cause i allows p oduce s o isola e, adjus , and manipu-
la e indi idual elemen s o a song. Following he de ini-
ion in MusicMagus [11], “ ex - o-music edi ing" in ol es
using ex ual que ies o modi y a ious aspec s o a mu-
sic eco ding, which can be ca ego ised in o wo main
ypes: in a-s em edi ing, which ocuses on modi ying a
single s em (e.g., changing he ins umen , imb e, o pe -
o mance s yle), and in e -s em edi ing, which in ol es al-
e ing he ela ionships among s ems (e.g., adding, emo -
ing, o sepa a ing s ems). Ou wo k mainly ocuses on he
p oblem o in e -s em edi ing.
P e ious a emp s o de elop ex -based music edi ing
models ha e encoun e ed se e al challenges. Some ap-
p oaches [10,13] ha e ocused on aining specialised edi -
ing models om sc a ch, which is esou ce-in ensi e and
may no yield esul s compa able o s a e-o - he-a music
gene a ion models. O he wo k [12, 14, 15] has sough o
le e age exis ing la ge language models (LLMs) and Mu-
sicGen [2], allowing he LLM o in e p e edi ing ins uc-
ions wi hou u he aining he music model. Al hough
his app oach o e s lexibili y, i o en lacks he abili y o
p ecisely econs uc he condi ional audio, leading o un-
eliable esul s. To add ess hese limi a ions, an ideal solu-
ion should ha ness he knowledge embedded in p e ained
models o ensu e high-quali y audio ou pu while adap ing
he a chi ec u e o accommoda e he speci ic equi emen s
328
o music edi ing asks.
In his pape , we in oduce Ins uc -MusicGen, a no el
app oach ha applies an ins uc ion- ollowing uning s a -
egy o he p e ained MusicGen model, enhancing i s
abili y o ollow edi ing ins uc ions e ec i ely wi h-
ou ine uning all i s pa ame e s. As shown in Fig-
u e 1, by inco po a ing an audio usion module based on
LLaMA-Adap e [6,16] and a ex usion module based on
LoRA [17] in o he o iginal MusicGen a chi ec u e, we al-
low he model o p ocess bo h p ecise audio condi ions and
ex -based ins uc ions simul aneously, which he o iginal
MusicGen does no do. This enables Ins uc -MusicGen o
pe o m a ange o edi ing asks. In his pape , we ocus on
a speci ic se o hese asks: adding, sepa a ing, and emo -
ing s ems. To ain Ins uc -MusicGen, we syn hesize an
ins uc ional da ase using he Slakh2100 da ase [18], in-
oducing only 8% addi ional pa ame e s compa ed o he
o iginal model, and ine une he model o only 5K s eps,
which is less han 1% o aining a music edi ing model
om sc a ch.
We e alua e Ins uc -MusicGen on wo da ase s:
he Slakh es se and he ou -o -domain MoisesDB
da ase [19]. Ou model ou pe o ms exis ing baselines and
achie es pe o mance compa able o models speci ically
ained o indi idual asks. This demons a es he e ec-
i eness o ou app oach in le e aging p e ained models
o ex - o-music edi ing while main aining high-quali y
esul s.
2. RELATED WORK
Tex -based music edi ing p o ides a lexible app oach o
edi ing music using ex ual que ies. This me hod is sim-
ila o hose used in o he modali ies ha equi e edi ing,
such as image [20, 21] and ideo [22, 23] edi ing. In ex -
o-music edi ing, ex is used o speci y p ecise al e a ions
o exis ing music composi ions. P e ious esea ch such
as AUDIT [13] and Ins uc ME [10] de eloped a di usion
model ained wi h pai ed music edi ing da a. Addi ionally,
models like M2UGen [12], Loop Copilo [14], MusicA-
gen [24], Compose X [25] and Wa C a [26] use la ge
language models (LLMs) o easoning and egene a e mu-
sic wi h ex e nal music gene a ion models. Fu he mo e,
GMSDI [27] a emp s o model a join mul i-s em dis i-
bu ion o music o ex -based gene a ion and sepa a ion.
Ce ain models ocus exclusi ely on speci ic asks wi hin
music edi ing, such as condi ioned gene a ion [6, 7, 9] and
sepa a ion [28], along wi h in a-s em edi ing asks such as
ex -based imb e ans e and s yle ans e [11,29–31].
The ask o in e -s em music edi ing is closely ela ed
o s em-wise music gene a ion. Al hough no di ec ly ied
o ex -based con ols, some esea ch ocuses on mod-
eling s em-wise ep esen a ions o enable simul aneous
s em gene a ion and sepa a ion. Fo ins ance, Jen-1 Com-
pose [32] and MSDM [33] join ly model he dis ibu ion
o music wi h ou s ems using a di usion model. The
abili ies o mos exis ing s em-wise music models a e e-
s ic ed o a ixed se o 4 s ems, which limi s lexibili y
bu enhances con ollabili y. Besides, S emGen [34] ains
a LLaMA-based au o- eg essi e model o lexible s em-
wise audio gene a ion.
Ou wo k dis inguishes i sel om hese exis ing e o s
in se e al key ways. Fi s , a he han de eloping a new
model om sc a ch o s ic ly adhe ing o a ixed se o
s ems, we le e age he powe o a p e ained music lan-
guage model, MusicGen, and enhance i wi h ins uc ion
uning. This app oach no only educes he compu a ional
cos bu also e ains he high audio quali y o he o igi-
nal MusicGen model. Fu he mo e, ou me hod in oduces
minimal addi ional pa ame e s and equi es signi ican ly
less aining, demons a ing a mo e e icien and scalable
solu ion o ex -based music edi ing.
3. METHOD
3.1 MusicGen
The o iginal MusicGen consis s o h ee componen s: (1)
he EnCodec [35] audio encode and decode , which com-
p ess music audio wa e o ms in o la en codes and e-
cons uc hem back in o wa e o ms; (2) a mul i-laye
ans o me a chi ec u e ha models sequences o la en
codes, cap u ing highe -le el music ep esen a ions and e -
icien ly modeling in e nal ela ionships wi hin music au-
dio; and (3) he T5 [36] ex encode , which con e s ex
desc ip ions in o embeddings o ex -condi ioned gene a-
ion.
EnCodec employs Residual Vec o Quan iza ion
(RVQ) [37] o comp ess audio in o okens using mul iple
codebooks, whe e each quan ize encodes he quan iza ion
e o om he p e ious one. Fo a e e ence audio
X∈Rd· n, whe e dis he du a ion and nis he sample
a e, EnCodec comp esses Xin o Q∈ {1, ..., L}N×d· s,
whe e Lis he RVQ codebook size, Nis he numbe
o codebooks, and sis he la en code sample a e
( s<< n). In MusicGen, N= 4, n= 50, s= 32000,
and L= 2048. Finally, he ans o me models he
sequence ela ionships o e la en codes 3.
3.2 Ins uc -MusicGen
MusicGen is a ex - o-music gene a ion model, capable o
gene a ing music audio om a gi en ex p omp . How-
e e , MusicGen canno edi exis ing music audio. To
add ess his limi a ion, we in oduce Ins uc -MusicGen,
which ans o ms MusicGen in o a model ha can ollow
edi ing ins uc ions o modi y exis ing music audio.
Ins uc -MusicGen akes a music audio inpu Xcond
and a ex ins uc ion Xins uc (e.g., "Add gui a ") as in-
pu s. The model hen edi s he music audio Xcond acco d-
ing o he ins uc ion Xins uc and gene a es he desi ed
edi ed music Xmusic. As illus a ed in Figu e 2, Ins uc -
MusicGen inco po a es wo addi ional modules in o he
anilla MusicGen: an audio usion and a ex usion mod-
ule.
3MusicGen’s encodec uses a 50Hz sample a e, which is di e en
om he o iginal 75Hz EnCodec model.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
329
Masked Mul i-head
Sel -A en ion
Laye No m
Mul i-head
C oss A en ion
K V Q
Ou pu s (shi ed igh )
Posi ional
Encoding
FFN
Laye No m
Laye No m
Linea
Mul i-head
Sel -A en ion
Condi ion music audio
Posi ional
Encoding
Q''
T5 Encode
Tex ins uc ion
LoRA
LoRA
!!
"#$%
!&
"#$%
!!'(
"#$%
Linea
!
!!
)*+,"
Tex usion
Audio usion
!!'(
)*+,"
!!
"#$%& "!
"#$%& #!
"#$%&
$!
"#$%&
!!
&'() "!
&'() #!
&'()
$!
&'() %!
"#$%& %!
*& '!
"#$%& ( )!%!
+,-./ %!
*
Figu e 2: Illus a ion o he usion mechanism inside
he T ans o me module o ins uc -MusicGen. The au-
dio usion module ans o ms he condi ional music audio
in o embeddings using a duplica ed encode and in eg a es
hese embeddings in o he MusicGen decode . The ex
usion module modi ies he c oss-a en ion mechanism o
handle ex ins uc ions by ine uning speci ic laye s while
keeping he ex encode pa ame e s ozen.
3.2.1 Audio Fusion Module
The audio usion module enables Ins uc -MusicGen o ac-
cep ex e nal audio inpu s, which is inspi ed by LLaMA-
Adap e [16] and Coco-mulla [6]. The lowe pa o Fig-
u e 2 illus a es he audio usion module. Ini ially, we con-
e Xcond in o EnCodec okens, ollowed by e-encoding
hese okens in o he embedding zcond h ough he p e-
ained embedding laye s o MusicGen. Simila ly, we
ans o m Xmusic in o he p e ained embedding zmusic.
The module begins wi h duplica ing sel -a en ion mod-
ules o he p e ained MusicGen model o ex ac la en
ep esen a ions o zcond. Gi en ha MusicGen consis s o
Mlaye s, we deno e
Zcond ={zcond
0, zcond
1, . . . , zcond
M},(1)
Zmusic ={zmusic
0, zmusic
1, . . . , zmusic
M},(2)
which ep esen he hidden s a es o Xcond and Xmusic e-
spec i ely. No e ha we use a lea nable inpu embedding
as zcond
0and ini ialize zmusic
0wi h zmusic.
We compu e he anilla sel a en ion o Xmusic as ol-
lows:
Qmusic
l, Kmusic
l, V music
l=QKV-p ojec o (zmusic
l),(3)
omusic
l=Sel A n(Qmusic
l, Kmusic
l, V music
l).(4)
We p ojec zcond o a high-dimension ep esen a ion h
ia a linea laye land lea nable posi ional encoding el,
h= l(zcond) + el.(5)
Then, we compu e he (l+ 1)- h laye hidden s a es o
Xcond as ollows:
Qcond
l, Kcond
l, V cond
l=QKV-p ojec o (zcond
l+h),(6)
zcond
l+1 = Sel A n(Qcond
l, Kcond
l, V cond
l).(7)
To use in o ma ion o Xcond in o Xmusic, we compu e
he c oss a en ion be ween hem,
smusic
l=C ossA n(Qmusic
l+Qcond
l, Kcond
l, V cond
l).(8)
Finally, he a en ion ou pu o Xmusic is upda ed as ol-
lows,
s′
l=omusic
l+gl·smusic
l,(9)
zmusic
l+1 =Tex Fusion(s′
l, Xins uc ),(10)
whe e gis a ze o-ini ialized lea nable ga ing ac o .
Thus, he o al ainable pa ame e s in Ins uc -
MusicGen include he inpu embedding zcond
0, linea laye s
l, lea nable posi ion embeddings el, lea nable ga ing ac-
o s g, and lea nable pa ame e s in he ex usion module.
3.2.2 Tex Fusion Module
To eplace he ex desc ip ion inpu wi h ins uc ion inpu ,
we modi y he beha io o he cu en ex encode . We
achie e his by ine uning only he c oss-a en ion module
be ween he ex embedding and he music ep esen a ions
while keeping he ex encode ’s pa ame e s ozen.
The ins uc ion is embedded and encoded by he T5 ex
encode as zins uc =T5(Xins uc ). Fo e icien ine uning
o he c oss-a en ion module, we apply LoRA o he que y
and alue p ojec ion laye s. Thus, we expand Equa ion 10
as ollows,
Ql, Kins uc
l, V ins uc
l=QKV-Lo a(s′
l, zins uc ),(11)
zmusic
l+1 =C ossA n(Ql, Kins uc
l, V ins uc
l).(12)
Du ing ine- uning, only que y and alue p ojec ion laye s
a e ainable in he ex usion module.
4. EXPERIMENTS
We conduc bo h subjec i e expe imen s and objec i e ex-
pe imen s o e alua ion, and also p o ide example spec-
og ams in Figu e 3.
4.1 Objec i e Expe imen s
4.1.1 Da ase
Fo ou objec i e e alua ions, we u ilise wo dis inc
da ase s, each se ing a speci ic pu pose in assessing bo h
in-domain and ou -o -domain pe o mance capabili ies o
a ious models.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
330
1. Slakh2100 da ase [18]. The Syn hesized Lakh
(Slakh) Da ase , o iginally de i ed om he Lakh
MIDI Da ase 0.1, comp ises audio acks syn he-
sised using high-quali y sample-based i ual ins u-
men s. This da ase ea u es 2100 acks comple e
wi h co esponding MIDI iles.
2. MoisesDB da ase [19]. The MoisesDB da ase in-
cludes 240 eal audio acks sou ced om 45 di e se
a is s spanning wel e musical gen es. Uniquely,
MoisesDB o ganises i s acks in o a de ailed wo-
le el hie a chical axonomy o s ems, o e ing a a -
ied numbe o s ems pe ack, each anno a ed wi h
ex ual desc ip ions.
The a ionale o selec ing wo da ase s lies in hei di-
e se con igu a ions and common applica ions. While he
Slakh da ase is adi ionally u ilised o aining models
ailo ed o a ou -s em a angemen , ou model, Ins uc -
MusicGen, al hough ini ially ained on his da ase , is de-
signed o gene alise o a ious s em con igu a ions. Con-
e sely, models such as Ins uc ME and AUDIT a e ained
on p i a e o la ge , mo e di e se da ase s. By employing
bo h Slakh2100 and MoisesDB, we ensu e a comp ehen-
si e e alua ion, allowing us o ai ly compa e he adap -
abili y and pe o mance o di e en models unde a ying
condi ions o da a amilia i y and complexi y.
4.1.2 Da a P ep ocessing
We u ilised he Slakh2100 da ase o cons uc an
ins uc ion-based da ase o ou expe imen s, employing
he ollowing pipeline:
• A da a poin was andomly selec ed om he Slakh
aining da ase .
• An ins uc ion was chosen om a p ede ined se
{add, emo e, ex ac } along wi h a a ge s em. Sub-
sequen ly, no he s ems we e selec ed om he e-
maining s ems.
• An o se was andomly de e mined o cu a 5-
second audio clip. I he a ge s em con ained mo e
han 50% silence, a di e en o se was selec ed.
• The s ems we e mixed acco ding o he speci ied in-
s uc ions o c ea e a iple consis ing o {ins uc-
ion ex , condi ion audio inpu , audio g ound u h}.
4.1.3 Expe imen al Se up
Fo he ine uning o MusicGen, we join ly ained he au-
dio usion module and he ex usion module. The op-
imisa ion p ocess u ilised he AdamW op imise , wi h a
lea ning a e se a 5×10−3. We use L2 loss o e la en
okens as he aining objec i e. T aining inco po a ed a
Cosine Annealing schedule wi h an ini ial wa mup o 100
s eps. The aining egimen ex ended o e 5,000 s eps wi h
an accumula ed ba ch size o 32, achie ed h ough se ing
he ba ch size o 8 and using g adien accumula ion o e 4
i e a ions. The ine uning p ocess was execu ed on a single
NVIDIA A100 GPU and was comple ed wi hin wo days.
4.1.4 Baselines
In his sec ion, we explo e wo baseline models, each dis-
inguished by hei unique me hodologies o handling au-
dio da a.
1. AUDIT [13]: AUDIT is an ins uc ion-guided au-
dio edi ing model, consis ing o a a ia ional au oen-
code (VAE) o con e ing inpu audio in o a la en
space ep esen a ion, a T5 ex encode o p ocess-
ing edi ins uc ions, and a di usion ne wo k ha
pe o ms he ac ual audio edi ing in he la en space.
The sys em accep s mel-spec og ams o inpu audio
and edi ins uc ions, and gene a es he edi ed audio
as ou pu .
2. M2UGen [12]: The M2UGen amewo k le e ages
la ge language models o comp ehend and gene a e
music ac oss a ious modali ies, in eg a ing abili-
ies om ex e nal models such as MusicGen [2] and
AudioLDM 2 [4]. I is designed o s imula e c e-
a i e ou pu s om di e se sou ces, showcasing o-
bus pe o mance in mul i-modal music gene a ion.
Besides, Ins uc ME can also pe o m ins uc ion-
guided music edi ing and emixing wi h la en di usion
models. We exclude i om compa ison because In-
s uc ME’s model weigh s and e alua ion p o ocol a e no
publicly eleased.
Model Pa am size Da ase Hou s (h) S eps
AUDIT 942M (1.5B) Mul iple ∼6500 0.5M
Ins uc ME 967M (1.7B) Mul iple 417 2M
M2UGen 637M (∼9B) MUEdi 60.22 -
Ou s 264M (3.5B) Slakh 145 5K
Table 1: Compa ison o di e en models, whe e he pa am
size numbe s a e ainable pa ame e s and o al pa ame e s
espec i ely. Ou model has he lowes pa ame e size, and
only equi es 5K aining s eps.
4.1.5 Me ics
The me ics o e alua e model pe o mance a e lis ed be-
low.
1. F éche Audio Dis ance (FAD) [38] 4measu es he
simila i y be ween wo se s o audio iles by compa -
ing mul i a ia e Gaussian dis ibu ions i ed o ea-
u e embeddings om he audio da a. We use he
FAD sco e o e alua e he o e all audio quali y o
he p edic ed music.
2. CLAP Sco e (CLAP) [39] 5is used in ou expe -
imen s o measu e he co espondence be ween he
edi ed music and a a ge ex . Fo he emo al ask,
he a ge ex is gene a ed by dele ing he name o
he emo ed ins umen om he o iginal ex .
4h ps://gi hub.com/gudgud96/
eche -audio-dis ance.
5h ps://gi hub.com/LAION-AI/CLAP.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
331
3. Kullback-Leible Di e gence (KL) 6assesses he
di e ence be ween he p obabili y dis ibu ions o
audio ea u es om wo sou ces, indica ing in o ma-
ion loss when app oxima ing one dis ibu ion wi h
ano he . A low KL sco e indica es he p edic ed mu-
sic sha es simila ea u es wi h he g ound u h.
4. S uc u al Simila i y (SSIM) [40] is an image
quali y me ic ha we adap o e alua e s uc-
u al simila i y be ween p edic ed music and g ound
u h.
5. Scale-In a ian Signal- o-Dis o ion Ra io (SI-
SDR) [41] quan i ies audio quali y, especially in
sou ce sepa a ion asks. I is scale-in a ian , use-
ul o a ying audio olumes, and measu es dis o -
ion ela i e o a e e ence signal. We use SI-SDR o
e alua e he signal loss o he p edic ed audio.
6. Scale-In a ian Signal- o-Dis o ion Ra io im-
p o emen (SI-SDRi) [42] ex ends SI-SDR, mea-
su ing he imp o emen in signal- o-dis o ion a io
be o e and a e p ocessing. I is commonly used in
audio enhancemen and sepa a ion con ex s.
To u he in es iga e whe he he model success ully
adds, emo es o ex ac s he ins umen , we p opose he
P-Demucs sco e o e alua e he model pe o mance. This
me ic speci ically ocuses on de ec ing he p esence o a
newly added ins umen in he gene a ed audio. I le e -
ages he Demucs model, a sou ce sepa a ion model, o iso-
la e he a ge ins umen om he audio. A e sepa a ion,
he oo -mean-squa e ene gy (RMSE) o he isola ed ack
is analyzed. Fo example, i he ins uc ion is o "add gui-
a ," a non-silen gui a ack is ega ded as a success ul
edi .
4.1.6 Objec i e Expe imen Resul s
Ou e alua ion o Ins uc -MusicGen demons a es i s su-
pe io pe o mance ac oss a ious asks compa ed o exis -
ing ex - o-music edi ing baselines (AUDIT, Ins uc ME,
M2UGen). On he Slakh da ase (Table 2), Ins uc -
MusicGen excelled in adding, emo ing, and ex ac ing
s ems, achie ing he lowes F éche Audio Dis ance (FAD)
and highes CLAP and SSIM sco es. I also signi ican ly
imp o ed he signal- o-noise a io (SI-SDR) in he emo al
ask, showing balanced pe o mance ac oss all me ics
and p o ing i s obus ness in a ious edi ing scena ios.
Simila ly, in he MoisesDB da ase e alua ions (Table 3),
Ins uc -MusicGen demons a ed s ong pe o mance, wi h
he bes pe o mance on mos me ics o e he h ee asks.
We ind ha all models exhibi nega i e SI-SDR and SI-
SDRi sco es, which is a common occu ence when e alua -
ing gene a i e models on a signal le el. These me ics a e
ypically designed o sou ce sepa a ion asks and a e no
en i ely ai o gene a i e models, as hey penalise e en mi-
no disc epancies be ween he gene a ed and o iginal sig-
nals. Gene a i e models like Ins uc -MusicGen o en o-
cus on p oducing pe cep ually plausible audio a he han
pe ec ly ma ching he o iginal signal a a echnical le el.
6h ps://gi hub.com/haoheliu/audioldm_e al.
4.2 Subjec i e Expe imen s
4.2.1 Expe imen al Se up
We conduc ed a subjec i e lis ening es o e alua e he
model’s pe o mance. 7This es in ol ed dissemina -
ing an online su ey wi hin he Music In o ma ion Re-
ie al (MIR) communi y and ou b oade esea ch ne -
wo k, which esul ed in he collec ion o 30 comple e e-
sponses. The gende dis ibu ion o he pa icipan s was
23 males (76.7%) and 7 emales (23.3%). Rega ding
p o essional musical educa ion expe ience, 4 pa icipan s
(13.3%) had less han 1 yea o expe ience, 13 (43.3%)
had be ween 1 and 5 yea s, and 13 pa icipan s (43.3%)
had mo e han 5 yea s o expe ience. Fo he da a p epa a-
ion, we andomly selec ed a subse o da a poin s om he
objec i e es da ase . Speci ically, 6 audio samples we e
chosen, comp ising 2 audio samples o each sub ask (add,
emo e, ex ac ). Each da a poin included esul s om he
baseline models, ou models, and he g ound u h om he
da ase .
4.2.2 Me ics
1. Ins uc ion Adhe ence (IA) assesses how accu-
a ely he gene a ed music ollows he gi en edi ing
ins uc ion. In his expe imen , pa icipan s a e he
gene a ed music on a scale om 1 o 5, whe e 1 in-
dica es ha he ins uc ion was no ollowed a all,
and 5 indica es ha he ins uc ion was ollowed pe -
ec ly. Fo example, i he ins uc ion is "Remo e
D ums," a a ing o 1 would mean ha he d ums
we e no emo ed a all, while a a ing o 5 would
mean ha he d ums we e comple ely emo ed.
2. Audio Quali y (AQ) e alua es he o e all audio
quali y o he gene a ed music in compa ison o he
o iginal music. Pa icipan s a e he audio quali y on
a scale om 1 o 5, whe e 1 ep esen s e y poo
quali y wi h signi ican deg ada ion compa ed o he
o iginal music, and 5 ep esen s excellen quali y, as
good as o be e han he o iginal music. This me -
ic helps in unde s anding how he edi ing p ocess
a ec s he o e all sound quali y o he music.
4.2.3 Subjec i e Expe imen Resul s
The esul s o ou subjec i e expe imen s a e summa ised
in Table 4. We conduc ed wo pai ed - es s wi h Bon e -
oni co ec ion, se ing he signi icance le el a α= 0.05.
The esul s shows ha ou model demons a es a signi i-
can imp o emen in bo h Ins uc ion Adhe ence (IA) and
Audio Quali y (AQ) compa ed o he baseline models, AU-
DIT and M2UGen. 8
5. CONCLUSION
In his pape , we in oduced Ins uc -MusicGen, a no el
app oach o ex - o-music edi ing ha os e s join musical
7This subjec i e es was app o ed by he e hics commi ee o Sony.
8Mo e audio samples can be ound a h ps://bi .ly/
ins uc -musicgen.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
332

Task Models FAD↓CLAP↑KL↓SSIM↑P-Demucs↑SI-SDR↑SI-SDRi↑
Add
AUDIT 6.88 0.12 1.02 0.21 0.53 - -
M2UGen 7.24 0.22 0.99 0.20 0.43 - -
Ou s 3.75 0.23 0.67 0.26 0.80 - -
Remo e
AUDIT 15.48 0.07 2.75 0.35 0.33 -45.60 -47.28
M2UGen 8.26 0.09 1.59 0.23 0.70 -44.20 -46.13
Ou s 3.35 0.12 0.66 0.45 0.76 -2.09 -3.77
Ex ac
AUDIT 15.08 0.06 2.38 0.42 0.61 -52.90 -50.16
M2UGen 8.14 0.11 2.15 0.31 0.60 -46.38 -43.53
Ou s 3.24 0.12 0.54 0.52 0.75 -9.00 -6.15
Table 2: Compa ison o ex -based music edi ing models on he Slakh da ase (4 s ems).
Task Models FAD↓CLAP↑KL↓SSIM↑P-Demucs↑SI-SDR↑SI-SDRi↑
Add
AUDIT 4.06 0.12 0.84 0.21 0.50 - -
M2UGen 5.00 0.18 0.83 0.20 0.45 - -
Ou s 3.79 0.18 0.35 0.35 0.77 - -
Remo e
AUDIT 10.72 0.10 2.46 0.34 0.41 -44.32 -57.10
M2UGen 3.75 0.13 1.27 0.19 0.72 -43.94 -56.73
Ou s 5.05 0.10 0.84 0.34 0.78 -13.70 -26.48
Ex ac
AUDIT 6.67 0.07 1.97 0.45 0.60 -54.53 -56.17
M2UGen 5.74 0.08 1.91 0.25 0.52 -42.84 -44.49
Ou s 4.96 0.11 1.36 0.40 0.78 -21.39 -23.03
Table 3: Compa ison o ex -based music edi ing models on he MoisesDB da ase .
(a) Inpu music.
(b) Edi ed music ou pu .
(c) G ound u h.
Figu e 3: Spec og ams when Ins uc -MusicGen emo es
he d um s em.
Model Ins uc ion Adhe ence↑Audio Quali y↑
AUDIT 1.54 2.56
M2UGen 1.70 1.92
Ou s 3.85 3.55
G ound u h 4.36 4.21
Table 4: The subjec i e expe imen esul s.
and ex ual con ols. By ine uning he exis ing MusicGen
model wi h ins uc ion uning, Ins uc -MusicGen demon-
s a ed i s capabili y o edi ing music in a ious ways, in-
cluding adding, sepa a ing and ex ac ing a s em om mu-
sic audio using ex ual que ies, wi hou he need o ain-
ing specialised models om sc a ch. Also, i ou pe o ms
a ious baseline models ha a e dedica ed o speci ic mu-
sic edi ing asks. Fu he mo e, ou me hod uses signi i-
can ly ewe esou ces han p e ious models, wi h a e-
qui emen o uning only 8% o he pa ame e s o he o ig-
inal MusicGen.
6. ACKNOWLEDGEMENTS
This wo k was done du ing Yixiao Zhang’s in e nship a
Sony AI. Yixiao Zhang was a esea ch s uden a he UKRI
Cen e o Doc o al T aining in A i icial In elligence and
Music, suppo ed join ly by he China Schola ship Coun-
cil, Queen Ma y Uni e si y o London and Apple Inc.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
333
7. REFERENCES
[1] A. Agos inelli, T. I. Denk, Z. Bo sos, J. H. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi, M. Sha i i, N. Zeghidou ,
and C. H. F ank, “MusicLM: Gene a ing music om
ex ,” CoRR, ol. abs/2301.11325, 2023. [Online].
A ailable: h ps://doi.o g/10.48550/a xi .2301.11325
[2] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan ,
G. Synnae e, Y. Adi, and A. Dé ossez, “Simple
and con ollable music gene a ion,” in Ad ances
in Neu al In o ma ion P ocessing Sys ems 36:
Annual Con e ence on Neu al In o ma ion P ocessing
Sys ems 2023, Neu IPS 2023, New O leans,
LA, USA, Decembe 10 - 16, 2023, A. Oh,
T. Naumann, A. Globe son, K. Saenko, M. Ha d ,
and S. Le ine, Eds., 2023. [Online]. A ailable:
h p://pape s.nips.cc/pape _ iles/pape /2023/hash/
94b472a1842cd7c56dcb125 b2765 bd-Abs ac -Con e ence.
h ml
[3] P. Li, B. Chen, Y. Yao, Y. Wang, A. Wang, and
A. Wang, “JEN-1: Tex -guided uni e sal music
gene a ion wi h omnidi ec ional di usion models,”
CoRR, ol. abs/2308.04729, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2308.04729
[4] H. Liu, Q. Tian, Y. Yuan, X. Liu, X. Mei,
Q. Kong, Y. Wang, W. Wang, Y. Wang, and
M. D. Plumbley, “AudioLDM 2: Lea ning holis ic
audio gene a ion wi h sel -supe ised p e aining,”
CoRR, ol. abs/2308.05734, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2308.05734
[5] K. Chen, Y. Wu, H. Liu, M. Nezhu ina, T. Be g-
Ki kpa ick, and S. Dubno , “MusicLDM: En-
hancing no el y in ex - o-music gene a ion us-
ing bea -synch onous mixup s a egies,” CoRR,
ol. abs/2308.01546, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2308.01546
[6] L. Lin, G. Xia, J. Jiang, and Y. Zhang, “Con en -
based con ols o music la ge language modeling,”
CoRR, ol. abs/2310.17162, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2310.17162
[7] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music con olne : Mul iple ime- a ying con ols o
music gene a ion,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 32, pp. 2692–
2703, 2024.
[8] J. Melecho sky, Z. Guo, D. Ghosal, N. Majumde ,
D. He emans, and S. Po ia, “Mus ango: Towa d
con ollable ex - o-music gene a ion,” a xi p ep in
a xi :2311.08355, 2023.
[9] L. Lin, G. Xia, Y. Zhang, and J. Jiang, “A ange,
inpain , and e ine: S ee able long- e m music audio
gene a ion and edi ing ia con en -based con ols,”
CoRR, ol. abs/2402.09508, 2024. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2402.09508
[10] B. Han, J. Dai, X. Song, W. Hao, X. He,
D. Guo, J. Chen, Y. Wang, and Y. Qian, “In-
s uc ME: An ins uc ion guided music edi and
emix amewo k wi h la en di usion models,”
CoRR, ol. abs/2308.14360, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2308.14360
[11] Y. Zhang, Y. Ikemiya, G. Xia, N. Mu a a, M. A. M.
Ramí ez, W. Liao, Y. Mi su uji, and S. Dixon,
“MusicMagus: Ze o-sho ex - o-music edi ing ia
di usion models,” CoRR, ol. abs/2402.06178, 2024.
[Online]. A ailable: h ps://doi.o g/10.48550/a xi .
2402.06178
[12] A. S. Hussain, S. Liu, C. Sun, and Y. Shan,
“M2UGen: Mul i-modal music unde s anding and
gene a ion wi h he powe o la ge language models,”
CoRR, ol. abs/2311.11255, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2311.11255
[13] Y. Wang, Z. Ju, X. Tan, L. He, Z. Wu, J. Bian,
and S. Zhao, “AUDIT: Audio edi ing by ollowing
ins uc ions wi h la en di usion models,” in
Ad ances in Neu al In o ma ion P ocessing Sys ems
36: Annual Con e ence on Neu al In o ma ion
P ocessing Sys ems 2023, Neu IPS 2023, New
O leans, LA, USA, Decembe 10 - 16, 2023, A. Oh,
T. Naumann, A. Globe son, K. Saenko, M. Ha d ,
and S. Le ine, Eds., 2023. [Online]. A ailable:
h p://pape s.nips.cc/pape _ iles/pape /2023/hash/
e1b619a9e241606a23eb21767 16c 81-Abs ac -Con e ence.
h ml
[14] Y. Zhang, A. Maezawa, G. Xia, K. Yamamo o,
and S. Dixon, “Loop Copilo : Conduc ing AI
ensembles o music gene a ion and i e a i e edi ing,”
CoRR, ol. abs/2310.12404, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2310.12404
[15] D. Yang, J. Tian, X. Tan, R. Huang, S. Liu,
X. Chang, J. Shi, S. Zhao, J. Bian, X. Wu, Z. Zhao,
S. Wa anabe, and H. Meng, “UniAudio: An audio
ounda ion model owa d uni e sal audio gene a ion,”
CoRR, ol. abs/2310.00704, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2310.00704
[16] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li,
P. Gao, and Y. Qiao, “LLaMA-Adap e : E icien ine-
uning o language models wi h ze o-ini a en ion,”
CoRR, ol. abs/2303.16199, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2303.16199
[17] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu,
Y. Li, S. Wang, and W. Chen, “LoRA: Low- ank
adap a ion o la ge language models,” CoRR, ol.
abs/2106.09685, 2021. [Online]. A ailable: h ps:
//a xi .o g/abs/2106.09685
[18] E. Manilow, G. Wiche n, P. See ha aman, and J. L.
Roux, “Cu ing music sou ce sepa a ion some Slakh:
A da ase o s udy he impac o aining da a quali y
and quan i y,” in 2019 IEEE Wo kshop on Applica ions
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
334
o Signal P ocessing o Audio and Acous ics, WASPAA
2019, New Pal z, NY, USA, Oc obe 20-23, 2019.
IEEE, 2019, pp. 45–49. [Online]. A ailable: h ps:
//doi.o g/10.1109/WASPAA.2019.8937170
[19] I. Pe ei a, F. A aújo, F. Ko zeniowski, and R. Vogl,
“MoisesDB: A da ase o sou ce sepa a ion beyond
4-s ems,” in P oceedings o he 24 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR 2023, Milan, I aly, No embe 5-9, 2023,
A. Sa i, F. An onacci, M. Sandle , P. Bes agini,
S. Dixon, B. Liang, G. Richa d, and J. Pauwels,
Eds., 2023, pp. 619–626. [Online]. A ailable: h ps:
//doi.o g/10.5281/zenodo.10265363
[20] T. B ooks, A. Holynski, and A. A. E os, “In-
s uc Pix2Pix: Lea ning o ollow image edi -
ing ins uc ions,” in IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion, CVPR
2023, Vancou e , BC, Canada, June 17-24, 2023.
IEEE, 2023, pp. 18 392–18 402. [Online]. A ailable:
h ps://doi.o g/10.1109/CVPR52729.2023.01764
[21] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual ins uc ion
uning,” in Ad ances in Neu al In o ma ion P ocessing
Sys ems 36: Annual Con e ence on Neu al In o ma-
ion P ocessing Sys ems 2023, Neu IPS 2023, New
O leans, LA, USA, Decembe 10 - 16, 2023, A. Oh,
T. Naumann, A. Globe son, K. Saenko, M. Ha d ,
and S. Le ine, Eds., 2023. [Online]. A ailable:
h p://pape s.nips.cc/pape _ iles/pape /2023/hash/
6dc 277ea32ce3288914 a 369 e6de0-Abs ac -Con e ence.
h ml
[22] W. Chai, X. Guo, G. Wang, and Y. Lu,
“S ableVideo: Tex -d i en consis ency-awa e di -
usion ideo edi ing,” in IEEE/CVF In e na-
ional Con e ence on Compu e Vision, ICCV
2023, Pa is, F ance, Oc obe 1-6, 2023. IEEE,
2023, pp. 22 983–22 993. [Online]. A ailable:
h ps://doi.o g/10.1109/ICCV51070.2023.02106
[23] D. Ceylan, C. P. Huang, and N. J. Mi a, “Pix2Video:
Video edi ing using image di usion,” in IEEE/CVF
In e na ional Con e ence on Compu e Vision, ICCV
2023, Pa is, F ance, Oc obe 1-6, 2023. IEEE,
2023, pp. 23 149–23 160. [Online]. A ailable: h ps:
//doi.o g/10.1109/ICCV51070.2023.02121
[24] D. Yu, K. Song, P. Lu, T. He, X. Tan, W. Ye, S. Zhang,
and J. Bian, “MusicAgen : An AI agen o music
unde s anding and gene a ion wi h la ge language
models,” in P oceedings o he 2023 Con e ence on
Empi ical Me hods in Na u al Language P ocessing,
EMNLP 2023 - Sys em Demons a ions, Singapo e,
Decembe 6-10, 2023, Y. Feng and E. Le e e ,
Eds. Associa ion o Compu a ional Linguis ics,
2023, pp. 246–255. [Online]. A ailable: h ps:
//doi.o g/10.18653/ 1/2023.emnlp-demo.21
[25] Q. Deng, Q. Yang, R. Yuan, Y. Huang, Y. Wang, X. Liu,
Z. Tian, J. Pan, G. Zhang, H. Lin e al., “Compose X:
Mul i-agen symbolic music composi ion wi h LLMs,”
a xi p ep in a xi :2404.18081, 2024.
[26] J. Liang, H. Zhang, H. Liu, Y. Cao, Q. Kong, X. Liu,
W. Wang, M. D. Plumbley, H. Phan, and E. Bene os,
“Wa C a : Audio edi ing and gene a ion wi h la ge
language models,” in ICLR 2024 Wo kshop on La ge
Language Model (LLM) Agen s, 2024.
[27] E. Pos olache, G. Ma iani, L. Cosmo, E. Bene os,
and E. Rodolà, “Gene alized mul i-sou ce in e ence
o ex condi ioned music di usion models,” CoRR,
ol. abs/2403.11706, 2024. [Online]. A ailable: h ps:
//doi.o g/10.48550/a xi .2403.11706
[28] X. Liu, Q. Kong, Y. Zhao, H. Liu, Y. Yuan,
Y. Liu, R. Xia, Y. Wang, M. D. Plumbley,
and W. Wang, “Sepa a e any hing you desc ibe,”
CoRR, ol. abs/2308.05037, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2308.05037
[29] H. Mano and T. Michaeli, “Ze o-sho unsupe ised
and ex -based audio edi ing using DDPM in e sion,”
CoRR, ol. abs/2402.10009, 2024. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2402.10009
[30] S. Li, Y. Zhang, F. Tang, C. Ma, W. Dong, and C. Xu,
“Music s yle ans e wi h ime- a ying in e sion o
di usion models,” in Thi y-Eigh h AAAI Con e ence
on A i icial In elligence, AAAI 2024, Thi y-Six h
Con e ence on Inno a i e Applica ions o A i icial
In elligence, IAAI 2024, Fou een h Symposium on
Educa ional Ad ances in A i icial In elligence, EAAI
2014, Feb ua y 20-27, 2024, Vancou e , Canada,
M. J. Woold idge, J. G. Dy, and S. Na a ajan, Eds.
AAAI P ess, 2024, pp. 547–555. [Online]. A ailable:
h ps://doi.o g/10.1609/aaai. 38i1.27810
[31] F.-D. Tsai, S.-L. Wu, H. Kim, B.-Y. Chen,
H.-C. Cheng, and Y.-H. Yang, “Audio p omp
adap e : Unleashing music edi ing abili ies o ex -
o-music wi h ligh weigh ine uning,” a Xi p ep in
a Xi :2407.16564, 2024.
[32] Y. Yao, P. Li, B. Chen, and A. Wang,
“JEN-1 Compose : A uni ied amewo k o
high- ideli y mul i- ack music gene a ion,” CoRR,
ol. abs/2310.19180, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2310.19180
[33] G. Ma iani, I. Tallini, E. Pos olache, M. Mancusi,
L. Cosmo, and E. Rodolà, “Mul i-sou ce di usion
models o simul aneous music gene a ion and sepa-
a ion,” CoRR, ol. abs/2302.02257, 2023. [Online].
A ailable: h ps://doi.o g/10.48550/a xi .2302.02257
[34] J. D. Pa ke , J. Spijke e , K. Kos a, F. Yesile ,
B. Kuzne so , J. Wang, M. A en , J. Chen, and D. Le,
“S emGen: A music gene a ion model ha lis ens,”
CoRR, ol. abs/2312.08723, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2312.08723
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
335
[35] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi,
“High ideli y neu al audio comp ession,” CoRR, ol.
abs/2210.13438, 2022. [Online]. A ailable: h ps:
//doi.o g/10.48550/a xi .2210.13438
[36] C. Ra el, N. Shazee , A. Robe s, K. Lee, S. Na ang,
M. Ma ena, Y. Zhou, W. Li, and P. J. Liu,
“Explo ing he limi s o ans e lea ning wi h a
uni ied ex - o- ex ans o me ,” J. Mach. Lea n. Res.,
ol. 21, pp. 140:1–140:67, 2020. [Online]. A ailable:
h p://jml .o g/pape s/ 21/20-074.h ml
[37] N. Zeghidou , A. Luebs, A. Om an, J. Skoglund,
and M. Tagliasacchi, “SoundS eam: An end- o-
end neu al audio codec,” IEEE ACM T ans. Audio
Speech Lang. P ocess., ol. 30, pp. 495–507, 2022.
[Online]. A ailable: h ps://doi.o g/10.1109/TASLP.
2021.3129994
[38] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i i,
“F éche audio dis ance: A e e ence- ee me ic
o e alua ing music enhancemen algo i hms,” in
In e speech 2019, 20 h Annual Con e ence o he
In e na ional Speech Communica ion Associa ion,
G az, Aus ia, 15-19 Sep embe 2019, G. Kubin
and Z. Kacic, Eds. ISCA, 2019, pp. 2350–
2354. [Online]. A ailable: h ps://doi.o g/10.21437/
In e speech.2019-2219
[39] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-
Ki kpa ick, and S. Dubno , “La ge-scale con as i e
language-audio p e aining wi h ea u e usion and
keywo d- o-cap ion augmen a ion,” in IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing ICASSP 2023, Rhodes Island, G eece, June
4-10, 2023. IEEE, 2023, pp. 1–5. [Online]. A ailable:
h ps://doi.o g/10.1109/ICASSP49357.2023.10095969
[40] Z. Wang, A. C. Bo ik, H. R. Sheikh, and E. P.
Simoncelli, “Image quali y assessmen : F om e o
isibili y o s uc u al simila i y,” IEEE T ansac ions
on Image P ocessing, ol. 13, no. 4, pp. 600–612,
2004. [Online]. A ailable: h ps://doi.o g/10.1109/TIP.
2003.819861
[41] J. L. Roux, S. Wisdom, H. E dogan, and J. R.
He shey, “SDR - hal -baked o well done?” in IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing, ICASSP 2019, B igh on, Uni ed
Kingdom, May 12-17, 2019. IEEE, 2019, pp.
626–630. [Online]. A ailable: h ps://doi.o g/10.1109/
ICASSP.2019.8683855
[42] Y. Z. Isik, J. L. Roux, Z. Chen, S. Wa anabe,
and J. R. He shey, “Single-channel mul i-speake
sepa a ion using deep clus e ing,” in In e speech
2016, 17 h Annual Con e ence o he In e na ional
Speech Communica ion Associa ion, San F ancisco,
CA, USA, Sep embe 8-12, 2016, N. Mo gan,
Ed. ISCA, 2016, pp. 545–549. [Online]. A ailable:
h ps://doi.o g/10.21437/In e speech.2016-1176
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
336