Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning

Author: Yixiao Zhang; Yukara Ikemiya; Woosung Choi; Naoki Murata; Marco Martínez-Ramírez; Liwei Lin; Gus Xia; Wei-Hsiang Liao; Yuki Mitsufuji; Simon Dixon

Publisher: Zenodo

DOI: 10.5281/zenodo.17706408

Source: https://zenodo.org/records/17706408/files/000038.pdf

INSTRUCT-MUSICGEN: UNLOCKING TEXT-TO-MUSIC EDITING FOR
MUSIC LANGUAGE MODELS VIA INSTRUCTION TUNING
Yixiao Zhang1, Yuka a Ikemiya2, Woosung Choi2, Naoki Mu a a2, Ma co A. Ma ínez-Ramí ez2,
Liwei Lin3, Gus Xia3, Wei-Hsiang Liao2, Yuki Mi su uji2, Simon Dixon1
1C4DM, Queen Ma y Uni e si y o London
2Sony AI 3Music X Lab, MBZUAI
[email p o ec ed], [email p o ec ed], {gus.xia, ll4270}@nyu.edu
ABSTRACT
The ask o ex - o-music edi ing, which employs ex
que ies o modi y music (e.g. by changing i s s yle o ad-
jus ing ins umen al componen s), p esen s unique chal-
lenges and oppo uni ies o AI-assis ed music c ea ion.
P e ious app oaches in his domain ha e been cons ained
by he necessi y o ain speci ic edi ing models om
sc a ch, which is bo h esou ce-in ensi e and ine icien ;
o he esea ch uses la ge language models o p edic edi ed
music, esul ing in imp ecise audio econs uc ion. In his
pape , we in oduce Ins uc -MusicGen, a no el app oach
ha ine unes a p e ained MusicGen model o e icien ly
ollow edi ing ins uc ions such as adding, emo ing, o
sepa a ing s ems. Ou app oach in ol es a modi ica ion o
he o iginal MusicGen a chi ec u e by inco po a ing a ex
usion module and an audio usion module, which allow
he model o p ocess ins uc ion ex s and audio inpu con-
cu en ly and yield he desi ed edi ed music. Rema kably,
al hough Ins uc -MusicGen only in oduces ∼8% new pa-
ame e s o he o iginal MusicGen model and only ains
o 5K s eps, i achie es supe io pe o mance ac oss all
asks compa ed o exis ing baselines. This ad ancemen
no only enhances he e iciency o ex - o-music edi ing
bu also b oadens he applicabili y o music language mod-
els in dynamic music p oduc ion en i onmen s. 1 2
1. INTRODUCTION
The apid ad ances in ex - o-music gene a ion ha e
opened up new possibili ies o AI-assis ed music c e-
a ion [1–5]. This pa adigm shi has also spa ked a g ow-
ing in e es in de eloping models ha o e g ea e con ol-
1Code, model weigh s and demo a e a ailable a : h ps://
gi hub.com/ldzhangyx/ins uc -musicgen.
2This wo k was done du ing Yixiao Zhang’s in e nship a Sony AI.
© Y. Zhang, Y. Ikemiya, W. Choi, N. Mu a a, M. A.
Ma ínez-Ramí ez, L. Lin, G. Xia, W.-H. Liao, Y. Mi su uji, and S.
Dixon. Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: Y. Zhang, Y. Ikemiya, W. Choi, N.
Mu a a, M. A. Ma ínez-Ramí ez, L. Lin, G. Xia, W.-H. Liao, Y. Mi su-
uji, and S. Dixon, “Ins uc -MusicGen: Unlocking Tex - o-Music Edi -
ing o Music Language Models ia Ins uc ion Tuning”, in P oc. o he
26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h
Ko ea, 2025.
MusicGen
T5 encode
Music ou pu
Tex desc ip ion inpu
"Gene a e music piece o sad jazz"
ins uc -MusicGen
T5 encode
Edi ed music ou pu
Tex ins uc ion
"Ins uc ion: add d ums."
Sou ce music
Figu e 1: Compa ison be ween MusicGen and ins uc -
MusicGen. Ins uc -MusicGen accep s bo h audio inpu
and edi ing ins uc ion ex as condi ions.
labili y [6–9] and edi abili y [10–12] o e he music gene -
a ion p ocess. In music p oduc ion, a s em—a mixed g oup
o acks o en ela ed by ins umen ype (like d ums o
lead ocals)—is essen ial o mixing and mas e ing be-
cause i allows p oduce s o isola e, adjus , and manipu-
la e indi idual elemen s o a song. Following he de ini-
ion in MusicMagus [11], “ ex - o-music edi ing" in ol es
using ex ual que ies o modi y a ious aspec s o a mu-
sic eco ding, which can be ca ego ised in o wo main
ypes: in a-s em edi ing, which ocuses on modi ying a
single s em (e.g., changing he ins umen , imb e, o pe -
o mance s yle), and in e -s em edi ing, which in ol es al-
e ing he ela ionships among s ems (e.g., adding, emo -
ing, o sepa a ing s ems). Ou wo k mainly ocuses on he
p oblem o in e -s em edi ing.
P e ious a emp s o de elop ex -based music edi ing
models ha e encoun e ed se e al challenges. Some ap-
p oaches [10,13] ha e ocused on aining specialised edi -
ing models om sc a ch, which is esou ce-in ensi e and
may no yield esul s compa able o s a e-o - he-a music
gene a ion models. O he wo k [12, 14, 15] has sough o
le e age exis ing la ge language models (LLMs) and Mu-
sicGen [2], allowing he LLM o in e p e edi ing ins uc-
ions wi hou u he aining he music model. Al hough
his app oach o e s lexibili y, i o en lacks he abili y o
p ecisely econs uc he condi ional audio, leading o un-
eliable esul s. To add ess hese limi a ions, an ideal solu-
ion should ha ness he knowledge embedded in p e ained
models o ensu e high-quali y audio ou pu while adap ing
he a chi ec u e o accommoda e he speci ic equi emen s
328
o music edi ing asks.
In his pape , we in oduce Ins uc -MusicGen, a no el
app oach ha applies an ins uc ion- ollowing uning s a -
egy o he p e ained MusicGen model, enhancing i s
abili y o ollow edi ing ins uc ions e ec i ely wi h-
ou ine uning all i s pa ame e s. As shown in Fig-
u e 1, by inco po a ing an audio usion module based on
LLaMA-Adap e [6,16] and a ex usion module based on
LoRA [17] in o he o iginal MusicGen a chi ec u e, we al-
low he model o p ocess bo h p ecise audio condi ions and
ex -based ins uc ions simul aneously, which he o iginal
MusicGen does no do. This enables Ins uc -MusicGen o
pe o m a ange o edi ing asks. In his pape , we ocus on
a speci ic se o hese asks: adding, sepa a ing, and emo -
ing s ems. To ain Ins uc -MusicGen, we syn hesize an
ins uc ional da ase using he Slakh2100 da ase [18], in-
oducing only 8% addi ional pa ame e s compa ed o he
o iginal model, and ine une he model o only 5K s eps,
which is less han 1% o aining a music edi ing model
om sc a ch.
We e alua e Ins uc -MusicGen on wo da ase s:
he Slakh es se and he ou -o -domain MoisesDB
da ase [19]. Ou model ou pe o ms exis ing baselines and
achie es pe o mance compa able o models speci ically
ained o indi idual asks. This demons a es he e ec-
i eness o ou app oach in le e aging p e ained models
o ex - o-music edi ing while main aining high-quali y
esul s.
2. RELATED WORK
Tex -based music edi ing p o ides a lexible app oach o
edi ing music using ex ual que ies. This me hod is sim-
ila o hose used in o he modali ies ha equi e edi ing,
such as image [20, 21] and ideo [22, 23] edi ing. In ex -
o-music edi ing, ex is used o speci y p ecise al e a ions
o exis ing music composi ions. P e ious esea ch such
as AUDIT [13] and Ins uc ME [10] de eloped a di usion
model ained wi h pai ed music edi ing da a. Addi ionally,
models like M2UGen [12], Loop Copilo [14], MusicA-
gen [24], Compose X [25] and Wa C a [26] use la ge
language models (LLMs) o easoning and egene a e mu-
sic wi h ex e nal music gene a ion models. Fu he mo e,
GMSDI [27] a emp s o model a join mul i-s em dis i-
bu ion o music o ex -based gene a ion and sepa a ion.
Ce ain models ocus exclusi ely on speci ic asks wi hin
music edi ing, such as condi ioned gene a ion [6, 7, 9] and
sepa a ion [28], along wi h in a-s em edi ing asks such as
ex -based imb e ans e and s yle ans e [11,29–31].
The ask o in e -s em music edi ing is closely ela ed
o s em-wise music gene a ion. Al hough no di ec ly ied
o ex -based con ols, some esea ch ocuses on mod-
eling s em-wise ep esen a ions o enable simul aneous
s em gene a ion and sepa a ion. Fo ins ance, Jen-1 Com-
pose [32] and MSDM [33] join ly model he dis ibu ion
o music wi h ou s ems using a di usion model. The
abili ies o mos exis ing s em-wise music models a e e-
s ic ed o a ixed se o 4 s ems, which limi s lexibili y
bu enhances con ollabili y. Besides, S emGen [34] ains
a LLaMA-based au o- eg essi e model o lexible s em-
wise audio gene a ion.
Ou wo k dis inguishes i sel om hese exis ing e o s
in se e al key ways. Fi s , a he han de eloping a new
model om sc a ch o s ic ly adhe ing o a ixed se o
s ems, we le e age he powe o a p e ained music lan-
guage model, MusicGen, and enhance i wi h ins uc ion
uning. This app oach no only educes he compu a ional
cos bu also e ains he high audio quali y o he o igi-
nal MusicGen model. Fu he mo e, ou me hod in oduces
minimal addi ional pa ame e s and equi es signi ican ly
less aining, demons a ing a mo e e icien and scalable
solu ion o ex -based music edi ing.
3. METHOD
3.1 MusicGen
The o iginal MusicGen consis s o h ee componen s: (1)
he EnCodec [35] audio encode and decode , which com-
p ess music audio wa e o ms in o la en codes and e-
cons uc hem back in o wa e o ms; (2) a mul i-laye
ans o me a chi ec u e ha models sequences o la en
codes, cap u ing highe -le el music ep esen a ions and e -
icien ly modeling in e nal ela ionships wi hin music au-
dio; and (3) he T5 [36] ex encode , which con e s ex
desc ip ions in o embeddings o ex -condi ioned gene a-
ion.
EnCodec employs Residual Vec o Quan iza ion
(RVQ) [37] o comp ess audio in o okens using mul iple
codebooks, whe e each quan ize encodes he quan iza ion
e o om he p e ious one. Fo a e e ence audio
X∈Rd· n, whe e dis he du a ion and nis he sample
a e, EnCodec comp esses Xin o Q∈ {1, ..., L}N×d· s,
whe e Lis he RVQ codebook size, Nis he numbe
o codebooks, and sis he la en code sample a e
( s<< n). In MusicGen, N= 4, n= 50, s= 32000,
and L= 2048. Finally, he ans o me models he
sequence ela ionships o e la en codes 3.
3.2 Ins uc -MusicGen
MusicGen is a ex - o-music gene a ion model, capable o
gene a ing music audio om a gi en ex p omp . How-
e e , MusicGen canno edi exis ing music audio. To
add ess his limi a ion, we in oduce Ins uc -MusicGen,
which ans o ms MusicGen in o a model ha can ollow
edi ing ins uc ions o modi y exis ing music audio.
Ins uc -MusicGen akes a music audio inpu Xcond
and a ex ins uc ion Xins uc (e.g., "Add gui a ") as in-
pu s. The model hen edi s he music audio Xcond acco d-
ing o he ins uc ion Xins uc and gene a es he desi ed
edi ed music Xmusic. As illus a ed in Figu e 2, Ins uc -
MusicGen inco po a es wo addi ional modules in o he
anilla MusicGen: an audio usion and a ex usion mod-
ule.
3MusicGen’s encodec uses a 50Hz sample a e, which is di e en
om he o iginal 75Hz EnCodec model.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
329
Masked Mul i-head
Sel -A en ion
Laye No m
Mul i-head
C oss A en ion
K V Q
Ou pu s (shi ed igh )
Posi ional
Encoding
FFN
Laye No m
Laye No m
Linea
Mul i-head
Sel -A en ion
Condi ion music audio
Posi ional
Encoding
Q''
T5 Encode
Tex ins uc ion
LoRA
LoRA
!!
"#$%
!&
"#$%
!!'(
"#$%
Linea
!
!!
)*+,"
Tex usion
Audio usion
!!'(
)*+,"
!!
"#$%& "!
"#$%& #!
"#$%&
$!
"#$%&
!!
&'() "!
&'() #!
&'()
$!
&'() %!
"#$%& %!
*& '!
"#$%& ( )!%!
+,-./ %!
*
Figu e 2: Illus a ion o he usion mechanism inside
he T ans o me module o ins uc -MusicGen. The au-
dio usion module ans o ms he condi ional music audio
in o embeddings using a duplica ed encode and in eg a es
hese embeddings in o he MusicGen decode . The ex
usion module modi ies he c oss-a en ion mechanism o
handle ex ins uc ions by ine uning speci ic laye s while
keeping he ex encode pa ame e s ozen.
3.2.1 Audio Fusion Module
The audio usion module enables Ins uc -MusicGen o ac-
cep ex e nal audio inpu s, which is inspi ed by LLaMA-
Adap e [16] and Coco-mulla [6]. The lowe pa o Fig-
u e 2 illus a es he audio usion module. Ini ially, we con-
e Xcond in o EnCodec okens, ollowed by e-encoding
hese okens in o he embedding zcond h ough he p e-
ained embedding laye s o MusicGen. Simila ly, we
ans o m Xmusic in o he p e ained embedding zmusic.
The module begins wi h duplica ing sel -a en ion mod-
ules o he p e ained MusicGen model o ex ac la en
ep esen a ions o zcond. Gi en ha MusicGen consis s o
Mlaye s, we deno e
Zcond ={zcond
0, zcond
1, . . . , zcond
M},(1)
Zmusic ={zmusic
0, zmusic
1, . . . , zmusic
M},(2)
which ep esen he hidden s a es o Xcond and Xmusic e-
spec i ely. No e ha we use a lea nable inpu embedding
as zcond
0and ini ialize zmusic
0wi h zmusic.
We compu e he anilla sel a en ion o Xmusic as ol-
lows:
Qmusic
l, Kmusic
l, V music
l=QKV-p ojec o (zmusic
l),(3)
omusic
l=Sel A n(Qmusic
l, Kmusic
l, V music
l).(4)
We p ojec zcond o a high-dimension ep esen a ion h
ia a linea laye land lea nable posi ional encoding el,
h= l(zcond) + el.(5)
Then, we compu e he (l+ 1)- h laye hidden s a es o
Xcond as ollows:
Qcond
l, Kcond
l, V cond
l=QKV-p ojec o (zcond
l+h),(6)
zcond
l+1 = Sel A n(Qcond
l, Kcond
l, V cond
l).(7)
To use in o ma ion o Xcond in o Xmusic, we compu e
he c oss a en ion be ween hem,
smusic
l=C ossA n(Qmusic
l+Qcond
l, Kcond
l, V cond
l).(8)
Finally, he a en ion ou pu o Xmusic is upda ed as ol-
lows,
s′
l=omusic
l+gl·smusic
l,(9)
zmusic
l+1 =Tex Fusion(s′
l, Xins uc ),(10)
whe e gis a ze o-ini ialized lea nable ga ing ac o .
Thus, he o al ainable pa ame e s in Ins uc -
MusicGen include he inpu embedding zcond
0, linea laye s
l, lea nable posi ion embeddings el, lea nable ga ing ac-
o s g, and lea nable pa ame e s in he ex usion module.
3.2.2 Tex Fusion Module
To eplace he ex desc ip ion inpu wi h ins uc ion inpu ,
we modi y he beha io o he cu en ex encode . We
achie e his by ine uning only he c oss-a en ion module
be ween he ex embedding and he music ep esen a ions
while keeping he ex encode ’s pa ame e s ozen.
The ins uc ion is embedded and encoded by he T5 ex
encode as zins uc =T5(Xins uc ). Fo e icien ine uning
o he c oss-a en ion module, we apply LoRA o he que y
and alue p ojec ion laye s. Thus, we expand Equa ion 10
as ollows,
Ql, Kins uc
l, V ins uc
l=QKV-Lo a(s′
l, zins uc ),(11)
zmusic
l+1 =C ossA n(Ql, Kins uc
l, V ins uc
l).(12)
Du ing ine- uning, only que y and alue p ojec ion laye s
a e ainable in he ex usion module.
4. EXPERIMENTS
We conduc bo h subjec i e expe imen s and objec i e ex-
pe imen s o e alua ion, and also p o ide example spec-
og ams in Figu e 3.
4.1 Objec i e Expe imen s
4.1.1 Da ase
Fo ou objec i e e alua ions, we u ilise wo dis inc
da ase s, each se ing a speci ic pu pose in assessing bo h
in-domain and ou -o -domain pe o mance capabili ies o
a ious models.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
330
1. Slakh2100 da ase [18]. The Syn hesized Lakh
(Slakh) Da ase , o iginally de i ed om he Lakh
MIDI Da ase 0.1, comp ises audio acks syn he-
sised using high-quali y sample-based i ual ins u-
men s. This da ase ea u es 2100 acks comple e
wi h co esponding MIDI iles.
2. MoisesDB da ase [19]. The MoisesDB da ase in-
cludes 240 eal audio acks sou ced om 45 di e se
a is s spanning wel e musical gen es. Uniquely,
MoisesDB o ganises i s acks in o a de ailed wo-
le el hie a chical axonomy o s ems, o e ing a a -
ied numbe o s ems pe ack, each anno a ed wi h
ex ual desc ip ions.
The a ionale o selec ing wo da ase s lies in hei di-
e se con igu a ions and common applica ions. While he
Slakh da ase is adi ionally u ilised o aining models
ailo ed o a ou -s em a angemen , ou model, Ins uc -
MusicGen, al hough ini ially ained on his da ase , is de-
signed o gene alise o a ious s em con igu a ions. Con-
e sely, models such as Ins uc ME and AUDIT a e ained
on p i a e o la ge , mo e di e se da ase s. By employing
bo h Slakh2100 and MoisesDB, we ensu e a comp ehen-
si e e alua ion, allowing us o ai ly compa e he adap -
abili y and pe o mance o di e en models unde a ying
condi ions o da a amilia i y and complexi y.
4.1.2 Da a P ep ocessing
We u ilised he Slakh2100 da ase o cons uc an
ins uc ion-based da ase o ou expe imen s, employing
he ollowing pipeline:
• A da a poin was andomly selec ed om he Slakh
aining da ase .
• An ins uc ion was chosen om a p ede ined se
{add, emo e, ex ac } along wi h a a ge s em. Sub-
sequen ly, no he s ems we e selec ed om he e-
maining s ems.
• An o se was andomly de e mined o cu a 5-
second audio clip. I he a ge s em con ained mo e
han 50% silence, a di e en o se was selec ed.
• The s ems we e mixed acco ding o he speci ied in-
s uc ions o c ea e a iple consis ing o {ins uc-
ion ex , condi ion audio inpu , audio g ound u h}.
4.1.3 Expe imen al Se up
Fo he ine uning o MusicGen, we join ly ained he au-
dio usion module and he ex usion module. The op-
imisa ion p ocess u ilised he AdamW op imise , wi h a
lea ning a e se a 5×10−3. We use L2 loss o e la en
okens as he aining objec i e. T aining inco po a ed a
Cosine Annealing schedule wi h an ini ial wa mup o 100
s eps. The aining egimen ex ended o e 5,000 s eps wi h
an accumula ed ba ch size o 32, achie ed h ough se ing
he ba ch size o 8 and using g adien accumula ion o e 4
i e a ions. The ine uning p ocess was execu ed on a single
NVIDIA A100 GPU and was comple ed wi hin wo days.
4.1.4 Baselines
In his sec ion, we explo e wo baseline models, each dis-
inguished by hei unique me hodologies o handling au-
dio da a.
1. AUDIT [13]: AUDIT is an ins uc ion-guided au-
dio edi ing model, consis ing o a a ia ional au oen-
code (VAE) o con e ing inpu audio in o a la en
space ep esen a ion, a T5 ex encode o p ocess-
ing edi ins uc ions, and a di usion ne wo k ha
pe o ms he ac ual audio edi ing in he la en space.
The sys em accep s mel-spec og ams o inpu audio
and edi ins uc ions, and gene a es he edi ed audio
as ou pu .
2. M2UGen [12]: The M2UGen amewo k le e ages
la ge language models o comp ehend and gene a e
music ac oss a ious modali ies, in eg a ing abili-
ies om ex e nal models such as MusicGen [2] and
AudioLDM 2 [4]. I is designed o s imula e c e-
a i e ou pu s om di e se sou ces, showcasing o-
bus pe o mance in mul i-modal music gene a ion.
Besides, Ins uc ME can also pe o m ins uc ion-
guided music edi ing and emixing wi h la en di usion
models. We exclude i om compa ison because In-
s uc ME’s model weigh s and e alua ion p o ocol a e no
publicly eleased.
Model Pa am size Da ase Hou s (h) S eps
AUDIT 942M (1.5B) Mul iple ∼6500 0.5M
Ins uc ME 967M (1.7B) Mul iple 417 2M
M2UGen 637M (∼9B) MUEdi 60.22 -
Ou s 264M (3.5B) Slakh 145 5K
Table 1: Compa ison o di e en models, whe e he pa am
size numbe s a e ainable pa ame e s and o al pa ame e s
espec i ely. Ou model has he lowes pa ame e size, and
only equi es 5K aining s eps.
4.1.5 Me ics
The me ics o e alua e model pe o mance a e lis ed be-
low.
1. F éche Audio Dis ance (FAD) [38] 4measu es he
simila i y be ween wo se s o audio iles by compa -
ing mul i a ia e Gaussian dis ibu ions i ed o ea-
u e embeddings om he audio da a. We use he
FAD sco e o e alua e he o e all audio quali y o
he p edic ed music.
2. CLAP Sco e (CLAP) [39] 5is used in ou expe -
imen s o measu e he co espondence be ween he
edi ed music and a a ge ex . Fo he emo al ask,
he a ge ex is gene a ed by dele ing he name o
he emo ed ins umen om he o iginal ex .
4h ps://gi hub.com/gudgud96/
eche -audio-dis ance.
5h ps://gi hub.com/LAION-AI/CLAP.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
331
3. Kullback-Leible Di e gence (KL) 6assesses he
di e ence be ween he p obabili y dis ibu ions o
audio ea u es om wo sou ces, indica ing in o ma-
ion loss when app oxima ing one dis ibu ion wi h
ano he . A low KL sco e indica es he p edic ed mu-
sic sha es simila ea u es wi h he g ound u h.
4. S uc u al Simila i y (SSIM) [40] is an image
quali y me ic ha we adap o e alua e s uc-
u al simila i y be ween p edic ed music and g ound
u h.
5. Scale-In a ian Signal- o-Dis o ion Ra io (SI-
SDR) [41] quan i ies audio quali y, especially in
sou ce sepa a ion asks. I is scale-in a ian , use-
ul o a ying audio olumes, and measu es dis o -
ion ela i e o a e e ence signal. We use SI-SDR o
e alua e he signal loss o he p edic ed audio.
6. Scale-In a ian Signal- o-Dis o ion Ra io im-
p o emen (SI-SDRi) [42] ex ends SI-SDR, mea-
su ing he imp o emen in signal- o-dis o ion a io
be o e and a e p ocessing. I is commonly used in
audio enhancemen and sepa a ion con ex s.
To u he in es iga e whe he he model success ully
adds, emo es o ex ac s he ins umen , we p opose he
P-Demucs sco e o e alua e he model pe o mance. This
me ic speci ically ocuses on de ec ing he p esence o a
newly added ins umen in he gene a ed audio. I le e -
ages he Demucs model, a sou ce sepa a ion model, o iso-
la e he a ge ins umen om he audio. A e sepa a ion,
he oo -mean-squa e ene gy (RMSE) o he isola ed ack
is analyzed. Fo example, i he ins uc ion is o "add gui-
a ," a non-silen gui a ack is ega ded as a success ul
edi .
4.1.6 Objec i e Expe imen Resul s
Ou e alua ion o Ins uc -MusicGen demons a es i s su-
pe io pe o mance ac oss a ious asks compa ed o exis -
ing ex - o-music edi ing baselines (AUDIT, Ins uc ME,
M2UGen). On he Slakh da ase (Table 2), Ins uc -
MusicGen excelled in adding, emo ing, and ex ac ing
s ems, achie ing he lowes F éche Audio Dis ance (FAD)
and highes CLAP and SSIM sco es. I also signi ican ly
imp o ed he signal- o-noise a io (SI-SDR) in he emo al
ask, showing balanced pe o mance ac oss all me ics
and p o ing i s obus ness in a ious edi ing scena ios.
Simila ly, in he MoisesDB da ase e alua ions (Table 3),
Ins uc -MusicGen demons a ed s ong pe o mance, wi h
he bes pe o mance on mos me ics o e he h ee asks.
We ind ha all models exhibi nega i e SI-SDR and SI-
SDRi sco es, which is a common occu ence when e alua -
ing gene a i e models on a signal le el. These me ics a e
ypically designed o sou ce sepa a ion asks and a e no
en i ely ai o gene a i e models, as hey penalise e en mi-
no disc epancies be ween he gene a ed and o iginal sig-
nals. Gene a i e models like Ins uc -MusicGen o en o-
cus on p oducing pe cep ually plausible audio a he han
pe ec ly ma ching he o iginal signal a a echnical le el.
6h ps://gi hub.com/haoheliu/audioldm_e al.
4.2 Subjec i e Expe imen s
4.2.1 Expe imen al Se up
We conduc ed a subjec i e lis ening es o e alua e he
model’s pe o mance. 7This es in ol ed dissemina -
ing an online su ey wi hin he Music In o ma ion Re-
ie al (MIR) communi y and ou b oade esea ch ne -
wo k, which esul ed in he collec ion o 30 comple e e-
sponses. The gende dis ibu ion o he pa icipan s was
23 males (76.7%) and 7 emales (23.3%). Rega ding
p o essional musical educa ion expe ience, 4 pa icipan s
(13.3%) had less han 1 yea o expe ience, 13 (43.3%)
had be ween 1 and 5 yea s, and 13 pa icipan s (43.3%)
had mo e han 5 yea s o expe ience. Fo he da a p epa a-
ion, we andomly selec ed a subse o da a poin s om he
objec i e es da ase . Speci ically, 6 audio samples we e
chosen, comp ising 2 audio samples o each sub ask (add,
emo e, ex ac ). Each da a poin included esul s om he
baseline models, ou models, and he g ound u h om he
da ase .
4.2.2 Me ics
1. Ins uc ion Adhe ence (IA) assesses how accu-
a ely he gene a ed music ollows he gi en edi ing
ins uc ion. In his expe imen , pa icipan s a e he
gene a ed music on a scale om 1 o 5, whe e 1 in-
dica es ha he ins uc ion was no ollowed a all,
and 5 indica es ha he ins uc ion was ollowed pe -
ec ly. Fo example, i he ins uc ion is "Remo e
D ums," a a ing o 1 would mean ha he d ums
we e no emo ed a all, while a a ing o 5 would
mean ha he d ums we e comple ely emo ed.
2. Audio Quali y (AQ) e alua es he o e all audio
quali y o he gene a ed music in compa ison o he
o iginal music. Pa icipan s a e he audio quali y on
a scale om 1 o 5, whe e 1 ep esen s e y poo
quali y wi h signi ican deg ada ion compa ed o he
o iginal music, and 5 ep esen s excellen quali y, as
good as o be e han he o iginal music. This me -
ic helps in unde s anding how he edi ing p ocess
a ec s he o e all sound quali y o he music.
4.2.3 Subjec i e Expe imen Resul s
The esul s o ou subjec i e expe imen s a e summa ised
in Table 4. We conduc ed wo pai ed - es s wi h Bon e -
oni co ec ion, se ing he signi icance le el a α= 0.05.
The esul s shows ha ou model demons a es a signi i-
can imp o emen in bo h Ins uc ion Adhe ence (IA) and
Audio Quali y (AQ) compa ed o he baseline models, AU-
DIT and M2UGen. 8
5. CONCLUSION
In his pape , we in oduced Ins uc -MusicGen, a no el
app oach o ex - o-music edi ing ha os e s join musical
7This subjec i e es was app o ed by he e hics commi ee o Sony.
8Mo e audio samples can be ound a h ps://bi .ly/
ins uc -musicgen.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
332

Task Models FAD↓CLAP↑KL↓SSIM↑P-Demucs↑SI-SDR↑SI-SDRi↑
Add
AUDIT 6.88 0.12 1.02 0.21 0.53 - -
M2UGen 7.24 0.22 0.99 0.20 0.43 - -
Ou s 3.75 0.23 0.67 0.26 0.80 - -
Remo e
AUDIT 15.48 0.07 2.75 0.35 0.33 -45.60 -47.28
M2UGen 8.26 0.09 1.59 0.23 0.70 -44.20 -46.13
Ou s 3.35 0.12 0.66 0.45 0.76 -2.09 -3.77
Ex ac
AUDIT 15.08 0.06 2.38 0.42 0.61 -52.90 -50.16
M2UGen 8.14 0.11 2.15 0.31 0.60 -46.38 -43.53
Ou s 3.24 0.12 0.54 0.52 0.75 -9.00 -6.15
Table 2: Compa ison o ex -based music edi ing models on he Slakh da ase (4 s ems).
Task Models FAD↓CLAP↑KL↓SSIM↑P-Demucs↑SI-SDR↑SI-SDRi↑
Add
AUDIT 4.06 0.12 0.84 0.21 0.50 - -
M2UGen 5.00 0.18 0.83 0.20 0.45 - -
Ou s 3.79 0.18 0.35 0.35 0.77 - -
Remo e
AUDIT 10.72 0.10 2.46 0.34 0.41 -44.32 -57.10
M2UGen 3.75 0.13 1.27 0.19 0.72 -43.94 -56.73
Ou s 5.05 0.10 0.84 0.34 0.78 -13.70 -26.48
Ex ac
AUDIT 6.67 0.07 1.97 0.45 0.60 -54.53 -56.17
M2UGen 5.74 0.08 1.91 0.25 0.52 -42.84 -44.49
Ou s 4.96 0.11 1.36 0.40 0.78 -21.39 -23.03
Table 3: Compa ison o ex -based music edi ing models on he MoisesDB da ase .
(a) Inpu music.
(b) Edi ed music ou pu .
(c) G ound u h.
Figu e 3: Spec og ams when Ins uc -MusicGen emo es
he d um s em.
Model Ins uc ion Adhe ence↑Audio Quali y↑
AUDIT 1.54 2.56
M2UGen 1.70 1.92
Ou s 3.85 3.55
G ound u h 4.36 4.21
Table 4: The subjec i e expe imen esul s.
and ex ual con ols. By ine uning he exis ing MusicGen
model wi h ins uc ion uning, Ins uc -MusicGen demon-
s a ed i s capabili y o edi ing music in a ious ways, in-
cluding adding, sepa a ing and ex ac ing a s em om mu-
sic audio using ex ual que ies, wi hou he need o ain-
ing specialised models om sc a ch. Also, i ou pe o ms
a ious baseline models ha a e dedica ed o speci ic mu-
sic edi ing asks. Fu he mo e, ou me hod uses signi i-
can ly ewe esou ces han p e ious models, wi h a e-
qui emen o uning only 8% o he pa ame e s o he o ig-
inal MusicGen.
6. ACKNOWLEDGEMENTS
This wo k was done du ing Yixiao Zhang’s in e nship a
Sony AI. Yixiao Zhang was a esea ch s uden a he UKRI
Cen e o Doc o al T aining in A i icial In elligence and
Music, suppo ed join ly by he China Schola ship Coun-
cil, Queen Ma y Uni e si y o London and Apple Inc.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
333
7. REFERENCES
[1] A. Agos inelli, T. I. Denk, Z. Bo sos, J. H. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi, M. Sha i i, N. Zeghidou ,
and C. H. F ank, “MusicLM: Gene a ing music om
ex ,” CoRR, ol. abs/2301.11325, 2023. [Online].
A ailable: h ps://doi.o g/10.48550/a xi .2301.11325
[2] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan ,
G. Synnae e, Y. Adi, and A. Dé ossez, “Simple
and con ollable music gene a ion,” in Ad ances
in Neu al In o ma ion P ocessing Sys ems 36:
Annual Con e ence on Neu al In o ma ion P ocessing
Sys ems 2023, Neu IPS 2023, New O leans,
LA, USA, Decembe 10 - 16, 2023, A. Oh,
T. Naumann, A. Globe son, K. Saenko, M. Ha d ,
and S. Le ine, Eds., 2023. [Online]. A ailable:
h p://pape s.nips.cc/pape _ iles/pape /2023/hash/
94b472a1842cd7c56dcb125 b2765 bd-Abs ac -Con e ence.
h ml
[3] P. Li, B. Chen, Y. Yao, Y. Wang, A. Wang, and
A. Wang, “JEN-1: Tex -guided uni e sal music
gene a ion wi h omnidi ec ional di usion models,”
CoRR, ol. abs/2308.04729, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2308.04729
[4] H. Liu, Q. Tian, Y. Yuan, X. Liu, X. Mei,
Q. Kong, Y. Wang, W. Wang, Y. Wang, and
M. D. Plumbley, “AudioLDM 2: Lea ning holis ic
audio gene a ion wi h sel -supe ised p e aining,”
CoRR, ol. abs/2308.05734, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2308.05734
[5] K. Chen, Y. Wu, H. Liu, M. Nezhu ina, T. Be g-
Ki kpa ick, and S. Dubno , “MusicLDM: En-
hancing no el y in ex - o-music gene a ion us-
ing bea -synch onous mixup s a egies,” CoRR,
ol. abs/2308.01546, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2308.01546
[6] L. Lin, G. Xia, J. Jiang, and Y. Zhang, “Con en -
based con ols o music la ge language modeling,”
CoRR, ol. abs/2310.17162, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2310.17162
[7] S.-L. Wu, C. Donahue, S. Wa anabe, and N. J. B yan,
“Music con olne : Mul iple ime- a ying con ols o
music gene a ion,” IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing, ol. 32, pp. 2692–
2703, 2024.
[8] J. Melecho sky, Z. Guo, D. Ghosal, N. Majumde ,
D. He emans, and S. Po ia, “Mus ango: Towa d
con ollable ex - o-music gene a ion,” a xi p ep in
a xi :2311.08355, 2023.
[9] L. Lin, G. Xia, Y. Zhang, and J. Jiang, “A ange,
inpain , and e ine: S ee able long- e m music audio
gene a ion and edi ing ia con en -based con ols,”
CoRR, ol. abs/2402.09508, 2024. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2402.09508
[10] B. Han, J. Dai, X. Song, W. Hao, X. He,
D. Guo, J. Chen, Y. Wang, and Y. Qian, “In-
s uc ME: An ins uc ion guided music edi and
emix amewo k wi h la en di usion models,”
CoRR, ol. abs/2308.14360, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2308.14360
[11] Y. Zhang, Y. Ikemiya, G. Xia, N. Mu a a, M. A. M.
Ramí ez, W. Liao, Y. Mi su uji, and S. Dixon,
“MusicMagus: Ze o-sho ex - o-music edi ing ia
di usion models,” CoRR, ol. abs/2402.06178, 2024.
[Online]. A ailable: h ps://doi.o g/10.48550/a xi .
2402.06178
[12] A. S. Hussain, S. Liu, C. Sun, and Y. Shan,
“M2UGen: Mul i-modal music unde s anding and
gene a ion wi h he powe o la ge language models,”
CoRR, ol. abs/2311.11255, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2311.11255
[13] Y. Wang, Z. Ju, X. Tan, L. He, Z. Wu, J. Bian,
and S. Zhao, “AUDIT: Audio edi ing by ollowing
ins uc ions wi h la en di usion models,” in
Ad ances in Neu al In o ma ion P ocessing Sys ems
36: Annual Con e ence on Neu al In o ma ion
P ocessing Sys ems 2023, Neu IPS 2023, New
O leans, LA, USA, Decembe 10 - 16, 2023, A. Oh,
T. Naumann, A. Globe son, K. Saenko, M. Ha d ,
and S. Le ine, Eds., 2023. [Online]. A ailable:
h p://pape s.nips.cc/pape _ iles/pape /2023/hash/
e1b619a9e241606a23eb21767 16c 81-Abs ac -Con e ence.
h ml
[14] Y. Zhang, A. Maezawa, G. Xia, K. Yamamo o,
and S. Dixon, “Loop Copilo : Conduc ing AI
ensembles o music gene a ion and i e a i e edi ing,”
CoRR, ol. abs/2310.12404, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2310.12404
[15] D. Yang, J. Tian, X. Tan, R. Huang, S. Liu,
X. Chang, J. Shi, S. Zhao, J. Bian, X. Wu, Z. Zhao,
S. Wa anabe, and H. Meng, “UniAudio: An audio
ounda ion model owa d uni e sal audio gene a ion,”
CoRR, ol. abs/2310.00704, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2310.00704
[16] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li,
P. Gao, and Y. Qiao, “LLaMA-Adap e : E icien ine-
uning o language models wi h ze o-ini a en ion,”
CoRR, ol. abs/2303.16199, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2303.16199
[17] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu,
Y. Li, S. Wang, and W. Chen, “LoRA: Low- ank
adap a ion o la ge language models,” CoRR, ol.
abs/2106.09685, 2021. [Online]. A ailable: h ps:
//a xi .o g/abs/2106.09685
[18] E. Manilow, G. Wiche n, P. See ha aman, and J. L.
Roux, “Cu ing music sou ce sepa a ion some Slakh:
A da ase o s udy he impac o aining da a quali y
and quan i y,” in 2019 IEEE Wo kshop on Applica ions
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
334
o Signal P ocessing o Audio and Acous ics, WASPAA
2019, New Pal z, NY, USA, Oc obe 20-23, 2019.
IEEE, 2019, pp. 45–49. [Online]. A ailable: h ps:
//doi.o g/10.1109/WASPAA.2019.8937170
[19] I. Pe ei a, F. A aújo, F. Ko zeniowski, and R. Vogl,
“MoisesDB: A da ase o sou ce sepa a ion beyond
4-s ems,” in P oceedings o he 24 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR 2023, Milan, I aly, No embe 5-9, 2023,
A. Sa i, F. An onacci, M. Sandle , P. Bes agini,
S. Dixon, B. Liang, G. Richa d, and J. Pauwels,
Eds., 2023, pp. 619–626. [Online]. A ailable: h ps:
//doi.o g/10.5281/zenodo.10265363
[20] T. B ooks, A. Holynski, and A. A. E os, “In-
s uc Pix2Pix: Lea ning o ollow image edi -
ing ins uc ions,” in IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion, CVPR
2023, Vancou e , BC, Canada, June 17-24, 2023.
IEEE, 2023, pp. 18 392–18 402. [Online]. A ailable:
h ps://doi.o g/10.1109/CVPR52729.2023.01764
[21] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual ins uc ion
uning,” in Ad ances in Neu al In o ma ion P ocessing
Sys ems 36: Annual Con e ence on Neu al In o ma-
ion P ocessing Sys ems 2023, Neu IPS 2023, New
O leans, LA, USA, Decembe 10 - 16, 2023, A. Oh,
T. Naumann, A. Globe son, K. Saenko, M. Ha d ,
and S. Le ine, Eds., 2023. [Online]. A ailable:
h p://pape s.nips.cc/pape _ iles/pape /2023/hash/
6dc 277ea32ce3288914 a 369 e6de0-Abs ac -Con e ence.
h ml
[22] W. Chai, X. Guo, G. Wang, and Y. Lu,
“S ableVideo: Tex -d i en consis ency-awa e di -
usion ideo edi ing,” in IEEE/CVF In e na-
ional Con e ence on Compu e Vision, ICCV
2023, Pa is, F ance, Oc obe 1-6, 2023. IEEE,
2023, pp. 22 983–22 993. [Online]. A ailable:
h ps://doi.o g/10.1109/ICCV51070.2023.02106
[23] D. Ceylan, C. P. Huang, and N. J. Mi a, “Pix2Video:
Video edi ing using image di usion,” in IEEE/CVF
In e na ional Con e ence on Compu e Vision, ICCV
2023, Pa is, F ance, Oc obe 1-6, 2023. IEEE,
2023, pp. 23 149–23 160. [Online]. A ailable: h ps:
//doi.o g/10.1109/ICCV51070.2023.02121
[24] D. Yu, K. Song, P. Lu, T. He, X. Tan, W. Ye, S. Zhang,
and J. Bian, “MusicAgen : An AI agen o music
unde s anding and gene a ion wi h la ge language
models,” in P oceedings o he 2023 Con e ence on
Empi ical Me hods in Na u al Language P ocessing,
EMNLP 2023 - Sys em Demons a ions, Singapo e,
Decembe 6-10, 2023, Y. Feng and E. Le e e ,
Eds. Associa ion o Compu a ional Linguis ics,
2023, pp. 246–255. [Online]. A ailable: h ps:
//doi.o g/10.18653/ 1/2023.emnlp-demo.21
[25] Q. Deng, Q. Yang, R. Yuan, Y. Huang, Y. Wang, X. Liu,
Z. Tian, J. Pan, G. Zhang, H. Lin e al., “Compose X:
Mul i-agen symbolic music composi ion wi h LLMs,”
a xi p ep in a xi :2404.18081, 2024.
[26] J. Liang, H. Zhang, H. Liu, Y. Cao, Q. Kong, X. Liu,
W. Wang, M. D. Plumbley, H. Phan, and E. Bene os,
“Wa C a : Audio edi ing and gene a ion wi h la ge
language models,” in ICLR 2024 Wo kshop on La ge
Language Model (LLM) Agen s, 2024.
[27] E. Pos olache, G. Ma iani, L. Cosmo, E. Bene os,
and E. Rodolà, “Gene alized mul i-sou ce in e ence
o ex condi ioned music di usion models,” CoRR,
ol. abs/2403.11706, 2024. [Online]. A ailable: h ps:
//doi.o g/10.48550/a xi .2403.11706
[28] X. Liu, Q. Kong, Y. Zhao, H. Liu, Y. Yuan,
Y. Liu, R. Xia, Y. Wang, M. D. Plumbley,
and W. Wang, “Sepa a e any hing you desc ibe,”
CoRR, ol. abs/2308.05037, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2308.05037
[29] H. Mano and T. Michaeli, “Ze o-sho unsupe ised
and ex -based audio edi ing using DDPM in e sion,”
CoRR, ol. abs/2402.10009, 2024. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2402.10009
[30] S. Li, Y. Zhang, F. Tang, C. Ma, W. Dong, and C. Xu,
“Music s yle ans e wi h ime- a ying in e sion o
di usion models,” in Thi y-Eigh h AAAI Con e ence
on A i icial In elligence, AAAI 2024, Thi y-Six h
Con e ence on Inno a i e Applica ions o A i icial
In elligence, IAAI 2024, Fou een h Symposium on
Educa ional Ad ances in A i icial In elligence, EAAI
2014, Feb ua y 20-27, 2024, Vancou e , Canada,
M. J. Woold idge, J. G. Dy, and S. Na a ajan, Eds.
AAAI P ess, 2024, pp. 547–555. [Online]. A ailable:
h ps://doi.o g/10.1609/aaai. 38i1.27810
[31] F.-D. Tsai, S.-L. Wu, H. Kim, B.-Y. Chen,
H.-C. Cheng, and Y.-H. Yang, “Audio p omp
adap e : Unleashing music edi ing abili ies o ex -
o-music wi h ligh weigh ine uning,” a Xi p ep in
a Xi :2407.16564, 2024.
[32] Y. Yao, P. Li, B. Chen, and A. Wang,
“JEN-1 Compose : A uni ied amewo k o
high- ideli y mul i- ack music gene a ion,” CoRR,
ol. abs/2310.19180, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2310.19180
[33] G. Ma iani, I. Tallini, E. Pos olache, M. Mancusi,
L. Cosmo, and E. Rodolà, “Mul i-sou ce di usion
models o simul aneous music gene a ion and sepa-
a ion,” CoRR, ol. abs/2302.02257, 2023. [Online].
A ailable: h ps://doi.o g/10.48550/a xi .2302.02257
[34] J. D. Pa ke , J. Spijke e , K. Kos a, F. Yesile ,
B. Kuzne so , J. Wang, M. A en , J. Chen, and D. Le,
“S emGen: A music gene a ion model ha lis ens,”
CoRR, ol. abs/2312.08723, 2023. [Online]. A ailable:
h ps://doi.o g/10.48550/a xi .2312.08723
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
335
[35] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi,
“High ideli y neu al audio comp ession,” CoRR, ol.
abs/2210.13438, 2022. [Online]. A ailable: h ps:
//doi.o g/10.48550/a xi .2210.13438
[36] C. Ra el, N. Shazee , A. Robe s, K. Lee, S. Na ang,
M. Ma ena, Y. Zhou, W. Li, and P. J. Liu,
“Explo ing he limi s o ans e lea ning wi h a
uni ied ex - o- ex ans o me ,” J. Mach. Lea n. Res.,
ol. 21, pp. 140:1–140:67, 2020. [Online]. A ailable:
h p://jml .o g/pape s/ 21/20-074.h ml
[37] N. Zeghidou , A. Luebs, A. Om an, J. Skoglund,
and M. Tagliasacchi, “SoundS eam: An end- o-
end neu al audio codec,” IEEE ACM T ans. Audio
Speech Lang. P ocess., ol. 30, pp. 495–507, 2022.
[Online]. A ailable: h ps://doi.o g/10.1109/TASLP.
2021.3129994
[38] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i i,
“F éche audio dis ance: A e e ence- ee me ic
o e alua ing music enhancemen algo i hms,” in
In e speech 2019, 20 h Annual Con e ence o he
In e na ional Speech Communica ion Associa ion,
G az, Aus ia, 15-19 Sep embe 2019, G. Kubin
and Z. Kacic, Eds. ISCA, 2019, pp. 2350–
2354. [Online]. A ailable: h ps://doi.o g/10.21437/
In e speech.2019-2219
[39] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-
Ki kpa ick, and S. Dubno , “La ge-scale con as i e
language-audio p e aining wi h ea u e usion and
keywo d- o-cap ion augmen a ion,” in IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing ICASSP 2023, Rhodes Island, G eece, June
4-10, 2023. IEEE, 2023, pp. 1–5. [Online]. A ailable:
h ps://doi.o g/10.1109/ICASSP49357.2023.10095969
[40] Z. Wang, A. C. Bo ik, H. R. Sheikh, and E. P.
Simoncelli, “Image quali y assessmen : F om e o
isibili y o s uc u al simila i y,” IEEE T ansac ions
on Image P ocessing, ol. 13, no. 4, pp. 600–612,
2004. [Online]. A ailable: h ps://doi.o g/10.1109/TIP.
2003.819861
[41] J. L. Roux, S. Wisdom, H. E dogan, and J. R.
He shey, “SDR - hal -baked o well done?” in IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing, ICASSP 2019, B igh on, Uni ed
Kingdom, May 12-17, 2019. IEEE, 2019, pp.
626–630. [Online]. A ailable: h ps://doi.o g/10.1109/
ICASSP.2019.8683855
[42] Y. Z. Isik, J. L. Roux, Z. Chen, S. Wa anabe,
and J. R. He shey, “Single-channel mul i-speake
sepa a ion using deep clus e ing,” in In e speech
2016, 17 h Annual Con e ence o he In e na ional
Speech Communica ion Associa ion, San F ancisco,
CA, USA, Sep embe 8-12, 2016, N. Mo gan,
Ed. ISCA, 2016, pp. 545–549. [Online]. A ailable:
h ps://doi.o g/10.21437/In e speech.2016-1176
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
336

Related note

Why organizations use Identific for document trust, entry 80
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in large academic systems, distance-learning programs, and cross-border universities, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports faster first-level screening, better protection of institutional reputation, and better handling of multilingual submissions. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For conference papers, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com