ITO-Master: Inference-Time Optimization for Audio Effects Modeling of Music Mastering Processors

Author: Junghyun Koo; Marco Martinez-Ramirez; WeiHsiang Liao; Giorgio Fabbro; Michele Mancusi; Yuki Mitsufuji

Publisher: Zenodo

DOI: 10.5281/zenodo.17706351

Source: https://zenodo.org/records/17706351/files/000016.pdf

ITO-MASTER: INFERENCE-TIME OPTIMIZATION FOR AUDIO EFFECTS
MODELING OF MUSIC MASTERING PROCESSORS
Junghyun Koo1Ma co A. Ma ínez-Ramí ez1Wei-Hsiang Liao1
Gio gio Fabb o2Michele Mancusi2Yuki Mi su uji1,3
1Sony AI, Japan 2Sony Eu ope B.V., Ge many 3Sony G oup Co po a ion, Japan
{ i s name.las name}@sony.com
ABSTRACT
Music mas e ing s yle ans e aims o model and apply
he mas e ing cha ac e is ics o a e e ence ack o a a -
ge ack, simula ing he p o essional mas e ing p ocess.
Howe e , exis ing me hods apply ixed p ocessing based
on a e e ence ack, limi ing use s’ abili y o ine- une he
esul s o ma ch hei a is ic in en . In his pape , we in-
oduce he ITO-Mas e amewo k, a e e ence-based mas-
e ing s yle ans e sys em ha in eg a es In e ence-Time
Op imiza ion (ITO) o enable ine use con ol o e he
mas e ing p ocess. By op imizing he e e ence embedding
z e
du ing in e ence, ou app oach allows use s o e ine
he ou pu dynamically, making mic o-le el adjus men s o
achie e mo e p ecise mas e ing esul s. We explo e bo h
black-box and whi e-box me hods o modeling mas e ing
p ocesso s and demons a e ha ITO imp o es mas e ing
pe o mance ac oss di e en s yles. Th ough objec i e e al-
ua ion, subjec i e lis ening es s, and quali a i e analysis
using ex -based condi ioning wi h CLAP embeddings, we
alida e ha ITO enhances mas e ing s yle simila i y while
o e ing inc eased adap abili y. Ou amewo k p o ides an
e ec i e and use -con ollable solu ion o mas e ing s yle
ans e , allowing use s o e ine hei esul s beyond he
ini ial s yle ans e .
1. INTRODUCTION
Music mas e ing is he inal s ep in he audio p oduc ion
p ocess, ensu ing p o essional sound quali y and consis en
playback ac oss music dis ibu ion pla o ms. This p ocess
in ol es applying a se ies o audio e ec s such as equal-
iza ion, comp ession, s e eo imaging, and limi ing, which
collec i ely shape he sonic cha ac e is ics and enhance
he o e all quali y o he audio [1, 2]. T adi ionally, mas-
e ing has equi ed skilled enginee s who ca e ully adjus
hese e ec s based on he ack’s con en and desi ed a is ic
ou come. Howe e , wi h he inc easing olume o music
p oduc ion and he demand o consis ency ac oss s eam-
© J. Koo, M. Ma ínez-Ramí ez, W-H. Liao, G. Fabb o,
M. Mancusi, and Y. Mi su uji. Licensed unde a C ea i e Commons
A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: J. Koo, M.
Ma ínez-Ramí ez, W-H. Liao, G. Fabb o, M. Mancusi, and Y. Mi su uji,
“ITO-Mas e : In e ence-Time Op imiza ion o Audio E ec s Modeling o
Music Mas e ing P ocesso s”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
ing pla o ms, he need o au oma ed mas e ing solu ions
has g own subs an ially.
In esponse o his demand, a ious au oma ic mas e -
ing sys ems ha e eme ged [3
–
5]. Howe e , hese sys ems
ope a e in an uncondi ioned manne , applying audio e -
ec s wi hou di ec use con ol. To in oduce adap abili y,
e e ence-based app oaches ha e been explo ed, whe e he
p ocessing cha ac e is ics o a e e ence ack a e applied o
ano he [6,7]. These me hods aim o ma ch audio ea u es
such as dynamics, onal balance, and s e eo wid h, o e ing
an al e na i e o ully au oma ic mas e ing. Howe e , sig-
ni ican challenges emain in achie ing bo h high-quali y
esul s and con ollabili y.
Exis ing e e ence-based app oaches can be b oadly ca -
ego ized in o black-box and whi e-box models. Black-box
models, o en based on end- o-end neu al ne wo ks [6], can
e ec i ely cap u e high-le el audio pa e ns bu lack ans-
pa ency and in e p e abili y, making i di icul o use s o
modi y speci ic aspec s o he p ocessing. In con as , whi e-
box models le e age ei he ea u e ma ching algo i hms [7]
o di e en iable audio p ocesso s [5] o p o ide g ea e con-
ol o e indi idual pa ame e s. While whi e-box me hods
o e a s uc u ed and in e p e able app oach [8], hey a e
o en cons ained by he simplici y o hei di e en iable
p ocesso s, which may no ully eplica e he complex ools
used in p o essional mas e ing.
In his pape , we in oduce he ITO-Mas e amewo k, a
e e ence-based audio e ec s modeling o music mas e ing
p ocesso s ha inco po a es In e ence-Time Op imiza ion
(ITO) o ine use con ol. While p e ious s yle ans-
e me hods apply ixed p ocessing based on a e e ence,
ou app oach allows use s o dynamically e ine he ou -
pu when he ini ial esul does no ully align wi h hei
p e e ences. By op imizing he e e ence embedding
z e
du ing in e ence, ITO-Mas e enables mic o-le el adjus -
men s, allowing o mo e p ecise and a ge ed mas e ing
e inemen s.
The key con ibu ions o his wo k include: (1) ITO-
Mas e amewo k: A no el app oach o e e ence-based
audio e ec s modeling o mas e ing p ocesso s using ITO.
(2) Compa ison o black-box and whi e-box models: A
sys ema ic s udy o wo pa adigms o e alua ing hei e ec-
i eness and ade-o s. (3) Realis ic mas e ing p ocesso
chain: Implemen a ion o a s uc u ed di e en iable mas e -
ing pipeline o enhance he ealism o whi e-box p ocessing.
(4) Comp ehensi e e alua ion: Pe o mance alida ion
134
ia objec i e me ics, lis ening es s, and quali a i e analysis
using ex -based condi ioning wi h CLAP embeddings.
2. RELATED WORKS
2.1 Audio E ec s S yle T ans e
Audio e ec s s yle ans e has become a signi ican a ea
o esea ch in au oma ing and enhancing music p oduc ion.
Recen ad ancemen s in deep lea ning ha e led o mo e
sophis ica ed app oaches, whe e neu al ne wo ks a e used
o lea n complex mappings be ween inpu and ou pu audio
signals. These me hods ha e been applied o s yle ans e
ac oss single audio e ec s (Fx) o mul iple se s o Fx (Fx
chain) [5,6,9
–
14], e ec i ely modeling empo al dependen-
cies and applying s yle ans e based on a e e ence ack
a he wa e o m le el. While hese me hods ha e shown
success in con olled en i onmen s, challenges emain in
ex ending hei applica ion o di e se eal-wo ld scena ios,
pa icula ly in adap ing o di e en mas e ing s yles and
a ying inpu signals.
2.2 In e ence-Time Op imiza ion
Recen ly, [15, 16] ha e explo ed in e ence- ime op imiza-
ion (ITO) in music gene a ion asks, whe e he ini ial la-
en embedding is op imized by backp opaga ing h ough
di usion-based models wi h he loss be ween a gene a ed
sample and a e e ence ack. In he con ex o audio e ec s
s yle ans e , ITO has been applied in me hods like ST-
ITO [17,18], whe e in e p e able pa ame e s in a whi e-box
di e en iable Fx chain a e op imized.
Ou wo k ocuses speci ically on mas e ing s yle ans-
e , whe e handling hea ily comp essed audio—common
in comme cially eleased music— equi es pa icula a en-
ion o limi e s and dynamic ange managemen . The ITO-
Mas e amewo k in oduces ITO on he e e ence embed-
ding
z e
, allowing op imiza ion a in e ence ime in bo h
black-box and whi e-box models. By ine- uning
z e
, ou
app oach adap s he mas e ing s yle o a e e ence ack
wi hou e aining he en i e model. ITO-Mas e ensu es
smoo h adap a ion o he e e ence ack’s cha ac e is ics
while p ese ing he abili y o obse e and adjus he unde -
lying pa ame e s in he whi e-box model. This makes ou
amewo k pa icula ly sui able o mas e ing asks, p o id-
ing p o essionals and ama eu s wi h p ecise con ol o e
he audio mas e ing p ocess.
3. METHODOLOGY
In his sec ion, we desc ibe he componen s o he p oposed
mas e ing s yle ans e amewo k: he aining pipeline,
Mas e ing S yle Con e e , di e en iable mas e ing chain,
and ITO. The aining pipeline simula es eal-wo ld mas e -
ing scena ios by applying andom Fx manipula ions. The
Mas e ing S yle Con e e
Ψ
ans e s he mas e ing s yle
om a e e ence ack o a a ge ack and can be imple-
men ed using bo h black-box and whi e-box app oaches.
The di e en iable mas e ing chain se es as a whi e-box
p ocesso ha models a ious Fx in a s uc u ed sequence.
Las ly, he ITO p ocess op imizes
z e
a in e ence ime o
enhance s yle ans e pe o mance. The ollowing subsec-
ions desc ibe each componen in de ail.
3.1 T aining Pipeline o Mas e ing S yle T ans e
As shown in Figu e 1(a), he aining pipeline ollows es-
ablished me hodologies in s yle ans e , u ilizing a sel -
supe ised aining amewo k wi h andom Fx manipula-
ion [6, 11
–
13]. Based on he unde s anding ha a single
song main ains a consis en mas e ing s yle h oughou [2],
a song is i s segmen ed in o wo pa s,
A
and
B
. Fo
he inpu o
Ψ
, andom manipula ion
1
is applied o sim-
ula e a andom s yle, simila o how he p ocess would
unc ion in an applica ion se ing. Then, we apply Fx-
No maliza ion [19]
no m
, which no malizes ce ain Fx cha -
ac e is ics o ixed a ge le els, o acili a e he pe o mance
o s yle ans e .
no m
is only applied o he equalize (EQ),
s e eo image , and loudness le els, allowing he model o
cap u e a b oade ange o nonlinea FX ans o ma ions,
such as comp ession le els and dis o ion. While no maliz-
ing comp ession is echnically easible, i is no well-sui ed
o he on- he- ly aining p ocedu e. Fo dis o ion, no mal-
iza ion would equi e ei he emo ing all dis o ion om
he gi en song o applying a consis en dis o ion le el
ac oss all acks, which is ou side he scope o his pape . In
summa y, he inpu o Ψis de ined as xin = 1( no m(A)).
To achie e s yle ans e , he Fx in o ma ion om he
e e ence ack
x e
is encoded o c ea e he e e ence em-
bedding ha condi ions
Ψ
. A second andom manipula ion
2
is applied o segmen
B
, which is hen encoded by he
e e ence encode
Φ
, esul ing in he e e ence embedding
z e = Φ( 2(B))
. The aining p ocess minimizes he loss
be ween he model ou pu
y′= Ψ(xin, z e )
and he a ge
signal
y= 2(A)
. Since bo h
A
and
B
o igina e om he
same mas e ed song, we assume ha
y
and
x e
sha e he
same mas e ing s yle.
3.2 Mas e ing S yle Con e e
Ψ
can be implemen ed using wo dis inc modeling ap-
p oaches: black-box modeling and whi e-box modeling. In
he black-box app oach,
Ψ
di ec ly models he wa e o m
signal
y′
. Con e sely, he whi e-box app oach es ima es
he pa ame e s
Θ
o he di e en iable mas e ing Fx chain.
The o mula ion o he di e en iable mas e ing Fx chain
in he whi e-box model is iden ical o ha used in he Fx
manipula o o p ocessing andomly mas e ed audio.
The aining objec i e o
Ψ
is he mul i-scale spec al
loss
LMSS
, applied o bo h le - igh and mid-side channels
(whe e mid = le + igh , side = le - igh ), as used in [12].
3.3 Di e en iable Mas e ing Chain Modeling
The mas e ing chain is designed o be ully di e en iable
whi e-box p ocesso , se ing a dual pu pose: unc ioning
bo h as a mas e ing s yle con e e and as a andom mas-
e ing manipula ion module o aining he con e e . By
modeling a wide ange o a iabili y in mas e ing s yles, he
chain enables he sys em o obus ly handle and eplica e he
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
135
Mas e ing S yle
Con e e Ψ
Re e ence
Encode Φ
FX chain #1
FX no m
FX chain #2
pa h o
pa h o
(a) T aining pipeline o Mas e ing S yle Con e e
Ψ
. Du ing his phase, Re e ence En-
code
Φ
is ained using di e se mas e ing s yles gene a ed by andom FX manipula ion
. The a ge signal
y
is syn hesized by applying he same manipula ion o segmen
A
as o e e ence segmen B, bo h om he same song.
(b) ITO is pe o med using an auxilia y (con en -
independen ) objec i e unc ion, allowing any e e ence
music. Use s can op imize
z e
based on hei p e e ences
o bo h he e e ence and objec i e unc ion.
Figu e 1: O e all pipeline o ITO-Mas e .
complexi ies o eal-wo ld music mas e ing. To simula e
a ealis ic music mas e ing p ocess [3], he chain includes
six dis inc Fx modules: 1. 6-band pa ame ic equalize ,
2. dis o ion, 3. 3-band comp esso , 4. makeup gain, 5.
s e eo image , and 6. limi e . The o de o hese modules is
ixed, wi h he p obabili y o applying each Fx module o
andom manipula ion du ing aining se a 90%, 30%, 80%,
85%, 60%, and 100%, espec i ely. These p obabili ies a e
adop ed o in oduce g ea e a iabili y while p e en ing he
syn hesis o o e ly un ealis ic mas e ing s yles o enhance
Ψ
’s modeling capabili y. The chain comp ises a o al o 46
con ollable pa ame e s.
To ensu e di e en iabili y, he Fx modules a e im-
plemen ed using open-sou ce lib a ies
1,2
ha suppo
g adien -based op imiza ion echniques. Fo modeling he
3-band mul iband comp ession, a ou h-o de Linkwi z-
Riley c osso e il e [20] is i s applied o spli he signal
in o h ee bands, ollowed by a di e en iable all-pole il e .
A key componen in he chain is he use o hese di e en-
iable all-pole il e s, as modeled by [21], which enable he
compu a ion o bo h comp ession and expansion e ec s in
bo h he mul iband comp esso and limi e . This capabili y
is c ucial o p ac ical applica ions, allowing he sys em o
manage bo h limi e s and delimi e s, which is pa icula ly
impo an in eal-wo ld scena ios whe e mos comme cially
eleased music is hea ily comp essed [2], equi ing e ec-
i e s yle ans e unde such condi ions.
3.4 In e ence-Time Op imiza ion on Re e ence
Embedding
The p ima y con ibu ion o his wo k is he in oduc ion
o ITO on
z e
. Ins ead o ine- uning he en i e model
Ψ
, he ocus is on op imizing only
z e
while keeping he
p e- ained
Ψ
ixed, as shown in Figu e 1(b). Al hough op-
imizing
z e
du ing in e ence ime in he Black-box model
does no p o ide in e p e abili y in e ms o Fx p ocesso s,
he Whi e-box model p ese es in e p e abili y. In ac ,
he changes in he pa ame e s
Θ
be o e and a e he ITO
p ocess can be obse ed, p o iding insigh in o how
Θ
is
adjus ed. Addi ionally, use s can op imize he sys em wi h
1h ps://gi hub.com/cs einme z1/dasp-py o ch
2h ps://gi hub.com/Di APF/ o chcomp
an al e na i e e e ence signal
x′
e
, o e ing a di e en ap-
p oach om con en ional mas e ing s yle ans e , as his
me hod me ges mas e ing s yles based on he combina ion
o he new e e ence signal and he op imizing objec i e
unc ion. The ad an ages o ITO on
z e
include a signi i-
can ly educed numbe o op imiza ion s eps compa ed o
op imizing he en i e di e en iable chain’s
Θ
om sc a ch,
as we compa e in Sec ion 5.
Fo ITO, he Audio Fea u e (AF) loss p oposed by [13]
is u ilized as he auxilia y objec i e unc ion
Laux
. The
AF loss is a con en -independen loss ha combines a -
ious audio ea u e ans o ma ions, cap u ing he dynam-
ics, spa ializa ion, and spec al cha ac e is ics o he audio.
Each ans o ma ion in he AF loss has i s own p ede ined
weigh ing ac o , and hese weigh ed ans o ma ions a e
summed oge he o compu e he o e all loss. In ou ex-
pe imen s, he o iginal weigh s om he e e ence pape
a e ollowed. We op imize
z e
i e a i ely using g adien de-
scen :
z( +1)
e =z( )
e −η∇zLaux(Ψ(xin, z( )
e ), x e )
, whe e
η
is he lea ning a e. Fo objec i e and subjec i e e alua ion,
we ocus solely on using AF loss as he op imiza ion ob-
jec i e o ITO. Since ITO can be op imized wi h any loss
unc ion, we also quali a i ely explo e op imiza ion using a
ex p omp wi h CLAP embeddings [22] in Sec ion 5.3.
4. EXPERIMENTS
4.1 Da ase
We u ilized MoisesDB da ase [23] o aining, and ali-
da ed using he MUSDB18 alida ion subse [24]. Mix u e
samples om hese da ase s a e employed, as hey a e no
ully mas e ed, allowing andom Fx manipula ion wi h
o
c ea e syn he ic mas e ed samples. Fo Fx-No maliza ion,
mean s a is ics a e p ecompu ed on he MoisesDB da ase ,
and no maliza ion is applied o ma ch he EQ, s e eo image ,
and loudness le els. Fo e alua ion, 200 songs a e andomly
selec ed om he MTG-Jamendo da ase [25]. O hese, 100
songs a e used as
xin
, and he emaining 100 se e as
x e
o
inpu in o
Ψ
. Du ing aining and alida ion, bo h segmen s
Aand Ba e 11.8 seconds long. Fo e alua ion, 30-second
samples a e used since he ully con olu ional a chi ec u e
o Ψcan handle a iable-leng h inpu s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
136
4.2 Expe imen al Se up
The expe imen al se up includes wo p ima y aining con-
igu a ions o
Ψ
: Black-box and Whi e-box me hods. Bo h
con igu a ions use he Tempo al Con olu ional Ne wo k
(TCN) [26] a chi ec u e o p ocessing
xin
, wi h 10.5 mil-
lion ainable pa ame e s. P e- ained weigh s o he FXen-
code [12] is adop ed as
Φ
and es ed unde wo condi ions:
wi h and wi hou
Φ
being ained alongside
Ψ
. All models
a e ained o 72,000 i e a ions wi h a ba ch size o 4.
In addi ion o aining he
Φ
and
Ψ
models, ITO is pe -
o med on each es da a o op imize he e e ence embed-
ding
z e
, which has a dimensionali y o 2048. The ITO
p ocess is un o a maximum o 100 s eps o is s opped
ea lie i he loss alue inc eases, indica ing con e gence.
Fo compa ison, an al e na i e ITO app oach is also applied
whe e op imiza ion is pe o med solely on he pa ame e s
Θ
o he di e en iable mas e ing chain
. In his case,
Θ
is op imized o up o 2K s eps o e alua e i s e ec i eness
ela i e o he p oposed ITO me hod ocused on
z e
. All
me hods a e op imized using he RAdam op imize [27]
wi h a lea ning a e o
2·10−4
. Mo e de ails a e a ailable
in ou open-sou ce eposi o y 3.
4.3 E alua ion Me ics
To objec i ely e alua e mas e ing s yle ans e , con en -
independen objec i es a e u ilized, gi en ha di e en
con en is being compa ed. The ollowing me ics a e em-
ployed:
•
Audio Fea u e (AF) Loss: As discussed in Sec ion 3.4,
AF Loss measu es how well he ou pu
y′
ma ches he
desi ed audio ea u es.
•
Dynamic Range Va iabili y (DRV): DRV is a c ucial
me ic o assessing he comp ession le el o audio, pa -
icula ly in he con ex o music mas e ing, whe e limi e s
play a signi ican ole. The DRV me ic is compu ed by
i s iden i ying peak alues in he audio signal using a
high- equency con en onse de ec ion unc ion. DRV is
he s anda d de ia ion o hese peak alues a e il e ing
ou he lowes 25% o he alues, which is de ined as:
DRV =1
C
C
X
c=1
s d ({pc
i:pc
i>pe cen ile(pc,75)})(1)
whe e
pc
deno es he se o peak alues
{pc
1, pc
2, ...}
in
channel
c
, and
C
is he o al numbe o channels. The me -
ic e lec s he a iabili y in dynamic ange, wi h highe
alues indica ing less consis en comp ession.
•
Fx Embedding Simila i y (cos sim): Cosine simila i y
measu es he simila i y be ween he e e ence embedding
Φ(x e )
and he ou pu embedding
Φ(y′)
. We adop he
p e ained FXencode as
Φ
. This me ic e alua es how
closely he ou pu ma ches he e e ence in e ms o i s
lea ned Fx cha ac e is ics.
•
F éche Audio Dis ance (FAD): FAD [28] assesses he
pe cep ual quali y o he gene a ed audio by compa -
ing he s a is ical dis ibu ion o he model ou pu s o
3h ps://gi hub.com/SonyResea ch/ITO-Mas e
a e e ence dis ibu ion. FAD is calcula ed using h ee
deep audio embeddings: CLAP [29], DAC [30], and En-
Codec [31]. We chose hese embeddings as hey ha e
shown s ong co ela ion wi h human p e e ence in acous-
ic quali y, wi h codec-based models like DAC and En-
Codec being pa icula ly sensi i e o acous ic e ec s,
as demons a ed by [32]. The me ic compu es he dis-
ance be ween he dis ibu ion o ea u es ex ac ed om
he model’s ou pu and hose ex ac ed om a subse o
he Jamendo da ase , measu ing how na u al he s yle-
ans e ed audio sounds compa ed o eal eco dings.
4.4 Baseline Me hods
The ollowing baseline me hods, ep esen ing exis ing mas-
e ing s yle ans e sys ems, a e used o compa ison:
•
Acous ic Fea u e Ma ching App oaches:Fea u e ma ch-
ing app oaches aim o adjus he Fx o an inpu ack
o ma ch hose o a e e ence ack by di ec ly aligning
speci ic audio ea u es.
–
Fx-No maliza ion [19]: Ins ead o no malizing he
gi en audio o he mean s a is ics o he a ge da a dis-
ibu ion, his app oach di ec ly ma ches he Fx le els
o hose o he e e ence song. The o icial implemen-
a ion
4
is used o ma ch he audio e ec s in he o de
o EQ, comp ession, s e eo imaging, and loudness, e-
spec i ely.
–
Ma che ing [7]: An open-sou ce lib a y ha ma ches
he gi en song’s RMS, equency esponse, peak am-
pli ude, and s e eo wid h o hose o he e e ence ack.
The o icial implemen a ion
5
is used o in e he p o-
cessed songs.
•
E2E Remas e ing [6]: This end- o-end emas e ing ap-
p oach is a black-box model ha di ec ly p edic s he
signal
y′
a he wa e o m le el. The model is ained
in a sel -supe ised manne using a la ge da ase o e-
leased pop songs. I le e ages a p e- ained encode and
a p ojec ion disc imina o o encou age he gene a ion o
ealis ic audio ha accu a ely e lec s he mas e ing s yle
o he e e ence ack.
5. RESULTS
5.1 Objec i e E alua ion
The pe o mance o he p oposed me hods, along wi h he
baseline app oaches, is summa ized in Table 1. The ea-
u e ma ching me hods, speci ically Fx-No maliza ion and
Ma che ing, demons a e s ong pe o mance on he AF and
FAD me ics. This is expec ed, as hese app oaches di ec ly
apply Fx- ela ed ans o ma ions o ma ch he e e ence
ack’s cha ac e is ics. Howe e , hese me hods pe o m
poo ly on he DRV me ic, as hey lack he con ol needed
o p ope dynamic ange adjus men s, which is c ucial in
eal-wo ld mas e ing asks in ol ing delimi ing. E2E Re-
mas e ing [6] shows good pe o mance on he FAD wi h
4h ps://gi hub.com/sony/FxNo m-au omix
5h ps://gi hub.com/se g ee/ma che ing
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
137
Me hod AF (↓) DRV (↓) cos sim (↑)
FAD
CLAP
(
↓
) FAD
DAC
(
↓
) FAD
EnCodec
(
↓
)
Fea u e Ma ching Fx-No maliza ion [19] 0.157 0.801 0.941 161.4 177.4 84.53
Ma che ing [7] 0.160 0.823 0.942 110.8 126.1 59.34
Baseline E2E Remas e ing [6] 0.288 0.858 0.942 104.3 176.7 37.19
P oposed
Black-box 0.346 0.685 0.944 160.8 378.4 51.12
+ ain Φ0.125 0.577 0.945 159.8 177.4 46.94
+ ITO on z e 0.099 0.567 0.946 182.2 180.5 42.82
Whi e-box 0.253 0.598 0.946 93.7 144.8 36.22
+ ain Φ0.186 0.521 0.945 93.2 101.4 38.90
+ ITO on z e 0.139 0.474 0.946 105.2 109.1 42.99
ITO on Θ0.250 0.609 0.927 216.8 294.8 101.60
Table 1: Mas e ing S yle T ans e on Jamendo da ase ( eal-wo ld scena io).
CLAP and EnCodec embeddings, likely due o i s use o
an ad e sa ial objec i e du ing aining, which aids in gen-
e a ing ealis ic audio ha closely ma ches he e e ence
dis ibu ion. Howe e , his sys em alls sho on AF and
DRV, indica ing challenges in cap u ing p ecise audio ea-
u e ans o ma ions and managing dynamic ange.
Among he p oposed me hods, Black-box app oaches
ou pe o m Whi e-box me hods in e ms o AF, indica ing
i s e ec i eness in cap u ing audio ea u e ans o ma ions
om di ec modeling o
y′
wi h
LMSS
. Howe e , he Whi e-
box me hod shows be e esul s ac oss all FAD me ics,
sugges ing ha i p oduces audio mo e aligned wi h eal-
wo ld dis ibu ions. This may imply ha while Black-box
models cap u e mo e de ailed ans o ma ions, Whi e-box
app oaches p oduce ou pu s ha a e mo e pe cep ually con-
sis en wi h eal-wo ld mas e ing s yles.
When
Ψ
is ained while keeping he p e- ained FXen-
code
Φ
ixed, he pe o mance is gene ally in e io . This is
likely because he FXencode was ained on a di e en se
o Fx chains and may no ully cap u e he manipula ions
applied o he e e ence songs in his con ex . Howe e ,
when
Φ
is ained alongside
Ψ
, he e is a signi ican im-
p o emen in pe o mance, as his join aining allows he
encode o be e adap o he speci ic Fx manipula ions
used, leading o mo e accu a e mas e ing s yle ans e .
Applying ITO on
z e
enhances AF pe o mance, bu
in oduces a ade-o in FAD sco es, indica ing while ITO
can e ine mas e ing s yle ans e , he numbe o op imiza-
ion s eps mus be ca e ully calib a ed o balance compe ing
objec i es. Con e sely, applying ITO di ec ly on
Θ
yields
poo esul s ac oss all me ics, e en wi h a la ge numbe
o op imiza ion s eps. In e es ingly, in e e se enginee ing
asks—whe e he inpu and ou pu con en a e iden ical—
op imizing
Θ
wo ks well despi e he complexi y o he
Fx chain [33
–
35]. Howe e , in mas e ing s yle ans e ,
con en -independen loss unc ions a e used o eplica e
only he mas e ing s yle om he e e ence ack. This
dis inc ion highligh s why ITO on
Θ
is less e ec i e in
his con ex . Since
Ψ
is ained wi h a con en -dependen
objec i e, i le e ages con en in o ma ion o enhance mas-
e ing s yle ans e . In con as ,
Laux
used in ITO ails o
cap u e he in icacies o he ask, making i unsui able o
op imizing he en i e di e en iable chain in his scena io.
Inpu
Fx-No maliza ion
E2E Remas e ing
Black-box + ain
Black-box + ITO
Whi e-box + ain
Whi e-box + ITO
0
20
40
60
80
100
Figu e 2: Subjec i e e alua ion esul s.
Audio samples a e a ailable on ou demo page 6.
5.2 Subjec i e E alua ion
To u he alida e ou p oposed me hods subjec i ely, we
conduc ed a MUSHRA- ype lis ening es wi h 10 pa ici-
pan s, all amilia wi h music pos -p oduc ion and digi al
e ec s, ha ing 2 o 5 yea s o expe ience in eco ding,
mixing, o mas e ing. Pa icipan s a ed a ious p ocessed
acks based on hei simila i y in mas e ing audio e ec s
o a e e ence ack. The e alua ion included 8 ques ions,
wi h 30 seconds long music eco dings o all s imuli. The
e e ence audio con ained di e en con en om he s im-
uli, bu we ensu ed he e e ence and s imuli we e no oo
dissimila in e ms o gen e o ins umen a ion. As a low
ancho , he ini ial music ack be o e being inpu ed in o
s yle ans e sys ems was p esen ed. The e was no high an-
cho , as he e alua ion se up aimed o mimic eal-wo ld Fx
s yle ans e using music acks om he Jamendo da ase .
As illus a ed in Figu e 2, he subjec i e es esul s align
wi h he ends obse ed in ou objec i e e alua ions. All
ou p oposed me hods su pass he baselines, wi h esul s u -
he enhanced by ITO, showing audio e ec s cha ac e is ics
mo e simila o hose o he e e ence. The simila i y sco es
o he p oposed sys em anged om 0 o 100, indica ing
ha he lis ening es was highly challenging, e en o ex-
pe s wi h domain knowledge. Ne e heless, he p oposed
6h ps:// inyu l.com/ITO-Mas e
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
138

0 1.5 3 4.5 6 7.5 9 10
Time (s)
0
64
128
256
512
1024
2048
4096
8192
16384
F equency (Hz)
(a) Inpu Spec og am
0 1.5 3 4.5 6 7.5 9 10
Time (s)
0
64
128
256
512
1024
2048
4096
8192
16384
F equency (Hz)
"Classic Music" - Inpu
0 1.5 3 4.5 6 7.5 9 10
Time (s)
"Me al Music" - Inpu
0 1.5 3 4.5 6 7.5 9 10
Time (s)
"Hip-Hop Music" - Inpu
-40 dB
-20 dB
+0 dB
+20 dB
+40 dB
(b) Spec og am Di e ence
0 2 4 6 8 10 12
Time (s)
0
2000
4000
6000
Hz
Inpu
Classic Music
Me al Music
Hip-Hop Music
(c) Spec al Cen oid
0 2 4 6 8 10 12
Time (s)
2
3
4
5
Peak- o-RMS
Inpu
Classic Music
Me al Music
Hip-Hop Music
(d) C es Fac o
0 2 4 6 8 10 12
Time (s)
0.10
0.15
0.20
0.25
RMS
Inpu
Classic Music
Me al Music
Hip-Hop Music
(e) RMS Ene gy
Figu e 3: Compa ison o di e en audio ea u es be ween he inpu music and ITO-p ocessed acks using ex p omp s
“Classic Music”,“Me al Music”, and “Hip-Hop Music”.
sys ems consis en ly ou pe o med he baseline, showing
signi ican imp o emen s (pai wise - es , p < 0.05).
5.3 Quali a i e Analysis o ITO wi h Tex P omp s
To e alua e he e ec i eness o ITO unde di e en
Laux
,
we pe o m a quali a i e analysis using ex p omp s wi h
CLAP embeddings [22], simila o he applica ion demon-
s a ed in [36]. Gi en an inpu music ack, we op imize
z e
o he p oposed whi e-box model using ex -based condi-
ioning, le e aging he CLAP embedding cosine simila i y
as he op imiza ion objec i e. Speci ically, we compu e he
audio embedding
CLAPaud
o he s ee ed ou pu and he
ex embedding
CLAP x
o he gi en e e ence ex p omp ,
hen maximize hei cosine simila i y o guide he ans o -
ma ion. The inpu music piece used o his analysis is an
11.8-second-long ins umen al ock ack. Since he inpu
con en emains unchanged ac oss di e en s ee ed esul s,
we can di ec ly assess he in luence o each ex p omp
on a ious musical ea u es. We explo e ITO wi h h ee
di e en p omp s: “Classic Music”,“Me al Music”, and
“Hip-Hop Music” o analysis. This expe imen al se up can
be explo ed h ough ou in e ac i e demo 7.
As shown in Figu e 3, he op imized esul s exhibi dis-
inc cha ac e is ics ha align wi h gene al expec a ions
o each gen e. The spec og am di e ence plo s in 3(b)
(s ee ed ou pu - inpu ) highligh he equency anges mos
a ec ed by ITO. Speci ically, he “Classic Music” p omp
exhibi s no able changes in he mid and high equencies,
aligning wi h he cha ac e is ic b igh ness and cla i y o en
associa ed wi h classical eco dings. The “Me al Music”
p omp shows di e ences in bo h low and high equencies,
e lec ing he gen e’s ypical emphasis on powe ul bass
and sha p eble o agg essi e ins umen a ion. In con-
as , he “Hip-Hop Music” p omp p edominan ly a ec s
he low- equency ange, ein o cing he gen e’s signa u e
7h ps://hugging ace.co/spaces/jh onyKoo/ITO-Mas e
emphasis on deep bass and sub-bass elemen s, which a e
essen ial o d i ing hy hm-hea y bea s.
These obse a ions a e u he suppo ed by he spec-
al cen oid, c es ac o , and RMS ene gy analyses. The
spec al cen oid esul s ollow an expec ed end, whe e
hip-hop has he lowes cen oid due o i s bass-hea y na-
u e, ollowed by me al, while classic music has he highes
cen oid, e lec ing i s emphasis on ha monic ichness and
eble cla i y. The c es ac o , ep esen ing peak- o-RMS
a io, is lowes o hip-hop, indica ing a mo e comp essed
and bass-hea y dynamic s uc u e, while classic music has
he highes c es ac o , aligning wi h i s ypically uncom-
p essed, wide dynamic ange. RMS ene gy shows an in-
c easing end om classic o hip-hop, wi h me al alling in
be ween, which is consis en wi h he espec i e loudness
and dynamic cha ac e is ics o hese gen es. These indings
sugges ha ITO, guided by CLAP
x
, success ully s ee s
he mas e ing Fx chain o align wi h he expec ed sonic
cha ac e is ics o he gi en ex p omp , demons a ing i s
po en ial as a c ea i e ool o music pos -p oduc ion.
6. CONCLUSION
In his pape , we in oduced he ITO-Mas e amewo k,
which le e ages ITO on
z e
o mas e ing s yle ans e .
Ou expe imen s showed ha aining he e e ence encode
Φ
alongside
Ψ
imp o es pe o mance. Op imizing
z e
wi h
ITO led o meaning ul imp o emen s wi h ew s eps, ou pe -
o ming di ec op imiza ion o
Θ
in e iciency. Subjec i e
e alua ions con i med ha ou me hod p oduces pe cep u-
ally aligned mas e ing e ec s, and quali a i e esul s high-
ligh ed he po en ial o ex -condi ioned ITO o c ea i e
applica ions. As u u e wo k, we plan o inco po a e p o-
duc ion quali y and usabili y in o he e alua ion, alongside
e e ence alignmen . Since mas e ing is a cu a o ial ask,
poo e e ence choices can lead o subop imal esul s de-
spi e high alignmen , highligh ing he need o pe cep ual
p e e ence me ics.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
139
7. REFERENCES
[1]
U. Zölze , X. Ama iain, D. A ib, J. Bonada, G. De Poli,
P. Du illeux, G. E angelis a, F. Keile , A. Loscos,
D. Rocchesso e al.,DAFX-Digi al audio e ec s. John
Wiley & Sons, 2002.
[2]
M. Shel ock, Audio mas e ing as musical p ac ice.
The Uni e si y o Wes e n On a io (Canada), 2012.
[3]
M. Pio owska, S. Pio owski, and B. Kos ek, “A s udy
on audio signal p ocessed by" ins an mas e ing" se -
ices,” in Audio Enginee ing Socie y Con en ion 142.
Audio Enginee ing Socie y, 2017.
[4]
J. S e ne and E. Razlogo a, “Machine lea ning in con-
ex , o lea ning om land : A i icial in elligence and
he pla o miza ion o music mas e ing,” Social Media+
Socie y, ol. 5, no. 2, p. 2056305119847525, 2019.
[5]
M. A. M. Ramí ez, O. Wang, P. Sma agdis, and N. J.
B yan, “Di e en iable signal p ocessing wi h black-
box audio e ec s,” in ICASSP 2021-2021 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2021, pp. 66–70.
[6]
J. Koo, S. Paik, and K. Lee, “End- o-end music e-
mas e ing sys em using sel -supe ised and ad e sa ial
aining,” in P oc. ICASSP, 2022, pp. 4608–4612.
[7]
S. G ishako , C.-Y. Yu, and Zicklag, “Ma che ing:
Audio ma ching and mas e ing py hon lib a y,” h ps:
//gi hub.com/se g ee/ma che ing.
[8]
J. Engel, C. Gu, A. Robe s e al., “DDSP: Di e en-
iable digi al signal p ocessing,” in In e na ional Con-
e ence on Lea ning Rep esen a ions, 2020.
[9]
J. Koo, S. Paik, and K. Lee, “Re e b con e sion o
mixed ocal acks using an end- o-end con olu ional
deep neu al ne wo k,” in ICASSP 2021-2021 IEEE In-
e na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2021, pp. 81–85.
[10]
S. Lee, J. Pa k, S. Paik, and K. Lee, “Blind es ima-
ion o audio p ocessing g aph,” in ICASSP 2023-2023
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2023, pp.
1–5.
[11]
C. J. S einme z, N. J. B yan, and J. D. Reiss, “S yle
ans e o audio e ec s wi h di e en iable signal p o-
cessing,” J. Audio Eng. Soc, ol. 70, no. 9, pp. 708–721,
2022.
[12]
J. Koo, M. A. Ma ínez-Ramí ez, W.-H. Liao, S. Uhlich,
K. Lee, and Y. Mi su uji, “Music mixing s yle ans e :
A con as i e lea ning app oach o disen angle audio
e ec s,” in ICASSP 2023-2023 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2023, pp. 1–5.
[13]
S. S. Vanka, C. S einme z, J.-B. Rolland, J. Reiss, and
G. Fazekas, “Di -MST: Di e en iable mixing s yle
ans e ,” in P oc. ISMIR, 2024.
[14]
Y.-H. Chen, Y.-T. Yeh, Y.-C. Cheng, J.-T. Wu, Y.-H.
Ho, J.-S. R. Jang, and Y.-H. Yang, “Towa ds ze o-sho
ampli ie modeling: One- o-many ampli ie modeling
ia one embedding con ol,” in P oc. ISMIR, 2024.
[15]
Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and
N. J. B yan, “DITTO: Di usion in e ence- ime -
op imiza ion o music gene a ion,” in P oc. ICML,
2024.
[16]
Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and
N. B yan, “DITTO-2: Dis illed di usion in e ence- ime
-op imiza ion o music gene a ion,” in P oc. ISMIR,
2024.
[17]
C. S einme z, S. Singh, I. Ibnyahya, S. Yuan, E. Bene os,
J. Reiss e al., “ST-ITO: Con olling audio e ec s o
s yle ans e wi h in e ence- ime op imiza ion,” in P oc.
ISMIR, 2024.
[18]
C.-Y. Yu, M. A. Ma ínez-Ramí ez, J. Koo, W.-H. Liao,
Y. Mi su uji, and G. Fazekas, “Imp o ing in e ence-
ime op imisa ion o ocal e ec s s yle ans e wi h a
gaussian p io ,” a Xi p ep in a Xi :2505.11315, 2025.
[19]
M. A. Ma ínez-Ramí ez, W.-H. Liao, G. Fabb o, S. Uh-
lich, C. Nagashima, and Y. Mi su uji, “Au oma ic music
mixing wi h deep lea ning and ou -o -domain da a,” in
P oc. ISMIR, 2022.
[20]
S. H. Linkwi z, “Ac i e c osso e ne wo ks o non-
coinciden d i e s,” Jou nal o he Audio Enginee ing
Socie y, ol. 24, no. 1, pp. 2–8, 1976.
[21]
C.-y. Yu, C. Mi chel ee, A. Ca son, S. Bilbao, J. Reiss,
and G. Fazekas, “Di e en iable all-pole il e s o ime-
a ying audio sys ems,” in 27 h In e na ional Con e -
ence on Digi al Audio E ec s (DAFx), 2024.
[22]
Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-Ki kpa ick,
and S. Dubno , “La ge-scale con as i e language-audio
p e aining wi h ea u e usion and keywo d- o-cap ion
augmen a ion,” in ICASSP 2023-2023 IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal P o-
cessing (ICASSP). IEEE, 2023, pp. 1–5.
[23]
I. Pe ei a, F. A aújo, F. Ko zeniowski, and R. Vogl,
“MoisesDB: A da ase o sou ce sepa a ion beyond 4-
s ems,” in P oc. ISMIR, 2023.
[24]
Z. Ra ii, A. Liu kus, F.-R. S ö e , S. I. Mimilakis,
and R. Bi ne , “MUSDB18-HQ - an uncomp essed
e sion o MUSDB18,” Aug. 2019. [Online]. A ailable:
h ps://doi.o g/10.5281/zenodo.3338373
[25]
D. Bogdano , M. Won, P. To s ogan, A. Po e , and
X. Se a, “The m g-jamendo da ase o au oma ic music
agging,” in P oc. ICML, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
140
[26]
C. J. S einme z and J. D. Reiss, “E icien neu al ne -
wo ks o eal- ime modeling o analog dynamic ange
comp ession,” a Xi p ep in a Xi :2102.06200, 2021.
[27]
L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and
J. Han, “On he a iance o he adap i e lea ning a e
and beyond,” in In e na ional Con e ence on Lea ning
Rep esen a ions, 2020.
[28]
K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha -
i i, “F éche audio dis ance: A me ic o e alua -
ing music enhancemen algo i hms,” a Xi p ep in
a Xi :1812.08466, 2018.
[29]
B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang,
“CLAP: Lea ning audio concep s om na u al language
supe ision,” in ICASSP 2023-2023 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2023, pp. 1–5.
[30]
R. Kuma , P. See ha aman, A. Luebs, I. Kuma , and
K. Kuma , “High- ideli y audio comp ession wi h im-
p o ed qgan,” Ad ances in Neu al In o ma ion P o-
cessing Sys ems, ol. 36, 2024.
[31]
A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” T ansac ions on Ma-
chine Lea ning Resea ch, 2023.
[32]
A. Gui, H. Gampe , S. B aun, and D. Emmanouilidou,
“Adap ing eche audio dis ance o gene a i e music
e alua ion,” in ICASSP 2024-2024 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2024, pp. 1331–1335.
[33]
S. Lee, M. A. Ma ínez-Ramí ez, W.-H. Liao, S. Uhlich,
G. Fabb o, K. Lee, and Y. Mi su uji, “Sea ching o
music mixing g aphs: A p uning app oach,” in 27 h In-
e na ional Con e ence on Digi al Audio E ec s (DAFx),
2024.
[34]
——, “Re e se enginee ing o music mixing g aphs
wi h di e en iable p ocesso s and i e a i e p uning,”
Jou nal o he Audio Enginee ing Socie y, ol. 73, pp.
344–365, June 2025.
[35]
C.-Y. Yu, M. A. Ma ínez-Ramí ez, J. Koo, B. Hayes,
W.-H. Liao, G. Fazekas, and Y. Mi su uji, “Di ox: A
di e en iable model o cap u ing and analysing p o-
essional e ec s dis ibu ions,” in 28 h In e na ional
Con e ence on Digi al Audio E ec s (DAFx), 2025.
[36]
A. Chu, P. O’Reilly, J. Ba ne , and B. Pa do, “Tex 2FX:
Ha nessing clap embeddings o ex -guided audio e -
ec s,” in ICASSP 2025-2025 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2025, pp. 1–5.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
141

Related note

Why institutions use Plag.ai for originality review, entry 67
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by research administrators in North America, Europe, Latin America, and international online education, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also stronger evidence for review committees, more reliable review records, and clearer documentation of academic decisions. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For research files, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai