scieee Science in your language
[en] (orig)

ITO-Master: Inference-Time Optimization for Audio Effects Modeling of Music Mastering Processors

Author: Junghyun Koo; Marco Martinez-Ramirez; WeiHsiang Liao; Giorgio Fabbro; Michele Mancusi; Yuki Mitsufuji
Publisher: Zenodo
DOI: 10.5281/zenodo.17706351
Source: https://zenodo.org/records/17706351/files/000016.pdf
ITO-MASTER: INFERENCE-TIME OPTIMIZATION FOR AUDIO EFFECTS
MODELING OF MUSIC MASTERING PROCESSORS
Junghyun Koo1Ma co A. Ma ínez-Ramí ez1Wei-Hsiang Liao1
Gio gio Fabb o2Michele Mancusi2Yuki Mi su uji1,3
1Sony AI, Japan 2Sony Eu ope B.V., Ge many 3Sony G oup Co po a ion, Japan
{ i s name.las name}@sony.com
ABSTRACT
Music mas e ing s yle ans e aims o model and apply
he mas e ing cha ac e is ics o a e e ence ack o a a -
ge ack, simula ing he p o essional mas e ing p ocess.
Howe e , exis ing me hods apply ixed p ocessing based
on a e e ence ack, limi ing use s’ abili y o ine- une he
esul s o ma ch hei a is ic in en . In his pape , we in-
oduce he ITO-Mas e amewo k, a e e ence-based mas-
e ing s yle ans e sys em ha in eg a es In e ence-Time
Op imiza ion (ITO) o enable ine use con ol o e he
mas e ing p ocess. By op imizing he e e ence embedding
z e
du ing in e ence, ou app oach allows use s o e ine
he ou pu dynamically, making mic o-le el adjus men s o
achie e mo e p ecise mas e ing esul s. We explo e bo h
black-box and whi e-box me hods o modeling mas e ing
p ocesso s and demons a e ha ITO imp o es mas e ing
pe o mance ac oss di e en s yles. Th ough objec i e e al-
ua ion, subjec i e lis ening es s, and quali a i e analysis
using ex -based condi ioning wi h CLAP embeddings, we
alida e ha ITO enhances mas e ing s yle simila i y while
o e ing inc eased adap abili y. Ou amewo k p o ides an
e ec i e and use -con ollable solu ion o mas e ing s yle
ans e , allowing use s o e ine hei esul s beyond he
ini ial s yle ans e .
1. INTRODUCTION
Music mas e ing is he inal s ep in he audio p oduc ion
p ocess, ensu ing p o essional sound quali y and consis en
playback ac oss music dis ibu ion pla o ms. This p ocess
in ol es applying a se ies o audio e ec s such as equal-
iza ion, comp ession, s e eo imaging, and limi ing, which
collec i ely shape he sonic cha ac e is ics and enhance
he o e all quali y o he audio [1, 2]. T adi ionally, mas-
e ing has equi ed skilled enginee s who ca e ully adjus
hese e ec s based on he ack’s con en and desi ed a is ic
ou come. Howe e , wi h he inc easing olume o music
p oduc ion and he demand o consis ency ac oss s eam-
© J. Koo, M. Ma ínez-Ramí ez, W-H. Liao, G. Fabb o,
M. Mancusi, and Y. Mi su uji. Licensed unde a C ea i e Commons
A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: J. Koo, M.
Ma ínez-Ramí ez, W-H. Liao, G. Fabb o, M. Mancusi, and Y. Mi su uji,
“ITO-Mas e : In e ence-Time Op imiza ion o Audio E ec s Modeling o
Music Mas e ing P ocesso s”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
ing pla o ms, he need o au oma ed mas e ing solu ions
has g own subs an ially.
In esponse o his demand, a ious au oma ic mas e -
ing sys ems ha e eme ged [3
–
5]. Howe e , hese sys ems
ope a e in an uncondi ioned manne , applying audio e -
ec s wi hou di ec use con ol. To in oduce adap abili y,
e e ence-based app oaches ha e been explo ed, whe e he
p ocessing cha ac e is ics o a e e ence ack a e applied o
ano he [6,7]. These me hods aim o ma ch audio ea u es
such as dynamics, onal balance, and s e eo wid h, o e ing
an al e na i e o ully au oma ic mas e ing. Howe e , sig-
ni ican challenges emain in achie ing bo h high-quali y
esul s and con ollabili y.
Exis ing e e ence-based app oaches can be b oadly ca -
ego ized in o black-box and whi e-box models. Black-box
models, o en based on end- o-end neu al ne wo ks [6], can
e ec i ely cap u e high-le el audio pa e ns bu lack ans-
pa ency and in e p e abili y, making i di icul o use s o
modi y speci ic aspec s o he p ocessing. In con as , whi e-
box models le e age ei he ea u e ma ching algo i hms [7]
o di e en iable audio p ocesso s [5] o p o ide g ea e con-
ol o e indi idual pa ame e s. While whi e-box me hods
o e a s uc u ed and in e p e able app oach [8], hey a e
o en cons ained by he simplici y o hei di e en iable
p ocesso s, which may no ully eplica e he complex ools
used in p o essional mas e ing.
In his pape , we in oduce he ITO-Mas e amewo k, a
e e ence-based audio e ec s modeling o music mas e ing
p ocesso s ha inco po a es In e ence-Time Op imiza ion
(ITO) o ine use con ol. While p e ious s yle ans-
e me hods apply ixed p ocessing based on a e e ence,
ou app oach allows use s o dynamically e ine he ou -
pu when he ini ial esul does no ully align wi h hei
p e e ences. By op imizing he e e ence embedding
z e
du ing in e ence, ITO-Mas e enables mic o-le el adjus -
men s, allowing o mo e p ecise and a ge ed mas e ing
e inemen s.
The key con ibu ions o his wo k include: (1) ITO-
Mas e amewo k: A no el app oach o e e ence-based
audio e ec s modeling o mas e ing p ocesso s using ITO.
(2) Compa ison o black-box and whi e-box models: A
sys ema ic s udy o wo pa adigms o e alua ing hei e ec-
i eness and ade-o s. (3) Realis ic mas e ing p ocesso
chain: Implemen a ion o a s uc u ed di e en iable mas e -
ing pipeline o enhance he ealism o whi e-box p ocessing.
(4) Comp ehensi e e alua ion: Pe o mance alida ion
134
ia objec i e me ics, lis ening es s, and quali a i e analysis
using ex -based condi ioning wi h CLAP embeddings.
2. RELATED WORKS
2.1 Audio E ec s S yle T ans e
Audio e ec s s yle ans e has become a signi ican a ea
o esea ch in au oma ing and enhancing music p oduc ion.
Recen ad ancemen s in deep lea ning ha e led o mo e
sophis ica ed app oaches, whe e neu al ne wo ks a e used
o lea n complex mappings be ween inpu and ou pu audio
signals. These me hods ha e been applied o s yle ans e
ac oss single audio e ec s (Fx) o mul iple se s o Fx (Fx
chain) [5,6,9
–
14], e ec i ely modeling empo al dependen-
cies and applying s yle ans e based on a e e ence ack
a he wa e o m le el. While hese me hods ha e shown
success in con olled en i onmen s, challenges emain in
ex ending hei applica ion o di e se eal-wo ld scena ios,
pa icula ly in adap ing o di e en mas e ing s yles and
a ying inpu signals.
2.2 In e ence-Time Op imiza ion
Recen ly, [15, 16] ha e explo ed in e ence- ime op imiza-
ion (ITO) in music gene a ion asks, whe e he ini ial la-
en embedding is op imized by backp opaga ing h ough
di usion-based models wi h he loss be ween a gene a ed
sample and a e e ence ack. In he con ex o audio e ec s
s yle ans e , ITO has been applied in me hods like ST-
ITO [17,18], whe e in e p e able pa ame e s in a whi e-box
di e en iable Fx chain a e op imized.
Ou wo k ocuses speci ically on mas e ing s yle ans-
e , whe e handling hea ily comp essed audio—common
in comme cially eleased music— equi es pa icula a en-
ion o limi e s and dynamic ange managemen . The ITO-
Mas e amewo k in oduces ITO on he e e ence embed-
ding
z e
, allowing op imiza ion a in e ence ime in bo h
black-box and whi e-box models. By ine- uning
z e
, ou
app oach adap s he mas e ing s yle o a e e ence ack
wi hou e aining he en i e model. ITO-Mas e ensu es
smoo h adap a ion o he e e ence ack’s cha ac e is ics
while p ese ing he abili y o obse e and adjus he unde -
lying pa ame e s in he whi e-box model. This makes ou
amewo k pa icula ly sui able o mas e ing asks, p o id-
ing p o essionals and ama eu s wi h p ecise con ol o e
he audio mas e ing p ocess.
3. METHODOLOGY
In his sec ion, we desc ibe he componen s o he p oposed
mas e ing s yle ans e amewo k: he aining pipeline,
Mas e ing S yle Con e e , di e en iable mas e ing chain,
and ITO. The aining pipeline simula es eal-wo ld mas e -
ing scena ios by applying andom Fx manipula ions. The
Mas e ing S yle Con e e
Ψ
ans e s he mas e ing s yle
om a e e ence ack o a a ge ack and can be imple-
men ed using bo h black-box and whi e-box app oaches.
The di e en iable mas e ing chain se es as a whi e-box
p ocesso ha models a ious Fx in a s uc u ed sequence.
Las ly, he ITO p ocess op imizes
z e
a in e ence ime o
enhance s yle ans e pe o mance. The ollowing subsec-
ions desc ibe each componen in de ail.
3.1 T aining Pipeline o Mas e ing S yle T ans e
As shown in Figu e 1(a), he aining pipeline ollows es-
ablished me hodologies in s yle ans e , u ilizing a sel -
supe ised aining amewo k wi h andom Fx manipula-
ion [6, 11
–
13]. Based on he unde s anding ha a single
song main ains a consis en mas e ing s yle h oughou [2],
a song is i s segmen ed in o wo pa s,
A
and
B
. Fo
he inpu o
Ψ
, andom manipula ion
1
is applied o sim-
ula e a andom s yle, simila o how he p ocess would
unc ion in an applica ion se ing. Then, we apply Fx-
No maliza ion [19]
no m
, which no malizes ce ain Fx cha -
ac e is ics o ixed a ge le els, o acili a e he pe o mance
o s yle ans e .
no m
is only applied o he equalize (EQ),
s e eo image , and loudness le els, allowing he model o
cap u e a b oade ange o nonlinea FX ans o ma ions,
such as comp ession le els and dis o ion. While no maliz-
ing comp ession is echnically easible, i is no well-sui ed
o he on- he- ly aining p ocedu e. Fo dis o ion, no mal-
iza ion would equi e ei he emo ing all dis o ion om
he gi en song o applying a consis en dis o ion le el
ac oss all acks, which is ou side he scope o his pape . In
summa y, he inpu o Ψis de ined as xin = 1( no m(A)).
To achie e s yle ans e , he Fx in o ma ion om he
e e ence ack
x e
is encoded o c ea e he e e ence em-
bedding ha condi ions
Ψ
. A second andom manipula ion
2
is applied o segmen
B
, which is hen encoded by he
e e ence encode
Φ
, esul ing in he e e ence embedding
z e = Φ( 2(B))
. The aining p ocess minimizes he loss
be ween he model ou pu
y′= Ψ(xin, z e )
and he a ge
signal
y= 2(A)
. Since bo h
A
and
B
o igina e om he
same mas e ed song, we assume ha
y
and
x e
sha e he
same mas e ing s yle.
3.2 Mas e ing S yle Con e e
Ψ
can be implemen ed using wo dis inc modeling ap-
p oaches: black-box modeling and whi e-box modeling. In
he black-box app oach,
Ψ
di ec ly models he wa e o m
signal
y′
. Con e sely, he whi e-box app oach es ima es
he pa ame e s
Θ
o he di e en iable mas e ing Fx chain.
The o mula ion o he di e en iable mas e ing Fx chain
in he whi e-box model is iden ical o ha used in he Fx
manipula o o p ocessing andomly mas e ed audio.
The aining objec i e o
Ψ
is he mul i-scale spec al
loss
LMSS
, applied o bo h le - igh and mid-side channels
(whe e mid = le + igh , side = le - igh ), as used in [12].
3.3 Di e en iable Mas e ing Chain Modeling
The mas e ing chain is designed o be ully di e en iable
whi e-box p ocesso , se ing a dual pu pose: unc ioning
bo h as a mas e ing s yle con e e and as a andom mas-
e ing manipula ion module o aining he con e e . By
modeling a wide ange o a iabili y in mas e ing s yles, he
chain enables he sys em o obus ly handle and eplica e he
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
135
Mas e ing S yle
Con e e Ψ
Re e ence
Encode Φ
FX chain #1
FX no m
FX chain #2
pa h o
pa h o
(a) T aining pipeline o Mas e ing S yle Con e e
Ψ
. Du ing his phase, Re e ence En-
code
Φ
is ained using di e se mas e ing s yles gene a ed by andom FX manipula ion
. The a ge signal
y
is syn hesized by applying he same manipula ion o segmen
A
as o e e ence segmen B, bo h om he same song.
(b) ITO is pe o med using an auxilia y (con en -
independen ) objec i e unc ion, allowing any e e ence
music. Use s can op imize
z e
based on hei p e e ences
o bo h he e e ence and objec i e unc ion.
Figu e 1: O e all pipeline o ITO-Mas e .
complexi ies o eal-wo ld music mas e ing. To simula e
a ealis ic music mas e ing p ocess [3], he chain includes
six dis inc Fx modules: 1. 6-band pa ame ic equalize ,
2. dis o ion, 3. 3-band comp esso , 4. makeup gain, 5.
s e eo image , and 6. limi e . The o de o hese modules is
ixed, wi h he p obabili y o applying each Fx module o
andom manipula ion du ing aining se a 90%, 30%, 80%,
85%, 60%, and 100%, espec i ely. These p obabili ies a e
adop ed o in oduce g ea e a iabili y while p e en ing he
syn hesis o o e ly un ealis ic mas e ing s yles o enhance
Ψ
’s modeling capabili y. The chain comp ises a o al o 46
con ollable pa ame e s.
To ensu e di e en iabili y, he Fx modules a e im-
plemen ed using open-sou ce lib a ies
1,2
ha suppo
g adien -based op imiza ion echniques. Fo modeling he
3-band mul iband comp ession, a ou h-o de Linkwi z-
Riley c osso e il e [20] is i s applied o spli he signal
in o h ee bands, ollowed by a di e en iable all-pole il e .
A key componen in he chain is he use o hese di e en-
iable all-pole il e s, as modeled by [21], which enable he
compu a ion o bo h comp ession and expansion e ec s in
bo h he mul iband comp esso and limi e . This capabili y
is c ucial o p ac ical applica ions, allowing he sys em o
manage bo h limi e s and delimi e s, which is pa icula ly
impo an in eal-wo ld scena ios whe e mos comme cially
eleased music is hea ily comp essed [2], equi ing e ec-
i e s yle ans e unde such condi ions.
3.4 In e ence-Time Op imiza ion on Re e ence
Embedding
The p ima y con ibu ion o his wo k is he in oduc ion
o ITO on
z e
. Ins ead o ine- uning he en i e model
Ψ
, he ocus is on op imizing only
z e
while keeping he
p e- ained
Ψ
ixed, as shown in Figu e 1(b). Al hough op-
imizing
z e
du ing in e ence ime in he Black-box model
does no p o ide in e p e abili y in e ms o Fx p ocesso s,
he Whi e-box model p ese es in e p e abili y. In ac ,
he changes in he pa ame e s
Θ
be o e and a e he ITO
p ocess can be obse ed, p o iding insigh in o how
Θ
is
adjus ed. Addi ionally, use s can op imize he sys em wi h
1h ps://gi hub.com/cs einme z1/dasp-py o ch
2h ps://gi hub.com/Di APF/ o chcomp
an al e na i e e e ence signal
x′
e
, o e ing a di e en ap-
p oach om con en ional mas e ing s yle ans e , as his
me hod me ges mas e ing s yles based on he combina ion
o he new e e ence signal and he op imizing objec i e
unc ion. The ad an ages o ITO on
z e
include a signi i-
can ly educed numbe o op imiza ion s eps compa ed o
op imizing he en i e di e en iable chain’s
Θ
om sc a ch,
as we compa e in Sec ion 5.
Fo ITO, he Audio Fea u e (AF) loss p oposed by [13]
is u ilized as he auxilia y objec i e unc ion
Laux
. The
AF loss is a con en -independen loss ha combines a -
ious audio ea u e ans o ma ions, cap u ing he dynam-
ics, spa ializa ion, and spec al cha ac e is ics o he audio.
Each ans o ma ion in he AF loss has i s own p ede ined
weigh ing ac o , and hese weigh ed ans o ma ions a e
summed oge he o compu e he o e all loss. In ou ex-
pe imen s, he o iginal weigh s om he e e ence pape
a e ollowed. We op imize
z e
i e a i ely using g adien de-
scen :
z( +1)
e =z( )
e −η∇zLaux(Ψ(xin, z( )
e ), x e )
, whe e
η
is he lea ning a e. Fo objec i e and subjec i e e alua ion,
we ocus solely on using AF loss as he op imiza ion ob-
jec i e o ITO. Since ITO can be op imized wi h any loss
unc ion, we also quali a i ely explo e op imiza ion using a
ex p omp wi h CLAP embeddings [22] in Sec ion 5.3.
4. EXPERIMENTS
4.1 Da ase
We u ilized MoisesDB da ase [23] o aining, and ali-
da ed using he MUSDB18 alida ion subse [24]. Mix u e
samples om hese da ase s a e employed, as hey a e no
ully mas e ed, allowing andom Fx manipula ion wi h
o
c ea e syn he ic mas e ed samples. Fo Fx-No maliza ion,
mean s a is ics a e p ecompu ed on he MoisesDB da ase ,
and no maliza ion is applied o ma ch he EQ, s e eo image ,
and loudness le els. Fo e alua ion, 200 songs a e andomly
selec ed om he MTG-Jamendo da ase [25]. O hese, 100
songs a e used as
xin
, and he emaining 100 se e as
x e
o
inpu in o
Ψ
. Du ing aining and alida ion, bo h segmen s
Aand Ba e 11.8 seconds long. Fo e alua ion, 30-second
samples a e used since he ully con olu ional a chi ec u e
o Ψcan handle a iable-leng h inpu s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
136
4.2 Expe imen al Se up
The expe imen al se up includes wo p ima y aining con-
igu a ions o
Ψ
: Black-box and Whi e-box me hods. Bo h
con igu a ions use he Tempo al Con olu ional Ne wo k
(TCN) [26] a chi ec u e o p ocessing
xin
, wi h 10.5 mil-
lion ainable pa ame e s. P e- ained weigh s o he FXen-
code [12] is adop ed as
Φ
and es ed unde wo condi ions:
wi h and wi hou
Φ
being ained alongside
Ψ
. All models
a e ained o 72,000 i e a ions wi h a ba ch size o 4.
In addi ion o aining he
Φ
and
Ψ
models, ITO is pe -
o med on each es da a o op imize he e e ence embed-
ding
z e
, which has a dimensionali y o 2048. The ITO
p ocess is un o a maximum o 100 s eps o is s opped
ea lie i he loss alue inc eases, indica ing con e gence.
Fo compa ison, an al e na i e ITO app oach is also applied
whe e op imiza ion is pe o med solely on he pa ame e s
Θ
o he di e en iable mas e ing chain
. In his case,
Θ
is op imized o up o 2K s eps o e alua e i s e ec i eness
ela i e o he p oposed ITO me hod ocused on
z e
. All
me hods a e op imized using he RAdam op imize [27]
wi h a lea ning a e o
2·10−4
. Mo e de ails a e a ailable
in ou open-sou ce eposi o y 3.
4.3 E alua ion Me ics
To objec i ely e alua e mas e ing s yle ans e , con en -
independen objec i es a e u ilized, gi en ha di e en
con en is being compa ed. The ollowing me ics a e em-
ployed:
•
Audio Fea u e (AF) Loss: As discussed in Sec ion 3.4,
AF Loss measu es how well he ou pu
y′
ma ches he
desi ed audio ea u es.
•
Dynamic Range Va iabili y (DRV): DRV is a c ucial
me ic o assessing he comp ession le el o audio, pa -
icula ly in he con ex o music mas e ing, whe e limi e s
play a signi ican ole. The DRV me ic is compu ed by
i s iden i ying peak alues in he audio signal using a
high- equency con en onse de ec ion unc ion. DRV is
he s anda d de ia ion o hese peak alues a e il e ing
ou he lowes 25% o he alues, which is de ined as:
DRV =1
C
C
X
c=1
s d ({pc
i:pc
i>pe cen ile(pc,75)})(1)
whe e
pc
deno es he se o peak alues
{pc
1, pc
2, ...}
in
channel
c
, and
C
is he o al numbe o channels. The me -
ic e lec s he a iabili y in dynamic ange, wi h highe
alues indica ing less consis en comp ession.
•
Fx Embedding Simila i y (cos sim): Cosine simila i y
measu es he simila i y be ween he e e ence embedding
Φ(x e )
and he ou pu embedding
Φ(y′)
. We adop he
p e ained FXencode as
Φ
. This me ic e alua es how
closely he ou pu ma ches he e e ence in e ms o i s
lea ned Fx cha ac e is ics.
•
F éche Audio Dis ance (FAD): FAD [28] assesses he
pe cep ual quali y o he gene a ed audio by compa -
ing he s a is ical dis ibu ion o he model ou pu s o
3h ps://gi hub.com/SonyResea ch/ITO-Mas e
a e e ence dis ibu ion. FAD is calcula ed using h ee
deep audio embeddings: CLAP [29], DAC [30], and En-
Codec [31]. We chose hese embeddings as hey ha e
shown s ong co ela ion wi h human p e e ence in acous-
ic quali y, wi h codec-based models like DAC and En-
Codec being pa icula ly sensi i e o acous ic e ec s,
as demons a ed by [32]. The me ic compu es he dis-
ance be ween he dis ibu ion o ea u es ex ac ed om
he model’s ou pu and hose ex ac ed om a subse o
he Jamendo da ase , measu ing how na u al he s yle-
ans e ed audio sounds compa ed o eal eco dings.
4.4 Baseline Me hods
The ollowing baseline me hods, ep esen ing exis ing mas-
e ing s yle ans e sys ems, a e used o compa ison:
•
Acous ic Fea u e Ma ching App oaches:Fea u e ma ch-
ing app oaches aim o adjus he Fx o an inpu ack
o ma ch hose o a e e ence ack by di ec ly aligning
speci ic audio ea u es.
–
Fx-No maliza ion [19]: Ins ead o no malizing he
gi en audio o he mean s a is ics o he a ge da a dis-
ibu ion, his app oach di ec ly ma ches he Fx le els
o hose o he e e ence song. The o icial implemen-
a ion
4
is used o ma ch he audio e ec s in he o de
o EQ, comp ession, s e eo imaging, and loudness, e-
spec i ely.
–
Ma che ing [7]: An open-sou ce lib a y ha ma ches
he gi en song’s RMS, equency esponse, peak am-
pli ude, and s e eo wid h o hose o he e e ence ack.
The o icial implemen a ion
5
is used o in e he p o-
cessed songs.
•
E2E Remas e ing [6]: This end- o-end emas e ing ap-
p oach is a black-box model ha di ec ly p edic s he
signal
y′
a he wa e o m le el. The model is ained
in a sel -supe ised manne using a la ge da ase o e-
leased pop songs. I le e ages a p e- ained encode and
a p ojec ion disc imina o o encou age he gene a ion o
ealis ic audio ha accu a ely e lec s he mas e ing s yle
o he e e ence ack.
5. RESULTS
5.1 Objec i e E alua ion
The pe o mance o he p oposed me hods, along wi h he
baseline app oaches, is summa ized in Table 1. The ea-
u e ma ching me hods, speci ically Fx-No maliza ion and
Ma che ing, demons a e s ong pe o mance on he AF and
FAD me ics. This is expec ed, as hese app oaches di ec ly
apply Fx- ela ed ans o ma ions o ma ch he e e ence
ack’s cha ac e is ics. Howe e , hese me hods pe o m
poo ly on he DRV me ic, as hey lack he con ol needed
o p ope dynamic ange adjus men s, which is c ucial in
eal-wo ld mas e ing asks in ol ing delimi ing. E2E Re-
mas e ing [6] shows good pe o mance on he FAD wi h
4h ps://gi hub.com/sony/FxNo m-au omix
5h ps://gi hub.com/se g ee/ma che ing
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
137
Me hod AF (↓) DRV (↓) cos sim (↑)
FAD
CLAP
(
↓
) FAD
DAC
(
↓
) FAD
EnCodec
(
↓
)
Fea u e Ma ching Fx-No maliza ion [19] 0.157 0.801 0.941 161.4 177.4 84.53
Ma che ing [7] 0.160 0.823 0.942 110.8 126.1 59.34
Baseline E2E Remas e ing [6] 0.288 0.858 0.942 104.3 176.7 37.19
P oposed
Black-box 0.346 0.685 0.944 160.8 378.4 51.12
+ ain Φ0.125 0.577 0.945 159.8 177.4 46.94
+ ITO on z e 0.099 0.567 0.946 182.2 180.5 42.82
Whi e-box 0.253 0.598 0.946 93.7 144.8 36.22
+ ain Φ0.186 0.521 0.945 93.2 101.4 38.90
+ ITO on z e 0.139 0.474 0.946 105.2 109.1 42.99
ITO on Θ0.250 0.609 0.927 216.8 294.8 101.60
Table 1: Mas e ing S yle T ans e on Jamendo da ase ( eal-wo ld scena io).
CLAP and EnCodec embeddings, likely due o i s use o
an ad e sa ial objec i e du ing aining, which aids in gen-
e a ing ealis ic audio ha closely ma ches he e e ence
dis ibu ion. Howe e , his sys em alls sho on AF and
DRV, indica ing challenges in cap u ing p ecise audio ea-
u e ans o ma ions and managing dynamic ange.
Among he p oposed me hods, Black-box app oaches
ou pe o m Whi e-box me hods in e ms o AF, indica ing
i s e ec i eness in cap u ing audio ea u e ans o ma ions
om di ec modeling o
y′
wi h
LMSS
. Howe e , he Whi e-
box me hod shows be e esul s ac oss all FAD me ics,
sugges ing ha i p oduces audio mo e aligned wi h eal-
wo ld dis ibu ions. This may imply ha while Black-box
models cap u e mo e de ailed ans o ma ions, Whi e-box
app oaches p oduce ou pu s ha a e mo e pe cep ually con-
sis en wi h eal-wo ld mas e ing s yles.
When
Ψ
is ained while keeping he p e- ained FXen-
code
Φ
ixed, he pe o mance is gene ally in e io . This is
likely because he FXencode was ained on a di e en se
o Fx chains and may no ully cap u e he manipula ions
applied o he e e ence songs in his con ex . Howe e ,
when
Φ
is ained alongside
Ψ
, he e is a signi ican im-
p o emen in pe o mance, as his join aining allows he
encode o be e adap o he speci ic Fx manipula ions
used, leading o mo e accu a e mas e ing s yle ans e .
Applying ITO on
z e
enhances AF pe o mance, bu
in oduces a ade-o in FAD sco es, indica ing while ITO
can e ine mas e ing s yle ans e , he numbe o op imiza-
ion s eps mus be ca e ully calib a ed o balance compe ing
objec i es. Con e sely, applying ITO di ec ly on
Θ
yields
poo esul s ac oss all me ics, e en wi h a la ge numbe
o op imiza ion s eps. In e es ingly, in e e se enginee ing
asks—whe e he inpu and ou pu con en a e iden ical—
op imizing
Θ
wo ks well despi e he complexi y o he
Fx chain [33
–
35]. Howe e , in mas e ing s yle ans e ,
con en -independen loss unc ions a e used o eplica e
only he mas e ing s yle om he e e ence ack. This
dis inc ion highligh s why ITO on
Θ
is less e ec i e in
his con ex . Since
Ψ
is ained wi h a con en -dependen
objec i e, i le e ages con en in o ma ion o enhance mas-
e ing s yle ans e . In con as ,
Laux
used in ITO ails o
cap u e he in icacies o he ask, making i unsui able o
op imizing he en i e di e en iable chain in his scena io.
Inpu
Fx-No maliza ion
E2E Remas e ing
Black-box + ain
Black-box + ITO
Whi e-box + ain
Whi e-box + ITO
0
20
40
60
80
100
Figu e 2: Subjec i e e alua ion esul s.
Audio samples a e a ailable on ou demo page 6.
5.2 Subjec i e E alua ion
To u he alida e ou p oposed me hods subjec i ely, we
conduc ed a MUSHRA- ype lis ening es wi h 10 pa ici-
pan s, all amilia wi h music pos -p oduc ion and digi al
e ec s, ha ing 2 o 5 yea s o expe ience in eco ding,
mixing, o mas e ing. Pa icipan s a ed a ious p ocessed
acks based on hei simila i y in mas e ing audio e ec s
o a e e ence ack. The e alua ion included 8 ques ions,
wi h 30 seconds long music eco dings o all s imuli. The
e e ence audio con ained di e en con en om he s im-
uli, bu we ensu ed he e e ence and s imuli we e no oo
dissimila in e ms o gen e o ins umen a ion. As a low
ancho , he ini ial music ack be o e being inpu ed in o
s yle ans e sys ems was p esen ed. The e was no high an-
cho , as he e alua ion se up aimed o mimic eal-wo ld Fx
s yle ans e using music acks om he Jamendo da ase .
As illus a ed in Figu e 2, he subjec i e es esul s align
wi h he ends obse ed in ou objec i e e alua ions. All
ou p oposed me hods su pass he baselines, wi h esul s u -
he enhanced by ITO, showing audio e ec s cha ac e is ics
mo e simila o hose o he e e ence. The simila i y sco es
o he p oposed sys em anged om 0 o 100, indica ing
ha he lis ening es was highly challenging, e en o ex-
pe s wi h domain knowledge. Ne e heless, he p oposed
6h ps:// inyu l.com/ITO-Mas e
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
138

0 1.5 3 4.5 6 7.5 9 10
Time (s)
0
64
128
256
512
1024
2048
4096
8192
16384
F equency (Hz)
(a) Inpu Spec og am
0 1.5 3 4.5 6 7.5 9 10
Time (s)
0
64
128
256
512
1024
2048
4096
8192
16384
F equency (Hz)
"Classic Music" - Inpu
0 1.5 3 4.5 6 7.5 9 10
Time (s)
"Me al Music" - Inpu
0 1.5 3 4.5 6 7.5 9 10
Time (s)
"Hip-Hop Music" - Inpu
-40 dB
-20 dB
+0 dB
+20 dB
+40 dB
(b) Spec og am Di e ence
0 2 4 6 8 10 12
Time (s)
0
2000
4000
6000
Hz
Inpu
Classic Music
Me al Music
Hip-Hop Music
(c) Spec al Cen oid
0 2 4 6 8 10 12
Time (s)
2
3
4
5
Peak- o-RMS
Inpu
Classic Music
Me al Music
Hip-Hop Music
(d) C es Fac o
0 2 4 6 8 10 12
Time (s)
0.10
0.15
0.20
0.25
RMS
Inpu
Classic Music
Me al Music
Hip-Hop Music
(e) RMS Ene gy
Figu e 3: Compa ison o di e en audio ea u es be ween he inpu music and ITO-p ocessed acks using ex p omp s
“Classic Music”,“Me al Music”, and “Hip-Hop Music”.
sys ems consis en ly ou pe o med he baseline, showing
signi ican imp o emen s (pai wise - es , p < 0.05).
5.3 Quali a i e Analysis o ITO wi h Tex P omp s
To e alua e he e ec i eness o ITO unde di e en
Laux
,
we pe o m a quali a i e analysis using ex p omp s wi h
CLAP embeddings [22], simila o he applica ion demon-
s a ed in [36]. Gi en an inpu music ack, we op imize
z e
o he p oposed whi e-box model using ex -based condi-
ioning, le e aging he CLAP embedding cosine simila i y
as he op imiza ion objec i e. Speci ically, we compu e he
audio embedding
CLAPaud
o he s ee ed ou pu and he
ex embedding
CLAP x
o he gi en e e ence ex p omp ,
hen maximize hei cosine simila i y o guide he ans o -
ma ion. The inpu music piece used o his analysis is an
11.8-second-long ins umen al ock ack. Since he inpu
con en emains unchanged ac oss di e en s ee ed esul s,
we can di ec ly assess he in luence o each ex p omp
on a ious musical ea u es. We explo e ITO wi h h ee
di e en p omp s: “Classic Music”,“Me al Music”, and
“Hip-Hop Music” o analysis. This expe imen al se up can
be explo ed h ough ou in e ac i e demo 7.
As shown in Figu e 3, he op imized esul s exhibi dis-
inc cha ac e is ics ha align wi h gene al expec a ions
o each gen e. The spec og am di e ence plo s in 3(b)
(s ee ed ou pu - inpu ) highligh he equency anges mos
a ec ed by ITO. Speci ically, he “Classic Music” p omp
exhibi s no able changes in he mid and high equencies,
aligning wi h he cha ac e is ic b igh ness and cla i y o en
associa ed wi h classical eco dings. The “Me al Music”
p omp shows di e ences in bo h low and high equencies,
e lec ing he gen e’s ypical emphasis on powe ul bass
and sha p eble o agg essi e ins umen a ion. In con-
as , he “Hip-Hop Music” p omp p edominan ly a ec s
he low- equency ange, ein o cing he gen e’s signa u e
7h ps://hugging ace.co/spaces/jh onyKoo/ITO-Mas e
emphasis on deep bass and sub-bass elemen s, which a e
essen ial o d i ing hy hm-hea y bea s.
These obse a ions a e u he suppo ed by he spec-
al cen oid, c es ac o , and RMS ene gy analyses. The
spec al cen oid esul s ollow an expec ed end, whe e
hip-hop has he lowes cen oid due o i s bass-hea y na-
u e, ollowed by me al, while classic music has he highes
cen oid, e lec ing i s emphasis on ha monic ichness and
eble cla i y. The c es ac o , ep esen ing peak- o-RMS
a io, is lowes o hip-hop, indica ing a mo e comp essed
and bass-hea y dynamic s uc u e, while classic music has
he highes c es ac o , aligning wi h i s ypically uncom-
p essed, wide dynamic ange. RMS ene gy shows an in-
c easing end om classic o hip-hop, wi h me al alling in
be ween, which is consis en wi h he espec i e loudness
and dynamic cha ac e is ics o hese gen es. These indings
sugges ha ITO, guided by CLAP
x
, success ully s ee s
he mas e ing Fx chain o align wi h he expec ed sonic
cha ac e is ics o he gi en ex p omp , demons a ing i s
po en ial as a c ea i e ool o music pos -p oduc ion.
6. CONCLUSION
In his pape , we in oduced he ITO-Mas e amewo k,
which le e ages ITO on
z e
o mas e ing s yle ans e .
Ou expe imen s showed ha aining he e e ence encode
Φ
alongside
Ψ
imp o es pe o mance. Op imizing
z e
wi h
ITO led o meaning ul imp o emen s wi h ew s eps, ou pe -
o ming di ec op imiza ion o
Θ
in e iciency. Subjec i e
e alua ions con i med ha ou me hod p oduces pe cep u-
ally aligned mas e ing e ec s, and quali a i e esul s high-
ligh ed he po en ial o ex -condi ioned ITO o c ea i e
applica ions. As u u e wo k, we plan o inco po a e p o-
duc ion quali y and usabili y in o he e alua ion, alongside
e e ence alignmen . Since mas e ing is a cu a o ial ask,
poo e e ence choices can lead o subop imal esul s de-
spi e high alignmen , highligh ing he need o pe cep ual
p e e ence me ics.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
139
7. REFERENCES
[1]
U. Zölze , X. Ama iain, D. A ib, J. Bonada, G. De Poli,
P. Du illeux, G. E angelis a, F. Keile , A. Loscos,
D. Rocchesso e al.,DAFX-Digi al audio e ec s. John
Wiley & Sons, 2002.
[2]
M. Shel ock, Audio mas e ing as musical p ac ice.
The Uni e si y o Wes e n On a io (Canada), 2012.
[3]
M. Pio owska, S. Pio owski, and B. Kos ek, “A s udy
on audio signal p ocessed by" ins an mas e ing" se -
ices,” in Audio Enginee ing Socie y Con en ion 142.
Audio Enginee ing Socie y, 2017.
[4]
J. S e ne and E. Razlogo a, “Machine lea ning in con-
ex , o lea ning om land : A i icial in elligence and
he pla o miza ion o music mas e ing,” Social Media+
Socie y, ol. 5, no. 2, p. 2056305119847525, 2019.
[5]
M. A. M. Ramí ez, O. Wang, P. Sma agdis, and N. J.
B yan, “Di e en iable signal p ocessing wi h black-
box audio e ec s,” in ICASSP 2021-2021 IEEE In e -
na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2021, pp. 66–70.
[6]
J. Koo, S. Paik, and K. Lee, “End- o-end music e-
mas e ing sys em using sel -supe ised and ad e sa ial
aining,” in P oc. ICASSP, 2022, pp. 4608–4612.
[7]
S. G ishako , C.-Y. Yu, and Zicklag, “Ma che ing:
Audio ma ching and mas e ing py hon lib a y,” h ps:
//gi hub.com/se g ee/ma che ing.
[8]
J. Engel, C. Gu, A. Robe s e al., “DDSP: Di e en-
iable digi al signal p ocessing,” in In e na ional Con-
e ence on Lea ning Rep esen a ions, 2020.
[9]
J. Koo, S. Paik, and K. Lee, “Re e b con e sion o
mixed ocal acks using an end- o-end con olu ional
deep neu al ne wo k,” in ICASSP 2021-2021 IEEE In-
e na ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2021, pp. 81–85.
[10]
S. Lee, J. Pa k, S. Paik, and K. Lee, “Blind es ima-
ion o audio p ocessing g aph,” in ICASSP 2023-2023
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2023, pp.
1–5.
[11]
C. J. S einme z, N. J. B yan, and J. D. Reiss, “S yle
ans e o audio e ec s wi h di e en iable signal p o-
cessing,” J. Audio Eng. Soc, ol. 70, no. 9, pp. 708–721,
2022.
[12]
J. Koo, M. A. Ma ínez-Ramí ez, W.-H. Liao, S. Uhlich,
K. Lee, and Y. Mi su uji, “Music mixing s yle ans e :
A con as i e lea ning app oach o disen angle audio
e ec s,” in ICASSP 2023-2023 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2023, pp. 1–5.
[13]
S. S. Vanka, C. S einme z, J.-B. Rolland, J. Reiss, and
G. Fazekas, “Di -MST: Di e en iable mixing s yle
ans e ,” in P oc. ISMIR, 2024.
[14]
Y.-H. Chen, Y.-T. Yeh, Y.-C. Cheng, J.-T. Wu, Y.-H.
Ho, J.-S. R. Jang, and Y.-H. Yang, “Towa ds ze o-sho
ampli ie modeling: One- o-many ampli ie modeling
ia one embedding con ol,” in P oc. ISMIR, 2024.
[15]
Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and
N. J. B yan, “DITTO: Di usion in e ence- ime -
op imiza ion o music gene a ion,” in P oc. ICML,
2024.
[16]
Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and
N. B yan, “DITTO-2: Dis illed di usion in e ence- ime
-op imiza ion o music gene a ion,” in P oc. ISMIR,
2024.
[17]
C. S einme z, S. Singh, I. Ibnyahya, S. Yuan, E. Bene os,
J. Reiss e al., “ST-ITO: Con olling audio e ec s o
s yle ans e wi h in e ence- ime op imiza ion,” in P oc.
ISMIR, 2024.
[18]
C.-Y. Yu, M. A. Ma ínez-Ramí ez, J. Koo, W.-H. Liao,
Y. Mi su uji, and G. Fazekas, “Imp o ing in e ence-
ime op imisa ion o ocal e ec s s yle ans e wi h a
gaussian p io ,” a Xi p ep in a Xi :2505.11315, 2025.
[19]
M. A. Ma ínez-Ramí ez, W.-H. Liao, G. Fabb o, S. Uh-
lich, C. Nagashima, and Y. Mi su uji, “Au oma ic music
mixing wi h deep lea ning and ou -o -domain da a,” in
P oc. ISMIR, 2022.
[20]
S. H. Linkwi z, “Ac i e c osso e ne wo ks o non-
coinciden d i e s,” Jou nal o he Audio Enginee ing
Socie y, ol. 24, no. 1, pp. 2–8, 1976.
[21]
C.-y. Yu, C. Mi chel ee, A. Ca son, S. Bilbao, J. Reiss,
and G. Fazekas, “Di e en iable all-pole il e s o ime-
a ying audio sys ems,” in 27 h In e na ional Con e -
ence on Digi al Audio E ec s (DAFx), 2024.
[22]
Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-Ki kpa ick,
and S. Dubno , “La ge-scale con as i e language-audio
p e aining wi h ea u e usion and keywo d- o-cap ion
augmen a ion,” in ICASSP 2023-2023 IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal P o-
cessing (ICASSP). IEEE, 2023, pp. 1–5.
[23]
I. Pe ei a, F. A aújo, F. Ko zeniowski, and R. Vogl,
“MoisesDB: A da ase o sou ce sepa a ion beyond 4-
s ems,” in P oc. ISMIR, 2023.
[24]
Z. Ra ii, A. Liu kus, F.-R. S ö e , S. I. Mimilakis,
and R. Bi ne , “MUSDB18-HQ - an uncomp essed
e sion o MUSDB18,” Aug. 2019. [Online]. A ailable:
h ps://doi.o g/10.5281/zenodo.3338373
[25]
D. Bogdano , M. Won, P. To s ogan, A. Po e , and
X. Se a, “The m g-jamendo da ase o au oma ic music
agging,” in P oc. ICML, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
140
[26]
C. J. S einme z and J. D. Reiss, “E icien neu al ne -
wo ks o eal- ime modeling o analog dynamic ange
comp ession,” a Xi p ep in a Xi :2102.06200, 2021.
[27]
L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and
J. Han, “On he a iance o he adap i e lea ning a e
and beyond,” in In e na ional Con e ence on Lea ning
Rep esen a ions, 2020.
[28]
K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha -
i i, “F éche audio dis ance: A me ic o e alua -
ing music enhancemen algo i hms,” a Xi p ep in
a Xi :1812.08466, 2018.
[29]
B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang,
“CLAP: Lea ning audio concep s om na u al language
supe ision,” in ICASSP 2023-2023 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2023, pp. 1–5.
[30]
R. Kuma , P. See ha aman, A. Luebs, I. Kuma , and
K. Kuma , “High- ideli y audio comp ession wi h im-
p o ed qgan,” Ad ances in Neu al In o ma ion P o-
cessing Sys ems, ol. 36, 2024.
[31]
A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” T ansac ions on Ma-
chine Lea ning Resea ch, 2023.
[32]
A. Gui, H. Gampe , S. B aun, and D. Emmanouilidou,
“Adap ing eche audio dis ance o gene a i e music
e alua ion,” in ICASSP 2024-2024 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2024, pp. 1331–1335.
[33]
S. Lee, M. A. Ma ínez-Ramí ez, W.-H. Liao, S. Uhlich,
G. Fabb o, K. Lee, and Y. Mi su uji, “Sea ching o
music mixing g aphs: A p uning app oach,” in 27 h In-
e na ional Con e ence on Digi al Audio E ec s (DAFx),
2024.
[34]
——, “Re e se enginee ing o music mixing g aphs
wi h di e en iable p ocesso s and i e a i e p uning,”
Jou nal o he Audio Enginee ing Socie y, ol. 73, pp.
344–365, June 2025.
[35]
C.-Y. Yu, M. A. Ma ínez-Ramí ez, J. Koo, B. Hayes,
W.-H. Liao, G. Fazekas, and Y. Mi su uji, “Di ox: A
di e en iable model o cap u ing and analysing p o-
essional e ec s dis ibu ions,” in 28 h In e na ional
Con e ence on Digi al Audio E ec s (DAFx), 2025.
[36]
A. Chu, P. O’Reilly, J. Ba ne , and B. Pa do, “Tex 2FX:
Ha nessing clap embeddings o ex -guided audio e -
ec s,” in ICASSP 2025-2025 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2025, pp. 1–5.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
141