scieee Science in your language
[en] (orig)

Fx-Encoder++: Extracting Instrument-Wise Audio Effect Representations From Mixtures

Author: Yen-Tung Yeh; Junghyun Koo; Marco Martínez-Ramírez; Wei-Hsiang Liao; Yi-Hsuan Yang; Yuki Mitsufuji
Publisher: Zenodo
DOI: 10.5281/zenodo.17706537
Source: https://zenodo.org/records/17706537/files/000071.pdf
FX-ENCODER++: EXTRACTING INSTRUMENT-WISE AUDIO EFFECTS
REPRESENTATIONS FROM MIXTURES
*Yen-Tung Yeh1Junghyun Koo2Ma co A. Ma ínez-Ramí ez2
Wei-Hsiang Liao2Yi-Hsuan Yang1Yuki Mi su uji2,3
1Na ional Taiwan Uni e si y, Taipei, Taiwan,
2Sony AI, Tokyo, Japan, 3Sony G oup Co po a ion, Tokyo, Japan
[email p o ec ed]
ABSTRACT
Gene al-pu pose audio ep esen a ions ha e p o en e ec-
i e ac oss di e se music in o ma ion e ie al applica-
ions, ye hei u ili y in in elligen music p oduc ion e-
mains limi ed by insu icien unde s anding o audio e -
ec s (Fx). Al hough p e ious app oaches ha e empha-
sized audio e ec s analysis a he mix u e le el, his o-
cus alls sho o asks demanding ins umen -wise au-
dio e ec s unde s anding, such as au oma ic mixing. In
his wo k, we p esen Fx-Encode ++, a no el model de-
signed o ex ac ins umen -wise audio e ec s ep esen a-
ions om music mix u es. Ou app oach le e ages a con-
as i e lea ning amewo k and in oduces an “ex ac o ”
mechanism ha , when p o ided wi h ins umen que ies
(audio o ex ), ans o ms mix u e-le el audio e ec s em-
beddings in o ins umen -wise audio e ec s embeddings.
We e alua ed ou model ac oss e ie al and audio e ec s
pa ame e ma ching asks, es ing i s pe o mance ac oss a
di e se ange o ins umen s. The esul s demons a e ha
Fx-Encode ++ ou pe o ms p e ious app oaches a mix-
u e le el and show a no el abili y o ex ac e ec s ep e-
sen a ion ins umen -wise, add essing a c i ical capabili y
gap in in elligen music p oduc ion sys ems.
1. INTRODUCTION
Recen ad ances in deep lea ning ha e enabled signi ican
p og ess in gene al-pu pose audio ep esen a ions [1–5],
which ha e p o en e ec i e ac oss di e se applica ions.
Howe e , hey inadequa ely cap u e he nuanced cha ac-
e is ics o audio e ec s p ocessing [6–8], since hey p io -
i ize seman ic con en ecogni ion o e sub le audio e ec s
o imb al ans o ma ions. This sho coming pa icula ly
a ec s specialized applica ions ha equi e a p ecise un-
*Wo k done du ing an in e nship a Sony AI
© Y-T. Yeh, J. Koo, M. Ma ínez-Ramí ez, W-H. Liao, Y-H
Yang, and Y. Mi su uji. Licensed unde a C ea i e Commons A ibu-
ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: Y-T. Yeh, J.
Koo, M. Ma ínez-Ramí ez, W-H. Liao, Y-H Yang, and Y. Mi su uji, “Fx-
Encode ++: Ex ac ing Ins umen -Wise Audio E ec s Rep esen a ions
om Mix u es”, in P oc. o he 26 h In . Socie y o Music In o ma ion
Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
de s anding o audio e ec s con igu a ion and hei pe -
cep ual impac ac oss di e en musical con ex s.
A key domain a ec ed by his ep esen a ion gap is
in elligen music p oduc ion, he eme ging ield o mix-
ing and mas e ing au oma ion aiming o de elop AI sys-
ems capable o p o essional-quali y audio enginee ing [9].
Applica ions in his domain, including au oma ic mixing
[10, 11], audio e ec s s yle ans e [6, 12, 13], and mix-
ing s yle ans e [14, 15], all equi e specialized ep e-
sen a ions ha model e ec s a wo dis inc le els: how
hey shape he o e all sound o a comple e mix u e (“mix-
u e le el”) and how hey ans o m indi idual ins umen s
wi hin ha mix u e (“ins umen le el”).
P e ious app oaches o audio e ec s ep esen a ion can
be ca ego ized in o wo g oups: hose analyzing e ec s
a he en i e musical mixes [12, 15] and hose examining
e ec s on isola ed single ins umen s [16, 17]. Al hough
bo h app oaches ope a e a wha we conside he “mix-
u e le el”, hey ail o ex ac ins umen -speci ic e ec
cha ac e is ics om complex mix u es. To ou knowledge,
FX-Encode [15] is he only exis ing wo k ha encodes au-
dio e ec s in o ma ion om indi idual ins umen s ems.
Howe e , hey ocus only on modeling he agg ega e e-
sul , a he han iden i ying how e ec s ha e been applied
o each ins umen . This limi a ion es ic s applica ions
such as au oma ic mixing, whe e a p ecise unde s anding
o he ins umen -wise p ocessing is essen ial.
Ex ac ing ins umen -speci ic audio e ec s ep esen a-
ions om mix u es p esen s wo dis inc challenges. Fi s ,
a s aigh o wa d app oach migh in ol e applying sou ce
sepa a ion algo i hms [18, 19] o isola e indi idual ins u-
men s be o e analyzing hei e ec s. This me hod is lim-
i ed o sepa a ion a i ac s, such as missing high equen-
cies, ansien smea ing [20], and impe ec signal econ-
s uc ion. Those a i ac s will dis o e ec cha ac e is ics,
making he sepa a e- hen-analyze pipeline un eliable. Sec-
ond, o ob ain accu a e e ec pa ame e s, we equi e bo h
he p ocessed ack (we audio) and i s unp ocessed coun-
e pa (d y audio) o e e se enginee ing [21, 22]. This
necessi a es no only pe ec sou ce sepa a ion o ex ac
he p ocessed ack om he mix u e bu also access o he
o iginal unp ocessed audio, which is a ely a ailable.
To add ess hese limi a ions, we p opose Fx-
Encode ++, a no el model designed o ex ac ins umen -
speci ic audio e ec s ep esen a ions om comple e music
612
mix u es. Ou model employs a con as i e lea n-
ing amewo k based on SimCLR [23], which lea ns
ep esen a ions by maximizing ag eemen be ween di -
e en augmen ed iews o he same da a. Following
FX-Encode [15], we implemen se e al c ucial design
elemen s o ensu e e ec i e lea ning o audio e ec s
including: i) Fx-No maliza ion [24] o no malize inhe en
e ec s in sou ce audio; ii) consis en ins umen a ion
composi ion, o ensu e each mix u e in he en i e ba ch
is cons uc ed wi h he same ins umen combina ion,
he eby isola ing e ec - ela ed ea u es; and iii) sys em-
a ic audio e ec s manipula ion p ocedu es. The co e o
ou con ibu ion is an “ex ac o ” mechanism ha ans-
o ms mix u e-le el embeddings in o ins umen -speci ic
embeddings when p o ided wi h ins umen que ies. This
ex ac o le e ages a p e ained CLAP encode [3] o
suppo bo h audio and ex que ies.
In ou expe imen s, we e alua e Fx-Encode ++ agains
bo h gene al-pu pose audio ep esen a ions (VGGish [1],
PANN [2], and CLAP [3]) and specialized audio e ec s
encode s (FX-Encode [15] and AFx-Rep [6]). Build-
ing upon he e alua ion amewo k o AFx-Rep [6],
we in oduce a comp ehensi e benchma k o audio e -
ec s e ie al using eal-wo ld mul i ack eco dings om
MUSDB [25] and MedleyDB [26]. We quan i a i ely as-
sess pe o mance ac oss wo dimensions: e ec s complex-
i y (single s. mul iple cascaded e ec s) and ins umen
gene aliza ion. Resul s demons a e ha Fx-Encode ++
consis en ly ou pe o ms exis ing me hods in mix u e-le el
e ec s ex ac ion asks. Fu he mo e, ou app oach en-
ables ins umen -speci ic e ec s ex ac ion, add essing a
gap in p e ious wo k whe e e ec s could only be analyzed
a he mix u e le el.
2. RELATED WORKS
2.1 Gene al-pu pose Audio Rep esen a ion
Gene al-pu pose audio ep esen a ions ha e eme ged o
suppo a ious downs eam asks. VGGish [1] and PANN
[2] employ CNNs ained on AudioSe [27] o audio clas-
si ica ion and pa e n ecogni ion. Mo e ecen ly, CLAP
[3] uses a ans o me -based a chi ec u e wi h con as i e
lea ning o align audio and ex , enabling applica ions
in ex - o-music gene a ion [28, 29] and audio sepa a ion
[30,31]. Neu al audio comp ession models [4, 32, 33] ep-
esen ano he ca ego y, u ilizing VAEs o econs uc pe -
cep ual ea u es while minimizing bi a e. Despi e hei
success ac oss a ious asks, hese ep esen a ions show
limi ed sensi i i y o audio e ec s ans o ma ions [6–8].
This limi a ion s ems om aining objec i es p io i iz-
ing seman ic con en ecogni ion o c oss-modal alignmen
a he han p ese ing he sub le imb al modi ica ions in-
oduced by audio e ec s, a capabili y essen ial o in elli-
gen music p oduc ion sys ems.
2.2 Audio E ec s Rep esen a ion
Audio e ec s ep esen a ion esea ch has e ol ed om im-
plici o explici app oaches. Ea ly wo ks inco po a ed
Model T aining Audio Ex ac ion Que y
Me hod Type Le el
FX-Encode [15] Con as i e Mix u e Mix u e -
Tone Emb. [16] Con as i e Gui a Isola ed Ins . -
OpenAMP [17] Con as i e Gui a Isola ed Ins . -
AFx-Rep [6] Classi ica ion Mix u e Mix u e -
Fx-Encode ++ (ou s) Con as i e Mix u e Mix u e & Ins . ✓
Table 1. Compa ison o audio e ec ep esen a ion mod-
els. Fx-Encode ++ uniquely ex ac s ins umen -speci ic
e ec s di ec ly om mix u es, unlike p e ious app oaches
limi ed o whole-mix e ec s o isola ed ins umen s.
e ec s awa eness indi ec ly h ough applica ions like e-
e b pa ame e es ima ion [34], au oma ic mixing [35],
di e en iable e ec s [36], and neu al s yle ans e [12,
14]. Dedica ed ep esen a ions eme ged wi h FX-Encode
[15], which used con as i e lea ning o disen angle e -
ec s cha ac e is ics om con en , enabling mixing s yle
ans e . Simila ly, Tone Embedding [16] and OpenAMP
[17] ocused speci ically on gui a ones bu was limi ed
o isola ed gui a eco dings. Recen ly, AFx-Rep [6], a
classi ica ion-based model designed o in e ence- ime e -
ec s op imiza ion. I s aining objec i e is o classi y
which single e ec is applied be ween wo gi en audio
clips and o u he p edic which p ese ha e ec belongs
o. Despi e hese ad ances, a limi a ion pe sis s: exis ing
models ope a e ei he a mix u e le el o on isola ed in-
s umen s, bu canno ex ac ins umen -wise e ec s in-
o ma ion om comple e mixes. Table 1 compa es audio
e ec s ep esen a ion app oaches. “Ex ac ion Le el” indi-
ca es he sou ce om which models ex ac e ec s ep e-
sen a ions: “Isola ed Ins .” o single-ins umen capabil-
i y, “Mix u e” o comple e music mixes. Fx-Encode ++
uniquely ex ac s a bo h mix u e and ins umen -wise le -
els. The “Que y” column shows whe he he model sup-
po s condi ional ex ac ion based on ins umen que ies, a
dis inc i e ea u e o ou app oach.
3. METHOD
Ou goal is o de elop an encode , deno ed as E(x,q),
ha encodes an audio e ec s embedding om music mix-
u es x. When condi ioned wi h an ins umen que y q,
he encode ex ac s e ec s ep esen a ions speci ic o ha
ins umen wi hin he mix u e; wi hou condi ioning (i.e.,
q=∅), he encode p oduces ep esen a ions ha cha ac-
e ize he comple e mix u e. We o mula e a con as i e
objec i e o lea ning mix u e-le el ep esen a ions ol-
lowing FX-Encode [15], hen ex end i wi h a mechanism
o ex ac ins umen -speci ic e ec s in o ma ion di ec ly
om mix u es, as shown in Figu e 1.
3.1 Con as i e Objec i e o Audio E ec s
Following FX-Encode [15], we employed a SimCLR-
based [23] con as i e objec i e o lea n audio e ec s ep-
esen a ions by maximizing he ag eemen be ween di e -
en audio con en s p ocessed wi h iden ical e ec s. The
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
613
Figu e 1. Ou da a p epa a ion pipeline c ea es aining pai s by spli ing sou ce acks (n= 2 in his example) in o seg-
men s, applying consis en e ec chains, and combining hem in o mix u es wi h iden ical e ec con igu a ions bu dis inc
musical con en . The a chi ec u e consis s o an encode ha p ocesses mix u es o p oduce mix u e-le el embeddings, and
an ex ac o mechanism ha ans o ms hese embeddings in o ins umen -speci ic ep esen a ions using CLAP-de i ed
que ies. Du ing in e ence, we can op ionally p o ide ex que y o ex ac he speci ic ins umen s.
con as i e loss o a posi i e pai (i,j) can be o mula ed
as:
ℓmix u e
i,j =−log exp(sim(zi, zj)/τ)
P2N
k=1 I[k=i]exp(sim(zi, zk)/τ)(1)
whe e I[k=i]∈ {0,1}is an indica o unc ion e alua ing o
1 i k=i,τdeno es a empe a u e pa ame e , and embed-
dings zi=E(Mi)and zj=E(Mj)a e encode ou pu s
om mix u es Miand Mj. These mix u es a e gene a ed
de ailed as ollows:
•Fx-No maliza ion [24]. Di ec ly applying iden ical au-
dio e ec s o di e en audio clips is insu icien o
ou con as i e objec i e due o inhe en e ec s al eady
p esen in each eco ding. When applying e ec s o
clips xaand xbwi h unde lying e ec s aand b, we
ob ain dis inc ans o ma ions ◦ aand ◦ bde-
spi e using he same e ec s , con adic ing ou goal
o clus e ing samples wi h he same e ec s in embed-
ding space. To o e come his limi a ion, we employed
Fx-No maliza ion da a p ep ocessing me hod [24], he
echnique no malized audio cha ac e is ics by neu al-
izing exis ing e ec s, ensu ing subsequen ly applied e -
ec s make equal con ibu ions ac oss all samples.
•Consis en Ins umen Composi ion. We ensu e all da a
samples wi hin each aining ba ch sha e iden ical in-
s umen composi ion, p e en ing he model om ak-
ing sho cu s by dis inguishing be ween ins umen com-
bina ions (e.g., “d ums+bass” e sus “gui a + ocals”)
a he han e ec s cha ac e is ics. By con olling ins u-
men a ion ac oss bo h posi i e and nega i e pai s, we
o ce he model o ocus on sub le e ec s ea u es in-
s ead o mo e easily dis inguishable ins umen imb es.
This is c ucial o SimCLR, whe e nume ous nega i e
samples could o he wise lead o exploi a ion o simple
pa e ns based on ins umen con en [37].
•Audio E ec s Manipula ion. We c ea e mix u e pai s
wi h iden ical e ec s p ocessing bu di e en musical
con en h ough a sys ema ic app oach: (1) gene a e k
dis inc audio e ec s chains by andomly sampling con-
igu a ions (o de , numbe , ypes, and pa ame e s), om
ou e ec s pool (equalize , delay, dis o ion, s e eo im-
age , comp ession, limi e , e e b, and gain); (2) an-
domly selec kins umen s om he no malized pool; (3)
spli each ins umen ack in o wo segmen s wi h di -
e en musical con en ; (4) apply he same e ec s chain
Ej o bo h segmen s o ins umen j; (5) c ea e wo
mix u es—M1and M2—by combining all i s and sec-
ond segmen s espec i ely, wi h andom loudness no -
maliza ion (−18 o −14 dB LUFS) o indi idual ack
and consis en −18 dB LUFS o inal mix u es. This
c ea es posi i e pai s wi h di e en con en bu iden ical
e ec s p ocessing o con as i e lea ning. We also em-
ploy Fx p obabili y scheduling [15] o p e en he model
om ocusing only on easily dis inguishable e ec s.
•Hand-C a ed Ha d Nega i e Samples. Ha d nega i e
samples a e c ucial o e ec i e con as i e lea ning
[38]. We explici ly cons uc ha d nega i es by applying
di e en e ec s chains o iden ical sou ce ma e ial, c e-
a ing samples wi h same musical con en bu di e en e -
ec s s yles. We ensu e a ia ion h ough bo h s uc u al
di e ences (al e ing e ec p ocesso ypes, coun s, and
o de ing) and independen pa ame e sampling. This ap-
p oach o ces he model o ocus speci ically on e ec s-
based a he han con en -based ea u es.
3.2 Lea nable Ex ac o
A c i ical design equi emen is p oducing bo h mix u e-
le el and ins umen -speci ic e ec s ep esen a ions wi h
high ideli y. We inco po a e ins umen que ies om a
p e ained CLAP encode [3], allowing audio que ies du -
ing aining and ex que ies du ing in e ence [31] o in u-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
614
i i e con ol. While we could condi ion he encode using
me hods like FiLM [39], his c ea es a challenge: he ap-
p op ia e que y ec o o ex ac ing he “global” mix u e
e ec s. A ze o ec o lacks seman ic meaning, while us-
ing he mix u e i sel as a que y would o ce he model o
simul aneously analyze and ex ac om he same in o -
ma ion s eam. Ou a chi ec u e he e o e add esses dual
equi emen s: ex ac ing ins umen -speci ic componen s
while main aining mix u e-le el ep esen a ion.
Inspi ed om [40], ou ex ac o mechanism esol es
his by main aining sepa a e pa hs o mix u e-le el e -
ec s and ins umen -speci ic ex ac ion. The base encode
lea ns e ec s ep esen a ions om mix u es, while he ex-
ac o ans o ms hese based on ins umen que ies (audio
o ex ). This wo-s age design c ea es a p incipled in e -
ace be ween e ec s ep esen a ion and ins umen con en
ep esen a ion, enabling a ge ed ex ac ion o e ec s cha -
ac e is ics o speci ic ins umen s.
To acili a e he ans o ma ion o mix u e e ec s em-
beddings in o ins umen -speci ic e ec s embeddings, we
e o mula e he s anda d con as i e loss. Ra he han op-
e a ing solely on mix u e pai s, we de elop an ins umen -
awa e con as i e objec i e based on iple s (Q, Mi, Mj),
whe e Q ep esen s he ins umen que y, and Miand
Mjdeno e music mix u es wi h iden ical e ec s con-
igu a ions. To make his app oach compa ible wi h
mix u e-le el aining, we cons uc an ex ac o mecha-
nism ha ex ac s ins umen -speci ic e ec embeddings
om mix u e-le el e ec s embeddings using ins umen
que ies. Fo mally, we de ine he ex ac o as:
zm
i=ex ac o (Qm
i,E(Mi)) (2)
whe e zm
i ep esen s he ins umen -speci ic e ec s em-
bedding, Qm
i∈RDis he que y ec o o he m- h in-
s umen in mix u e Mi(ob ained om a p e ained CLAP
encode ), and E(Mi)is he mix u e-le el audio e ec s
embedding. The ex ac o is implemen ed as a mul i-
laye pe cep on ha lea ns o selec i ely a end o e ec s-
ela ed ea u es associa ed wi h he que ied ins umen .
The ins umen -awa e con as i e loss is hen o mula ed
as:
ℓins
i,j =−log exp(sim(zm
i, zm
j)/τ)
P2N
k=1 I[k=i]exp(sim(zm
i, zm
k)/τ)(3)
In his o mula ion, posi i e pai s consis o embeddings
om he same ins umen ype ac oss di e en mix u es
p ocessed wi h iden ical e ec s, while nega i e pai s in-
clude he same ins umen wi h di e en e ec s p ocess-
ing, ex ending Equa ion (1) o he ins umen le el. Ou
inal aining objec i e combines bo h mix u e-le el and
ins umen -le el con as i e losses:
L=λmix ·ℓmix u e
i,j +λins ·ℓins
i,j (4)
whe e λmix +λins = 1.0. This dual-objec i e app oach
enables ou model o simul aneously lea n e ec s ep e-
sen a ions a bo h mix u e and ins umen -speci ic le els.
We implemen a cu iculum lea ning s a egy by s a ing
wi h λmix = 1.0and linea ly in oducing he ins umen -
wise objec i e o e aining s ep. This is necessa y because
ins umen -wise g adien s lack meaning ul guidance un-
il he model has es ablished easonably accu a e mix u e-
le el ep esen a ions. Ou inal loss weigh ing uses λmix =
0.8and λins = 0.2, delibe a ely p io i izing mix u e-le el
lea ning. This weigh ing e lec s a c i ical insigh : obus
mix u e-le el e ec s ep esen a ions o m he ounda ion
o success ul ins umen -speci ic ex ac ion, as he ex ac-
o di ec ly ope a es on hese mix u e-le el embeddings o
isola e indi idual componen s.
3.3 Model A chi ec u e and T aining De ails
Audio P ocessing Pipeline. Fo e ec s no maliza ion,
we ollow [24] bu exclude e e b no maliza ion. We
implemen audio p ocesso s using dasp-py o ch [41]
and o chcomp [42], inco po a ing se en e ec s ypes
(equalize , dis o ion, mul iband-comp esso , limi e , gain,
s e eo image , delay). Re e be a ion uses con olu ion-
based p ocessing wi h in-house impulse esponses, while
using pyloudno m o loudness no maliza ion [43].
Model Componen s. The encode Eis implemen ed us-
ing he PANN a chi ec u e [2], ollowing es ablished p ac-
ices in audio e ec s p ocessing [6]. Inpu audio is p e-
p ocessed by compu ing log-melspec og ams wi h a win-
dow size o 2048 samples and hop size o 512 samples.
We clip magni ude alues be ween −80 and 40 dB be o e
scaling he spec og ams o he ange [−1,1]. Each inpu
segmen is 10-seconds long o adequa ely cap u e audio e -
ec s cha ac e is ics. The ex ac o mechanism employs a
3-laye MLP wi h hidden dimension 128 and LeakyReLU
ac i a ion (slope 0.1). Fo ins umen que ies, we le e age
he CLAP model [3] o p oduce embeddings, applying a
high d opou a e (0.75 o 0.95) du ing aining o imp o e
gene aliza ion and b idging he modali y gap be ween ex -
audio encode s in CLAP, ollowing he app oach in [31].
T aining P ocedu e We ain ou model using he Adam
op imize [44] wi h β1= 0.99 and β2= 0.9, employing a
wa m-up schedule [45] ollowed by cosine lea ning a e
decay om a base a e o 1e−4. We se con as i e em-
pe a u e o 0.1and ain on 2 NVIDIA H100 GPUs wi h
a ba ch size o 192. Fo ou expe imen s, we employ he
MoisesDB da ase [46], which con ains 240 acks wi h 11
s ems ac oss 12 gen es, o aling o e 14 hou s o mul i-
ack da a. I should be no ed ha acks om he Moi-
sesDB da ase [46] a e “we ,” meaning hey ha e al eady
been p ocessed wi h audio e ec s. To enhance model’s
obus ness ac oss a ying mix u e complexi ies, we an-
domly selec ed be ween one o ou ins umen s o con-
s uc ing each mix u e du ing aining. 1
4. EVALUATION METHOD
4.1 Audio E ec s Re ie al
To e alua e ou model, we conduc audio e ec s e-
ie al expe imen s using a con ollable e ec s pipeline
1h ps://gi hub.com/SonyResea ch/Fx-Encode _
PlusPlus
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
615
Type Model
MUSDB18 [25]
D ums Bass Vocals O he Mix u e
R@1 R@5 R@10 LdR@1 R@5 R@10 LdR@1 R@5 R@10 LdR@1 R@5 R@10 LdR@1 R@5 R@10
GP
CLAP 1.5 3.9 5.9 1.38 1.4 3.7 6.0 1.39 0.7 3.0 4.7 1.67 0.3 1.8 3.4 1.40 1.6 5.2 8.7
PANN 0.8 2.2 3.7 1.75 0.8 2.9 4.6 1.59 0.3 1.8 3.1 1.75 0.3 0.9 2.3 1.54 1.2 3.8 6.5
VGGish 2.0 5.7 8.5 1.13 1.8 4.7 6.7 1.28 1.1 3.7 5.8 1.65 0.6 2.6 4.3 1.39 1.4 5.1 7.8
Fx
FX-Encode 12.1 25.9 35.1 1.33 7.3 17.8 24.9 1.39 6.1 16.0 22.8 1.92 5.8 15.4 22.5 1.37 12.3 26.6 34.5
AFx-Rep 20.8 34.8 40.5 1.15 13.1 24.9 31.8 1.18 11.5 23.1 29.3 1.42 10.9 21.9 27.1 1.40 16.7 30.2 36.9
Fx-Encode ++ 26.1 40.7 46.5 1.16 22.1 35.9 42.0 1.13 14.2 25.4 31.9 1.58 17.4 30.2 37.1 1.38 19.4 33.5 40.5
Type Model
MedleyDB [26]
Mandolin Al o Saxophone Ho n T umpe Mix u e
R@1 R@5 R@10 LdR@1 R@5 R@10 LdR@1 R@5 R@10 LdR@1 R@5 R@10 LdR@1 R@5 R@10
GP
CLAP 1.4 3.5 5.8 1.24 1.5 4.2 6.9 1.43 1.9 4.5 6.1 3.1 1.36 1.2 4.5 1.42 0.8 3.4 5.4
PANN 0.6 2.5 4.9 1.24 0.8 2.4 4.3 1.53 1.4 3.4 4.9 1.33 0.9 2.2 3.3 1.42 0.7 2.4 4.9
VGGish 1.7 4.4 7.5 1.26 2.0 5.4 8.1 1.44 2.8 5.7 7.5 1.30 1.5 3.6 5.1 1.33 0.9 2.6 3.8
Fx
FX-Encode 2.3 6.3 10.1 1.27 2.2 6.5 10.0 1.38 3.4 8.5 11.9 1.33 1.7 4.8 6.8 1.36 0.8 3.0 5.0
AFx-Rep 15.8 25.2 30.8 1.15 12.7 21.3 26.1 1.37 12.2 19.9 24.4 1.17 8.0 14.3 17.7 1.40 4.2 10.1 13.6
Fx-Encode ++ 17.5 27.8 34.1 1.20 18.6 29.4 35.4 1.33 13.5 21.6 25.9 1.24 9.1 16.1 19.8 1.30 5.6 12.4 17.3
Table 2. Audio e ec s e ie al pe o mance (R@K alues in pe cen ages, highe is be e ) on isola ed ins umen s and
mix u es. The Ldcolumn indica es pe o mance in audio e ec s pa ame e ma ching asks (lowe is be e ). GP: gene al
pu pose audio ep esen a ions. Fx: audio e ec s speci ic ep esen a ions.
Type Model
MUSDB18 [25]
Ta ge Ins umen (O acle) USS(m) / MSS(m)E(x,qaudio )/E(x,q ex )
R@1 R@5 R@10 mAP@10 R@1 R@5 R@10 mAP@10 R@1 R@5 R@10 mAP@10
GP
CLAP 1.0 3.1 5.0 2.0 0.3 / 0.3 1.2 / 1.5 2.4 / 2.8 0.8 / 0.8 - - - -
PANN 0.6 2.0 3.4 1.2 0.3 / 0.3 1.1 / 1.4 2.3 / 2.4 0.7 / 0.8 - - - -
VGGish 1.4 4.2 6.3 2.6 0.3 / 0.5 1.3 / 1.7 2.7 / 3.0 0.7 / 1.1 - - - -
Fx
FX-Encode 7.8 18.8 26.3 12.6 0.6 / 2.0 2.4 / 7.0 4.1 / 11.5 1.4 / 4.3 - - - -
AFx-Rep 14.0 26.2 32.2 19.3 2.1 / 4.6 6.0 / 10.8 8.8 / 15.0 3.8 / 7.3 - - - -
Fx-Encode ++ 19.9 33.0 39.4 25.6 2.1 / 7.1 6.4 / 15.4 9.9 / 20.5 4.0 / 10.8 3.0 / 3.0 8.1 / 8.1 11.7 / 12.2 5.2 / 5.3
Type Model
MedleyDB [26]
Ta ge Ins umen (O acle) USS(m)E(x,qaudio)/E(x,q ex )
R@1 R@5 R@10 mAP@10 R@1 R@5 R@10 mAP@10 R@1 R@5 R@10 mAP@10
GP
CLAP 0.9 2.6 4.4 1.7 0.3 1.5 2.8 0.9 - - - -
PANN 2.0 4.8 7.0 3.2 0.3 1.3 2.5 0.8 - - - -
VGGish 1.5 3.8 5.8 2.6 0.4 1.4 2.5 0.8 - - - -
Fx
FX-Encode 2.4 6.5 9.7 4.2 0.6 2.2 3.9 1.4 - - - -
AFx-Rep 12.1 20.2 24.8 15.6 1.8 4.7 7.2 3.1 - - - -
Fx-Encode ++ 14.7 23.7 28.8 18.7 1.7 4.6 7.0 3.0 1.6 / 1.9 5.0 / 5.7 7.6 / 8.7 3.1 / 3.6
Table 3. Audio e ec s e ie al pe o mance (R@K alues and mAP in pe cen ages) on ins umen -wise ex ac ion.
wi h MUSDB [25] and MedleyDB [26]. Ou e alua ion
builds upon AFx-Rep [6] and makes u he imp o emen s
in h ee dimensions, including: (1) expanding small e-
ie al pools (p e iously limi ed o only 20 samples), (2)
implemen ing e ec s no maliza ion (add essing he issue
o inhe en e ec s in audio), and (3) ex ending pe o -
mance me ics (beyond he p e iously epo ed accu acy).
E alua ion F amewo k. Ou e ie al ask e alua es how
accu a ely models iden i y co ec e ec s con igu a ions
om 500 candida es when p esen ed wi h p ocessed au-
dio. Gi en audio que y, ou goal is o e ie e he a ge
audio wi h same e ec s con igu a ions bu di e en musi-
cal con en s. The ask uses cosine simila i y in he embed-
ding space be ween que ies and candida e pools, ollow-
ing he e alua ion me hodology es ablished in CLAP [3].
We epo Recall@K (R@K) in pe cen ages, applying e -
ec s o e alua ion da ase s using he p ocedu e desc ibed
in Sec ion 3.1. We cons uc sepa a e e alua ion se s om
MUSDB [25] ( ocals, d ums, bass, o he ) and om Med-
leyDB [26] (mandolin, saxophone, ho n, umpe ) o es
gene aliza ion ac oss ins umen di e si y. We u he as-
sess pe o mance ac oss e ec s complexi y by a ying he
numbe o e ec s applied o each ack om 1 o 8.
Baselines. We compa e agains gene al-pu pose ep esen-
a ions (VGGish [1], PANN [2], CLAP [3]) and special-
ized e ec s ep esen a ions (FX-Encode [15], AFx-Rep
[6]), excluding gui a -speci ic models [16, 17] unsui able
o mul i-ins umen e alua ion.
Re ie al Scena ios. We e alua e using wo comple-
men a y scena ios: (1) Mix u e-le el Re ie al, es ing e -
ec s iden i ica ion in comple e mix u es o isola ed eco d-
ings [6, 15]; and (2) Ins umen -wise Re ie al, e alua -
ing ou no el ex ac ion capabili y wi h bo h g ound- u h
isola ed acks and di ec ly om mix u es. Fo baseline
models lacking ins umen -speci ic capabili ies, we imple-
men a wo-s age app oach using uni e sal sound sepa a-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
616

Figu e 2. Pe o mance compa ison ac oss da ase s, mix-
u e and ins umen -le el, and e ec s complexi y.
ion [18] o a bi a y ins umen ypes and Hyb id De-
mucs [47] o MUSDB18’s ou s anda d acks ( ocals,
bass, d ums, o he ). In con as , Fx-Encode ++ di ec ly
ex ac s ins umen -speci ic embeddings wi hou equi ing
sepa a ion. We suppo bo h audio and ex que ies h ough
ou CLAP encode , using he p omp empla e: “ his is he
sound o { a ge ins umen }” [31] o ex que ies.
4.2 Ma ching o Audio E ec s Pa ame e s
We e alua e downs eam applica ions h ough e ec s pa-
ame e s ma ching ollowing [6], ex ac ing e ec s ep e-
sen a ions om e e ence audio o op imize di e en iable
e ec chain pa ame e s applied o d y inpu while minimiz-
ing a mul i- esolu ion STFT Loss [48]. Ou e alua ion uses
audio iple s (clean, e e ence, a ge ) whe e inpu / a ge
sha e iden ical con en wi h di e en applied audio e ec s,
while e e ence/ a ge ha e di e en con en bu iden ical
e ec s. We syn hesize 100 samples pe ins umen om
MUSDB [25] and MedleyDB [26] da ase s, applying se en
sequen ial e ec s (EQ, mul iband comp esso , s e eo im-
age , gain, dis o ion, delay, limi e ). We compa e agains
he same baselines used in ou e ie al expe imen s.
5. RESULTS
5.1 Audio E ec s Re ie al
Mix u e-le el Re ie al. Table 2 shows Fx-Encode ++
subs an ially ou pe o ms bo h gene al-pu pose models
and o he e ec -speci ic models on MUSDB da ase . Ou
model pa icula ly excels wi h “d ums” and “bass” . How-
e e , pe o mance on “ ocals” is compa a i ely lowe han
o he ins umen s (d ops app oxima ely 10%), likely due
o he high imb al a ia ion o ocals in each song (e.g.
male s. emale singe s). This pa e n is also obse ed in
he “o he ” and “mix u e” ca ego ies, which simila ly ex-
hibi high imb al a ie y. All models show dec eased pe -
o mance on he MedleyDB da ase , sugges ing ha pe -
o mance is limi ed by he equency o ins umen expo-
su e du ing aining. Despi e his challenge, Fx-Encode ++
main ains i s pe o mance ad an age ac oss da ase s.
Ins umen -wise Re ie al. Table 3 p esen s e ie al e-
sul s ac oss h ee p o ocols: (1) Ta ge Ins umen (O -
acle): using g ound- u h isola ed acks; (2) USS(m) /
MSS(m): applying uni e sal [18] o music sou ce sepa-
a ion [47]; and (3) E(x,qaudio)/E(x,q ex ): ou ex ac-
o applied di ec ly o mix u es. Fo MUSDB da ase , ou
model subs an ially ou pe o ms gene al-pu pose models
on a ge ins umen s. USS-based me hods su e signi -
ican pe o mance d ops (9.9% R@10), while MSS pe -
o ms be e (20.5% R@10) bu is limi ed o only ou
p ede ined s ems. No ably, e en when using high-quali y
sou ce sepa a ion, we obse e a clea gap be ween he
“Ta ge Ins umen ” and “MSS(m)”, (app oxima ely 19%
lowe ), indica ing ha sepa a ion a i ac s may dis o he
e ec s cha ac e is ics. Ou ex ac o ou pe o ms USS(m)
wi hou equi ing an ex e nal sou ce sepa a ion model,
wi h ex que ies pe o m sligh ly be e han audio que ies,
p obably because hey p o ide a mo e gene alized ins u-
men space compa ed he qaudio, as he que y audio has di -
e en con en and Fx han he a ge . Fo he MedleyDB
da ase , pe o mance deg ades ac oss all p o ocols, high-
ligh ing challenges in gene alizing o ins umen s obse ed
less equen ly du ing aining.
Numbe o E ec s. Figu e 2 demons a es ou model’s
supe io pe o mance ac oss a ying e ec coun s. Pe -
o mance imp o es wi h mo e e ec s o all models, indi-
ca ing complex e ec chains c ea e mo e dis inc i e sig-
na u es han single e ec , which may be con used wi h
na u al ins umen imb es. Fo MUSDB da ase , he pe -
o mance gap widens wi h inc easing e ec s complexi y.
While simila ends appea o MedleyDB da ase , e-
duced pe o mance indica es challenges in gene alizing o
no el imb al cha ac e is ics.
5.2 Ma ching o Audio E ec s Pa ame e s
Table 2 shows a ied ma ching pe o mance (Ld) ac oss
ins umen s. Unlike e ie al asks, no model consis en ly
ou pe o ms ac oss all scena ios. Fo MUSDB da ase ,
di e en models excel in di e en ca ego ies (VGGish on
d ums, Fx-Encode ++ on bass, AFx-Rep on ocals), wi h
gene al-pu pose VGGish su p isingly ou pe o ming spe-
cialized models in se e al cases. Fo MedleyDB da ase ,
AFx-Rep and Fx-Encode ++ achie e he lowes econ-
s uc ion losses. These indings sugges ha e ie al and
e ec pa ame e ma ching asks bene i om di e en ep-
esen a ional cha ac e is ics.
6. CONCLUSION
We in oduced Fx-Encode ++, he i s model ha ex-
ac s ins umen -wise audio e ec s in o ma ion om mu-
sic mix u es. Ou app oach ou pe o ms exis ing me h-
ods in audio e ec s e ie al a he mix u e le el while en-
abling ins umen -wise e ec s embedding ex ac ion. Lim-
i a ions include educed e ec pa ame e ma ching pe o -
mance, and di icul ies wi h single e ec unde s anding.
Fu u e wo k should add ess hese challenges by imp o ing
e ie al- ans o ma ion b idging, enhancing single e ec
unde s anding o applica ions like VA modeling [49–51].
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
617
7. ACKNOWLEDGEMENTS
Yen-Tung hanks Na ional Science and Technology Coun-
cil o suppo ing his PhD s udy.
8. REFERENCES
[1] S. He shey, S. Chaudhu i, D. P. Ellis, J. F. Gemmeke,
A. Jansen, R. C. Moo e, M. Plakal, D. Pla , R. A.
Sau ous, B. Seybold e al., “CNN a chi ec u es o
la ge-scale audio classi ica ion,” in P oc. In e na ional
Con e ence on Acous ics, Speech, and Signal P ocess-
ing (ICASSP), 2017.
[2] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and
M. D. Plumbley, “PANNs: La ge-scale p e ained au-
dio neu al ne wo ks o audio pa e n ecogni ion,”
IEEE/ACM T ans. Audio, Speech, Lang. P ocess.,
ol. 28, 2020.
[3] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-Ki kpa ick,
and S. Dubno , “La ge-scale con as i e language-
audio p e aining wi h ea u e usion and keywo d-
o-cap ion augmen a ion,” in P oc. In e na ional Con-
e ence on Acous ics, Speech, and Signal P ocessing
(ICASSP), 2023.
[4] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” T ans. Machine
Lea ning Resea ch, 2023.
[5] S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins,
Z. Chen, and F. Wei, “Bea s: Audio p e- aining wi h
acous ic okenize s,” in P oc. ICML, 2022.
[6] C. J. S einme z, S. Singh, M. Comuni à, I. Ibnyahya,
S. Yuan, E. Bene os, and J. D. Reiss, “ST-ITO: Con-
olling audio e ec s o s yle ans e wi h in e ence-
ime op imiza ion,” in P oc. In e na ional Socie y o
Music In o ma ion Re ie al (ISMIR), 2024.
[7] S. H. Hawley and C. J. S einme z, “Le e aging neu al
ep esen a ions o audio manipula ion,” in 154 h Con-
en ion o he Audio Enginee ing Socie y, 2023.
[8] A. Chu, P. O’Reilly, J. Ba ne , and B. Pa do, “Tex 2 x:
Ha nessing clap embeddings o ex -guided audio e -
ec s,” in P oc. In e na ional Con e ence on Acous ics,
Speech, and Signal P ocessing (ICASSP), 2025.
[9] B. De Man, R. S ables, and J. D. Reiss, In elligen mu-
sic p oduc ion. Focal P ess, 2019.
[10] C. J. S einme z, J. Pons, S. Pascual, and J. Se à, “Au-
oma ic mul i ack mixing wi h a di e en iable mixing
console o neu al audio e ec s,” in P oc. In e na ional
Con e ence on Acous ics, Speech, and Signal P ocess-
ing (ICASSP), 2021.
[11] M. Ma inez Rami ez, D. S olle , and D. Mo a , “A
deep lea ning app oach o in elligen d um mixing wi h
he wa e-u-ne ,” Jou nal o he Audio Enginee ing So-
cie y, ol. 69, 2021.
[12] C. J. S einme z, N. J. B yan, and J. D. Reiss, “S yle
ans e o audio e ec s wi h di e en iable signal p o-
cessing,” J. Audio Eng. Soc, ol. 70, 2022.
[13] J. Koo, S. Paik, and K. Lee, “End- o-end music e-
mas e ing sys em using sel -supe ised and ad e sa ial
aining,” in P oc. In e na ional Con e ence on Acous-
ics, Speech, and Signal P ocessing (ICASSP), 2022.
[14] S. S. Vanka, C. S einme z, J.-B. Rolland, J. Reiss, and
G. Fazekas, “Di -MST: Di e en iable mixing s yle
ans e ,” in P oc. In e na ional Socie y o Music In-
o ma ion Re ie al (ISMIR), 2024.
[15] J. Koo, M. A. Ma ínez-Ramí ez, W.-H. Liao, S. Uh-
lich, K. Lee, and Y. Mi su uji, “Music mixing s yle
ans e : A con as i e lea ning app oach o disen an-
gle audio e ec s,” in P oc. In e na ional Con e ence on
Acous ics, Speech, and Signal P ocessing (ICASSP),
2023.
[16] Y.-H. Chen, Y.-T. Yeh, Y.-C. Cheng, J.-T. Wu, Y.-H.
Ho, J.-S. R. Jang, and Y.-H. Yang, “Towa ds ze o-sho
ampli ie modeling: One- o-many ampli ie modeling
ia one embedding con ol,” in P oc. In e na ional So-
cie y o Music In o ma ion Re ie al (ISMIR), 2024.
[17] A. W igh , A. Ca son, and L. Ju ela, “Open-Amp:
Syn he ic da a amewo k o audio e ec ounda ion
models,” in P oc. In e na ional Con e ence on Acous-
ics, Speech, and Signal P ocessing (ICASSP), 2025.
[18] Q. Kong, K. Chen, H. Liu, X. Du, T. Be g-Ki kpa ick,
S. Dubno , and M. D. Plumbley, “Uni e sal sou ce
sepa a ion wi h weakly labelled da a,” a Xi p ep in
a Xi :2305.07447, 2023.
[19] A. Dé ossez, N. Usunie , L. Bo ou, and F. Bach, “Mu-
sic sou ce sepa a ion in he wa e o m domain,” a Xi
p ep in a Xi :1911.13254, 2019.
[20] N. Scha e , B. Cogan, E. Manilow, M. Mo ison,
P. See ha aman, and B. Pa do, “Music sepa a ion en-
hancemen wi h gene a i e modeling,” in P oc. In e -
na ional Socie y o Music In o ma ion Re ie al (IS-
MIR), 2022.
[21] J. T. Colonel and J. Reiss, “Re e se enginee ing o a
eco ding mix wi h di e en iable digi al signal p o-
cessing,” The Jou nal o he Acous ical Socie y o
Ame ica, ol. 150, no. 1, pp. 608–619, 2021.
[22] S. Lee, M. A. Ma ínez-Ramí ez, W.-H. Liao, S. Uh-
lich, G. Fabb o, K. Lee, and Y. Mi su uji, “Sea ch-
ing o music mixing g aphs: A p uning app oach,” in
27 h In e na ional Con e ence on Digi al Audio E ec s
(DAFx), 2024.
[23] T. Chen, S. Ko nbli h, M. No ouzi, and G. Hin on,
“A simple amewo k o con as i e lea ning o isual
ep esen a ions,” in P oc. ICML, 2020.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
618
[24] M. A. Ma ínez-Ramí ez, W.-H. Liao, G. Fabb o,
S. Uhlich, C. Nagashima, and Y. Mi su uji, “Au oma ic
music mixing wi h deep lea ning and ou -o -domain
da a,” in P oc. In e na ional Socie y o Music In o -
ma ion Re ie al (ISMIR), 2022.
[25] Z. Ra ii, A. Liu kus, F.-R. S ö e , S. I. Mimilakis, and
R. Bi ne , “MUSDB18-HQ - an uncomp essed e sion
o MUSDB18,” Aug. 2019. [Online]. A ailable:
h ps://doi.o g/10.5281/zenodo.3338373
[26] R. M. Bi ne , J. Salamon, M. Tie ney, M. Mauch,
C. Cannam, and J. P. Bello, “MedleyDB: A mul i ack
da ase o anno a ion-in ensi e mi esea ch.” in P oc.
In e na ional Socie y o Music In o ma ion Re ie al
(ISMIR), 2014.
[27] J. F. Gemmeke, D. P. Ellis, D. F eedman, A. Jansen,
W. Law ence, R. C. Moo e, M. Plakal, and M. Ri e ,
“Audio Se : An on ology and human-labeled da ase
o audio e en s,” in P oc. In e na ional Con e ence on
Acous ics, Speech, and Signal P ocessing (ICASSP),
2017.
[28] H. Liu, Y. Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian,
Y. Wang, W. Wang, Y. Wang, and M. D. Plumb-
ley, “AudioLDM 2: Lea ning holis ic audio gene a ion
wi h sel -supe ised p e aining,” IEEE/ACM T ans-
ac ions on Audio, Speech, and Language P ocessing,
2024.
[29] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” in P oc. Neu IPS, 2023.
[30] X. Liu, Q. Kong, Y. Zhao, H. Liu, Y. Yuan, Y. Liu,
R. Xia, Y. Wang, M. D. Plumbley, and W. Wang, “Sep-
a a e any hing you desc ibe,” IEEE/ACM T ans. Audio,
Speech, Lang. P ocess., 2024.
[31] K. Saijo, J. Ebbe s, F. G. Ge main, S. Khu ana,
G. Wiehe n, and J. L. Roux, “Le e aging audio-only
da a o ex -que ied a ge sound ex ac ion,” in P oc.
In e na ional Con e ence on Acous ics, Speech, and
Signal P ocessing (ICASSP), 2025.
[32] R. Kuma , P. See ha aman, A. Luebs, I. Kuma , and
K. Kuma , “High- ideli y audio comp ession wi h im-
p o ed qgan,” in P oc. Neu IPS, 2023.
[33] M. Pasini, S. La ne , and G. Fazekas, “Music2la en :
Consis ency au oencode s o la en audio comp es-
sion,” in P oc. In e na ional Socie y o Music In o -
ma ion Re ie al (ISMIR), 2024.
[34] J. Koo, S. Paik, and K. Lee, “Re e b con e sion o
mixed ocal acks using an end- o-end con olu ional
deep neu al ne wo k,” in P oc. In e na ional Con e -
ence on Acous ics, Speech, and Signal P ocessing
(ICASSP), 2021.
[35] B.-Y. Chen, W.-H. Hsu, W.-H. Liao, M. A. M.
Ramí ez, Y. Mi su uji, and Y.-H. Yang, “Au oma ic dj
ansi ions wi h di e en iable audio e ec s and gene a-
i e ad e sa ial ne wo ks,” in P oc. In e na ional Con-
e ence on Acous ics, Speech, and Signal P ocessing
(ICASSP), 2022.
[36] S. Lee, H.-S. Choi, and K. Lee, “Di e en iable a i i-
cial e e be a ion,” IEEE/ACM T ans. Audio, Speech,
Lang. P ocess., ol. 30, 2022.
[37] J. Robinson, L. Sun, K. Yu, K. Ba manghelich,
S. Jegelka, and S. S a, “Can con as i e lea ning a oid
sho cu solu ions?” in P oc. Neu IPS, 2021.
[38] J. Robinson, C.-Y. Chuang, S. S a, and S. Jegelka,
“Con as i e lea ning wi h ha d nega i e samples,” in
P oc. In e na ional Con e ence on Lea ning Rep esen-
a ions (ICLR), 2021.
[39] E. Pe ez, F. S ub, H. De V ies, V. Dumoulin, and
A. Cou ille, “Film: Visual easoning wi h a gene al
condi ioning laye ,” in P oc. AAAI, 2018.
[40] H.-Y. Chen, Z. Lai, H. Zhang, X. Wang, M. Eichne ,
K. You, M. Cao, B. Zhang, Y. Yang, and Z. Gan, “Con-
as i e localized language-image p e- aining,” a Xi
p ep in a Xi :2410.02746, 2024.
[41] C. J. S einme z, “dasp-py o ch,” [Online] h ps://
gi hub.com/cs einme z1/dasp-py o ch/.
[42] C.-Y. Yu, C. Mi chel ee, A. Ca son, S. Bilbao, J. D.
Reiss, and G. Fazekas, “Di e en iable all-pole il e s
o ime- a ying audio sys ems,” in In e na ional Con-
e ence on Digi al Audio E ec s (DAFx), 2024.
[43] C. J. S einme z and J. Reiss, “pyloudno m: A simple
ye lexible loudness me e in py hon,” in Audio Engi-
nee ing Socie y Con en ion 150, 2021.
[44] D. P. Kingma and J. Ba, “Adam: A me hod o s ochas-
ic op imiza ion,” in P oc. In e na ional Con e ence on
Lea ning Rep esen a ions (ICLR), 2014.
[45] P. Goyal, P. Dollá , R. Gi shick, P. Noo dhuis,
L. Wesolowski, A. Ky ola, A. Tulloch, Y. Jia, and
K. He, “Accu a e, La ge Miniba ch SGD: T aining im-
agene in 1 hou ,” a Xi p ep in a Xi :1706.02677,
2017.
[46] I. Pe ei a, F. A aújo, F. Ko zeniowski, and R. Vogl,
“MoisesDB: A da ase o sou ce sepa a ion beyond 4-
s ems,” in P oc. In e na ional Socie y o Music In o -
ma ion Re ie al (ISMIR), 2023, pp. 619–626.
[47] A. Dé ossez, “Hyb id spec og am and wa e o m
sou ce sepa a ion,” a Xi p ep in a Xi :2111.03600,
2021.
[48] C. J. S einme z and J. D. Reiss, “au aloss: Audio o-
cused loss unc ions in PyTo ch,” in Digi al Music
Resea ch Ne wo k One-day Wo kshop (DMRN+15),
2020.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
619
[49] M. A. Ma ínez Ramí ez, E. Bene os, and J. D. Reiss,
“Deep lea ning o black-box modeling o audio e -
ec s,” Applied Sciences, ol. 10, 2020.
[50] Y.-T. Yeh, W.-Y. Hsiao, and Y.-H. Yang, “Hype ecu -
en neu al ne wo k: Condi ion mechanisms o black-
box audio e ec modeling,” in In e na ional Con e -
ence on Digi al Audio E ec s (DAFx), 2024.
[51] C. J. S einme z and J. D. Reiss, “E icien neu al ne -
wo ks o eal- ime modeling o analog dynamic ange
comp ession,” 152nd Con en ion o he Audio Engi-
nee ing Socie y, 2021.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
620