scieee Science in your language
[en] (orig)

Assessing the Alignment of Audio Representations With Timbre Similarity Ratings

Author: Haokun Tian; Stefan Lattner; Charalampos Saitis
Publisher: Zenodo
DOI: 10.5281/zenodo.17706569
Source: https://zenodo.org/records/17706569/files/000083.pdf
ASSESSING THE ALIGNMENT OF AUDIO REPRESENTATIONS WITH
TIMBRE SIMILARITY RATINGS
Haokun Tian1S e an La ne 2Cha alampos Sai is1
1Queen Ma y Uni e si y o London, UK 2Sony Compu e Science Labo a o ies, Pa is, F ance
[email p o ec ed]
ABSTRACT
Psychoacous ical so-called “ imb e spaces” map pe cep-
ual simila i y a ings o ins umen sounds on o low-
dimensional embeddings ia mul idimensional scaling, bu
su e om scalabili y issues and a e incapable o gene -
aliza ion. Recen esul s om audio (music and speech)
quali y assessmen as well as image simila i y ha e shown
ha deep lea ning is able o p oduce embeddings ha align
well wi h human pe cep ion while being la gely ee om
hese cons ain s. Al hough he exis ing human- a ed im-
b e simila i y da a is no la ge enough o ain deep neu-
al ne wo ks (2,614 pai wise a ings on 334 audio sam-
ples), i can se e as es -only da a o audio models. In
his pape , we in oduce me ics o assess he alignmen
o di e se audio ep esen a ions wi h human judgmen s
o imb e simila i y by compa ing bo h he absolu e al-
ues and he ankings o embedding dis ances o human
simila i y a ings. Ou e alua ion in ol es h ee signal-
p ocessing-based ep esen a ions, wel e ep esen a ions
ex ac ed om p e- ained models, and h ee ep esen a ions
ex ac ed om a no el sound ma ching model. Among
hem, he s yle embeddings inspi ed by image s yle ans e ,
ex ac ed om he CLAP model and he sound ma ching
model, ema kably ou pe o m he o he s, showing hei
po en ial in modeling imb e simila i y.
1. INTRODUCTION
How do humans dis inguish be ween di e en musical im-
b es? This ques ion has d i en esea ch in he ield o
psychoacous ics o decades [4]. Resea che s ypically e-
c ui a g oup o people, play di e en sounds o hem in a
con olled acous ic en i onmen , and ask hem o a e he
di e ences in he sounds by assigning a sco e. These sounds
a e no malized in pi ch, loudness, and du a ion so ha pa -
icipan s can ocus on imb e. A e collec ing all human
a ings, esea che s use a echnique called mul idimensional
scaling (MDS) o map he sounds on o a low-dimensional
space called imb e space, whe e dis ances be ween he
esul ing embeddings e lec human a ings. This space is
© H. Tian, S. La ne , and C. Sai is. Licensed unde a
C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY 4.0).
A ibu ion: H. Tian, S. La ne , and C. Sai is, “Assessing he Alignmen
o Audio Rep esen a ions wi h Timb e Simila i y Ra ings”, in P oc. o he
26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h
Ko ea, 2025.
Figu e 1. Human simila i y a ings o h ee imb e space
da ase s [1
–
3]. Each block ma ix ep esen s pai wise com-
pa isons be ween audio s imuli. En y
(i, j)
indica es he
pe cei ed simila i y be ween audio
i
and audio
j
. Da ke
colo s indica e lowe simila i y. The lowe iangula pa is
made pa ially anspa en due o symme y.
hen analyzed o ind whe he ce ain acous ic ea u es can
explain how imb es a e o de ed along di e en dimensions.
Howe e , hese s udies ha e always been limi ed in
scale— ha is, hey only in ol ed a small numbe o sound
s imuli. I he numbe o sounds we e o inc ease, he
equi ed human a ings would g ow quad a ically, since a -
ings a e gi en in pai s. The la ges s udy o da e is [5], which
includes 42 sound s imuli. Addi ionally, pi ch, loudness,
and du a ion mus be ixed o elimina e con ounding e ec s,
meaning ha a imb e space wi h a iable pi ch o loudness
can ne e be cons uc ed. Fu he mo e, o place new audio
samples in o a imb e space, new human a ings mus always
be collec ed, indica ing ha he imb e space as a model does
no ha e he abili y o gene alize wi hou addi ional e o s.
To o e come hese limi a ions, we en ision a neu al
ne wo k-based model ha can se e as a pe cep ual me ic
o imb e— ha is, i can embed any audio inpu in o a
space whe e dis ances be ween embeddings e lec how
di e en humans pe cei e hem. Such a me ic would ha e
nume ous applica ions. Fi s , o ain a model o gene a e
710
ins umen sounds [6, 7], one could ma ch he imb e by
di ec ly compu ing he loss using his me ic a he han
elying on spec og am dis ances. In music p oduc ion, i
could enable e icien e ie al o samples wi h simila o
con as ing imb es wi hou he need o labo ious lis ening.
Addi ionally, i can encode audio signals in o meaning ul
imb e okens o gene a i e modeling. Finally, i could
p o ide a con ol space o musical exp ession, as has al-
eady been demons a ed [8].
In his wo k, we ake he i s s ep by e alua ing a ange
o models’ ep esen a ions, including signal-p ocessing-
based low-le el ep esen a ions and hose p oduced by p e-
ained audio models, on he pas 21 small da ase s collec ed
by psychoacous ical imb e space s udies. Ou goal is o
de e mine which model cu en ly pe o ms bes in e ms
o ma ching human a ings o imb e simila i y. We also
include a newly ained sound ma ching model ha p e-
dic s syn hesis pa ame e s om audio, and we e alua e
audio ep esen a ions ex ac ed om i . In pa icula , we
ha e ound ha s yle embeddings, o iginally p oposed o
image s yle ans e , a e p omising as a u u e di ec ion
o modeling imb e simila i y. We cla i y ha he e m
“alignmen ” in his pape does no e e o any empo al
synch oniza ions, and ou assessmen does no indica e he
ex en o which models p oduce disen angled imb e ep e-
sen a ions—an abili y equi ed o some o he applica ions
men ioned in he p e ious pa ag aph.
2. RELATED WORK
Se e al wo ks ha e aimed o ain models ha p oduce a
pe cep ual imb e space. Esling e al. [9] ained a a ia-
ional au oencode o econs uc audio samples o di e en
imb es, using pe cep ual a ings om imb e space s udies
o egula ize he space. Los anlen e al. [10] collec ed imb e
simila i y judgmen s on 78 sounds using ee so ing [11].
By agg ega ing esponses om di e en pa icipan s, he
sounds we e assigned o 19 clus e s acco ding o hei pe -
cep ual simila i y. They, along wi h a subsequen s udy [12],
de eloped me ic lea ning models o cap u e his s uc u e.
Tho e e al. [13] and Pascal e al. [14] a emp ed o lea n
dis ance me ics di ec ly om imb e space da a bu did
no lea n a single me ic ha gene alizes ac oss all da ase s.
The mos simila wo k o ou s is by Vahidi e al. [15],
which e alua ed he alignmen o h ee ep esen a ions wi h
imb e simila i y a ings. Addi ionally, he accompanied
codebase [16] p o ides mo e implemen a ions o alignmen
sco es, bu no esul s ha e been epo ed. Ou wo k is a di-
ec con inua ion, p o iding a holis ic e alua ion o a ious
models and implemen a ions wi h ex ended unc ionali ies.
In b oade domains, pe cep ual me ics ha e been de-
eloped o handle a ious le els o pe u ba ions. They
a e ei he ained om sc a ch o ine- uned using human
pe cep ual judgmen s. Fo audio, Manocha e al. p oposed
DPAM [17] and CDPAM [18], which add ess low-le el
audio pe u ba ions such as noise addi ion, e e b, and
comp ession. In he image domain, Zhang e al. [19] mod-
eled low-le el pe u ba ions including adi ional dis o ions
such as noise addi ion and blu , as well as CNN-based dis-
o ions. Fu e al. [20] explo ed mid-le el pe u ba ions
ela ed o pose, colo , and shape. Mu en hale e al. [21]
in es iga ed high-le el pe u ba ions in ol ing di e en
objec -le el concep s (e.g., “g ass” e sus “sand”). The
smalles da ase used is NIGHTS [20] wi h 20K human judg-
men s, while he la ges is THINGS [22], which con ains
4.7M human judgmen s. In pa icula , [19] demons a ed
ha deep lea ning ep esen a ions, ega dless o he aining
da a, model, o ask used, exhibi a p omising abili y o
cap u e pe cep ual image simila i y. The e o e, al hough
cu en imb e space da ase s p o ide only 2.6K pai wise
judgmen s and a e insu icien o aining, hey a e wo h
being used o model e alua ion.
3. DATA AND EXPERIMENTS
3.1 Da a
We use da a cu a ed by Tho e e al. [13] and Vahidi e .
al [15],
1
comp ising a o al o 21 da ase s om 11 pub-
lished psychoacous ic s udies [1
–
3,23
–
25,27,28,31,33,34].
We p esen summa y in o ma ion o each da ase in Ta-
ble 1. Each da ase con ains a se o audio samples along
wi h pai wise imb e simila i y a ings. The sounds span a
wide ange: acous ic eco dings o musical ins umen no es,
digi ally edi ed acous ic samples designed o c ea e simple
o c oss-ins umen spec a, and ones om syn hesize s
and elec omechanical ins umen s. Each simila i y a ing
is an absolu e alue calcula ed by a e aging ac oss mul i-
ple human lis ene s. We compile all a ings om he 21
da ase s in o a la ge, spa se block-diagonal ma ix, whe e
he uppe iangula pa o each block ep esen s he a ings
o unique audio pai s in he co esponding da ase . Figu e 1
illus a es h ee o hese blocks, wi h symme ic pai wise
a ings isualized by colo . In o al, he ma ix includes
334 audio samples and 2,614 simila i y a ings.
3.2 Me ics
Ou goal is o in es iga e whe he an audio model can pe -
cei e imb e simila i y in a way ha aligns wi h human
pe cep ion. To his end, we compu e pai wise dis ances
be ween audio ep esen a ions p oduced by he model o
es ima e simila i y a ings, which we hen compa e wi h eal
human a ings o ob ain alignmen sco es. Concep ually, he
pai wise dis ances o audio ep esen a ions o m a p edic ed
simila i y ma ix, which we compa e agains he g ound
u h simila i y ma ix, consis ing o human a ings.
To conduc his e alua ion, we i s ob ain equally shaped
ep esen a ions o audio o di e en leng hs, in p epa a ion
o he dis ance compu a ion. Second, we employ a dis ance
unc ion o quan i y hei simila i y. Thi d, we equi e
me ics o measu e he alignmen be ween he p edic ed and
ue simila i y ma ices. Below, we p esen ou app oach
o hese h ee s eps, esul ing in wo me hods o handling
a iable leng h (see nex sec ion o cases wi h only one
iable op ion), wo me hods o compu ing dis ances, and
i e ways o compu ing alignmen sco es. This gi es a
1h ps://gi hub.com/ben-hayes/ imb e-dissimila i y-me ics
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
711
S udy Da ase No
sounds Pi ch Leng h
(s)
Loudness
(dB LUFS) Type o sounds No
a e s∗
G ey (1977) [23] – 16 E♭4 0.27 ±0.03 -14.61 ±1.57 Syn hesized om acous ic 22
G ey & Go don (1978) [24] – 16 E♭4 0.27 ±0.03 -14.87 ±1.76 Same as abo e, spec al en elopes aded 22
I e son & K umhansl (1993) [25] Whole 16 C4 3.19 ±0.69 -16.62 ±3.99 Acous ic [26] 10
Onse 16 C4 0.11 ±0.01 -23.24 ±9.10 Same as abo e, only ∼80ms a ack 9
Remainde 16 C4 3.10 ±0.70 -16.71 ±4.13 Same as abo e, ∼80ms a ack emo ed 9
McAdams e al. (1995) [27] – 18 E♭4 0.69 ±0.19 -18.36 ±4.05 FM simula ed and hyb ids 22
Laka os (2000) [28] Combined 20 E♭4 1.50 -24.08 ±2.98 Acous ic [26] 34
Ha monic 17 E♭4 1.50 -25.03 ±2.36 Same as abo e, only ha monic sounds 34
Pe cussi e 18 E♭4 1.49 ±0.05 -23.97 ±3.83 Same as abo e, only pe cussi e sounds 34
Ba he e al. (2010) [1] – 15 E♭4 1.56 ±0.04 -23.77 ±1.13 Physical modeling, cla ine only 16
Pa il e al. (2012) [2] A3 11 A3 0.25 -19.03 ±0.85 Acous ic [29] 6
D4 11 D4 0.25 -19.17 ±0.81 Same as abo e 20
G♯4 11 G♯4 0.25 -18.97 ±0.81 Same as abo e 6
Zacha akis e al. (2015) [3] G eek a e s 24 A1–4 1.30 -29.36 ±3.84 Acous ic [30] and syn hesize s 33
English a e s 24 A1–4 1.30 -29.36 ±3.84 Same as abo e 20
Siedenbu g e al. (2016) [31] Exp 2A Se 1 14 E♭4 0.50 -23.56 ±3.15 Acous ic [32] 24
Exp 2A Se 2 14 E♭4 0.50 -23.77 ±1.90 Chime ic (spec al en elopes aded) 24
Exp 2A Se 3 14 E♭4 0.50 -23.21 ±2.37 Acous ic and chime ic 24
Exp 2B 14 E♭4 0.50 -23.21 ±2.37 Same as abo e 24
Sai is & Siedenbu g (2020) [33] GEdissim 14 E♭4 0.50 -23.56 ±3.15 Same as Exp 2A Se 1 in [31,32] 40
Vahidi e al. (2020) [34] – 15 A4 1.00 -9.73 ±3.09 Sub ac i e syn hesis 35
∗Ra e s a e ypically musicians o wi h some so o musical backg ound. Some s udies used a mix u e o musicians and non-musicians.
Table 1. Summa y o he 21 imb e simila i y da ase s. We compu ed in eg a ed loudness o each sample using
pyloudno m
[35] wi h a block size o 0.08 seconds—sligh ly sho e han he sho es sample. While some dynamic a ia ions can be
obse ed, loudness is ypically epo ed o ha e been no malized by expe lis ene s.
o al o 10 o 20 sco es o each ep esen a ion (one model
can p oduce mul iple ep esen a ions). Addi ionally, we
ha e de eloped his p ocess in o an easy- o-use Py hon
package, as desc ibed in Sec ion 3.2.4.
3.2.1 Handling Va iable Audio Leng h
To p oduce ep esen a ions o audio, models ypically use
an analysis window wi h a ixed leng h, which slides o e
ime wi h a hop leng h o p oduce ime- a ying ep esen a-
ions. As shown in Table 1, some imb e space da ase s ha e
ixed audio leng hs, while o he s do no . This c ea es he
need o compu e iden ically shaped ep esen a ions o audio
o di e en leng hs, as models employing sliding windows
inhe en ly p oduce ep esen a ions wi h a ime dimension
p opo ional o he inpu leng h. To add ess his, we p o ide
wo app oaches. The i s is o squash he ime dimension by
compu ing he a e age o e i . The second, which we call
dynamic-leng h, ma ches inpu leng hs by padding ze os o
he igh —o unca ing in e y ew cases— o a i e a he
same leng h. Below a e a ew scena ios. I he model has a
long enough analysis window ha co e s all inpu leng hs o
he imb e space da a, all audio samples a e padded o his
leng h. Fo his case, we will no compu e he ime-a e aged
e sion, as he ep esen a ion has only one ame along he
ime dimension. And i he model ope a es wi h a a he
sho analysis window, he solu ion is o always dynamically
pad he sho e audio wi hin a pai o ma ch he leng h o
he longe one, ensu ing he shapes o hei ep esen a ions
ma ch. We only unca e audio samples in one case, and ha
is when he model is sensi i e o ime shi ing and does no
accep sliding. We pad he sho e audio and unca e he
longe audio o ma ch he ixed window leng h. In his case
we will also no compu e he ime-a e aged e sion. We
desc ibe in Sec ion 3.4 speci ically how he dynamic-leng h
app oach is applied o each model.
3.2.2 Dis ance Func ions
We compu e pai wise dis ances using wo unc ions: he
ℓ2
dis ance and cosine dis ance (de ined as one minus he
cosine simila i y). These unc ions a e applied o la ened
audio ep esen a ions.
3.2.3 Alignmen Sco es
The alignmen sco es a e compu ed be ween he p edic ed
simila i y ma ix and he g ound u h simila i y ma ix, bo h
o which a e diagonal block ma ices. We no e ha only he
pai s wi hin he diagonal blocks a e alid, ha e simila i y
a ings, and a e used o compu e he alignmen sco es. Pai s
ou side he diagonal blocks lack human a ings as samples
a e om di e en da ase s. Be o e compu a ion, he alues
in each da ase -le el block ma ix ( om bo h he p edic ed
and g ound u h ma ices) a e escaled o he ange
[0,1]
.
We compu e i e alignmen sco es in o al. The i s is
he mean absolu e e o (MAE). I compu es he absolu e
e o be ween a ings o each unique audio pai and a e ages
he e o s ac oss all pai s. The o he ou a e ank-based
sco es ha e alua e how well he model anks a lis o audio
samples based on hei imb e simila i y o a gi en sample.
This is done by compa ing each ow o he wo simila i y
ma ices (column-wise compa ison yields he same esul s
due o symme y), which con ains he simila i y a ings
be ween a e e ence sample (whose index co esponds o
he ow index) and all o he samples om he same da ase ,
whe e he ela i e o de ing o hese a ings de e mines he
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
712
imb e simila i y anking wi h espec o he e e ence sam-
ple. So ing he a ings pe ow yields
N
ankings—one
o each sample (each sample is used once as he e e ence
sample)—whe e
N
is he o al numbe o samples. The p e-
dic ed anking om he compu ed ma ix is hen compa ed
o he ue anking om he g ound- u h ma ix. The inal
ank-based alignmen sco es a e ob ained by compu ing a
sco e pe ow and a e aging ac oss all ows.
We use ou ank-based me ics o pe - ow compa -
isons: Kendall’s ank co ela ion coe icien , No malized
Discoun ed Cumula i e Gain (NDCG) ha ea s imb e sim-
ila i y ( a he han dissimila i y) as ele ance, Spea man’s
ank co ela ion coe icien , and iple ag eemen wi h an ad-
jus able ma gin. Fo he i s h ee s anda d me ics, we use
implemen a ions om
o chme ics
[36]. Fo iple
ag eemen , we i s ex ac all iple s om he g ound-
u h ow ha sa is ies he gi en ma gin condi ion— ha
is, he absolu e di e ence be ween wo dissimila i y a -
ings mus be g ea e han he ma gin alue. This mimics
a jus -no iceable di e ence in human pe cep ion, whe e
a ing di e ences smalle han his h eshold a e conside ed
pe cep ually equi alen o he e e ence and a e he e o e
excluded om he e alua ion. The ma gin is se o 0.1 in
ou e alua ion. To illus a e, a iple consis s o h ee audio
samples
(i, j, k)
, whe e
i
is he e e ence sample and he
simila i y be ween
i
and
j
is compa ed o he simila i y
be ween
i
and
k
. We hen compu e he iple ag eemen
a e as he a io o iple s o which bo h ma ix ows ag ee
on he ela i e o de ing o
j
and
k
wi h espec o
i
, o
he o al numbe o ex ac ed iple s.
3.2.4 The Py hon Package
Ou Py hon package
2 imb eme ics
simpli ies he
e alua ion p ocess by equi ing only he inpu o a model
capable o con e ing aw wa e o ms in o audio ep esen a-
ions o compu e he alignmen o ha model wi h imb e
simila i y pe cep ion. This package suppo s all e alua ions
desc ibed in his sec ion and addi ionally suppo s h ee
dis ance unc ions:
ℓ1
, nega i e do p oduc , and Poinca é
dis ance, a commonly used hype bolic dis ance [37]. I is
compa ible wi h audio models in e aced by he F éche
Audio Dis ance Toolki [38].
3
Fu he mo e, i can be used
as a con enien aining- ime e alua ion o inspec whe he
he model can acqui e human-like abili y o pe cei e imb e
simila i y du ing he lea ning p ocess.
3.3 Sound Ma ching
We ain a sound ma ching model ha in e s he Vi al
syn hesize
4
by p edic ing he syn hesis pa ame e s om
syn hesized audio.
5
We a e in e es ed in whe he , ia his
ask, he model can lea n meaning ul ep esen a ions ha
align wi h imb e simila i y pe cep ion. We a e mo i a ed by
he ac ha syn hesize pa ame e s encode all audio con en
using e y li le s o age, and we expec ha by lea ning
2h ps://gi hub.com/ iianhk/ imb eme ics
3h ps://gi hub.com/mic oso / ad k
4h ps:// i al.audio
5Code a ailable a h ps://gi hub.com/ iianhk/sm4 p
his comp ession, we can ob ain compac bu exp essi e
in e media e ep esen a ions.
3.3.1 Da a Gene a ion
We use
Vi a
,
6
, a package ha p o ides Py hon bindings
o he Vi al Syn hesize , o gene a e ou da ase . Ten pa am-
e e s and hei anges a e subjec i ely selec ed o p oduce
p onounced imb al changes when a ied. These pa ame e s
include one disc e e selec ion om se en basic wa eshapes
(e.g., sine wa e, iangle wa e) in he wa e able, wo o
dis o he wa eshape, wo om a unison e ec , h ee om
he ADSR en elope, and wo om an EQ. O hese, wo
a e disc e e pa ame e s, and he emaining eigh a e con-
inuous pa ame e s. We uni o mly sample each pa ame e
o gene a e da a. A andom pi ch is also sampled, bu no
as a p edic ion a ge , as pi ch is conside ed un ela ed o
imb e. We gene a e a o al o 500k samples, each wi h a
maximum leng h o wo seconds, which akes
∼
8 hou s on
a single CPU co e. All con inuous pa ame e s a e escaled
o he ange
[0,1]
o be used as eg ession a ge s.
3.3.2 Model A chi ec u e
The model s a s wi h an ini ial con olu ional laye wi h
a la ge ke nel size o cap u e low-le el ea u es, ollowed
by ba ch no maliza ion, ReLU ac i a ion, and max pooling.
The co e o he model comp ises ou esidual blocks wi h
inc easing channel dimensions [39]. Each esidual block
con ains wo con olu ional laye s wi h ba ch no maliza ion
and ReLU, along wi h a sho cu connec ion o p ese e g a-
dien low. A e ea u e ex ac ion, a global a e age pooling
laye educes spa ial dimensions, yielding a ixed-leng h
embedding ec o o size 256. This embedding is hen ed
in o he ou pu heads: a eg ession head ha p oduces 8 con-
inuous alues be ween 0 and 1, and wo classi ica ion heads
o disc e e p edic ions. Ou model has
∼
5M pa ame e s.
3.3.3 T aining
Fo he disc e e pa ame e s, we compu e c oss-en opy
losses. Fo he con inuous pa ame e s, we compu e he
ℓ1
losses. Losses a e added wi hou weigh ing. We use
an 8/2 ain- alida ion spli , a ba ch size o 32, and he
Adam op imize wi h a lea ning a e o 1e-4. The model
is ained o 100 epochs, which akes
∼
12 hou s on an
A100 GPU. We obse e ha he alida ion loss o each
pa ame e has con e ged.
3.3.4 Task and S yle Embeddings
We ex ac and e alua e h ee ep esen a ions om he
sound ma ching model. The i s ep esen a ion is he 256-
dimensional embedding ob ained igh be o e he p edic ion
heads, which encodes in o ma ion necessa y o sol ing he
ask—in ou case, syn hesis pa ame e p edic ion. We e e
o his as he ask embedding. The o he wo a e inspi ed by
image s yle ans e [40,41]. We e e o hem as he s yle
embedding. Ou mo i a ion o using s yle embeddings is
wo old: i s , hey a e in a ian o he spa ial loca ion o
audio e en s on spec og ams; and second, hey use ea u es
6h ps://gi hub.com/DB aun/Vi a
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
713
om di e en laye s o a model, hus cap u ing mul iple
le els o abs ac ion—a p ope y shown o be e ec i e o
modeling pe cep ual imb e simila i y [14]. The i s s yle
embedding, p oposed by Ga ys e al. [40], is compu ed
as a G am ma ix o ea u e ac i a ions, whe e each en y
cap u es he inne p oduc be ween wo channels ac oss
he spa ial dimensions, esul ing in a symme ic ma ix
ha cap u es co ela ions be ween channels. The second
s yle embedding is shown o be e ec i e by Huang and
Belongie [41] and is compu ed as he channel-wise mean
and s anda d de ia ion o ac i a ions o e he spa ial di-
mensions. Gi en a ea u e map wi h shape
(B, C, H, W )
,
whe e
B
is he ba ch size,
C
is he channel numbe , and
H
and
W
a e he spa ial dimensions, i.e. heigh and wid h,
he s yle embedding o Ga ys e al. has shape
(B, C, C)
,
and he s yle embedding o Huang and Belongie has shape
(B, 2C, )
. Wi h ou model, we ex ac bo h Ga ys s yle
embeddings and Huang s yle embeddings om i e in e me-
dia e con olu ional laye s: he ini ial con olu ional laye ,
and he i s con olu ional laye s in each o he ou esidual
blocks. Fea u e maps a e ob ained a e ba ch no maliza ion.
We conca ena e he embeddings ob ained om di e en
laye s, esul ing in a single Ga ys s yle embedding and a
single Huang s yle embedding pe inpu . Toge he wi h
he asking embedding, we abb e ia e hem as s.m.- ask,
s.m.-Ga ys, and s.m.-Huang.
3.4 O he Models and Dynamic Leng h Adap ion
We e alua e signal-p ocessing-based me hods
7
ha do no
lea n om da a, including mel- equency ceps al coe i-
cien s (MFCC), mul i-scale spec og ams (MSS) [7], and he
join ime- equency sca e ing ans o m (JTFS) [43]. We
also e alua e p e- ained audio models in e aced h ough
he F éche Audio Dis ance Toolki [38], which a e yp-
ically ained on la ge da ase s and used o e alua e gen-
e a i e models, as hei ep esen a ions a e conside ed o
co ela e well wi h pe cep ual music quali y. This includes
h ee CLAP models ained wi h na u al language supe i-
sion [44, 45]; wo CDPAM models ained on pe cep ual
a ings o audio quali y [18]; and neu al audio codecs in-
cluding wo Encodec models [46] and one Desc ip Au-
dio Codec (DAC) [47], which comp ess audio in o lowe -
bi a e la en ep esen a ions. Addi ionally, we e alua e
Music2La en [48] and a ep oduc ion o he Complex Au-
oencode (CAE) [49] ha p oduces ep esen a ions in a i-
an o ansposi ion and ime-shi .
8
In Sec ion 3.2.1, we discussed how o p oduce equally
shaped ep esen a ions o audio o di e en leng hs. He e,
we desc ibe how his is speci ically achie ed o each model.
Since he longes sample in ou da a is 4.39 seconds (see
Table 1), we compu e jus one ame o ep esen a ion o
he ollowing models, wi h hei espec i e analysis win-
dow leng hs indica ed in pa en heses: CLAP ( en seconds
o [45] and se en seconds o [44]) and CDPAM ( i e
7
Fo audio a 44.1kHz, MFCC is compu ed wi h n_m cc=40, MSS is
compu ed wi h _sizes=(4096, 2048, 1024, 512, 256, 128), and JTFS is
compu ed wi h
J= 12
,
Q= (8,2)
,
J = 3
, and
Q = 2
using he
Kyma io package [42].
8T ained wi h piano music om he MAPS da ase [50].
Figu e 2. E olu ion o alignmen sco es du ing he aining
o he sound ma ching model. The le column shows he
de ailed changes wi h he i s epoch zoomed in, while he
igh column shows he o e all p og ess ac oss he o al 100
epochs. Fo MAE, lowe sco es a e be e ; o o he me ics,
highe sco es a e be e . The bes sco es e e o he highes
(o lowes , o MAE) alues o each me ic in Figu e 3.
seconds). Fo hese models, audio samples a e padded o
ma ch he window leng h. The sound ma ching model has
an analysis window o wo seconds (which co e s 19 ou
o 21 da ase s) and is ained o cap u e he ADSR en e-
lope, so he analysis window canno be shi ed. The e o e,
we pad o unca e samples o a du a ion o wo seconds.
The abo e six models p oduce eigh unique ep esen a ions,
o which we do no compu e he ime a e age since hey
con ain only one ime ame. Fo all o he models, ep-
esen a ions a e compu ed using bo h ime a e aging and
dynamic padding, whe e he sho e sample is padded o
ma ch he longe one wi hin an audio pai .
We also e alua e s yle embeddings ex ac ed om one
CLAP model [44], which di e s om he sound ma ching
model no only in aining objec i e bu also in a chi ec-
u e—i uses a T ans o me backbone. Howe e , s yle
embeddings can be compu ed in a simila way: he in e nal
ep esen a ions p oduced by he T ans o me consis o spa-
ial okens wi h a ea u e dimension, analogous o spa ial
loca ions and channel dimensions in CNNs. The e o e, we
ea he ans o me ’s ea u e dimension as equi alen o
he channel dimension in CNNs. Following his analogy, we
compu e Ga ys s yle embeddings by measu ing co ela ions
be ween ea u e dimensions, and Huang s yle embeddings
by compu ing s a is ics (mean and a iance) ac oss each
ea u e. We compu e s yle embeddings using he ou pu s
om each Swin T ans o me block in he i s h ee laye s.
Each laye con ains mul iple blocks, wi h nine blocks in
o al. The esul ing embeddings a e conca ena ed in o a
single embedding pe inpu o e alua ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
714

Figu e 3. E alua ion esul s. Each con igu a ion shown a he op is ep esen ed by a unique colo . Fo each alignmen sco e,
he bes esul wi hin each con igu a ion is ma ked wi h a s a , while a yellow do behind he s a indica es he o e all winne .
Fo MAE, lowe sco es a e be e ; o o he me ics, highe sco es a e be e . Rep esen a ions a e o de ed by he mean iple
ag eemen ac oss con igu a ions, in descending o de . Fo ex labels on he x-axis, he same backg ound colo indica es
di e en ep esen a ions ex ac ed om he same model.
4. RESULTS AND DISCUSSION
In Figu e 2 and 3, we omi he Kendall and Spea man sco es,
as hey a e highly co ela ed wi h iple ag eemen and
p oduce nea ly iden ical ankings o he e alua ed ep esen-
a ions. We choose o epo iple ag eemen ins ead, as i
is a mo e in ui i e me ic o in e p e han he o he wo.
Figu e 2 shows aining- ime alignmen sco es o ou
sound ma ching model on all 21 da ase s. We use a h ee-
old alida ion- es spli o ensu e a ai compa ison wi h
o he models shown in Figu e 3. Fo each sco e, we selec
he bes -pe o ming model based on alida ion pe o mance
using only checkpoin s om he i s epoch. Final esul s
a e epo ed as a e ages o e he es olds. The s yle em-
beddings con e ge quickly, e aining high alignmen wi h
human imb e judgmen s, whe eas he ask embedding o e -
i s and loses gene aliza ion o e ime. In Figu e 3, he
s yle embeddings om bo h he sound ma ching model and
he CLAP model show clea imp o emen s o e hei base
ep esen a ions, demons a ing he e ec i eness o s yle
embeddings ega dless o aining objec i e o model a chi-
ec u e. In pa icula , he Huang s yle embedding ex ac ed
om he CLAP model shows he s onges pe o mance.
MFCC emains compe i i e and ou pe o ms many
ained models. By con as , CDPAM— ained on low-
le el dis o ion judgmen s o speech—does no adap well o
musical imb e. This may be due o di e ences in da a
domain, o i may echo indings om D eamSim [20],
which sugges ha pe cep ual simila i y lea ned o one
ype o pe u ba ion does no gene alize o o he s. MSS also
unde pe o ms, consis en wi h p io wo k [51] showing
ha spec al dis ances can be p oblema ic o cap u ing
pe cep ual simila i y o pi ch.
In e es ingly, Encodec’s 24k model aligns be e wi h
human a ings han he 48k model, despi e he o me be-
ing ained on a a ie y o audio da a, whe eas he la e is
ained exclusi ely on music. This sugges s ha inc eased
comp ession combined wi h b oade aining da a may en-
cou age he model o disca d i ele an de ails while p ese -
ing pe cep ually meaning ul s uc u e, esul ing in a mo e
e icien in e nal imb e ep esen a ion. This also highligh s
he e ec i eness o Music2La en , which aligns be e wi h
human a ings, has a high comp ession a e, and is ained
on bo h music and speech. JTFS pe o ms mode a ely on
ank-based me ics bu poo ly on MAE. This sugges s ha
i s dis ance scale may be nonlinea ly s e ched ela i e o
human pe cep ion, p ese ing he o de o examples bu
dis o ing hei absolu e di e ences.
5. CONCLUSION
In his pape , we in oduced a uni ied e alua ion amewo k
o compa e model-de i ed dis ances wi h human simila -
i y a ings om 21 classic imb e space da ase s, encom-
passing a wide ange o musical ins umen sounds. We
assessed bo h hand-c a ed ea u es (e.g., MFCC) and deep
lea ning-based ep esen a ions (e.g., CLAP, CDPAM, neu al
audio codecs), as well as a newly p oposed sound ma ch-
ing model ha in e s a wa e able syn hesize . Ou esul s
showed ha s yle embeddings ex ac ed om di e en mod-
els ou pe o med hei base ep esen a ions, and in pa ic-
ula , he Huang s yle embedding om he CLAP model
is ma kedly supe io o he o he s. To encou age u he
wo k, we p o ide a Py hon package ha implemen s all ou
me ics and p ocedu es. We hope his e alua ion ame-
wo k and Py hon package will encou age ad ancemen s
in imb e me ics ac oss asks like gene a i e modeling
and ins umen e ie al.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
715
6. ETHICS STATEMENTS
This wo k e alua es models using da ase s ha p ima ily
ea u e Wes e n musical ins umen s, which may e lec a
cul u al bias owa d Wes e n music adi ions. We acknowl-
edge his limi a ion and a e en husias ic abou including
non-Wes e n musical da a in ou e alua ion amewo k, as i
may bo h enhance cul u al di e si y and help e eal biases
in he models’ beha io unde b oade musical con ex s.
7. ACKNOWLEDGEMENTS
We hank Ma hieu Lag ange o he aluable discussions.
This wo k is suppo ed by he EPSRC UKRI Cen e
o Doc o al T aining in A i icial In elligence and Mu-
sic (g an numbe EP/S022694/1). This esea ch u ilized
Queen Ma y’s Apoc i a HPC acili y, suppo ed by QMUL
Resea ch-IT. h p://doi.o g/10.5281/zenodo.438045.
8. REFERENCES
[1]
M. Ba he , P. Guillemain, R. K onland-Ma ine , and
S. Ys ad, “F om cla ine con ol o imb e pe cep ion,”
Ac a Acus ica uni ed wi h Acus ica, ol. 96, no. 4, pp.
678–689, 2010.
[2]
K. Pa il, D. P essni ze , S. Shamma, and M. Elhilali,
“Music in ou ea s: he biological bases o musical imb e
pe cep ion,” PLoS Compu a ional Biology, ol. 8, no. 11,
p. e1002759, 2012.
[3]
A. Zacha akis, K. Pas iadis, and J. D. Reiss, “An in-
e language uni ica ion o musical imb e: B idging
seman ic, pe cep ual, and acous ic dimensions,” Music
Pe cep ion: An In e disciplina y Jou nal, ol. 32, no. 4,
pp. 394–412, 2015.
[4]
S. McAdams, “The pe cep ual ep esen a ion o imb e,”
Timb e: Acous ics, Pe cep ion, and Cogni ion, pp. 23–
57, 2019.
[5]
T. M. Ellio , L. S. Hamil on, and F. E. Theunissen,
“Acous ic s uc u e o he i e pe cep ual dimensions o
imb e in o ches al ins umen ones,” The Jou nal o
he Acous ical Socie y o Ame ica, ol. 133, no. 1, pp.
389–404, 2013.
[6]
J. Engel, C. Resnick, A. Robe s, S. Dieleman,
M. No ouzi, D. Eck, and K. Simonyan, “Neu al au-
dio syn hesis o musical no es wi h wa ene au oen-
code s,” in In e na ional Con e ence on Machine Lea n-
ing. PMLR, 2017, pp. 1068–1077.
[7]
J. Engel, L. H. Han akul, C. Gu, and A. Robe s,
“DDSP: Di e en iable digi al signal p ocessing,” in
In e na ional Con e ence on Lea ning Rep esen a ions,
2020. [Online]. A ailable: h ps://open e iew.ne /
o um?id=B1x1ma4 D
[8]
D. L. Wessel, “Timb e space as a musical con ol s uc-
u e,” Compu e Music Jou nal, pp. 45–52, 1979.
[9]
P. Esling, A. Chemla-Romeu-San os, and A. Bi on,
“Gene a i e imb e spaces wi h a ia ional audio syn he-
sis,” in P oceedings o he In e na ional Con e ence on
Digi al Audio E ec s (DAFx), 2018, pp. 175–181.
[10]
V. Los anlen, C. El-Hajj, M. Rossignol, G. La ay,
J. Andén, and M. Lag ange, “Time– equency sca e ing
accu a ely models audi o y simila i ies be ween ins u-
men al playing echniques,” EURASIP Jou nal on Audio,
Speech, and Music P ocessing, ol. 2021, no. 1, p. 3,
2021.
[11]
S. Cholle , D. Valen in, and H. Abdi, “F ee so ing
ask,” No el Techniques in Senso y Cha ac e iza ion
and Consume P o iling, ol. 207, 2014.
[12]
C. Vahidi, S. Singh, E. Bene os, H. Phan, D. S owell,
G. Fazekas, and M. Lag ange, “Pe cep ual musical simi-
la i y me ic lea ning wi h g aph neu al ne wo ks,” in
2023 IEEE Wo kshop on Applica ions o Signal P ocess-
ing o Audio and Acous ics (WASPAA). IEEE, 2023,
pp. 1–5.
[13]
E. Tho e , B. Ca amiaux, P. Depalle, and S. Mcadams,
“Lea ning me ics on spec o empo al modula ions e-
eals he pe cep ion o musical ins umen imb e,” Na-
u e Human Beha iou , ol. 5, no. 3, pp. 369–377,
2021.
[14]
B. Pascal and M. Lag ange, “On he obus ness o mu-
sical imb e pe cep ion models: F om pe cep ual o
lea ned app oaches,” in 2024 32nd Eu opean Signal
P ocessing Con e ence (EUSIPCO). IEEE, 2024, pp.
41–45.
[15]
C. Vahidi, B. Hayes, C. Sai is, and G. Fazekas, “Acous-
ic ep esen a ions o pe cep ual imb e simila i y,” in
Digi al Music Resea ch Ne wo k One-Day Wo kshop
(DMRN+ 16), 2021.
[16]
B. Hayes and C. Vahidi, “Timb e dissimi-
la i y me ics,” h ps://gi hub.com/ben-hayes/
imb e-dissimila i y-me ics, 2021, accessed: 2025-03-
29.
[17]
P. Manocha, A. Finkels ein, R. Zhang, N. J. B yan,
G. J. Myso e, and Z. Jin, “A di e en iable pe cep ual
audio me ic lea ned om jus no iceable di e ences,”
in In e speech, 2020.
[18]
P. Manocha, Z. Jin, R. Zhang, and A. Finkels ein, “Cd-
pam: Con as i e lea ning o pe cep ual audio simi-
la i y,” in IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2021,
pp. 196–200.
[19]
R. Zhang, P. Isola, A. A. E os, E. Shech man, and
O. Wang, “The un easonable e ec i eness o deep ea-
u es as a pe cep ual me ic,” in P oceedings o he IEEE
Con e ence on Compu e Vision and Pa e n Recogni-
ion, 2018, pp. 586–595.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
716
[20]
S. Fu, N. Tami , S. Sunda am, L. Chai, R. Zhang,
T. Dekel, and P. Isola, “D eamsim: Lea ning new dimen-
sions o human isual simila i y using syn he ic da a,”
in Ad ances in Neu al In o ma ion P ocessing Sys ems,
ol. 36, 2023, pp. 50 742–50 768.
[21]
L. Mu en hale , L. Linha d , J. Dippel, R. A. Vande -
meulen, K. He mann, A. Lampinen, and S. Ko nbli h,
“Imp o ing neu al ne wo k ep esen a ions using human
simila i y judgmen s,” Ad ances in Neu al In o ma ion
P ocessing Sys ems, ol. 36, pp. 50 978–51 007, 2023.
[22]
M. N. Heba , O. Con ie , L. Teichmann, A. H. Rock-
e , C. Y. Zheng, A. Kidde , A. Co i eau, M. Vazi i-
Pashkam, and C. I. Bake , “Things-da a, a mul imodal
collec ion o la ge-scale da ase s o in es iga ing objec
ep esen a ions in human b ain and beha io ,” Eli e,
ol. 12, p. e82580, 2023.
[23]
J. M. G ey, “Mul idimensional pe cep ual scaling o
musical imb es,” he Jou nal o he Acous ical Socie y
o Ame ica, ol. 61, no. 5, pp. 1270–1277, 1977.
[24]
J. M. G ey and J. W. Go don, “Pe cep ual e ec s o
spec al modi ica ions on musical imb es,” The Jou nal
o he Acous ical Socie y o Ame ica, ol. 63, no. 5, pp.
1493–1500, 1978.
[25]
P. I e son and C. L. K umhansl, “Isola ing he dynamic
a ibu es o musical imb ea,” The Jou nal o he Acous-
ical Socie y o Ame ica, ol. 94, no. 5, pp. 2595–2603,
1993.
[26]
F. Opolko and J. Wapnick, McGill Uni e si y mas e
samples (3 CDs). Quebec, Canada: McGill Uni e si y,
1987.
[27]
S. McAdams, S. Winsbe g, S. Donnadieu, G. De Soe e,
and J. K impho , “Pe cep ual scaling o syn hesized
musical imb es: Common dimensions, speci ici ies, and
la en subjec classes,” Psychological Resea ch, ol. 58,
pp. 177–192, 1995.
[28]
S. Laka os, “A common pe cep ual space o ha monic
and pe cussi e imb es,” Pe cep ion & Psychophysics,
ol. 62, no. 7, pp. 1426–1439, 2000.
[29]
M. Go o, H. Hashiguchi, T. Nishimu a, and R. Oka,
“RWC music da abase: Music gen e da abase and mu-
sical ins umen sound da abase,” in P oceedings o
he 4 h In e na ional Con e ence on Music In o ma ion
Re ie al (ISMIR), 2003.
[30]
F. Opolko and J. Wapnick, The McGill Uni e si y mas e
samples collec ion on DVD (3 DVDs). Quebec, Canada:
McGill Uni e si y, 1987.
[31]
K. Siedenbu g, K. Jones-Molle up, and S. McAdams,
“Acous ic and ca ego ical dissimila i y o musical im-
b e: E idence om asymme ies be ween acous ic and
chime ic sounds,” F on ie s in Psychology, ol. 6, p.
1977, 2016.
[32] Vienna Symphonic Lib a y, h ps://www. sl.co.a /.
[33]
C. Sai is and K. Siedenbu g, “B igh ness pe cep ion
o musical ins umen sounds: Rela ion o imb e dis-
simila i y and sou ce-cause ca ego ies,” The Jou nal o
he Acous ical Socie y o Ame ica, ol. 148, no. 4, pp.
2256–2266, 2020.
[34]
C. Vahidi, G. Fazekas, C. Sai is, and A. Palladini, “Tim-
b e space ep esen a ion o a sub ac i e syn hesize ,”
in P oceedings o he 2nd In e na ional Con e ence on
Timb e, 2020, p. 30–33.
[35]
C. J. S einme z and J. Reiss, “pyloudno m: A simple ye
lexible loudness me e in py hon,” in Audio Enginee ing
Socie y Con en ion 150. Audio Enginee ing Socie y,
2021.
[36]
Nicki Ska e De le sen, Ji i Bo o ec, Jus us Schock,
Ananya Ha sh, Teddy Koke , Luca Di Liello, Daniel
S ancl, Changsheng Quan, Maxim G echkin, and
William Falcon, “To chMe ics - Measu ing Rep o-
ducibili y in PyTo ch,” Feb. 2022. [Online]. A ailable:
h ps://gi hub.com/Ligh ning-AI/ o chme ics
[37]
V. Kh ulko , L. Mi akhabo a, E. Us ino a, I. Oselede s,
and V. Lempi sky, “Hype bolic image embeddings,” in
P oceedings o he IEEE/CVF Con e ence on Compu e
Vision and Pa e n Recogni ion, 2020, pp. 6418–6428.
[38]
A. Gui, H. Gampe , S. B aun, and D. Emmanouilidou,
“Adap ing eche audio dis ance o gene a i e music
e alua ion,” in IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2024, pp. 1331–1335.
[39]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep esid-
ual lea ning o image ecogni ion,” in P oceedings o
he IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion, 2016, pp. 770–778.
[40]
L. A. Ga ys, A. S. Ecke , and M. Be hge, “Image
s yle ans e using con olu ional neu al ne wo ks,” in
P oceedings o he IEEE Con e ence on Compu e Vision
and Pa e n Recogni ion, 2016, pp. 2414–2423.
[41]
X. Huang and S. Belongie, “A bi a y s yle ans e
in eal- ime wi h adap i e ins ance no maliza ion,” in
P oceedings o he IEEE In e na ional Con e ence on
Compu e Vision, 2017, pp. 1501–1510.
[42]
M. And eux, T. Angles, G. Exa chakis, R. Leona duzzi,
G. Roche e, L. Thi y, J. Za ka, S. Malla , J. Andén,
E. Belilo sky e al., “Kyma io: Sca e ing ans o ms in
py hon,” Jou nal o Machine Lea ning Resea ch, ol. 21,
no. 60, pp. 1–6, 2020.
[43]
J. Andén, V. Los anlen, and S. Malla , “Join ime–
equency sca e ing,” IEEE T ansac ions on Signal
P ocessing, ol. 67, no. 14, pp. 3704–3718, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
717
[44]
B. Elizalde, S. Deshmukh, and H. Wang, “Na u al
language supe ision o gene al-pu pose audio
ep esen a ions,” 2023. [Online]. A ailable: h ps:
//a xi .o g/abs/2309.05767
[45]
Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-Ki kpa ick,
and S. Dubno , “La ge-scale con as i e language-audio
p e aining wi h ea u e usion and keywo d- o-cap ion
augmen a ion,” in IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2023.
[46]
A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi,
“High ideli y neu al audio comp ession,” a Xi p ep in
a Xi :2210.13438, 2022.
[47]
R. Kuma , P. See ha aman, A. Luebs, I. Kuma , and
K. Kuma , “High- ideli y audio comp ession wi h im-
p o ed qgan,” Ad ances in Neu al In o ma ion P o-
cessing Sys ems, ol. 36, pp. 27 980–27 993, 2023.
[48]
M. Pasini, S. La ne , and G. Fazekas, “Music2la en :
Consis ency au oencode s o la en audio comp ession,”
P oceedings o he 25 h In e na ional Con e ence on
Music In o ma ion Re ie al (ISMIR), 2024.
[49]
S. La ne , M. Dö le , and A. A z , “Lea ning complex
basis unc ions o in a ian ep esen a ions o audio,”
in P oceedings o he 20 h In e na ional Con e ence on
Music In o ma ion Re ie al (ISMIR), 2019.
[50]
V. Emiya, N. Be in, B. Da id, and R. Badeau, “Maps-a
piano da abase o mul ipi ch es ima ion and au oma ic
ansc ip ion o music,” 2010.
[51]
J. Tu ian and M. Hen y, “I’m so y o you
loss: Spec ally-based audio dis ances a e bad
a pi ch,” in ”I Can’ Belie e I ’s No Be e !”
Neu IPS 2020 wo kshop, 2020. [Online]. A ailable:
h ps://open e iew.ne / o um?id=Z4UwGkTRTes
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
718