Assessing the Alignment of Audio Representations With Timbre Similarity Ratings

Author: Haokun Tian; Stefan Lattner; Charalampos Saitis

Publisher: Zenodo

DOI: 10.5281/zenodo.17706569

Source: https://zenodo.org/records/17706569/files/000083.pdf

ASSESSING THE ALIGNMENT OF AUDIO REPRESENTATIONS WITH
TIMBRE SIMILARITY RATINGS
Haokun Tian1S e an La ne 2Cha alampos Sai is1
1Queen Ma y Uni e si y o London, UK 2Sony Compu e Science Labo a o ies, Pa is, F ance
[email p o ec ed]
ABSTRACT
Psychoacous ical so-called “ imb e spaces” map pe cep-
ual simila i y a ings o ins umen sounds on o low-
dimensional embeddings ia mul idimensional scaling, bu
su e om scalabili y issues and a e incapable o gene -
aliza ion. Recen esul s om audio (music and speech)
quali y assessmen as well as image simila i y ha e shown
ha deep lea ning is able o p oduce embeddings ha align
well wi h human pe cep ion while being la gely ee om
hese cons ain s. Al hough he exis ing human- a ed im-
b e simila i y da a is no la ge enough o ain deep neu-
al ne wo ks (2,614 pai wise a ings on 334 audio sam-
ples), i can se e as es -only da a o audio models. In
his pape , we in oduce me ics o assess he alignmen
o di e se audio ep esen a ions wi h human judgmen s
o imb e simila i y by compa ing bo h he absolu e al-
ues and he ankings o embedding dis ances o human
simila i y a ings. Ou e alua ion in ol es h ee signal-
p ocessing-based ep esen a ions, wel e ep esen a ions
ex ac ed om p e- ained models, and h ee ep esen a ions
ex ac ed om a no el sound ma ching model. Among
hem, he s yle embeddings inspi ed by image s yle ans e ,
ex ac ed om he CLAP model and he sound ma ching
model, ema kably ou pe o m he o he s, showing hei
po en ial in modeling imb e simila i y.
1. INTRODUCTION
How do humans dis inguish be ween di e en musical im-
b es? This ques ion has d i en esea ch in he ield o
psychoacous ics o decades [4]. Resea che s ypically e-
c ui a g oup o people, play di e en sounds o hem in a
con olled acous ic en i onmen , and ask hem o a e he
di e ences in he sounds by assigning a sco e. These sounds
a e no malized in pi ch, loudness, and du a ion so ha pa -
icipan s can ocus on imb e. A e collec ing all human
a ings, esea che s use a echnique called mul idimensional
scaling (MDS) o map he sounds on o a low-dimensional
space called imb e space, whe e dis ances be ween he
esul ing embeddings e lec human a ings. This space is
© H. Tian, S. La ne , and C. Sai is. Licensed unde a
C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY 4.0).
A ibu ion: H. Tian, S. La ne , and C. Sai is, “Assessing he Alignmen
o Audio Rep esen a ions wi h Timb e Simila i y Ra ings”, in P oc. o he
26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h
Ko ea, 2025.
Figu e 1. Human simila i y a ings o h ee imb e space
da ase s [1
–
3]. Each block ma ix ep esen s pai wise com-
pa isons be ween audio s imuli. En y
(i, j)
indica es he
pe cei ed simila i y be ween audio
i
and audio
j
. Da ke
colo s indica e lowe simila i y. The lowe iangula pa is
made pa ially anspa en due o symme y.
hen analyzed o ind whe he ce ain acous ic ea u es can
explain how imb es a e o de ed along di e en dimensions.
Howe e , hese s udies ha e always been limi ed in
scale— ha is, hey only in ol ed a small numbe o sound
s imuli. I he numbe o sounds we e o inc ease, he
equi ed human a ings would g ow quad a ically, since a -
ings a e gi en in pai s. The la ges s udy o da e is [5], which
includes 42 sound s imuli. Addi ionally, pi ch, loudness,
and du a ion mus be ixed o elimina e con ounding e ec s,
meaning ha a imb e space wi h a iable pi ch o loudness
can ne e be cons uc ed. Fu he mo e, o place new audio
samples in o a imb e space, new human a ings mus always
be collec ed, indica ing ha he imb e space as a model does
no ha e he abili y o gene alize wi hou addi ional e o s.
To o e come hese limi a ions, we en ision a neu al
ne wo k-based model ha can se e as a pe cep ual me ic
o imb e— ha is, i can embed any audio inpu in o a
space whe e dis ances be ween embeddings e lec how
di e en humans pe cei e hem. Such a me ic would ha e
nume ous applica ions. Fi s , o ain a model o gene a e
710
ins umen sounds [6, 7], one could ma ch he imb e by
di ec ly compu ing he loss using his me ic a he han
elying on spec og am dis ances. In music p oduc ion, i
could enable e icien e ie al o samples wi h simila o
con as ing imb es wi hou he need o labo ious lis ening.
Addi ionally, i can encode audio signals in o meaning ul
imb e okens o gene a i e modeling. Finally, i could
p o ide a con ol space o musical exp ession, as has al-
eady been demons a ed [8].
In his wo k, we ake he i s s ep by e alua ing a ange
o models’ ep esen a ions, including signal-p ocessing-
based low-le el ep esen a ions and hose p oduced by p e-
ained audio models, on he pas 21 small da ase s collec ed
by psychoacous ical imb e space s udies. Ou goal is o
de e mine which model cu en ly pe o ms bes in e ms
o ma ching human a ings o imb e simila i y. We also
include a newly ained sound ma ching model ha p e-
dic s syn hesis pa ame e s om audio, and we e alua e
audio ep esen a ions ex ac ed om i . In pa icula , we
ha e ound ha s yle embeddings, o iginally p oposed o
image s yle ans e , a e p omising as a u u e di ec ion
o modeling imb e simila i y. We cla i y ha he e m
“alignmen ” in his pape does no e e o any empo al
synch oniza ions, and ou assessmen does no indica e he
ex en o which models p oduce disen angled imb e ep e-
sen a ions—an abili y equi ed o some o he applica ions
men ioned in he p e ious pa ag aph.
2. RELATED WORK
Se e al wo ks ha e aimed o ain models ha p oduce a
pe cep ual imb e space. Esling e al. [9] ained a a ia-
ional au oencode o econs uc audio samples o di e en
imb es, using pe cep ual a ings om imb e space s udies
o egula ize he space. Los anlen e al. [10] collec ed imb e
simila i y judgmen s on 78 sounds using ee so ing [11].
By agg ega ing esponses om di e en pa icipan s, he
sounds we e assigned o 19 clus e s acco ding o hei pe -
cep ual simila i y. They, along wi h a subsequen s udy [12],
de eloped me ic lea ning models o cap u e his s uc u e.
Tho e e al. [13] and Pascal e al. [14] a emp ed o lea n
dis ance me ics di ec ly om imb e space da a bu did
no lea n a single me ic ha gene alizes ac oss all da ase s.
The mos simila wo k o ou s is by Vahidi e al. [15],
which e alua ed he alignmen o h ee ep esen a ions wi h
imb e simila i y a ings. Addi ionally, he accompanied
codebase [16] p o ides mo e implemen a ions o alignmen
sco es, bu no esul s ha e been epo ed. Ou wo k is a di-
ec con inua ion, p o iding a holis ic e alua ion o a ious
models and implemen a ions wi h ex ended unc ionali ies.
In b oade domains, pe cep ual me ics ha e been de-
eloped o handle a ious le els o pe u ba ions. They
a e ei he ained om sc a ch o ine- uned using human
pe cep ual judgmen s. Fo audio, Manocha e al. p oposed
DPAM [17] and CDPAM [18], which add ess low-le el
audio pe u ba ions such as noise addi ion, e e b, and
comp ession. In he image domain, Zhang e al. [19] mod-
eled low-le el pe u ba ions including adi ional dis o ions
such as noise addi ion and blu , as well as CNN-based dis-
o ions. Fu e al. [20] explo ed mid-le el pe u ba ions
ela ed o pose, colo , and shape. Mu en hale e al. [21]
in es iga ed high-le el pe u ba ions in ol ing di e en
objec -le el concep s (e.g., “g ass” e sus “sand”). The
smalles da ase used is NIGHTS [20] wi h 20K human judg-
men s, while he la ges is THINGS [22], which con ains
4.7M human judgmen s. In pa icula , [19] demons a ed
ha deep lea ning ep esen a ions, ega dless o he aining
da a, model, o ask used, exhibi a p omising abili y o
cap u e pe cep ual image simila i y. The e o e, al hough
cu en imb e space da ase s p o ide only 2.6K pai wise
judgmen s and a e insu icien o aining, hey a e wo h
being used o model e alua ion.
3. DATA AND EXPERIMENTS
3.1 Da a
We use da a cu a ed by Tho e e al. [13] and Vahidi e .
al [15],
1
comp ising a o al o 21 da ase s om 11 pub-
lished psychoacous ic s udies [1
–
3,23
–
25,27,28,31,33,34].
We p esen summa y in o ma ion o each da ase in Ta-
ble 1. Each da ase con ains a se o audio samples along
wi h pai wise imb e simila i y a ings. The sounds span a
wide ange: acous ic eco dings o musical ins umen no es,
digi ally edi ed acous ic samples designed o c ea e simple
o c oss-ins umen spec a, and ones om syn hesize s
and elec omechanical ins umen s. Each simila i y a ing
is an absolu e alue calcula ed by a e aging ac oss mul i-
ple human lis ene s. We compile all a ings om he 21
da ase s in o a la ge, spa se block-diagonal ma ix, whe e
he uppe iangula pa o each block ep esen s he a ings
o unique audio pai s in he co esponding da ase . Figu e 1
illus a es h ee o hese blocks, wi h symme ic pai wise
a ings isualized by colo . In o al, he ma ix includes
334 audio samples and 2,614 simila i y a ings.
3.2 Me ics
Ou goal is o in es iga e whe he an audio model can pe -
cei e imb e simila i y in a way ha aligns wi h human
pe cep ion. To his end, we compu e pai wise dis ances
be ween audio ep esen a ions p oduced by he model o
es ima e simila i y a ings, which we hen compa e wi h eal
human a ings o ob ain alignmen sco es. Concep ually, he
pai wise dis ances o audio ep esen a ions o m a p edic ed
simila i y ma ix, which we compa e agains he g ound
u h simila i y ma ix, consis ing o human a ings.
To conduc his e alua ion, we i s ob ain equally shaped
ep esen a ions o audio o di e en leng hs, in p epa a ion
o he dis ance compu a ion. Second, we employ a dis ance
unc ion o quan i y hei simila i y. Thi d, we equi e
me ics o measu e he alignmen be ween he p edic ed and
ue simila i y ma ices. Below, we p esen ou app oach
o hese h ee s eps, esul ing in wo me hods o handling
a iable leng h (see nex sec ion o cases wi h only one
iable op ion), wo me hods o compu ing dis ances, and
i e ways o compu ing alignmen sco es. This gi es a
1h ps://gi hub.com/ben-hayes/ imb e-dissimila i y-me ics
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
711
S udy Da ase No
sounds Pi ch Leng h
(s)
Loudness
(dB LUFS) Type o sounds No
a e s∗
G ey (1977) [23] – 16 E♭4 0.27 ±0.03 -14.61 ±1.57 Syn hesized om acous ic 22
G ey & Go don (1978) [24] – 16 E♭4 0.27 ±0.03 -14.87 ±1.76 Same as abo e, spec al en elopes aded 22
I e son & K umhansl (1993) [25] Whole 16 C4 3.19 ±0.69 -16.62 ±3.99 Acous ic [26] 10
Onse 16 C4 0.11 ±0.01 -23.24 ±9.10 Same as abo e, only ∼80ms a ack 9
Remainde 16 C4 3.10 ±0.70 -16.71 ±4.13 Same as abo e, ∼80ms a ack emo ed 9
McAdams e al. (1995) [27] – 18 E♭4 0.69 ±0.19 -18.36 ±4.05 FM simula ed and hyb ids 22
Laka os (2000) [28] Combined 20 E♭4 1.50 -24.08 ±2.98 Acous ic [26] 34
Ha monic 17 E♭4 1.50 -25.03 ±2.36 Same as abo e, only ha monic sounds 34
Pe cussi e 18 E♭4 1.49 ±0.05 -23.97 ±3.83 Same as abo e, only pe cussi e sounds 34
Ba he e al. (2010) [1] – 15 E♭4 1.56 ±0.04 -23.77 ±1.13 Physical modeling, cla ine only 16
Pa il e al. (2012) [2] A3 11 A3 0.25 -19.03 ±0.85 Acous ic [29] 6
D4 11 D4 0.25 -19.17 ±0.81 Same as abo e 20
G♯4 11 G♯4 0.25 -18.97 ±0.81 Same as abo e 6
Zacha akis e al. (2015) [3] G eek a e s 24 A1–4 1.30 -29.36 ±3.84 Acous ic [30] and syn hesize s 33
English a e s 24 A1–4 1.30 -29.36 ±3.84 Same as abo e 20
Siedenbu g e al. (2016) [31] Exp 2A Se 1 14 E♭4 0.50 -23.56 ±3.15 Acous ic [32] 24
Exp 2A Se 2 14 E♭4 0.50 -23.77 ±1.90 Chime ic (spec al en elopes aded) 24
Exp 2A Se 3 14 E♭4 0.50 -23.21 ±2.37 Acous ic and chime ic 24
Exp 2B 14 E♭4 0.50 -23.21 ±2.37 Same as abo e 24
Sai is & Siedenbu g (2020) [33] GEdissim 14 E♭4 0.50 -23.56 ±3.15 Same as Exp 2A Se 1 in [31,32] 40
Vahidi e al. (2020) [34] – 15 A4 1.00 -9.73 ±3.09 Sub ac i e syn hesis 35
∗Ra e s a e ypically musicians o wi h some so o musical backg ound. Some s udies used a mix u e o musicians and non-musicians.
Table 1. Summa y o he 21 imb e simila i y da ase s. We compu ed in eg a ed loudness o each sample using
pyloudno m
[35] wi h a block size o 0.08 seconds—sligh ly sho e han he sho es sample. While some dynamic a ia ions can be
obse ed, loudness is ypically epo ed o ha e been no malized by expe lis ene s.
o al o 10 o 20 sco es o each ep esen a ion (one model
can p oduce mul iple ep esen a ions). Addi ionally, we
ha e de eloped his p ocess in o an easy- o-use Py hon
package, as desc ibed in Sec ion 3.2.4.
3.2.1 Handling Va iable Audio Leng h
To p oduce ep esen a ions o audio, models ypically use
an analysis window wi h a ixed leng h, which slides o e
ime wi h a hop leng h o p oduce ime- a ying ep esen a-
ions. As shown in Table 1, some imb e space da ase s ha e
ixed audio leng hs, while o he s do no . This c ea es he
need o compu e iden ically shaped ep esen a ions o audio
o di e en leng hs, as models employing sliding windows
inhe en ly p oduce ep esen a ions wi h a ime dimension
p opo ional o he inpu leng h. To add ess his, we p o ide
wo app oaches. The i s is o squash he ime dimension by
compu ing he a e age o e i . The second, which we call
dynamic-leng h, ma ches inpu leng hs by padding ze os o
he igh —o unca ing in e y ew cases— o a i e a he
same leng h. Below a e a ew scena ios. I he model has a
long enough analysis window ha co e s all inpu leng hs o
he imb e space da a, all audio samples a e padded o his
leng h. Fo his case, we will no compu e he ime-a e aged
e sion, as he ep esen a ion has only one ame along he
ime dimension. And i he model ope a es wi h a a he
sho analysis window, he solu ion is o always dynamically
pad he sho e audio wi hin a pai o ma ch he leng h o
he longe one, ensu ing he shapes o hei ep esen a ions
ma ch. We only unca e audio samples in one case, and ha
is when he model is sensi i e o ime shi ing and does no
accep sliding. We pad he sho e audio and unca e he
longe audio o ma ch he ixed window leng h. In his case
we will also no compu e he ime-a e aged e sion. We
desc ibe in Sec ion 3.4 speci ically how he dynamic-leng h
app oach is applied o each model.
3.2.2 Dis ance Func ions
We compu e pai wise dis ances using wo unc ions: he
ℓ2
dis ance and cosine dis ance (de ined as one minus he
cosine simila i y). These unc ions a e applied o la ened
audio ep esen a ions.
3.2.3 Alignmen Sco es
The alignmen sco es a e compu ed be ween he p edic ed
simila i y ma ix and he g ound u h simila i y ma ix, bo h
o which a e diagonal block ma ices. We no e ha only he
pai s wi hin he diagonal blocks a e alid, ha e simila i y
a ings, and a e used o compu e he alignmen sco es. Pai s
ou side he diagonal blocks lack human a ings as samples
a e om di e en da ase s. Be o e compu a ion, he alues
in each da ase -le el block ma ix ( om bo h he p edic ed
and g ound u h ma ices) a e escaled o he ange
[0,1]
.
We compu e i e alignmen sco es in o al. The i s is
he mean absolu e e o (MAE). I compu es he absolu e
e o be ween a ings o each unique audio pai and a e ages
he e o s ac oss all pai s. The o he ou a e ank-based
sco es ha e alua e how well he model anks a lis o audio
samples based on hei imb e simila i y o a gi en sample.
This is done by compa ing each ow o he wo simila i y
ma ices (column-wise compa ison yields he same esul s
due o symme y), which con ains he simila i y a ings
be ween a e e ence sample (whose index co esponds o
he ow index) and all o he samples om he same da ase ,
whe e he ela i e o de ing o hese a ings de e mines he
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
712
imb e simila i y anking wi h espec o he e e ence sam-
ple. So ing he a ings pe ow yields
N
ankings—one
o each sample (each sample is used once as he e e ence
sample)—whe e
N
is he o al numbe o samples. The p e-
dic ed anking om he compu ed ma ix is hen compa ed
o he ue anking om he g ound- u h ma ix. The inal
ank-based alignmen sco es a e ob ained by compu ing a
sco e pe ow and a e aging ac oss all ows.
We use ou ank-based me ics o pe - ow compa -
isons: Kendall’s ank co ela ion coe icien , No malized
Discoun ed Cumula i e Gain (NDCG) ha ea s imb e sim-
ila i y ( a he han dissimila i y) as ele ance, Spea man’s
ank co ela ion coe icien , and iple ag eemen wi h an ad-
jus able ma gin. Fo he i s h ee s anda d me ics, we use
implemen a ions om
o chme ics
[36]. Fo iple
ag eemen , we i s ex ac all iple s om he g ound-
u h ow ha sa is ies he gi en ma gin condi ion— ha
is, he absolu e di e ence be ween wo dissimila i y a -
ings mus be g ea e han he ma gin alue. This mimics
a jus -no iceable di e ence in human pe cep ion, whe e
a ing di e ences smalle han his h eshold a e conside ed
pe cep ually equi alen o he e e ence and a e he e o e
excluded om he e alua ion. The ma gin is se o 0.1 in
ou e alua ion. To illus a e, a iple consis s o h ee audio
samples
(i, j, k)
, whe e
i
is he e e ence sample and he
simila i y be ween
i
and
j
is compa ed o he simila i y
be ween
i
and
k
. We hen compu e he iple ag eemen
a e as he a io o iple s o which bo h ma ix ows ag ee
on he ela i e o de ing o
j
and
k
wi h espec o
i
, o
he o al numbe o ex ac ed iple s.
3.2.4 The Py hon Package
Ou Py hon package
2 imb eme ics
simpli ies he
e alua ion p ocess by equi ing only he inpu o a model
capable o con e ing aw wa e o ms in o audio ep esen a-
ions o compu e he alignmen o ha model wi h imb e
simila i y pe cep ion. This package suppo s all e alua ions
desc ibed in his sec ion and addi ionally suppo s h ee
dis ance unc ions:
ℓ1
, nega i e do p oduc , and Poinca é
dis ance, a commonly used hype bolic dis ance [37]. I is
compa ible wi h audio models in e aced by he F éche
Audio Dis ance Toolki [38].
3
Fu he mo e, i can be used
as a con enien aining- ime e alua ion o inspec whe he
he model can acqui e human-like abili y o pe cei e imb e
simila i y du ing he lea ning p ocess.
3.3 Sound Ma ching
We ain a sound ma ching model ha in e s he Vi al
syn hesize
4
by p edic ing he syn hesis pa ame e s om
syn hesized audio.
5
We a e in e es ed in whe he , ia his
ask, he model can lea n meaning ul ep esen a ions ha
align wi h imb e simila i y pe cep ion. We a e mo i a ed by
he ac ha syn hesize pa ame e s encode all audio con en
using e y li le s o age, and we expec ha by lea ning
2h ps://gi hub.com/ iianhk/ imb eme ics
3h ps://gi hub.com/mic oso / ad k
4h ps:// i al.audio
5Code a ailable a h ps://gi hub.com/ iianhk/sm4 p
his comp ession, we can ob ain compac bu exp essi e
in e media e ep esen a ions.
3.3.1 Da a Gene a ion
We use
Vi a
,
6
, a package ha p o ides Py hon bindings
o he Vi al Syn hesize , o gene a e ou da ase . Ten pa am-
e e s and hei anges a e subjec i ely selec ed o p oduce
p onounced imb al changes when a ied. These pa ame e s
include one disc e e selec ion om se en basic wa eshapes
(e.g., sine wa e, iangle wa e) in he wa e able, wo o
dis o he wa eshape, wo om a unison e ec , h ee om
he ADSR en elope, and wo om an EQ. O hese, wo
a e disc e e pa ame e s, and he emaining eigh a e con-
inuous pa ame e s. We uni o mly sample each pa ame e
o gene a e da a. A andom pi ch is also sampled, bu no
as a p edic ion a ge , as pi ch is conside ed un ela ed o
imb e. We gene a e a o al o 500k samples, each wi h a
maximum leng h o wo seconds, which akes
∼
8 hou s on
a single CPU co e. All con inuous pa ame e s a e escaled
o he ange
[0,1]
o be used as eg ession a ge s.
3.3.2 Model A chi ec u e
The model s a s wi h an ini ial con olu ional laye wi h
a la ge ke nel size o cap u e low-le el ea u es, ollowed
by ba ch no maliza ion, ReLU ac i a ion, and max pooling.
The co e o he model comp ises ou esidual blocks wi h
inc easing channel dimensions [39]. Each esidual block
con ains wo con olu ional laye s wi h ba ch no maliza ion
and ReLU, along wi h a sho cu connec ion o p ese e g a-
dien low. A e ea u e ex ac ion, a global a e age pooling
laye educes spa ial dimensions, yielding a ixed-leng h
embedding ec o o size 256. This embedding is hen ed
in o he ou pu heads: a eg ession head ha p oduces 8 con-
inuous alues be ween 0 and 1, and wo classi ica ion heads
o disc e e p edic ions. Ou model has
∼
5M pa ame e s.
3.3.3 T aining
Fo he disc e e pa ame e s, we compu e c oss-en opy
losses. Fo he con inuous pa ame e s, we compu e he
ℓ1
losses. Losses a e added wi hou weigh ing. We use
an 8/2 ain- alida ion spli , a ba ch size o 32, and he
Adam op imize wi h a lea ning a e o 1e-4. The model
is ained o 100 epochs, which akes
∼
12 hou s on an
A100 GPU. We obse e ha he alida ion loss o each
pa ame e has con e ged.
3.3.4 Task and S yle Embeddings
We ex ac and e alua e h ee ep esen a ions om he
sound ma ching model. The i s ep esen a ion is he 256-
dimensional embedding ob ained igh be o e he p edic ion
heads, which encodes in o ma ion necessa y o sol ing he
ask—in ou case, syn hesis pa ame e p edic ion. We e e
o his as he ask embedding. The o he wo a e inspi ed by
image s yle ans e [40,41]. We e e o hem as he s yle
embedding. Ou mo i a ion o using s yle embeddings is
wo old: i s , hey a e in a ian o he spa ial loca ion o
audio e en s on spec og ams; and second, hey use ea u es
6h ps://gi hub.com/DB aun/Vi a
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
713
om di e en laye s o a model, hus cap u ing mul iple
le els o abs ac ion—a p ope y shown o be e ec i e o
modeling pe cep ual imb e simila i y [14]. The i s s yle
embedding, p oposed by Ga ys e al. [40], is compu ed
as a G am ma ix o ea u e ac i a ions, whe e each en y
cap u es he inne p oduc be ween wo channels ac oss
he spa ial dimensions, esul ing in a symme ic ma ix
ha cap u es co ela ions be ween channels. The second
s yle embedding is shown o be e ec i e by Huang and
Belongie [41] and is compu ed as he channel-wise mean
and s anda d de ia ion o ac i a ions o e he spa ial di-
mensions. Gi en a ea u e map wi h shape
(B, C, H, W )
,
whe e
B
is he ba ch size,
C
is he channel numbe , and
H
and
W
a e he spa ial dimensions, i.e. heigh and wid h,
he s yle embedding o Ga ys e al. has shape
(B, C, C)
,
and he s yle embedding o Huang and Belongie has shape
(B, 2C, )
. Wi h ou model, we ex ac bo h Ga ys s yle
embeddings and Huang s yle embeddings om i e in e me-
dia e con olu ional laye s: he ini ial con olu ional laye ,
and he i s con olu ional laye s in each o he ou esidual
blocks. Fea u e maps a e ob ained a e ba ch no maliza ion.
We conca ena e he embeddings ob ained om di e en
laye s, esul ing in a single Ga ys s yle embedding and a
single Huang s yle embedding pe inpu . Toge he wi h
he asking embedding, we abb e ia e hem as s.m.- ask,
s.m.-Ga ys, and s.m.-Huang.
3.4 O he Models and Dynamic Leng h Adap ion
We e alua e signal-p ocessing-based me hods
7
ha do no
lea n om da a, including mel- equency ceps al coe i-
cien s (MFCC), mul i-scale spec og ams (MSS) [7], and he
join ime- equency sca e ing ans o m (JTFS) [43]. We
also e alua e p e- ained audio models in e aced h ough
he F éche Audio Dis ance Toolki [38], which a e yp-
ically ained on la ge da ase s and used o e alua e gen-
e a i e models, as hei ep esen a ions a e conside ed o
co ela e well wi h pe cep ual music quali y. This includes
h ee CLAP models ained wi h na u al language supe i-
sion [44, 45]; wo CDPAM models ained on pe cep ual
a ings o audio quali y [18]; and neu al audio codecs in-
cluding wo Encodec models [46] and one Desc ip Au-
dio Codec (DAC) [47], which comp ess audio in o lowe -
bi a e la en ep esen a ions. Addi ionally, we e alua e
Music2La en [48] and a ep oduc ion o he Complex Au-
oencode (CAE) [49] ha p oduces ep esen a ions in a i-
an o ansposi ion and ime-shi .
8
In Sec ion 3.2.1, we discussed how o p oduce equally
shaped ep esen a ions o audio o di e en leng hs. He e,
we desc ibe how his is speci ically achie ed o each model.
Since he longes sample in ou da a is 4.39 seconds (see
Table 1), we compu e jus one ame o ep esen a ion o
he ollowing models, wi h hei espec i e analysis win-
dow leng hs indica ed in pa en heses: CLAP ( en seconds
o [45] and se en seconds o [44]) and CDPAM ( i e
7
Fo audio a 44.1kHz, MFCC is compu ed wi h n_m cc=40, MSS is
compu ed wi h _sizes=(4096, 2048, 1024, 512, 256, 128), and JTFS is
compu ed wi h
J= 12
,
Q= (8,2)
,
J = 3
, and
Q = 2
using he
Kyma io package [42].
8T ained wi h piano music om he MAPS da ase [50].
Figu e 2. E olu ion o alignmen sco es du ing he aining
o he sound ma ching model. The le column shows he
de ailed changes wi h he i s epoch zoomed in, while he
igh column shows he o e all p og ess ac oss he o al 100
epochs. Fo MAE, lowe sco es a e be e ; o o he me ics,
highe sco es a e be e . The bes sco es e e o he highes
(o lowes , o MAE) alues o each me ic in Figu e 3.
seconds). Fo hese models, audio samples a e padded o
ma ch he window leng h. The sound ma ching model has
an analysis window o wo seconds (which co e s 19 ou
o 21 da ase s) and is ained o cap u e he ADSR en e-
lope, so he analysis window canno be shi ed. The e o e,
we pad o unca e samples o a du a ion o wo seconds.
The abo e six models p oduce eigh unique ep esen a ions,
o which we do no compu e he ime a e age since hey
con ain only one ime ame. Fo all o he models, ep-
esen a ions a e compu ed using bo h ime a e aging and
dynamic padding, whe e he sho e sample is padded o
ma ch he longe one wi hin an audio pai .
We also e alua e s yle embeddings ex ac ed om one
CLAP model [44], which di e s om he sound ma ching
model no only in aining objec i e bu also in a chi ec-
u e—i uses a T ans o me backbone. Howe e , s yle
embeddings can be compu ed in a simila way: he in e nal
ep esen a ions p oduced by he T ans o me consis o spa-
ial okens wi h a ea u e dimension, analogous o spa ial
loca ions and channel dimensions in CNNs. The e o e, we
ea he ans o me ’s ea u e dimension as equi alen o
he channel dimension in CNNs. Following his analogy, we
compu e Ga ys s yle embeddings by measu ing co ela ions
be ween ea u e dimensions, and Huang s yle embeddings
by compu ing s a is ics (mean and a iance) ac oss each
ea u e. We compu e s yle embeddings using he ou pu s
om each Swin T ans o me block in he i s h ee laye s.
Each laye con ains mul iple blocks, wi h nine blocks in
o al. The esul ing embeddings a e conca ena ed in o a
single embedding pe inpu o e alua ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
714

Figu e 3. E alua ion esul s. Each con igu a ion shown a he op is ep esen ed by a unique colo . Fo each alignmen sco e,
he bes esul wi hin each con igu a ion is ma ked wi h a s a , while a yellow do behind he s a indica es he o e all winne .
Fo MAE, lowe sco es a e be e ; o o he me ics, highe sco es a e be e . Rep esen a ions a e o de ed by he mean iple
ag eemen ac oss con igu a ions, in descending o de . Fo ex labels on he x-axis, he same backg ound colo indica es
di e en ep esen a ions ex ac ed om he same model.
4. RESULTS AND DISCUSSION
In Figu e 2 and 3, we omi he Kendall and Spea man sco es,
as hey a e highly co ela ed wi h iple ag eemen and
p oduce nea ly iden ical ankings o he e alua ed ep esen-
a ions. We choose o epo iple ag eemen ins ead, as i
is a mo e in ui i e me ic o in e p e han he o he wo.
Figu e 2 shows aining- ime alignmen sco es o ou
sound ma ching model on all 21 da ase s. We use a h ee-
old alida ion- es spli o ensu e a ai compa ison wi h
o he models shown in Figu e 3. Fo each sco e, we selec
he bes -pe o ming model based on alida ion pe o mance
using only checkpoin s om he i s epoch. Final esul s
a e epo ed as a e ages o e he es olds. The s yle em-
beddings con e ge quickly, e aining high alignmen wi h
human imb e judgmen s, whe eas he ask embedding o e -
i s and loses gene aliza ion o e ime. In Figu e 3, he
s yle embeddings om bo h he sound ma ching model and
he CLAP model show clea imp o emen s o e hei base
ep esen a ions, demons a ing he e ec i eness o s yle
embeddings ega dless o aining objec i e o model a chi-
ec u e. In pa icula , he Huang s yle embedding ex ac ed
om he CLAP model shows he s onges pe o mance.
MFCC emains compe i i e and ou pe o ms many
ained models. By con as , CDPAM— ained on low-
le el dis o ion judgmen s o speech—does no adap well o
musical imb e. This may be due o di e ences in da a
domain, o i may echo indings om D eamSim [20],
which sugges ha pe cep ual simila i y lea ned o one
ype o pe u ba ion does no gene alize o o he s. MSS also
unde pe o ms, consis en wi h p io wo k [51] showing
ha spec al dis ances can be p oblema ic o cap u ing
pe cep ual simila i y o pi ch.
In e es ingly, Encodec’s 24k model aligns be e wi h
human a ings han he 48k model, despi e he o me be-
ing ained on a a ie y o audio da a, whe eas he la e is
ained exclusi ely on music. This sugges s ha inc eased
comp ession combined wi h b oade aining da a may en-
cou age he model o disca d i ele an de ails while p ese -
ing pe cep ually meaning ul s uc u e, esul ing in a mo e
e icien in e nal imb e ep esen a ion. This also highligh s
he e ec i eness o Music2La en , which aligns be e wi h
human a ings, has a high comp ession a e, and is ained
on bo h music and speech. JTFS pe o ms mode a ely on
ank-based me ics bu poo ly on MAE. This sugges s ha
i s dis ance scale may be nonlinea ly s e ched ela i e o
human pe cep ion, p ese ing he o de o examples bu
dis o ing hei absolu e di e ences.
5. CONCLUSION
In his pape , we in oduced a uni ied e alua ion amewo k
o compa e model-de i ed dis ances wi h human simila -
i y a ings om 21 classic imb e space da ase s, encom-
passing a wide ange o musical ins umen sounds. We
assessed bo h hand-c a ed ea u es (e.g., MFCC) and deep
lea ning-based ep esen a ions (e.g., CLAP, CDPAM, neu al
audio codecs), as well as a newly p oposed sound ma ch-
ing model ha in e s a wa e able syn hesize . Ou esul s
showed ha s yle embeddings ex ac ed om di e en mod-
els ou pe o med hei base ep esen a ions, and in pa ic-
ula , he Huang s yle embedding om he CLAP model
is ma kedly supe io o he o he s. To encou age u he
wo k, we p o ide a Py hon package ha implemen s all ou
me ics and p ocedu es. We hope his e alua ion ame-
wo k and Py hon package will encou age ad ancemen s
in imb e me ics ac oss asks like gene a i e modeling
and ins umen e ie al.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
715
6. ETHICS STATEMENTS
This wo k e alua es models using da ase s ha p ima ily
ea u e Wes e n musical ins umen s, which may e lec a
cul u al bias owa d Wes e n music adi ions. We acknowl-
edge his limi a ion and a e en husias ic abou including
non-Wes e n musical da a in ou e alua ion amewo k, as i
may bo h enhance cul u al di e si y and help e eal biases
in he models’ beha io unde b oade musical con ex s.
7. ACKNOWLEDGEMENTS
We hank Ma hieu Lag ange o he aluable discussions.
This wo k is suppo ed by he EPSRC UKRI Cen e
o Doc o al T aining in A i icial In elligence and Mu-
sic (g an numbe EP/S022694/1). This esea ch u ilized
Queen Ma y’s Apoc i a HPC acili y, suppo ed by QMUL
Resea ch-IT. h p://doi.o g/10.5281/zenodo.438045.
8. REFERENCES
[1]
M. Ba he , P. Guillemain, R. K onland-Ma ine , and
S. Ys ad, “F om cla ine con ol o imb e pe cep ion,”
Ac a Acus ica uni ed wi h Acus ica, ol. 96, no. 4, pp.
678–689, 2010.
[2]
K. Pa il, D. P essni ze , S. Shamma, and M. Elhilali,
“Music in ou ea s: he biological bases o musical imb e
pe cep ion,” PLoS Compu a ional Biology, ol. 8, no. 11,
p. e1002759, 2012.
[3]
A. Zacha akis, K. Pas iadis, and J. D. Reiss, “An in-
e language uni ica ion o musical imb e: B idging
seman ic, pe cep ual, and acous ic dimensions,” Music
Pe cep ion: An In e disciplina y Jou nal, ol. 32, no. 4,
pp. 394–412, 2015.
[4]
S. McAdams, “The pe cep ual ep esen a ion o imb e,”
Timb e: Acous ics, Pe cep ion, and Cogni ion, pp. 23–
57, 2019.
[5]
T. M. Ellio , L. S. Hamil on, and F. E. Theunissen,
“Acous ic s uc u e o he i e pe cep ual dimensions o
imb e in o ches al ins umen ones,” The Jou nal o
he Acous ical Socie y o Ame ica, ol. 133, no. 1, pp.
389–404, 2013.
[6]
J. Engel, C. Resnick, A. Robe s, S. Dieleman,
M. No ouzi, D. Eck, and K. Simonyan, “Neu al au-
dio syn hesis o musical no es wi h wa ene au oen-
code s,” in In e na ional Con e ence on Machine Lea n-
ing. PMLR, 2017, pp. 1068–1077.
[7]
J. Engel, L. H. Han akul, C. Gu, and A. Robe s,
“DDSP: Di e en iable digi al signal p ocessing,” in
In e na ional Con e ence on Lea ning Rep esen a ions,
2020. [Online]. A ailable: h ps://open e iew.ne /
o um?id=B1x1ma4 D
[8]
D. L. Wessel, “Timb e space as a musical con ol s uc-
u e,” Compu e Music Jou nal, pp. 45–52, 1979.
[9]
P. Esling, A. Chemla-Romeu-San os, and A. Bi on,
“Gene a i e imb e spaces wi h a ia ional audio syn he-
sis,” in P oceedings o he In e na ional Con e ence on
Digi al Audio E ec s (DAFx), 2018, pp. 175–181.
[10]
V. Los anlen, C. El-Hajj, M. Rossignol, G. La ay,
J. Andén, and M. Lag ange, “Time– equency sca e ing
accu a ely models audi o y simila i ies be ween ins u-
men al playing echniques,” EURASIP Jou nal on Audio,
Speech, and Music P ocessing, ol. 2021, no. 1, p. 3,
2021.
[11]
S. Cholle , D. Valen in, and H. Abdi, “F ee so ing
ask,” No el Techniques in Senso y Cha ac e iza ion
and Consume P o iling, ol. 207, 2014.
[12]
C. Vahidi, S. Singh, E. Bene os, H. Phan, D. S owell,
G. Fazekas, and M. Lag ange, “Pe cep ual musical simi-
la i y me ic lea ning wi h g aph neu al ne wo ks,” in
2023 IEEE Wo kshop on Applica ions o Signal P ocess-
ing o Audio and Acous ics (WASPAA). IEEE, 2023,
pp. 1–5.
[13]
E. Tho e , B. Ca amiaux, P. Depalle, and S. Mcadams,
“Lea ning me ics on spec o empo al modula ions e-
eals he pe cep ion o musical ins umen imb e,” Na-
u e Human Beha iou , ol. 5, no. 3, pp. 369–377,
2021.
[14]
B. Pascal and M. Lag ange, “On he obus ness o mu-
sical imb e pe cep ion models: F om pe cep ual o
lea ned app oaches,” in 2024 32nd Eu opean Signal
P ocessing Con e ence (EUSIPCO). IEEE, 2024, pp.
41–45.
[15]
C. Vahidi, B. Hayes, C. Sai is, and G. Fazekas, “Acous-
ic ep esen a ions o pe cep ual imb e simila i y,” in
Digi al Music Resea ch Ne wo k One-Day Wo kshop
(DMRN+ 16), 2021.
[16]
B. Hayes and C. Vahidi, “Timb e dissimi-
la i y me ics,” h ps://gi hub.com/ben-hayes/
imb e-dissimila i y-me ics, 2021, accessed: 2025-03-
29.
[17]
P. Manocha, A. Finkels ein, R. Zhang, N. J. B yan,
G. J. Myso e, and Z. Jin, “A di e en iable pe cep ual
audio me ic lea ned om jus no iceable di e ences,”
in In e speech, 2020.
[18]
P. Manocha, Z. Jin, R. Zhang, and A. Finkels ein, “Cd-
pam: Con as i e lea ning o pe cep ual audio simi-
la i y,” in IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2021,
pp. 196–200.
[19]
R. Zhang, P. Isola, A. A. E os, E. Shech man, and
O. Wang, “The un easonable e ec i eness o deep ea-
u es as a pe cep ual me ic,” in P oceedings o he IEEE
Con e ence on Compu e Vision and Pa e n Recogni-
ion, 2018, pp. 586–595.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
716
[20]
S. Fu, N. Tami , S. Sunda am, L. Chai, R. Zhang,
T. Dekel, and P. Isola, “D eamsim: Lea ning new dimen-
sions o human isual simila i y using syn he ic da a,”
in Ad ances in Neu al In o ma ion P ocessing Sys ems,
ol. 36, 2023, pp. 50 742–50 768.
[21]
L. Mu en hale , L. Linha d , J. Dippel, R. A. Vande -
meulen, K. He mann, A. Lampinen, and S. Ko nbli h,
“Imp o ing neu al ne wo k ep esen a ions using human
simila i y judgmen s,” Ad ances in Neu al In o ma ion
P ocessing Sys ems, ol. 36, pp. 50 978–51 007, 2023.
[22]
M. N. Heba , O. Con ie , L. Teichmann, A. H. Rock-
e , C. Y. Zheng, A. Kidde , A. Co i eau, M. Vazi i-
Pashkam, and C. I. Bake , “Things-da a, a mul imodal
collec ion o la ge-scale da ase s o in es iga ing objec
ep esen a ions in human b ain and beha io ,” Eli e,
ol. 12, p. e82580, 2023.
[23]
J. M. G ey, “Mul idimensional pe cep ual scaling o
musical imb es,” he Jou nal o he Acous ical Socie y
o Ame ica, ol. 61, no. 5, pp. 1270–1277, 1977.
[24]
J. M. G ey and J. W. Go don, “Pe cep ual e ec s o
spec al modi ica ions on musical imb es,” The Jou nal
o he Acous ical Socie y o Ame ica, ol. 63, no. 5, pp.
1493–1500, 1978.
[25]
P. I e son and C. L. K umhansl, “Isola ing he dynamic
a ibu es o musical imb ea,” The Jou nal o he Acous-
ical Socie y o Ame ica, ol. 94, no. 5, pp. 2595–2603,
1993.
[26]
F. Opolko and J. Wapnick, McGill Uni e si y mas e
samples (3 CDs). Quebec, Canada: McGill Uni e si y,
1987.
[27]
S. McAdams, S. Winsbe g, S. Donnadieu, G. De Soe e,
and J. K impho , “Pe cep ual scaling o syn hesized
musical imb es: Common dimensions, speci ici ies, and
la en subjec classes,” Psychological Resea ch, ol. 58,
pp. 177–192, 1995.
[28]
S. Laka os, “A common pe cep ual space o ha monic
and pe cussi e imb es,” Pe cep ion & Psychophysics,
ol. 62, no. 7, pp. 1426–1439, 2000.
[29]
M. Go o, H. Hashiguchi, T. Nishimu a, and R. Oka,
“RWC music da abase: Music gen e da abase and mu-
sical ins umen sound da abase,” in P oceedings o
he 4 h In e na ional Con e ence on Music In o ma ion
Re ie al (ISMIR), 2003.
[30]
F. Opolko and J. Wapnick, The McGill Uni e si y mas e
samples collec ion on DVD (3 DVDs). Quebec, Canada:
McGill Uni e si y, 1987.
[31]
K. Siedenbu g, K. Jones-Molle up, and S. McAdams,
“Acous ic and ca ego ical dissimila i y o musical im-
b e: E idence om asymme ies be ween acous ic and
chime ic sounds,” F on ie s in Psychology, ol. 6, p.
1977, 2016.
[32] Vienna Symphonic Lib a y, h ps://www. sl.co.a /.
[33]
C. Sai is and K. Siedenbu g, “B igh ness pe cep ion
o musical ins umen sounds: Rela ion o imb e dis-
simila i y and sou ce-cause ca ego ies,” The Jou nal o
he Acous ical Socie y o Ame ica, ol. 148, no. 4, pp.
2256–2266, 2020.
[34]
C. Vahidi, G. Fazekas, C. Sai is, and A. Palladini, “Tim-
b e space ep esen a ion o a sub ac i e syn hesize ,”
in P oceedings o he 2nd In e na ional Con e ence on
Timb e, 2020, p. 30–33.
[35]
C. J. S einme z and J. Reiss, “pyloudno m: A simple ye
lexible loudness me e in py hon,” in Audio Enginee ing
Socie y Con en ion 150. Audio Enginee ing Socie y,
2021.
[36]
Nicki Ska e De le sen, Ji i Bo o ec, Jus us Schock,
Ananya Ha sh, Teddy Koke , Luca Di Liello, Daniel
S ancl, Changsheng Quan, Maxim G echkin, and
William Falcon, “To chMe ics - Measu ing Rep o-
ducibili y in PyTo ch,” Feb. 2022. [Online]. A ailable:
h ps://gi hub.com/Ligh ning-AI/ o chme ics
[37]
V. Kh ulko , L. Mi akhabo a, E. Us ino a, I. Oselede s,
and V. Lempi sky, “Hype bolic image embeddings,” in
P oceedings o he IEEE/CVF Con e ence on Compu e
Vision and Pa e n Recogni ion, 2020, pp. 6418–6428.
[38]
A. Gui, H. Gampe , S. B aun, and D. Emmanouilidou,
“Adap ing eche audio dis ance o gene a i e music
e alua ion,” in IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2024, pp. 1331–1335.
[39]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep esid-
ual lea ning o image ecogni ion,” in P oceedings o
he IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion, 2016, pp. 770–778.
[40]
L. A. Ga ys, A. S. Ecke , and M. Be hge, “Image
s yle ans e using con olu ional neu al ne wo ks,” in
P oceedings o he IEEE Con e ence on Compu e Vision
and Pa e n Recogni ion, 2016, pp. 2414–2423.
[41]
X. Huang and S. Belongie, “A bi a y s yle ans e
in eal- ime wi h adap i e ins ance no maliza ion,” in
P oceedings o he IEEE In e na ional Con e ence on
Compu e Vision, 2017, pp. 1501–1510.
[42]
M. And eux, T. Angles, G. Exa chakis, R. Leona duzzi,
G. Roche e, L. Thi y, J. Za ka, S. Malla , J. Andén,
E. Belilo sky e al., “Kyma io: Sca e ing ans o ms in
py hon,” Jou nal o Machine Lea ning Resea ch, ol. 21,
no. 60, pp. 1–6, 2020.
[43]
J. Andén, V. Los anlen, and S. Malla , “Join ime–
equency sca e ing,” IEEE T ansac ions on Signal
P ocessing, ol. 67, no. 14, pp. 3704–3718, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
717
[44]
B. Elizalde, S. Deshmukh, and H. Wang, “Na u al
language supe ision o gene al-pu pose audio
ep esen a ions,” 2023. [Online]. A ailable: h ps:
//a xi .o g/abs/2309.05767
[45]
Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-Ki kpa ick,
and S. Dubno , “La ge-scale con as i e language-audio
p e aining wi h ea u e usion and keywo d- o-cap ion
augmen a ion,” in IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2023.
[46]
A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi,
“High ideli y neu al audio comp ession,” a Xi p ep in
a Xi :2210.13438, 2022.
[47]
R. Kuma , P. See ha aman, A. Luebs, I. Kuma , and
K. Kuma , “High- ideli y audio comp ession wi h im-
p o ed qgan,” Ad ances in Neu al In o ma ion P o-
cessing Sys ems, ol. 36, pp. 27 980–27 993, 2023.
[48]
M. Pasini, S. La ne , and G. Fazekas, “Music2la en :
Consis ency au oencode s o la en audio comp ession,”
P oceedings o he 25 h In e na ional Con e ence on
Music In o ma ion Re ie al (ISMIR), 2024.
[49]
S. La ne , M. Dö le , and A. A z , “Lea ning complex
basis unc ions o in a ian ep esen a ions o audio,”
in P oceedings o he 20 h In e na ional Con e ence on
Music In o ma ion Re ie al (ISMIR), 2019.
[50]
V. Emiya, N. Be in, B. Da id, and R. Badeau, “Maps-a
piano da abase o mul ipi ch es ima ion and au oma ic
ansc ip ion o music,” 2010.
[51]
J. Tu ian and M. Hen y, “I’m so y o you
loss: Spec ally-based audio dis ances a e bad
a pi ch,” in ”I Can’ Belie e I ’s No Be e !”
Neu IPS 2020 wo kshop, 2020. [Online]. A ailable:
h ps://open e iew.ne / o um?id=Z4UwGkTRTes
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
718

Related note

Why institutions use Plag.ai for originality review, entry 19
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by review committees in large academic systems, distance-learning programs, and cross-border universities, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer separation between similarity and misconduct, more consistent review procedures, and more transparent source review. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For grant proposals, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai