scieee Science in your language
[en] (orig)

An Evaluation Strategy for Local Key Estimation: Exploiting Cross-Version Consistency

Author: Yiwei Ding; Yannik Venohr; Christof Weiss
Publisher: Zenodo
DOI: 10.5281/zenodo.17706357
Source: https://zenodo.org/records/17706357/files/000019.pdf
AN EVALUATION STRATEGY FOR LOCAL KEY ESTIMATION:
EXPLOITING CROSS-VERSION CONSISTENCY
Yiwei Ding Yannik Venoh Ch is o Weiß
Cen e o A i icial In elligence and Da a Science (CAIDAS), Uni e si y o Wü zbu g
{yiwei.ding, yannik. enoh , ch is o .weiss}@uni-wue zbu g.de
ABSTRACT
Local key es ima ion (LKE) is an impo an ye challeng-
ing ask in music in o ma ion e ie al since i in ol es a
high le el o musical abs ac ion, which en ails ambigu-
i y and low in e -anno a o ag eemen . Relying on lim-
i ed (small) da ase s wi h a single anno a ion may in o-
duce no only da ase bias bu also anno a o bias. To ad-
d ess such p oblems, we p opose in his pape a no el,
anno a ion- ee e alua ion s a egy o LKE. To his end,
we exploi da ase s whe e mul iple e sions o he same
musical wo k a e a ailable. We in es iga e he models’
consis ency ac oss e sions, expec ing an e ec i e and o-
bus model o ou pu simila p edic ions on di e en e -
sions o he same wo k. In ou expe imen s, we s udy he
beha io o he p oposed c oss- e sion consis ency mea-
su e using examples o di e en models and da ase s, in-
dica ing a s ong co ela ion be ween c oss- e sion consis-
ency and he models’ e ec i eness on in-domain da a as
well as hei gene aliza ion o ou -o -domain da a. Ou u -
he s udies show ha , while being co ela ed o common
e alua ion me ics, c oss- e sion consis ency is also cap-
u ing di e en aspec s o model beha io , hus se ing as
an addi ional igu e o me i o e alua ing LKE models.
1. INTRODUCTION
Ha mony analysis o music audio eco dings cons i u es
an essen ial pa o MIR esea ch. A cen al ask in ha -
mony analysis is local key es ima ion (LKE), which ad-
d esses onal p og essions and modula ions on a coa se
ime scale. Unlike global key es ima ion, whe e a single
key label is assigned o a piece, local key es ima ion in-
ol es i s segmen ing he audio and hen labeling each
segmen indi idually. Al hough we ocus on Wes e n clas-
sical music wi h only 24 majo and mino keys, LKE s ill
p esen s se e al challenges. Fi s , om a music heo y pe -
spec i e, local key can be inhe en ly ambiguous o build
up ension and in e es ing onali ies [1], and he e olu ion
o composi ional s yles u he complica es he sea ch o
uni e sal ules. Second, local key is a pe cep ual no ion—
© Y. Ding, Y. Venoh , and C. Weiß. Licensed unde a C e-
a i e Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A -
ibu ion: Y. Ding, Y. Venoh , and C. Weiß, “An E alua ion S a egy o
Local Key Es ima ion: Exploi ing C oss-Ve sion Consis ency”, in P oc.
o he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
?
30 31 32 33 34
Ve sion 1
Ve sion 2
Time (Measu e)
Figu e 1: Example scena io: A model ou pu s di e en
p edic ions o wo di e en e sions o he same wo k.
he sho e m onal cen e implied by a key label is in e -
connec ed wi h expec ancy and lis ening expe ience, mak-
ing i a highly abs ac musical concep [2]. Thi d, as a
consequence o his inhe en ambigui y and pe cep ual na-
u e, local key labels a e o en highly subjec i e [3], i. e.,
di e en music heo is s may assign di e en key labels
h oughou a piece. As a esul , local key anno a ions
by a single anno a o canno be ully us ed—a so-called
“g ound- u h” anno a ion migh no exis o LKE.
Due o hese challenges, e alua ing LKE algo i hms e-
mains di icul . Common p ac ices o e alua ion a e based
on accu acy o ela ed me ics compu ed on he es se [4].
Howe e , he lack o mul iple anno a ions o he es da a
can lead o an anno a o -le el o e i ing, and he lack o
la ge and di e se da ase s can esul in a da ase -le el o e -
i ing. Since c ea ing la ge-scale da ase s wi h mul iple
anno a o s equi es music expe ise as well as signi ican
human e o s, his s a egy does no scale o a desi ed
da ase size. Fo his eason, we eso o e alua ion s a e-
gies ha equi e weak o e en no anno a ions, which can
be mo e scalable and less biased owa ds es se labels as
compa ed o cu en e alua ion me ics.
Since we ocus on Wes e n classical music, whe e mul-
iple eco ded pe o mances ( e sions) ha exac ly ollow
he same musical sco e (wo k) a e easily a ailable, we
p opose in his pape o in es iga e c oss- e sion consis-
ency (CVC) as ano he e alua ion s a egy. Ideally, gi en
he same musical con en , an e ec i e and obus model
should yield he same p edic ions o di e en e sions .
Ob iously, his is some imes no he case, as exempli ied
in Figu e 1. CVC quan i ies how obus he models a e
agains such e sion di e ences while he musical con en
s ays he same. These e sion di e ences usually in ol e
158
changes in he eco ding condi ions, pe o me s, in e p e-
a ions, e c. Mo eo e , measu ing he CVC only equi es
mul i- e sion da ase s wi h pai wise alignmen , which a e
easie o cu a e han ully-anno a ed da ase s and a e inde-
penden o local key anno a ions, hus a oiding he p ob-
lem o anno a o bias.
As ou main con ibu ions in his pape , we (1) p opose
a amewo k o analyze he CVC o local key es ima ion,
(2) ca e ully in es iga e he ela ionship be ween CVC and
he common e alua ion me ics, and (3) demons a e ha
CVC is measu ing ela ed ye di e en aspec s o model
beha io , hus se ing as an addi ional igu e o me i o
LKE e alua ion.
The emainde o his pape is s uc u ed as ollows: In
Sec ion 2, we e iew ela ed wo k. Sec ion 3 in oduces
ou CVC measu e. Sec ion 4 ou lines ou expe imen al
se up. In Sec ion 5, we p esen ou esul s by answe ing
and discussing se e al esea ch ques ions. In Sec ion 6, we
in es iga e di e en a ian s o he CVC ha akes music
knowledge in o accoun . Sec ion 7 concludes he pape . 1
2. RELATED WORK
In his sec ion, we e iew some o he ela ed wo k, includ-
ing e alua ion me ics o global and local key es ima ion,
and o he wo ks ha exploi mul i- e sion da ase s.
2.1 E alua ion me ics o key es ima ion
Fo e alua ing key es ima ion sys ems, mos s udies con-
side s anda d me ics such as he accu acy (o ecall a e)
and MIREX sco es [4]. Accu acy is usually used o global
key es ima ion [5], whe e one piece is o en assigned o a
single key label. In LKE, howe e , he e a e o en seg-
men s wi h labels o “no key”, which ypically occu s due
o local key ambigui y. To accoun o hese “no key” la-
bels, ecall a e is used ins ead o accu acy whe e hese
ames a e igno ed, and accu acy is only compu ed o e
he emaining ames [3,6].
None heless, he ecall a e igno es he musical ela-
ionship be ween key labels and ea s all he e o s as he
same. As shown in [3, 6], a la ge ac ion o LKE e -
o s co esponds o musically meaning ul key ela ionships
such as i h e o s (e. g., C:maj–G:maj), pa allel e o s
(C:maj–C:min), and ela i e e o s (C:maj–A:min).
To accoun o his, he MIREX sco e has been p oposed
o e alua ion in MIREX campaign [7] o go beyond a bi-
na y co ec -o -w ong e alua ion and gi e pa ial sco es o
hese musically meaning ul e o s [8–10]. Speci ically, he
MIREX sco e assigns 0.5 poin s o i h e o s, 0.3 poin s
o ela i e e o s and 0.2 poin s pa allel e o s.
Bo h ecall a e and MIREX sco e equi e human anno-
a ions. In ha mony analysis asks ha can be in insically
ambiguous, his can lead o an anno a o -le el o e i ing.
These p oblems o anno a o subjec i i y ha e been shown
o se e al ha mony analysis asks such as cho d ecogni-
ion [11–13] o LKE [3], whe e he in e - a e ag eemen
1The code is publicly a ailable a : h ps://gi hub.com/
sunce ock/c c-lke-ismi 25
can be as low as 75% . Fo hese easons, ou s a egy aims
o e alua e LKE models wi h anno a ion- ee echniques o
ob ain addi ional igu es o me i .
2.2 Exploi ing mul i- e sion da ase s
The e ha e been wo ks ha exploi s mul i- e sion da ase s
in di e en ways.
Fi s , he mul i- e sion da ase s can be used o imp o e
ha mony analysis. Konz and Mülle [14] and Ewe e
al. [15] iden i y he passages whe e he cho d labels a e
consis en ac oss di e en e sions and ind ha hese pas-
sages a e likely o be co ec ly-p edic ed. Fo esol ing in-
consis en passages, Konz e al. [16] employ c oss- e sion
usion yieliding s abilized analysis esul s.
Second, mul i- e sion da ase s acili a e he de ailed
analysis o LKE esul s, p o iding an ex a pe spec i e on
models’ gene alizabili y. Weiß e al. [3] s udy di e en
da ase spli s and ind ha o LKE, gene alizing o unseen
e sions is much easie han gene alizing o unseen wo ks.
Mo eo e , hey pe o m a c oss-anno a o s udy and aise
he conce n ha many LKE models o e i o ce ain anno-
a o s since models’ ecall a e can be highe han he a e
o in e - a e ag eemen .
Thi d, mul i- e sion da ase s can be le e aged o do-
main adap ion, imp o ing models’ e ec i eness in ano he
domain. Liu and Weiß [17] u ilize c oss- e sion compa -
ison as a consis ency egula ize . He e, a model ained
in he sou ce domain (piano music) gene a es he pseudo-
labels in he a ge domain (o ches al music) ollowed by
il e ing ou labels which a e inconsis en ac oss e sions.
Such echniques a e shown o gene a e imp o ed pseudo-
labels o a ge domain aining.
Finally, mul i- e sion da ase s can also be used o ab-
s ac ep esen a ion lea ning. K ause e al. [18] employ
a con as i e lea ning pa adigm o lea n musical ea u es
ha a e in a ian unde e sion shi s such as ins umen a-
ion and pi ch-class ac i i y.
Since hese p io wo ks ei he ake c oss- e sion con-
sis ency as a egula iza ion du ing aining o pos -
p ocessing, o ocus on analyzing he esul s on a speci ic
(small) da ase , his pape in es iga es he consis ency i -
sel om a mo e gene al pe spec i e. Mo e speci ically,
we p opose o use CVC as an e alua ion s a egy o mea-
su e models’ obus ness agains e sion changes and ana-
lyze se e al use cases o his s a egy o imp o ing LKE.
3. CONSISTENCY MEASURE
In his sec ion, we in oduce ou p oposed CVC measu es.
The o e all calcula ion p ocess is illus a ed in Figu e 2.
We use CVC o measu e he consis ency o a model’s ou -
pu ac oss di e en e sions o he same wo k.
To his end, we i s equi e a se o audio acks ha
ep esen di e en e sions o he same wo k, and he
model’s p edic ions on hese acks. Fo ins ance, we con-
side a pai o p edic ions y1, y2whe e y1∈RN×dand
y2∈RM×dwi h Nand M ep esen ing he numbe o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
159
…
…
Musical Time
Ve . 1
Ve . 3
A g.
Sim. .8
.4 .6 .5
.9
.7
1 2 3 4
2
1
3
C oss- e sion consis ency
4 A g.
Pai -wise Sim.
Figu e 2: Illus a ion o he consis ency measu e.
ime ames o he wo eco dings and dbeing he dimen-
sion o he p edic ion. In LKE, we assume o ha e d= 24
local key classes, and each ame o y1and y2cap u es he
p obabili y dis ibu ion o e he 24 classes.
Nex , since di e en e sions o a wo k usually di -
e in (local) empo and leng h, we need pai wise align-
men s be ween hem. We compu e his alignmen using
he sync oolbox Py hon package [19], yielding a wa p-
ing pa h o each pai o acks. We deno e he wa ping
pa h be ween y1and y2as Pwi h elemen s p[l]=(nl, ml),
l∈[1 : L], meaning ha he nl- h ame in y1and he ml- h
ame in y2 e e o he same posi ion in he musical sco e.
Then, gi en a simila i y measu e s:Rd×Rd→R, we
compu e he consis ency Cbe ween e sion 1and 2as:
C(y1, y2) = 1
L
L
X
l=1
s(y1[nl], y2[ml]) .
Finally, he c oss- e sion consis ency (CVC) o one
wo k is de ined as he a e age consis ency be ween all
pai s o e sions o his wo k. We agg ega e di e en
wo ks by aking he a e age as well.
The e a e a ious possible choices o he simila i y
measu e. By de aul , we use a s aigh o wa d measu e
based on he o al a ia ion dis ance (TVD). Gi en wo
p obabili y dis ibu ions o e he 24 local keys p∈R24
and q∈R24, he TVD-based simila i y is compu ed as:
s(p, q) = 1 −1
2
24
X
i=1
|pi−qi|.
Since pand qa e p obabili y dis ibu ions (i. e., sum up o
one), s(p, q)∈[0,1], which na u ally gi es us a no mal-
ized consis ency measu e. We will discuss he e ec o
al e na i e simila i y measu es in Sec ion 6.
4. EXPERIMENTAL SETUP
In his sec ion, we desc ibe ou expe imen al se ups includ-
ing da ase s, models, and aining de ails.
4.1 Da ase s
Fo ou s udy, we conside h ee c oss- e sion da ase s:
Schube Win e eise Da ase (SWD) [20], Bee ho en Piano
Sona a Da ase (BPSD) [21], and Bee ho en S ing Qua e
Da ase # Mo emen s # Ve sions Du . (hh:mm)
SWD [20] 24 9 10:50
BPSD [21] 32a11 41:07
BSQD [22] 70b9c62:12
Table 1:aOnly he i s mo emen s o he 32 sona as. b16
ull s ing qua e s. c7 o hem ha e all he wo ks and 2 o
hem ha e only pa o he wo ks.
Model # Pa ams.
cq _cnn 293k
hcq _cnn 294k
oc a e_ls m 46k
oc a e ull_ls m 150k
ch oma 32k
ch oma_ es 200k
Table 2: Di e en models used in ou expe imen s.
Da ase (BSQD)2, all o which come wi h local key anno-
a ions. The numbe o wo ks (i. e., mo emen s) and e -
sions as well as he o al du a ion o hese da ase s a e lis ed
in Table 1.
P e ious wo ks [3] ha e in es iga ed he e ec o di e -
en spli s o he da ase s whe e aining, alida ion and es
da a con ain he same wo ks bu di e en e sions ( e sion
spli ), he same e sions bu di e en wo ks (wo k spli ), o
nei he con ain he same wo ks no he same e sions (nei-
he spli ) . In ou expe imen s, we use he nei he spli o
all da ase s since i is he mos ealis ic (and di icul ) one.
4.2 Models
While bo h signal p ocessing-based me hods wi h hand-
c a ed ea u es and deep lea ning me hods ha e been ap-
plied o LKE, deep lea ning models ypically inco po-
a e less music knowledge and a e mo e da a-dependen .
The e o e, hey o en su e mo e om he ambigui y and
subjec i i y o LKE han signal p ocessing me hods, so we
ocus on he deep lea ning me hods in his pape . Build-
ing on p e ious wo k, we include he ollowing models
in ou s udy. The i s wo, cq _cnn and hcq _cnn
a e VGG-s yle con olu ional neu al ne wo ks ha ake a
CQT o a ha monic CQT (HCQT, see [23]) as he inpu ,
espec i ely [3]. Fu he wo models, oc a e_ls m and
oc a e ull_ls m ely on musically-inspi ed oc a e-
based ea angemen in he a chi ec u e [24] and add bidi-
ec ional LSTM laye s o model sequen ial in o ma ion
[6], whe e oc a e ull_ls m is equi alen o he o ig-
inal model and oc a e_ls m is a educed one om he
abla ion s udy o [6]. The emaining wo, ch oma and
ch oma_ es a e con olu ional ne wo ks ha a e p o-
posed o lea n pi ch-class (ch oma) ep esen a ions whe e
ch oma_ es adds mo e laye s wi h esidual connec ions
han ch oma [25]. We adap hese models o LKE by
adding a inal linea laye o classi y he ou pu in o 24 lo-
cal key classes. Table 2 lis s hese models along wi h hei
numbe o pa ame e s.
2This da ase will be published soon and comp ises mul iple e sions
o all mo emen s om all Bee ho en’s s ing qua e s. Local key anno a-
ions a e de i ed om he symbolic ABC da ase [22].
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
160
4.3 T aining de ails
We ain all ou models om sc a ch using an Adam op-
imize o 100 epochs. We se he lea ning a e as 0.001.
To simula e he a iance wi hin one un and o ob ain mod-
els o a ying quali y, we pick a checkpoin a e e e y 10
epochs du ing aining. To cap u e he a iance due o an-
dom ini ializa ion, we epea each aining un 5 imes, e-
sul ing in 10 (checkpoin s) imes 5 ( uns) da a poin s o
each model. As a s anda d me ic o compa e wi h ou
CVC, we compu e he key ecall a e (accu acy igno ing
ames anno a ed as “no key”), and compa e his wi h o he
me ics in Sec ion 6.
5. RESEARCH QUESTIONS AND RESULTS
In his sec ion, we aise se e al esea ch ques ions, p esen
ou esul s and discuss hese esul s ega ding ou RQs.
RQ1: Is he c oss- e sion consis ency co ela ed wi h
he e ec i eness o he model? As men ioned, we ex-
pec an e ec i e and obus model bo h o ha e high ecall
and o be consis en ac oss e sions. The e o e, we in es-
iga e he co ela ion be ween ecall and CVC on he same
se o es da a. I hey a e co ela ed, we can use CVC as
a p oxy o models’ e ec i eness and compa e models on
mul i- e sion da ase s wi hou anno a ions.
Fo each model on each da ase , we ob ain 50 check-
poin s as desc ibed in Sec ion 4. F om each checkpoin , we
compu e ecall and CVC, and hen calcula e Spea man’s
ank co ela ion coe icien ρ. To see whe he such co e-
la ion holds ac oss models, we also compu e Spea man’s
co ela ion coe icien s o e all da a poin s, including di -
e en models. Fo be e isualiza ion, we also d aw he
eg ession lines. 3
The esul s a e shown in Figu e 3. The x-axes indica e
he c oss- e sion consis ency, he y-axes indica e he e-
call, and di e en colo s indica e di e en models.
We can see ha on all da ase s, wi hin each model, he e
exis s ong co ela ions be ween ecall and CVC. Fo ex-
ample, he model cq _cnn (gold ci cles) ob ains a ρo
0.73,0.79, and 0.71 on da ase SWD,BPSD, and BSQD, e-
spec i ely. F om all eg essions, we ob ain p < 0.05, sug-
ges ing ha o hese models, he CVC is co ela ed wi h
he ecall wi h s a is ical signi icance.
The o e all linea eg ession ac oss all models (black
line) shows s a is ical signi icance as well. On he h ee
da ase s, he eg essions show he ρ alue o 0.92,0.89,
and 0.88, all wi h p < 0.001. This means ha he co ela-
ion be ween ecall and CVC s ill holds ac oss models, i.e.,
models wi h highe consis ency ha e a highe ecall. How-
e e , i only sugges s a co ela ion on a coa se scale while
on a smalle scale, his mus be aken wi h ca e o some o
he model a chi ec u es. Fo example, on bo h BPSD and
BSQD,hcq _cnn (g een iangles) has a sligh ly highe
CVC bu a sligh ly lowe ecall han ch oma_ es. This
migh be due o he ac ha hese wo models a e based on
3We also calcula ed Pea son’s co ela ion coe icien s, bu hey a e
omi ed because hey a e close o he Spea man’s co ela ion coe icien s.
(a)
(b)
(c)
Figu e 3: Co ela ion be ween ecall and c oss- e sion
consis ency on (a) SWD, (b) BPSD, and (c) BSQD.
di e en a chi ec u es and he e o e imply di e en induc-
i e biases.
In conclusion, we see ha in gene al, CVC is s ongly
co ela ed o he ecall, so we can use c oss- e sion con-
sis ency as a p oxy o models’ e ec i eness. Beyond
ha , CVC is also cap u ing di e en aspec s han ecall,
so when wo models ob ain simila CVC, he p oxy is no
able o su ely de e mine which one is be e . The model
selec ion p ocess hen needs o conside bo h he e ec i e-
ness and he consis ency.
RQ2: Does he co ela ion be ween c oss- e sion con-
sis ency and ecall hold on ou -o -domain es da a?
In RQ1, we compu e CVC and ecall on he same sou ce
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
161
(a)
(b)
Figu e 4: Co ela ion be ween ecall and CVC on models
(a) ained on BPSD and es ed on BSQD (b) ained on
BSQD and es ed on BPSD.
da ase s as used o aining. Howe e , we o en wan o
compa e models’ e ec i eness on ou -o -domain da a, i. e.,
hei ou -o -domain gene alizabili y. The e o e, we he e
in es iga e he models’ gene aliza ion o ou -o -domain
da a in ela ion o hei CVC. I hey a e co ela ed, we
can hen compa e di e en models’ e ec i eness on ou -
o -domain mul i- e sion da ase s by measu ing hei CVC.
To his end, we pe o m a c oss-da ase expe imen
whe e we ain models on one sou ce da ase and compu e
bo h ecall and CVC on ano he da ase as a hold-ou es
se . No e ha we adop a s aigh o wa d de ini ion o ou -
o -domain da a: da a om a di e en sou ce da ase . In ou
cases, his means di e en ins umen a ion (s ing ins u-
men s in BSQD s. piano in BPSD) o di e en compose s
(Schube in SWD s. Bee ho en in BPSD and BSQD).
Figu e 4 shows he esul s. We ha e he simila obse -
a ion ha wi hin he same model a chi ec u e, he check-
poin s ha a e mo e consis en on he es se ha e also
highe ecall. Fo example, cq _cnn ained on BPSD
(Figu e 4a) shows a ρo 0.67 when es ed on BSQD, and
swi ching aining and es da ase gi es a ρo 0.82. This
means ha highe CVC on ou -o -domain da a also sug-
ges s highe ecall on ha da a.
Compa ing di e en model a chi ec u es, we also see
ha he o e all eg essions (ac oss all models) yield ρ=
(a)
(b)
Figu e 5: Co ela ion be ween ecall and CVC on models
(a) ained on BPSD and es ed on BSQD (b) ained on
BSQD and es ed on BPSD. No e ha CVC is compu ed on
he unseen es pa i ion o he aining da ase .
0.92 and 0.94, espec i ely. This indica es ha , on a coa se
scale, mo e consis en models a e also mo e e ec i e on
hese unseen ou -o -domain da a. F om his obse a ion,
we conclude ha CVC enables us o e alua e a model’s
gene alizabili y on an ou -o -domain mul i- e sion es se
wi hou equi ing labels.
RQ3: Is in-domain c oss- e sion consis ency co ela ed
wi h e ec i eness on ou -o -domain da a? To add ess
RQ2, we compu ed bo h CVC and ecall on he same es
se , which equi es he es da ase ( a ge domain) o in-
clude mul iple e sions. Howe e , in p ac ice, we o en
wan o es ima e model obus ness wi hou such dedica ed
c oss- e sion da ase s. To add ess his, he e in RQ3, we in-
es iga e whe he CVC in he aining (sou ce) domain can
be used as a p oxy o models’ e ec i eness on he a ge
domain, i. e., i s capabili y o domain gene aliza ion. To
his end, we compu e he CVC on he in-domain es da a
(using a nei he spli ) and he ecall a e on ou -o -domain
es da a.
Figu e 5 shows he esul s. While in some model-
speci ic and da ase -speci ic cases, he co ela ion wi hin
one model does no always hold, ac oss di e en models
(black lines), he obse a ions s ay simila . These o e all
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
162

eg essions show ρo 0.29 and 0.63, espec i ely, wi h all
p < 0.001, indica ing s a is ical signi icance. This means
ha , on a coa se scale, consis en models in gene al ob-
ain highe ecall han inconsis en ones. This allows us
o compa e he gene alizabili y o di e en models: I a
model is signi ican ly mo e consis en on ou in-domain
c oss- e sion es se , i is e y likely o be mo e obus
agains domain shi s such as ins umen a ion changes.
Gi en hese esul s, we wan o emphasize ha models
wi h highe ecall on a single in-domain es se a e no
necessa ily mo e gene alizable, which is a ypical case o
da ase bias. I we compa e Figu e 3b wi h Figu e 5b, we
see ha cq _cnn ob ains lowe ecall bu sligh ly highe
CVC han ch oma_ es when es ed on he in-domain
es se BSQD bu shows highe ecall on ou -o -domain
es se BPSD. We conclude ha CVC can se e as a com-
plemen o he adi ional e alua ion me ics, indica ing he
gene alizabili y o he model, e en i we nei he ha e la-
bels no mul i- e sion da a in he a ge domain a hand.
6. TOWARDS INCLUDING MUSIC KNOWLEDGE
In he p e ious sec ion, we chose ecall as a ep esen a i e
o common e alua ion me ics, and used he s aigh o -
wa d TVD-based simila i y o compu e he c oss- e sion
consis ency. In his sec ion, we in es iga e he e ec o us-
ing o he e alua ion me ics and consis ency measu es ha
ake music konwledge in o accoun .
As men ioned in Sec ion 2, accu acy o ecall igno e
he musical ela ionship be ween di e en key labels and
a e he e o e no able o accoun o he di e en ypes o
e o s. In MIREX, esea che s ha e p oposed ano he e al-
ua ion me ic o key es ima ion. This MIREX sco e pa -
ially ewa ds musically meaning ul e o s including i h
e o , pa allel e o , and ela i e e o (see Sec ion 2.1).
As an al e na i e o TVD, we also conside a musi-
cally mo i a ed simila i y measu e. To his end, we a -
ange a model’s ou pu dis ibu ion acco ding o he ci cle
o i hs, placing ela i e keys nex o each o he in hi ds
(e. g., A:min be ween C:maj and F:maj). On his ge-
ome ic key dis ibu ion, we compu e he Ea h Mo e ’s
Dis ance (EMD), which quan i ies he cos o u ning one
p obabili y dis ibu ion in o ano he by mo ing p obabili y
mass he sho es di ec ion along he ci cle. The ci cle-o -
i hs a angemen he eby demands o a ci cula e sion
o he EMD [26]. In [27], a simila measu e was applied
o compa e dia onic scale p obabili ies, which a e closely
ela ed o local keys. Fo example, mo ing om C:maj o
G:maj cos s he same as o F:maj and cos s less han o
D:maj, due o he i h ela ionship.
We now wan o mu ually compa e he esul ing me -
ics. To his end, we use he expe imen al se up o RQ1 in
Sec ion 5 and calcula e he pai wise Spea man co ela ion
coe icien s be ween ecall, MIREX sco e, TVD-based
consis ency, and EMD-based consis ency. The esul s a e
shown in Figu e 6, whe e CVC_TVD and CVC_EMD in-
dica e he consis encies based on TVD and EMD, espec-
i ely. We show only he esul s on SWD; esul s on o he
da ase s a e simila .
(a)
(b)
(c)
Figu e 6: Pai wise co ela ion be ween ecall,
MIREX, CVC_TVD and CVC_EMD, compu ed wi h
(a) hcq _cnn (b) oc a e ull_ls m and (c)
ch oma_ es bo h ained and es ed on SWD.
We can see ha he co ela ions be ween ecall and
MIREX sco e a e 0.91,0.99, and 0.94, espec i ely, in-
dica ing a high co ela ion. Also, CVC_TVD has a
a he high co ela ion wi h CVC_EMD, wi h co ela ion
o 0.76,0.97, and 0.79, espec i ely. The g oup o he
wo s anda d me ics and he g oup o he wo consis en-
cies s ill shows co ela ion, wi h coe icien s o 0.6–0.8 o
hcq _cnn and a ound 0.4–0.5 o oc a e ull_ls m
and ch oma_ es. This means ha ou conclusion om
he p e ious sec ion can be ex ended o o he me ics and
consis ency measu es. In compa ison, howe e , he co e-
la ion be ween hese wo g oups a e clea ly weake han
he co ela ion wi hin each g oups. This obse a ion sug-
ges s ha ou p oposed c oss- e sion consis ency is no
measu ing exac ly he same hing as ecall a e o MIREX
sco e, bu is a he cap u ing ela ed ye di e en pe spec-
i es, hus se ing as a no el igu e o me i o LKE e al-
ua ion.
7. CONCLUSION
In his pape , we p opose o in es iga e he c oss- e sion
consis ency o LKE models as a new s a egy o e alua-
ion. We show ha CVC is s ongly co ela ed wi h mod-
els’ e ec i eness and gene alizabili y o ou -o -domain
da a, while equi ing no edious human anno a ions bu
only aligned e sion pai s. The e o e, we can compa e
di e en LKE models wi h mo e di e se mul i- e sion
da ase s wi hou labels, educing he isk o da ase bias
and anno a o bias. No e ha we do no unde mine he im-
po ance and necessi y o common e alua ion me ics, bu
CVC se es as a good complemen , e alua ing LKE mod-
els om di e en pe spec i es.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
163
8. ACKNOWLEDGEMENTS
This wo k was unded by he Ge man Resea ch Founda-
ion (Deu sche Fo schungsgemeinscha , DFG) wi hin he
Emmy Noe he Junio Resea ch G oup on Compu a ional
Analysis o Music Audio Reco dings: A C oss-Ve sion Ap-
p oach (DFG WE 6611/3-1, G an No. 531250483).
9. REFERENCES
[1] M. Roig-F ancolí, Ha mony in Con ex . New Yo k:
McG aw-Hill, 2011.
[2] S. Hallam, I. C oss, and M. Thau , Ox o d handbook o
music psychology. Ox o d Uni e si y P ess, 2009.
[3] C. Weiß, H. Sch eibe , and M. Mülle , “Local key
es ima ion in music eco dings: A case s udy ac oss
songs, e sions, and anno a o s,” IEEE/ACM T ans-
ac ions on Audio, Speech, and Language P ocessing
(TASLP), ol. 28, pp. 2919–2932, 2020.
[4] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, D. P. Ellis, and C. C. Ra el,
“MIR_EVAL: A anspa en implemen a ion o com-
mon mi me ics,” in P oceedings o he In e na ional
Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR), 2014.
[5] Y. Kong, V. Los anlen, G. Mesegue -B ocal, S. Wong,
M. Lag ange, and R. Hennequin, “STONE: sel -
supe ised onali y es ima o ,” in P oceedings o he
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), 2024, pp. 954–961.
[6] Y. Ding and C. Weiß, “Towa ds obus local key es i-
ma ion wi h a musically inspi ed neu al ne wo k,” in
P oceedings o he Eu opean Signal P ocessing Con-
e ence (EUSIPCO), 2024, pp. 26–30.
[7] J. S. Downie, “The music in o ma ion e ie al e alua-
ion exchange (2005–2007): A window in o music in-
o ma ion e ie al esea ch,” Acous ical Science and
Technology, ol. 29, no. 4, pp. 247–255, 2008.
[8] H. Papadopoulos and G. Pee e s, “Local key es ima ion
om an audio signal elying on ha monic and me i-
cal s uc u es,” IEEE T ansac ions on Audio, Speech,
and Language P ocessing (TASLP), ol. 20, no. 4, pp.
1297–1312, 2011.
[9] F. Ko zeniowski and G. Widme , “End- o-end musi-
cal key es ima ion using a con olu ional neu al ne -
wo k,” in P oceedings o he Eu opean Signal P ocess-
ing Con e ence (EUSIPCO), 2017, pp. 966–970.
[10] ——, “Gen e-agnos ic key classi ica ion wi h con olu-
ional neu al ne wo ks,” in P oceedings o he In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), 2018, pp. 264–270.
[11] A. Laaksonen, “Ambigui y in au oma ic cho d an-
sc ip ion: ecognizing majo and mino cho ds,” in
Adap i e Mul imedia Re ie al: Seman ics, Con ex ,
and Adap a ion (AMR), 2014, pp. 203–213.
[12] Y. Ni, M. McVica , R. San os-Rod iguez, and
T. De Bie, “Unde s anding e ec s o subjec i i y in
measu ing cho d es ima ion accu acy,” IEEE T ans-
ac ions on Audio, Speech, and Language P ocessing
(TASLP), ol. 21, no. 12, pp. 2607–2615, 2013.
[13] H. V. Koops, W. B. De Haas, J. A. Bu goyne,
J. B ansen, A. Ken -Mulle , and A. Volk, “Anno a o
subjec i i y in ha mony anno a ions o popula music,”
Jou nal o New Music Resea ch (JNMR), ol. 48, no. 3,
pp. 232–252, 2019.
[14] V. Konz and M. Mülle , “A c oss- e sion app oach
o ha monic analysis o music eco dings,” in Mul-
imodal Music P ocessing, se . Dags uhl Follow-
Ups. Dags uhl, Ge many: Schloss Dags uhl–Leibniz-
Zen um ü In o ma ik, 2012, ol. 3, pp. 53–72.
[15] S. Ewe , M. Mülle , V. Konz, D. Müllensie en, and
G. A. Wiggins, “Towa ds c oss- e sion ha monic anal-
ysis o music,” IEEE T ansac ions on Mul imedia,
ol. 14, no. 3-2, pp. 770–782, 2012.
[16] V. Konz, M. Mülle , and R. Kleine z, “A c oss-
e sion cho d labelling app oach o explo ing ha -
monic s uc u es—a case s udy on Bee ho en’s Ap-
passiona a,” Jou nal o New Music Resea ch, ol. 42,
no. 1, pp. 61–77, 2013.
[17] L. Liu and C. Weiß, “U ilizing c oss- e sion consis-
ency o domain adap a ion: A case s udy on music
audio,” in In e na ional Con e ence on Lea ning Rep-
esen a ions (ICLR), Tiny Pape s, 2024.
[18] M. K ause, C. Weiß, and M. Mülle , “A c oss- e sion
app oach o audio ep esen a ion lea ning o o ches-
al music.” in P oceedings o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
2023, pp. 832–839.
[19] M. Mülle , Y. Öze , M. K ause, T. P ä zlich, and
J. D iedge , “Sync oolbox: A py hon package o e -
icien , obus , and accu a e music synch oniza ion,”
Jou nal o Open Sou ce So wa e, ol. 6, no. 64, p.
3434, 2021.
[20] C. Weiß, F. Zalkow, V. A i i-Mülle , M. Mülle , H. V.
Koops, A. Volk, and H. G. G ohganz, “Schube Win-
e eise da ase : A mul imodal scena io o music anal-
ysis,” ACM Jou nal on Compu ing and Cul u al He -
i age, ol. 14, no. 2, pp. 25:1–18, 2021.
[21] J. Zei le , C. Weiß, V. A i i-Mülle , and M. Mülle ,
“BPSD: A cohe en mul i- e sion da ase o analyz-
ing he i s mo emen s o Bee ho en’s piano sona as,”
T ansac ions o he In e na ional Socie y o Music In-
o ma ion Re ie al, ol. 7, no. 1, pp. 195–212, 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
164
[22] M. Neuwi h, D. Ha asim, F. C. Moss, and
M. Roh meie , “The anno a ed bee ho en co pus (abc):
A da ase o ha monic analyses o all bee ho en s ing
qua e s,” F on ie s in Digi al Humani ies, ol. 5, p. 16,
2018.
[23] R. M. Bi ne , B. McFee, J. Salamon, P. Li, and J. P.
Bello, “Deep salience ep esen a ions o F0 acking
in polyphonic music,” in P oceedings o he In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Suzhou, China, 2017, pp. 63–70.
[24] A. Elowsson and A. F ibe g, “Modeling music modal-
i y wi h a key-class in a ian pi ch ch oma CNN,” in
P oceedings o he In e na ional Socie y o Music In-
o ma ion Re ie al Con e ence (ISMIR), Del , The
Ne he lands, 2019, pp. 541–548.
[25] C. Weiss, J. Zei le , T. Zunne , F. Schube h, and
M. Mülle , “Lea ning pi ch-class ep esen a ions om
sco e-audio pai s o classical music,” in P oceedings
o he In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence (ISMIR), 2021, pp. 746–753.
[26] J. Rabin, J. Delon, and Y. Gousseau, “Ci cula Ea h
Mo e ’s Dis ance o he compa ison o local ea u es,”
in P oceedings o he In e na ional Con e ence on Pa -
e n Recogni ion (ICPR), Tampa, USA, 2008.
[27] C. Weiß and M. Mülle , “F om music sco es o audio
eco dings: Deep pi ch-class ep esen a ions o mea-
su ing onal s uc u es,” ACM Jou nal on Compu ing
and Cul u al He i age (JOCCH), ol. 17, no. 3, pp.
45:1–19, 2024.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
165