scieee Science in your language
[en] (orig)

Towards Robust Automatic Music Transcription By Measuring Cross-Version Consistency

Author: Yannik Venohr; Yiwei Ding; Christof Weiss
Publisher: Zenodo
DOI: 10.5281/zenodo.17706389
Source: https://zenodo.org/records/17706389/files/000031.pdf
TOWARDS ROBUST MUSIC TRANSCRIPTION BY MEASURING
CROSS-VERSION CONSISTENCY IN WESTERN CLASSICAL MUSIC
Yannik Venoh Yiwei Ding Ch is o Weiß
Cen e o A i icial In elligence and Da a Science, Uni e si y o Wü zbu g
{yannik. enoh , yiwei.ding, ch is o .weiss}@uni-wue zbu g.de
ABSTRACT
Au oma ic Music T ansc ip ion (AMT) is a cen al ask
wi hin MIR, enabling a ious subsequen applica ions. De-
spi e ad ancemen s hanks o deep lea ning, imp o ing
AMT emains challenging due o he sca ci y o la ge,
high-quali y anno a ed da ase s. Recognizing pi ches in
mul i-ins umen se ings beyond solo piano is pa icula ly
di icul , as models s uggle o gene alize ac oss domains
due o da ase biases and o e i ing. AMT esea ch ap-
pea s o ha e hi a glass ceiling, whe e u he p og ess is
di icul o achie e and o measu e. To add ess his, we
p opose c oss- e sion consis ency (CVC)—an anno a ion-
ee e alua ion amewo k ha measu es a model’s an-
sc ip ion consis ency ac oss di e en eco dings o he
same musical wo k. We o malize his concep and sys-
ema ically analyze i s ela ionship wi h s anda d e alua-
ion me ics on he AMT sub ask o mul i-pi ch es ima ion.
Ou esul s show ha CVC is closely ied o s anda d e alu-
a ion me ics and enables model assessmen using only un-
labeled mul i- e sion da ase s, making i pa icula ly alu-
able in domains whe e anno a ed da a is sca ce bu mul i-
e sion eco dings a e easy o ob ain, such as o ches al
music. Beyond his, we a gue ha CVC is, by design, a
desi able p ope y o ansc ip ion models and ou esul s
indica e ha i can p o ide insigh s in o a model’s obus -
ness, i. e., i s abili y o gene alize o ou -o -domain da a.
1. INTRODUCTION
Au oma ic Music T ansc ip ion (AMT) aims o con e
music eco dings in o some o m o music no a ion, mak-
ing i a powe ul ool o a ious applica ions [1]. In
musicology, AMT can help o analyze la ge collec ions
o eco ded music, including imp o ised pe o mances o
o ally ansmi ed pieces, e ealing pa e ns ha migh o h-
e wise emain inaccessible [2,3]. AMT also suppo s mu-
sic educa ion by p o iding au oma ic ansc ip ions o help
wi h lea ning and p ac ice. Ul ima ely, an AMT sys em
aims owa ds gene a ing a human- eadable sco e. How-
© Yannik Venoh , Yiwei Ding and Ch is o Weiß. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: Yannik Venoh , Yiwei Ding and Ch is o Weiß, “To-
wa ds Robus Music T ansc ip ion by Measu ing C oss-Ve sion Consis-
ency in Wes e n Classical Music”, in P oc. o he 26 h In . Socie y o
Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
Figu e 1: C oss- e sion consis ency. We compa e p edic-
ions ˆ
Y1,ˆ
Y2 om a model θac oss wo di e en eco d-
ings X1, X2o he same wo k a musically co esponding
posi ions.
e e , his goal is commonly add essed h ough in e medi-
a e s eps wi h inc easing le els o abs ac ion. These s eps
include ame-le el, no e-le el, s eam-le el, and no a ion-
le el ansc ip ion as de ined in [1].
While ansc ip ion models ha e shown good esul s
in piano-only scena ios [4, 5, 6], ecognizing pi ches in a
mul i-ins umen scena io beyond solo piano emains a sig-
ni ican challenge [1]. Recen ly, using la ge-scale T ans-
o me a chi ec u es, some signi ican p og ess has been
made o s eam-le el ansc ip ion [7,8]. Howe e , when
es ed on unseen da ase s, hese models show a d as ic d op
in e icacy [8]. Simila ly, o ame-le el ansc ip ion,
Weiß and Pee e s [9] ound ha a ia ions be ween a chi-
ec u es a e o en smalle han a ia ions ac oss aining
uns and e en become i ele an in c oss-da ase e alua-
ions. This highligh s he long-s anding p oblem ha deep
lea ning models end o o e i o inhe en da ase biases
a he han gene alizing e ec i ely [10]. Fo AMT models
o be p ac ically use ul o musicology, hey mus be obus
and able o gene alize o ou -o -domain da a.
A majo challenge is he sca ci y o la ge, unbiased, and
high-quali y anno a ed da ase s. Inspi ed by ad ances in
o he modali ies like ex and images, a na u al nex s ep
is o ind ways o le e age unlabeled da a. Ea ly e o s in
his di ec ion include sel -supe ised lea ning app oaches,
such as lea ning om equi a iance unde pi ch ansposi-
271
ion [11]. Howe e , e en wi h imp o ed models, he chal-
lenge o measu ing p og ess wi h a limi ed amoun o eli-
able es da a emains un esol ed.
Inspi ed by p e ious wo k [12, 13, 14], we p opose o
add ess his challenge by exploi ing mul i- e sion da ase s
o Wes e n classical music. These da ase s con ain se -
e al e sions ( eco ded pe o mances) o he same musical
wo k—possibly by di e en musicians, on di e en ins u-
men s, and in di e en eco ding condi ions—all closely
ollowing he same sco e. Thus, we ob ain di e en au-
dio signals ha ca y he same musical con en . This p o-
ides an oppo uni y o e alua e ansc ip ion models be-
yond s anda d e alua ion me ics. As ou main con ibu-
ion, we o malize and sys ema ically es he no ion o
C oss-Ve sion Consis ency (CVC). We conside a model
o ha e a high CVC, i i makes simila p edic ions a mu-
sically co esponding posi ions, ega dless o pe o me o
eco ding condi ions (Figu e 1). As his measu e does no
depend on anno a ions, i enables us o e alua e models
e en in domains, whe e anno a ed da a is sca ce o una ail-
able. Mo eo e , as i implici ly cap u es when and which
ansc ip ion e o s occu a he han jus how many, i may
se e as a use ul addi ional igu e o me i o e alua ing
ansc ip ion models. While his pape aims o p o iding
insigh s o he b oade ield o AMT, he expe imen s in
his pape ocus on he sub ask o ame-le el ansc ip ion,
also known as Mul i-Pi ch Es ima ion (MPE). Ou main
con ibu ions a e (1) p oposing and o malizing CVC, (2)
designing expe imen s o sys ema ically examine i s ela-
ionship wi h s anda d e alua ion me ics and (3) showing
ha CVC is closely ied o bo h ansc ip ion capabili ies
and a model’s abili y o gene alize o di e en domains.
The emainde o he pape is o ganized as ollow: Sec-
ion 2 e iews ela ed wo k. Sec ion 3 o malizes CVC.
Sec ion 4 p esen s ou expe imen al se up. Sec ion 5
p esen s esul s and discusses ou indings. Sec ion 6 con-
cludes he pape .
2. RELATED WORK
AMT has been an ac i e esea ch a ea o nea ly i e
decades [1]. Gi en he ex ensi e body o wo k in his ield,
we e e o [1] o a comp ehensi e o e iew. Mos s a e-
o - he-a AMT app oaches ely on deep lea ning wi h su-
pe ised aining [4, 5, 6, 15, 16, 9, 7, 8], whe e models a e
ained on da ase s o music eco dings wi h aligned pi ch
anno a ions. E en hough hese app oaches ha e been suc-
cess ul in ce ain domains, wo key challenges emain: (1)
compa ed o adi ional signal p ocessing echniques, deep
lea ning models o en s uggle o gene alize ac oss di e -
en domains [1] and, (2) in many domains such as choi
o o ches al music, anno a ed da ase s a e sca ce, limi ing
he e ec i eness o supe ised lea ning.
Exploi ing mul i- e sion da ase s, o bo h aining and
e alua ion, has eme ged as a s a egy o add ess hese
challenges. Weiß e al. [12, 9] le e aged mul i- e sion
da ase s o e alua ion, s udying gene aliza ion ac oss e -
sions and analyzing he impac o di e en spli ing s a e-
gies. K ause e al. [13, 17] explo ed aining s a egies
using mul i- e sion da a. One app oach employs con-
as i e lea ning, ea ing empo ally close audio segmen s
ac oss e sions as posi i e pai s and dis an segmen s as
nega i e pai s [17]. Howe e , hei indings sugges ha
he esul ing ep esen a ions cap u e ins umen ex u e
a he han pi ch classes and ha monies. Ano he ap-
p oach minimizes he dis ance be ween ime– equency
ep esen a ions o di e en e sions o he same wo k [13],
demons a ing p omising MPE esul s. Liu and Weiß [14]
used mul i- e sion da ase s o domain adap a ion wi hin a
eache –s uden lea ning pa adigm. They use a no ion o
CVC o il e aining labels by compa ing eache anno a-
ions ac oss e sions and e aining only ma ching anno a-
ions o s uden aining. In con as o [14], whe e CVC
is conside ed as a il e on bina y ou pu s, we de ine i as
a measu e on he p obabili ies. Mo eo e , we conside a
la ge pic u e, o malize and sys ema ically explo e CVC.
3. CROSS-VERSION CONSISTENCY
In his pape , we ocus on MPE, aiming o ain a neu-
al ne wo k θ o es ima e pi ch p obabili ies om audio.
Speci ically, a model p oduces a sequence o pi ch p ob-
abili y ec o s ˆ
Y= (ˆy(0), . . . ,ˆy(T)), whe e each ec o
ˆy( )∈[0,1]72 ep esen s he p obabili y o pi ches being
ac i e a ime ame . We conside a model o ha e a high
CVC, i i makes simila p edic ions a musically co e-
sponding posi ions, ega dless o di e ences in pe o me
and eco ding condi ions.
3.1 Alignmen ia Dynamic Time Wa ping
As a peculia i y o Wes e n classical music, di e en e -
sions o a wo k exac ly ollow he same sco e ega d-
ing pi ch and no e in o ma ion, bu a e qui e ee in e-
ga ds o global and local empo (including luc ua ions
such as agogics, i a dando, o uba o). To iden i y mu-
sically co esponding posi ions despi e hese empo a i-
a ions, we ely on Dynamic Time Wa ping (DTW) [18],
a well-es ablished me hod o aligning ime-se ies da a.
DTW yields a wa ping pa h be ween wo sequences wi h
leng hs Nand Mdeno ed as P= (p(1), . . . , p(L)) wi h
p(l) = (nl, ml)∈[1 :N]×[1 :M].
This wa ping pa h es ablishes co espondences be ween
ime ames o he wo sequences, enabling us o align mu-
sical posi ions despi e empo a ia ions (see Figu e 1 in
blue). No ably, al hough DTW is compu ed using audio
ea u e sequences, i can also be applied o align p edic-
ions, since we use he same ea u e a e o bo h.
3.2 C oss-Ve sion Consis ency
Gi en wo audio signals X1and X2and hei espec i e
pi ch p edic ions ˆ
Y1= θ(X1)∈RN×72 and ˆ
Y2=
θ(X2)∈RM×72, we compu e he simila i y be ween
aligned ames along he wa ping pa h. Fo he l- h ele-
men p(l)in he wa ping pa h P, we de ine he ame-le el
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
272
simila i y using a sui able simila i y measu e (e.g., cosine
simila i y) as
s(l) =cossim(ˆ
Y1(nl),ˆ
Y2(ml)) ∈[0,1].
Since some e sions may be pe o med in di e en keys,
we anspose ˆ
Yacco dingly be o e compu ing he simila -
i y. Fo a gi en model θand wo eco dings X1and X2
o he same musical wo k, we de ine CVC as he a e age
ame-le el simila i y ac oss he aligned ames:
CVC( θ, X1, X2) =1
L
L
X
l=1
s(l).
Fo a gi en se o eco dings D, we de ine CVC o be
compu ed o all pai wise combina ions o di e en e -
sions o each wo k and a e aged o e wo ks. By design,
his measu e assesses a model’s obus ness agains a i-
a ions in pe o me and eco ding condi ions. We a gue
ha being consis en ac oss di e ing condi ions is pa icu-
la ly desi able o applica ions like co pus analysis. Mo e-
o e , as i implici ly cap u es when and which ansc ip ion
e o s occu a he han jus how many, i o e s a com-
plemen a y pe spec i e o s anda d e alua ion me ics.
While, by de ini ion, a model making cons an bu inco -
ec p edic ions (e.g., p edic ing he same pi ch o e e y
ame) will yield a high CVC sco e, we no e ha his does
no in alida e he me ic since we conside CVC as an ad-
di ional igu e o me i a he han a s andalone quali y in-
dica o .
4. EXPERIMENTAL SETUP
To s udy ou p oposed consis ency measu e, we conduc a
se ies o expe imen s on he ela ionship o CVC on mul i-
e sion da ase s and common e alua ion me ics on labeled
es se s.
4.1 Da ase s and Spli s
In he cen e o all ou expe imen s a e h ee s uc-
u ed mul i- e sion da ase s: Schube Win e eise Da ase
(SWD) [19], Bee ho en Piano Sona a Da ase (BPSD) [20],
and Bee ho en S ing Qua e Da ase (BSQD) [ o be pub-
lished soon 1]. Since hese da ase s a e used o aining
and o e alua ion, we spli hem in o subse s. To a oid
o e i ing o speci ic eco ding condi ions (“ e sion e -
ec ”, [9]) o melodic/ha monic pa e ns in a wo k (“co e
song e ec ”, [9]), we ensu e ha models a e nei he ained
on he same e sions no wo ks hey a e es ed on and al-
ways use a s ic “nei he spli ”. (see Fig. 4 in [12]).
Fo ou s udies on gene aliza ion, we use wo u he
high-quali y da ase s as unseen es se s: he classical sub-
se o he Real Wo ld Compu ing Music Da abase (RWC)
[21] and TRIOS [22]. These da ase s, which include o -
ches al, ho n, lu e, cembalo, and o gan pieces, a e used
1A mul i- e sion da ase comp ising 6-7 Ve sions o L. .
Bee ho ens comple e s ing qua e s wi h anno a ions de i ed om sym-
bolic ABC Co pus [3].
Re . Name In . hh:mm W×V
[19] SWD Piano, Singing 10:49 24 ×9
[20] BPSD Piano 41:08 32a×11
.b.p. BSQD S ings 62:12 70b×9
[21] RWC-C Mixed 5:21 35 ×1
[22] TRIOS Mixed 0:03 5×1
Table 1: Da ase s used in his pape . W: Numbe o
unique wo ks. V: Numbe o unique e sions. aThe i s
mo emen s o he 32 sona as. b16 ull s ing qua e s.
Re . A chi ec u e Pa ame e s
[16] Basic Pi ch 13,320
[15] Deep Salience 406,453
[9] ResNe -S 393,535
[9] ResNe -M 1,512,783
[9] ResNe -L 4,555,683
Table 2: Model a chi ec u es used in his pape . Each
model is based closely on he e e enced wo k, hough mi-
no di e ences may exis due o eimplemen a ion (e.g.,
sligh a ia ions in pa ame e coun s).
exclusi ely o es ing. To ensu e hey ep esen ou -
o -domain da a, we exclude pieces wi h ins umen a ions
ma ching ou h ee mul i- e sion da ase s, lea ing 35 o
he 50 pieces in RWC. Table 1 p o ides an o e iew.
4.2 Models
In his pape , we ocus on demons a ing he po en ial o
CVC as a measu e o e alua ing MPE models. Ra he
han p oposing complex ne wo k a chi ec u es o conduc -
ing ex ensi e hype pa ame e uning, we aim o assess a
ange o commonly used a chi ec u es. We implemen i e
ully con olu ional deep lea ning models ha a e closely
inspi ed by p io wo k, making minimal adjus men s o i
ou aining amewo k. The smalles model is based on
he no e ac i a ion componen o he No es and Mul ipi ch
(NMP) 2model [16] adjus ed o wo k wi h ewe ha mon-
ics, which we will e e o as Basic Pi ch. Ano he
model builds upon he Deep Salience ne wo k [15],
wi h an addi ional laye o quan ize ou pu s in o semi one
bins. Las ly, we use h ee di e en sizes o a deep con-
olu ional ne wo k wi h addi ional esidual connec ions
(ResNe ), as desc ibed in [9]. In Table 2, we show model
sizes anging om 13 housand o 4,5 million pa ame e s.
4.3 Implemen a ion De ails
All models a e ained om sc a ch using he Adam op i-
mize o 50 epochs, wi h bina y c oss-en opy as he loss
unc ion. We se he lea ning a e o 0.0005. The inpu o
all models is a Ha monic Cons an -Q T ans o m (HCQT)
[15] wi h i e ha monics and one subha monic spanning
six oc a es (C1–C7) wi h h ee bins pe semi one, esul -
ing in 216 pi ch bins. We use a sample a e o 22.05 kHz
and an HCQT hop size o 512 samples, yielding a ame
a e o app oxima ely 43 Hz. We p ocess sequences o 64
2Please no e ha in [16], “no e” e e s o quan ized pi ch on a semi-
one axis, in con as o sub-semi one pi ch con ou s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
273
ames ( oughly 1.5 seconds o audio) in mini-ba ches o
128 samples. To mi iga e and analyze he impac o an-
domness [9], we epea aining h ee imes using di e en
seeds. Fo each aining un, we sa e and e alua e check-
poin s a e e e y i e epochs. DTW is compu ed using he
SyncToolbox [23] implemen a ion o memo y- es ic ed
mul i-scale DTW (M MsDTW) [24] using ch oma and on-
se ea u es a a ame a e o app oxima ely 43 Hz. Fo
u he de ails e e o he sou ce code 3.
5. RESULTS
Ou expe imen s o es ing and unde s anding CVC a e
d i en by wo Resea ch Ques ions (RQ).
RQ 1: Can CVC se e as a p oxy o model e icacy
wi hin one domain? Fo example, does a model ha
ob ains highe CVC on SWD also show highe e icacy
on SWD. To s udy his, we es whe he CVC co ela es
wi h s anda d e alua ion me ics on he h ee mul i- e sion
da ase s.
RQ 2: Beyond his, does CVC p o ide insigh in o
model obus ness— he abili y o gene alize o ou -o -
domain-da a? He e we a e pa icula ly in e es ed in cases,
whe e ou consis ency measu e migh be complemen a y
o common e alua ion me ics. To es his, we in es i-
ga e whe he CVC on he mul i- e sion da ase s co ela es
wi h s anda d e alua ion me ics on he wo unseen ou -o -
domain da ase s.
To es o co ela ion, we epo Spea man’s ank co e-
la ion coe icien (ρ). Fo isualiza ion pu poses, we o e -
lay a linea eg ession line on he plo s.
5.1 P oxy o E icacy Wi hin One Domain
To add ess he i s esea ch ques ion and de e mine
whe he CVC can se e as a p oxy o e icacy wi hin a
domain, we ain mul iple models θ, sys ema ically a y-
ing ei he hei a chi ec u e o aining da a, and compa e
hei CVC sco es wi h hei A e age P ecision (AP) sco es.
Amoun o T aining Da a: In he i s expe imen , we
a y he amoun o aining da a. We use a mixed ain se
(SWD+BPSD+BSQD) as a pool and ain he ResNe -M
a chi ec u e on inc easing ac ions o his pool, anging
om 10% o 20% up o he ull da ase . Figu e 2 shows
one plo o each o he h ee mul i- e sion es se s. To
simula e models o di e en quali y, we e alua e mul iple
checkpoin s o each un. Each ma ke ep esen s a check-
poin o a model, wi h he colo indica ing he amoun o
aining da a he model has seen. The x-axis epo s CVC,
while he y-axis epo s he AP—bo h compu ed on he e-
spec i e es se .
As expec ed, inc easing he amoun o aining da a
gene ally leads o highe AP sco es. Howe e , due o he
inhe en andomness o deep lea ning, small inc eases in
da a size do no always esul in highe AP. Mo e impo -
an ly, looking a he igu es and he high co ela ion coe -
icien s, we can obse e a clea co ela ion be ween CVC
and AP on all h ee es se s. This sugges s i s po en ial as
3h ps://gi hub.com/yannik- enoh /ismi 25-c c- o -mpe
Figu e 2: Va ying he amoun o aining da a.
CVC s. AP—bo h compu ed on he espec i e es
se — o ResNe -M ained on inc easing ac ions o
SWD+BPSD+BSQD.
a p oxy o model e icacy wi hin one domain. The co -
ela ion is sligh ly weake o BPSD, possibly due o i s
smalle AP ange (≈15% s. ≈30% in o he da ase s)
whe e e ec s o andomness become mo e ele an .
Domain o T aining Da a: In his nex expe imen , we
explo e models ained on di e en da ase s. Fo his we
ain ResNe -M on all possible combina ions o he h ee
mul i- e sion da ase s. Compa ed o he p e ious expe i-
men , his can be seen as a mo e d as ic a ia ion, as we
no only a y he amoun , bu also he ins umen a ion a
model has seen du ing aining. As wi h he p e ious ex-
pe imen , we compu e CVC and AP on he es se s o he
h ee mul i- e sion da ase s. Resul s a e shown in Figu e
3. Fi s o all, we see s ong di e ences be ween he h ee
es se s. When es ing wi hin he domain o SWD (Figu e
3a), we see a s ong co ela ion (ρ=.92) be ween CVC
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
274
Figu e 3: Va ying he aining domain. CVC s. AP—
bo h compu ed on he espec i e es se — o ResNe -M
ained on combina ions o SWD, BPSD and BSQD.
and AP. When es ing in he s ing qua e domain (BSQD,
Figu e 3c), we obse e a i s case whe e models ha e a
lowe AP bu a highe CVC, wi h SWD in da k blue and
BPSD in yellow, while s ill main aining a s ong co ela-
ion o e all (ρ=.75). In con as o his is he domain o
BPSD (Figu e 3b), whe e we only obse e a weak co e-
la ion (ρ≈.38). In he cen e o he plo , he e seems o
be a co ela ion on a smalle scale be ween he models ha
ha e seen examples o BPSD du ing aining. Howe e , he
models ained on SWD,BSQD, and SWD+BSQD all show
low AP, bu s ill compa able CVC. The models ained on
BSQD show a pa icula ly low AP, while ha ing a CVC
compa able o all o he models. Upon examining p edic-
ions o hese models mo e closely, we obse ed ha hey
consis en ly miss pi ches in he highe egis e s. While his
ob iously makes his a no e y use ul model, one could
a gue ha making mis akes ega dless o pe o me and
eco ding condi ions is a desi ed p ope y o co pus s ud-
Figu e 4: Va ying model a chi ec u e. CVC s. AP—bo h
compu ed on all h ee mul i- e sion da ase s— o di e en
a chi ec u es ained on SWD+BPSD+BSQD.
ies. Le us ake a close look a ano he example. The mod-
els ained on BPSD (yellow) ha e a highe AP, bu a lowe
CVC han he models ained SWD+BSQD (ligh blue). We
expec a highe AP, since i has been ained in he domain
whe e i is es ed, bu CVC does no seem o e lec his.
We will u he discuss his example in Sec ion 5.2.
Model A chi ec u e: In his expe imen , we com-
pa e di e en model a chi ec u es while keeping he ain-
ing da a cons an (SWD+BPSD+BSQD). Figu e 4 shows
he esul s, whe e colo s indica e he ained a chi ec u e.
Since he ends we e simila ac oss he h ee mul i- e sion
da ase s, we combined he esul s o he h ee es domains
by a e aging hem. Each da ase is weigh ed equally, e-
ga dless o i s size. As expec ed, models wi h highe ca-
paci y end o achie e highe AP. Mo e impo an ly, we
see ha e en in his expe imen , we ind a co ela ion be-
ween CVC and AP (ρ=.61), e en hough i is sligh ly
weake . No ably, compa ed o he p e ious expe imen ,
he ange o AP is ela i ely small (≈10%). In e es ingly,
di e ences in CVC do no always ansla e o di e ences
in AP. Fo example, when compa ing Deep Salience
(yellow) and ResNe -M (g ey)— wo models o simila
size—we obse e ha al hough ResNe -M, which inco -
po a es esidual connec ions, achie es highe CVC, bo h
models a ain simila AP. We discuss his obse a ion u -
he in Sec ion 5.2.
To summa ize, mos expe imen s show a co ela ion be-
ween CVC and AP. Howe e , some indi idual examples
sugges ha CVC may cap u e aspec s o a model ha a e
no ully e lec ed in AP sco es wi hin he domain o he
mul i- e sion se s. No ably, we obse ed ins ances whe e a
model wi h highe CVC did no necessa ily achie e highe
AP—and he o he way a ound.
5.2 P oxy o Robus ness
While he p e ious expe imen s examined he ela ionship
be ween CVC and model e icacy wi hin he same domain,
we now u n ou ocus o obus ness. To add ess his sec-
ond esea ch ques ion and assess whe he CVC se es as a
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
275

Figu e 5: CVC as p oxy o model obus ness. CVC on
BPSD s. AP on RWC+TRIOS. As in Figu e 3 we compa e
ResNe -M ained on he di e en aining da a.
p oxy o a model’s abili y o gene alize o ou -o -domain
da a, we analyze i s co ela ion wi h AP on he wo unseen
ou -o -domain da ase s: RWC and TRIOS. Fo his we will
ake a close look a he wo ins ances, whe e co ela ion
was no obse ed in he p e ious expe imen s.
Domain o T aining Da a: Le us ecall he expe i-
men on a ying he domain o aining da a om he p e-
ious sec ion. When e alua ing in he domain o BPSD
(Figu e 3b), changes in AP did no necessa ily co espond
o CVC. We now ex end his analysis by compa ing CVC
on BPSD o AP on RWC+TRIOS. The esul s o his a e
shown in Figu e 5. No e ha , while he meaning o he
y-axis changed, he x-axis is exac ly he same as in he ex-
pe imen om he p e ious sec ion. As opposed o he p e-
ious sec ion, we can now see a s ong co ela ion (ρ=.7)
be ween CVC on BPSD and AP on RWC+TRIOS. Looking
close a ou p e ious example, we see ha he lowe CVC
o he models ained on BPSD (yellow) is now e lec ed
in a lowe AP on ou -o domain da a. Whe eas he highe
CVC o he models ained on SWD+BSQD (ligh blue) is
e lec ed in a highe AP. This indica es ha by measu ing
CVC, we a e able o iden i y models ha o e i o da ase
biases o BPSD.
Model A chi ec u e: We now conduc an expe imen
simila o he a chi ec u e compa ison om he p e ious
sec ion (Figu e 4). Howe e , now we compa e CVC
on he combined h ee mul i- e sion da ase s o AP on
RWC+TRIOS (Figu e 6). The o e all ends emain simi-
la , bu one di e ence eme ges. Taking a close look a he
wo models wi h di e ences in CVC bu simila AP (Deep
Salience and ResNe -M), we now obse e ha Deep
Salience pe o ms sligh ly wo se on he ou -o -domain
es se compa ed o ResNe -M. This also e lec s in a
sligh ly highe co ela ion coe icien . E en hough he e -
ec is small, his suppo s wo insigh s: (1) models wi h
esidual connec ions may gene alize be e , aligning wi h
Figu e 6: CVC as p oxy o model obus ness. CVC on he
h ee mul i- e sion da ase s s. AP on RWC+TRIOS. As in
Figu e 4 we compa e di e en a chi ec u es, all ained on
SWD+BPSD+BSQD.
p io indings [12] and (2) CVC migh be able o cap u e
his gene aliza ion abili y.
To summa ize, he p e ious sec ion iden i ied ins ances
whe e CVC cap u es model aspec s no ully e lec ed in
AP sco es wi hin mul i- e sion da ase s. Fo hose cases,
we in es iga ed whe he CVC on he mul i- e sion da ase s
migh co ela es wi h s anda d e alua ion me ics on ou -
o -domain da ase s. Ou esul s sugges ha CVC does in-
deed seem o p o ide insigh s in o model obus ness. This
would indica e ha CVC is no jus a use ul e alua ion
me ic wi hin he domain o he mul i- e sion da ase , bu
also a aluable addi ional igu e o me i o assessing a
model’s abili y o gene alize o ou -o -domain da a.
6. CONCLUSION
In his pape , we p esen ed CVC as an anno a ion- ee
s a egy o e alua ing AMT models, which assesses
whe he a model makes consis en p edic ions when ac-
ing di e en e sions o he same musical piece. We a -
gued ha CVC is, by design, a desi able p ope y o an-
sc ip ion models used o co pus analysis. We showed ha ,
in mos cases, i may se e as a p oxy o model e icacy
wi hin a domain. In cases whe e i does no , we could ob-
se e ha CVC can p o ides insigh s in o a model’s abili y
o gene alize o ou -o -domain da a, making i a powe ul
addi ional igu e o me i .
In his wo k, we es ablished CVC o e alua ion. How-
e e , we belie e ha ou indings lay he g oundwo k o
a la ge goal: le e aging mul i- e sion da a o imp o e
model aining. The esul s gi e con idence o he idea
o c oss- e sion based con as i e lea ning [17] o MPE.
Inspi ed by he usage in [11], in u u e wo k we would
like o explo e using he CVC as a dis ance measu e o
con as i e lea ning wi h a Siamese Ne wo k a chi ec u e
[25]. Since his s udy ocused on ame-le el ansc ip ion
(MPE), ou u u e wo k will explo e no e-le el ansc ip-
ion, in eg a ing onse de ec ion [4, 16] and T ans o me
a chi ec u es o u he imp o emen s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
276
7. ACKNOWLEDGEMENTS:
This wo k was unded by he Ge man Resea ch Founda-
ion (Deu sche Fo schungsgemeinscha , DFG) wi hin he
Emmy Noe he Junio Resea ch G oup on Compu a ional
Analysis o Music Audio Reco dings: A C oss-Ve sion Ap-
p oach (DFG WE 6611/3-1, G an No. 531250483).
8. REFERENCES
[1] E. Bene os, S. Dixon, Z. Duan, and S. Ewe , “Au o-
ma ic music ansc ip ion: An o e iew,” IEEE Signal
P ocessing Magazine, ol. 36, no. 1, pp. 20–30, 2019.
[2] X. Se a, “The compu a ional s udy o a musical cul-
u e h ough i s digi al aces,” Ac a Musicologica,
ol. 89, no. 1, p. 24–44, 2017.
[3] M. Neuwi h, D. Ha asim, F. C. Moss, and
M. Roh meie , “The anno a ed Bee ho en co pus
(ABC): A da ase o ha monic analyses o all
Bee ho en s ing qua e s,” F on ie s Digi . Humani .,
ol. 5, p. 16, 2018.
[4] C. Haw ho ne, E. Elsen, J. Song, A. Robe s, I. Si-
mon, C. Ra el, J. H. Engel, S. Oo e, and D. Eck, “On-
se s and ames: Dual-objec i e piano ansc ip ion,”
in P oceedings o he In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, (ISMIR), Pa is,
F ance, 2018, pp. 50–57.
[5] C. Haw ho ne, I. Simon, R. Swa ely, E. Manilow, and
J. H. Engel, “Sequence- o-sequence piano ansc ip-
ion wi h ans o me s,” in P oceedings o he In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Online, 2021, pp. 246–253.
[6] Q. Kong, B. Li, X. Song, Y. Wan, and Y. Wang, “High-
esolu ion piano ansc ip ion wi h pedals by eg ess-
ing onse and o se imes,” IEEE ACM T ans. Audio
Speech Lang. P ocess., ol. 29, pp. 3707–3717, 2021.
[7] J. Ga dne , I. Simon, E. Manilow, C. Haw ho ne, and
J. H. Engel, “MT3: mul i- ask mul i ack music an-
sc ip ion,” in The Ten h In e na ional Con e ence on
Lea ning Rep esen a ions, (ICLR), 2022.
[8] S. Chang, E. Bene os, H. Ki chho , and S. Dixon,
“You MT3+: mul i-ins umen music ansc ip ion
wi h enhanced ans o me a chi ec u es and c oss-
da ase STEM augmen a ion,” in 34 h IEEE In e na-
ional Wo kshop on Machine Lea ning o Signal P o-
cessing, (MLSP). IEEE, 2024, pp. 1–6.
[9] C. Weiß and G. Pee e s, “Compa ing deep models and
e alua ion s a egies o mul i-pi ch es ima ion in mu-
sic eco dings,” IEEE ACM T ans. Audio Speech Lang.
P ocess., ol. 30, pp. 2814–2827, 2022.
[10] R. Gei hos, J. Jacobsen, C. Michaelis, R. S. Zemel,
W. B endel, M. Be hge, and F. A. Wichmann, “Sho -
cu lea ning in deep neu al ne wo ks,” Na . Mach. In-
ell., ol. 2, no. 11, pp. 665–673, 2020.
[11] A. Riou, S. La ne , G. Hadje es, and G. Pee e s,
“PESTO: pi ch es ima ion wi h sel -supe ised
ansposi ion-equi a ian objec i e,” in P oceedings o
he 24 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, (ISMIR), 2023, pp. 535–544.
[12] C. Weiß, H. Sch eibe , and M. Mülle , “Local key es i-
ma ion in music eco dings: A case s udy ac oss songs,
e sions, and anno a o s,” IEEE/ACM T ansac ions on
Audio, Speech, and Language P ocessing, ol. 28, pp.
2919–2932, 2020.
[13] M. K ause, S. S ahl, and M. Mülle , “Weakly supe -
ised mul i-pi ch es ima ion using c oss- e sion align-
men ,” in P oceedings o he 24 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence, (ISMIR),
2023, pp. 289–296.
[14] L. Liu and C. Weiss, “U ilizing c oss- e sion consis-
ency o domain adap a ion: A case s udy on music
audio,” in The Second Tiny Pape s T ack a (ICLR),
2024.
[15] R. M. Bi ne , B. McFee, J. Salamon, P. Li, and J. P.
Bello, “Deep salience ep esen a ions o F0 acking
in polyphonic music,” in P oceedings o he In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Suzhou, China, 2017, pp. 63–70.
[16] R. M. Bi ne , J. J. Bosch, D. Rubins ein, G. Mesegue -
B ocal, and S. Ewe , “A ligh weigh ins umen -
agnos ic model o polyphonic no e ansc ip ion and
mul ipi ch es ima ion,” in IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing,
(ICASSP). IEEE, 2022, pp. 781–785.
[17] M. K ause, C. Weiß, and M. Mülle , “A c oss- e sion
app oach o audio ep esen a ion lea ning o o ches-
al music,” in P oceedings o he 24 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR, A. Sa i, F. An onacci, M. Sandle , P. Bes agini,
S. Dixon, B. Liang, G. Richa d, and J. Pauwels, Eds.,
2023, pp. 832–839.
[18] M. Mülle , Fundamen als o Music P ocessing - Us-
ing Py hon and Jupy e No ebooks, Second Edi ion.
Sp inge , 2021.
[19] C. Weiß, F. Zalkow, V. A i i-Mülle , M. Mülle , H. V.
Koops, A. Volk, and H. G. G ohganz, “Schube win-
e eise da ase : A mul imodal scena io o music anal-
ysis,” ACM Jou nal on Compu ing and Cul u al He -
i age, ol. 14, no. 2, pp. 25:1–25:18, 2021.
[20] J. Zei le , C. Weiß, V. A i i-Mülle , and M. Mülle ,
“BPSD: A cohe en mul i- e sion da ase o analyz-
ing he i s mo emen s o bee ho en’s piano sona as,”
T ans. In . Soc. Music. In . Re ., ol. 7, no. 1, pp. 195–
212, 2024.
[21] M. Go o, H. Hashiguchi, T. Nishimu a, and R. Oka,
“RWC music da abase: Music gen e da abase and
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
277
musical ins umen sound da abase,” in P oceedings
o he In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), Bal imo e, Ma yland,
USA, 2003, pp. 229–230.
[22] J. F i sch and M. D. Plumbley, “Sco e in o med audio
sou ce sepa a ion using cons ained nonnega i e ma-
ix ac o iza ion and sco e syn hesis,” in P oceedings
o he IEEE In e na ional Con e ence on Acous ics,
Speech, and Signal P ocessing (ICASSP), 2013, pp.
888–891.
[23] M. Mülle , Y. Öze , M. K ause, T. P ä zlich, and
J. D iedge , “Sync Toolbox: A Py hon package o e -
icien , obus , and accu a e music synch oniza ion,”
Jou nal o Open Sou ce So wa e (JOSS), ol. 6,
no. 64, pp. 3434:1–4, 2021.
[24] T. P ä zlich, J. D iedge , and M. Mülle , “Memo y-
es ic ed mul iscale dynamic ime wa ping,” in IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing, (ICASSP). IEEE, 2016, pp. 569–
573.
[25] J. B omley, I. Guyon, Y. LeCun, E. Säckinge , and
R. Shah, “Signa u e e i ica ion using a siamese ime
delay neu al ne wo k,” in Ad ances in Neu al In o -
ma ion P ocessing Sys ems 6, [7 h NIPS Con e ence].
Mo gan Kau mann, 1993, pp. 737–744.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
278