Towards Robust Automatic Music Transcription By Measuring Cross-Version Consistency

Author: Yannik Venohr; Yiwei Ding; Christof Weiss

Publisher: Zenodo

DOI: 10.5281/zenodo.17706389

Source: https://zenodo.org/records/17706389/files/000031.pdf

TOWARDS ROBUST MUSIC TRANSCRIPTION BY MEASURING
CROSS-VERSION CONSISTENCY IN WESTERN CLASSICAL MUSIC
Yannik Venoh Yiwei Ding Ch is o Weiß
Cen e o A i icial In elligence and Da a Science, Uni e si y o Wü zbu g
{yannik. enoh , yiwei.ding, ch is o .weiss}@uni-wue zbu g.de
ABSTRACT
Au oma ic Music T ansc ip ion (AMT) is a cen al ask
wi hin MIR, enabling a ious subsequen applica ions. De-
spi e ad ancemen s hanks o deep lea ning, imp o ing
AMT emains challenging due o he sca ci y o la ge,
high-quali y anno a ed da ase s. Recognizing pi ches in
mul i-ins umen se ings beyond solo piano is pa icula ly
di icul , as models s uggle o gene alize ac oss domains
due o da ase biases and o e i ing. AMT esea ch ap-
pea s o ha e hi a glass ceiling, whe e u he p og ess is
di icul o achie e and o measu e. To add ess his, we
p opose c oss- e sion consis ency (CVC)—an anno a ion-
ee e alua ion amewo k ha measu es a model’s an-
sc ip ion consis ency ac oss di e en eco dings o he
same musical wo k. We o malize his concep and sys-
ema ically analyze i s ela ionship wi h s anda d e alua-
ion me ics on he AMT sub ask o mul i-pi ch es ima ion.
Ou esul s show ha CVC is closely ied o s anda d e alu-
a ion me ics and enables model assessmen using only un-
labeled mul i- e sion da ase s, making i pa icula ly alu-
able in domains whe e anno a ed da a is sca ce bu mul i-
e sion eco dings a e easy o ob ain, such as o ches al
music. Beyond his, we a gue ha CVC is, by design, a
desi able p ope y o ansc ip ion models and ou esul s
indica e ha i can p o ide insigh s in o a model’s obus -
ness, i. e., i s abili y o gene alize o ou -o -domain da a.
1. INTRODUCTION
Au oma ic Music T ansc ip ion (AMT) aims o con e
music eco dings in o some o m o music no a ion, mak-
ing i a powe ul ool o a ious applica ions [1]. In
musicology, AMT can help o analyze la ge collec ions
o eco ded music, including imp o ised pe o mances o
o ally ansmi ed pieces, e ealing pa e ns ha migh o h-
e wise emain inaccessible [2,3]. AMT also suppo s mu-
sic educa ion by p o iding au oma ic ansc ip ions o help
wi h lea ning and p ac ice. Ul ima ely, an AMT sys em
aims owa ds gene a ing a human- eadable sco e. How-
© Yannik Venoh , Yiwei Ding and Ch is o Weiß. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: Yannik Venoh , Yiwei Ding and Ch is o Weiß, “To-
wa ds Robus Music T ansc ip ion by Measu ing C oss-Ve sion Consis-
ency in Wes e n Classical Music”, in P oc. o he 26 h In . Socie y o
Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
Figu e 1: C oss- e sion consis ency. We compa e p edic-
ions ˆ
Y1,ˆ
Y2 om a model θac oss wo di e en eco d-
ings X1, X2o he same wo k a musically co esponding
posi ions.
e e , his goal is commonly add essed h ough in e medi-
a e s eps wi h inc easing le els o abs ac ion. These s eps
include ame-le el, no e-le el, s eam-le el, and no a ion-
le el ansc ip ion as de ined in [1].
While ansc ip ion models ha e shown good esul s
in piano-only scena ios [4, 5, 6], ecognizing pi ches in a
mul i-ins umen scena io beyond solo piano emains a sig-
ni ican challenge [1]. Recen ly, using la ge-scale T ans-
o me a chi ec u es, some signi ican p og ess has been
made o s eam-le el ansc ip ion [7,8]. Howe e , when
es ed on unseen da ase s, hese models show a d as ic d op
in e icacy [8]. Simila ly, o ame-le el ansc ip ion,
Weiß and Pee e s [9] ound ha a ia ions be ween a chi-
ec u es a e o en smalle han a ia ions ac oss aining
uns and e en become i ele an in c oss-da ase e alua-
ions. This highligh s he long-s anding p oblem ha deep
lea ning models end o o e i o inhe en da ase biases
a he han gene alizing e ec i ely [10]. Fo AMT models
o be p ac ically use ul o musicology, hey mus be obus
and able o gene alize o ou -o -domain da a.
A majo challenge is he sca ci y o la ge, unbiased, and
high-quali y anno a ed da ase s. Inspi ed by ad ances in
o he modali ies like ex and images, a na u al nex s ep
is o ind ways o le e age unlabeled da a. Ea ly e o s in
his di ec ion include sel -supe ised lea ning app oaches,
such as lea ning om equi a iance unde pi ch ansposi-
271
ion [11]. Howe e , e en wi h imp o ed models, he chal-
lenge o measu ing p og ess wi h a limi ed amoun o eli-
able es da a emains un esol ed.
Inspi ed by p e ious wo k [12, 13, 14], we p opose o
add ess his challenge by exploi ing mul i- e sion da ase s
o Wes e n classical music. These da ase s con ain se -
e al e sions ( eco ded pe o mances) o he same musical
wo k—possibly by di e en musicians, on di e en ins u-
men s, and in di e en eco ding condi ions—all closely
ollowing he same sco e. Thus, we ob ain di e en au-
dio signals ha ca y he same musical con en . This p o-
ides an oppo uni y o e alua e ansc ip ion models be-
yond s anda d e alua ion me ics. As ou main con ibu-
ion, we o malize and sys ema ically es he no ion o
C oss-Ve sion Consis ency (CVC). We conside a model
o ha e a high CVC, i i makes simila p edic ions a mu-
sically co esponding posi ions, ega dless o pe o me o
eco ding condi ions (Figu e 1). As his measu e does no
depend on anno a ions, i enables us o e alua e models
e en in domains, whe e anno a ed da a is sca ce o una ail-
able. Mo eo e , as i implici ly cap u es when and which
ansc ip ion e o s occu a he han jus how many, i may
se e as a use ul addi ional igu e o me i o e alua ing
ansc ip ion models. While his pape aims o p o iding
insigh s o he b oade ield o AMT, he expe imen s in
his pape ocus on he sub ask o ame-le el ansc ip ion,
also known as Mul i-Pi ch Es ima ion (MPE). Ou main
con ibu ions a e (1) p oposing and o malizing CVC, (2)
designing expe imen s o sys ema ically examine i s ela-
ionship wi h s anda d e alua ion me ics and (3) showing
ha CVC is closely ied o bo h ansc ip ion capabili ies
and a model’s abili y o gene alize o di e en domains.
The emainde o he pape is o ganized as ollow: Sec-
ion 2 e iews ela ed wo k. Sec ion 3 o malizes CVC.
Sec ion 4 p esen s ou expe imen al se up. Sec ion 5
p esen s esul s and discusses ou indings. Sec ion 6 con-
cludes he pape .
2. RELATED WORK
AMT has been an ac i e esea ch a ea o nea ly i e
decades [1]. Gi en he ex ensi e body o wo k in his ield,
we e e o [1] o a comp ehensi e o e iew. Mos s a e-
o - he-a AMT app oaches ely on deep lea ning wi h su-
pe ised aining [4, 5, 6, 15, 16, 9, 7, 8], whe e models a e
ained on da ase s o music eco dings wi h aligned pi ch
anno a ions. E en hough hese app oaches ha e been suc-
cess ul in ce ain domains, wo key challenges emain: (1)
compa ed o adi ional signal p ocessing echniques, deep
lea ning models o en s uggle o gene alize ac oss di e -
en domains [1] and, (2) in many domains such as choi
o o ches al music, anno a ed da ase s a e sca ce, limi ing
he e ec i eness o supe ised lea ning.
Exploi ing mul i- e sion da ase s, o bo h aining and
e alua ion, has eme ged as a s a egy o add ess hese
challenges. Weiß e al. [12, 9] le e aged mul i- e sion
da ase s o e alua ion, s udying gene aliza ion ac oss e -
sions and analyzing he impac o di e en spli ing s a e-
gies. K ause e al. [13, 17] explo ed aining s a egies
using mul i- e sion da a. One app oach employs con-
as i e lea ning, ea ing empo ally close audio segmen s
ac oss e sions as posi i e pai s and dis an segmen s as
nega i e pai s [17]. Howe e , hei indings sugges ha
he esul ing ep esen a ions cap u e ins umen ex u e
a he han pi ch classes and ha monies. Ano he ap-
p oach minimizes he dis ance be ween ime– equency
ep esen a ions o di e en e sions o he same wo k [13],
demons a ing p omising MPE esul s. Liu and Weiß [14]
used mul i- e sion da ase s o domain adap a ion wi hin a
eache –s uden lea ning pa adigm. They use a no ion o
CVC o il e aining labels by compa ing eache anno a-
ions ac oss e sions and e aining only ma ching anno a-
ions o s uden aining. In con as o [14], whe e CVC
is conside ed as a il e on bina y ou pu s, we de ine i as
a measu e on he p obabili ies. Mo eo e , we conside a
la ge pic u e, o malize and sys ema ically explo e CVC.
3. CROSS-VERSION CONSISTENCY
In his pape , we ocus on MPE, aiming o ain a neu-
al ne wo k θ o es ima e pi ch p obabili ies om audio.
Speci ically, a model p oduces a sequence o pi ch p ob-
abili y ec o s ˆ
Y= (ˆy(0), . . . ,ˆy(T)), whe e each ec o
ˆy( )∈[0,1]72 ep esen s he p obabili y o pi ches being
ac i e a ime ame . We conside a model o ha e a high
CVC, i i makes simila p edic ions a musically co e-
sponding posi ions, ega dless o di e ences in pe o me
and eco ding condi ions.
3.1 Alignmen ia Dynamic Time Wa ping
As a peculia i y o Wes e n classical music, di e en e -
sions o a wo k exac ly ollow he same sco e ega d-
ing pi ch and no e in o ma ion, bu a e qui e ee in e-
ga ds o global and local empo (including luc ua ions
such as agogics, i a dando, o uba o). To iden i y mu-
sically co esponding posi ions despi e hese empo a i-
a ions, we ely on Dynamic Time Wa ping (DTW) [18],
a well-es ablished me hod o aligning ime-se ies da a.
DTW yields a wa ping pa h be ween wo sequences wi h
leng hs Nand Mdeno ed as P= (p(1), . . . , p(L)) wi h
p(l) = (nl, ml)∈[1 :N]×[1 :M].
This wa ping pa h es ablishes co espondences be ween
ime ames o he wo sequences, enabling us o align mu-
sical posi ions despi e empo a ia ions (see Figu e 1 in
blue). No ably, al hough DTW is compu ed using audio
ea u e sequences, i can also be applied o align p edic-
ions, since we use he same ea u e a e o bo h.
3.2 C oss-Ve sion Consis ency
Gi en wo audio signals X1and X2and hei espec i e
pi ch p edic ions ˆ
Y1= θ(X1)∈RN×72 and ˆ
Y2=
θ(X2)∈RM×72, we compu e he simila i y be ween
aligned ames along he wa ping pa h. Fo he l- h ele-
men p(l)in he wa ping pa h P, we de ine he ame-le el
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
272
simila i y using a sui able simila i y measu e (e.g., cosine
simila i y) as
s(l) =cossim(ˆ
Y1(nl),ˆ
Y2(ml)) ∈[0,1].
Since some e sions may be pe o med in di e en keys,
we anspose ˆ
Yacco dingly be o e compu ing he simila -
i y. Fo a gi en model θand wo eco dings X1and X2
o he same musical wo k, we de ine CVC as he a e age
ame-le el simila i y ac oss he aligned ames:
CVC( θ, X1, X2) =1
L
L
X
l=1
s(l).
Fo a gi en se o eco dings D, we de ine CVC o be
compu ed o all pai wise combina ions o di e en e -
sions o each wo k and a e aged o e wo ks. By design,
his measu e assesses a model’s obus ness agains a i-
a ions in pe o me and eco ding condi ions. We a gue
ha being consis en ac oss di e ing condi ions is pa icu-
la ly desi able o applica ions like co pus analysis. Mo e-
o e , as i implici ly cap u es when and which ansc ip ion
e o s occu a he han jus how many, i o e s a com-
plemen a y pe spec i e o s anda d e alua ion me ics.
While, by de ini ion, a model making cons an bu inco -
ec p edic ions (e.g., p edic ing he same pi ch o e e y
ame) will yield a high CVC sco e, we no e ha his does
no in alida e he me ic since we conside CVC as an ad-
di ional igu e o me i a he han a s andalone quali y in-
dica o .
4. EXPERIMENTAL SETUP
To s udy ou p oposed consis ency measu e, we conduc a
se ies o expe imen s on he ela ionship o CVC on mul i-
e sion da ase s and common e alua ion me ics on labeled
es se s.
4.1 Da ase s and Spli s
In he cen e o all ou expe imen s a e h ee s uc-
u ed mul i- e sion da ase s: Schube Win e eise Da ase
(SWD) [19], Bee ho en Piano Sona a Da ase (BPSD) [20],
and Bee ho en S ing Qua e Da ase (BSQD) [ o be pub-
lished soon 1]. Since hese da ase s a e used o aining
and o e alua ion, we spli hem in o subse s. To a oid
o e i ing o speci ic eco ding condi ions (“ e sion e -
ec ”, [9]) o melodic/ha monic pa e ns in a wo k (“co e
song e ec ”, [9]), we ensu e ha models a e nei he ained
on he same e sions no wo ks hey a e es ed on and al-
ways use a s ic “nei he spli ”. (see Fig. 4 in [12]).
Fo ou s udies on gene aliza ion, we use wo u he
high-quali y da ase s as unseen es se s: he classical sub-
se o he Real Wo ld Compu ing Music Da abase (RWC)
[21] and TRIOS [22]. These da ase s, which include o -
ches al, ho n, lu e, cembalo, and o gan pieces, a e used
1A mul i- e sion da ase comp ising 6-7 Ve sions o L. .
Bee ho ens comple e s ing qua e s wi h anno a ions de i ed om sym-
bolic ABC Co pus [3].
Re . Name In . hh:mm W×V
[19] SWD Piano, Singing 10:49 24 ×9
[20] BPSD Piano 41:08 32a×11
.b.p. BSQD S ings 62:12 70b×9
[21] RWC-C Mixed 5:21 35 ×1
[22] TRIOS Mixed 0:03 5×1
Table 1: Da ase s used in his pape . W: Numbe o
unique wo ks. V: Numbe o unique e sions. aThe i s
mo emen s o he 32 sona as. b16 ull s ing qua e s.
Re . A chi ec u e Pa ame e s
[16] Basic Pi ch 13,320
[15] Deep Salience 406,453
[9] ResNe -S 393,535
[9] ResNe -M 1,512,783
[9] ResNe -L 4,555,683
Table 2: Model a chi ec u es used in his pape . Each
model is based closely on he e e enced wo k, hough mi-
no di e ences may exis due o eimplemen a ion (e.g.,
sligh a ia ions in pa ame e coun s).
exclusi ely o es ing. To ensu e hey ep esen ou -
o -domain da a, we exclude pieces wi h ins umen a ions
ma ching ou h ee mul i- e sion da ase s, lea ing 35 o
he 50 pieces in RWC. Table 1 p o ides an o e iew.
4.2 Models
In his pape , we ocus on demons a ing he po en ial o
CVC as a measu e o e alua ing MPE models. Ra he
han p oposing complex ne wo k a chi ec u es o conduc -
ing ex ensi e hype pa ame e uning, we aim o assess a
ange o commonly used a chi ec u es. We implemen i e
ully con olu ional deep lea ning models ha a e closely
inspi ed by p io wo k, making minimal adjus men s o i
ou aining amewo k. The smalles model is based on
he no e ac i a ion componen o he No es and Mul ipi ch
(NMP) 2model [16] adjus ed o wo k wi h ewe ha mon-
ics, which we will e e o as Basic Pi ch. Ano he
model builds upon he Deep Salience ne wo k [15],
wi h an addi ional laye o quan ize ou pu s in o semi one
bins. Las ly, we use h ee di e en sizes o a deep con-
olu ional ne wo k wi h addi ional esidual connec ions
(ResNe ), as desc ibed in [9]. In Table 2, we show model
sizes anging om 13 housand o 4,5 million pa ame e s.
4.3 Implemen a ion De ails
All models a e ained om sc a ch using he Adam op i-
mize o 50 epochs, wi h bina y c oss-en opy as he loss
unc ion. We se he lea ning a e o 0.0005. The inpu o
all models is a Ha monic Cons an -Q T ans o m (HCQT)
[15] wi h i e ha monics and one subha monic spanning
six oc a es (C1–C7) wi h h ee bins pe semi one, esul -
ing in 216 pi ch bins. We use a sample a e o 22.05 kHz
and an HCQT hop size o 512 samples, yielding a ame
a e o app oxima ely 43 Hz. We p ocess sequences o 64
2Please no e ha in [16], “no e” e e s o quan ized pi ch on a semi-
one axis, in con as o sub-semi one pi ch con ou s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
273
ames ( oughly 1.5 seconds o audio) in mini-ba ches o
128 samples. To mi iga e and analyze he impac o an-
domness [9], we epea aining h ee imes using di e en
seeds. Fo each aining un, we sa e and e alua e check-
poin s a e e e y i e epochs. DTW is compu ed using he
SyncToolbox [23] implemen a ion o memo y- es ic ed
mul i-scale DTW (M MsDTW) [24] using ch oma and on-
se ea u es a a ame a e o app oxima ely 43 Hz. Fo
u he de ails e e o he sou ce code 3.
5. RESULTS
Ou expe imen s o es ing and unde s anding CVC a e
d i en by wo Resea ch Ques ions (RQ).
RQ 1: Can CVC se e as a p oxy o model e icacy
wi hin one domain? Fo example, does a model ha
ob ains highe CVC on SWD also show highe e icacy
on SWD. To s udy his, we es whe he CVC co ela es
wi h s anda d e alua ion me ics on he h ee mul i- e sion
da ase s.
RQ 2: Beyond his, does CVC p o ide insigh in o
model obus ness— he abili y o gene alize o ou -o -
domain-da a? He e we a e pa icula ly in e es ed in cases,
whe e ou consis ency measu e migh be complemen a y
o common e alua ion me ics. To es his, we in es i-
ga e whe he CVC on he mul i- e sion da ase s co ela es
wi h s anda d e alua ion me ics on he wo unseen ou -o -
domain da ase s.
To es o co ela ion, we epo Spea man’s ank co e-
la ion coe icien (ρ). Fo isualiza ion pu poses, we o e -
lay a linea eg ession line on he plo s.
5.1 P oxy o E icacy Wi hin One Domain
To add ess he i s esea ch ques ion and de e mine
whe he CVC can se e as a p oxy o e icacy wi hin a
domain, we ain mul iple models θ, sys ema ically a y-
ing ei he hei a chi ec u e o aining da a, and compa e
hei CVC sco es wi h hei A e age P ecision (AP) sco es.
Amoun o T aining Da a: In he i s expe imen , we
a y he amoun o aining da a. We use a mixed ain se
(SWD+BPSD+BSQD) as a pool and ain he ResNe -M
a chi ec u e on inc easing ac ions o his pool, anging
om 10% o 20% up o he ull da ase . Figu e 2 shows
one plo o each o he h ee mul i- e sion es se s. To
simula e models o di e en quali y, we e alua e mul iple
checkpoin s o each un. Each ma ke ep esen s a check-
poin o a model, wi h he colo indica ing he amoun o
aining da a he model has seen. The x-axis epo s CVC,
while he y-axis epo s he AP—bo h compu ed on he e-
spec i e es se .
As expec ed, inc easing he amoun o aining da a
gene ally leads o highe AP sco es. Howe e , due o he
inhe en andomness o deep lea ning, small inc eases in
da a size do no always esul in highe AP. Mo e impo -
an ly, looking a he igu es and he high co ela ion coe -
icien s, we can obse e a clea co ela ion be ween CVC
and AP on all h ee es se s. This sugges s i s po en ial as
3h ps://gi hub.com/yannik- enoh /ismi 25-c c- o -mpe
Figu e 2: Va ying he amoun o aining da a.
CVC s. AP—bo h compu ed on he espec i e es
se — o ResNe -M ained on inc easing ac ions o
SWD+BPSD+BSQD.
a p oxy o model e icacy wi hin one domain. The co -
ela ion is sligh ly weake o BPSD, possibly due o i s
smalle AP ange (≈15% s. ≈30% in o he da ase s)
whe e e ec s o andomness become mo e ele an .
Domain o T aining Da a: In his nex expe imen , we
explo e models ained on di e en da ase s. Fo his we
ain ResNe -M on all possible combina ions o he h ee
mul i- e sion da ase s. Compa ed o he p e ious expe i-
men , his can be seen as a mo e d as ic a ia ion, as we
no only a y he amoun , bu also he ins umen a ion a
model has seen du ing aining. As wi h he p e ious ex-
pe imen , we compu e CVC and AP on he es se s o he
h ee mul i- e sion da ase s. Resul s a e shown in Figu e
3. Fi s o all, we see s ong di e ences be ween he h ee
es se s. When es ing wi hin he domain o SWD (Figu e
3a), we see a s ong co ela ion (ρ=.92) be ween CVC
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
274
Figu e 3: Va ying he aining domain. CVC s. AP—
bo h compu ed on he espec i e es se — o ResNe -M
ained on combina ions o SWD, BPSD and BSQD.
and AP. When es ing in he s ing qua e domain (BSQD,
Figu e 3c), we obse e a i s case whe e models ha e a
lowe AP bu a highe CVC, wi h SWD in da k blue and
BPSD in yellow, while s ill main aining a s ong co ela-
ion o e all (ρ=.75). In con as o his is he domain o
BPSD (Figu e 3b), whe e we only obse e a weak co e-
la ion (ρ≈.38). In he cen e o he plo , he e seems o
be a co ela ion on a smalle scale be ween he models ha
ha e seen examples o BPSD du ing aining. Howe e , he
models ained on SWD,BSQD, and SWD+BSQD all show
low AP, bu s ill compa able CVC. The models ained on
BSQD show a pa icula ly low AP, while ha ing a CVC
compa able o all o he models. Upon examining p edic-
ions o hese models mo e closely, we obse ed ha hey
consis en ly miss pi ches in he highe egis e s. While his
ob iously makes his a no e y use ul model, one could
a gue ha making mis akes ega dless o pe o me and
eco ding condi ions is a desi ed p ope y o co pus s ud-
Figu e 4: Va ying model a chi ec u e. CVC s. AP—bo h
compu ed on all h ee mul i- e sion da ase s— o di e en
a chi ec u es ained on SWD+BPSD+BSQD.
ies. Le us ake a close look a ano he example. The mod-
els ained on BPSD (yellow) ha e a highe AP, bu a lowe
CVC han he models ained SWD+BSQD (ligh blue). We
expec a highe AP, since i has been ained in he domain
whe e i is es ed, bu CVC does no seem o e lec his.
We will u he discuss his example in Sec ion 5.2.
Model A chi ec u e: In his expe imen , we com-
pa e di e en model a chi ec u es while keeping he ain-
ing da a cons an (SWD+BPSD+BSQD). Figu e 4 shows
he esul s, whe e colo s indica e he ained a chi ec u e.
Since he ends we e simila ac oss he h ee mul i- e sion
da ase s, we combined he esul s o he h ee es domains
by a e aging hem. Each da ase is weigh ed equally, e-
ga dless o i s size. As expec ed, models wi h highe ca-
paci y end o achie e highe AP. Mo e impo an ly, we
see ha e en in his expe imen , we ind a co ela ion be-
ween CVC and AP (ρ=.61), e en hough i is sligh ly
weake . No ably, compa ed o he p e ious expe imen ,
he ange o AP is ela i ely small (≈10%). In e es ingly,
di e ences in CVC do no always ansla e o di e ences
in AP. Fo example, when compa ing Deep Salience
(yellow) and ResNe -M (g ey)— wo models o simila
size—we obse e ha al hough ResNe -M, which inco -
po a es esidual connec ions, achie es highe CVC, bo h
models a ain simila AP. We discuss his obse a ion u -
he in Sec ion 5.2.
To summa ize, mos expe imen s show a co ela ion be-
ween CVC and AP. Howe e , some indi idual examples
sugges ha CVC may cap u e aspec s o a model ha a e
no ully e lec ed in AP sco es wi hin he domain o he
mul i- e sion se s. No ably, we obse ed ins ances whe e a
model wi h highe CVC did no necessa ily achie e highe
AP—and he o he way a ound.
5.2 P oxy o Robus ness
While he p e ious expe imen s examined he ela ionship
be ween CVC and model e icacy wi hin he same domain,
we now u n ou ocus o obus ness. To add ess his sec-
ond esea ch ques ion and assess whe he CVC se es as a
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
275

Figu e 5: CVC as p oxy o model obus ness. CVC on
BPSD s. AP on RWC+TRIOS. As in Figu e 3 we compa e
ResNe -M ained on he di e en aining da a.
p oxy o a model’s abili y o gene alize o ou -o -domain
da a, we analyze i s co ela ion wi h AP on he wo unseen
ou -o -domain da ase s: RWC and TRIOS. Fo his we will
ake a close look a he wo ins ances, whe e co ela ion
was no obse ed in he p e ious expe imen s.
Domain o T aining Da a: Le us ecall he expe i-
men on a ying he domain o aining da a om he p e-
ious sec ion. When e alua ing in he domain o BPSD
(Figu e 3b), changes in AP did no necessa ily co espond
o CVC. We now ex end his analysis by compa ing CVC
on BPSD o AP on RWC+TRIOS. The esul s o his a e
shown in Figu e 5. No e ha , while he meaning o he
y-axis changed, he x-axis is exac ly he same as in he ex-
pe imen om he p e ious sec ion. As opposed o he p e-
ious sec ion, we can now see a s ong co ela ion (ρ=.7)
be ween CVC on BPSD and AP on RWC+TRIOS. Looking
close a ou p e ious example, we see ha he lowe CVC
o he models ained on BPSD (yellow) is now e lec ed
in a lowe AP on ou -o domain da a. Whe eas he highe
CVC o he models ained on SWD+BSQD (ligh blue) is
e lec ed in a highe AP. This indica es ha by measu ing
CVC, we a e able o iden i y models ha o e i o da ase
biases o BPSD.
Model A chi ec u e: We now conduc an expe imen
simila o he a chi ec u e compa ison om he p e ious
sec ion (Figu e 4). Howe e , now we compa e CVC
on he combined h ee mul i- e sion da ase s o AP on
RWC+TRIOS (Figu e 6). The o e all ends emain simi-
la , bu one di e ence eme ges. Taking a close look a he
wo models wi h di e ences in CVC bu simila AP (Deep
Salience and ResNe -M), we now obse e ha Deep
Salience pe o ms sligh ly wo se on he ou -o -domain
es se compa ed o ResNe -M. This also e lec s in a
sligh ly highe co ela ion coe icien . E en hough he e -
ec is small, his suppo s wo insigh s: (1) models wi h
esidual connec ions may gene alize be e , aligning wi h
Figu e 6: CVC as p oxy o model obus ness. CVC on he
h ee mul i- e sion da ase s s. AP on RWC+TRIOS. As in
Figu e 4 we compa e di e en a chi ec u es, all ained on
SWD+BPSD+BSQD.
p io indings [12] and (2) CVC migh be able o cap u e
his gene aliza ion abili y.
To summa ize, he p e ious sec ion iden i ied ins ances
whe e CVC cap u es model aspec s no ully e lec ed in
AP sco es wi hin mul i- e sion da ase s. Fo hose cases,
we in es iga ed whe he CVC on he mul i- e sion da ase s
migh co ela es wi h s anda d e alua ion me ics on ou -
o -domain da ase s. Ou esul s sugges ha CVC does in-
deed seem o p o ide insigh s in o model obus ness. This
would indica e ha CVC is no jus a use ul e alua ion
me ic wi hin he domain o he mul i- e sion da ase , bu
also a aluable addi ional igu e o me i o assessing a
model’s abili y o gene alize o ou -o -domain da a.
6. CONCLUSION
In his pape , we p esen ed CVC as an anno a ion- ee
s a egy o e alua ing AMT models, which assesses
whe he a model makes consis en p edic ions when ac-
ing di e en e sions o he same musical piece. We a -
gued ha CVC is, by design, a desi able p ope y o an-
sc ip ion models used o co pus analysis. We showed ha ,
in mos cases, i may se e as a p oxy o model e icacy
wi hin a domain. In cases whe e i does no , we could ob-
se e ha CVC can p o ides insigh s in o a model’s abili y
o gene alize o ou -o -domain da a, making i a powe ul
addi ional igu e o me i .
In his wo k, we es ablished CVC o e alua ion. How-
e e , we belie e ha ou indings lay he g oundwo k o
a la ge goal: le e aging mul i- e sion da a o imp o e
model aining. The esul s gi e con idence o he idea
o c oss- e sion based con as i e lea ning [17] o MPE.
Inspi ed by he usage in [11], in u u e wo k we would
like o explo e using he CVC as a dis ance measu e o
con as i e lea ning wi h a Siamese Ne wo k a chi ec u e
[25]. Since his s udy ocused on ame-le el ansc ip ion
(MPE), ou u u e wo k will explo e no e-le el ansc ip-
ion, in eg a ing onse de ec ion [4, 16] and T ans o me
a chi ec u es o u he imp o emen s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
276
7. ACKNOWLEDGEMENTS:
This wo k was unded by he Ge man Resea ch Founda-
ion (Deu sche Fo schungsgemeinscha , DFG) wi hin he
Emmy Noe he Junio Resea ch G oup on Compu a ional
Analysis o Music Audio Reco dings: A C oss-Ve sion Ap-
p oach (DFG WE 6611/3-1, G an No. 531250483).
8. REFERENCES
[1] E. Bene os, S. Dixon, Z. Duan, and S. Ewe , “Au o-
ma ic music ansc ip ion: An o e iew,” IEEE Signal
P ocessing Magazine, ol. 36, no. 1, pp. 20–30, 2019.
[2] X. Se a, “The compu a ional s udy o a musical cul-
u e h ough i s digi al aces,” Ac a Musicologica,
ol. 89, no. 1, p. 24–44, 2017.
[3] M. Neuwi h, D. Ha asim, F. C. Moss, and
M. Roh meie , “The anno a ed Bee ho en co pus
(ABC): A da ase o ha monic analyses o all
Bee ho en s ing qua e s,” F on ie s Digi . Humani .,
ol. 5, p. 16, 2018.
[4] C. Haw ho ne, E. Elsen, J. Song, A. Robe s, I. Si-
mon, C. Ra el, J. H. Engel, S. Oo e, and D. Eck, “On-
se s and ames: Dual-objec i e piano ansc ip ion,”
in P oceedings o he In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, (ISMIR), Pa is,
F ance, 2018, pp. 50–57.
[5] C. Haw ho ne, I. Simon, R. Swa ely, E. Manilow, and
J. H. Engel, “Sequence- o-sequence piano ansc ip-
ion wi h ans o me s,” in P oceedings o he In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Online, 2021, pp. 246–253.
[6] Q. Kong, B. Li, X. Song, Y. Wan, and Y. Wang, “High-
esolu ion piano ansc ip ion wi h pedals by eg ess-
ing onse and o se imes,” IEEE ACM T ans. Audio
Speech Lang. P ocess., ol. 29, pp. 3707–3717, 2021.
[7] J. Ga dne , I. Simon, E. Manilow, C. Haw ho ne, and
J. H. Engel, “MT3: mul i- ask mul i ack music an-
sc ip ion,” in The Ten h In e na ional Con e ence on
Lea ning Rep esen a ions, (ICLR), 2022.
[8] S. Chang, E. Bene os, H. Ki chho , and S. Dixon,
“You MT3+: mul i-ins umen music ansc ip ion
wi h enhanced ans o me a chi ec u es and c oss-
da ase STEM augmen a ion,” in 34 h IEEE In e na-
ional Wo kshop on Machine Lea ning o Signal P o-
cessing, (MLSP). IEEE, 2024, pp. 1–6.
[9] C. Weiß and G. Pee e s, “Compa ing deep models and
e alua ion s a egies o mul i-pi ch es ima ion in mu-
sic eco dings,” IEEE ACM T ans. Audio Speech Lang.
P ocess., ol. 30, pp. 2814–2827, 2022.
[10] R. Gei hos, J. Jacobsen, C. Michaelis, R. S. Zemel,
W. B endel, M. Be hge, and F. A. Wichmann, “Sho -
cu lea ning in deep neu al ne wo ks,” Na . Mach. In-
ell., ol. 2, no. 11, pp. 665–673, 2020.
[11] A. Riou, S. La ne , G. Hadje es, and G. Pee e s,
“PESTO: pi ch es ima ion wi h sel -supe ised
ansposi ion-equi a ian objec i e,” in P oceedings o
he 24 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, (ISMIR), 2023, pp. 535–544.
[12] C. Weiß, H. Sch eibe , and M. Mülle , “Local key es i-
ma ion in music eco dings: A case s udy ac oss songs,
e sions, and anno a o s,” IEEE/ACM T ansac ions on
Audio, Speech, and Language P ocessing, ol. 28, pp.
2919–2932, 2020.
[13] M. K ause, S. S ahl, and M. Mülle , “Weakly supe -
ised mul i-pi ch es ima ion using c oss- e sion align-
men ,” in P oceedings o he 24 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence, (ISMIR),
2023, pp. 289–296.
[14] L. Liu and C. Weiss, “U ilizing c oss- e sion consis-
ency o domain adap a ion: A case s udy on music
audio,” in The Second Tiny Pape s T ack a (ICLR),
2024.
[15] R. M. Bi ne , B. McFee, J. Salamon, P. Li, and J. P.
Bello, “Deep salience ep esen a ions o F0 acking
in polyphonic music,” in P oceedings o he In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Suzhou, China, 2017, pp. 63–70.
[16] R. M. Bi ne , J. J. Bosch, D. Rubins ein, G. Mesegue -
B ocal, and S. Ewe , “A ligh weigh ins umen -
agnos ic model o polyphonic no e ansc ip ion and
mul ipi ch es ima ion,” in IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing,
(ICASSP). IEEE, 2022, pp. 781–785.
[17] M. K ause, C. Weiß, and M. Mülle , “A c oss- e sion
app oach o audio ep esen a ion lea ning o o ches-
al music,” in P oceedings o he 24 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR, A. Sa i, F. An onacci, M. Sandle , P. Bes agini,
S. Dixon, B. Liang, G. Richa d, and J. Pauwels, Eds.,
2023, pp. 832–839.
[18] M. Mülle , Fundamen als o Music P ocessing - Us-
ing Py hon and Jupy e No ebooks, Second Edi ion.
Sp inge , 2021.
[19] C. Weiß, F. Zalkow, V. A i i-Mülle , M. Mülle , H. V.
Koops, A. Volk, and H. G. G ohganz, “Schube win-
e eise da ase : A mul imodal scena io o music anal-
ysis,” ACM Jou nal on Compu ing and Cul u al He -
i age, ol. 14, no. 2, pp. 25:1–25:18, 2021.
[20] J. Zei le , C. Weiß, V. A i i-Mülle , and M. Mülle ,
“BPSD: A cohe en mul i- e sion da ase o analyz-
ing he i s mo emen s o bee ho en’s piano sona as,”
T ans. In . Soc. Music. In . Re ., ol. 7, no. 1, pp. 195–
212, 2024.
[21] M. Go o, H. Hashiguchi, T. Nishimu a, and R. Oka,
“RWC music da abase: Music gen e da abase and
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
277
musical ins umen sound da abase,” in P oceedings
o he In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), Bal imo e, Ma yland,
USA, 2003, pp. 229–230.
[22] J. F i sch and M. D. Plumbley, “Sco e in o med audio
sou ce sepa a ion using cons ained nonnega i e ma-
ix ac o iza ion and sco e syn hesis,” in P oceedings
o he IEEE In e na ional Con e ence on Acous ics,
Speech, and Signal P ocessing (ICASSP), 2013, pp.
888–891.
[23] M. Mülle , Y. Öze , M. K ause, T. P ä zlich, and
J. D iedge , “Sync Toolbox: A Py hon package o e -
icien , obus , and accu a e music synch oniza ion,”
Jou nal o Open Sou ce So wa e (JOSS), ol. 6,
no. 64, pp. 3434:1–4, 2021.
[24] T. P ä zlich, J. D iedge , and M. Mülle , “Memo y-
es ic ed mul iscale dynamic ime wa ping,” in IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing, (ICASSP). IEEE, 2016, pp. 569–
573.
[25] J. B omley, I. Guyon, Y. LeCun, E. Säckinge , and
R. Shah, “Signa u e e i ica ion using a siamese ime
delay neu al ne wo k,” in Ad ances in Neu al In o -
ma ion P ocessing Sys ems 6, [7 h NIPS Con e ence].
Mo gan Kau mann, 1993, pp. 737–744.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
278

Related note

Why institutions use Plag.ai for originality review, entry 67
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by research administrators in North America, Europe, Latin America, and international online education, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also stronger evidence for review committees, more reliable review records, and clearer documentation of academic decisions. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For research files, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai