COUNT THE NOTES: HISTOGRAM-BASED SUPERVISION FOR
AUTOMATIC MUSIC TRANSCRIPTION
Jona han Ya e1Ben Maman2Meina d Mülle 2Ami H. Be mano1
1Tel A i Uni e si y, Is ael
2In e na ional Audio Labo a o ies E langen
[email p o ec ed], [email p o ec ed]
ABSTRACT
Au oma ic Music T ansc ip ion (AMT) con e s au-
dio eco dings in o symbolic musical ep esen a ions.
T aining deep neu al ne wo ks (DNNs) o AMT yp-
ically equi es s ongly aligned aining pai s wi h
p ecise ame-le el anno a ions. Since c ea ing such
da ase s is cos ly and imp ac ical o many musical
con ex s, weakly aligned app oaches using segmen -
le el anno a ions ha e gained ac ion. Howe e , ex-
is ing me hods o en ely on Dynamic Time Wa ping
(DTW) o so alignmen loss unc ions, bo h o which
s ill equi e local seman ic co espondences, making
hem e o -p one and compu a ionally expensi e. In
his a icle, we in oduce Coun EM, a no el AMT
amewo k ha elimina es he need o explici local
alignmen by le e aging no e e en his og ams as su-
pe ision, enabling ligh e compu a ions and g ea e
lexibili y. Using an Expec a ion-Maximiza ion (EM)
app oach, Coun EM i e a i ely e ines p edic ions
based solely on no e occu ence coun s, signi i-
can ly educing anno a ion e o s while main ain-
ing high ansc ip ion accu acy. Expe imen s on pi-
ano, gui a , and mul i-ins umen da ase s demon-
s a e ha Coun EM ma ches o su passes exis ing
weakly supe ised me hods, imp o ing AMT’s o-
bus ness, scalabili y, and e iciency. Ou p ojec page
is a ailable a h ps://yoni-ya e.gi hub.
io/coun - he-no es
1. INTRODUCTION
Au oma ic Music T ansc ip ion (AMT) con e s au-
dio eco dings in o symbolic, sco e-like ep esen a-
ions. As a co e ask in Music In o ma ion Re ie al
(MIR), AMT has applica ions in music educa ion,
analysis, p oduc ion, and neu al gene a ion. How-
e e , i emains challenging, pa icula ly o poly-
© J. Ya e, B. Maman, M. Mülle , and A.H. Be mano.
Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional
License (CC BY 4.0). A ibu ion: J. Ya e, B. Maman, M. Mülle ,
and A.H. Be mano, “Coun The No es: His og am-Based Supe i-
sion o Au oma ic Music T ansc ip ion”, in P oc. o he 26 h In .
Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko-
ea, 2025.
phonic and mul i-ins umen eco dings, due o o e -
lapping ha monics, complex imb es, and a ying
acous ic en i onmen s. Mos AMT sys ems ely on
s ongly aligned aining da a, whe e each audio ame
has an exac co esponding label [1–4]. While e -
ec i e, c ea ing such da ase s is cos ly and labo -
in ensi e, es ic ing AMT models o speci ic ins u-
men s, s yles, and acous ic condi ions. As an al e -
na i e, semi-supe ised lea ning me hods use weakly
aligned segmen -le el anno a ions a he han ame-
le el labels, showing ha impe ec supe ision—
such as unaligned ansc ip ions om di e en pe -
o mances o he same piece, can s ill p o ide use ul
aining a ge s [5–8].
One such me hod, No eEM [5], applies an
Expec a ion-Maximiza ion (EM) amewo k o i e a-
i ely e ine weak labels. Beginning wi h a ansc ibe
ained on syn he ic da a, i al e na es be ween align-
ing weak labels using he ne wo k’s p edic ed ea u es,
and aining he ne wo k wi h hese labels. This s a -
egy has achie ed high ansc ip ion accu acy ac oss
di e se musical s yles and ins umen s [5–7]. How-
e e , alignmen me hods like Dynamic Time Wa p-
ing (DTW) [9] in oduce synch oniza ion e o s, com-
pu a ional o e head, and label inconsis encies, e en
wi h imp o ed neu al ea u es. This is especially ue
o no e onse de ec ion, whe e high empo al p eci-
sion is c ucial [1, 2, 5, 6]. Mos c i ically, such ap-
p oaches assume weak labels p ese e e en o de ,
e en i misaligned—an assump ion ha o en ails
in eal-wo ld scena ios, such as in a peggios, whe e
cho ds a e pe o med as sequen ial no es.
As he main con ibu ion o his a icle, we in o-
duce Coun EM, a no el AMT amewo k le e aging
an e en weake o m o supe ision: no e e en coun -
ing, in eg a ed wi h he Expec a ion-Maximiza ion
(EM) algo i hm. Unlike supe ised o weakly su-
pe ised me hods ha equi e s uc u al alignmen ,
Coun EM uses no e onse his og ams o i e a i ely e-
ine p edic ions and empo al es ima es. A key insigh
o Coun EM is ha s ic alignmen s eps based on
app oxima e empo al o de ing, en o ced by me hods
like DTW, can be elaxed o elimina ed. Ins ead o
en o cing s uc u e-p ese ing alignmen , Coun EM
469
His og am Supe ision
Pi ch
Time Window
O de ing Inaccu acy Timing Inaccu acy
Coun
T ansc ibe Peak
Picking
Accu a e Labeling T ansla ion Inaccu acy
0
2
2
1
Inpu P edic ed Onse s Es ima ed Labels
Figu e 1. Es ima ing aligned labels om his og ams by peak-picking. Fo each no e in he his og am, he K
mos likely imings a e selec ed acco ding o he cu en p edic ed pos e io g am. Since misaligned labels educe
o he same his og am ( op), possible iming inaccu acies common in weakly-aligned labels can be o e come.
coun s no e onse s wi hin la ge ime windows, using
hese coun s alone as supe ision. This educes an-
no a ion e o while imp o ing e iciency, lexibili y,
and obus ness. Compa ed o DTW-based me hods,
his og am-based alignmen is compu a ionally sim-
ple , and minimizes alignmen e o s caused by s uc-
u al a ia ions in musical pe o mances.
To demons a e he e ec i eness o Coun EM, we
adap he No eEM amewo k o ou his og am-based
supe ision app oach. The model is ini ially ained
on syn he ic da a, o o he iming-accu a e sou ces,
be o e unde going an i e a i e p ocess o labeling and
aining. Du ing labeling, he model gene a es onse
es ima es o each pi ch o e he p edic ion empo-
al window, and he Kmos p obable imings a e se-
lec ed, whe e Kis he supe ised e en coun . See
also Figu e 1 o an illus a ion o he p ocess. This
me hod is applicable a a ious g anula i ies, om en-
i e audio acks o smalle segmen s o 30 seconds,
wi h longe windows p o iding weake supe ision.
We e alua e Coun EM on eal-wo ld da ase s, show-
casing i s abili y o gene alize ac oss di e se musical
con ex s, and demons a e ha i ma ches o su passes
exis ing weakly supe ised me hods. E en wi h la ge
window sizes (up o en i e acks), i main ains high
ansc ip ion accu acy. Fu he mo e, we demons a e
Coun EM is obus o misalignmen s and anno a ion
e o s, enhancing AMT’s scalabili y and ex ending i s
applicabili y o unde -documen ed musical adi ions.
The emainde o his a icle desc ibes ou ap-
p oach in de ail. Sec ion 2 in oduces he me hods
unde lying Coun EM, ollowed by Sec ion 3, which
p esen s he expe imen s and e alua ion. Sec ion 4
discusses key indings and implica ions, wi h di ec-
ions o u u e esea ch. Code and quali a i e samples
can be ound on ou p ojec page. 1
1h ps://yoni-ya e.gi hub.io/
coun - he-no es
2. METHOD
No e his og ams in musical pe o mances can o en
be accu a ely de i ed om shee music, pa icula ly
o Wes e n classical music, which ollows a musical
sco e. Coun EM le e ages his in o ma ion as coa se
supe ision o music ansc ip ion. The cen al in-
sigh o his wo k is ha such coun ing supe ision,
which is easy o label and does no equi e p ecise im-
ing o no e o de ing, can be a su icien aining sig-
nal. A second insigh is ha no e onse s a e p ominen
ea u es in musical pe o mances and emain consis-
en be ween a sco e and i s endi ion: I a no e oc-
cu s K imes in a musical sco e, hen Konse s o ha
no e will be pe cei ed in an ac ual pe o mance o ha
sco e. Indeed, s udies on audio–sco e synch oniza ion
demons a e imp o ed alignmen obus ness when in-
co po a ing onse ea u es [5,10–12].
O he pe o mance aspec s, such as ela i e no e
iming, du a ions, in ensi y, and pi ch luc ua ions,
a y by pe o me and in e p e a ion. T adi ional
audio–sco e synch oniza ion algo i hms s uggle wi h
hese a ia ions, especially in polyphonic music, o -
en leading o alignmen e o s [5, 6]. These e o s
s em om exp essi e iming and mino shi s in no e
o de , such as in a peggios. E ec i e alignmen al-
go i hms, especially o no e onse s, mus accommo-
da e such a ia ions. Recen ansc ip ion me hods use
DTW wi h neu al onse ea u es, ollowed by a e ine-
men s ep ha applies local empo al adjus men s o
each no e independen ly [5,6].
In con as , ou me hod alle ia es he need o
alignmen and DTW by adop ing a simple , mo e lex-
ible app oach. Ins ead o en o cing s ic empo al
alignmen , we use peak-picking o iden i y he Kmos
p obable onse s in a empo al window based on lo-
cal maxima in he ou pu signal. This s aigh o wa d,
op imiza ion- ee p ocess is obus o s uc u al, im-
ing, and o de ing inaccu acies. The me hod ollows
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
470
Algo i hm 1 Coun EM
Inpu : audio a1,...aN, his og. h1,...hN∈NP
0
Ou pu : model Θ, labels Y1,...YN∈ {0,1}T×P
p e- ain Θ(syn he ic / o he ins umen )
Yi, dhis
i= None,∞i= 1, . . . , N
epea
o i= 1 o Ndo
Y emp
i= PeakPick( Θ(ai), hi)
hp ed
i= ΣT
=1 Θ(ai) ∈RP
+
d emp
i=∥hp ed
i−hi∥2
i d emp
i< dhis
i hen
Yi, dhis
i=Y emp
i, d emp
i
end i
end o
Θ = a gmin
Θ′PN
i=1 BCE( Θ′(ai), Yi)
un il PN
i=1 dhis
icon e ges
e u n Θ,Y1,...YN
an EM loop, al e na ing be ween label e inemen and
model imp o emen (see Sec ion 2.1). The E-s ep e-
ines labels using peak-picking (Sec ion 2.2), while
he M-s ep upda es ne wo k pa ame e s.
2.1 Expec a ion–Maximiza ion
The EM p ocess, ou lined in Algo i hm 1, consis s o
he ollowing s eps:
•Ini ializa ion: The model is p e- ained on
ully supe ised da a om an easily accessible
domain, such as syn he ic da a.
•Expec a ion (E-s ep): The model p edic s a
no e onse pos e io g am (hea map). The like-
lihoods in he pos e io g am a e e ined us-
ing op-Klocal-maxima peak picking o each
pi ch, based on i s a ge numbe o occu -
ences, o es ima e s ongly-aligned onse la-
bels. As a egula iza ion, we only upda e he es-
ima ed label i he Euclidean dis ance be ween
he cu en p edic ed his og am and he a ge
his og am has imp o ed.
•Maximiza ion (M-s ep): We use he es ima ed
s ongly-aligned labels o upda e he model pa-
ame e s using s anda d op imiza ion [13].
The E- and M-s eps a e al e na ely epea ed un il con-
e gence. We used 5 i e a ions o ou expe imen s.
The EM i e a ions p og essi ely imp o e empo al lo-
caliza ion wi hou elying on de ailed empo al anno-
a ion. Tempo al p ecision is de i ed om he model
i sel , which is p e- ained on ano he domain.
2.2 S ong Alignmen om His og ams
We use peak-picking o es ima e p ecise ime-aligned
labels based on he a ge no e his og ams and he
model’s p edic ions. We assume a a ge his og am
h= (h1, . . . , hP)⊤∈NP
0whe e Pis he numbe o
conside ed pi ches, and a p edic ed no e onse pos e-
io g am Z∈[0,1]T×P, whe e Tis he numbe o
ime ames. The pos e io g am Zcan be in e p e ed
as a p edic ed no e onse hea map, which we assume
is compu ed as Z= Θ(a) o a gi en inpu audio
ep esen a ion aand a deep neu al ne wo k Θ.
We assign an es ima ed label Y∈[0,1]T×Pusing
a peak-picking ope a o (“PeakPick” in Algo i hm 1):
Ψ : [0,1]T×P×NP
0→ {0,1}T×P(1)
which simply picks o each pi ch p∈ {1, . . . , P } he
Kmos likely empo al local peaks acco ding o he
p edic ed pos e io g am Z, whe e K=hpis he a -
ge numbe occu ences o he pi ch acco ding o he
his og am. A posi ion is conside ed o be a local peak
i i is highe o equal o all i s neighbo s in a ce ain
adius o ames, e.g., one ame.
Deno ing Y= Ψ(Z, h), o each pi ch p he peak
picke Ψselec s K=hppeaks om he p- h column
o Z o de ine he p- h column o Y, whe e peak po-
si ions a e bina y-encoded (mul i-ho ). No e ha by
de ini ion, i holds ha ΣT
=1Y =h∈NP
0, i.e., he
ows o Ysum up o he a ge his og am h.
2.3 Model T aining
We expe imen wi h wo models: The Onse s and
F ames a chi ec u e [1, 2] p e- ained on syn he ic
da a [5], which we deno e Sy, and he model o Kong
e al. [3] p e- ained on he MAESTRO da ase , which
we deno e Kg. We op imize he mean bina y c oss-
en opy (BCE) loss using an Adam op imize [13]. To
add ess he imbalance be ween posi i e and nega i e
labels esul ing om no e onse spa si y, we assign a
weigh w≥1 o posi i e labels du ing aining. This
is done by applying a mask M=w·Y+ (1 −Y) o
he bina y c oss-en opy loss ma ix, whe e Yis he
es ima ed label. The loss unc ion is compu ed as:
L( θ(a), Y ) = X
i,j
Mi,j ·BCE( θ(a)i,j, Yi,j).
We se he weigh w o 2 (Sy) o 1 (Kg), which
om ou obse a ion p o ided app oxima ely equal
p ecision and ecall. We apply pi ch shi augmen a-
ion [5, 6, 14, 15], gene a ing 11 pi ch-shi ed copies
o he audio da a, wi h shi s in he ange o ±5
semi ones, and wi h an addi ional andom ac ional
e m in he ange o ±0.1semi ones o accoun o
small uning a ia ion. Labels we e compu ed only
o he o iginal copy and ansposed acco dingly o
each augmen ed copy, en o cing pi ch shi equi a i-
ance. All expe imen s we e implemen ed in PyTo ch
and execu ed using wo NVIDIA GeFo ce RTX 3090
GPUs. We used a ba ch size o 16 and ained mod-
els o 37.5K s eps, excep o Sec ion 3.1, whe e we
ained o 500K s eps.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
471
Model Tes T ain
P R F P R F
P e- ained Model
Sy 88.3 81.6 84.6 87.8 81.2 84.1
His og am Supe ision
Rep.
i e .
F/T 92.4 90.4 91.3 91.8 90.5 91.1
180s 93.2 91.7 92.4 92.9 91.9 92.4
120s 93.1 92.2 92.6 92.8 92.4 92.6
60s 95.7 92.2 93.9 95.6 92.5 94.0
30s 95.5 92.8 94.1 95.3 93.1 94.2
1-
i e .
F/T 92.4 87.1 89.6 91.9 87.3 89.5
60s 93.9 88.4 91.0 93.6 88.5 90.9
Sup 98.7 93.1 95.8 98.8 93.4 96.0
Table 1. No e-le el ansc ip ion esul s o aining
wi h his og am-based supe ision on he MAESTRO
da ase . We epo P ecision (P), Recall (R), and F-
sco e (F) ac oss di e en his og am window sizes (o
Full T ack). Fo e e ence, esul s include a baseline
ained on syn he ic da a only (Sy) and a supe ised
model ained wi h g ound- u h labels (Sup).
3. EXPERIMENTS
In his sec ion, we p esen ou expe imen s e alua -
ing ou app oach ac oss di e en da ase s and ins u-
men s, including piano ansc ip ion and noisy his-
og ams (Sec ions 3.1-3.2, MAESTRO da ase [2]),
gui a ansc ip ion (Sec ion 3.3, c oss-da ase ), and
mul i-ins umen ansc ip ion including s ings and
winds (Sec ion 3.4, c oss-da ase ). E alua ion me ics
include no e-le el p ecision, ecall, and F-sco e wi h a
50 ms onse ole ance.
3.1 Piano T ansc ip ion—MAESTRO Da ase
We i s e alua e ou me hod in a con olled se ing
using he MAESTRO da ase [2], which p o ides p e-
cise e e ence anno a ions gene a ed au oma ically by
a Diskla ie . Ins ead o using hese labels di ec ly o
aining, we de i e onse his og ams by segmen ing
he audio and labels in o smalle windows along he
ime axis, o e which we compu e his og ams. These
his og ams se e as supe ision o aining, while
e alua ion is pe o med using he e e ence labels. To
assess he impac o supe ision le els, we es win-
dow leng hs o 30 seconds, one minu e, wo minu es,
h ee minu es, and en i e acks (up o 40 minu es).
Table 1 shows ha ou app oach signi ican ly im-
p o es ansc ip ion accu acy compa ed o he ini-
ial p e- ained model (Sy), e en wi h ull- ack his-
og ams (F/T), whe e F-sco e inc eases by o e 6 %
( om 84.6 o 91.3). Reducing he coun ing window
u he imp o es he F-sco e, as i be e cons ains
onse iming, e ec i ely inc easing supe ision. Pe -
o mance app oaches ully supe ised le els o win-
dows o one minu e o less, indica ing ha he coun -
ing app oach is e ec i e e en wi h empo ally highly
Model Tes T ain
P R F P R F
Noisy His og am Supe ision
60s0% 95.7 92.2 93.9 95.6 92.5 94.0
60s10% 93.1 92.0 92.5 93.0 92.2 92.6
60s20% 92.2 90.2 91.2 91.7 90.6 91.1
F/T0% 92.4 90.4 91.3 91.8 90.5 91.1
F/T10% 90.9 89.8 90.3 90.4 89.9 90.1
F/T20% 89.2 88.2 88.6 88.5 88.4 88.4
Table 2. Coun EM obus ness o noisy his og ams on
he MAESTRO da ase . We apply ±10% and ±20%
andom noise o simula e his og am e o s and e alu-
a e di e en window leng hs as in Table 1.
inaccu a e labeling.
We also obse e ha epea ing he labeling p o-
cess du ing aining (“Rep. i e .”) imp o es pe o -
mance compa ed o aining o he same numbe o
o al s eps wi h a single labeling (“1-i e ”), e.g., om
91.0 o 93.9 o one-minu e windows.
3.2 Noisy His og ams
While ully supe ised da ase s like MAESTRO p o-
ide nea -pe ec his og ams, labels o eal-wo ld
eco dings ely on musical sco es, in oducing po en-
ial disc epancies. Fo example, ills pe o med di -
e en ly in audio and unaligned labels can cause mino
inconsis encies. To assess he obus ness o ou ap-
p oach, we ain on he MAESTRO da ase wi h mul i-
plica i e andom noise sampled om he uni o m dis-
ibu ion U[1−α, 1+α]a wo le els (α∈ {0.1,0.2}),
in oducing up o 10% and 20% noise. We conduc
expe imen s using bo h one-minu e and ull- ack his-
og ams. Table 2 shows ha while his og am e -
o s sligh ly a ec pe o mance, he impac emains
limi ed—no mo e han 3% e en wi h 20% noise.
3.3 Gui a T ansc ip ion
As a nex s ep, we e alua e ou me hod on gui-
a da ase s, namely Gui a Se [16] and he Gui a -
Aligned Pe o mance Sco es (GAPS) da ase [7]. The
anno a ion o Gui a Se was c ea ed by applying 0
es ima ion on monophonic acks ob ained om hexa-
phonic pickup, ollowed by semi-au oma ed me hods
o no e onse and o se localiza ion. The anno a-
ion o GAPS was done di ec ly on polyphonic acks
by p o essional anno a o s, elying on ecen neu al
ne wo k-based alignmen echniques [7].
We compa e wo exis ing o - he-shel models:
The Onse s and F ames a chi ec u e [1,2] p e- ained
on syn he ic da a [5], and he model o Kong e al. [3]
p e- ained on he MAESTRO da ase . We deno e
hese models Sy and Kg, espec i ely. We ain each
o hem using his og am supe ision on each o he
wo da ase s—Gui a Se and GAPS, which we deno e
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
472
Gs and Gp, espec i ely. This yields ou di e en
con igu a ions: SyGs,SyGp,KgGs,KgGp. We e al-
ua e each con igu a ion on each o he wo da ase s,
enabling bo h in a- and in e -da ase (c oss da ase )
e alua ion. We ain each o he ou con igu a ions
wi h his og ams compu ed o e di e en windows—
one-minu e windows (60s) and en i e acks (F/T).
The acks in Gui a Se a e all sho e han 30 sec-
onds, he e o e we only use en i e- ack his og ams
o i . Since Gui a Se is small ( h ee hou s) we ain
SyGs and KgGs on he en i e se , howe e , only wi h
his og am in o ma ion. The e o e e alua ion o SyGs
and KgGs on Gui a Se measu es he abili y o es o e
he o iginal ime-aligned labels om he his og am in-
o ma ion. When aining on GAPS, we use he same
ain– es spli as Riley e al. [7].
Resul s a e shown in Table 3. I can be seen ha ou
app oach yields signi ican imp o emen o e bo h
baselines (Sy,Kg) o o e 15% in F-sco e o bo h
Gui a Se and GAPS, e en when coun ing o e en-
i e acks (SyGpF/T,KgGpF/T). Fo example, ine-
uning Sy on GAPS wi h his og am supe ision o e
en i e acks (SyGpF/T) imp o es accu acy on Gui-
a Se om 66.2% o 84.6%.
When educing he coun ing window on GAPS o
one minu e, Accu acy on Gui a Se sligh ly imp o es
by 1.1% on a e age.
I can also be seen ha by aining on Gui a Se
wi h only i s his og am in o ma ion we can es o e i s
g ound- u h s ongly-aligned labels wi h accu acy o
88.9% (SyGsF/T) o 89.7% (KgGsF/T).
We u he compa e ou esul s o p e ious wo k in
weakly-supe ised ansc ip ion. The model o Ma-
man and Be mano [5] was ine- uned om syn he ic
( he same p e- ained model we use) o sel -collec ed
gui a da a. We deno e his model by SySc. Accu-
acy o ou model su passes his model, imp o ing
on Gui a Se om 82.2% o 85.8% (SyGp60s) o
86.5% (KgGp60s), and on GAPS om 86.6% o 90%
(SyGp60s) o 93% (KgGp60s).
The models o Riley e al. [7] we e ained on
GAPS wi h i s ime-aligned labels ei he om sc a ch
( [7] Gp) o ine- uned om piano ( [7] KgGp). The
labels we e ob ained by alignmen o neu al onse ea-
u es, applying an ini ial DTW s ep, ollowed by a
local-max e inemen s ep o each no e onse inde-
penden ly. Con a y o Riley e al. [7], we ain on
GAPS using his og am in o ma ion only, i.e., weak-
ening o comple ely omi ing he DTW s ep. Ou
model’s accu acy is sligh ly highe han [7] ained on
GAPS om sc a ch, and sligh ly lowe han [7] ined
uned on GAPS om MAESTRO, bu on a compa-
able scale. This shows ha he DTW s ep may be
omi ed wi h a small impac .
Mos impo an ly, esul s show ha ou app oach
is obus ac oss di e en a chi ec u es, and enables
adap a ion o gui a ansc ip ion om ei he syn he ic
Model Gui a Se GAPS
P R F P R F
P e- ained Models
Sy 57.9 80.7 66.2 67.2 86.3 75.0
Kg 71.1 44.0 50.9 61.9 77.7 67.1
His og am Supe ision
SyGsF/T 87.6 90.3 88.9 84.2 81.2 82.2
SyGpF/T 83.6 86.4 84.6 90.6 90.6 90.6
SyGp60s 85.6 86.5 85.8 89.8 90.1 90.0
KgGsF/T 89.3 90.1 89.7 85.4 89.0 87.1
KgGpF/T 83.6 88.1 85.5 93.3 92.5 92.9
KgGp60s 86.9 85.4 86.5 93.1 93.0 93.0
DTW + Re inemen
[7] Gp 92.4 81.8 86.1 94.9 92.1 93.4
[7] KgGp 91.1 85.9 88.1 95.0 93.6 94.3
[5] SySc 86.7 79.7 82.2 82.8 91.8 86.6
Table 3. Gui a ansc ip ion e alua ion on he Gui-
a Se and GAPS da ase s. We compa e models p e-
ained on syn he ic da a (Sy) and MAESTRO (Kg),
ained on Gui a Se (Gs) and GAPS (Gp) using his-
og ams om one-minu e windows (60s) and en i e
acks (F/T). See ex o de ails.
(Sy) o piano (Kg) da a p e- aining.
3.4 Mul i-Ins umen T ansc ip ion
As a inal, mo e challenging, and less con olled
expe imen , we e alua e he gene alizabili y o he
Coun EM app oach by applying ou me hod o
mul i-ins umen ansc ip ion using he MusicNe
da ase [17], which ea u es eco dings o bo h solo
and ensemble pe o mances ac oss a ious ins u-
men s. Unlike he MAESTRO da ase , MusicNe
lacks ull supe ision, as i s no e labels we e de i ed
om aligning audio and MIDI iles om di e en
sou ces, in oducing e o s, pa icula ly in onse im-
ing [1,2,5]. Howe e , a key ad an age is ha he mu-
sical s uc u e was manually e i ied, ensu ing consis-
ency ac oss pe o mances. While ine-g ained align-
men emains imp ecise, no e his og ams p o ide a
s able and eliable signal, making his da ase well-
sui ed o e alua ing ou his og am-based supe ision
app oach in eal-wo ld, less cu a ed condi ions.
Ano he s eng h o MusicNe is i s di e si y in
acous ics and ins umen a ion, making i well-sui ed
o gene aliza ion ac oss di e en musical con ex s
(ze o-sho ansc ip ion).
We de i e no e his og ams o e en i e acks om
unaligned labels. To ob ain his og ams o e sho e
chunks, we use loose alignmen only o cohe en ly
subdi ide audio and weakly-aligned labels. Mino e -
o s in onse iming ha e li le impac on his og ams
compu ed o e 30- o 60-second windows. Fu u e
wo k could explo e al e na i e segmen a ion ech-
niques o u he e inemen .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
473
Model MAESTRO Gui a Se URMP URMP (His og.)
P R F P R F P R F P R F
P e- ained Model
Sy 88.3 81.6 84.6 57.9 80.7 66.2 76.2 65.4 70.1 91.8 79.8 84.9
His og am Supe ision MusicNe Piano (ou s)
30s 93.0 88.2 90.4 77.8 82.5 79.4 70.1 79.6 74.5 80.1 90.8 85.0
F/T 92.1 85.8 88.7 81.2 80.1 79.8 77.3 75.1 76.1 89.7 87.1 88.3
His og am Supe ision MusicNe Full (ou s)
32ms 77.1 12.1 16.7 85.5 5.0 8.6 56.9 1.5 2.8 100.0 19.0 36.0
100ms 94.7 33.9 43.9 91.3 31.9 40.6 90.2 6.0 11.2 100.0 6.6 12.1
500ms 92.4 80.5 85.8 90.5 69.2 75.8 82.9 70.6 76.1 97.7 83.2 89.8
30s 94.5 86.0 89.9 88.5 75.4 80.3 82.2 79.9 80.9 93.0 90.4 91.6
60s 93.1 86.1 89.3 86.7 78.5 81.5 81.9 79.7 80.7 92.6 90.3 91.3
F/T 92.4 85.0 88.4 82.8 82.4 82.0 81.6 78.2 79.7 92.3 88.8 90.3
DTW + Re inemen
[5] AlPl 92.6 87.2 89.7 86.6 80.4 82.9 81.7 77.6 79.6 95.6 91.0 93.2
[5] Al 96.4 83.4 89.2 89.0 76.9 81.5 84.0 75.2 79.3 96.6 86.8 91.3
Table 4. C oss-da ase e alua ion. T aining was pe o med on MusicNe , wi h e alua ion on MAESTRO,
Gui a Se , and URMP. Fo URMP, we also epo F-his og am, which does no en o ce he 50ms onse h eshold.
No e ha while e ined e sions o he da ase ex-
is [5], o demons a e he e icacy o ou app oach we
use he o iginal, weakly-aligned labels.
We also no e ha we use he MusicNe da ase ex-
clusi ely o aining, as i lacks p ecise and eliable
e e ence anno a ions. Fo e alua ion, we again use
he MAESTRO and Gui a Se da ase s, along wi h
he URMP da ase [18], which consis s o s ing and
wind ins umen s. In URMP he eco dings a e mul i-
acked, whe e each ack is monophonic, making an-
no a ions mo e accu a e and eliable. While hese la-
bels a e gene ally accu a e, hey a e no pe ec ly p e-
cise [4]. To accoun o po en ial iming inaccu a-
cies, we epo bo h he s anda d 50ms onse F-sco e,
and a high- ole ance me ic, e e ed o as onse F-
his og am. I is compu ed simila ly o he F-sco e, bu
wi hou he 50ms h eshold. I compa es he se s wi h-
ou conside ing iming, and se es as an uppe bound
in cases o anno a ion e o s in onse iming.
We expe imen ed wi h bo h p e- ained models ap-
pea ing in Sec ion 3.3, howe e , he syn he ic p e-
ained model (Sy) pe o med be e han he piano
p e- ained one (Kg). We pos ula e his is hanks o
he di e si y in he da a used o p e- ain Sy (despi e
being syn he ic). The e o e, p esen ed esul s a e om
Sy, also used by Maman and Be mano [5], bu ine-
uned on MusicNe wi h ou app oach.
As shown in Table 4, ou app oach imp o es o e
he syn he ic baseline, e en wi h ull- ack his og ams
(F/T), inc easing accu acy on MAESTRO om
84.6% o 88.7%, and eaching 90.4% o hal -minu e
segmen s. I sligh ly ou pe o ms a model om p e-
ious wo k ained wi h alignmen and pseudo-labels
( [5] AlPl) while elying on a much simple label es-
ima ion me hod. No ably, ou esul s e en wi h ull-
ack his og ams ma ch esul s using DTW and local-
max e inemen ( [5] Al), sugges ing ha DTW may
no be essen ial o his ask.
Las ly, we no e ha when educing he window size
below 100ms, accu acy d as ically d ops, con a y
o he MAESTRO da ase whe e a single ame (co -
esponding o ull supe ision) p o ides bes esul s.
This demons a es ha he MusicNe labels con ain e -
o s in onse iming, and also shows ha ou app oach
can o e come hem, as illus a ed in Figu e 1.
4. CONCLUSION
In his wo k, we in oduced Coun EM, a no el ame-
wo k o AMT ha le e ages his og am-based su-
pe ision o elimina e he need o explici empo al
alignmen . By eplacing adi ional alignmen s a e-
gies wi h a simple peak-picking mechanism, Coun-
EM educes compu a ional o e head while imp o -
ing lexibili y. Ex ensi e expe imen s ac oss piano,
gui a , and mul i-ins umen da ase s demons a ed i s
obus ness, achie ing pe o mance compa able o o
su passing exis ing weakly-supe ised me hods wi h
a signi ican ly simpli ied label es ima ion p ocess.
Looking ahead, Coun EM’s p inciples could ex-
end o asks such as ins umen ecogni ion, hy hm
analysis, and ly ics ansc ip ion, pa icula ly in com-
plex polyphonic se ings. Fu he explo a ion o
weakly- and semi-supe ised lea ning s a egies could
enhance ansc ip ion accu acy while minimizing an-
no a ion cos s. By shi ing owa ds mo e e icien and
scalable supe ision mechanisms, Coun EM pa es he
way o da a-e icien app oaches o music ansc ip-
ion ac oss di e se musical con ex s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
474
5. ACKNOWLEDGEMENTS
This esea ch was suppo ed in pa by he Len Bla a -
nik and he Bla a nik amily ounda ion and ISF
g an numbe 1337/22. This wo k was unded by
he Deu sche Fo schungsgemeinscha (DFG, Ge man
Resea ch Founda ion) unde G an No. 500643750
(MU 2686/15-1). The In e na ional Audio Labo a o-
ies E langen a e a join ins i u ion o he F ied ich-
Alexande -Uni e si ä E langen-Nü nbe g (FAU) and
F aunho e Ins i u e o In eg a ed Ci cui s IIS.
6. REFERENCES
[1] C. Haw ho ne, E. Elsen, J. Song, A. Robe s, I. Si-
mon, C. Ra el, J. H. Engel, S. Oo e, and D. Eck,
“Onse s and ames: Dual-objec i e piano an-
sc ip ion,” in P oceedings o he In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence,
(ISMIR), Pa is, F ance, 2018, pp. 50–57.
[2] C. Haw ho ne, A. S asyuk, A. Robe s, I. Si-
mon, C. A. Huang, S. Dieleman, E. Elsen,
J. H. Engel, and D. Eck, “Enabling ac-
o ized piano music modeling and gene a-
ion wi h he MAESTRO da ase ,” in P o-
ceedings o he In e na ional Con e ence on
Lea ning Rep esen a ions (ICLR), New O -
leans, Louisiana, USA, 2019. [Online]. A ailable:
h ps://open e iew.ne / o um?id= 1lYRjC9F7
[3] Q. Kong, B. Li, X. Song, Y. Wan, and
Y. Wang, “High- esolu ion piano ansc ip ion
wi h pedals by eg essing onse and o se imes,”
IEEE/ACM T ansac ions o Audio, Speech, and
Language P ocessing, ol. 29, pp. 3707–3717,
2021. [Online]. A ailable: h ps://doi.o g/10.
1109/TASLP.2021.3121991
[4] J. Ga dne , I. Simon, E. Manilow, C. Haw ho ne,
and J. H. Engel, “MT3: Mul i- ask mul i ack
music ansc ip ion,” in P oceedings o he In e -
na ional Con e ence on Lea ning Rep esen a ions
(ICLR), Vi ual, 2022.
[5] B. Maman and A. H. Be mano, “Unaligned su-
pe ision o au oma ic music ansc ip ion in he
wild,” in P oceedings o he In e na ional Con-
e ence on Machine Lea ning (ICML), Bal imo e,
Ma yland, USA, 2022, pp. 14 918–14 934.
[6] X. Riley, D. Edwa ds, and S. Dixon, “High eso-
lu ion gui a ansc ip ion ia domain adap a ion,”
in P oceedings o he IEEE In e na ional Con e -
ence on Acous ics, Speech, and Signal P ocessing
(ICASSP), Seoul, Sou h Ko ea, 2024, pp. 1051–
1055.
[7] X. Riley, Z. Guo, and S. Edwa ds, D ew
abd Dixon, “Gaps: A la ge and di e se clas-
sical gui a da ase and benchma k ansc ip ion
model,” P oceedings o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (IS-
MIR), San F ancisco, USA, 2024.
[8] F. Zalkow and M. Mülle , “CTC-based lea ning o
ch oma ea u es o sco e-audio music e ie al,”
IEEE/ACM T ansac ions on Audio, Speech, and
Language P ocessing, ol. 29, pp. 2957–2971,
2021.
[9] M. Mülle , In o ma ion Re ie al o Music and
Mo ion. Sp inge Ve lag, 2007.
[10] S. Ewe , M. Mülle , and P. G osche, “High es-
olu ion audio synch oniza ion using ch oma on-
se ea u es,” in P oceedings o IEEE In e na-
ional Con e ence on Acous ics, Speech, and Sig-
nal P ocessing (ICASSP), Taipei, Taiwan, 2009,
pp. 1869–1872.
[11] Y. Öze , M. Is anek, V. A i i-Mülle , and
M. Mülle , “Using ac i a ion unc ions o
imp o ing measu e-le el audio synch oniza ion,”
in P oceedings o he 23 d In e na ional Socie y
o Music In o ma ion Re ie al Con e ence,
ISMIR 2022, Bengalu u, India, Decembe 4-8,
2022, P. Rao, H. A. Mu hy, A. S ini asamu hy,
R. M. Bi ne , R. C. Repe o, M. Go o, X. Se a,
and M. Mi on, Eds., 2022, pp. 749–756.
[Online]. A ailable: h ps://a chi es.ismi .ne /
ismi 2022/pape /000090.pd
[12] J. Zei le , B. Maman, and M. Mülle , “Robus and
accu a e audio synch oniza ion using aw ea u es
om ansc ip ion models,” P oceedings o he
In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence (ISMIR), San F ancisco, USA,
2024.
[13] D. P. Kingma and J. Ba, “Adam: A me hod
o s ochas ic op imiza ion,” in P oceedings
o he In e na ional Con e ence o Lea ning
Rep esen a ions (ICLR), San Diego, Cali o nia,
USA, 2015. [Online]. A ailable: h p://a xi .o g/
abs/1412.6980
[14] J. Thicks un, Z. Ha chaoui, D. P. Fos e , and S. M.
Kakade, “In a iances and da a augmen a ion o
supe ised music ansc ip ion,” in P oceedings o
he IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP), Calga y,
Canada, 2018, pp. 2241–2245.
[15] A. Riou, S. La ne , G. Hadje es, and G. Pee e s,
“PESTO: Pi ch es ima ion wi h sel -supe ised
ansposi ion-equi a ian objec i e,” in P oceed-
ings o he In e na ional Socie y o Music In-
o ma ion Re ie al Con e ence (ISMIR), Milano,
I aly, 2023, pp. 535–544.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
475
[16] Q. Xi, R. M. Bi ne , J. Pauwels, X. Ye,
and J. P. Bello, “Gui a se : A da ase o
gui a ansc ip ion,” in P oceedings o he
19 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, ISMIR 2018, Pa is, F ance,
Sep embe 23-27, 2018, E. Gómez, X. Hu,
E. Humph ey, and E. Bene os, Eds., 2018, pp.
453–460. [Online]. A ailable: h p://ismi 2018.
i cam. /doc/pd s/188_Pape .pd
[17] J. Thicks un, Z. Ha chaoui, and S. M. Kakade,
“Lea ning ea u es o music om sc a ch,” in
P oceedings o he In e na ional Con e ence on
Lea ning Rep esen a ions (ICLR), Toulon, F ance,
2017. [Online]. A ailable: h ps://open e iew.ne /
o um?id= kFBJ 9gg
[18] B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sha ma,
“C ea ing a mul i ack classical music pe o -
mance da ase o mul imodal music analysis:
Challenges, insigh s, and applica ions,” IEEE
T ansac ions on Mul imedia, ol. 21, no. 2, pp.
522–535, 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
476