scieee Science in your language
[en] (orig)

Sheet Music Benchmark: Standardized Optical Music Recognition Evaluation

Author: Juan C. Martinez-Sevilla; Joan Cerveto-Serrano; Noelia Luna-Barahona; Greg Chapman; Craig Sapp; David Rizo; Jorge Calvo-Zaragoza
Publisher: Zenodo
DOI: 10.5281/zenodo.17706531
Source: https://zenodo.org/records/17706531/files/000070.pdf
SHEET MUSIC BENCHMARK:
STANDARDIZED OPTICAL MUSIC RECOGNITION EVALUATION
Juan C. Ma inez-Se illa1Joan Ce e o-Se ano1Noelia Luna1
G eg Chapman2C aig Sapp3Da id Rizo1,4Jo ge Cal o-Za agoza1
1Pa e n Recogni ion and A i icial In elligence G oup, Uni e si y o Alican e, Spain
2Sel -employed
3Cen e o Compu e Resea ch in Music and Acous ics, S an o d Uni e si y, USA
4Ins i u o Supe io de Enseñanzas A ís icas de la Comunidad Valenciana, Spain
{jcma inez.se illa, joan.ce e o, noelia.luna, d izo, jo ge.cal o}@ua.es
[email p o ec ed], [email p o ec ed]
ABSTRACT
In his wo k, we in oduce he Shee Music Benchma k
(SMB), a da ase o six hund ed and eigh y- i e pages
speci ically designed o benchma k Op ical Music Recog-
ni ion (OMR) esea ch. SMB encompasses a di e se a -
ay o musical ex u es, including monophony, piano o m,
qua e , and o he s, all encoded in Common Wes e n Mod-
e n No a ion using he Humd um **ke n o ma . Along-
side SMB, we in oduce he OMR No malized Edi Dis-
ance (OMR-NED), a new me ic ailo ed explici ly o
e alua ing OMR pe o mance. OMR-NED builds upon
he widely-used Symbol E o Ra e (SER), o e ing a ine-
g ained and de ailed e o analysis ha co e s indi idual
musical elemen s such as no e heads, beams, pi ches, ac-
ciden als, and o he c i ical no a ion ea u es. The esul -
ing nume ic sco e p o ided by OMR-NED acili a es clea
compa isons, enabling esea che s and end-use s alike o
iden i y op imal OMR app oaches. Ou wo k hus ad-
d esses a long-s anding gap in OMR e alua ion, and we
suppo ou con ibu ions wi h baseline expe imen s using
s anda dized SMB da ase spli s o aining and assessing
s a e-o - he-a me hods.
1. INTRODUCTION
Op ical Music Recogni ion (OMR) is a long-s anding chal-
lenge wi hin he ield o Music In o ma ion Re ie al
(MIR). I ocuses on au oma ically ex ac ing musical in-
o ma ion om scanned images, manusc ip s, o p in ed
documen s, con e ing his in o ma ion in o s uc u ed dig-
i al ep esen a ions [1], such as Humd um **ke n [2],
MEI [3], o MusicXML [4]. These machine- eadable o -
ma s acili a e la ge-scale musical in o ma ion e ie al,
© J. C. Ma inez-Se illa, J. Ce e o-Se ano, N. Luna, G.
Chapman, C. Sapp, D. Rizo, and J. Cal o-Za agoza. Licensed unde a
C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY 4.0).
A ibu ion: J. C. Ma inez-Se illa, J. Ce e o-Se ano, N. Luna, G.
Chapman, C. Sapp, D. Rizo, and J. Cal o-Za agoza, “Shee Music Bench-
ma k: S anda dized Op ical Music Recogni ion E alua ion”, in P oc. o
he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
enable ad anced compu a ional music analysis, and g an
b oade accessibili y o ex ensi e musical a chi es [5].
T adi ionally, OMR me hods we e p edominan ly ule-
based, in ol ing s eps such as s a line emo al and p im-
i i e de ec ion algo i hms, which posed signi ican p ac i-
cal di icul ies [6]. Howe e , ecen ad ancemen s in Deep
Lea ning (DL) ha e success ully add essed many o hese
long-s anding obs acles, leading o subs an ial imp o e-
men s in OMR pe o mance [7]. This is e idenced by he
shi in he ield owa ds end- o-end me hods ha a emp
o sol e he p oblem in a ew s eps, ei he by di ec ly de-
ec ing musical objec s in ull images [8], o wi h single-
s ep ansc ip ion pipelines a bo h he egion [9, 10] and
page le els [11].
Gi en hese ad ances and ex ensi e ongoing esea ch,
one migh conside OMR a ma u e and well-de eloped
sub ield o MIR. Ne e heless, unlike o he MIR asks,
OMR s ill lacks a comp ehensi e, high-quali y bench-
ma k co pus, complica ing igo ous pe o mance assess-
men and quali a i e compa ison be ween sys ems. Such
benchma ks ha e signi ican ly bene i ed o he MIR a -
eas; no able examples include MAPS [12], ASAP [13],
and MAESTRO [14] in au oma ic music ansc ip ion;
Ball oom [15] and GTZAN [16] in bea acking; o
NSyn h [17] o ins umen classi ica ion, o name a
ew. These es ablished benchma ks ha e g ea ly acili a ed
s anda dized compa isons and he sys ema ic e alua ion o
no el app oaches.
To da e, se e al a emp s ha e been made o es ablish
an OMR benchma k capable o add essing his gap. In
hei seminal wo k, By d and Simonsen p oposed he OM-
RTes Co pus [18]; howe e , his co pus consis s o only
hi y- ou pages wi h subs an ial g aphical a iabili y, en-
de ing i insu icien and unsui able o con empo a y DL-
based OMR app oaches. Simila ly, CVC-MUSCIMA [19]
is only de ised o he s a -line emo al s ep, while i s
ex ension MUSCIMA++ [20] is objec de ec ion-o ien ed,
wi h anno a ions no ep esen ing ull musical sco es. The
P IMuS da ase , al hough in oduced as he i s end- o-end
OMR da ase , includes only monophonic samples a he
s a - egion le el [21]. Fo piano o m end- o-end ecog-
604
ni ion, he G andS a da ase [11] was p esen ed, bu i is
syn he ically gene a ed and lacks su icien di e si y, as he
musical exce p s p ima ily o igina e om a limi ed se o
compose s. Recen ly, he OLiMPiC da ase [9] has been
p oposed o piano o m ecogni ion asks. Ne e heless, i
also exhibi s limi ed a ie y ega ding epe oi e and musi-
cal pu poses, a es ic ed size, and is no uly end- o-end,
since i elies on Linea ized XML, a compac ep esen a-
ion o MusicXML which equi es addi ional p ocessing o
ob ain a ully ende able music sco e.
Conside ing he p e iously iden i ied limi a ions and
he exis ing gap in he ield, in his wo k we in oduce
he Shee Music Benchma k (SMB), a no el, ull-page
end- o-end da ase speci ically designed o benchma king
mode n DL-based OMR sys ems. SMB suppo s comp e-
hensi e e alua ion, including layou analysis, egion-le el
ecogni ion, and ull-page ansc ip ion. In addi ion o he
da ase , we design a new e alua ion me ic: he OMR No -
malized Edi Dis ance (OMR-NED). This me ic pu sues
wo main objec i es: (i) summa izing he o e all pe o -
mance o an OMR sys em in a single nume ical alue,
and (ii) enabling de ailed p o iling o OMR sys em e -
o s by ca ego izing hem in o speci ic no a ion elemen s
such as no es, es s, and measu es, among o he s. To he
bes o ou knowledge, he combina ion o SMB da ase
and OMR-NED me ic will signi ican ly ad ance OMR e-
sea ch, simila o how es ablished benchma ks ha e d i en
p og ess in o he MIR a eas, p o iding esea che s and
p ac i ione s wi h a s anda dized and e ec i e amewo k
o igo ously e alua e OMR sys ems.
2. BUILDING THE BENCHMARK
SMB co po a is buil upon Ke nSco es [22], an online li-
b a y o musical da a encoded in **ke n. This encod-
ing o ma allows u he p ocessing wi h he Humd um
Toolki o Music Resea ch. 1Ke nSco es p o ides mul-
iple ep esen a ions o each piece, including a public-
domain scan o he shee music (when a ailable), he
**ke n encoding, and a MIDI ile. To build SMB, all
he s eps we e pe o med by musically ained expe s o
ensu e he quali y o he anno a ions, such as he selec ion
o he pieces, which included he linking p ocess be ween
he shee music scan and i s co esponding aw **ke n
encoding. 2
2.1 Anno a ion p ocess
The scanned sco es we e uploaded o HumanSignal, 3a
web pla o m speci ically designed o da a anno a ion.
Each musical piece was manually labeled, s a ing wi h he
assignmen o a ex u e ag—Monophonic 4, Piano o m,
PianoAndVoice, Qua e , o O he . 5Then, o each page,
1h ps://www.humd um.o g
2He e linking means ha when he scan was no a ailable, anno a o s
looked o i in di e en sou ces.
3h ps://humansignal.com/
4Mos o he samples a e single-s a and single- oice howe e ho-
mophony and polyphony can appea .
5Gi en he numbe o samples in he PianoAndVoice and O he ex-
u e ag, hey will be e e ed o as O he o he es o he pape .
Figu e 1. Labeled piece example. In blue he bounding
boxes ha loca e he egions o in e es . In he op- igh
co ne he aw **ke n encoding o he i s egion o he
page.
he egions co esponding o indi idual s a es o musical
sys ems we e delimi ed by bounding boxes and labeled
wi h hei con en in aw **ke n o ma (see Fig. 1).
These anno a ions we e pe o med a he s a le el
(some imes e e ed o as line-le el anno a ions). While
ull-page end- o-end digi iza ion is he ul ima e goal o
any OMR sys em, i emains a highly complex challenge
o mos a chi ec u es. None heless, SMB also p o ides
“diploma ic” 6 ull-page anno a ions, sui able o s a -
le el and page-le el ansc ip ion. These de ailed anno a-
ions also allow Layou Analysis (LA) p ocesses, making
SMB app op ia e o any mode n OMR pipeline.
2.2 The use o Humd um **ke n
Music can be digi ally ep esen ed in a ious o ma s, in-
cluding Humd um **ke n, MEI, o MusicXML. How-
e e , **ke n emains he mos e icien and leas e bose
op ion, making i pa icula ly sui able o OMR DL-based
sys ems. This e iciency, combined wi h he ex ensi e
Humd um ecosys em—–which includes ende ing ools
like Ve o io Humd um Viewe [23] and symbolic analysis
amewo ks such as humd um- ools [24]— as well as pa -
i u a [25] and music21 [26], ensu es **ke n is bo h ac-
cessible and highly ad an ageous o OMR applica ions. 7
2.3 Pos p ocessing
Once he anno a ion p ocess was comple ed, compiling
a obus , unc ional da ase o mode n end- o-end OMR
me hods posed signi ican challenges. An impo an com-
ponen in he ex ac ed aw **ke n agmen s om Ke n-
Sco es is hei ideli y o he o iginal sco e, p o iding
me ada a such as he sou ce, au ho ship, publica ion yea ,
wo k i les, and mo e. Howe e , o some DL expe imen s,
he a ailabili y o his in o ma ion oge he wi h he inhe -
en ambigui y o he aw **ke n encoding i sel —which
6Anno a ions indica e whe e he s a b eaks a e in he sou ce sco e.
7The use o **ke n o OMR does no cons ain he la e usage o
ano he encoding o ma .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
605
8#a8a# a8# a#8#8a#a8
Figu e 2. Possible **ke n inconsis en ep esen a ions
o an A#4eigh h no e.
can ep esen he same symbol in mul iple ways (see Fig.
2)—supposes a majo d awback, and hus i needed o be
add essed.
Fo ha eason, he cons uc ion o s anda dized, high-
quali y **ke n iles, e.g., by cons aining he ep esen-
a ions o no es o educe ocabula y size, is a key poin
o he SMB. The ke npy [27] package was a majo pilla
du ing he pos p ocessing p ocess p o iding ea u es ha
con e ed non-s anda dized agmen ed **ke n da a—as
i comes om he labeling p ocess—in o comple e s an-
da dized encoding eady o use by he Humd um ecosys-
em ools: including he p ope **ke n heade s, spine
pa hs 8and changes in cle s, key signa u es, and ime sig-
na u es.
In addi ion, his package was also used o en o ce s uc-
u al consis ency and pe o m seman ic okeniza ion ac oss
di e en music ca ego ies. This app oach e en allows o
he possibili y o pe o ming cus om selec ions o speci ic
esea ch asks, such as enabling he use o a ailo ed se o
ca ego ies o a oiding he inclusion o a icula ions, acci-
den als, dynamics, o pi ches. Consequen ly, h ee o ma s
eme ged om his s ep:
aw. Rep esen s he o iginal agmen ed aw **ke n
Ke nSco es encoding used o in he labeling s eps wi h-
ou any change.
ke n. S anda dized **ke n ile e ie ed om he ok-
enized e sion ob ained om ke npy package. Compa i-
ble wi h he Humd um ecosys em oolki .
eke n. The "Ex ended **ke n" is a s anda dized ok-
enized e sion o **ke n whe e each musical symbol
is spaced by “@” and “·” di iding seman ic oken ca e-
go ies o each symbol, i.e., du a ion, pi ch o acciden al.
This app oach educes signi ican ly he ocabula y size
o he da ase , al hough inc easing sequence leng hs.
Following au oma ed da a p ocessing, he SMB unde -
wen manual inspec ion a page and egion le els o ensu e
e o - ee **ke n and **eke n anno a ions. Encoding
i egula i ies in p oblema ic anno a ions we e sys ema i-
cally sol ed.
2.4 Da a analysis
In his subsec ion, we p esen a comp ehensi e analysis o
SMB samples o p o ide insigh s in o i s ea u es, dis ibu-
8h ps://www.humd um.o g/Humd um/ ep esen a ions/ke n.h ml
Pi ch oken
Coun
C4 C5
C3
A0 C2
C1 C6 C7
Figu e 3. Pi ch oken equency dis ibu ion in SMB.
ion, and con en . We depic key s a is ical p ope ies, ex-
plo e pa e ns wi hin he da a, and iden i y no able de ails
gi en he s anda dized **ke n encoding. This analysis
se es o bo h alida e he quali y o he benchma k and
illus a e in a clea e way he con en o i .
SMB spans 4039 egions which co espond o 685
pages. Table 1 p o ides a de ailed desc ip ion o he main
ea u es o his asso men in **ke n o ma . As shown,
he mos ep esen a i e ex u e is Piano o m. On one hand,
i is one o he ins umen s wi h he b oades epe oi e; on
he o he , i emains one o he mos complica ed ex u es
o ansc ibe. Ano he poin o highligh is ha when in-
c easing he numbe o ins umen s, i.e., Qua e ex u e,
musicians do no usually play wi h he ull sco e as hey
p e e o use pa s. As a esul , he numbe o anno a ed
pages is lowe han o he s.
An in e es ing ou come is ha he g aphical ea u es
o highe - oice-coun pages such as Qua e compa ed o
Monophony, cause a d op in he numbe o egions by
page, bu he numbe o okens by page emains simila .
A key di e ence we ind is he numbe o okens pe e-
gion as when he numbe o oices inc eases, he coun
ises d as ically.
In Figu e 3 we show he Pi ch oken equency dis ibu-
ion o SMB, whe e he mos common pi ches a e he ones
p esen be ween A3and G5.
Di e en compose s, s yles, and pe iods we e consid-
e ed o inc ease he a iabili y in he da a. The he e ogene-
i y o he collec ion spans om he Ba oque pe iod o he
Rag ime e as. In be ween, he e is a se o a ious musi-
cal pe iods such as he Classical, he Roman ic, o he Im-
p essionis , which p o ide a wide ange o musical pieces.
Some o he a ailable compose s a e: Ca l Ma ia on We-
be , Domenico Sca la i, F édé ic Chopin, Joseph Haydn,
W. A. Moza , L. an Bee ho en, and Sco Joplin, among
o he s.
The di e si y p esen in SMB makes i o ou bes
knowledge he i s and mos comple e da ase in Common
Wes e n Mode n No a ion (CWMN) o OMR DL-based
me hodologies benchma king, p o iding a se o di e se
and complex sco es.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
606
Tex u e Pages Regions Regions by page (µ±σ)Tokens by page (µ±σ)Tokens by egion (µ±σ)
Monophony 115 972 8.4±3.5 1308.5±719.7 165.2±51.1
Piano o m 469 2686 5.7±1.1 1371.3±679.4 263.5±105.0
Qua e 79 304 3.8±0.5 1338.4±413.2 373.1±92.9
O he 22 77 3.5±1.5 1203.0±671.8 349.8±136.4
All 685 4039 5.8±2.2 1351.6±660.9 249.7±110.2
Table 1. Resul s o SMB page, egion and oken coun analysis in **ke n encoding.
2.5 Publica ion
Ensu ing accessibili y and ease o use o he esea ch
communi y, SMB is a ailable a Hugging Face Da ase s
pla o m. 9
3. EVALUATION METRIC
One key aspec o building a success ul OMR sys em is he
way o measu ing he quali y o i . Measu ing OMR pe -
o mance emains an open p oblem due o a se o unan-
swe ed ques ions: Wha is conside ed a music symbol?
How do we measu e he co ec ion e o ? The answe
o hese ques ions emains an open opic ha is being ad-
d essed in conjunc ion wi h musicologis s.
Un il oday, he mos popula me ic used when e alu-
a ing OMR has been he Symbol E o Ra e (i.e. in some
wo ks e e ed o as Edi Dis ance, Cha ac e E o Ra e, o
Music E o Ra e), ha is compu ed as he a e age numbe
o elemen a y ope a ions (inse ions, dele ions, o subs i-
u ions) equi ed o con e p edic ion ˆ
ziin o e e ence zi,
no malized by he leng h o he la e .
Indeed, i s ill s ands as he bes me ic o co ela e he
human co ec ion e o ha is necessa y o ob ain he de-
si ed sco e. Howe e such co ec ion e o does no indi-
ca e in which aspec o he ansc ip ion p ocess he OMR
is s uggling, e.g., cle s, no e heads, beams, acciden als,
key signa u es, o any hing else.
In his wo k, we le e age Music-Sco e-Di [28] cu -
en ly main ained by G eg Chapman and be e known as
MusicDi 10 . MusicDi compu es he isual no a ion
di e ences be ween wo music sco es; howe e , his ool
has been known o ha ing ouble p ocessing he sco es
om OMR pipelines because i equi es a pa seable sco e
(which is ce ainly no always he case wi h OMR sys-
ems). Due o his, and o he sho comings, we ha e im-
p o ed MusicDi signi ican ly so i can now do he ollow-
ing: (i) i can compa e non-pa seable p edic ed **ke n
sco es, (ii) i can compa e all musical objec s, no jus
no es/ es s, (iii) i can compa e all he no es in a measu e
wi hou ega d o oicing and/o membe ship in cho ds,
and (i ) i can be old a wha ine-g ained le el o de ail
i should compa e he sco es. We ha e also signi ican ly
changed how MusicDi compu es edi dis ance, and we
ha e ca e ully audi ed and changed how music symbols
a e de ined/coun ed.
9hugging ace.co/da ase s/PRAIG/SMB
10 gi hub.com/g egchapman-de /musicdi
Hence we p esen he esul an me ic sco e, e e ed o
as he OMR No malized Edi Dis ance (OMR-NED), de-
signed speci ically o allow isual sco e compa ison and
measu ing in a ine-g ained manne , he e o e s anding as
he i s al e na i e o he adi ionally used SER. OMR-
NED de ines a se o ca ego ies, and a de ailed lis o isual
music symbols o each ca ego y, which can hen be com-
pa ed. No e ha hese music symbols a e no de ined in
e ms o **ke n (o **eke n). Ins ead, hey a e de ined
in e ms o isual music no a ion, in a ile o ma agnos-
ic way. The OMR-NED me ic o he sco e compa ison
can be compu ed as he a e age numbe o music symbol
inse ions and dele ions equi ed o con e he p edic ed
sco e in o he e e ence sco e, no malized by he sum o
he numbe o music symbols in bo h:
OMR-NED =I+D
N1+N2
(1)
Iand Da e he o al numbe o inse ions and dele ions
o indi idual music symbols o all ca ego ies, and N1+
N2is he o al numbe o music symbols in he p edic ed
and g ound u h sco es. We no malize using bo h sco es’
music symbol coun s, because when a symbol is changed
(say he e is a p edic ed 2/4 ime signa u e, bu he g ound
u h is a 3/4 ime signa u e), ins ead o compu ing a subs i-
u ion o one symbol, we compu e a dele ion o he “2” and
an inse ion o he “3”, o an edi dis ance o wo symbols.
MusicDi can be poin ed a a olde ull o g ound u h
( e e ence) sco e iles, and a olde ull o same-named
p edic ed sco e iles. When unning in his mode, Mu-
sicDi will p oduce a CSV ile con aining a sp eadshee
wi h a ow o each sco e ile compa ison and a column o
each ca ego y. Each column con ains he edi dis ances o
ha ca ego y, wi h he o al edi dis ance and OMR-NED
me ics o all he sco e compa isons in he inal columns.
The e is a summa y ow o he en i e un o sco e com-
pa isons a he bo om, wi h a o al edi dis ance o each
ca ego y ac oss all he sco es, and an o e all OMR-NED
me ic o he en i e un.
3.1 OMR-NED ca ego ies
In his subsec ion we p esen he ca ego ies ela ed o mu-
sic no es and es s and he non-no es ones. This will be
u he emphasized in he ine-g ained analysis ha OMR-
NED depic s.
No es/Res s: No es and es s con ain mul iple ca e-
go ies, each wi h i s own symbol coun . MusicDi only
di ec ly compa es no es/ es s ha ha e he same pi ch and
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
607
a e on he same exac o se wi hin he measu e. This
means ha no es ha a e misma ched by pi ch (o by o se
wi hin he measu e) will ha e a la ge edi dis ance because
he en i e no e (and all i s symbols) is dele ed, and hen he
new no e is inse ed. 11
A ailable ca ego ies: Pi ch — One symbol o he
pi ch. Acciden al — One symbol o he acciden al. Tie
— One symbol i he no e is ied o a subsequen no e.
No e head — One symbol o he no e head ype (qua e
no e, hal no e, whole no e, e c). The no e head ype o
eigh h no es and smalle is s ill a qua e no e head. The
lags/beams ca ego y di e en ia es u he be ween sho e
no es. No e lags/beams — One symbol pe lag o beam.
Do s — One symbol pe do . A icula ions — One sym-
bol pe a icula ion (s acca o, enu o, e c). O namen s —
One symbol pe o namen ( ill, mo den , e c). G ace —
One symbol i a g ace no e; one mo e symbol i slashed.
Non-No es: Non-no e objec s a e a single ca ego y
each. MusicDi only di ec ly compa es non-no e objec s
ha a e a he same exac o se wi hin he measu e. So, o
example, ex di ec ions ha a e shi ed ho izon ally om
each o he a e dele ed/inse ed, ins ead o e u ning he edi
dis ance be ween he s ings.
A ailable ca ego ies: Dynamic — Dynamics such as
p, , e c a e one symbol only (no one symbol pe cha -
ac e ). Hai pin dynamic ma kings a e one symbol o he
hai pin di ec ion plus an ex a symbol o he hai pin du-
a ion. Cle — Cle s a e one symbol. Key signa u e —
Key signa u es a e one symbol pe acciden al. Time sig-
na u e — Time signa u es a e one symbol o he op, one
o he bo om (e.g. 12/8 is wo symbols, Cis one sym-
bol). Slu — One symbol o du a ion. O a a — Two
symbols: one o o a a ype (e.g. 8 a,8ba, e c), and one
o du a ion. Di ec ion — One symbol o each cha ac e
in he s ing. A peggio — One symbol o a peggio ype
(up, down, undi ec ed, non-a peggia ed). One mo e sym-
bol i a peggio spans mul iple s a es. Cho d symbol —
One symbol. Ly ic — One symbol o each cha ac e in
he ly ic syllable. One symbol o e se numbe /iden i ie .
Ending — One symbol o each cha ac e in he ending
name. One symbol o measu e coun .
3.2 Edi dis ances
Edi lis s a e calcula ed as a se ies o inse s/dele es o mu-
sical objec s, wi h he edi dis ance being he o al numbe
o symbols inse ed/dele ed. Fo example, i a p edic ed
no e has a lag, bu he ma ching g ound u h no e has a
beam ins ead, he edi dis ance will be 2: dele e he lag
(one symbol) and inse he beam (one symbol).
A a sligh ly highe le el, i he p edic ed sco e has a
no e whe e he g ound u h sco e has a es , he edi will
dele e he no e, and inse he es . The edi dis ance in ha
case will be he numbe o symbols in he dele ed no e plus
he numbe o symbols in he inse ed es .
11 The desc ip ion he e is o he de aul ca ego ies. The e can be mo e
symbols pe ca ego y (and mo e ca ego ies) i S yle and/o Me ada a
is eques ed in addi ion o he de aul de ail le el. See g egchapman-
de .gi hub.io/musicdi /musicdi /de aille el.h ml o mo e in o ma ion
abou speci ying de ail le els o MusicDi .
A an e en highe le el, i he p edic ed sco e has an
ex a measu e ha does no exis in he g ound u h sco e,
ha measu e will be dele ed, wi h an edi dis ance equal o
he o al numbe o symbols in ha p edic ed measu e.
4. BASELINE RESULTS
This wo k cons i u es he i s o in oduce bo h he SMB
benchma k da ase and he ine-g ained me ic e alua ion
OMR-NED. The e o e, in his sec ion, we epo a pe -
o mance baseline using a s a e-o - he-a model, which
se es as a e e ence o u u e wo k.
While he ecommended use o SMB is as es se , o
he baseline we conside each ex u e ype a subse o sam-
ples, namely Monophony, Piano o m, Qua e , and O he .
These sco es a e used a he egion le el (assuming a p e i-
ous LA s ep) and ollowing a 5- old c oss alida ion ame-
wo k, ensu ing ha e e y egion akes pa in he es spli
once. We ain he amewo k bo h wi h **ke n and
**eke n encoding.
Fo he lea ning amewo k, we eso o he s a e-o -
he-a Shee Music T ans o me [10]. This a chi ec u e
consis s o a encode -decode ne wo k. The encode is in
cha ge o e ie ing he image ea u es and he decode is
a condi ioned language model ha p edic s he mos p ob-
able music symbol in an au o- eg essi e ashion.
The model was ained wi hou he use o any p e-
ained weigh s. We i e a e o 400 epochs, conside ing he
ADAM op imize wi h a ixed lea ning a e o 10−4. We
keep he weigh s ha minimize he SER me ic in he ali-
da ion pa i ion, alida ing e e y 5 epochs. Finally, all ex-
pe imen s we e un using he Py hon language ( . 3.12.3)
wi h he PyTo ch amewo k ( . 2.0.0) on a single NVIDIA
RTX 4080 ca d wi h 20GB o ideo memo y.
Table 2 p esen s he a e age esul s ob ained using he
conside ed expe imen al baseline in e ms o SER and he
in oduced OMR-NED. Rega ding he (low) ecogni ion
a es, i is wo h no ing ha T ans o me a chi ec u es ypi-
cally equi e la ge co po a o achie e con e gence. Ne e -
heless, we epo hese alues o acili a e u u e esea ch
using SMB. The esul s indica e ample oom o imp o e-
men , highligh ing he da ase as a challenging benchma k
ha should d i e ad ancemen s in OMR.
Tex u e Encoding ↓OMR-NED (%) ↓SER (%)
Monophony ke n 94.1±5.0 57.1±0.8
eke n 98.8±1.0 65.8±1.2
Piano o m ke n 57.4±6.0 31.4±1.5
eke n 77.2±2.0 55.1±2.2
Qua e ke n 92.3±1.2 39.8±1.6
eke n 93.3±1.1 82.9±1.2
O he ke n 89.6±2.4 60.2±6.2
eke n 93.7±2.0 72.4±7.3
Table 2. Resul s in e ms o he SER (%) and OMR-NED
(%) me ics when conside ing Monophony, Piano o m,
Qua e and O he sco e ex u es. All cases a e e alua ed
using 5- old c oss alida ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
608

make ma hapy, whe look  my eyes
**
ke
n
**
e
x
*
cle
G
2
*
*
k
[
b
-
e
-
a
-]*
*
M
2
/
4
*
= =
16
B
-'/
LL
.
16
d
'/ .
16
'/ .
16
b
-'/
JJ
make
8
bn
/
ma
8
.
= =
(
16
F
/
LL
ha
py
,
16
bn
/ .
16
c
/ .
16
b
/
JJ
.
16
F
/
LL
.
16
c
/
w
he
16
e
-/ .
16
c
/
JJ
)
l
oo
k
**
ke
n
**
e
x
*
cle
G
2
*
*
k
[
b
-
e
-
a
-] *
*
M
2
/
4
*
= =
16
B
-'/
LL Y
ou
16
d
'/ .
16
'/ .
16
b
-'/
JJ
make
8
an
/
me
8
.
= =
(
16
F
/
LL
ha
ppy
,
16
A
n
/ .
16
c
/ .
16
A
/
JJ
.
16
F
/
LL
.
16
c
/
w
hen
16
e
-/
you
16
c
/
JJ
)
l
oo
k
G
ou
nd
u
h
P
ed
R
ende
S
can
Figu e 4. Monophonic (wi h ly ics) egion-le el ansc ip-
ion example. Red boxes and ci cles ep esen ansc ip ion
e o s.
5. CASE STUDY: OMR-NED VS. SER
To be e exhibi he p os and cons o using OMR-NED
(a he egion le el) compa ed o adi ionally employed
SER, we s udy wo di e en ansc ip ion examples. The
i s one (see Fig. 4) highligh s he w ongly ansc ibed
no es and ly ics in a monophonic sco e. I we look a he
esul s ob ained in Table 3 he di e en OMR-NED Ca -
ego ies pe mi he ine-g ained analysis o his beha io .
Wi h 65% o he e o s ela ed o no es and 35% o ly ics
mis ansc ip ions.
Figu e 5 shows an example o a piano o m ansc ip-
ion. In his case, mos o he e o s (88.2%) co espond o
he no e ca ego y, howe e , OMR-NED ou lines p oblems
in he ex a ca ego y, indica ing ha he e a e also mis aken
music symbols ha comp ehend dynamics o le -hand cle
and key. I is wo h men ioning ha gi en he le el o de ail
OMR-NED depic s and he p ocedu e o calcula e he di -
e en edi dis ances be ween hypo hesis and g ound u h
(see Sec ion 3), he o e all sco e penalizes he ansc ip ion
e o s mo e han adi ionally used SER.
The compa a i e analysis o OMR-NED and SER
h ough hese case s udies demons a es he ad an ages o
adop ing a mo e e ined and s uc u ed e alua ion me -
ic. While SER p o ides an accep able measu emen o
ansc ip ion accu acy, OMR-NED enables a mo e g anu-
la unde s anding o he speci ic ansc ip ion e o ypes.
This is pa icula ly e iden in he i s example, whe e
OMR-NED di e en ia es be ween no e and ly ic e o s.
Simila ly, he second example highligh s OMR-NED’s
abili y o cap u e complex ansc ip ion issues in poly-
phonic piano o m music, pa icula ly by exposing e o s in
musical symbols beyond pi ch con en —such as dynamics
and cle misclassi ica ions. OMR-NED as a mo e in o ma-
i e me ic, encompasses he de elopmen o mo e obus
OMR sys ems.
G
ou
nd
u
h
P
ed
R
ende
S
can
**
ke
n
**
ke
n
**
d
y
nam
*
s
a
2
*
s
a
1
*
s
a
1
*
I
p
ian
o
*
I
p
ian
o
*
I
p
ian
o
*
cle
F
3
*
cle
G
2
*
cle
G
2
*
k
[] *
k
[
#
c
#] *
k
[
#
c
#]
*
b
: *
b
: *
b
:
*
M
4
/
8
*
M
3
/
4
*
M
3
/
4
*
MM
208
*
MM
208
*
MM
208
4
(
4
dd
.
=
1
=
1
=
1
*
p
ed
* *
4
BB
12
dd
L
p
.
12
aa
.
.
12
dd
J
.
4
F
#
4
A
4
cc
# .
4
G
#
4
d
4
g
.
*
X
p
ed
* *
=
2
=
2
=
2
*
p
ed
* *
4
.
FF
#
4
a
# .
.
8
g
8
gg
.
4
F
#
4
A
#
4
e
2
#
2
#) .
4
F
#
4
A
#
4
e
. .
*
X
p
ed
* *
=
3
=
3
=
3
*
p
ed
* *
.
8
dd
q
/ .
4
BB
(
12
dd
L
.
12
ee
.
.
12
a
J
.
4
G
#
4
d
4
cc
# .
4
F
#
4
d
4
b
.
*
X
p
ed
* *
**
ke
n
**
ke
n
**
d
y
nam
*
s
a
2
*
s
a
1
*
s
a
1
*
I
p
ian
o
*
I
p
ian
o
*
I
p
ian
o
*
cle
F
4
*
cle
G
2
*
cle
G
2
*
k
[
#
c
#] *
k
[
#
c
#] *
k
[
#
c
#]
*
b
: *
b
: *
b
:
*
M
3
/
4
*
M
3
/
4
*
M
3
/
4
*
MM
208
*
MM
208
*
MM
208
4
(
4
dd
.
=
1
=
1
=
1
*
p
ed
* *
4
BB
12
dd
L
p
.
12
ee
.
.
12
dd
J
.
4
F
#
4
d
4
cc
# .
4
F
#
4
d
4
b
.
*
X
p
ed
* *
=
2
=
2
=
2
*
p
ed
* *
4
FF
#
8
a
#
L
.
.
8
g
8
gg
J
.
4
F
#
4
A
#
4
e
2
#
2
#) .
4
F
#
4
A
#
4
e
. .
*
X
p
ed
* *
=
3
=
3
=
3
*
p
ed
* *
.
8
dd
q
/ .
4
BB
(
12
dd
L
.
12
ee
.
.
12
dd
J
.
4
F
#
4
d
4
cc
# .
4
F
#
4
d
4
b
.
*
X
p
ed
* *
Figu e 5. Piano o m egion-le el ansc ip ion example.
Red boxes and ci cles ep esen ansc ip ion e o s.
Example OMR-NED Ca ego ies Symbol
E o Ra e
No e Ex a Ly ics Measu e Pa S a G oup OMR-NED
Fig. 4 65.0 0 35.0 0 0 0 13.6 9.0
Fig. 5 88.2 11.8 0 0 0 0 40.9 13.0
Table 3. Resul s in e ms o he SER (%) and OMR-NED
Ca ego ies (%) me ics. No e ha OMR-NED is p epa ed
o ull-page ecogni ion hus some ca ego ies do no a ec
he OMR-NED sco e in he egion-le el scena io.
6. CONCLUSION
In his wo k, we add essed a long-s anding gap in he Op-
ical Music Recogni ion (OMR) li e a u e by in oducing
he Shee Music Benchma k (SMB), a collec ion o sco es
speci ically designed o e alua ing mode n end- o-end
OMR pipelines. SMB cons i u es he i s publicly a ail-
able co pus ha allows comp ehensi e end- o-end e alu-
a ion a mul iple le els, including layou analysis as well
as egion-le el and ull-page ansc ip ion asks. We also
p o ided baseline esul s o guide u u e esea ch e o s
in his a ea. The epo ed pe o mance wi h a s a e-o - he-
a me hodology demons a e he benchma k’s complexi y
and i s e ec i eness in assessing OMR sys ems a ge ing
Common Wes e n Mode n No a ion.
Fu he mo e, we in oduced a no el e alua ion me ic,
he OMR No malized Edi Dis ance (OMR-NED), which
enables a ine-g ained analysis o OMR sys em pe o -
mance. OMR-NED decomposes he adi ional edi dis-
ance calcula ion in o dis inc e o ca ego ies, acili a -
ing p ecise p o iling o po en ial sys em de iciencies. As
demons a ed in Sec. 5, he use o OMR-NED p o ides
aluable insigh s in o ansc ip ion e o s, he eby enabling
a ge ed imp o emen s in OMR amewo ks.
The combined con ibu ions o SMB and OMR-NED
es ablish a s anda dized e alua ion pipeline, acili a ing
clea pe o mance compa isons among OMR sys ems.
Thus, his wo k signi ican ly ad ances he ield by add ess-
ing a c i ical need wi hin he OMR esea ch communi y.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
609
7. ACKNOWLEDGMENTS
This pape is suppo ed by g an CISEJI/2023/9 om “P o-
g ama pa a el apoyo a pe sonas in es igado as con alen o
(Plan GenT) de la Gene ali a Valenciana”.
8. REFERENCES
[1] J. Cal o-Za agoza, J. H. J ., and A. Pacha, “Unde -
s anding Op ical Music Recogni ion,” ACM Compu .
Su ., ol. 53, no. 4, pp. 77:1–77:35, 2021.
[2] D. Hu on, “Humd um and Ke n: Selec i e Fea u e En-
coding BT - Beyond MIDI: The handbook o musi-
cal codes,” in Beyond MIDI: The handbook o musical
codes. Camb idge, MA, USA: MIT P ess, jan 1997,
pp. 375–401.
[3] A. Hankinson, P. Roland, and I. Fujinaga, “The Music
Encoding Ini ia i e as a Documen -Encoding F ame-
wo k,” in P oceedings o he 12 h In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence, IS-
MIR 2011, Miami, Flo ida, USA, Oc obe 24-28, 2011.
Uni e si y o Miami, 2011, pp. 293–298.
[4] M. Good e al., “MusicXML: An in e ne - iendly o -
ma o shee music,” in Xml con e ence and expo.
Ci esee , 2001, pp. 03–04.
[5] M. Al a o-Con e as, D. Rizo, J. M. Iñes a, and
J. Cal o-Za agoza, “OMR-assis ed ansc ip ion: a
case s udy wi h ea ly p in s,” in P oceedings o he
22nd In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence. Online: ISMIR, No . 2021, pp.
35–41.
[6] A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. Ma cal,
C. Guedes, and J. S. Ca doso, “Op ical music ecog-
ni ion: s a e-o - he-a and open issues,” In e na ional
Jou nal o Mul imedia In o ma ion Re ie al, ol. 1,
pp. 173–190, 2012.
[7] E. Sha i and G. Fazekas, “Op ical music ecogni ion:
S a e o he a and majo challenges,” a Xi p ep in
a Xi :2006.07885, 2020.
[8] L. Tuggene , R. Embe ge , A. Ghosh, P. Sage , Y. P.
Sa yawan, J. Mon oya, S. Goldschagg, F. Seibold,
U. Gu , P. Acke mann e al., “Real wo ld music objec
ecogni ion,” T ansac ions o he In e na ional Socie y
o Music In o ma ion Re ie al, ol. 7, no. 1, pp. 1–14,
2024.
[9] J. Maye , M. S aka, J. Hajic, and P. Pecina, “P ac-
ical End- o-End Op ical Music Recogni ion o Pi-
ano o m Music,” in Documen Analysis and Recogni-
ion - ICDAR 2024 - 18 h In e na ional Con e ence,
A hens, G eece, Augus 30 - Sep embe 4, 2024, P o-
ceedings, Pa VI, se . Lec u e No es in Compu e Sci-
ence, E. H. B. Smi h, M. Liwicki, and L. Peng, Eds.,
ol. 14809. Sp inge , 2024, pp. 55–73.
[10] A. Ríos-Vila, J. Cal o-Za agoza, and T. Paque , “Shee
music ans o me : End- o-end op ical music ecog-
ni ion beyond monophonic ansc ip ion,” in In e na-
ional Con e ence on Documen Analysis and Recogni-
ion. Sp inge , 2024, pp. 20–37.
[11] A. Ríos-Vila, J. Cal o-Za agoza, D. Rizo, and T. Pa-
que , “Shee Music T ans o me ++: End- o-End Full-
Page Op ical Music Recogni ion o Piano o m Shee
Music,” CoRR, ol. abs/2405.12105, 2024.
[12] V. Emiya, N. Be in, B. Da id, and R. Badeau, “MAPS
- A piano da abase o mul ipi ch es ima ion and au o-
ma ic ansc ip ion o music,” -, Resea ch Repo , Jul.
2010.
[13] F. Fosca in, A. McLeod, P. Rigaux, F. Jacquema d,
and M. Sakai, “ASAP: a da ase o aligned sco es
and pe o mances o piano ansc ip ion,” in P oceed-
ings o he 21 h In e na ional Socie y o Music In-
o ma ion Re ie al Con e ence, ISMIR 2020, Mon-
eal, Canada, Oc obe 11-16, 2020, J. Cumming, J. H.
Lee, B. McFee, M. Schedl, J. De aney, C. McKay,
E. Zange le, and T. de Reuse, Eds., 2020, pp. 534–541.
[14] C. Haw ho ne, A. S asyuk, A. Robe s, I. Simon, C. A.
Huang, S. Dieleman, E. Elsen, J. H. Engel, and D. Eck,
“Enabling ac o ized piano music modeling and gene -
a ion wi h he MAESTRO da ase ,” in 7 h In e na ional
Con e ence on Lea ning Rep esen a ions, ICLR 2019,
New O leans, LA, USA, May 6-9, 2019. OpenRe-
iew.ne , 2019.
[15] F. Gouyon, A. Klapu i, S. Dixon, M. Alonso, G. Tzane-
akis, C. Uhle, and P. Cano, “An expe imen al compa i-
son o audio empo induc ion algo i hms,” IEEE T ans.
Speech Audio P ocess., ol. 14, no. 5, pp. 1832–1844,
2006.
[16] G. Tzane akis and P. R. Cook, “Musical gen e classi-
ica ion o audio signals,” IEEE T ans. Speech Audio
P ocess., ol. 10, no. 5, pp. 293–302, 2002.
[17] J. H. Engel, C. Resnick, A. Robe s, S. Dieleman,
M. No ouzi, D. Eck, and K. Simonyan, “Neu al audio
syn hesis o musical no es wi h wa ene au oencode s,”
in P oceedings o he 34 h In e na ional Con e ence
on Machine Lea ning, ICML 2017, Sydney, NSW, Aus-
alia, 6-11 Augus 2017, se . P oceedings o Machine
Lea ning Resea ch, D. P ecup and Y. W. Teh, Eds.,
ol. 70. PMLR, 2017, pp. 1068–1077.
[18] D. By d and J. Simonsen, “Towa ds a S anda d Tes bed
o Op ical Music Recogni ion: De ini ions, Me ics,
and Page Images,” Jou nal o New Music Resea ch,
ol. 44, 07 2015.
[19] A. Fo nés, A. Du a, A. Go do, and J. Lladós, “CVC-
MUSCIMA: a g ound u h o handw i en music sco e
images o w i e iden i ica ion and s a emo al,” In .
J. Documen Anal. Recogni ., ol. 15, no. 3, pp. 243–
251, 2012.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
610
[20] J. Hajic and P. Pecina, “The MUSCIMA++ Da ase o
Handw i en Op ical Music Recogni ion,” in 14 h IAPR
In e na ional Con e ence on Documen Analysis and
Recogni ion, ICDAR 2017, Kyo o, Japan, No embe 9-
15, 2017. IEEE, 2017, pp. 39–46.
[21] J. Cal o-Za agoza and D. Rizo, “Came a-P IMuS:
Neu al End- o-End Op ical Music Recogni ion on Re-
alis ic Monophonic Sco es,” in P oceedings o he 19 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, ISMIR 2018, Pa is, F ance, Sep embe
23-27, 2018, E. Gómez, X. Hu, E. Humph ey, and
E. Bene os, Eds., 2018, pp. 248–255.
[22] C. S. Sapp, “Online da abase o sco es in he humd um
ile o ma ,” in ISMIR 2005, 6 h In e na ional Con e -
ence on Music In o ma ion Re ie al, London, UK, 11-
15 Sep embe 2005, P oceedings, 2005, pp. 664–665.
[23] L. Pugin, R. Zi ellini, and P. Roland, “Ve o io: A li-
b a y o eng a ing MEI music no a ion in o SVG,”
in P oceedings o he 15 h In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, ISMIR 2014,
Taipei, Taiwan, Oc obe 27-31, 2014, H. Wang,
Y. Yang, and J. H. Lee, Eds., 2014, pp. 107–112.
[24] N. Condi -Schul z and C. A hu , “humd um : a new
ake on an old app oach o compu a ional musicology,”
in P oceedings o he In e na ional Socie y o Music
In o ma ion Re ie al, 2019, pp. 715–722.
[25] C. E. Cancino-Chacón, S. D. Pe e , E. Ka ys inaios,
F. Fosca in, M. G ach en, and G. Widme , “Pa i-
u a: A Py hon Package o Symbolic Music P ocess-
ing,” in P oceedings o he Music Encoding Con e ence
(MEC2022), Hali ax, Canada, 2022.
[26] M. S. Cu hbe and C. A iza, “Music21: A oolki o
compu e -aided musicology and symbolic music da a,”
in P oceedings o he 11 h In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, ISMIR 2010,
U ech , Ne he lands, Augus 9-13, 2010, J. S. Downie
and R. C. Vel kamp, Eds. In e na ional Socie y o
Music In o ma ion Re ie al, 2010, pp. 637–642.
[27] J. Ce e o-Se ano, D. Rizo, and J. Cal o-Za agoza,
“ke npy: a Humd um **Ke n O ien ed Py hon Pack-
age o Op ical Music Recogni ion Tasks,” in P oceed-
ings o he Music Encoding Con e ence (MEC2025),
London, Uni ed Kingdom, 2025.
[28] F. Fosca in, F. Jacquema d, and R. Fou nie -S’nieho a,
“A di p ocedu e o music sco e iles,” in 6 h In e na-
ional Con e ence on Digi al Lib a ies o Musicology,
2019, pp. 58–64.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
611