OPTICAL MUSIC RECOGNITION OF JAZZ LEAD SHEETS
Juan C. Ma inez-Se illa1F ancesco Fosca in2Pa icia Ga cia-Iasci1,3
Da id Rizo1,4Jo ge Cal o-Za agoza1Ge ha d Widme 2
1Pa e n Recogni ion and A i icial In elligence G oup, Uni e si y o Alican e, Spain
2Ins i u e o Compu a ional Pe cep ion, Johannes Keple Uni e si y, Aus ia
3Uni e si y o Salamanca, Spain
4Ins i u o Supe io de Enseñanzas A ís icas de la Comunidad Valenciana, EASDA, Spain
[email p o ec ed]
ABSTRACT
In his pape , we add ess he challenge o Op ical Mu-
sic Recogni ion (OMR) o handw i en jazz lead shee s, a
widely used musical sco e ype ha encodes melody and
cho ds. The ask is challenging due o he p esence o
cho ds, a sco e componen no handled by exis ing OMR
sys ems, and he high a iabili y and quali y issues associ-
a ed wi h handw i en images. Ou con ibu ion is wo- old.
We p esen a no el da ase consis ing o 293 handw i en
jazz lead shee s o 163 unique pieces, amoun ing o 2021
o al s a es aligned wi h Humd um **ke n and MusicXML
g ound u h sco es. We also supply syn he ic sco e images
gene a ed om he g ound u h. The second con ibu ion
is he de elopmen o an OMR model o jazz lead shee s.
We discuss speci ic okenisa ion choices ela ed o ou kind
o da a, and he ad an ages o using syn he ic sco es and
p e ained models. We publicly elease all code, da a, and
models. 1
1. INTRODUCTION
A lead shee is a kind o shee music (musical sco e) ha
encodes he melody, cho ds, and some imes ly ics o a mu-
sic composi ion. Opposed o music s yles, such as classical
music, whe e he compose speci ies wi h a high deg ee
o p ecision wha musicians ha e o play, lead shee s a e
popula in con ex s whe e a lo o eedom is gi en o he
pe o me . This is he case o jazz music, which lis s imp o-
isa ion as one o i s co e elemen s. Mul iple collec ions
o “uno icial” lead shee s (i.e., no eleased by he o igi-
nal au ho s), named Fake Books, we e c ea ed by manual
ansc ip ion om he pe o med acks and a e widely em-
ployed by jazz musicians du ing lea ning, pe o ming, and
composing [1].
1h ps://g ia.dlsi.ua.es/jazz-om /
© J. C. Ma inez-Se illa, F. Fosca in, P. Ga cia-Iasci, D.
Rizo, J. Cal o-Za agoza, and G. Widme . Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion:
J. C. Ma inez-Se illa, F. Fosca in, P. Ga cia-Iasci, D. Rizo, J. Cal o-
Za agoza, and G. Widme , “Op ical Music Recogni ion o Jazz Lead
Shee s”, in P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al
Con ., Daejeon, Sou h Ko ea, 2025.
Figu e 1. A (pa icula ly p oblema ic) handw i en lead
shee o Duke Elling on’s ballad P elude o a Kiss.
Howe e , handw i en o p in ed lead shee s s ill ha e
limi ed u ili y compa ed o digi ised lead shee s, which s o e
musical elemen s in a o ma ha is easily accessible o
bo h people and compu e s. Fo example, a digi ised lead
shee enables ope a ions such as co ec ions, ansposi ion,
soni ica ion o melody and cho ds, changing he layou ,
ex ension in o a longe a angemen , and use in au oma ic
sys ems o analysis, e ie al, and accompanimen [2].
P oducing a digi ised lead shee is possible wi h music
no a ion so wa e such as MuseSco e, Finale, Do ico, e c.
Howe e , musicians o en ind i as e o mo e enjoyable o
w i e a lead shee on pape by hand. This is whe e Op ical
Music Recogni ion (OMR) echnology becomes use ul in
he con ex o jazz lead shee s, as i p omises o con e
handw i en ones in o a digi ised o ma .
Howe e , his ask p esen s unique challenges. The i s
is he a iabili y in handw i ing s yles o all he musical
symbols, pai ed wi h he necessi y o handling “di y” no-
a ion such as c ossing ou and co ec ions (see
Figu e 1
).
The second is he handling o cho d symbols, which, o
696
he bes o ou knowledge, a e no included in any exis ing
OMR sys em. This includes co ec ly p edic ing a musi-
cally alid cho d name, bu also co ec ly aligning he cho d
o i s posi ion in he musical sco e. Using heu is ics based
on he e ical alignmen is o en no su icien , as we can
see in he sco e o
Figu e 1
whe e he i s wo cho ds a e
almos on op o he 2nd and 4 h no e, e en i in he g ound
u h, hey should be aligned o he 1s and 3 d no e. Mo e-
o e , we need o make some o iginal conside a ions on he
okenisa ion p ocess o cho ds and melodies, i.e., on how
o e ec i ely p ep ocess hem o he usage in ou ne wo k.
Finally, he e isn’ a cu en ly a ailable da ase o handw i -
en jazz lead shee s and digi ised g ound u h speci ically
de ised o OMR asks.
In his pape , we ocus on lead shee s wi h only melody
and cho ds (lea ing ly ics o a u u e ex ension) and ad-
d ess he a o emen ioned challenges. We p esen a new
da ase o 293 handw i en lead shee s, collec ed om jazz
school s uden s and p o essional musicians om Spanish in-
s i u ions, which include melody and cho ds (see Sec ion 3).
We p opose an end- o-end OMR model (see Sec ion 4) ded-
ica ed o hese da a, and p opose a se o expe imen s (see
Sec ion 5) o analyse he impac o di e en okenisa ion
s a egies, he impo ance o p e aining, and he usage o
syn he ic da a.
2. RELATED WORK
Ou p oblem o ansc ibing melody and aligned ly ics is
e y simila o he ansc ip ion o melody and ly ics a -
ge ed by Ma inez-Se illa e al. [3]. They de elop a ne wo k
ha , gi en a music egion o a ed by 90 deg ees i pe o ms
an in e nal eshaping o he hidden space, o ho izon ally
slice he sco e a no e posi ion. The esul is a sequen ial
hidden space ha al e na i ely encodes no es and ly ics.
A simila app oach was p oposed o polyphonic music
by Ríos-Vila e al. [4]. Al hough ou lead shee s, wi h
cho ds posi ioned abo e he no es, may appea simila and
could sugges a compa able app oach, he e is a undamen-
al dis inc ion. The ea lie s udies elied on syn he ic da a
wi h pe ec e ical alignmen , while ou handw i en da a
exhibi s e ical misalignmen be ween cho ds and no es,
ende ing his app oach unsui able.
A mo e p omising sys em, which is cu en ly s a e-o -
he-a o ull-page OMR o polyphonic music, is he
Shee Music T ans o me [5], whe e he geome ic consid-
e a ions desc ibed abo e a e le ou in a o o an au o e-
g essi e ans o me , which c oss-a ends o he encoded
inpu image and lea ns o di ec ly p oduce an ou pu in Hum-
d um **ke n o ma [6], a compac encoding o musical
sco es. While we a e no a ge ing polyphonic o ull-page
ecogni ion, we euse hei ne wo k a chi ec u e, since i is
lexible enough o be adap ed o ou use case, and i has he
po en ial o be easily scaled o u u e sys ems.
Rega ding ou p oposed da ase , he e a e no o he pub-
licly a ailable collec ions o handw i en lead shee s. A
ela ed wo k is he CoCoPops da ase [7], a ecen collec-
ion o digi ised lead shee s. While he e a e no o e laps
be ween hei and ou pieces, we suppo hei e o o uni-
o ming he encoding and we make a ailable ou digi ised
sco es in he same o ma , i.e., Humd um **ke n iles, wi h
Ha e [8] syn ax o cho ds. The Ha e syn ax is also used
by he ChoCo da ase [9], which also pa ially o e laps wi h
ou da ase , hough hei in e es is mainly on cho ds.
3. DATASET
The da ase we elease consis s o images o musical sco es,
aligned wi h a digi ised e sion. E e y sco e encodes a
lead shee o a jazz s anda d, consis ing o a monophonic
melody 2and cho d symbols.
3.1 Digi ised musical sco es
We p o ide musical sco es o 163 unique jazz s anda ds
in MusicXML and Humd um **ke n o ma . The la e
is widely used in sys ems ha ou pu musical sco es [5],
because i is a compac and easy- o-handle o ma . The
MusicXML sco es a e aken om he Wiki onia da abase
(discon inued in 2013) and pa ially co ec ed. We also
lea e he ly ics, i p esen in he o iginal iles, as hey could
be help ul o u u e ex ensions, bu we don’ conside hem
in his wo k.
We con e he MusicXML sco es o **ke n wi h he
musicxml2hum ool in Humlib.
3
Howe e , he o iginal
handling o cho ds c ea es an
**mxhm
spine wi h oo ,
ype, and bass, bu comple ely disca ds he cho d ex ensions
(which a e an impo an pa o jazz cho ds). Mo eo e ,
i ansla es he MusicXML
ha mony/kind 4
ield in o
e y long s ings, which con adic s ou mo i a ions o he
usage o he **ke n o ma . Fo hese easons, we de eloped
an ex ension o he musicxml2hum ool ha con e s he
cho ds in a ep esen a ion ha is sui able o ou goals (see
Sec ion 3.5).
3.2 Musical sco e images
We p o ide musical sco e images o wo kinds: handw i en
and syn he ic.
The handw i en sco es we e p oduced by jazz p o es-
sionals and s uden s a di e en le els, who we e p o ided
wi h p in ed e sions o he digi ised musical sco es. They
we e ins uc ed o copy he sco es while main aining he
same cho d symbols and layou (i.e., he same numbe o
measu es o each s a ). To simula e a ealis ic use case,
pa icipan s we e asked o ei he scan he comple ed sco es
o pho og aph hem using a mobile phone. All he collec ed
sco es we e manually checked o assess hei quali y (mo e
de ails in Sec ion 3.4). Some jazz s anda ds we e copied
mul iple imes, o a o al o 293 handw i en sco es wi h an
a e age leng h o 31.6 measu es. To u he align wi h eal-
wo ld usage scena ios, we elease he images in JPG o ma
as hey we e ecei ed, e aining hei a ying esolu ions,
angles, and ligh condi ions.
2Few samples ha e polyphonic melodies.
3
h ps://gi hub.com/c aigsapp/humlib (Re ie ed Sep embe 10, 2024)
4
h ps://www.w3.o g/2021/06/musicxml40/musicxml-
e e ence/elemen s/ha mony (Re ie ed Ma ch 26, 2025)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
697
The syn he ic pa was p oduced wi h he Musesco e 4
command line ool: o each piece in ou digi ised sco es
collec ion, we gene a ed 2 syn he ic ull-page ende ings:
one wi h he Classical on and one wi h he MuseJazz
on , which ies o mimic he human handw i en s yle
ha is commonly used o jazz music. The images we e
gene a ed in SVG o ma , hen he ly ics we e emo ed, and
inally images we e con e ed o PNG. In o al, we ha e
326
syn he ic sco es wi h an a e age leng h o 30.3 measu es.
3.3 F om page o s a egions
The da a p esen ed abo e can be used o ain sys ems o-
cusing on page-le el OMR. Howe e , a la ge numbe o
sys ems wo k a egion le el, whe e each egion consis s o
a music pa ha can be ead sequen ially, ypically om le
o igh . Fo ou lead shee s, a egion would be a single s a
wi h cho d symbols abo e i . This egion-le el app oach
simpli ies he OMR ask signi ican ly, equi ing smalle ne -
wo ks and less aining da a. Mo eo e , dedica ed ne wo ks
can be ained o segmen he page in o mul iple egions,
making his me hodology bo h p ac ical and e ec i e when
wo king wi h small da ase s.
Fo his eason, we also include egion anno a ions in
he da ase , speci ically bounding boxes ha iden i y each
s a wi h cho ds and a e e ence o hei co esponding
digi ised s a es. We c ea ed he bounding boxes using a
egion iden i ie sys em [10] speci ically ained o ecog-
nising music pa s and manually checked hem a e wa ds.
The alignmen o a egion o he speci ic pa o he **ke n
sco e has been ob ained by spli ing a he end o line oken
(
!!lineb eak:o iginal
), which is au oma ically in-
se ed when con e ing om MusicXML using he abo e
men ioned musicxml2hum ool om Humlib and he yank
ool om Humd um Tools
5
. To keep he **ke n sou ce
clean, and he egion-wise p edic ion ask easonable, all
commen s and measu e numbe s ha e been emo ed.
Region-wise, ou da ase consis s o 2021 handw i en
egions and 2208 syn he ic egions wi h an a e age o 4.6
and 4.5 measu es, espec i ely.
3.4 Quali y issues in he da ase
Du ing he inspec ion o he handw i en sco es, we iden i-
ied he ollowing quali y p oblems. S ike- h ough, ha d-
o- ead callig aphy, and no e-cho d misalignmen s a e e-
quen , bu hey a e pa o wha makes his p oblem chal-
lenging, so we e ain hem.
A la ge numbe o sco es didn’ espec he layou o he
e e ence digi ised sco e. This is p oblema ic o page-le el
sys ems ha aim a co ec ly ansc ibing he layou , and
in pa icula o he ex ac ion o egion-le el anno a ions.
To a oid disca ding all his ma e ial, we decided o adap
he digi ised sco es, i.e., we c ea ed a di e en e sion o
he MusicXML and **ke n iles ha espec he layou o
he handw i en sco e. Fo his eason, in ou da ase , he e
a e as many digi ised sco es as he numbe o handw i en
sco es, hough he di e ences be ween mul iple e sions o
5
h ps://gi hub.com/humd um- ools/humd um- ools (Re ie ed Sep em-
be 10, 2024)
he same digi ised sco e may be minimal. Pa ially w i en
sco es we e also ea ed in he same way.
Ano he kind o p oblem we iden i ied is in he usage
o equi alen cho d symbols. In jazz lead shee s, he e a e
mul iple ways o w i ing he same cho d, o example, a
majo 7 cho d can be w i en as “maj7” o “
∆
7”, o a mino
cho d can be w i en wi h “-”, “m”, “mi” o “min”. Jazz
musicians lea n o ead all hese equi alen symbols and
a e no sensi i e o he speci ic symbols used. Indeed, we
ound ha , despi e being ins uc ed o use he same sym-
bols, many ansc ibe s used equi alen ones. A solu ion
o his p oblem would be o manually adap all labels in
he MusicXML sco es o ma ch he ones used by he an-
sc ibe . Howe e , gi en he ime-in ensi e na u e o his
ask, we ha e p io i ised o he aspec s o he da ase o his
elease, wi h plans o inco po a e his in u u e e sions. An
al e na i e pa h, which we ollow in his pape , is o de ine
he cho d ecogni ion p oblem as a mapping no be ween
an image and a speci ic label, bu be ween an image and
he en i e class o equi alen labels o a ce ain cho d. In a
use-case scena io, he choice o a speci ic label o he inal
ou pu can be selec ed by he use depending on hei p e -
e ence. We hen need o iden i y se s o equi alen cho ds
in a unique way, as we de ail in he nex sec ion.
Finally, we add ess a quali y issue in he cho d encod-
ing o MusicXML sco es. We ound ha many sco es
used an encoding ha did no ollow he MusicXML s an-
da d. Fo example, ins ead o encoding a C6 cho d wi h he
ha mony/kind a ibu e
majo -six h
, he ha mony/ ex
ield was used wi h he “6” s ing. While he g aphical
ou pu ende ed om MusicXML is he same, his is an
issue o he au oma ic p ocessing o sco es and ou **ke n
con e e . We manually co ec ed he MusicXML iles o
ensu e a well-de ined and consis en cho d encoding.
3.5 Cho d o maliza ion
To p ope ly desc ibe he cho ds in ou **ke n sco es, we
ely on he Ha e syn ax [8]. Howe e , Ha e syn ax s ill
allows mul iple ways o encoding he same cho d, o ex-
ample, wi h combina ions o di e en sho hands and ex-
ensions, o by speci ying e e y single no e. The e o e, we
es ic he syn ax as ollows. E e y cho d is composed o
a oo , a sho hand, an op ional lis o ex ensions and an
op ional al e na i e bass, encoded in a s ing as:
oo :sho hand(ex ension1,ex ension2,...)/bass (1)
The lis o sho hands we conside is ’aug’, ’aug7’, ’dim’,
’dim7’, ’hdim7’, ’maj’, ’maj11’, ’maj13’, ’maj6’, ’maj7’,
’maj9’, ’min’, ’min11’, ’min13’, ’min6’, ’min7’, ’min9’,
’minmaj7’, ’sus2’, ’sus4’, ’11’, ’13’, ’7’, ’9’. All emaining
cho ds a e ob ained by adding an ex ension, which can add
o emo e a ce ain pi ch, o example, C:7(b9). A cho d
o which a sho hand exis s canno be encoded wi h an ex-
ension; o example, C:maj(7,b9) is no a alid o mula ion
in ou syn ax. The no e emo al in an ex ension, i.e., he
sub ac deg ee ype in MusicXML
6
, is coded wi h he “no”
6
h ps://www.w3.o g/2021/06/musicxml40/musicxml- e e ence/da a-
ypes/deg ee- ype- alue (Re ie ed Ma ch 26, 2025)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
698
p e ix, e.g., C:maj(no5,b9).
3.6 Da a spli
To acili a e he usage o his da ase o benchma king, we
elease an o icial da a spli . We conside he numbe o
unique pieces in he da ase , and assign 70%, 10%, and
20% o hem o he ain, alida ion, and es spli s, espec-
i ely. We ensu e ha all unique pieces o which mul iple
handw i en copies exis a e placed in he ain subse only.
This ensu es ha we do no ha e da a in di e en spli s ha
sha e he same g ound u h, which would bias he expe -
imen s, since he model could jus memo ise a piece and
ep oduce i . Speci ically, ou o 163 unique pieces, we
assign 115, 16, and 32 o he ain, alida ion, and es sub-
se s. When coun ing he numbe o handw i en sco es (i.e.,
including mul iple e sions o he same piece), he h ee
subse s include 245, 32, and 16 sco es, and when coun ing
he numbe o egions, 1696, 102, and 220.
4. MODEL
In his sec ion, we desc ibe ou app oach o OMR o jazz
lead shee s. We ocus on he egion-le el app oach, i.e., ou
sys em’s inpu is a single s a wi h cho ds on op.
7
Fo -
mally, each sample in ou da ase consis s o a pai
(x,y)
o a egion-le el image ep esen ed as ma ix
x∈Rc×h×w
,
and a sequence o musical symbols y=y1, y2, . . . , y|y|.
Each sequence o symbols is d awn om a common o-
cabula y
Σ
and is uniquely associa ed wi h a **ke n ile,
h ough he okenisa ion p ocess desc ibed in Sec ion 4.2.
While he sco e images a e dis ibu ed wi h he o iginal
colo s, we use black-and-whi e images, so we se
c= 1
. We
ix he dimension o each image o
h= 128
and
w= 1000
by i s downsampling each image, main aining he aspec
a io, and hen igh padding on he wid h dimension.
4.1 A chi ec u e
We use he a chi ec u e p oposed by Rios-Vila e al. [5] The
a chi ec u e is an encode -decode model, whe e he en-
code , a Con Nex [12] ne wo k, ac s as a ea u e-lea ning
block, by p ocessing he image
x
in o a comp essed hidden
ep esen a ion
z
. The decode , is om Vaswani e al. [13],
and p oduces a each s ep
a p obabili y dis ibu ion o e
all symbols
Σ
,
p ∈R|Σ|
, aking in o accoun bo h he
p eceden okens
(ˆy0,...,ˆy −1)
and he image hidden ep-
esen a ion z:
ˆp =P(ˆy |z, (ˆy0,ˆy1,ˆy2,...,ˆy −1)) (2)
P edic ion consis s in conca ena ing he
a g max
o
p
,
i.e.
ˆy
, o all he s eps. This sequence always s a s wi h he
special oken
< bos >
(begin o sequence) and he decode
s ops p edic ing when ˆy|ˆy|=< eos > (end o sequence).
An ad an age o using he same model as Rios-Vila e
al. is ha we can ini ialise ou ne wo k wi h he weigh s
om hei publicly a ailable p e ained checkpoin . E en
7
Exis ing layou analysis me hods can be success ully used o de ec
egions om ull-page images [11].
Figu e 2. Fi s wo measu es o Ain’ Misbeha in’ in **ke n
o ma , wi h g aphical examples o ou okenisa ion s a e-
gies.
hough ou asks a e no he same, hey may s ill be ela ed
enough o make p e aining bene icial. The only pa o he
model we need o change is he las linea laye , since we
use a di e en symbol ocabula y Σ.
4.2 Tokenisa ion
As depic ed on he le pa o Figu e 2, ou ke n sco es a e
ex iles wi h wo ab-sepa a ed spines, one o melody and
one o cho ds. The melodic spine speci ies he du a ion
and pi ch o he no es (o i he e is a es ), and also o he
g aphical elemen s, such as beaming s a s (L) and ends
(J). The e a e mul iple ways o using hese iles o ain ou
neu al ne wo k; we p opose and e alua e h ee di e en
okenisa ion s a egies o he **ke n sco es: wo d-le el,
cha ac e -le el, and medium-le el, which build h ee di e -
en ocabula ies Σw,Σc, and Σm espec i ely.
The wo d-le el okenise is he same employed by Rios
e al. [5], and conside s e e y ab o new-line sepa a ed
s ing as a single oken. This had he ad an age o gen-
e a ing sho sequences, a he cos o a e y la ge (1762
okens) ocabula y. Wi h his okenisa ion, he model is
o ced o lea n independen ly how o p ocess symbols, e en
when hei g aphical ep esen a ion may be e y simila ;
o example, a D# and a D# wi h a du a ion do .
The cha ac e -le el s a egy ea s e e y cha ac e as a
single oken. In di ec con as o he p e ious okenise ,
his yields long sequences and a small ocabula y (69 o-
kens). The sys em can mo e e ec i ely euse he g aphical-
o-symbol mapping i lea ns, o example, when a sha p
symbol # is used in a no e o a cho d. As a d awback, his
o ces he ans o me decode o pay mo e a en ion o he
con ex , o example, he le e “a” co esponds o e y di -
e en g aphical symbols in he melody spine (a no e) and
cho d spine (pa o he “maj” o “aug” cho d ype).
We de elop he las okenise o be on a medium le el
be ween he p e ious wo. The goal is o ha e a bijec i e
mapping be ween g aphical symbols and okens. Fo he
no e spine, we encode all pi ches as unique symbols (e en
when hey a e made o mul iple cha ac e s, o example,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
699
"ee" co esponds o E one oc a e highe han middle E),
and all o he cha ac e s as unique symbols, o sepa a e la s,
sha ps, du a ion do s, ies, slu s, and beamings. Fo he
cho d spine, we sepa a e he oo , he ype, he ex ensions
(each ex ension is conside ed singula ly i he e a e mul i-
ple), and he bass i hey a e speci ied. All o he symbols,
like ime signa u e, key signa u e, cle , spine de ini ion, and
measu e sepa a o s, a e conside ed a he wo d-le el. This
gene a es a ocabula y o size 153.
These okenise s a e de eloped o wo k in bo h di ec-
ions, i.e., hey p oduce okens om a **ke n ile, o ain
he model, bu hey also econs uc he **ke n ile om a
model ou pu o e alua ion.
These okenise s a e s aigh o wa d o unde s and and
can be implemen ed in jus a ew lines o code. This sim-
plici y s ems om he use o **ke n iles, which we use,
ins ead o o he o ma s, such as MusicXML, since hey
p o ide a compac and easy- o-handle ep esen a ion o
musical sco es. Howe e , he **ke n o ma has i s lim-
i s; o example, acciden als a e explici ly encoded in each
no e, e en i hey a e de ined on he key signa u e, and
his doesn’ align wi h he g aphical ep esen a ion. Fo
u u e wo k, one could ha e much mo e con ol o e he
okenisa ion p ocess by pa sing he musical sco e in o some
dedica ed in e nal ep esen a ion, and hen p oducing he
okens, simila ly o wha MidiTok [14] does wi h MIDI
iles.
8
This would also allow us o wo k wi h di e en ile
o ma s which encode a wide se o g aphical in o ma ion.
4.3 Me ics
As e alua ion me ics, we use edi -dis ance-based me ics,
as used in o he ecen s udies [3,5], which coun he num-
be o inse ions, dele ions, o modi ica ions o ans o m
ou p edic ed **ke n ou pu in o he g ound u h. In pa -
icula , we conside 3 me ics: cha ac e e o a e (CER),
wo d e o a e (WER)
9
, and line e o a e (LER). The
LER wo ks a he line le el, i.e., i coun s an e o i he e is
e en a single-cha ac e di e ence be ween wo lines. This
is he only me ic ha allows us o e alua e i he cho d is
aligned o he co ec no e, bu i doesn’ conside “almos -
co ec ” lines, o example, a line whe e a single du a ion
do is misp edic ed. CER wo ks a he cha ac e le el, bu
i has he disad an age o o e ly penalising w ong p edic-
ions ha in ol e se e al cha ac e s, such as he cho d ype
“maj7”. WER wo ks a he wo d le el, so i has simila dis-
ad an ages ha LER, bu less p onounced; o example, i
will s ill conside co ec a no e, e en i he cho d nex o i
is misp edic ed. All hese me ics ha e di e en ad an ages
and disad an ages, so we chose o epo all o hem. The
ask o inding a good unique me ic ha aligns wi h human
pe cep ion is s ill an open p oblem in he ield.
8
No e ha he okens hey p oduce a e only ocusing on he musical
con en , while we wan o also e ain g aphical in o ma ion.
9
No e ha in he OMR li e a u e WER is e e ed o as Symbol E o
Ra e (SER), bu we use WER o be consis en wi h he e ms used in he
okenisa ion sec ion.
5. EXPERIMENTS
Wi h he expe imen s in his sec ion, we wan o answe he
ollowing h ee esea ch ques ions: Do p e ained SOTA
weigh s on page-le el polyphonic piano music [5] imp o e
he pe o mance in ou asks? Which o he h ee okenise s
is mo e e ec i e? Does he inclusion o syn he ic da a
du ing aining help he pe o mance on eal da a? All
combina ions o hese h ee ac o s esul in 12 expe imen s,
which we desc ibe in he ollowing.
5.1 Expe imen al se ings
We un ou expe imen o 100 epochs wi h a lea ning a e
o
5×10−4
, wi h cosine annealing, a wa mup phase o
150 s eps, weigh decay o 0.01, and ba ch size 64. Fo
each expe imen , we use ea ly s opping and compu e ou
es esul s wi h he model ha minimises he WER me ic
in he alida ion spli . The model Con Nex encode has
h ee laye s wi h ke nel sizes o [3, 3, 9] and ou pu channel
sizes o [64, 128, 256]. The language model block, i.e.,
he decode , is composed o 8 laye s, each o which uses 4
a en ion heads and a hidden dimension o 256.
5.2 Resul s
The esul s o ou expe imen s a e summa ised in Table 1
Ini ialising ou model wi h he p e ained checkpoin s be-
o e aining b ings clea ad an ages o e e y me ic, o-
kenise , and da a se ing. Wi hou p e aining, ou model
was no able o p ope ly ain wi h handw i en da a only,
and s ayed a a WER le el
>40
. We ound ha his co e-
sponds o a model ha has lea ned he basic **ke n syn ax
(i.e., i places he no es and cho ds p ope ly o he le and
igh spine, espec i ely), bu doesn’ seem o conside he
inpu image a all. Adding syn he ic da a helped o ease
his p oblem, bu only o he medium and cha ac e -le el
okenise s, and he pe o mance was s ill much wo se.
Rega ding he choice o he okenise s, he medium-
le el one yields he bes esul s ac oss all me ics o he
p e ained models. Simila ly, using syn he ic da a is also
bene icial. The esul s a e less clea o he non-p e ained
models, bu as said be o e, hese models did no p ope ly
ain, so we should no us a compa ison be ween hei
ou pu s, since we may as well be analysing jus some an-
dom noise. Acco ding o all h ee me ics, ou bes model
is he model ained om he p e ained checkpoin , wi h
handw i en and syn he ic da a and using he medium-le el
okenise .
We wan o emphasise ha while ou esul s a e clea and
consis en ac oss ou expe imen s, hey a e only alid o
ou da ase and model choices. We can specula e ha as he
model and da ase ge bigge , he ad an age o using a well-
balanced okenisa ion p ocedu e would g adually diminish,
in a ou o lexible okenisa ion like he cha ac e -le el
one, which can handle all kinds o **ke n sco es; o e en
be able o di ec ly p oduce e y e bose ile o ma s like
MusicXML. Howe e , i one is in e es ed in e icien sys-
ems o needs o wo k in a low-da a scena io, i is wo h
explo ing hese music-in o med di ec ions.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
700
Me ic handw i en handw i en + syn he ic p e ained + handw i en p e . + handw. + syn h.
wo d medium cha wo d medium cha wo d medium cha wo d medium cha
↓WER 40.71 44.15 40.96 41.38 35.15 27.59 15.57 12.42 15.91 12.55 11.90 12.86
↓CER 52.93 55.51 52.52 53.67 45.70 34.78 19.43 14.35 18.40 16.37 13.67 15.32
↓LER 75.01 79.40 79.54 73.55 71.60 57.84 39.94 31.28 39.49 32.99 29.68 32.71
Table 1. Resul s o ou expe imen s wi h di e en okenise s (a wo d, medium, and cha ac e le el), he addi ion o syn he ic
da a, and he usage o a p e ained model.
**
ke
n
**
m
x
hm
*
cle
G
2
*
*
k
[
#] *
*
M
4
/
4
*
= =
4
b
E
:
min
6
4
g
.
4
a
F
#:
7
4
g
.
= =
8
b
B
:
min
7
4
.
8
# .
4
#
B
b
:
a
u
g
[
4
# .
= =
1
#]
B
:
min
7
= =
4
a
D
:
maj
4
# .
4
g
E
b
:
dim
4
# .
= =
4
g
A
:
7
4
a
.
4
b
.
4
cc
# .
*- *-
GT
INPUT IMAGE
**
ke
n
**
m
x
hm
*
cle
G
2
*
*
k
[] *
*
M
4
/
4
*
= =
4
g
.
4
a
F
:
7
4
g
.
= =
8
b
B
:
min
7
4
.
8
.
4
#
B
b
:
7
[
4
.
= =
1
]
D
:
min
7
= =
4
a
D
:
maj
4
# .
4
g
E
b
:
dim
4
# .
= =
4
g
A
:
7
(
b
9
)
4
a
.
4
b
.
4
cc
# .
*- *-
PRED
F:7 B:min7 Bb:7 D:min7 D:maj Eb:dim A:7(b9)
PRED RENDERED
Figu e 3. P edic ion o Emb aceable You by Geo ge Ge sh-
win (ba s 10 h o 14 h) exce p wi h ou bes model. In
ed, e o s associa ed wi h ansc ip ion o missing symbols.
Blue boxes depic seman ic e o s.
5.3 Quali a i e analysis
In Figu e 3, we show an example o he ansc ip ion pe -
o mance o ou bes model (p e ained, wi h syn he ic da a
and medium-le el okenisa ion).
A ending o ed boxes, which ep esen bo h missing
and w ongly ansc ibed symbols, i is wo h highligh ing
ha mos o he e o s we encoun e a e ela ed o cho d
symbols. This beha iou ou lines he di icul y o ansc ib-
ing simul aneously bo h sou ces (melody and cho ds). We
can specula e ha his de i es om he highe equency
o no es compa ed o cho ds in a jazz lead shee , so he
model has much less chance o lea n a co ec beha iou o
cho ds du ing aining. Mo eo e , he use o he p e ained
model on he polyphonic g and piano da ase biases he
amewo k owa ds ansc ibing he melody, as no cho ds
appea in ha da ase . On he posi i e side, he cho d-no e
alignmen seems o wo k p e y well, e en in he case o he
s ike- h ough in measu e 2.
Finally, we highligh in blue he seman ic e o s. These
a e common p oblems in OMR ansc ip ions, whe e some
symbols a e ansc ibed co ec ly i we only conside hei
local g aphical ea u es, such as he no e “1 ” a he s a
o he hi d ba . G aphically, i does no con ain he sha p
symbol (#); howe e , gi en he key o he piece (*k[ #]),
he model misses his implici acciden al.
6. CONCLUSION
Wi h his pape , we ook an ini ial ye signi ican s ep o-
wa ds he de elopmen o OMR sys ems o handw i en
jazz lead shee s ha could bene i musicians and MIR e-
sea che s. We collec ed a da ase o 293 handw i en lead
shee s p oduced by music p o essionals and s uden s o
di e en le els, add essed i s quali y p oblems, aligned i
wi h g ound u h digi ised sco es in MusicXML and **ke n
o ma s, and o ganised i in a sui able o ma o aining
end- o-end page-le el and egion-le el sys ems. We also
supplied syn he ic sco es gene a ed om he g ound u h
and a da a spli o acili a e he usage o his da ase o
benchma king. We eleased he i s OMR sys em dedi-
ca ed o handw i en jazz lead shee s, based on an encode -
decode a chi ec u e, which is used o polyphonic piano
OMR. We p oposed mul iple okenisa ion s a egies and
analysed hei e ec , showing ha he mos e ec i e op ion
(a leas o ou model and da ase size) is o ha e a unique
mapping be ween g aphical symbols and okens, wi hou
mul iple symbols being g ouped in a single oken, o a sin-
gle symbol being encoded by mul iple okens. We also
p o ed ha he usage o syn he ic da a is bene icial, and
ha p e aining ou ne wo k wi h he di e en bu ela ed
ask o polyphonic piano ansc ip ion is undamen al o
enable e ec i e aining on ou da ase .
Since he pe o mance o deep-lea ning sys ems highly
co ela es wi h he size o he aining da a, u u e wo k on
he da ase pa would include expanding he collec ion o
handw i en sco es and including new digi ised sco es o
p oduce mo e syn he ic da a. Since we a e also dis ibu ing
he SVG iles o he syn he ic sco es, mul iple da a aug-
men a ion p ocedu es can be pe o med, such as changing
he colou and he wid h o lines, and applying di e en
backg ounds and image ans o ma ions o simula e eal
pape in di e en ligh ing condi ions. On he model side, i
is easonable ha employing mo e ecen ans o me com-
ponen s (e.g., o a y posi ional embedding) could make ou
model mo e pe o man and less da a-hung y. The usage o
a ision ans o me ins ead o he Con Nex encode would
also be a na u al nex s ep in he scaling o ou ne wo k.
Finally, as we ound i highly bene icial o use a p e ained
checkpoin on ano he OMR ask o kicks a ou aining,
we belie e ha inco po a ing a wide ange o OMR asks in
he p e aining phase could u he enhance he esul s. This
also mo i a es he de elopmen o a gene al OMR model
capable o unc ioning ac oss a ious music domains.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
701
7. ACKNOWLEDGMENTS
We hank ‘Cen o Supe io de Música del País Vasco,
Musikene’ (I zia La inaga), ‘Casa So ía de El Al e , Ali-
can e’ (Ped o J. Ponce de León), ‘Conse a o io Supe io
de Música Joaquín Rod igo de Valencia’ (Jo ge Se illa),
‘Sedajazz, Valencia‘ (F ancisco A. Blanco La ino), ‘Con-
se a o io Supe io de Música Ósca Esplá de Alican e’
(Manuel Mas), and ‘AMCE San a Cecilia’ o coo dina -
ing hei s uden s’ me iculous wo k in ansc ibing jazz
lead shee s. Pa icia Ga cia-Iasci holds a esea ch con ac
wi h he Uni e si y o Salamanca unded by he Regional
Go e nmen o Cas illa y León (O de EDU/1009/2024 o
Oc obe 10) and co- inanced by he Eu opean Social Fund
Plus (FSE+) budge code (18.218F 463AB05). This pape is
suppo ed by g an CISEJI/2023/9 om “P og ama pa a el
apoyo a pe sonas in es igado as con alen o (Plan GenT) de
la Gene ali a Valenciana”, he Eu opean Resea ch Council
(ERC) unde he EU’s Ho izon 2020 esea ch & inno a-
ion p og amme, g an ag eemen No. 101019375 (Whi he
Music?), and he Fede al S a e o Uppe Aus ia (LIT AI
Lab).
8. REFERENCES
[1]
D. Fos e and S. Dixon, “Filosax: A da ase o anno a ed
jazz saxophone eco dings,” in P oceedings o he 22nd
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence, ISMIR 2021, Online, No embe 7-12, 2021,
J. H. Lee, A. Le ch, Z. Duan, J. Nam, P. Rao, P. an
K anenbu g, and A. S ini asamu hy, Eds., 2021, pp.
205–212.
[2]
J. Cal o-Za agoza, J. Hajiˇ
c J ., and A. Pacha, “Unde -
s anding Op ical Music Recogni ion,” ACM Compu .
Su ., ol. 53, no. 4, pp. 77:1–77:35, 2021.
[3]
J. C. Ma inez-Se illa, A. Rios-Vila, F. J. Cas ellanos,
and J. Cal o-Za agoza, “A holis ic app oach o aligned
music and ly ics ansc ip ion,” in In e na ional Con e -
ence on Documen Analysis and Recogni ion (ICDAR),
2023.
[4]
A. Ríos-Vila, D. Rizo, J. M. Iñes a, and J. Cal o-
Za agoza, “End- o-end op ical music ecogni ion o
piano o m shee music,” In e na ional Jou nal on Doc-
umen Analysis and Recogni ion (IJDAR), ol. 26, no. 3,
pp. 347–362, 2023.
[5]
A. Ríos-Vila, J. Cal o-Za agoza, and T. Paque , “Shee
music ans o me : End- o-end op ical music ecogni-
ion beyond monophonic ansc ip ion,” in In e na ional
Con e ence on Documen Analysis and Recogni ion (IC-
DAR), 2024.
[6]
D. Hu on, “Humd um and Ke n: Selec i e Fea u e En-
coding BT - Beyond MIDI: The handbook o musi-
cal codes,” in Beyond MIDI: The handbook o musical
codes. Camb idge, MA, USA: MIT P ess, jan 1997,
pp. 375–401.
[7]
C. A hu and N. Condi -Schul z, “The coo dina ed co -
pus o popula musics (cocopops): A me a-co pus o
melodic and ha monic ansc ip ions.” in P oceedings
o he In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence (ISMIR), 2023.
[8]
C. Ha e, M. B. Sandle , S. A. Abdallah, and E. Gómez,
“Symbolic ep esen a ion o musical cho ds: A p oposed
syn ax o ex anno a ions.” in P oceedings o he In e -
na ional Socie y o Music In o ma ion Re ie al Con-
e ence (ISMIR), 2005.
[9]
J. de Be a dinis, A. Me oño-Peñuela, A. Pol onie i, and
V. P esu i, “Choco: a cho d co pus and a da a ans-
o ma ion wo k low o musical ha mony knowledge
g aphs,” Scien i ic Da a, ol. 10, no. 1, p. 641, 2023.
[10]
G. Joche , A. Chau asia, and J. Qiu, “Ul aly ics
yolo 8,” 2023. [Online]. A ailable: h ps://gi hub.com/
ul aly ics/ul aly ics
[11]
V. D o ák, J. Hajiˇ
c J ., and J. Maye , “S a layou anal-
ysis using he YOLO pla o m,” in 6 h In e na ional
Wo kshop on Reading Music Sys ems, 2024, p. 18.
[12]
Z. Liu, H. Mao, C.-Y. Wu, C. Feich enho e , T. Da ell,
and S. Xie, “A con ne o he 2020s,” in P oceedings
o he IEEE/CVF con e ence on compu e ision and
pa e n ecogni ion, 2022.
[13]
A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, L. u. Kaise , and I. Polosukhin,
“A en ion is all you need,” in Ad ances in Neu al In o -
ma ion P ocessing Sys ems, I. Guyon, U. V. Luxbu g,
S. Bengio, H. Wallach, R. Fe gus, S. Vishwana han,
and R. Ga ne , Eds., ol. 30. Cu an Associa es, Inc.,
2017.
[14]
N. F ade , J.-P. B io , F. Chhel, A. El Fal-
lah Segh ouchni, and N. Gu owski, “MidiTok: A py hon
package o MIDI ile okeniza ion,” in Ex ended Ab-
s ac s o he La e-B eaking Demo Session o he In e -
na ional Socie y o Music In o ma ion Re ie al Con-
e ence, 2021.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
702