HUMAN VS. MACHINE: COMPARING SELECTION STRATEGIES IN
ACTIVE LEARNING FOR OPTICAL MUSIC RECOGNITION
Juan P. Ma inez-Es eso1Alejand o Galan-Cuenca1Ca los Pé ez-Sancho1
F ancisco J. Cas ellanos1An onio Ja ie Gallego1
1Uni e si y Ins i u e o Compu ing Resea ch, Uni e si y o Alican e, Spain
{juan.ma inez11, a.galan}@ua.es, {cpe ez, cas ellanos, jgallego}@dlsi.ua.es
ABSTRACT
Op ical Music Recogni ion (OMR) sys ems ely on accu-
a e layou analysis (LA) o segmen di e en in o ma ion
laye s in music sco e images. While deep lea ning ap-
p oaches ha e imp o ed pe o mance, hey emain hea -
ily dependen on la ge amoun s o anno a ed da a. In his
wo k, we p opose he in eg a ion o a Few-Sho Lea n-
ing (FSL) a chi ec u e in o an ac i e lea ning amewo k
o LA. This enables in e ac i e and i e a i e aining, al-
lowing he model o p og essi ely imp o e om minimal
anno a ed da a. We e alua e how his app oach enhances
ecogni ion accu acy and educes anno a ion e o , and
we s udy he impac o di e en sample selec ion c i e ia
wi hin his amewo k, compa ing da a selec ed by i e ex-
pe anno a o s agains ou au oma ed s a egies: andom,
sequen ial, ink densi y-based, and en opy-based. Expe i-
men s ac oss h ee di e se music sco e da ase s show ha
en opy-based selec ion consis en ly ou pe o ms human
choices, achie ing an F1-sco e o 81.1% wi h only 8 la-
beled pa ches, while humans equi ed a leas 16 o each
simila pe o mance. Ou me hod imp o es o e exis -
ing FSL app oaches by up o 21.6% and subs an ially e-
duces anno a ion ime. These esul s sugges ha au o-
ma ed s a egies can o e mo e e icien al e na i es o hu-
man selec ion in OMR anno a ion wo k lows.
1. INTRODUCTION
Op ical Music Recogni ion (OMR) aims o au oma ically
ansc ibe music sco e images in o digi al o ma s, en-
abling access o as collec ions o musical he i age o
analysis, sea ch, and e ie al [1]. A c ucial s ep in his
pipeline is Layou Analysis (LA), which segmen s sco e
images in o laye s such as s a lines, symbols, ly ics, and
backg ound [2]. The di e si y o his o ical and mode n
manusc ip s, wi h hei a ying no a ional s yles and p in
echniques, adds subs an ial complexi y o his ask.
© J.P. Ma inez-Es eso, A. Galan-Cuenca, C. Pé ez-
Sancho, F.J. Cas ellanos, and A.J. Gallego. Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A i-
bu ion: J.P. Ma inez-Es eso, A. Galan-Cuenca, C. Pé ez-Sancho, F.J.
Cas ellanos, and A.J. Gallego, “Human s. Machine: Compa ing Se-
lec ion S a egies in Ac i e Lea ning o Op ical Music Recogni ion”, in
P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al Con .,
Daejeon, Sou h Ko ea, 2025.
Mode n app oaches o LA ely hea ily on deep lea n-
ing [3], which equi es la ge anno a ed da ase s o ain-
ing. Howe e , manually labeling music sco e images is
ime-consuming, labo -in ensi e, and demands domain ex-
pe ise o ensu e accu acy [4]. To alle ia e his bu den,
se e al s a egies ha e been explo ed. Domain adap a ion
echniques [5] ha e shown p omising esul s in speci ic
scena ios bu o en su e om uns able aining and pe -
o mance. Syn he ic da a gene a ion and da a augmen a-
ion [6, 7] can inc ease aining da a a iabili y bu s ug-
gle when labeled da a is sca ce o when he a ge domain
signi ican ly di e s in appea ance. Few-Sho Lea ning
(FSL) app oaches [8, 9] ha e also been p oposed, achie -
ing p omising esul s by maximizing model pe o mance
om minimal labeled da a.
Recen wo ks ha e adap ed FSL o LA o ex docu-
men s [10,11] and music [12], wi h he la e educing an-
no a ion o small image pa ches a he han ull pages. By
combining masking and o e sampling, his p oposal en-
ables models o gene alize om minimal anno a ed da a.
Howe e , i elies on a ixed, sequen ially selec ed ain-
ing se , lea ing open whe he pe o mance could be u he
imp o ed h ough mo e in o med sample selec ion and in-
e ac i e aining.
Ac i e lea ning [13, 14] add esses his by i e a i ely
choosing he mos in o ma i e samples. Ins ead o ely-
ing on p ede ined o andom choices, sample selec ion is
guided by a human expe o an au oma ed s a egy, aiming
o maximize model imp o emen . The p ocess ypically
begins wi h a single labeled sample and p oceeds h ough
cycles o aining, e alua ion, and selec ion. In he con ex
o OMR, howe e , he compa a i e e ec i eness o human
e sus au oma ed selec ion emains unexplo ed [15, 16].
In his wo k, we p opose he in eg a ion o an FSL
a chi ec u e in o an ac i e lea ning amewo k o LA in
OMR, enabling in e ac i e, i e a i e aining. We assess i s
impac on accu acy and anno a ion e o compa ed o he
baseline FSL app oach, ocusing on he compa ison o hu-
man decision-making agains au oma ed selec ion s a e-
gies. We speci ically assess whe he human anno a o s—
i e domain expe s in ou case—a e capable o selec ing
he mos in o ma i e image pa ches o aining in compa -
ison wi h ou au oma ed s a egies.
Expe imen s we e conduc ed on h ee music sco e
da ase s wi h dis inc cha ac e is ics, including mensu al
and neuma ic no a ions. Pe o mance was e alua ed ac oss
703
ou key laye s: musical symbols, ly ics, s a lines, and
backg ound. In addi ion o accu acy, anno a ion ime was
measu ed using a cus om-de eloped pixel-le el labeling
in e ace o assess he p ac ical e iciency o each s a egy.
In summa y, his wo k makes ou main con ibu ions:
(i) he adap a ion o an FSL a chi ec u e o an ac i e lea n-
ing se ing; (ii) a de ailed analysis o human e sus au o-
ma ed selec ion s a egies; (iii) a s udy on anno a ion e i-
ciency using a pu pose-buil labeling ool; and (i ) a new
s a e-o - he-a esul in FSL o LA.
2. METHODOLOGY
Py
PxIi
aining
da ase D
Pa ch selec ion
T aining p ocess
No
Yes
Finish
(Px, Py)
add
Manual anno a ion
O acle
me ic
≥ σ
LA model
FSAE
Unlabeled Da ase U
Figu e 1: Gene al scheme o he ac i e lea ning p ocess.
Figu e 1 p o ides a gene al o e iew o he ac i e lea n-
ing wo k low. The p ocess begins wi h a pool o unlabeled
images, deno ed as U. In each i e a ion, a pa ch is selec ed,
anno a ed, and used o e ain he model, p og essi ely im-
p o ing i s pe o mance. This cycle is epea ed un il a p e-
de ined s op c i e ion is me . Speci ically, he ollowing
s eps a e pe o med:
1. Pa ch selec ion: A s a egy de e mines which image
Ii om he se Uand which pa ch Pxwi hin he
image should be anno a ed. The e ec i eness o his
selec ion is he ocus o his s udy, whe e di e en
c i e ia a e compa ed, as de ailed in la e sec ions.
2. Anno a ion: A human expe —o en e e ed o as
he o acle in ela ed li e a u e—pe o ms pixel-wise
anno a ions o each in o ma ion laye o he se-
lec ed pa ch, yielding he labeled pai (Px,Py).
3. Model upda e: The new anno a ed pa ch is added
o he aining se Dand used o e ain and upda e
he weigh s o he LA model.
4. Pe o mance e alua ion: The model’s accu acy is
assessed agains a p ede ined pe o mance h eshold
σ. I i mee s o exceeds σ, o a maximum numbe
o i e a ions is eached, he p ocess s ops; o he wise,
he selec ion-anno a ion- aining cycle epea s.
The ollowing sec ions de ail he di e en selec ion
s a egies analyzed in his s udy, he model a chi ec u e,
and he aining p ocedu e.
2.1 Selec ion C i e ia
Wi hin his ac i e lea ning amewo k, he objec i e o his
s udy is o assess he impac o he pa ch selec ion s a -
egy on model pe o mance and o compa e he decision-
making abili y o human anno a o s agains au oma ed se-
lec ion me hods. To his end, we e alua e a human-d i en
app oach—in which anno a o Hi, based on hei expe
in ui ion, manually selec s he pa ch hey belie e will mos
inc ease he di e si y o da ase Dand yield he g ea es
imp o emen —alongside ou au oma ed selec ion s a e-
gies:
•Random selec ion: Pa ches a e andomly selec ed
om he da ase , se ing as a compa ison baseline.
•Sequen ial selec ion: Pa ches a e selec ed in a ixed
sequen ial o de by a e sing each image om le
o igh and op o bo om, ensu ing an e en dis ibu-
ion ac oss he da ase . This c i e ion was o iginally
p oposed o FSL [12], and i is included as an addi-
ional baseline o compa ison.
•Ink densi y-based selec ion: Pa ches a e chosen
based on he amoun o ink in he a ge laye , p i-
o i izing hose wi h highe quan i y. 1
•En opy-based selec ion: Pa ches a e chosen based
on hei p edic ed unce ain y, using en opy as a
measu e o in o ma i eness. This alue is calcula ed
using he me hod desc ibed in [17] o each candi-
da e pa ch.
Addi ionally, a h eshold ange [λ1, λ2]is de ined o
all he au oma ed selec ion c i e ia, ensu ing ha selec ed
pa ches con ain a minimum ink densi y (λ1) o he a ge
laye while a oiding pa ches ha a e ully sa u a ed wi h
ink (λ2), which migh lack disce nible s uc u es. This
check emula es he decision o a human anno a o , who
would ejec sys em-p oposed pa ches ha ei he lack el-
e an in o ma ion o a e en i ely illed wi h ink.
I is impo an o no e ha while Figu e 1 illus a es he
selec ion o a single pa ch ac oss all laye s o simplici y,
di e en pa ches may be chosen o each laye depending
on he selec ion s a egy. This applies o all selec ion c i e-
ia. Fo example, a human anno a o may choose di e en
pa ches o each laye based on which hey conside mos
in o ma i e; in he ink densi y-based s a egy, pa ches a e
anked and selec ed acco ding o he ink densi y o he a -
ge laye ; and in he sequen ial and en opy-based s a e-
gies, al hough he pa ch o de is p ede ined, only hose
wi hin he h eshold ange [λ1, λ2] o he a ge laye a e
selec ed, which also esul s in di e en pa ches being cho-
sen o each laye .
2.2 LA model
To maximize lea ning e iciency wi h minimal anno a ions,
we adop he Few-sho Selec ional Au o-encode (FSAE)
app oach p oposed by [12]. This FSL a chi ec u e, speci -
ically designed o LA, ains a specialized model o
each in o ma ion laye and hen combines hei p edic-
ions a he pixel le el using a maximum a pos e io i ap-
p oach weigh ed by he con idence o each p edic ion. This
makes FSAE pa icula ly well-sui ed o he p oposed ac-
i e lea ning amewo k, as i allows selec i e anno a ion
1This s a egy simula es human beha io , whe e anno a o s end o
ocus on egions wi h mo e in o ma ion (i.e., ink) in he laye . To mimic
his beha io , ink densi y is calcula ed using he g ound u h.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
704
o each laye , op imizing he labeling e o .
To enhance lea ning and gene aliza ion wi h limi ed
da a, FSAE in eg a es wo key echniques: a masking laye
and an o e sampling s a egy. Each model is ained on
he labeled pa ch se Dl, whe e l e e s o he anno a ed
pa ches o a speci ic laye . The o e sampling s a egy ex-
ac s mul iple andom samples a ound each labeled pa ch
o en ich he aining da a, while he masking laye p e-
en s unanno a ed egions ou side he labeled a eas om
in luencing he lea ning p ocess.
O iginally, his me hod was designed o lea n om a
ixed—and limi ed—da ase . We p opose i s in eg a ion
in o an ac i e lea ning amewo k, enabling i e a i e ain-
ing as desc ibed in Figu e 1. Wi hin his amewo k, wo
s a egies a e in oduced o imp o e model aining:
•Valida ion se o o e i ing p e en ion: To mi -
iga e o e i ing wi h ew samples, a alida ion se
is also selec ed wi h one labeled sample. Fo his,
in he i s i e a ion, he anno a o selec s wo sam-
ples: one o aining and one o alida ion. In la e
i e a ions, he alida ion sample is added o D, and
he new selec ed sample becomes he alida ion se .
Expe imen al esul s indica e ha his s a egy en-
hances aining obus ness by enabling ea ly s op-
ping based on alida ion pe o mance.
•I e a i e model selec ion: F om he second i e a-
ion onwa d, he pe o mance o he cu en model
is compa ed agains he p e ious i e a ion. The bes -
pe o ming model is e ained as he baseline o
subsequen aining cycles. Empi ical e alua ions
demons a e ha his s a egy imp o es aining s a-
bili y and inal pe o mance.
3. EXPERIMENTAL SETUP
This sec ion p esen s he expe imen al se up used in his
s udy, including he da ase s used o e alua ion, he me -
ics employed o assess pe o mance, and he implemen a-
ion de ails o he model.
3.1 Co po a
Fo he expe imen s, we conside ed he ollowing 3
da ase s wi h manual pixel-wise anno a ions o 4 laye s o
in o ma ion (s a ,no es, ex , and backg ound).
Table 1 includes a summa y o hei de ails, while Figu e 2
shows examples o egions om he o iginal images o be -
e isualize hei pa icula i ies.
•EINSIEDELN: 9 high- esolu ion scanned pages
o neuma ic no a ion belonging o he Einsiedeln,
S i sbiblio hek, Codex 611(89), om 1314. 2
•SALZINNES: 10 high- esolu ion images o pages
om he Salzinnes An iphonal manusc ip (CDM-
Hsmu M2149.14), in neuma ic no a ion. I is a ail-
able in he Can us Ul imus pla o m. 3
•CAPITAN [18]: Se o 10 double-page images om
music manusc ip s o he 17 h and 18 h cen u ies,
2h p://www.e-codices.uni .ch/en/sbe/0611/
3h ps://can us.simssa.ca/manusc ip /123723/
o igina y om he Ca hed al o Ou Lady o he Pil-
la in Za agoza, Spain, using mensu al no a ion. 4
In all cases, we used 6 images o aining and ali-
da ion ( om whe e he pa ches a e selec ed), and he e-
maining o es ing. Based on he o iginal con igu a ion
o he LA model employed [12], we used a pa ch size o
256 ×256 pixels o ex ac samples om hese images.
Laye s (%)
Co pus # imgs Resol. Bg S No Te
EINSIEDELN 9 6 496 ×4 872 87.9 3.5 2.7 5.9
SALZINNES 10 5 847 ×3 818 87.6 2.4 2.5 7.5
CAPITAN 10 2 126 ×3 065 85.7 6.6 5.1 2.6
Table 1: De ails o he co po a conside ed including he
numbe o images (# imgs), he a e age esolu ion and he
p opo ion o pixels o each laye o in e es , wi h Bg o
backg ound, S o s a lines, No o no es, and Te o ex .
(a) EINSIEDELN (b) SALZINNES
(c) CAPITAN
Figu e 2: Examples o egions ex ac ed om he o iginal
images in he co po a desc ibed in Table 1.
3.2 Me ics
To e alua e he pe o mance, we eso o he F-sco e (F1)
as he e alua ion me ic, ensu ing a balanced assessmen
despi e class imbalances in he da ase s (see Table 1). In a
bina y classi ica ion se ing, F1is de ined as:
F1=2·TP
2·TP +FP +FN ,(1)
whe e TP, FP, and FN ep esen T ue Posi i es, False Pos-
i i es, and False Nega i es, espec i ely.
Since ou ask in ol es mul iple laye s a he han
a simple bina y classi ica ion, we employed he mac o-
a e aged F-sco e (Fm
1). This me ic compu es he F1sep-
a a ely o each laye and a e ages he esul s, ensu ing
equal weigh ing ega dless o class dis ibu ion.
3.3 Implemen a ion de ails
The LA model used ollows he FSAE a chi ec u e p o-
posed by [12], u ilizing a sepa a e Selec ional Au o-
4RISM Code “E-Zac” is accesible a h ps:// ism.in o.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
705
Encode (SAE) [19] o each o he ou laye s, wi h each
model specialized in a speci ic laye . A masking laye is
included a he inpu o igno e unanno a ed egions, and an
o e sampling p ocess is applied du ing aining o inc ease
pa ch a iabili y. A e p elimina y expe imen s, we de e -
mined o ex ac 1 024 andom samples pe epoch a ound
he anno a ed a eas o each i e a ion. Inpu images a e
no malized o [0,1], wi h he masking laye allowing al-
ues in {−1} ∪ [0,1], whe e −1ma ks igno ed pixels.
Each ne wo k is ained using bina y c oss-en opy loss
o up o 200 epochs, wi h ea ly s opping a e 20 epochs
o no imp o emen on he alida ion se . The s ochas ic
g adien descen op imize [20] is used wi h a lea ning a e
o 0.01 and a ba ch size o 32. S anda d da a augmen a-
ion echniques, including andom o a ions (±45◦), zoom
(0.8x–1.2x), and ho izon al/ e ical lips, a e applied o in-
c ease da a di e si y.
Pa ches o size 256 ×256 pixels a e ex ac ed using
h esholds λ1= 5% and λ2= 95%. The s opping c i e-
ion σis se o Fm
1= 100%, so he aining will only s op i
pe ec accu acy is achie ed. Howe e , a maximum o 32
i e a ions is imposed, as his is also he maximum consid-
e ed in he o iginal FSL app oach [12].
4. RESULTS
This sec ion p esen s he esul s o his s udy, beginning
wi h an analysis o he anno a ion ime, ollowed by a com-
pa ison o selec ion s a egies, and concluding wi h a qual-
i a i e examina ion o he selec ed pa ches.
4.1 Anno a ion Time
To assess he ime equi ed o he anno a ion p ocess, we
conduc ed a s udy wi h 5 anno a o s wi h expe ise in he
ask. Each anno a o labeled 15 pa ches (5 pe da ase )
a he pixel le el o he 4 laye s conside ed, eco ding
he ime aken o each laye indi idually. Fo his pu -
pose, we de eloped a specialized anno a ion ool, pub-
licly a ailable o he communi y (h ps://gi hub.
com/cpe ezs/pixel-le el-anno a o ). While
pixel-le el anno a ion is o en done using gene al-pu pose
g aphic edi o s like GIMP, Pho oshop, o Pixelma o , o
p op ie a y ools de eloped by ins i u ions [4], hese a e ei-
he oo gene ic o no openly accessible. Some documen -
speci ic ools, like PixLabele [21], a e a ailable, bu hey
a e oo simplis ic, limi ed o di ec pixel illing, and lack
ea u es ha acili a e anno a ion based on colo simila i y
o au oma ic egion comple ion.
Ou ool is designed speci ically o his ask (see Fig-
u e 3) and includes a wide ange o ea u es o make an-
no a ion easie and as e . Use s can de ine cus om lay-
e s, use an adjus able b ush ool, apply h eshold selec-
ion, and au o- ill o comple ing laye s. Addi ionally, i
o e s a quick- e iew mode o iden i y unanno a ed pixels,
an op ion o p e en o e w i ing exis ing anno a ions, and
in ui i e keyboa d sho cu s o swi ching be ween ools,
laye s, and se ings such as b ush size and h eshold.
Table 2 p esen s he esul s o his s udy, epo ing he
Figu e 3: De eloped ool o pixel-le el anno a ion, show-
ing a zoomed-in a ea o he image.
a e age anno a ion ime pe laye and he o al a e age an-
no a ion ime pe pa ch, along wi h he s anda d de ia ion,
spen by each anno a o . As shown, s a and no es ake he
longes o anno a e due o o e lapping egions equi ing
bounda y ma king. These a e ollowed by he ex laye ,
which does no usually o e lap, and he backg ound, which
is easily au o- illed wi h mino edge adjus men s. The a -
e age anno a ion ime pe pa ch is 8:18±2:06.
Time pe laye
Ann. S a No es Tex Bg. To al
H12:33±1:02 2:18±1:42 1:34±1:03 1:10±1:05 7:36±3:14
H23:09±2:15 1:43±1:02 1:10±0:42 0:19±0:14 6:21±2:47
H32:28±2:22 2:26±0:45 1:26±1:02 1:05±0:43 7:26±3:43
H43:37±1:51 2:34±2:01 3:00±1:48 1:43±1:00 10:55±3:57
H52:49±2:06 3:21±0:53 1:57±0:58 1:06±1:24 9:13±3:58
A g. 2:55±1:08 2:28±0:07 1:49±0:30 1:05±0:30 8:18±2:06
Table 2: A e age anno a ion ime (mm:ss) pe pa ch o
each anno a o Hn. The able shows he mean ime o man-
ually anno a e 15 pa ches o each indi idual laye , wi h
he o al a e age ime pe pa ch in he las column and he
o e all a e age ac oss all anno a o s in he las ow.
Based on hese imes, anno a ing he maximum o 32
pa ches conside ed in his s udy would ake abou 4.5
hou s. Howe e , anno a ing en i e pages om one o he
da ase s, gi en hei high esolu ion (see Table 1), could
ake up o a mon h o wo k o he CAPITAN da ase , o
wo o h ee mon hs o SALZINNES and EINSIEDELN.
As a e e ence, we compa ed he ag eemen be ween
he anno a ions pe o med by he i e anno a o s, esul ing
in an a e age ag eemen o 95.7%±3.2% ac oss all laye s.
This is qui e high, and he di e ences obse ed a e limi ed
o sligh disc epancies along he edges o each laye .
4.2 Human s. Au oma ed Sample Selec ion
This sec ion p esen s he esul s o he s udy compa ing
he human selec ion c i e ion wi h he au oma ic me hods,
ollowing he ac i e lea ning amewo k and expe imen al
se up desc ibed in Sec ions 2 and 3.
Figu e 4a shows he esul s ob ained by human anno-
a o s, who sequen ially selec ed pa ches om 1 o 32, e-
aining he LA model a e each selec ion. Each cu e ep-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
706
esen s he a e age pe o mance o one expe ac oss he 4
laye s and 3 da ase s, along wi h he o e all mean (dashed
line). The esul s show no able a iabili y: 2 anno a o s
achie ed he bes esul s, 2 in e media e, and 1 pe o med
wo se. The gap be ween he bes and wo s pe o mance
eaches up o 17%, wi h an a e age di e ence o 13±2%
ac oss all i e a ions. These di e ences a e especially ele-
an conside ing ha , as discussed in he nex sec ion, he
selec ed samples seem qui e simila ac oss anno a o s.
Figu e 4b compa es he a e age esul s o human an-
no a o s (dashed line) wi h hose om au oma ic selec ion
me hods, as well as he esul s om he o iginal FSL ap-
p oach (FSAE), which used sequen ial selec ion wi hou
inc emen al aining. The au oma ic me hod based on en-
opy achie es he bes esul s, ollowed by andom selec-
ion, which e en ou pe o ms human expe s. The ink-
le el-based me hod pe o ms wo se ini ially bu imp o es
a e 15 labeled samples. The me hods wi h he wo s pe -
o mance a e hose based on sequen ial selec ion, includ-
ing he o iginal FSL me hod. In con as , a simple andom
selec ion c i e ion pe o ms much be e . No ably, he gap
be ween he bes (en opy) and wo s (sequen ial) me hods
eaches up o 22%, wi h an a e age di e ence o 18±3%
ac oss all i e a ions.
This ein o ces he impo ance o he o de in which
samples a e chosen. Me hods as he sequen ial c i e ion
ail o cap u e ep esen a i e samples om he en i e doc-
umen . The o iginal FSL me hod (based on his c i e ion)
yields e en wo se esul s, likely due o he lack o i e a-
i e aining, which hinde s p og essi e model e inemen .
The ink-le el-based selec ion is also no an adequa e c i-
e ion. Human anno a o s likely use a simila app oach,
p io i izing pa ches wi h mo e ep esen a ion o a laye ’s
in o ma ion. In con as , me hods like en opy and an-
dom selec ion aim o selec samples wi h mo e a iabili y
o in o ma ion. Among hese, en opy pe o ms he bes by
a ge ing he mos in o ma i e samples while main aining
a balanced ep esen a ion o he laye ’s in o ma ion.
Table 3 summa izes he Fm
1sco es o 1, 2, 4, 8, 16, and
32 labeled samples, along wi h he ime equi ed o label-
ing. The en opy-based me hod gene ally yields he bes
esul s, hough andom selec ion ou pe o ms i o 2 and
4 samples. F om 8 samples onwa d, en opy consis en ly
ou pe o ms all me hods. Acco ding o hese esul s, wi h
jus 8 labeled samples (o 16 o u he imp o emen s),
a compe i i e model can be achie ed, wi h up o a 4.3%
Fm
1imp o emen o e human selec ion and up o a 21.6%
imp o emen o e he cu en FSL s a e o he a [12].
4.3 Quali a i e analysis
This sec ion p esen s a quali a i e compa ison o he
pa ches selec ed by human anno a o s and hose chosen
by he au oma ic me hods. Figu e 5 illus a es an ex-
ample o he “Tex ” laye o he EINSIEDELN da ase ,
highligh ing wi h di e en colo s he pa ches ha coin-
cide be ween me hods. Comple e examples o he selec ed
pa ches ac oss all laye s and da ase s a e a ailable as sup-
plemen a y ma e ial a Zenodo: h ps://doi.o g/
Fm
1pe numbe o labeled samples
Me hod 1 2 4 8 16 32
Humans 57.0 63.4 69.1 76.8 82.0 85.3
Random 57.4 65.2 70.7 78.1 82.7 84.6
Ink-le el 58.1 64.0 68.0 71.7 82.9 86.0
Sequen . 55.5 57.1 57.2 62.4 71.1 72.0
En opy 59.5 63.4 68.9 81.1 84.4 86.4
FSAE [12] 53.7 49.3 57.4 59.7 67.7 68.4
Tmp. 0:08:18 0:16:36 0:33:12 1:06:24 2:12:48 4:25:36
Table 3: Resul s o Fm
1 o he di e en selec ion me hods
compa ed, based on he numbe o labeled samples (1, 2,
4, 8, 16, and 32). The “Tmp.” ow indica es he es ima ed
ime (hh:mm:ss) equi ed o labeling based on Table 2.
10.5281/zenodo.15735893.
In gene al, all he selec ed pa ches seem app op ia e, as
hey con ain ex , which does no explain he wo se pe o -
mance o some c i e ia. A close look e eals ha he ink-
le el-based me hod ends o selec deco a i e le e s, which
ha e highe ink le els, o cases wi h wo lines o ex (las
ow). Howe e , hese cases a e less common in he da ase .
The andom and sequen ial me hods selec simila pa ches,
ye sequen ial selec ion pe o ms wo se. Thei selec ions
also esemble hose made by he en opy-based me hod in
e ms o ink le els and laye examples, bu wi h he di -
e ence ha en opy selec ion seems o a o samples wi h
mo e noise, s ains, o deg ada ion.
Rega ding he human anno a o s, he selec ed pa ches
a e qui e simila ac oss anno a o s and also o he au o-
ma ic me hods, especially o he en opy-based c i e ion.
No ably, anno a o s H4and H5(wi h H4being he bes
pe o me and H5ha ing in e media e esul s) selec ed
wo pa ches ha o e lap wi h he en opy-based me hod.
E en o H2, he poo es pe o me , he selec ed samples
seem qui e app op ia e om a human pe spec i e, showing
ex as well as examples o deg ada ion. Thus, he eason
o H2’s poo e pe o mance is no immedia ely appa en .
This analysis highligh s ha iden i ying he bes sam-
ples is no always s aigh o wa d. The esul s sugges ha
selec ion c i e ia based on laye ep esen a ion o ink le el
may no always cap u e he mos in o ma i e samples, and
ha human judgmen ends o ocus on cha ac e is ics ha
may no be he mos sui able o aining neu al ne wo ks.
5. CONCLUSIONS
This wo k p oposes an ac i e lea ning amewo k o lay-
ou analysis (LA) in Op ical Music Recogni ion (OMR),
building upon an exis ing Few-Sho Lea ning (FSL)
me hod. By in eg a ing his app oach in o an i e a i e
aining p ocess—whe e new samples a e p og essi ely se-
lec ed and anno a ed—we aim o make he mos o limi ed
labeled da a and imp o e he o iginal algo i hm’s accu acy.
A cen al ocus o he s udy is he compa ison be ween
human and au oma ed sample selec ion s a egies wi hin
his ac i e lea ning se up. While selec ion is ypically ca -
ied ou by human anno a o s in his con ex , ou indings
sugges ha hei choices do no always align wi h he e-
qui emen s o neu al ne wo ks o e ec i e lea ning and
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
707
0 5 10 15 20 25 30
I e a ion
50
60
70
80
90
Fm
1
H
1
H
2
H
3
H
4
H
5
A g.
(a) Compa ison o he esul s o he i e human anno a o s in
his s udy, along wi h he o e all mean (dashed line).
0 5 10 15 20 25 30
I e a ion
50
60
70
80
90
Fm
1
Humans
Random
Ink-le el
En opy
Sequen ial
FSAE
(b) Compa ison o he a e age esul s ob ained by human an-
no a o s wi h hose om he au oma ic selec ion me hods.
Figu e 4: G aphs showing he a e age Fm
1 esul s o he ou anno a ion laye s ac oss he h ee da ase s. Each line
ep esen s a di e en selec ion me hod. The ho izon al axis indica es he numbe o selec ed pa ches, anging om 1 o 32.
Figu e 5: Example o he pa ches selec ed o he EINSIEDELN da ase , speci ically o he “Tex ” laye , by he i e human
anno a o s and he di e en au oma ic selec ion me hods compa ed, o he samples 1, 2, 4, 8, 16, and 32. Pa ches ha
coincide ac oss di e en selec ion c i e ia a e highligh ed wi h dis inc colo s.
eliable model pe o mance. Resul s a y no ably depend-
ing on he indi idual, wi h pe o mance di e ences o up
o 17%, and an a e age a iance o 13% be ween anno a-
o s ac oss all i e a ions.
In con as , simple au oma ed me hods—pa icula ly
en opy-based selec ion—consis en ly ou pe o m human
s a egies. En opy selec ion eached a compe i i e Fm
1o
81.1% wi h only 8 labeled pa ches, while humans needed
16 o mo e o simila esul s. This ep esen s a subs an-
ial educ ion in anno a ion e o and ime. Fu he mo e,
ou app oach imp o es upon exis ing FSL me hods by up
o 21.6%, hanks o he enhanced selec ion s a egy and i s
in eg a ion in o an i e a i e and inc emen al aining loop.
O e all, his s udy demons a es ha au oma ed ac i e
lea ning s a egies can ema kably op imize OMR wo k-
lows, educe manual anno a ion ime, and enhance model
pe o mance beyond wha human in ui ion can achie e.
These indings encou age he in eg a ion o machine-
d i en sample selec ion in o anno a ion pipelines and ques-
ion he assump ion ha human judgmen is inhe en ly su-
pe io in da a selec ion asks. As u u e wo k, we aim o
explo e mo e ad anced, model-awa e selec ion s a egies
ha dynamically adap o he cu en pe o mance o he
model o iden i y he mos in o ma i e samples.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
708
6. ACKNOWLEDGMENTS
This esea ch was suppo ed by he Spanish Minis y
o Science and Inno a ion h ough he LEMUR p ojec
(PID2023-148259NB-I00). Juan P. Ma inez-Es eso ac-
knowledges suppo om he Gene ali a Valenciana
h ough he SmallOMR p ojec (CIAICO/2023/255). Ale-
jand o Galan-Cuenca also acknowledges suppo om he
Gene ali a Valenciana h ough g an CIACIF/2023/090.
7. REFERENCES
[1] D. Bainb idge and T. Bell, “The challenge o op ical
music ecogni ion,” Compu e s and he Humani ies,
ol. 35, no. 2, pp. 95–121, 2001.
[2] G. M. Binmakhashen and S. A. Mahmoud, “Documen
layou analysis: a comp ehensi e su ey,” ACM Com-
pu ing Su eys (CSUR), ol. 52, no. 6, pp. 1–36, 2019.
[3] F. J. Cas ellanos, A. J. Gallego, and I. Fujinaga, “Deep
lea ning o op ical music ecogni ion: A e iew,” Feb.
2025. [Online]. A ailable: h p://dx.doi.o g/10.36227/
ech xi .174077177.78767136/ 1
[4] Z. Saleh, K. Zhang, J. Cal o-Za agoza, G. Vigliensoni,
and I. Fujinaga, “Pixel.js: Web-based pixel classi i-
ca ion co ec ion pla o m o g ound u h c ea ion,”
in 2017 14 h IAPR In e na ional Con e ence on Docu-
men Analysis and Recogni ion (ICDAR), ol. 02, 2017,
pp. 39–40.
[5] F. J. Cas ellanos, A. J. Gallego, and J. Cal o-Za agoza,
“Unsupe ised domain adap a ion o documen anal-
ysis o music sco e images,” in P oceedings o he
22nd In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, ISMIR 2021, Online, No embe 7-
12, 2021, 2021, pp. 81–87.
[6] C. Wick, A. Ha el , and F. Puppe, “S a , symbol
and melody de ec ion o medie al manusc ip s w i en
in squa e no a ion using deep ully con olu ional ne -
wo ks,” Applied Sciences, ol. 9, no. 13, p. 2646, 2019.
[7] F. J. Cas ellanos, C. Ga ido-Munoz, A. Ríos-Vila, and
J. Cal o-Za agoza, “Region-based layou analysis o
music sco e images,” Expe Sys Appl., ol. 209, p.
118211, 2022.
[8] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Gen-
e alizing om a ew examples: A su ey on ew-sho
lea ning,” ACM Compu . Su ., ol. 53, no. 3, 2020.
[9] X. Li, L. Yu, C.-W. Fu, M. Fang, and P.-A. Heng, “Re-
isi ing me ic lea ning o ew-sho image classi ica-
ion,” Neu ocompu ing, ol. 406, pp. 49–58, 2020.
[10] A. De Na din, S. Zo in, M. Paie , G. L. Fo es i,
E. Colombi, and C. Picia elli, “E icien ew-sho
lea ning o pixel-p ecise handw i en documen lay-
ou analysis,” in P oceedings o he IEEE/CVF Win-
e Con e ence on Applica ions o Compu e Vision
(WACV), Janua y 2023, pp. 3680–3688.
[11] A. De Na din, S. Zo in, C. Picia elli, E. Colombi, and
G. L. Fo es i, “Few-sho pixel-p ecise documen layou
segmen a ion ia dynamic ins ance gene a ion and lo-
cal h esholding,” In e na ional Jou nal o Neu al Sys-
ems, ol. 33, no. 10, p. 2350052, 2023.
[12] F. J. Cas ellanos, A. J. Gallego, and I. Fujinaga, “A
ew-sho neu al app oach o layou analysis o music
sco e images,” in P oceedings o he 24 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
Milan, I aly, No embe 5-9, 2023, 2023, pp. 106–113.
[13] J. Zhu, H. Wang, and E. H. Ho y, “Lea ning a s op-
ping c i e ion o ac i e lea ning o wo d sense dis-
ambigua ion and ex classi ica ion,” in Thi d In e na-
ional Join Con e ence on Na u al Language P ocess-
ing, IJCNLP 2008, Hyde abad, India, Janua y 7-12,
2008. The Associa ion o Compu e Linguis ics,
2008, pp. 366–372.
[14] B. Se les, “Ac i e lea ning li e a u e su ey,” Uni e -
si y o Wisconsin-Madison, Depa men o Compu e
Sciences, Tech. Rep. 1648, 2009.
[15] J. Cal o-Za agoza, K. Zhang, Z. Saleh, G. Viglien-
soni, and I. Fujinaga, “Music documen layou anal-
ysis h ough machine lea ning and human eedback,”
in 2017 14 h IAPR In e na ional Con e ence on Docu-
men Analysis and Recogni ion (ICDAR), ol. 02, 2017,
pp. 23–24.
[16] I. Fujinaga and G. Vigliensoni, “The a o eaching
compu e s: The SIMSSA op ical music ecogni ion
wo k low sys em,” in 27 h Eu opean Signal P ocess-
ing Con e ence, EUSIPCO, A Co uña, Spain, Sep em-
be 2-6. IEEE, 2019, pp. 1–5.
[17] R. M. G ay, En opy and In o ma ion Theo y.
Sp inge , 2011.
[18] A. E. Es eban, Ed., Música de la Ca ed al de
Ba celona a la Biblio eca de Ca alunya. Ba celona:
Biblio eca de Ca alunya, 2001.
[19] A. J. Gallego and J. Cal o-Za agoza, “S a -line e-
mo al wi h selec ional au o-encode s,” Expe Sys
Appl., ol. 89, pp. 138–148, 2017.
[20] L. Bo ou, “La ge-scale machine lea ning wi h s ochas-
ic g adien descen ,” in P oceedings o COMP-
STAT’2010. Sp inge , 2010, pp. 177–186.
[21] E. Saund, J. Lin, and P. Sa ka , “Pixlabele : Use in-
e ace o pixel-le el labeling o elemen s in docu-
men images,” in 2009 10 h In e na ional Con e ence
on Documen Analysis and Recogni ion. IEEE, 2009,
pp. 646–650.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
709