Human Vs. Machine: Comparing Selection Strategies in Active Learning for Optical Music Recognition

Author: Juan Pedro Martinez-Esteso; Alejandro Galan-Cuenca; Carlos Pérez-Sancho; Francisco J. Castellanos; Antonio Javier Gallego

Publisher: Zenodo

DOI: 10.5281/zenodo.17706567

Source: https://zenodo.org/records/17706567/files/000082.pdf

HUMAN VS. MACHINE: COMPARING SELECTION STRATEGIES IN
ACTIVE LEARNING FOR OPTICAL MUSIC RECOGNITION
Juan P. Ma inez-Es eso1Alejand o Galan-Cuenca1Ca los Pé ez-Sancho1
F ancisco J. Cas ellanos1An onio Ja ie Gallego1
1Uni e si y Ins i u e o Compu ing Resea ch, Uni e si y o Alican e, Spain
{juan.ma inez11, a.galan}@ua.es, {cpe ez, cas ellanos, jgallego}@dlsi.ua.es
ABSTRACT
Op ical Music Recogni ion (OMR) sys ems ely on accu-
a e layou analysis (LA) o segmen di e en in o ma ion
laye s in music sco e images. While deep lea ning ap-
p oaches ha e imp o ed pe o mance, hey emain hea -
ily dependen on la ge amoun s o anno a ed da a. In his
wo k, we p opose he in eg a ion o a Few-Sho Lea n-
ing (FSL) a chi ec u e in o an ac i e lea ning amewo k
o LA. This enables in e ac i e and i e a i e aining, al-
lowing he model o p og essi ely imp o e om minimal
anno a ed da a. We e alua e how his app oach enhances
ecogni ion accu acy and educes anno a ion e o , and
we s udy he impac o di e en sample selec ion c i e ia
wi hin his amewo k, compa ing da a selec ed by i e ex-
pe anno a o s agains ou au oma ed s a egies: andom,
sequen ial, ink densi y-based, and en opy-based. Expe i-
men s ac oss h ee di e se music sco e da ase s show ha
en opy-based selec ion consis en ly ou pe o ms human
choices, achie ing an F1-sco e o 81.1% wi h only 8 la-
beled pa ches, while humans equi ed a leas 16 o each
simila pe o mance. Ou me hod imp o es o e exis -
ing FSL app oaches by up o 21.6% and subs an ially e-
duces anno a ion ime. These esul s sugges ha au o-
ma ed s a egies can o e mo e e icien al e na i es o hu-
man selec ion in OMR anno a ion wo k lows.
1. INTRODUCTION
Op ical Music Recogni ion (OMR) aims o au oma ically
ansc ibe music sco e images in o digi al o ma s, en-
abling access o as collec ions o musical he i age o
analysis, sea ch, and e ie al [1]. A c ucial s ep in his
pipeline is Layou Analysis (LA), which segmen s sco e
images in o laye s such as s a lines, symbols, ly ics, and
backg ound [2]. The di e si y o his o ical and mode n
manusc ip s, wi h hei a ying no a ional s yles and p in
echniques, adds subs an ial complexi y o his ask.
© J.P. Ma inez-Es eso, A. Galan-Cuenca, C. Pé ez-
Sancho, F.J. Cas ellanos, and A.J. Gallego. Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A i-
bu ion: J.P. Ma inez-Es eso, A. Galan-Cuenca, C. Pé ez-Sancho, F.J.
Cas ellanos, and A.J. Gallego, “Human s. Machine: Compa ing Se-
lec ion S a egies in Ac i e Lea ning o Op ical Music Recogni ion”, in
P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al Con .,
Daejeon, Sou h Ko ea, 2025.
Mode n app oaches o LA ely hea ily on deep lea n-
ing [3], which equi es la ge anno a ed da ase s o ain-
ing. Howe e , manually labeling music sco e images is
ime-consuming, labo -in ensi e, and demands domain ex-
pe ise o ensu e accu acy [4]. To alle ia e his bu den,
se e al s a egies ha e been explo ed. Domain adap a ion
echniques [5] ha e shown p omising esul s in speci ic
scena ios bu o en su e om uns able aining and pe -
o mance. Syn he ic da a gene a ion and da a augmen a-
ion [6, 7] can inc ease aining da a a iabili y bu s ug-
gle when labeled da a is sca ce o when he a ge domain
signi ican ly di e s in appea ance. Few-Sho Lea ning
(FSL) app oaches [8, 9] ha e also been p oposed, achie -
ing p omising esul s by maximizing model pe o mance
om minimal labeled da a.
Recen wo ks ha e adap ed FSL o LA o ex docu-
men s [10,11] and music [12], wi h he la e educing an-
no a ion o small image pa ches a he han ull pages. By
combining masking and o e sampling, his p oposal en-
ables models o gene alize om minimal anno a ed da a.
Howe e , i elies on a ixed, sequen ially selec ed ain-
ing se , lea ing open whe he pe o mance could be u he
imp o ed h ough mo e in o med sample selec ion and in-
e ac i e aining.
Ac i e lea ning [13, 14] add esses his by i e a i ely
choosing he mos in o ma i e samples. Ins ead o ely-
ing on p ede ined o andom choices, sample selec ion is
guided by a human expe o an au oma ed s a egy, aiming
o maximize model imp o emen . The p ocess ypically
begins wi h a single labeled sample and p oceeds h ough
cycles o aining, e alua ion, and selec ion. In he con ex
o OMR, howe e , he compa a i e e ec i eness o human
e sus au oma ed selec ion emains unexplo ed [15, 16].
In his wo k, we p opose he in eg a ion o an FSL
a chi ec u e in o an ac i e lea ning amewo k o LA in
OMR, enabling in e ac i e, i e a i e aining. We assess i s
impac on accu acy and anno a ion e o compa ed o he
baseline FSL app oach, ocusing on he compa ison o hu-
man decision-making agains au oma ed selec ion s a e-
gies. We speci ically assess whe he human anno a o s—
i e domain expe s in ou case—a e capable o selec ing
he mos in o ma i e image pa ches o aining in compa -
ison wi h ou au oma ed s a egies.
Expe imen s we e conduc ed on h ee music sco e
da ase s wi h dis inc cha ac e is ics, including mensu al
and neuma ic no a ions. Pe o mance was e alua ed ac oss
703
ou key laye s: musical symbols, ly ics, s a lines, and
backg ound. In addi ion o accu acy, anno a ion ime was
measu ed using a cus om-de eloped pixel-le el labeling
in e ace o assess he p ac ical e iciency o each s a egy.
In summa y, his wo k makes ou main con ibu ions:
(i) he adap a ion o an FSL a chi ec u e o an ac i e lea n-
ing se ing; (ii) a de ailed analysis o human e sus au o-
ma ed selec ion s a egies; (iii) a s udy on anno a ion e i-
ciency using a pu pose-buil labeling ool; and (i ) a new
s a e-o - he-a esul in FSL o LA.
2. METHODOLOGY
Py
PxIi
aining
da ase D
Pa ch selec ion
T aining p ocess
No
Yes
Finish
(Px, Py)
add
Manual anno a ion
O acle
me ic
≥ σ
LA model
FSAE
Unlabeled Da ase U
Figu e 1: Gene al scheme o he ac i e lea ning p ocess.
Figu e 1 p o ides a gene al o e iew o he ac i e lea n-
ing wo k low. The p ocess begins wi h a pool o unlabeled
images, deno ed as U. In each i e a ion, a pa ch is selec ed,
anno a ed, and used o e ain he model, p og essi ely im-
p o ing i s pe o mance. This cycle is epea ed un il a p e-
de ined s op c i e ion is me . Speci ically, he ollowing
s eps a e pe o med:
1. Pa ch selec ion: A s a egy de e mines which image
Ii om he se Uand which pa ch Pxwi hin he
image should be anno a ed. The e ec i eness o his
selec ion is he ocus o his s udy, whe e di e en
c i e ia a e compa ed, as de ailed in la e sec ions.
2. Anno a ion: A human expe —o en e e ed o as
he o acle in ela ed li e a u e—pe o ms pixel-wise
anno a ions o each in o ma ion laye o he se-
lec ed pa ch, yielding he labeled pai (Px,Py).
3. Model upda e: The new anno a ed pa ch is added
o he aining se Dand used o e ain and upda e
he weigh s o he LA model.
4. Pe o mance e alua ion: The model’s accu acy is
assessed agains a p ede ined pe o mance h eshold
σ. I i mee s o exceeds σ, o a maximum numbe
o i e a ions is eached, he p ocess s ops; o he wise,
he selec ion-anno a ion- aining cycle epea s.
The ollowing sec ions de ail he di e en selec ion
s a egies analyzed in his s udy, he model a chi ec u e,
and he aining p ocedu e.
2.1 Selec ion C i e ia
Wi hin his ac i e lea ning amewo k, he objec i e o his
s udy is o assess he impac o he pa ch selec ion s a -
egy on model pe o mance and o compa e he decision-
making abili y o human anno a o s agains au oma ed se-
lec ion me hods. To his end, we e alua e a human-d i en
app oach—in which anno a o Hi, based on hei expe
in ui ion, manually selec s he pa ch hey belie e will mos
inc ease he di e si y o da ase Dand yield he g ea es
imp o emen —alongside ou au oma ed selec ion s a e-
gies:
•Random selec ion: Pa ches a e andomly selec ed
om he da ase , se ing as a compa ison baseline.
•Sequen ial selec ion: Pa ches a e selec ed in a ixed
sequen ial o de by a e sing each image om le
o igh and op o bo om, ensu ing an e en dis ibu-
ion ac oss he da ase . This c i e ion was o iginally
p oposed o FSL [12], and i is included as an addi-
ional baseline o compa ison.
•Ink densi y-based selec ion: Pa ches a e chosen
based on he amoun o ink in he a ge laye , p i-
o i izing hose wi h highe quan i y. 1
•En opy-based selec ion: Pa ches a e chosen based
on hei p edic ed unce ain y, using en opy as a
measu e o in o ma i eness. This alue is calcula ed
using he me hod desc ibed in [17] o each candi-
da e pa ch.
Addi ionally, a h eshold ange [λ1, λ2]is de ined o
all he au oma ed selec ion c i e ia, ensu ing ha selec ed
pa ches con ain a minimum ink densi y (λ1) o he a ge
laye while a oiding pa ches ha a e ully sa u a ed wi h
ink (λ2), which migh lack disce nible s uc u es. This
check emula es he decision o a human anno a o , who
would ejec sys em-p oposed pa ches ha ei he lack el-
e an in o ma ion o a e en i ely illed wi h ink.
I is impo an o no e ha while Figu e 1 illus a es he
selec ion o a single pa ch ac oss all laye s o simplici y,
di e en pa ches may be chosen o each laye depending
on he selec ion s a egy. This applies o all selec ion c i e-
ia. Fo example, a human anno a o may choose di e en
pa ches o each laye based on which hey conside mos
in o ma i e; in he ink densi y-based s a egy, pa ches a e
anked and selec ed acco ding o he ink densi y o he a -
ge laye ; and in he sequen ial and en opy-based s a e-
gies, al hough he pa ch o de is p ede ined, only hose
wi hin he h eshold ange [λ1, λ2] o he a ge laye a e
selec ed, which also esul s in di e en pa ches being cho-
sen o each laye .
2.2 LA model
To maximize lea ning e iciency wi h minimal anno a ions,
we adop he Few-sho Selec ional Au o-encode (FSAE)
app oach p oposed by [12]. This FSL a chi ec u e, speci -
ically designed o LA, ains a specialized model o
each in o ma ion laye and hen combines hei p edic-
ions a he pixel le el using a maximum a pos e io i ap-
p oach weigh ed by he con idence o each p edic ion. This
makes FSAE pa icula ly well-sui ed o he p oposed ac-
i e lea ning amewo k, as i allows selec i e anno a ion
1This s a egy simula es human beha io , whe e anno a o s end o
ocus on egions wi h mo e in o ma ion (i.e., ink) in he laye . To mimic
his beha io , ink densi y is calcula ed using he g ound u h.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
704
o each laye , op imizing he labeling e o .
To enhance lea ning and gene aliza ion wi h limi ed
da a, FSAE in eg a es wo key echniques: a masking laye
and an o e sampling s a egy. Each model is ained on
he labeled pa ch se Dl, whe e l e e s o he anno a ed
pa ches o a speci ic laye . The o e sampling s a egy ex-
ac s mul iple andom samples a ound each labeled pa ch
o en ich he aining da a, while he masking laye p e-
en s unanno a ed egions ou side he labeled a eas om
in luencing he lea ning p ocess.
O iginally, his me hod was designed o lea n om a
ixed—and limi ed—da ase . We p opose i s in eg a ion
in o an ac i e lea ning amewo k, enabling i e a i e ain-
ing as desc ibed in Figu e 1. Wi hin his amewo k, wo
s a egies a e in oduced o imp o e model aining:
•Valida ion se o o e i ing p e en ion: To mi -
iga e o e i ing wi h ew samples, a alida ion se
is also selec ed wi h one labeled sample. Fo his,
in he i s i e a ion, he anno a o selec s wo sam-
ples: one o aining and one o alida ion. In la e
i e a ions, he alida ion sample is added o D, and
he new selec ed sample becomes he alida ion se .
Expe imen al esul s indica e ha his s a egy en-
hances aining obus ness by enabling ea ly s op-
ping based on alida ion pe o mance.
•I e a i e model selec ion: F om he second i e a-
ion onwa d, he pe o mance o he cu en model
is compa ed agains he p e ious i e a ion. The bes -
pe o ming model is e ained as he baseline o
subsequen aining cycles. Empi ical e alua ions
demons a e ha his s a egy imp o es aining s a-
bili y and inal pe o mance.
3. EXPERIMENTAL SETUP
This sec ion p esen s he expe imen al se up used in his
s udy, including he da ase s used o e alua ion, he me -
ics employed o assess pe o mance, and he implemen a-
ion de ails o he model.
3.1 Co po a
Fo he expe imen s, we conside ed he ollowing 3
da ase s wi h manual pixel-wise anno a ions o 4 laye s o
in o ma ion (s a ,no es, ex , and backg ound).
Table 1 includes a summa y o hei de ails, while Figu e 2
shows examples o egions om he o iginal images o be -
e isualize hei pa icula i ies.
•EINSIEDELN: 9 high- esolu ion scanned pages
o neuma ic no a ion belonging o he Einsiedeln,
S i sbiblio hek, Codex 611(89), om 1314. 2
•SALZINNES: 10 high- esolu ion images o pages
om he Salzinnes An iphonal manusc ip (CDM-
Hsmu M2149.14), in neuma ic no a ion. I is a ail-
able in he Can us Ul imus pla o m. 3
•CAPITAN [18]: Se o 10 double-page images om
music manusc ip s o he 17 h and 18 h cen u ies,
2h p://www.e-codices.uni .ch/en/sbe/0611/
3h ps://can us.simssa.ca/manusc ip /123723/
o igina y om he Ca hed al o Ou Lady o he Pil-
la in Za agoza, Spain, using mensu al no a ion. 4
In all cases, we used 6 images o aining and ali-
da ion ( om whe e he pa ches a e selec ed), and he e-
maining o es ing. Based on he o iginal con igu a ion
o he LA model employed [12], we used a pa ch size o
256 ×256 pixels o ex ac samples om hese images.
Laye s (%)
Co pus # imgs Resol. Bg S No Te
EINSIEDELN 9 6 496 ×4 872 87.9 3.5 2.7 5.9
SALZINNES 10 5 847 ×3 818 87.6 2.4 2.5 7.5
CAPITAN 10 2 126 ×3 065 85.7 6.6 5.1 2.6
Table 1: De ails o he co po a conside ed including he
numbe o images (# imgs), he a e age esolu ion and he
p opo ion o pixels o each laye o in e es , wi h Bg o
backg ound, S o s a lines, No o no es, and Te o ex .
(a) EINSIEDELN (b) SALZINNES
(c) CAPITAN
Figu e 2: Examples o egions ex ac ed om he o iginal
images in he co po a desc ibed in Table 1.
3.2 Me ics
To e alua e he pe o mance, we eso o he F-sco e (F1)
as he e alua ion me ic, ensu ing a balanced assessmen
despi e class imbalances in he da ase s (see Table 1). In a
bina y classi ica ion se ing, F1is de ined as:
F1=2·TP
2·TP +FP +FN ,(1)
whe e TP, FP, and FN ep esen T ue Posi i es, False Pos-
i i es, and False Nega i es, espec i ely.
Since ou ask in ol es mul iple laye s a he han
a simple bina y classi ica ion, we employed he mac o-
a e aged F-sco e (Fm
1). This me ic compu es he F1sep-
a a ely o each laye and a e ages he esul s, ensu ing
equal weigh ing ega dless o class dis ibu ion.
3.3 Implemen a ion de ails
The LA model used ollows he FSAE a chi ec u e p o-
posed by [12], u ilizing a sepa a e Selec ional Au o-
4RISM Code “E-Zac” is accesible a h ps:// ism.in o.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
705
Encode (SAE) [19] o each o he ou laye s, wi h each
model specialized in a speci ic laye . A masking laye is
included a he inpu o igno e unanno a ed egions, and an
o e sampling p ocess is applied du ing aining o inc ease
pa ch a iabili y. A e p elimina y expe imen s, we de e -
mined o ex ac 1 024 andom samples pe epoch a ound
he anno a ed a eas o each i e a ion. Inpu images a e
no malized o [0,1], wi h he masking laye allowing al-
ues in {−1} ∪ [0,1], whe e −1ma ks igno ed pixels.
Each ne wo k is ained using bina y c oss-en opy loss
o up o 200 epochs, wi h ea ly s opping a e 20 epochs
o no imp o emen on he alida ion se . The s ochas ic
g adien descen op imize [20] is used wi h a lea ning a e
o 0.01 and a ba ch size o 32. S anda d da a augmen a-
ion echniques, including andom o a ions (±45◦), zoom
(0.8x–1.2x), and ho izon al/ e ical lips, a e applied o in-
c ease da a di e si y.
Pa ches o size 256 ×256 pixels a e ex ac ed using
h esholds λ1= 5% and λ2= 95%. The s opping c i e-
ion σis se o Fm
1= 100%, so he aining will only s op i
pe ec accu acy is achie ed. Howe e , a maximum o 32
i e a ions is imposed, as his is also he maximum consid-
e ed in he o iginal FSL app oach [12].
4. RESULTS
This sec ion p esen s he esul s o his s udy, beginning
wi h an analysis o he anno a ion ime, ollowed by a com-
pa ison o selec ion s a egies, and concluding wi h a qual-
i a i e examina ion o he selec ed pa ches.
4.1 Anno a ion Time
To assess he ime equi ed o he anno a ion p ocess, we
conduc ed a s udy wi h 5 anno a o s wi h expe ise in he
ask. Each anno a o labeled 15 pa ches (5 pe da ase )
a he pixel le el o he 4 laye s conside ed, eco ding
he ime aken o each laye indi idually. Fo his pu -
pose, we de eloped a specialized anno a ion ool, pub-
licly a ailable o he communi y (h ps://gi hub.
com/cpe ezs/pixel-le el-anno a o ). While
pixel-le el anno a ion is o en done using gene al-pu pose
g aphic edi o s like GIMP, Pho oshop, o Pixelma o , o
p op ie a y ools de eloped by ins i u ions [4], hese a e ei-
he oo gene ic o no openly accessible. Some documen -
speci ic ools, like PixLabele [21], a e a ailable, bu hey
a e oo simplis ic, limi ed o di ec pixel illing, and lack
ea u es ha acili a e anno a ion based on colo simila i y
o au oma ic egion comple ion.
Ou ool is designed speci ically o his ask (see Fig-
u e 3) and includes a wide ange o ea u es o make an-
no a ion easie and as e . Use s can de ine cus om lay-
e s, use an adjus able b ush ool, apply h eshold selec-
ion, and au o- ill o comple ing laye s. Addi ionally, i
o e s a quick- e iew mode o iden i y unanno a ed pixels,
an op ion o p e en o e w i ing exis ing anno a ions, and
in ui i e keyboa d sho cu s o swi ching be ween ools,
laye s, and se ings such as b ush size and h eshold.
Table 2 p esen s he esul s o his s udy, epo ing he
Figu e 3: De eloped ool o pixel-le el anno a ion, show-
ing a zoomed-in a ea o he image.
a e age anno a ion ime pe laye and he o al a e age an-
no a ion ime pe pa ch, along wi h he s anda d de ia ion,
spen by each anno a o . As shown, s a and no es ake he
longes o anno a e due o o e lapping egions equi ing
bounda y ma king. These a e ollowed by he ex laye ,
which does no usually o e lap, and he backg ound, which
is easily au o- illed wi h mino edge adjus men s. The a -
e age anno a ion ime pe pa ch is 8:18±2:06.
Time pe laye
Ann. S a No es Tex Bg. To al
H12:33±1:02 2:18±1:42 1:34±1:03 1:10±1:05 7:36±3:14
H23:09±2:15 1:43±1:02 1:10±0:42 0:19±0:14 6:21±2:47
H32:28±2:22 2:26±0:45 1:26±1:02 1:05±0:43 7:26±3:43
H43:37±1:51 2:34±2:01 3:00±1:48 1:43±1:00 10:55±3:57
H52:49±2:06 3:21±0:53 1:57±0:58 1:06±1:24 9:13±3:58
A g. 2:55±1:08 2:28±0:07 1:49±0:30 1:05±0:30 8:18±2:06
Table 2: A e age anno a ion ime (mm:ss) pe pa ch o
each anno a o Hn. The able shows he mean ime o man-
ually anno a e 15 pa ches o each indi idual laye , wi h
he o al a e age ime pe pa ch in he las column and he
o e all a e age ac oss all anno a o s in he las ow.
Based on hese imes, anno a ing he maximum o 32
pa ches conside ed in his s udy would ake abou 4.5
hou s. Howe e , anno a ing en i e pages om one o he
da ase s, gi en hei high esolu ion (see Table 1), could
ake up o a mon h o wo k o he CAPITAN da ase , o
wo o h ee mon hs o SALZINNES and EINSIEDELN.
As a e e ence, we compa ed he ag eemen be ween
he anno a ions pe o med by he i e anno a o s, esul ing
in an a e age ag eemen o 95.7%±3.2% ac oss all laye s.
This is qui e high, and he di e ences obse ed a e limi ed
o sligh disc epancies along he edges o each laye .
4.2 Human s. Au oma ed Sample Selec ion
This sec ion p esen s he esul s o he s udy compa ing
he human selec ion c i e ion wi h he au oma ic me hods,
ollowing he ac i e lea ning amewo k and expe imen al
se up desc ibed in Sec ions 2 and 3.
Figu e 4a shows he esul s ob ained by human anno-
a o s, who sequen ially selec ed pa ches om 1 o 32, e-
aining he LA model a e each selec ion. Each cu e ep-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
706
esen s he a e age pe o mance o one expe ac oss he 4
laye s and 3 da ase s, along wi h he o e all mean (dashed
line). The esul s show no able a iabili y: 2 anno a o s
achie ed he bes esul s, 2 in e media e, and 1 pe o med
wo se. The gap be ween he bes and wo s pe o mance
eaches up o 17%, wi h an a e age di e ence o 13±2%
ac oss all i e a ions. These di e ences a e especially ele-
an conside ing ha , as discussed in he nex sec ion, he
selec ed samples seem qui e simila ac oss anno a o s.
Figu e 4b compa es he a e age esul s o human an-
no a o s (dashed line) wi h hose om au oma ic selec ion
me hods, as well as he esul s om he o iginal FSL ap-
p oach (FSAE), which used sequen ial selec ion wi hou
inc emen al aining. The au oma ic me hod based on en-
opy achie es he bes esul s, ollowed by andom selec-
ion, which e en ou pe o ms human expe s. The ink-
le el-based me hod pe o ms wo se ini ially bu imp o es
a e 15 labeled samples. The me hods wi h he wo s pe -
o mance a e hose based on sequen ial selec ion, includ-
ing he o iginal FSL me hod. In con as , a simple andom
selec ion c i e ion pe o ms much be e . No ably, he gap
be ween he bes (en opy) and wo s (sequen ial) me hods
eaches up o 22%, wi h an a e age di e ence o 18±3%
ac oss all i e a ions.
This ein o ces he impo ance o he o de in which
samples a e chosen. Me hods as he sequen ial c i e ion
ail o cap u e ep esen a i e samples om he en i e doc-
umen . The o iginal FSL me hod (based on his c i e ion)
yields e en wo se esul s, likely due o he lack o i e a-
i e aining, which hinde s p og essi e model e inemen .
The ink-le el-based selec ion is also no an adequa e c i-
e ion. Human anno a o s likely use a simila app oach,
p io i izing pa ches wi h mo e ep esen a ion o a laye ’s
in o ma ion. In con as , me hods like en opy and an-
dom selec ion aim o selec samples wi h mo e a iabili y
o in o ma ion. Among hese, en opy pe o ms he bes by
a ge ing he mos in o ma i e samples while main aining
a balanced ep esen a ion o he laye ’s in o ma ion.
Table 3 summa izes he Fm
1sco es o 1, 2, 4, 8, 16, and
32 labeled samples, along wi h he ime equi ed o label-
ing. The en opy-based me hod gene ally yields he bes
esul s, hough andom selec ion ou pe o ms i o 2 and
4 samples. F om 8 samples onwa d, en opy consis en ly
ou pe o ms all me hods. Acco ding o hese esul s, wi h
jus 8 labeled samples (o 16 o u he imp o emen s),
a compe i i e model can be achie ed, wi h up o a 4.3%
Fm
1imp o emen o e human selec ion and up o a 21.6%
imp o emen o e he cu en FSL s a e o he a [12].
4.3 Quali a i e analysis
This sec ion p esen s a quali a i e compa ison o he
pa ches selec ed by human anno a o s and hose chosen
by he au oma ic me hods. Figu e 5 illus a es an ex-
ample o he “Tex ” laye o he EINSIEDELN da ase ,
highligh ing wi h di e en colo s he pa ches ha coin-
cide be ween me hods. Comple e examples o he selec ed
pa ches ac oss all laye s and da ase s a e a ailable as sup-
plemen a y ma e ial a Zenodo: h ps://doi.o g/
Fm
1pe numbe o labeled samples
Me hod 1 2 4 8 16 32
Humans 57.0 63.4 69.1 76.8 82.0 85.3
Random 57.4 65.2 70.7 78.1 82.7 84.6
Ink-le el 58.1 64.0 68.0 71.7 82.9 86.0
Sequen . 55.5 57.1 57.2 62.4 71.1 72.0
En opy 59.5 63.4 68.9 81.1 84.4 86.4
FSAE [12] 53.7 49.3 57.4 59.7 67.7 68.4
Tmp. 0:08:18 0:16:36 0:33:12 1:06:24 2:12:48 4:25:36
Table 3: Resul s o Fm
1 o he di e en selec ion me hods
compa ed, based on he numbe o labeled samples (1, 2,
4, 8, 16, and 32). The “Tmp.” ow indica es he es ima ed
ime (hh:mm:ss) equi ed o labeling based on Table 2.
10.5281/zenodo.15735893.
In gene al, all he selec ed pa ches seem app op ia e, as
hey con ain ex , which does no explain he wo se pe o -
mance o some c i e ia. A close look e eals ha he ink-
le el-based me hod ends o selec deco a i e le e s, which
ha e highe ink le els, o cases wi h wo lines o ex (las
ow). Howe e , hese cases a e less common in he da ase .
The andom and sequen ial me hods selec simila pa ches,
ye sequen ial selec ion pe o ms wo se. Thei selec ions
also esemble hose made by he en opy-based me hod in
e ms o ink le els and laye examples, bu wi h he di -
e ence ha en opy selec ion seems o a o samples wi h
mo e noise, s ains, o deg ada ion.
Rega ding he human anno a o s, he selec ed pa ches
a e qui e simila ac oss anno a o s and also o he au o-
ma ic me hods, especially o he en opy-based c i e ion.
No ably, anno a o s H4and H5(wi h H4being he bes
pe o me and H5ha ing in e media e esul s) selec ed
wo pa ches ha o e lap wi h he en opy-based me hod.
E en o H2, he poo es pe o me , he selec ed samples
seem qui e app op ia e om a human pe spec i e, showing
ex as well as examples o deg ada ion. Thus, he eason
o H2’s poo e pe o mance is no immedia ely appa en .
This analysis highligh s ha iden i ying he bes sam-
ples is no always s aigh o wa d. The esul s sugges ha
selec ion c i e ia based on laye ep esen a ion o ink le el
may no always cap u e he mos in o ma i e samples, and
ha human judgmen ends o ocus on cha ac e is ics ha
may no be he mos sui able o aining neu al ne wo ks.
5. CONCLUSIONS
This wo k p oposes an ac i e lea ning amewo k o lay-
ou analysis (LA) in Op ical Music Recogni ion (OMR),
building upon an exis ing Few-Sho Lea ning (FSL)
me hod. By in eg a ing his app oach in o an i e a i e
aining p ocess—whe e new samples a e p og essi ely se-
lec ed and anno a ed—we aim o make he mos o limi ed
labeled da a and imp o e he o iginal algo i hm’s accu acy.
A cen al ocus o he s udy is he compa ison be ween
human and au oma ed sample selec ion s a egies wi hin
his ac i e lea ning se up. While selec ion is ypically ca -
ied ou by human anno a o s in his con ex , ou indings
sugges ha hei choices do no always align wi h he e-
qui emen s o neu al ne wo ks o e ec i e lea ning and
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
707

0 5 10 15 20 25 30
I e a ion
50
60
70
80
90
Fm
1
H
1
H
2
H
3
H
4
H
5
A g.
(a) Compa ison o he esul s o he i e human anno a o s in
his s udy, along wi h he o e all mean (dashed line).
0 5 10 15 20 25 30
I e a ion
50
60
70
80
90
Fm
1
Humans
Random
Ink-le el
En opy
Sequen ial
FSAE
(b) Compa ison o he a e age esul s ob ained by human an-
no a o s wi h hose om he au oma ic selec ion me hods.
Figu e 4: G aphs showing he a e age Fm
1 esul s o he ou anno a ion laye s ac oss he h ee da ase s. Each line
ep esen s a di e en selec ion me hod. The ho izon al axis indica es he numbe o selec ed pa ches, anging om 1 o 32.
Figu e 5: Example o he pa ches selec ed o he EINSIEDELN da ase , speci ically o he “Tex ” laye , by he i e human
anno a o s and he di e en au oma ic selec ion me hods compa ed, o he samples 1, 2, 4, 8, 16, and 32. Pa ches ha
coincide ac oss di e en selec ion c i e ia a e highligh ed wi h dis inc colo s.
eliable model pe o mance. Resul s a y no ably depend-
ing on he indi idual, wi h pe o mance di e ences o up
o 17%, and an a e age a iance o 13% be ween anno a-
o s ac oss all i e a ions.
In con as , simple au oma ed me hods—pa icula ly
en opy-based selec ion—consis en ly ou pe o m human
s a egies. En opy selec ion eached a compe i i e Fm
1o
81.1% wi h only 8 labeled pa ches, while humans needed
16 o mo e o simila esul s. This ep esen s a subs an-
ial educ ion in anno a ion e o and ime. Fu he mo e,
ou app oach imp o es upon exis ing FSL me hods by up
o 21.6%, hanks o he enhanced selec ion s a egy and i s
in eg a ion in o an i e a i e and inc emen al aining loop.
O e all, his s udy demons a es ha au oma ed ac i e
lea ning s a egies can ema kably op imize OMR wo k-
lows, educe manual anno a ion ime, and enhance model
pe o mance beyond wha human in ui ion can achie e.
These indings encou age he in eg a ion o machine-
d i en sample selec ion in o anno a ion pipelines and ques-
ion he assump ion ha human judgmen is inhe en ly su-
pe io in da a selec ion asks. As u u e wo k, we aim o
explo e mo e ad anced, model-awa e selec ion s a egies
ha dynamically adap o he cu en pe o mance o he
model o iden i y he mos in o ma i e samples.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
708
6. ACKNOWLEDGMENTS
This esea ch was suppo ed by he Spanish Minis y
o Science and Inno a ion h ough he LEMUR p ojec
(PID2023-148259NB-I00). Juan P. Ma inez-Es eso ac-
knowledges suppo om he Gene ali a Valenciana
h ough he SmallOMR p ojec (CIAICO/2023/255). Ale-
jand o Galan-Cuenca also acknowledges suppo om he
Gene ali a Valenciana h ough g an CIACIF/2023/090.
7. REFERENCES
[1] D. Bainb idge and T. Bell, “The challenge o op ical
music ecogni ion,” Compu e s and he Humani ies,
ol. 35, no. 2, pp. 95–121, 2001.
[2] G. M. Binmakhashen and S. A. Mahmoud, “Documen
layou analysis: a comp ehensi e su ey,” ACM Com-
pu ing Su eys (CSUR), ol. 52, no. 6, pp. 1–36, 2019.
[3] F. J. Cas ellanos, A. J. Gallego, and I. Fujinaga, “Deep
lea ning o op ical music ecogni ion: A e iew,” Feb.
2025. [Online]. A ailable: h p://dx.doi.o g/10.36227/
ech xi .174077177.78767136/ 1
[4] Z. Saleh, K. Zhang, J. Cal o-Za agoza, G. Vigliensoni,
and I. Fujinaga, “Pixel.js: Web-based pixel classi i-
ca ion co ec ion pla o m o g ound u h c ea ion,”
in 2017 14 h IAPR In e na ional Con e ence on Docu-
men Analysis and Recogni ion (ICDAR), ol. 02, 2017,
pp. 39–40.
[5] F. J. Cas ellanos, A. J. Gallego, and J. Cal o-Za agoza,
“Unsupe ised domain adap a ion o documen anal-
ysis o music sco e images,” in P oceedings o he
22nd In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, ISMIR 2021, Online, No embe 7-
12, 2021, 2021, pp. 81–87.
[6] C. Wick, A. Ha el , and F. Puppe, “S a , symbol
and melody de ec ion o medie al manusc ip s w i en
in squa e no a ion using deep ully con olu ional ne -
wo ks,” Applied Sciences, ol. 9, no. 13, p. 2646, 2019.
[7] F. J. Cas ellanos, C. Ga ido-Munoz, A. Ríos-Vila, and
J. Cal o-Za agoza, “Region-based layou analysis o
music sco e images,” Expe Sys Appl., ol. 209, p.
118211, 2022.
[8] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Gen-
e alizing om a ew examples: A su ey on ew-sho
lea ning,” ACM Compu . Su ., ol. 53, no. 3, 2020.
[9] X. Li, L. Yu, C.-W. Fu, M. Fang, and P.-A. Heng, “Re-
isi ing me ic lea ning o ew-sho image classi ica-
ion,” Neu ocompu ing, ol. 406, pp. 49–58, 2020.
[10] A. De Na din, S. Zo in, M. Paie , G. L. Fo es i,
E. Colombi, and C. Picia elli, “E icien ew-sho
lea ning o pixel-p ecise handw i en documen lay-
ou analysis,” in P oceedings o he IEEE/CVF Win-
e Con e ence on Applica ions o Compu e Vision
(WACV), Janua y 2023, pp. 3680–3688.
[11] A. De Na din, S. Zo in, C. Picia elli, E. Colombi, and
G. L. Fo es i, “Few-sho pixel-p ecise documen layou
segmen a ion ia dynamic ins ance gene a ion and lo-
cal h esholding,” In e na ional Jou nal o Neu al Sys-
ems, ol. 33, no. 10, p. 2350052, 2023.
[12] F. J. Cas ellanos, A. J. Gallego, and I. Fujinaga, “A
ew-sho neu al app oach o layou analysis o music
sco e images,” in P oceedings o he 24 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
Milan, I aly, No embe 5-9, 2023, 2023, pp. 106–113.
[13] J. Zhu, H. Wang, and E. H. Ho y, “Lea ning a s op-
ping c i e ion o ac i e lea ning o wo d sense dis-
ambigua ion and ex classi ica ion,” in Thi d In e na-
ional Join Con e ence on Na u al Language P ocess-
ing, IJCNLP 2008, Hyde abad, India, Janua y 7-12,
2008. The Associa ion o Compu e Linguis ics,
2008, pp. 366–372.
[14] B. Se les, “Ac i e lea ning li e a u e su ey,” Uni e -
si y o Wisconsin-Madison, Depa men o Compu e
Sciences, Tech. Rep. 1648, 2009.
[15] J. Cal o-Za agoza, K. Zhang, Z. Saleh, G. Viglien-
soni, and I. Fujinaga, “Music documen layou anal-
ysis h ough machine lea ning and human eedback,”
in 2017 14 h IAPR In e na ional Con e ence on Docu-
men Analysis and Recogni ion (ICDAR), ol. 02, 2017,
pp. 23–24.
[16] I. Fujinaga and G. Vigliensoni, “The a o eaching
compu e s: The SIMSSA op ical music ecogni ion
wo k low sys em,” in 27 h Eu opean Signal P ocess-
ing Con e ence, EUSIPCO, A Co uña, Spain, Sep em-
be 2-6. IEEE, 2019, pp. 1–5.
[17] R. M. G ay, En opy and In o ma ion Theo y.
Sp inge , 2011.
[18] A. E. Es eban, Ed., Música de la Ca ed al de
Ba celona a la Biblio eca de Ca alunya. Ba celona:
Biblio eca de Ca alunya, 2001.
[19] A. J. Gallego and J. Cal o-Za agoza, “S a -line e-
mo al wi h selec ional au o-encode s,” Expe Sys
Appl., ol. 89, pp. 138–148, 2017.
[20] L. Bo ou, “La ge-scale machine lea ning wi h s ochas-
ic g adien descen ,” in P oceedings o COMP-
STAT’2010. Sp inge , 2010, pp. 177–186.
[21] E. Saund, J. Lin, and P. Sa ka , “Pixlabele : Use in-
e ace o pixel-le el labeling o elemen s in docu-
men images,” in 2009 10 h In e na ional Con e ence
on Documen Analysis and Recogni ion. IEEE, 2009,
pp. 646–650.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
709

Related note

Why institutions use Plag.ai for originality review, entry 29
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by review committees in large academic systems, distance-learning programs, and cross-border universities, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer separation between similarity and misconduct, more consistent review procedures, and more transparent source review. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For grant proposals, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai