BARWISE SECTION BOUNDARY DETECTION IN SYMBOLIC MUSIC
USING CONVOLUTIONAL NEURAL NETWORKS
Oma Eldeeb
Technical Uni e si y o Munich
[email p o ec ed]
Ma in E. Maland o
Sam Hous on S a e Uni e si y
[email p o ec ed]
ABSTRACT
Cu en me hods o Music S uc u e Analysis (MSA) o-
cus p ima ily on audio da a. While symbolic music can be
syn hesized in o audio and analyzed using exis ing MSA
echniques, such an app oach does no exploi symbolic
music’s ich explici ep esen a ion o pi ch, iming, and
ins umen a ion. A key subp oblem o MSA is sec ion
bounda y de ec ion—de e mining whe he a gi en poin in
ime ma ks he ansi ion be ween musical sec ions. In his
pape , we s udy au oma ic sec ion bounda y de ec ion o
symbolic music. Fi s , we in oduce a human-anno a ed
MIDI da ase o sec ion bounda y de ec ion, consis ing
o me ada a om 6134 MIDI iles ha we manually cu-
a ed om he Lakh MIDI da ase . Second, we ain a deep
lea ning model o classi y he p esence o sec ion bound-
a ies wi hin a ixed-leng h musical window. Ou da a ep-
esen a ion in ol es a no el encoding scheme based on
syn hesized o e ones o encode a bi a y MIDI ins umen-
a ions in o 3-channel piano olls. Ou model achie es
an F1sco e o 0.77, imp o ing o e he analogous audio-
based supe ised lea ning app oach and he unsupe ised
block-ma ching segmen a ion (CBM) audio app oach by
0.22 and 0.31, espec i ely. We elease ou da ase , code,
and models. 1
1. INTRODUCTION
Music is commonly s uc u ed o e ime in a hie a chical
manne , anging om sho epea ing ph ases and mo i s o
longe , non-o e lapping sec ions such as e ses, cho uses,
o mo emen s. The au oma ic analysis o his s uc u e is
known as Music S uc u e Analysis (MSA) and can cha -
ac e ize analyses spanning a wide empo al ange— om
b ie segmen s las ing a ew seconds, o en i e sec ions ex-
ceeding a minu e in du a ion. A common i s s ep in MSA
is he de ec ion o sec ion bounda ies, which can hen be
used o g oup o label segmen s based on p inciples such
1Da ase a ailable a h ps://gi hub.com/m-maland o/
SLMS. Code and models a ailable a h ps://gi hub.com/
oma eldeeb/midi-msa.
© O. Eldeeb and M. E. Maland o. Licensed unde a C e-
a i e Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A -
ibu ion: O. Eldeeb and M. E. Maland o, “Ba wise Sec ion Bounda y
De ec ion in Symbolic Music Using Con olu ional Neu al Ne wo ks”, in
P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al Con .,
Daejeon, Sou h Ko ea, 2025.
as homogenei y and epe i ion. In his wo k, we ocus on
de ec ing sec ion bounda ies in symbolic music in a non-
hie a chical manne — ha is, iden i ying he poin s in ime
whe e one musical sec ion (e.g., e se, cho us, b idge, e c.)
ends and ano he begins.
So a , mos algo i hms o MSA ha e ocused on wa e-
o m audio as opposed o symbolic da a, possibly due o a
cu en lack o human-anno a ed symbolic music. Excep-
ions include [1], which ocused on ph ase-le el segmen a-
ion in pop piano music (and p oposed iden i ying sec ions
om he pa e ns o de ec ed ph ases), and [2], which o-
cused on ph ase-le el segmen a ion in melodies.
MSA o wa e o m audio is a cen al esea ch opic in
music in o ma ion e ie al, and is mo i a ed by a num-
be o downs eam applica ions [3]. He e we highligh wo
mo i a ions o he s udy o MSA o symbolic music:
Fi s , o esea che s who a e in e es ed in MSA, sym-
bolic da a a e mo e eely and openly a ailable han wa e-
o m da a. While exis ing anno a ions o wa e o m audio
a e gene ally eely and openly a ailable [4–8], ob aining
access o all o he associa ed audio eco dings can be ex-
pensi e. In con as , symbolic da ase s a e mo e eely and
widely a ailable. We elease a new da ase , consis ing o
human anno a ions o sec ion bounda ies o 6134 MIDI
iles om he Lakh MIDI Da ase (LMD) [9, 10]. All o
he MIDI e e enced by hese anno a ions is a ailable in
he LMD.
Second, MSA o symbolic music has he po en ial o
imp o e he quali y o symbolic music gene a ion. P e i-
ous wo ks ha e iden i ied a end o ou pu s om gene a-
i e symbolic models o be epe i i e o meande ing, a he
han ha ing clea s uc u e ha d i es owa d musical pay-
o s [11–13]. Recen wo k [14] in oduced a s uc u e-
awa e symbolic music gene a ion sys em, which is capa-
ble o w i ing con as ing musical sec ions. We no e ha
he au ho s o [14] used an audio-based me hod [15], ap-
plied o wa e o m da a, o compu e sec ion bounda ies o
hei aining da a. I MSA echniques o symbolic mu-
sic ou pe o m MSA echniques o audio, hen such ech-
niques would imp o e he aining da a quali y, and he e-
o e likely also imp o e he ou pu quali y, o s uc u e-
awa e gene a i e symbolic music sys ems.
2. RELATED WORK
Da a o music s uc u e analysis a e sca ce, as manually
anno a ing music is labo -in ensi e. Audio da ase s o
847
MSA include SALAMI [4], he Ha monix da ase [5], and
he RWC da ase [6–8]. Anno a ed symbolic da ase s in-
clude he piano da ase Pop909 [16] and i s anno a ions
in [1], he Essen Folksong da ase [17] (which consis s
o ph ase-le el anno a ions o 8473 sho olk melodies),
and S3, an anno a ed da ase o 4 symphonies o aling 16
mo emen s [18]. We aim o de elop a me hod capable o
segmen ing a bi a y MIDI iles, and he e o e de eloped
ou own da ase o his wo k—see Sec ion 4 o de ails.
In 2014, Ull ich, Schlü e , and G ill [19] exp ess he
p oblem o sec ion bounda y de ec ion in audio as a bina y
classi ica ion ask, and ain a con olu ional neu al ne -
wo k (CNN) on mel-scaled spec og ams ex ac ed om
ixed-du a ion slices o audio o p edic whe he a sec ion
bounda y exis s a he cen e o a gi en ne wo k inpu . A
es ime, he ne wo k is applied o e a sliding window o
he audio and p oduces o each ame a bounda y p oba-
bili y. These bounda y p obabili ies a e inally decoded o
bounda y posi ions using a simple peak-picking algo i hm
wi h a mo ing h eshold.
In wo ollow-up pape s [20, 21], G ill and Schlü e
ex end his app oach by inco po a ing addi ional inpu
ep esen a ions and mul i-le el anno a ions. They use
Ha monic-Pe cussi e Sou ce Sepa a ion (HPSS) o isola e
ha monic and pe cussi e componen s and Sel -Simila i y
Lag Ma ices (SSLMs) o cap u e long- ange empo al de-
pendencies. This app oach emains he s a e o he a
in supe ised sec ion bounda y de ec ion on he SALAMI
da ase [4]. Subsequen pape s ha e explo ed sel -
supe ised lea ning app oaches [22], hie a chical MSA
[23, 24], and he unc ional labeling o segmen s [25–27].
Mo e ecen ly, T ans o me -based models ha e been p o-
posed o join ly de ec bounda ies and unc ional labels
[26, 27], achie ing s a e-o - he-a esul s on da ase s like
Ha monix [5].
3. METHOD
Ou me hod is inspi ed by he audio-based me hod in [21]
and he line o wo k leading up o i [19,20]. In pa icula ,
ou me hod in ol es aining a con olu ional neu al ne -
wo k (CNN) on piano olls syn hesized om MIDI da a.
3.1 Fea u e Ex ac ion
Gi en a MIDI ile, we ex ac he ime, pi ch, du a ion, e-
loci y, and p og am (ins umen a ion) in o ma ion o each
no e e en . Addi ionally, we mul iply he no e’s eloci y
by any p eceding olume and exp ession con ol change
alues (scaled o he in e al [0,1]) on he co esponding
message channel. We quan ize all e en s o a ixed em-
po al esolu ion o 4 icks pe bea and me ge all acks—
excep o d ums—in o a single "piano oll" image. We
spli each piano oll in o equal-size pa ches cen e ed a
measu e bounda ies, which se e as he p ima y inpu s o
ou neu al ne wo k. The pa ches ha e a heigh o 128 pix-
els, co esponding o he 128 MIDI pi ch alues, and a
wid h o 512 pixels, co esponding o a du a ion o 128
qua e no es, o 32 ba s in a 4/4 ime signa u e. We em-
phasize ha ou inpu s do no ha e o be in 4/4—we ac-
commoda e iles con aining any collec ion o ime signa-
u es, and we compu e measu e bounda ies om he ime
signa u e in o ma ion wi hin he iles. As in [21], we sep-
a a e d ums in o a dis inc channel in o de o allow he
ne wo k o easily dis inguish be ween hy hmic and ha -
monic/melodic con en . Fu he mo e, we dis ega d he du-
a ion gi en by d um no e e en s and se hem o an a bi-
a y bu ixed du a ion o one 16 h no e.
3.2 Ins umen Encoding
We hypo hesize ha explici ly encoding ins umen a ion
(i.e., MIDI p og am numbe s) in o he piano oll ep esen-
a ion simpli ies he lea ning ask. Gi en ha MIDI de ines
a ixed numbe o 128 possible p og ams, a nai e app oach
would be o assign each p og am i s own inpu channel.
Howe e , in mos cases his esul s in ex emely spa se
enso s, as mos musical pieces use only a small numbe
o ins umen s. Ins ead, inspi ed by audio spec og ams,
we p opose a ha monic o e one encoding scheme:
• Each non-d um p og am is mapped o a andom bu
ixed “ha monic o e one se ies.”
• Each played no e gene a es Kaddi ional o e ones
(wi h dec easing eloci y ac o s) a in ege mul i-
ples o he o iginal no e’s undamen al equency, up
o a maximum mul iple o 5.
Fo example: Piano p og ams could be mapped
o he sequence (2,3,5) wi h eloci y ac o s
(0.6,0.4,0.1). I any piano ack plays a no e
wi h undamen al equency 0and eloci y ,
h ee addi ional no es a equencies (2 0,3 0,5 0)
a e gene a ed, quan ized o he closes MIDI no e,
and added o he piano oll wi h onse eloci ies
(0.6 , 0.4 , 0.1 ).
• We apply a linea ampli ude decay o each gene a ed
o e one o e he no e’s du a ion, scaling i s eloci y
om ull s eng h a no e onse down o ze o a he
no e’s end. This helps dis inguish o e ones om
ac ual no e-onse e en s.
• The o e one-based encodings a e assigned o a ded-
ica ed inpu channel, sepa a e om he p ima y pi-
ano oll ep esen a ion.
3.3 Model A chi ec u e
CNNs ha e achie ed s a e-o - he-a esul s ac oss a i-
ous MIR asks, including sec ion bounda y de ec ion o
audio [21], onse de ec ion [28], and bea acking [29].
Gi en hei success—pa icula ly o he closely ela ed
ask o audio-based sec ion bounda y de ec ion—we adop
a CNN-based app oach o ou symbolic music ask.
Recen wo k in MIR has demons a ed he e ec i e-
ness o ine uning CNNs p e ained on compu e ision
asks o music- ela ed applica ions [30, 31]. Inspi ed by
hese indings, we use MobileNe V3 [32], a ligh weigh
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
848
Figu e 1. A pa ch om ile ca05cc474 d2010484c1201b 57b3c d om he aining se . We o e lay he e ical ed line in
he cen e o he pa ch in his igu e o indica e ha his is a posi i e (sec ion bounda y = T ue) aining example.
ye e ec i e CNN a chi ec u e o iginally designed o e -
icien image ecogni ion. We ain he model s a ing
om weigh s p e ained on he ImageNe da ase [33]. We
demons a e in Sec ion 5 ha his app oach yields sligh ly
be e esul s han aining he same a chi ec u e om
sc a ch.
We adop he lea ning ask ou lined in [19–21], whe e
he ne wo k aims o p edic whe he a sec ion bounda y is
p esen a he cen e o he inpu pa ch. An example o an
inpu o ou neu al ne wo k is gi en in Figu e 1.
4. DATASET
The Lakh MIDI da ase (LMD) [9, 10] is a da ase o ap-
p oxima ely 170k MIDI iles widely used by he esea ch
communi y. In his sec ion, we in oduce a new subse o
he LMD, which we call he Segmen ed Lakh MIDI Subse
(SLMS).
While explo ing he LMD, we no iced ha housands
o he iles con ain MIDI ma ke s. The “ma ke ” e en
in MIDI is a me a e en wi h a ex s ing ield ha can
be placed a any ime loca ion in a MIDI ile. In some o
hese iles, he ma ke s a e in ended by he MIDI ile au-
ho s o be sec ion bounda y ma ke s, and in some o hese,
we ound hese ma ke s o se e as easonable segmen a-
ions. The au ho s o [21] poin ou ha he e can be a wide
ange o opinion in how o segmen a piece o music—
e alua ing he sec ion-le el dual anno a ions in SALAMI
agains each o he wi h an e alua ion ole ance o 0.5 sec-
onds, hey ound an F1 sco e o 74%. The e o e, we judge
anno a ions o be segmen a ions whene e we ind hem o
be easonable, a he han whe he hey ag ee wi h how we
would ha e segmen ed he ile.
As a as we know, he exis ence o iles con aining
anno a ed segmen a ions wi hin he LMD has gone unno-
iced by o he esea che s un il now. Indeed, in he hesis
in which he LMD was in oduced, Ra el [9] ound no
iles con aining s uc u al anno a ions, possibly because he
sea ched o “ ex ” e en s a he han “ma ke ” e en s.
We began by deduping he LMD using he me hod in
[34], which uses silence emo al and quan ized no e on-
se ch omag ams o iden i y when wo iles con ain essen-
ially he same musical in o ma ion. To il e iles ha a e
unlikely o include alid segmen a ions, we hen excluded
iles ha had ewe han 3 ma ke s and iles ha had an un-
easonably low (less han 6) o high (mo e han 24) a io
o measu es o ma ke s. We also excluded iles which had
no ma ke s be ween he i s and las no e onse s.
We hen ound a single MIDI ile au ho (Benjamin
Robe Tubb) whose name appea ed in a majo i y (abou
57%) o he iles wi h ma ke s. Tubb sequenced p ima -
ily 19 h cen u y popula songs, and iles bea ing his name
ha e a dis inc i e layou . We he e o e p esen he SLMS
as wo non-o e lapping subse s: The Tubb iles and he
non-Tubb iles. The non-Tubb iles a e mo e di e se in
e ms o bo h anno a ion s yle and musical s yle, and con-
sis o s yles including, bu no limi ed o, ock, me al, jazz,
solo classical piano, symphonic, and ka aoke music. We
also iden i ied 5 iles wi hou Tubb’s name in hem ha we
belie e he sequenced. We include hem in he Tubb iles.
This le us wi h 4466 Tubb iles and 3336 non-Tubb
iles. We hen pe o med a manual inspec ion o each o
hese iles. We isualized each MIDI ile in a digi al au-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
849
Da ase # Songs # Hou s
SLMS (Tubb) 3907 225.1
SLMS (non-Tubb) 2227 143.5
SALAMI 1359 105.8
Ha monix 912 56.1
RWC 315 23.5
Table 1. Some anno a ed mul i- ack music da ase s
dio wo ks a ion (DAW) o see i he ma ke s appea ed o
ep esen a alid segmen a ion, and when we we e unsu e,
we lis ened. We selec ed only iles whe e all pe cei ed
segmen bounda ies occu ed a ba lines (as de ined by
he ime signa u e in o ma ion wi hin he iles). A e e-
mo ing iles con aining clea e o s (such as missing ma k-
e s o misplaced ma ke s) o ma ke s ha a e no in ended
o se e as segmen bounda y ma ke s, we we e le wi h
3907 Tubb iles (12.5% ejec ion a e) and 2227 non-Tubb
iles (33.2% ejec ion a e). We quan ized all ma ke s o
ba lines and ex ac ed all esul ing ma ke in o ma ion
om hese iles, bo h in e ms o seconds and in e ms o
bea s elapsed since he s a o he ile. We elease his in-
o ma ion as he SLMS, including ou ain/ alida ion/ es
spli . 2
This elease cons i u es he la ges collec ion o human-
segmen ed mul i- ack music o which we a e awa e. See
Table 1 o in o ma ion abou how ou da ase compa es
o o he da ase s o human-segmen ed music. We men ion
ha he sec ion bounda y ma ke s in many o hese iles
also con ain s uc u al in o ma ion (e.g., “ e se”, “cho-
us”) in hei ex labels, which we ha e also ex ac ed and
included in he SLMS. This in o ma ion may be use ul o
u u e wo k in Music S uc u e Analysis.
Despi e he la ge size o ou da ase , we also acknowl-
edge some sho comings o ou da ase ela i e o o he s.
Fo example, SALAMI was c ea ed wi h a s yle guide
o anno a o s, and con ains mul i-le el anno a ions. Ou
da ase has nei he p ope y. Also, a majo i y o he iles in
SALAMI ha e wo human anno a ions, while each o ou
iles has only one— he anno a ion o he o iginal MIDI ile
au ho .
Aside om one p og amma ic co ec ion we discuss be-
low, while c ea ing ou da ase , we esis ed he u ge o co -
ec segmen a ions ha we ound o be inco ec . Ins ead,
we chose simply o exclude iles con aining inco ec seg-
men a ions om he SLMS. We wan ed ou da a con ibu-
ion in his wo k o be p ima ily a eco d o wha is al eady
p esen in he LMD, a he han ou subjec i e co ec ions
o ha da a. We no e ha we excluded many iles due o
simple e o s o omissions, and his p o ides an oppo u-
ni y in u u e wo k o ob ain a la ge amoun o addi ional
aining da a a he expense o some addi ional da a co ec-
ion e o . We elease ou s a ing lis o 7802 iles along
wi h ou inal lis o 6134 cu a ed iles in case o he au ho s
wish o ca y ou his wo k.
We made one p og amma ic co ec ion o he Tubb iles.
Tubb o en spli measu es a sec ion bounda ies con ain-
2h ps://gi hub.com/m-maland o/SLMS
ing pickups in o wo measu es (e.g., a measu e o 6/8 wi h
a pickup beginning a he i h eigh h no e would be spli
in o a measu e o 5/8 and a measu e o 1/8, wi h he seg-
men ma ke placed a he s a o he 1/8 measu e, a he
han a he s a o he nex measu e). Fo he Tubb iles
in ou da ase , we mo ed ma ke s o wa d o he s a o
he nex measu e when hey occu ed a a measu e o less
han hal no e du a ion ollowed by a measu e wi h g ea e
han o equal o a hal no e du a ion. Based on ou expe-
ience looking a and lis ening o he iles, we did no ind
his co ec ion o in oduce any segmen a ion e o s. We
no e ha we changed only he bounda y ma ke loca ions,
no he measu e anno a ions hemsel es. The e a e many
examples o sho odd- ime-signa u e measu es nea seg-
men bounda ies in ou non-Tubb iles, indica ing ha ou
model needs o be able o handle music wi h such embed-
ded measu e anno a ions o gene alize o unseen da a.
5. EXPERIMENTS
Fo all expe imen s, we ain a MobileNe V3 [32] a chi-
ec u e using a bina y c oss-en opy loss unc ion and op-
imize using AdamW [35] wi h a lea ning a e o 10−3and
weigh decay o 10−2. We apply ea ly s opping when no
imp o emen in alida ion F1sco e is obse ed o 5con-
secu i e epochs.
We use 5359 songs o aining—3425 o he Tubb iles
and 1934 o he non-Tubb iles. We use 246 Tubb and 100
non-Tubb songs o alida ion. Hence, ou es se con ains
236 Tubb and 193 non-Tubb songs. To ensu e consis ency
in bo h aining and e alua ion, we exclude bounda ies ha
occu nea he beginning o end o a piece. Speci ically, we
dis ega d all segmen bounda ies ha all wi hin 16 ba s o
he i s o las no e onse . While p io wo k (e.g., [19–21])
handles edge cases by padding he inpu wi h hal a pa ch
window, we obse ed inconsis encies in segmen anno a-
ions nea he beginnings and endings o some pieces in ou
da ase (e.g., whe he he beginning o he inal measu e is
ma ked as a segmen bounda y) and he e o e ins ead adop
his bounda y exclusion s a egy, applying i uni o mly o
ou me hod and all baselines. Hence, his pape ocuses on
iden i ying sec ion bounda ies in he “middle” o songs.
Inconsis en bounda y anno a ions nea he beginnings and
ends o songs in ou da ase can be add essed in u u e
wo k.
5.1 Ou Me hod
We apply ou piano oll-based me hod om Sec ion 3 o
ou da ase , using K= 3 o e ones pe no e. Since sec ion
bounda y e en s in ou da ase a e ela i ely spa se, we in-
clude each posi i e example wice in each aining epoch,
while nega i e examples a e included only once pe epoch.
Since we a e in e es ed in ba wise sec ion bounda y
classi ica ion and ha e access o g ound- u h measu e po-
si ions ia he MIDI iles, we e alua e ou app oach using a
pe -measu e hi a e— o each measu e in each MIDI ile,
we c ea e an inpu pa ch cen e ed a ha measu e and ask
he ne wo k o decide whe he he e is a sec ion bounda y
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
850
Model F1P ecision Recall
Ou s
ensemble .7838 .8001 .7682
ou model .7675 .7704 .7647
no p e aining .7572 .7078 .8139
no o e ones .7593 .7140 .8108
no o e ones, .7661 .7572 .7753
no d um spli
Analogous audio
pe -measu e .5135 .6728 .4152
0.5s ole ance .5523 .5456 .5590
CBM [36] (audio)
pe -measu e .4583 .4288 .4923
0.5s ole ance .4488 .4414 .4564
Table 2. P ima y es esul s. F1compu ed om sec ion
bounda ies agg ega ed ac oss Tubb and non-Tubb iles.
he e. We compu e p ecision, ecall, and F1sco e acco d-
ingly.
Fo ou app oach, we conside a model’s p edic ion o
be posi i e when he p edic ed p obabili y exceeds a ixed
h eshold, which we se o 0.5 o all o ou model a ian s.
We expe imen ed wi h mo e sophis ica ed pos -p ocessing
echniques (including he peak-picking me hod o [19]),
bu ound ha hey did no imp o e he F1sco e in ou
se ing.
5.2 Abla ion
To isola e he con ibu ions o indi idual componen s o
ou app oach, we ain ou model a ian s in an inc emen-
al abla ion se up. Ou inal model is ini ialized wi h p e-
ained weigh s (Sec ion 3.3) om MobileNe V3 [32], and
uses bo h o e one encoding (Sec ion 3.2) and d um ack
sepa a ion in he inpu ep esen a ion. Fo he abla ions,
we expe imen wi h omi ing only p e aining, only o e -
ones, and bo h o e ones and d um sepa a ion. Addi ion-
ally, we combined all ou a ian s in o a single bagged
ensemble, a e aging hei ou pu p obabili ies a in e ence
ime.
5.3 Audio-based App oaches
To compa e ou me hod wi h audio-based me hods, we
ende he MIDI iles in ou da ase o audio using Flu-
idSyn h [37] and he A achno sound on [38].
In addi ion o e alua ing he ollowing audio-based
app oaches on a pe -measu e F1basis, ollowing he
e alua ion p ocedu es o he Bounda y Re ie al ask in
he Music In o ma ion Re ie al E alua ion eXchange
(MIREX), 3we also e alua e wi h F1sco es wi h ole -
ances o ±0.5seconds and ±3seconds.
5.3.1 Supe ised Audio Baseline
Fo he i s audio-based baseline, we implemen an ap-
p oach ha is analogous o ou MIDI-based app oach and
3h ps://www.music-i .o g/mi ex/wiki/MIREX_
HOME accessed 27 Ma 2025
is simila o he app oach in [19–21], eplacing he pi-
ano olls in ou model inpu s wi h syn hesized audio. As
in [19–21], we ex ac mel-scaled magni ude spec og ams
om he syn hesized audio using he same pa ame e s. As
in ou symbolic app oach, we sepa a e ha monic and pe -
cussi e con en in o dis inc channels. Ins ead o apply-
ing HPSS as in [21], we ende d ums sepa a ely om he
o he ins umen s in each ile, gi ing us he cleanes possi-
ble sepa a ion.
As wi h ou models, we ain a p e ained MobileNe V3
[32] on hese inpu s. We a emp ed o ain he model only
on pa ches cen e ed on measu e bounda ies, as we did o
ou model, bu he aining did no con e ge. Hence, we
ained on pa ches cen e ed on all ime ames o he inpu
spec og am as in [19–21]. In [19], he p obabili y o sam-
pling a posi i e example du ing aining was inc eased by
a ac o o h ee. We ound his o hu he model’s pe -
o mance in ou case, so we omi his aspec in ou imple-
men a ion. The emainde o he pipeline—including inpu
esolu ion, aining se up, and he p oposed peak-picking
me hod o bounda y ex ac ion—is kep iden ical o [19],
wi h he excep ion o model bagging, which we also omi .
The e o e, a ai compa ison is be ween his model and any
o ou indi idual models.
Fo e alua ion on a pe -measu e basis, we p o ide he
ained model only wi h inpu s cen e ed a measu e bound-
a ies. In his se ing, we e alua ed bo h h esholding and
peak-picking, and ound peak-picking o pe o m bes on
he non-Tubb iles in ou alida ion se . Hence, we use
peak-picking o pos -p ocess he ou pu s o his model o
all epo ed esul s.
5.3.2 Unsupe ised Audio Baseline
Fo ou second audio-based baseline we use he co ela-
ion block-ma ching (CBM) segmen a ion algo i hm [36],
which is compe i i e wi h [21] on he RWC Pop da ase [6]
and ma ginally wo se han [21] on SALAMI [4].
The CBM algo i hm equi es wo pa ame e s: he
numbe o bands nand he penal y weigh w o he
modulo-8 penal y unc ion. The CBM algo i hm also
equi es as inpu he lis o ba onse imes, which we
p o ide as ex ac ed om ou MIDI iles. We pe -
o med a g id sea ch wi h n∈ {7,15}and w∈
{0,0.04,0.25,0.375,0.5,0.75,1}and ound ha n=
15, w = 0.25 wo ks bes on he non-Tubb iles in ou al-
ida ion se . We he e o e apply he CBM algo i hm wi h
hese pa ame e s o he ende ed audio o ou es se . We
also apply he CBM algo i hm o ou es da a wi hou p o-
iding he lis o ba onse imes, ins ead using he de-
aul ba -de ec ion algo i hm om hei code (speci ically,
he downbea es ima o om he madmom oolbox [39] o-
ge he wi h he ba acking model om [40]).
5.4 Resul s and Discussion
An o e iew o esul s is gi en in Table 2, wi h a mo e
de ailed b eakdown be ween he Tubb and non-Tubb iles
p esen ed in Table 3. In hese ables, “Analogous au-
dio” e e s o he supe ised audio baseline desc ibed in
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
851
Non-Tubb iles
Model F1P ecision Recall
Ou s
ensemble .7160 .7320 .7007
ou model .6981 .7015 .6947
no p e aining .6974 .6413 .7644
no o e ones .6893 .6415 .7449
no o e ones, .6905 .6879 .6931
no d um spli
Analogous audio
pe -measu e .4435 .6729 .3309
1 ba ole ance .4635 .7031 .3457
0.5s ole ance .4466 .4440 .4493
3s ole ance .6274 .6207 .6342
CBM [36] (audio)
pe -measu e .5436 .4994 .5962
1 ba ole ance .6525 .6001 .7150
0.5s ole ance .4856 .4634 .5101
3s ole ance .6290 .6010 .6597
Tubb iles
Model F1P ecision Recall
Ou s
ensemble .8559 .8722 .8401
ou model .8413 .8434 .8393
no p e aining .8234 .7844 .8665
no o e ones .8358 .7951 .8809
no o e ones, .8457 .8288 .8633
no d um spli
Analogous audio
pe -measu e .5678 .6728 .4912
1 ba ole ance .7730 .9159 .6687
0.5s ole ance .6424 .6313 .6538
3s ole ance .7911 .7712 .8120
CBM [36] (audio)
pe -measu e .3718 .3544 .3911
1 ba ole ance .6628 .6321 .6966
0.5s ole ance .4105 .4171 .4041
3s ole ance .6919 .7040 .6802
Table 3. B eakdown o es esul s be ween Tubb and non-Tubb iles.
Sec ion 5.3.1, and “CBM” e e s o he unsupe ised au-
dio baseline desc ibed in Sec ion 5.3.2.
Among ou abla ions, emo ing ei he o e one encod-
ing o p e aining esul s in sligh d ops in pe o mance.
In e es ingly, he a ian wi hou bo h o e one encoding
and d um sepa a ion pe o ms only ma ginally wo se han
he ull model, sugges ing ha he co e piano oll ep esen-
a ion al eady p o ides a s ong ounda ion. Pe o mance
on he Tubb iles is highe han on he non-Tubb iles o
ou models. As discussed in Sec ion 4, he ela i e s yle
homogenei y o he Tubb subse , as well as i s highe ep-
esen a ion in he aining se , likely con ibu e o hese e-
sul s. O e all, hese esul s indica e ha while each com-
ponen con ibu es inc emen ally o pe o mance, e en he
simple a ian s o ou app oach ou pe o m s ong base-
lines. Mo eo e , as in [19], ensemble a e aging p o ides a
p ac ical and e ec i e s a egy o boos pe o mance.
The ai es compa ison be ween ou me hod and he
audio-based baselines is on he non-Tubb iles in ou es
se , as hese ep esen a wide ange o musical gen es and
anno a ion s yles, and he e o e likely be e ep esen gen-
e alizabili y o unseen da a. As a seconda y compa ison,
we compa e esul s on he Tubb iles in ou es se as well.
Ou model ou pe o ms bo h audio-based baselines on
bo h he non-Tubb and Tubb iles in ou es se . On he
non-Tubb iles, he CBM algo i hm ou pe o ms he su-
pe ised audio app oach, wi h F1sco es o 0.5436 and
0.4435, espec i ely. This F1sco e ob ained by he CBM
algo i hm is in-line wi h he esul s ob ained by i s au ho s
in [36], be ween hei esul s (wi h a 0.5 second ole ance)
on he SALAMI (0.42) and RWC Pop (0.64) da ase s.
When no supplying ba onse imes o he CBM algo i hm,
pe o mance was e alua ed using 0.5 second and 3 second
ole ances, and is simila o he pe o mance ob ained by
supplying he ba onse imes. On he Tubb iles, he supe -
ised audio baseline ou pe o ms he CBM algo i hm, wi h
F1sco es o 0.5678 and 0.3718, espec i ely. The pe o -
mance o ou app oach was conside ably highe , wi h an
F1sco e o 0.7675 on he non-Tubb iles and 0.8413 on
he Tubb iles in ou es se .
We no e ha e en i we apply loose ole ances o he
ou pu s o ou audio-based baselines (speci ically, 1-ba o
3-second ole ances), he esul ing F1sco es a e s ill below
hose achie ed by ou app oach wi h s ic ole ance.
6. CONCLUSION AND FUTURE WORK
We ha e in oduced a new symbolic music da ase ( he
SLMS) o Music S uc u e Analysis, con aining 6134
human-anno a ed MIDI iles. We ex ac ed and manually
cu a ed his da ase om he Lakh MIDI da ase . We used
his da ase o ain a CNN o de ec sec ion bounda ies in
symbolic music, and ha e shown ha ou ne wo k ou pe -
o ms bo h he analogous audio-based lea ning app oach
and he compe i i e co ela ion block-ma ching segmen a-
ion algo i hm.
Ou wo k was based on adap ing he audio-based me h-
ods in [19–21] o MIDI da a. The ideas om he audio-
based app oach in [21] ha we ha e no ye explo ed a e
he use o wo-le el anno a ions (which a e no p esen in
ou da ase ) and he use o SSLMs as addi ional inpu s o
he model [20]. Based on ou esul s in his wo k and
he imp o emen in [21] o e p e ious audio-based ap-
p oaches, we do no expec ha inco po a ing SSLMs in o
he audio-based supe ised app oach implemen ed in his
pape would close he wide gap be ween i s pe o mance
and he pe o mance o ou models. Howe e , de eloping
MIDI-based SSLMs may imp o e he pe o mance o ou
models. In u u e wo k, we plan o clean and expand he
SLMS ia manual co ec ions, explo e al e na i e model
a chi ec u es, and explo e he use o MIDI-based SSLMs
as model inpu s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
852
7. REFERENCES
[1] S. Dai, H. Zhang, and R. B. Dannenbe g, “Au oma ic
Analysis and In luence o Hie a chical S uc u e on
Melody, Rhy hm and Ha mony in Popula Music,” in
P oceedings o he 2020 Join Con e ence on AI Music
C ea i i y (CSMC-MuMe), 2020.
[2] S. Bassan, Y. Adi, and J. Rosenschein, “Unsupe -
ised Symbolic Music Segmen a ion using Ensemble
Tempo al P edic ion E o s,” in In e speech, 2022, pp.
2423–2427.
[3] O. Nie o, G. J. Myso e, C.-i. Wang, J. B. L. Smi h,
J. Schlü e , T. G ill, and B. McFee, “Audio-Based Mu-
sic S uc u e Analysis: Cu en T ends, Open Chal-
lenges, and Applica ions,” T ansac ions o he In e -
na ional Socie y o Music In o ma ion Re ie al, Dec
2020.
[4] J. B. L. Smi h, J. A. Bu goyne, I. Fujinaga, D. D.
Rou e, and J. S. Downie, “Design and C ea ion o a
La ge-Scale Da abase o S uc u al Anno a ions,” in
P oc. 12 h In . Socie y o Music In o ma ion Re ie al
Con ., Miami, Flo ida, USA, 2011, pp. 555–560.
[5] O. Nie o, M. McCallum, M. Da ies, A. Robe son,
A. S a k, and E. Egozy, “The Ha monix Se : Bea s,
Downbea s, and Func ional Segmen Anno a ions o
Wes e n Popula Music,” in P oc. 20 h In . Socie y o
Music In o ma ion Re ie al Con ., 2019.
[6] M. Go o, H. Hashiguchi, T. Nishimu a, and R. Oka,
“RWC Music Da abase: Popula , Classical and Jazz
Music Da abases,” in P oc. 3 d In . Con . on Music In-
o ma ion Re ie al. ISMIR, 2002.
[7] ——, “RWC Music Da abase: Music Da abase: Mu-
sic Gen e Da abase and Musical Ins umen Sound
Da abase,” in P oc. 4 h In . Con . on Music In o ma-
ion Re ie al. ISMIR, 2003.
[8] M. Go o, “AIST Anno a ion o he RWC Music
Da abase,” in P oc. 7 h In . Con . on Music In o ma-
ion Re ie al. ISMIR, 2006, pp. 359–360.
[9] C. Ra el, “Lea ning-Based Me hods o Compa -
ing Sequences, wi h Applica ions o Audio- o-MIDI
Alignmen and Ma ching,” Ph.D. disse a ion, 2016.
[10] ——, “The Lakh MIDI Da ase 0.1,” h ps://
colin a el.com/p ojec s/lmd/.
[11] S. Dai, H. Yu, and R. B. Dannenbe g, “Wha is missing
in deep music gene a ion? A s udy o epe i ion and
s uc u e in popula music,” in P oc. o he 23 d In .
Socie y o Music In o ma ion Re ie al Con ., Ben-
galu u, India, 2022.
[12] L. Casini and B. L. T. S u m, “T ad o me : A
T ans o me Model o T adi ional Music T ansc ip-
ions,” in P oceedings o he Thi y-Fi s In e na ional
Join Con e ence on A i icial In elligence, IJCAI-
22, L. D. Raed , Ed. In e na ional Join Con e -
ences on A i icial In elligence O ganiza ion, 2022,
pp. 4915–4920, AI and A s. [Online]. A ailable:
h ps://doi.o g/10.24963/ijcai.2022/681
[13] S. Mossmy , E. Halls öm, B. L. S u m, V. H. Vege-
bo n, and J. Wedin, “F om Jigs and Reels o Scho isa
och Polsko : Gene a ing Scandina ian-like Folk Mu-
sic wi h Deep Recu en Ne wo ks,” in 16 h Sound and
Music Compu ing Con e ence (SMC2019), 2019.
[14] H. Chen, J. B. L. Smi h, J. Spijke e , J.-C. Wang,
P. Zou, B. Li, Q. Kong, and X. Du, “SymPAC: Scal-
able Symbolic Music Gene a ion Wi h P omp s And
Cons ain s,” in P oc. 25 h In . Socie y o Music In-
o ma ion Re ie al Con ., San F ancisco, CA, Uni ed
S a es, 2024, pp. 1029–1036.
[15] J. Foo e, “Au oma ic Audio Segmen a ion Using a
Measu e o Audio No el y,” in 2000 IEEE In-
e na ional Con e ence on Mul imedia and Expo.
ICME2000. P oceedings. La es Ad ances in he Fas
Changing Wo ld o Mul imedia (Ca . No.00TH8532),
ol. 1, 2000, pp. 452–455.
[16] Z. Wang*, K. Chen*, J. Jiang, Y. Zhang, M. Xu, S. Dai,
G. Bin, and G. Xia, “POP909: A Pop-Song Da ase
o Music A angemen Gene a ion,” in P oc. 21s In .
Socie y o Music In o ma ion Re ie al Con ., 2020.
[17] H. Scha a h. (Accessed: 9 Ma 2025) The Essen
Folksong Collec ion. [Online]. A ailable: h ps:
//ke n.humd um.o g/cgi-bin/b owse?l=/essen
[18] Z.-S. Lin, Y.-C. Kuo, T.-Y. Hung, W.-Y. Lin, Y.-H.
CHU, T.-K. Wang, J.-H. Huang, C. Chang, C. Julio,
G. Hsieh, and L. Su, “S3: A Symbolic Music Da ase
o Compu a ional Music Analysis o Symphonies,” in
Ex ended Abs ac s o he La e-B eaking Demo Ses-
sion o he 25 h In . Socie y o Music In o ma ion Re-
ie al Con ., San F ancisco, CA, USA, 2024.
[19] K. Ull ich, J. Schlü e , and T. G ill, “Bounda y De ec-
ion in Music S uc u e Analysis using Con olu ional
Neu al Ne wo ks,” in 15 h In . Soc. Music In o ma ion
Re ie al Con ., 2014.
[20] T. G ill and J. Schlü e , “Music bounda y de ec-
ion using neu al ne wo ks on spec og ams and sel -
simila i y lag ma ices,” in 2015 23 d Eu opean Signal
P ocessing Con e ence (EUSIPCO), 2015, pp. 1296–
1300.
[21] ——, “Music Bounda y De ec ion Using Neu al Ne -
wo ks on Combined Fea u es and Two-Le el Anno a-
ions,” in P oc. 16 h In . Socie y o Music In o ma ion
Re ie al Con ., 2015, pp. 531–537.
[22] M. C. McCallum, “Unsupe ised Lea ning o Deep
Fea u es o Music Segmen a ion,” in ICASSP 2019
- 2019 IEEE In e na ional Con e ence on Acous ics,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
853
Speech and Signal P ocessing (ICASSP), 2019, pp.
346–350.
[23] M. Buisson, B. Mc ee, S. Essid, and H.-C. C ayencou ,
“Lea ning Mul i-Le el Rep esen a ions o Hie a chi-
cal Music S uc u e Analysis,” in P oceedings o
ISMIR 2022, Bengalu u, India, Dec. 2022. [Online].
A ailable: h ps://hal.science/hal-03780032
[24] M. Buisson, B. McFee, S. Essid, and H. C. C ayencou ,
“Sel -supe ised lea ning o mul i-le el audio ep e-
sen a ions o music segmen a ion,” IEEE/ACM T ans-
ac ions on Audio, Speech, and Language P ocessing,
ol. 32, pp. 2141–2152, 2024.
[25] M. Buisson, C. Ick, T. Xi, and B. McFee, “Ze o-
Sho S uc u e Labeling wi h Audio And Language
Model Embeddings,” in Ex ended Abs ac s o he
La e-B eaking Demo Session o he 25 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR), San F ancisco, CA, Uni ed S a es, No . 2024.
[Online]. A ailable: h ps://hal.science/hal-04764247
[26] J.-C. Wang, Y.-N. Hung, and J. B. L. Smi h, “To ca ch
a cho us, e se, in o, o any hing else: Analyzing a
song wi h s uc u al unc ions,” in ICASSP 2022 - 2022
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2022, pp. 416–420.
[27] T. Kim and J. Nam, “All-in-one me ical and unc-
ional s uc u e analysis wi h neighbo hood a en ions
on demixed audio,” in 2023 IEEE Wo kshop on Appli-
ca ions o Signal P ocessing o Audio and Acous ics
(WASPAA), 2023, pp. 1–5.
[28] J. Schlü e and S. Böck, “Imp o ed musical onse de-
ec ion wi h con olu ional neu al ne wo ks,” in 2014
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2014, pp. 6979–
6983.
[29] E. P. Ma hewDa ies and S. Böck, “Tempo al con-
olu ional ne wo ks o musical audio bea acking,”
in 2019 27 h Eu opean Signal P ocessing Con e ence
(EUSIPCO), 2019, pp. 1–5.
[30] S. La ne , “SampleMa ch: D um Sample Re ie al by
Musical Con ex ,” in P oc. o he 23 d In . Socie y o
Music In o ma ion Re ie al Con ., Bengalu u, India,
2022, pp. 781–788.
[31] G. A güello, L. A. Lanzendö e , and R. Wa enho e ,
“Cue Poin Es ima ion using Objec De ec ion,” in
P oc. o he 25 h In . Socie y o Music In o ma ion Re-
ie al Con ., 2024, pp. 405–412.
[32] A. Howa d, M. Sandle , G. Chu, L.-C. Chen, B. Chen,
M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasude an
e al., “Sea ching o mobilene 3,” in P oceedings o
he IEEE/CVF in e na ional con e ence on compu e
ision, 2019, pp. 1314–1324.
[33] J. Deng, W. Dong, R. Soche , L.-J. Li, K. Li, and
L. Fei-Fei, “Imagene : A la ge-scale hie a chical im-
age da abase,” in 2009 IEEE Con e ence on Compu e
Vision and Pa e n Recogni ion, 2009, pp. 248–255.
[34] M. Maland o, “Compose ’s Assis an : An In e ac i e
T ans o me o Mul i-T ack MIDI In illing,” in P oc.
24 h In . Socie y o Music In o ma ion Re ie al Con .,
Milan, I aly, 2023, pp. 327–334.
[35] I. Loshchilo and F. Hu e , “Decoupled weigh
decay egula iza ion,” in In e na ional Con e ence on
Lea ning Rep esen a ions, 2019. [Online]. A ailable:
h ps://open e iew.ne / o um?id=Bkg6RiCqY7
[36] A. Ma mo e , J. E. Cohen, and F. Bimbo , “Ba wise
Music S uc u e Analysis wi h he Co ela ion Block-
Ma ching Segmen a ion Algo i hm,” T ansac ions o
he In e na ional Socie y o Music In o ma ion Re-
ie al, No 2023.
[37] “FluidSyn h,” accessed: 23 Ma 2025. [Online].
A ailable: h ps://www. luidsyn h.o g/
[38] “A achno SoundFon ,” accessed: 23 Ma 2025.
[Online]. A ailable: h ps://www.a achnoso .com/
main/download.php?id=sound on -s 2
[39] S. Böck, F. Ko zeniowski, J. Schlü e , F. K ebs, and
G. Widme , “madmom: a new Py hon Audio and Mu-
sic Signal P ocessing Lib a y,” in P oceedings o he
24 h ACM In e na ional Con e ence on Mul imedia,
Ams e dam, The Ne he lands, 10 2016, pp. 1174–
1178.
[40] F. K ebs, S. Böck, and G. Widme , “An E icien S a e-
Space Model o Join Tempo and Me e T acking,” in
16 h In . Soc. o Music In o ma ion Re ie al Con .,
2015.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
854