scieee Science in your language
[en] (orig)

Simple and Effective Semantic Song Segmentation

Author: Filip Korzeniowski; Richard Vogl
Publisher: Zenodo
DOI: 10.5281/zenodo.17706573
Source: https://zenodo.org/records/17706573/files/000084.pdf
SIMPLE AND EFFECTIVE SEMANTIC SONG SEGMENTATION
Filip Ko zeniowski∗and Richa d Vogl∗
Music AI
ABSTRACT
We p opose a simple, ye e ec i e app oach o seman ic
song segmen a ion. Ou model is a con olu ional neu-
al ne wo k ained o join ly p edic ame-wise bounda y
ac i a ion unc ions and segmen label p obabili ies. The
inpu ea u es consis o a log-magni ude log- equency
spec og am and sel -simila i y lag ma ices, combining
mode n deep lea ning app oaches wi h hand-c a ed ea-
u es.
To e alua e ou app oach, we i s examine commonly
used da ase s and ind subs an ial o e lap (up o 22%) be-
ween aining and es ing se s (SALAMI s. RWC-Pop).
As his o e lap in alida es meaning ul compa isons, we
p opose using he p e iously unexplo ed McGill Billboa d
da ase o es ing. We ca e ully elimina e duplica e en-
ies be ween McGill Billboa d and o he da ase s h ough
bo h audio inge p in ing and s ing-ma ching o song i-
les and a is names. Using he esul ing se o 719 acks,
we demons a e he e ec i eness o ou app oach.
1. INTRODUCTION
Music S uc u e Analysis (MSA) is he ask o di iding
a piece o music in o nono e lapping segmen s ha co -
espond o a human’s analysis o pe cep ion o he s uc-
u e o he piece. Such segmen s can op ionally be labeled
seman ically ( o example, exposi ion,solo,cho us,in o,
e c.), by simila i y (A,B,B′, e c.) o bo h. They can also
be subdi ided in o ine subsegmen s, o ming a hie a chi-
cal segmen a ion o a piece [1].
His o ically, MSA elied on hand-c a ed ea u es and
ca e ul design o segmen o bounda y de ec ion algo-
i hms. These include me hods ha de ec changes in di-
agonals o sel -simila i y ma ices [2, 3], hidden Ma ko
models [4], ma ix ac o iza ion [5] o clus e ing algo-
i hms [6,7].
Recen ly, esea ch has ocused on models ha de-
ec segmen bounda ies and p edic segmen labels om
a ixed ocabula y using deep lea ning. Such ap-
p oaches o en ou -pe o m classical me hods bu equi e
la ge amoun s o da a o ain on, e.g., da ase s such
as SALAMI [8] and Ha monix [9]. To u he expand
* Equal con ibu ion.
© F. Ko zeniowski and R. Vogl. Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu-
ion: F. Ko zeniowski and R. Vogl, “Simple and E ec i e Seman ic Song
Segmen a ion”, in P oc. o he 26 h In . Socie y o Music In o ma ion
Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
he a ailable aining da a, esea che s explo ed unsupe -
ised [10,11] and semi-supe ised [12] aining, as well as
pa ially labeled da a [13] o imp o e hei models.
Wo k in music segmen a ion has also explo ed s a e-
gies ha le e age me ic lea ning. In pa icula , Salamon
e al. [14] in oduce an app oach ha eplaces handc a ed
ea u es wi h deep audio embeddings lea ned ia ew-sho
and au o- agging amewo ks. While such me ic lea n-
ing–based echniques a e a ac i e because hey educe
anno a ion equi emen s and open he doo o unsupe ised
aining, hey usually pe o m sub-pa compa ed wi h su-
pe ised me hods.
In his wo k, we ocus on seman ic segmen a ion o au-
dio eco dings o mos ly Wes e n music. Fo simplici y, we
conside only a single le el o bounda y anno a ions and a
simple, la axonomy o high-le el unc ional segmen la-
bels, such as e se,cho us, and b idge. We ely only on
supe ised lea ning and conside scaling up h ough sel -
supe ised lea ning o pa ially labeled da a as o hogonal
a enues o explo e in he u u e.
Fu he mo e, we examine he e alua ion p o ocols used
in p e ious wo k and highligh h ee issues: i s , da ase s
o en used o “c oss-da ase ” e alua ion o e lap signi i-
can ly; second, pape s o en do no speci y whe he hey
used “ immed” me ics o no , and we ound ins ances
whe e in alid compa isons ha e been made due o his
ac o ; hi d, models a e ypically ained using di e en
da ase and da ase sizes, c ea ing a con ounding ac o be-
sides he me hod i sel . These issues in alida e clean, di-
ec compa isons be ween me hods.
Ou con ibu ions can be summa ized as:
• We p opose a simple, ye e ec i e app oach o se-
man ic song segmen a ion which su passes s a e-o -
he-a esul s on a a ie y o da ase s.
• We iden i y and e iew issues in e alua ion ela ed
o bo h me ics and da a, and sugges ways o o e -
come some o hem.
• We p opose o use a da ase ye unexplo ed o s uc-
u al segmen a ion as unseen es se : he McGill
Billboa d da ase .
The emainde o he wo k is s uc u ed as ollows: Fi s
we discuss echniques in p e ious wo ks ele an o his
ask in Sec ion 2. In Sec ion 3 we explain ou app oach
o MSA in de ail, while in Sec ion 4 we discuss he used
da ase s, e alua ion me ics, and he expe imen al se up.
Finally, we p esen and discuss ou esul s in Sec ion 5 and
conclude he pape in Sec ion 6.
719
2. RELATED WORK
Ull ich e al. [15] pionee ed supe ised aining o deep
lea ning models o bounda y de ec ion, imp o ing he
s a e-o - he-a de ec ion F1 sco e by 40%. They also
in oduced a Gaussian loss weigh ing scheme o accoun
o anno a ion inaccu acy and o e sampling o mi iga e
sca ceness o posi i e examples. Building upon his model,
[16] p oposed using sel -simila i y lag ma ices as addi-
ional inpu legs o he ne wo k, sepa a ing he inpu spec-
og am in o ha monic and pe cussi e pa s, as well as
aining he model using mul iple sou ces and le els o
g ound u h a ailable in he SALAMI da ase .
Wang e al. [17] p oposed adding a classi ica ion head
o cho us de ec ion, which a e ained join ly wi h seg-
men bounda y p edic ion on op o he same deep lea ning
backbone. They also in oduced Hann-window smoo h-
ing o accoun o anno a ion inaccu acy and widened pos-
i i e bounda y a ge s o 0.5s. Following up on hei
wo k, in [18], hey ex ended he cho us de ec ion head o a
la ge se o sec ion labels, used SpecTNT [19] as a back-
bone, and adop ed he Connec ionis Tempo al Localiza-
ion (CTL) loss [20] as an addi ional aining objec i e.
Kim e al. [21] a gue ha MSA and bea - acking a e
connec ed asks and p opose a model ha join ly p edic s
bea s, downbea s, segmen bounda ies, and labels. Thei
model equi es a s em sepa a ion model o p ep ocess-
ing o spli he inpu audio in o ou s ems. Then, i
p ocesses hem using a con olu ional on -end ollowed
by ans o me blocks wi h dila ed neighbo hood a en-
ion o cap u e empo al and in e -ins umen dependen-
cies. Finally, ou classi ica ion heads p edic he p es-
ence o (down-)bea s and segmen s. Among o he op i-
miza ion icks, hey employ s ochas ic weigh a e aging
(SWA) [22], which ends o imp o e he gene aliza ion o
a model by sea ching o wide local op ima.
Chen e al. [23] in oduce a ans o me -in- ans o me
model inspi ed by SpecTNT, aiming o analyze bo h spec-
al and long- e m empo al dependencies. Thei model
ea u es al e na ing spec al and empo al encode laye s
wi h specialized mul i-head sel -a en ion (MHSA) mecha-
nisms, ailo ed o deal wi h he speci ici ies o spec al and
empo al da a. They claim imp o ed pe o mance on h ee
da ase s, enabled by he imp o ed abili y o he sys em o
analyze non-local dependencies in music.
Buisson e al. [24] ede ine s uc u e analysis as a pai -
wise link p edic ion p oblem, whe e he sys em lea ns o
classi y pai s o bea s as belonging o he same s uc u al
segmen o no . Thei me hod in eg a es g aph a en ion
ne wo ks (GATs) o combine link and node ea u es, le e -
aging a sel -simila i y ma ix o cap u e empo al depen-
dencies. The model is u he e ined h ough MinCu eg-
ula iza ion and a mul i- ask aining objec i e, enabling i
o p edic bo h segmen bounda ies and sec ion labels.
3. METHOD
We c a ou segmen a ion model by in eg a ing ideas om
p e ious wo k in o a new model and aining ecipe. Ou
Dila ed Con
2x Dila ed Con
ELU
Ch-D opou 0.1
ELU
Ch-D opou 0.1
Linea
Linea
F on -End
F on -End
F on -End
11x
1×5 Max-Pool
Linea
Linea
Bounda y P ob
Label P obs
Spec og am
SSLM1
SSLM2
1×1 Con
Figu e 1. Model O e iew. Th ee iden ical con olu ional
on -end ne wo ks (c. . Table 1) p e-p ocess each o he
inpu ea u es. The conca ena ed esul is down-sampled
in ime by 5 (1×5max-pooling) and hen eed in o a TCN
s ack. The TCN s ack consis s o 11 blocks o wo pa al-
lel dila ed con olu ions wi h inc easing dila ion a es, ELU
ac i a ions, d opou and a esidual connec ion. The dila-
ion a es s a a 1 and 2 o he pa allel con olu ions and
a e doubled o each o he 11 blocks.
aim is o c ea e a ligh weigh model ha is able o e i-
cien ly p ocess la ge amoun s o da a while minimizing
compu a ional cos and memo y oo p in . To achie e his,
we de ia e in wo ways om cu en ends in deep lea n-
ing (which end o p oduce e e -so-la ge end- o-end mod-
els): i s , we employ empo al con olu ional ne wo ks
(TCN) ins ead o ans o me -based models; secondly, we
eso o meaning ul, hand-c a ed ea u es as addi ional
inpu ins ead o elying on inc eased model complexi y o
lea n such ea u es. See Fig. 1 o a model o e iew.
3.1 Inpu Fea u es
3.1.1 Spec og am
The main inpu o he model is, simila o [25], a log-
equency log-magni ude spec og am. The magni ude
spec og am is compu ed om 44.1kHz audio using Hann-
windows o size 2048 a 100 ames pe second ( ps). We
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
720
apply a il e bank o loga i hmically spaced iangula il-
e s wi h 12 bands pe oc a e be ween 30Hz and 17kHz,
esul ing in F= 81 equency bands. Finally, he na u-
al loga i hm (adding ϵ= 1e−6 o nume ical s abili y) is
applied o comp ess he magni ude, esul ing in a spec o-
g am deno ed as x o = 1 . . . T.
3.1.2 Sel -Simila i y Lag Ma ices
F om his spec og am, we ex ac wo sel -simila i y lag
ma ices (SSLMs), simila as desc ibed in [16]. Fi s , he
ime axis is down-sampled by a ac o o ou o 25 ps, us-
ing max-pooling (pooling ac o p= 4,T′=T/p). Then,
he disc e e cosine ans o m (DCT) is applied ame-wise,
disca ding he DC coe icien ( i s bin). This esul s in a
MFCC-like ep esen a ion o he signal.
Nex , ames o his ep esen a ion wi hin a empo al
con ex o ±C(C= 2 equaling 0.08s o audio a 25 ps,
5 ames o al) a e s acked, o build o e lapping ea u e
blocks ˆx . Using hese ea u e blocks, he cosine dis ance
dcos is calcula ed be ween ea u e blocks up o a lag Lo
90 seconds (L= 2250 ames a 25 ps), esul ing in a
dis ance ma ix o size T′×L.
D ,l =dcos (ˆx ,ˆx −l), = 1 . . . T′, l = 1 . . . L. (1)
The esul ing ma ix is hen no malized using an adap i e
h eshold ϵ ,l, he mean o he quan ile Qκwi h κ= 0.1o
he dis ances in ows and −lo D:
ϵ ,l =Qκ(D ,1, . . . , D ,L, D −l,1, . . . , D −l,L).(2)
The inpu signal is conside ed ime-ci cula , hus indices
( −l)<1a e w apped a ound o ′= ( −l) + T′.
A e no maliza ion, he sigmoid unc ion σ(·)is ap-
plied o smoo hing, esul ing in he inal ela ionship ma-
ix:
R ,l =σ1−D ,l
ϵ ,l .(3)
Fo he wo di e en SSLM ma ices, he ela ionship
ma ix is down-sampled in ea u e dimension by max-
pooling. The i s one is using a ke nel size o 6, cap u -
ing local simila i y, he o he a ke nel size o 22, cap u -
ing longe - e m dependencies. The inal numbe o ea u e
alues is limi ed o he closes 100 alues. Bo h SSLM
ma ices a e up-sampled again by a ac o o ou in ime
dimension using bicubic in e pola ion, o ma ch he o igi-
nal 100 ps.
The wo SSLMs and he il e ed spec og am ep esen
he h ee ea u es used as inpu s o he con olu ional on -
end. While o some expe imen s in [16], an addi ional
ha monic pe cussi e sepa a ion is applied o he spec o-
g am, we ound ha his does no u he imp o e pe o -
mance. All ea u es sha e he same ime esolu ion, which
simpli ies u he p ocessing.
3.2 Con olu ional F on -End
Each o he h ee inpu ea u es is p ocessed by a sepa a e
bu iden ical on -end module simila o hose p oposed
in [26]: h ee con olu ions wi h 20 ke nels o size (3 ×3),
(1 ×10), and (3 ×3), espec i ely, ollowed by an ELU
ac i a ion [27], elemen -wise d opou [28] wi h p obabil-
i y 0.1, and inally, (3 ×1) equency-wise max-pooling.
A e each con olu ion, padding along he ime axis is ap-
plied o main ain i s ull leng h, while he ea u e dimen-
sion is educed o 1 by con olu ions and max-pooling. See
Table 1 o a concise summa y o he on -end modules.
Conca ena ing he h ee 20-channel-ou pu s dimension
esul s in 60 channels which a e subsequen ly educed o
30 using a (1×1) con olu ion. The ime-axis is hen down-
sampled by a ac o o 5 using max-pooling. This educes
he ame a e om 100 o 20 ps—su icien o p ecise
bounda y de ec ion, while educing he compu a ional load
o he ne wo k’s back-bone.
Con olu ional F on end
2D Con . (3 ×3), 20 channels w/ ELU
Padding (0,1)
Max-Pooling (3 ×1),0.1d opou
Con . Ke nel (10 ×1), 20 channels w/ ELU
Max-Pooling (3 ×1),0.1d opou
Con . Ke nel (3 ×3), 20 channels w/ ELU
Padding (0,1)
Max-Pooling (3 ×1),0.1d opou
Table 1. Con olu ional on -end con igu a ion. Each
block consis s o a 2D con olu ion laye wi h ELU ac i-
a ion ollowed by max-pooling and d opou . No ba ch o
o he no maliza ion is used. The ime dimension (las ) is
always padded o keep i s size, while he equency dimen-
sion is educed om 81 o size 1.
3.3 TCN Backbone
The combined ea u es a e p ocessed h ough 11 sequen ial
1D TCN blocks (as desc ibed in [29]). Each block con ains
wo pa allel dila ed 1D-con olu ional laye s, each wi h 30
ke nels o size 5. The dila ion a e o he second con olu-
ional laye is wice ha o he i s , and hese a es double
p og essi ely ac oss blocks, s a ing om one. The dila ed
con olu ions a e ollowed by channel-wise d opou wi h
p obabili y 0.1 and an ELU ac i a ion unc ion. Wi hin
each block, he ou pu s o bo h con olu ions a e conca e-
na ed and educed om 60 o 30 channels using a linea
p ojec ion (implemen ed by a 1D-con olu ion wi h ke nel
size 1). A esidual connec ion, p ocessed h ough ano he
linea p ojec ion, is added o he inal ou pu o he block.
3.4 Mul i-Task Ou pu
The ou pu o he backbone ne wo k is passed o wo sepa-
a e ou pu heads. The ou pu heads consis o a single lin-
ea laye ollowed by a sigmoid ac i a ion. Each o hem
p ojec s he 30-dimensional inpu in o a single bounda y
p obabili y and eigh p obabili ies co esponding o pos-
sible segmen labels, espec i ely. 1Using hese ou pu
1In heo y, he segmen label ou pu should be a single ca ego ical
p obabili y dis ibu ion, as labels a e mu ually exclusi e; howe e , we
ound in p elimina y expe imen s ha indi idual p obabili ies o each
segmen label wo k be e in p ac ice.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
721
heads, we ob ain bounda y and label p obabili ies a 20
ames pe second.
3.5 Pos -P ocessing
Finally, gi en he he p obabili y imese ies, we mus de e -
mine he conc e e segmen bounda ies and assign segmen
labels. Fo segmen bounda ies, we use he peak-picking
me hod in oduced in [15]. We empi ically op imize he
de ec ion h eshold as well as he leng hs o a e age- and
max- il e s used o peak picking on he alida ion se , and
choose he ollowing alues: a symme ic 6s max il e , 16s
p e-a g- il e , 8s pos -a g il e , and a de ec ion h eshold
o 0.2. To choose he segmen labels, we ollow [18] and
selec he label wi h he highes a e age p obabili y wi hin
segmen bounda ies.
3.6 T aining Se up
We ain he model o a o al o 100 epochs (each epoch
is de ined as 1000 upda es) using s ochas ic g adien de-
scen wi h a ba ch size o 1. We apply look-ahead op i-
miza ion [30] wi h he RAdam upda e ule [31] using an
ini ial lea ning a e o 0.002, which is di ided by 5 a e 60
epochs. We also clip g adien s a a no m o 0.5.
To imp o e he gene aliza ion o he inal model, we use
s ochas ic weigh a e aging [22]. S a ing om epoch 70,
we inc ease he lea ning a e o e 10 epochs o 0.001 us-
ing a cosine schedule and con inue aining o ano he 20
epochs. The inal model weigh s a e he a e age weigh s
obse ed du ing he las 30 epochs.
3.7 T aining C i e ion
We employ bina y c oss-en opy as he loss unc ion o
bo h segmen labels and bounda ies. Gi en he sca ci y o
segmen bounda ies, we enhance aining by smoo hing he
ame-wise a ge s o e ime wi h an exponen ial ke nel o
size 3 (0.15s). Addi ionally, we weigh posi i e a ge s by
a ac o o 2.
We also obse ed ha segmen labels p o ide a s onge
lea ning signal han segmen bounda ies. Consequen ly,
simply summing he losses esul s in lowe bounda y de-
ec ion accu acy. To add ess his, we apply a weigh o
15 o he bounda y loss. This adjus men ensu es e ec i e
bounda y de ec ion wi h minimal impac on label accu acy.
Finally, o deal wi h double anno a ions in he SALAMI
da ase , we ollow [16] and use bo h anno a ion a ian s
pe ack by duplica ing he audio and associa ing i wi h
one o he anno a ion a ian s each. Bo h samples a e used
in each epoch.
4. EXPERIMENT SETUP
4.1 T aining & E alua ion Da a
A a ie y o da ase s ha e been used o aining and
e alua ing s uc u al segmen a ion sys ems. Among he
mos popula da ase s a e Bea les [32], Ha monix [9],
RWC [33], SALAMI [8], and Queen [32].
These da ase s we e ga he ed o e ex ended pe iods by
independen esea ch g oups. As a esul , hey we e anno-
a ed using di e se and o en non- anspa en guidelines.
Fo example, while bounda ies a e aligned wi h down-
bea s in he Ha monix da ase , no such alignmen is e iden
in SALAMI [34]. Simila ly, he de ini ions o segmen
labels can a y be ween da ase s. Consequen ly, some
da ase s may a guably ep esen di e en asks (e.g. “ ind
downbea -aligned segmen bounda ies” s. “ ind segmen
bounda ies whene e a new ph ase s a s”).
Addi ionally, some da ase s o e lap signi ican ly. Fo
ins ance, SALAMI was designed o con ain acks om
bo h RWC and Isophonics. As a esul , i includes 22 o
he 100 acks in RWC-Pop, as well as 21 ou o 180 songs
in he Bea les da ase .
All o his aises conce ns abou he eliabili y o c oss-
da ase e alua ion, which is in ended o assess a model’s
abili y o gene alize. Howe e , i aining and es se s
o e lap signi ican ly, key assump ions o machine lea n-
ing heo y a e comp omised, and hus, e alua ions become
in alid. Simila ly, i he es se ep esen s a di e en ask,
he esul s do no e lec gene aliza ion o di e en music.
In his pape , we aim a building a model ha wo ks well
o a wide a ie y o wes e n-s yle pop music and e alua e
i in a me hodologically co ec way. To his end, we em-
ploy 8- old c oss- alida ion on a mix o all he da ase s
men ioned abo e. To p e en da a bleeding be ween es
and aining se s, we emo e duplica es in he da a, keep-
ing he anno a ions o smalle da ase s (e.g., we emo e
RWC and Bea les songs om Salami). We sha e ou c oss-
alida ion pa i ions, anno a ions and model p edic aions
online 2, in o de o acili a e mo e igo ous and eliable
e alua ions in u u e esea ch.
Fo aining, we use he o iginal anno a ions om he
Bea les da ase . P e ious wo ks ypically ely on he “co -
ec ed” e sions p o ided by TU Tampe e and UPF. How-
e e , upon quali a i e examina ion, we did no ind hese
anno a ions o be “be e ” han he o iginal ones (o en o
he con a y). This is suppo ed by he esul s shown in
Sec. 5, whe e e alua ion using he o iginal anno a ions yp-
ically yield highe sco es, which indica es ha he o iginal
anno a ions a e mo e p edic able (and hus, one may a gue,
mo e cohe en ) han he TUT ones.
4.2 Hold-Ou Tes Se
Fo compa ison wi h exis ing wo k, we p opose o use a
da ase ha has no been ea u ed in he con ex o music
segmen a ion a all: he McGill Billboa d da ase [35]. Al-
hough his da ase has seen widesp ead usage o cho d
ecogni ion and key iden i ica ion, i s s uc u al segmen a-
ion anno a ions ha e no been u ilized o aining o e al-
ua ion. We emo e duplica e acks and elimina e o e -
lap wi h he da ase s men ioned p e iously using s ing-
ma ching and audio inge p in ing me hods, esul ing in
719 acks o es ing.
2h ps://gi hub.com/ dlm/ismi 2025
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
722
Keywo d Label Keywo d Label
1 silence silence 15 solo solo
2 p echo us e se 16 p e-cho us e se
3 cho us cho us 17 e ain cho us
4 s u e cho us 18 heme cho us
5 ap e se 19 e se e se
6 slow e se 20 sec ion e se
7 dialog e se 21 build e se
8 adein in o 22 in o in o
9 b idge b idge 23 opening in o
10 ou ou o 24 ans b idge
11 ending ou o 25 coda ou o
12 ins ins 26 b eak ins
13 imp o ins 27 in e lude ins
14 end silence 28 gui a s ins
29 ins
Table 2. Segmen label no maliza ion. I an incoming la-
bel con ains he keywo d, i is mapped o he label p o ided
in he second column. The o de o he lis indica es p i-
o i y. Compa ed o he me hod om [18], we in oduced a
“solo” label, and added mo e mappings.
4.3 E alua ion Me ics
4.3.1 Segmen Bounda ies
We use a subse o he de aul me ics o seman ic seg-
men a ion, compu ed using mi _e al [36]. Fo bound-
a ies, we use F1 sco e wi h a ole ance o 0.5s. We op
o he immed e sion, i.e. we igno e he i s and las
“bounda ies”, which co espond o he beginning and end
imes o a ack. These “bounda ies” a e seman ically oid
and, o ou belie , should no be e alua ed, as hey in la e
de ec ion esul s by up o 9% in F1 sco e (see Table 3).
Un o una ely, he e is no clea consensus in p e ious
wo k on whe he o use immed o un- immed me ics
o epo esul s. No ably, he de aul pa ame e s in he
mi _e al lib a y disable imming, and he e o e any
pape ha uses he lib a y wi h de aul alues epo s in-
la ed me ics. To complica e hings u he , many pape s
do no e en epo whe he hey used immed me ics o
no . To ensu e he bes compa abili y in his wo k, we in-
dica e i immed me ics we e used by e e ence pape s,
and i no speci ied in he o iginal pape , we examined he
sou ce code (i a ailable) o con ac ed he au ho s o ob ain
ha in o ma ion.
4.3.2 Segmen Labels
Fo segmen labels, we use no malized condi ional en opy
(NCE), and label accu acy. Fo he la e , we compu e he
o al du a ion o co ec label p edic ions wi hin a ack
and di ide i by he leng h o he ack. We ex end he
label ocabula y p oposed in [18] by a solo label ha indi-
ca es ins umen al solos, in con as o gene ic ins umen-
al sec ions, esul ing in an 8-class ocabula y consis ing o
b idge,cho us,ins ,in o,ou o,silence,solo, and e se.
We also ex end he no maliza ion o he o iginal labels by
in oducing new mappings o he me hod p esen ed in [18],
see Table 2.
4.3.3 Mul iple Anno a ions
The SALAMI da ase includes wo human-gene a ed an-
no a ions o mos acks. Because bo h a e human-
labeled, bo h should be ega ded as “co ec ” segmen a-
ions acco ding o he ask de ini ion. The e o e, i a model
accu a ely p edic s ei he anno a ion, i should achie e he
highes possible sco e o ha ack, ega dless o any dis-
ag eemen wi h he o he anno a ion. To cap u e his, we
e alua e model p edic ions agains all anno a ions, g oup
esul s by ack, and epo he highes sco e o each me -
ic. 3
5. RESULTS AND DISCUSSION
5.1 O e all Resul s
Table 3 p esen s he o e all esul s ac oss all da ase s used
o c oss- alida ion in his s udy (excep Queen due o i s
limi ed size). As indica ed ea lie , he e a e some ca ea s
o conside when in e p e ing hese esul s. These ca ea s
a e no speci ic o his s udy, bu o anspa ency we de-
libe a ely wan o d aw a en ion o hem.
Fi s , mos pape s use di e en aining se s and se-
ups o epo ing he esul s. Fo example, in he case o
Ha monix, al hough mos s udies pe o m c oss- alida ion,
hey use a di e en numbe o pa i ions (4 in [13, 18],
8 in o he s) and/o include addi ional aining da a like
in [13, 18] and ou wo k, while o he s only use Ha monix
i sel . The e ec o using addi ional aining da a is no
s aigh o wa d o assess: on he one hand, addi ional da a
o aining o en imp o es gene aliza ion and pe o mance
o a model; on he o he hand, including o he da ase s
migh shi he da a dis ibu ion in a way ha educes e-
sul s o one pa icula da ase , especially i anno a ion
guidelines o gen e dis ibu ions a y. Second, as men-
ioned in p e ious sec ions, some da ase s o e lap; his can
indica e ha e en in “c oss-da ase ” e alua ions—which
pu po o indica e he gene aliza ion o a model— he e
migh be ain- es o e lap, which could lead o in la ed e-
sul s. Finally, e alua ion me hods may di e sub ly in hei
implemen a ion o he hype -pa ame e s used. We al eady
discussed he example o F1 bounda y hi a e, whe e im-
ming has a huge impac on he absolu e esul s.
In Tab. 3, we indica e which se up was used by each
me hod (see he able cap ion o de ails). We also indica e
which esul s we e aken di ec ly om he o iginal pape ,
and which we e e-p oduced using publicly a ailable in e -
ence code and models and ou own e alua ion code (which
is based on mi _e al). We also deno e o which com-
pa isons he e is a ain/ es o e lap which may in la e e-
sul s. While we belie e ha hese measu es inc ease ans-
pa ency, hey do no sol e he unde lying issues which lead
o limi ed compa abili y in he i s place.
Wi h all he ca ea s in mind, we can s ill iden i y en-
dencies in he me ics. Ou p oposed model pe o ms bes
in de ec ing bounda ies o all da ase s excep RWC-Pop,
and exhibi s he highes label p edic ion accu acy. No ably,
in e ms o NCE, LinkSeg [24] o en ou -pe o ms ou
3Fo e e ence, using he a e age ins ead o max educes F1 by 5%.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
723

Se up F1 ( ) F1NCE Acc
Ha monix — 911 acks
[16] G&S ○CD 0.578 0.644 0.717 -
[18] SpecTNT CV-4+ - 0.570 0.714 0.701
[13] MuSFA CV-4+ - 0.595 - 0.714
[24] LinkSeg CV-8 - 0.772 0.742
[21] All-In-1 CV-8 - 0.660 0.769 -
[21] All-In-1 ○CV-8 0.583 0.646 0.740 0.729
P oposed CV-8+ 0.630 0.682 0.790 0.773
RWC-Pop — 100 acks
[16] G&S ○CD0.507 0.571 0.744 -
[18] SpecTNT CD- 0.623 0.728 0.675
[13] MuSFA CD- 0.643 - 0.677
[24] LinkSeg CD 0.648 -0.812 0.747
[23] MF-Sim CD- 0.570 - 0.589
[21] All-In-1 ○CD 0.557 0.613 0.727 0.720
P oposed CV-8+ 0.557 0.608 0.729 0.770
Bea les (TUT) — 174 acks
[16] G&S ○CD0.457 0.566 0.659 -
[23] MF-Sim CD- 0.521 - 0.495
[24] LinkSeg ○CD 0.463 0.559 0.747 0.495
[21] All-In-1 ○CD 0.437 0.549 0.639 0.439
P oposed CV-8 0.549 0.626 0.721 0.598
Bea les (O ig) — 180 acks
[16] G&S ○CD0.550 0.639 0.674 -
[24] LinkSeg ○CD 0.467 0.562 0.741 0.485
[21] All-In-1 ○CD 0.455 0.563 0.637 0.450
P oposed CV-8 0.613 0.681 0.719 0.597
SALAMI-Pop (Clean) — 191 acks
[18] SpecTNT CD- 0.490 0.632 0.544
[13] MuSFA CD- 0.532 - 0.551
[23] MF-Sim CD0.505 - - 0.497
[24] LinkSeg ○CD 0.503 0.584 0.743 0.575
[21] All-In-1 ○CD 0.507 0.596 0.700 0.545
P oposed CV-8+ 0.607 0.674 0.731 0.682
SALAMI (Clean) — 1239 acks
[24] LinkSeg ○CD 0.413 0.494 0.694 0.467
[21] All-In-1 ○CD 0.415 0.507 0.659 0.426
P oposed CV-8+ 0.555 0.632 0.720 0.614
SALAMI (Clean, G&S Tes ) — 452 acks
[16] G&S ○TT 0.519 0.603 0.664 -
[24] LinkSeg ○CD 0.412 0.488 0.703 0.479
[21] All-In-1 ○CD 0.407 0.496 0.673 0.462
P oposed CV-8+ 0.554 0.628 0.732 0.629
Table 3. O e all esul s. CD indica es a c oss-da ase
se up, CV-N means c oss- alida ion wi h N pa i ions, TT
indica es a ain- es spli , +indica es usage o addi ional
da a, indica es ain/ es o e lap. Resul s o me hods
ma ked wi h ○a e calcula ed using o iginal checkpoin s
and in e ence code. F1is shown o e e ence only, and
F1 ( ) should be conside ed ins ead. Fo SALAMI (Clean),
we emo ed o e laps. SALAMI (Clean, G&S Tes ) is he
in e sec ion o SALAMI (Clean) and he es se used in
[16].
F1 ( ) NCE Acc
[16] G&S 0.569 0.678 -
[24] LinkSeg 0.461 0.752 0.629
[21] All-In-1 0.491 0.683 0.590
P oposed 0.647 0.754 0.668
Table 4. Resul s on McGill Billboa d da ase (719 acks).
model, indica ing ha while ou p oposed me hod p edic s
he co ec label mo e o en, LinkSeg o e s mo e consis-
en labeling (dis ega ding i s seman ic meaning). This in-
dica es ha he g aph-link app oach is be e capable o
iden i ying which sec ions a e simila , e en i he assigned
label is inco ec .
S ikingly, LinkSeg ou pe o ms e e y compa ed
me hod on he RWC-Pop da ase by a la ge ma gin in
bounda y de ec ion F1 and label NCE. A mo e ho ough,
quali a i e examina ion o he p edic ed segmen s may be
equi ed o de e mine why hei app oach lends i sel so
well o his pa icula da ase .
5.2 McGill Billboa d Resul s
To ci cum en some o he issues discussed abo e, we p o-
pose o use he McGill Billboa d da ase , which has no
been used o nei he e alua ion no aining in any o he
me hods we compa e agains . He e, we can only compa e
me hods o which we ha e access o he in e ence code
and models. The esul s a e shown in Tab. 4.
On his da ase , ou p oposed model clea ly ou -
pe o ms he compa ed me hods. In e es ingly, e en he
decade-old G&S model om [16] gi es be e esul s han
he mo e mode n All-In-1 and LinkSeg, mimicking simila
endencies seen in Tab. 3 o Ha monix (G&S > LinkSeg),
Bea les (TUT) (G&S > All-In-1), Bea les (O ig) (G&S >
LinkSeg and All-In-1), and Salami (Clean G&S Tes Spli )
(G&S > LinkSeg and All-In-1).
6. CONCLUSION
In his pape , we in oduced a simple, ye e ec i e me hod
o seman ical song segmen a ion based on a mul i-leg
TCN a chi ec u e ha combine a aw log- equency log-
magni ude spec og am and hand-c a ed sel -simila i y
lag ma ices o p edic segmen bounda ies and labels.
We iden i ied, explo ed, and p oposed emedies o chal-
lenges in e alua ion, mos no ably ain/ es o e lap and
inconsis en con igu a ions o e alua ion me ics. Wi h
his in mind, we e alua ed and compa ed ou model agains
s a e-o - he-a me hods on a wide a ie y o da ase s. We
also a emp ed o p o ide a cleane compa ison by p opos-
ing o use he hi he o unused McGill Billboa d da ase
as es se , o which we elimina ed o e laps wi h exis -
ing da ase s used o aining. In all scena ios, ou me hod
yields supe io pe o mance on mos da ase s.
Fu u e wo k could explo e he addi ion o link-
p edic ion me hods like he ones used in [24], as hey ha e
shown o achie e p omising esul s in e ms o label p e-
dic ion consis ency. Also, scaling up da a using pa ially
labeled da ase s may u he imp o e esul s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
724
7. REFERENCES
[1] O. Nie o, G. J. Myso e, C.-i. Wang, J. B. L. Smi h,
J. Schlü e , T. G ill, and B. McFee, “Audio-Based Mu-
sic S uc u e Analysis: Cu en T ends, Open Chal-
lenges, and Applica ions,” T ansac ions o he In e na-
ional Socie y o Music In o ma ion Re ie al, ol. 3,
no. 1, Dec. 2020.
[2] M. Mülle , “Audio S uc u e Analysis,” in In o ma-
ion Re ie al o Music and Mo ion, se . Sp inge Link:
Sp inge e-Books. Be lin, Heidelbe g: Sp inge -
Ve lag, 2007.
[3] J. Paulus, M. Mülle , and A. Klapu i, “Audio-based
Music S uc u e Analysis,” in P oceedings o he 11 h
In e na ional Con e ence on Music In o ma ion Re-
ie al (ISMIR), U ech , Ne he lands, Sep. 2010.
[4] M. Le y and M. Sandle , “S uc u al Segmen a ion
o Musical Audio by Cons ained Clus e ing,” IEEE
T ans. Audio Speech Lang. P ocess., ol. 16, no. 2, Feb.
2008.
[5] O. Nie o and T. Jehan, “Con ex Non-Nega i e Ma ix
Fac o iza ion o Au oma ic Music S uc u e Iden i i-
ca ion,” in P oceedings o he 2013 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP), Vancou e , Canada, May 2013.
[6] O. Nie o and J. P. Bello, “Music Segmen Simila -
i y Using 2D-Fou ie Magni ude Coe icien s,” in P o-
ceedings o he 2014 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP),
Flo ence, I aly, May 2014.
[7] B. McFee and D. Ellis, “Analyzing Song S uc u e wi h
Spec al Clus e ing.” in P oceedings o 15 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Taipei, Taiwan, Oc . 2014.
[8] J. B. L. Smi h, J. A. Bu goyne, I. Fujinaga, D. D.
Rou e, and J. S. Downie, “Design and C ea ion o a
La ge-Scale Da abase o S uc u al Anno a ions,” in
P oceedings o he 12 h In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR), Miami,
USA, Oc . 2011.
[9] O. Nie o, M. McCallum, M. E. P. Da ies, A. Robe -
son, A. S a k, and E. Egozy, “The HARMONIX Se :
Bea s, Downbea s, and Func ional Segmen Anno a-
ions o Wes e n Popula Music,” in P oceedings o
he 20 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence (ISMIR), Del , The Ne he lands,
No . 2019.
[10] M. C. McCallum, “Unsupe ised Lea ning o Deep
Fea u es o Music Segmen a ion,” in P oceedings o
he 2019 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP), B igh on,
Uni ed Kingdom, May 2019.
[11] M. Buisson, B. McFee, S. Essid, and H. C. C ayencou ,
“A Repe i ion-Based T iple Mining App oach o Mu-
sic Segmen a ion,” in P oceedings o he 24 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Milan, I aly, No . 2023.
[12] Y.-N. Hung, J.-C. Wang, M. Won, and D. Le, “Scaling
Up Music In o ma ion Re ie al T aining wi h Semi-
Supe ised Lea ning,” a Xi , ol. a Xi :2310.01353,
Oc . 2023.
[13] J.-C. Wang, J. B. L. Smi h, and Y.-N. Hung, “MuSFA:
Imp o ing Music S uc u al Func ion Analysis wi h
Pa ially Labeled Da a,” in La e-B eaking/Demo Ses-
sion o he 23 d In e na ional Socie y o Music In o -
ma ion Re ie al Con e ence (ISMIR), Bengalu u, In-
dia, Dec. 2022.
[14] J. Salamon, O. Nie o, and N. J. B yan, “Deep embed-
dings and sec ion usion imp o e music segmen a ion,”
in P oceedings o he 22nd In e na ional Socie y o
Music In o ma ion Re ie al Con e ence (ISMIR), On-
line, No . 2021.
[15] K. Ull ich, J. Schlü e , and T. G ill, “Bounda y De ec-
ion in Music S uc u e Analysis Using Con olu ional
Neu al Ne wo ks,” in P oceedings o he 15 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Taipei, Taiwan, Oc . 2014.
[16] T. G ill and J. Schlü e , “Music Bounda y De ec ion
using Neu al Ne wo ks On Combined Fea u es and
Two-Le el Anno a ions,” in P oceedings o he 16 h
In e na ional Socie y o Music In o ma ion Re ie al
Con e ence (ISMIR), Málaga, Spain, Oc . 2015.
[17] J.-C. Wang, J. B. L. Smi h, J. Chen, X. Song, and
Y. Wang, “Supe ised Cho us De ec ion o Popu-
la Music Using Con olu ional Neu al Ne wo k and
Mul i-Task Lea ning,” in P oceedings o he 2021
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), To on o, Canada,
Ap . 2021.
[18] J.-C. Wang, Y.-N. Hung, and J. B. L. Smi h, “To Ca ch
a Cho us, Ve se, In o, o Any hing Else: Analyzing
a Song wi h S uc u al Func ions,” in P oceedings o
he 2022 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP), Singapo e,
May 2022.
[19] W.-T. Lu, J.-C. Wang, M. Won, K. Choi, and X. Song,
“SpecTNT: A Time-F equency T ans o me o Music
Audio,” in P oceedings o he 22nd In e na ional So-
cie y o Music In o ma ion Re ie al Con e ence (IS-
MIR), Online, No . 2021.
[20] Y. Wang and F. Me ze, “Connec ionis Tempo al Lo-
caliza ion o Sound E en De ec ion wi h Sequen-
ial Labeling,” in P oceedings o he 2019 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP), B igh on, Uni ed Kingdom,
May 2019.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
725
[21] T. Kim and J. Nam, “All-In-One Me ical and Func-
ional S uc u e Analysis wi h Neighbo hood A en-
ions on Demixed Audio,” in IEEE Wo kshop on Ap-
plica ions o Signal P ocessing o Audio and Acous ics
(WASPAA), New Pal z, USA, Oc . 2023.
[22] P. Izmailo , D. Podop ikhin, T. Ga ipo , D. Ve o ,
and A. G. Wilson, “A e aging Weigh s Leads o Wide
Op ima and Be e Gene aliza ion,” in P oceedings o
he 34 h Con e ence on Unce ain y in A i icial In el-
ligence (UAI), Mon e ey, USA, Aug. 2018.
[23] T.-P. Chen and K. Yoshii, “Lea ning Mul i ace ed Sel -
Simila i y o e Time And F equency o Music S uc-
u e Analysis,” in P oceedings o he 25 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR), San F ancisco, USA, No . 2024.
[24] M. Buisson, B. McFee, and S. Essid, “Using Pai -
wise Link P edic ion and G aph A en ion Ne wo ks
o Music S uc u e Analysis,” in P oceedings o he
25 h In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence (ISMIR), San F ancisco, USA, No .
2024.
[25] M. E. P. Da ies and S. Bock, “Tempo al Con olu ional
Ne wo ks o Musical Audio Bea T acking,” in P o-
ceedings o he 2019 27 h Eu opean Signal P ocessing
Con e ence (EUSIPCO), A Co uña, Spain, Sep. 2019.
[26] S. Böck and M. E. P. Da ies, “Decons uc , Ana-
lyze, Recons uc : How o Imp o e Tempo, Bea , and
Downbea T acking,” in P oceedings o he 21s In e -
na ional Socie y o Music In o ma ion Re ie al Con-
e ence (ISMIR), Mon éal, Canada, Oc . 2020.
[27] D.-A. Cle e , T. Un e hine , and S. Hoch ei e , “Fas
and Accu a e Deep Ne wo k Lea ning by Exponen-
ial Linea Uni s (ELUs),” in P oceedings o he In-
e na ional Con e ence on Lea ning Rep esen a ions
(ICLR), San Juan, Pue o Rico, May 2016.
[28] N. S i as a a, G. Hin on, A. K izhe sky, I. Su ske e ,
and R. Salakhu dino , “D opou : A Simple Way o P e-
en Neu al Ne wo ks om O e i ing,” The Jou nal
o Machine Lea ning Resea ch, ol. 15, no. 56, Jun.
2014.
[29] S. Böck, M. E. P. Da ies, and P. Knees, “Mul i-Task
Lea ning O Tempo And Bea : Lea ning One To Im-
p o e The O he ,” in P oceedings o he 20 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Del , The Ne he lands, No . 2019.
[30] M. R. Zhang, J. Lucas, G. Hin on, and J. Ba, “Looka-
head Op imize : k s eps o wa d, 1 s ep back,” in Ad-
ances in Neu al In o ma ion P ocessing Sys ems 32
(Neu IPS 2019), Vancou e , Canada, 2019.
[31] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and
J. Han, “On he Va iance o he Adap i e Lea ning Ra e
and Beyond,” in P oceedings o he 8 h In e na ional
Con e ence On Lea ning Rep esen a ions (ICLR), Ad-
dis Ababa, E hiopia, Ap . 2020.
[32] M. Mauch, C. Cannam, M. Da ies, S. Dixon, C. Ha e,
S. Kolozali, D. Tidha , and M. Sandle , “OMRAS2
Me ada a P ojec 2009,” in La e-B eaking/Demos Ses-
sion o he 10 h In e na ional Con e ence on Music In-
o ma ion Re ie al (ISMIR), Kobe, Japan, Oc . 2009.
[33] M. Go o, H. Hashiguchi, T. Nishimu a, and R. Oka,
“RWC Music Da abase: Popula , Classical and Jazz
Music Da abases,” in P oceedings o he 3 d In e na-
ional Con e ence on Music In o ma ion Re ie al (IS-
MIR), Pa is, F ance, Oc . 2002.
[34] A. Ma mo e , J. E. Cohen, and F. Bimbo , “Ba wise
Music S uc u e Analysis wi h he Co ela ion Block-
Ma ching Segmen a ion Algo i hm,” T ansac ions o
he In e na ional Socie y o Music In o ma ion Re-
ie al, ol. 6, no. 1, No . 2023.
[35] J. A. Bu goyne, J. Wild, and I. Fujinaga, “An Expe
G ound T u h Se o Audio Cho d Recogni ion and
Music Analysis.” in P oceedings o he 12 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Miami, USA, Oc . 2011.
[36] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, and D. P. W. Ellis, “Mi _e al:
A T anspa en Implemen a ion o Common MIR Me -
ics,” in P oceedings o he 15 h In e na ional Con e -
ence on Music In o ma ion Re ie al (ISMIR), Taipei,
Taiwan, Oc . 2014.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
726