scieee Science in your language
[en] (orig)

Lose the Frames: Exact Metrics for More Responsible Music Structure Analysis Evaluations

Author: Qingyang Xi; Brian Mcfee
Publisher: Zenodo
DOI: 10.5281/zenodo.17706471
Source: https://zenodo.org/records/17706471/files/000049.pdf
LOSE THE FRAMES: EVENT-BASED METRICS FOR EFFICIENT
MUSIC STRUCTURE ANALYSIS EVALUATIONS
Qingyang (Tom) Xi
Music and Audio Resea ch Lab
New Yo k Uni e si y
[email p o ec ed]
B ian McFee
Music and Audio Resea ch Lab
New Yo k Uni e si y
[email p o ec ed]
ABSTRACT
Many e alua ion me ics in Music In o ma ion Re ie al
(MIR) ely on uni o m ime sampling o phenomena ha
un old o e ime. While uni o m sampling is sui able o
con inuously a ying concep s such as pi ch o dynamic
en elope, i is subop imal o inhe en ly disc e e o piece-
wise cons an e en s, such as labeled segmen s. Cu en
Music S uc u e Analysis (MSA) me ics o label e alu-
a ion a e all implemen ed wi h ime sampling, which can
be inexac and ine icien . In his wo k, we p opose e en -
based implemen a ions o he h ee mos widely used MSA
me ics. Ou app oach yields e alua ions ha a e mo e
accu a e, mo e compu a ionally e icien , and mo e ep o-
ducible, s eamlining MSA esea ch wo k lows.
1. INTRODUCTION
E icien and accu a e e alua ion me ics a e i al o
p og ess in MIR. Cu en ly, many MIR me ics ely on
uni o m ime sampling. While sui able o con inuously
a ying phenomena like pi ch, his app oach in oduces in-
accu acies and compu a ional ine iciencies o disc e e o
piecewise-cons an anno a ions, such as cho d es ima ion,
sound e en de ec ion, and o he labeled in e als. Mo e-
o e , sampling in oduces an a bi a y hype pa ame e , he
ame size, ha o ces a ade-o be ween nume ical accu-
acy and compu a ional e iciency.
These issues a e especially p onounced o Music
S uc u e Analysis (MSA) me ics, whe e labeled segmen-
a ions a e ou inely e alua ed using he Pai wise F ame
Clus e ing (PFC) sco e [1], he No malized C oss En opy
(NCE) sco e [2], he V-measu e [3], and he L-measu e [4].
Al hough he cu en implemen a ions p o ided by he
s anda d MIR e alua ion oolki mi _e al [5] a e widely
adop ed and op imized, hey s ill ope a e wi h a pa adigm
based on uni o m ime sampling. Unde his sampling-
based pa adigm, me ic compu a ion can become p o-
hibi i ely expensi e, o en quad a ic o wo se in he num-
be o ames, and nume ically sensi i e o he choice o
© Q. Xi and B. McFee. Licensed unde a C ea i e Com-
mons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: Q.
Xi and B. McFee, “Lose he F ames: E en -Based Me ics o E icien
Music S uc u e Analysis E alua ions”, in P oc. o he 26 h In . Socie y
o Music In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
ame size. This ine iciency hinde s obus hype pa ame-
e uning, slows i e a i e model de elopmen , and makes
la ge-scale s udies imp ac ical.
We in oduce a new e en -based pa adigm o imple-
men ing h ee common MSA me ics: he PFC, he V-
measu e, and he L-measu e. 1Ou e en -based pa adigm
o e s a gene al s a egy o e alua ing ime se ies anno a-
ions by mo ing away om ame-based sampling. This
app oach decouples nume ical p ecision om compu a-
ional cos , p o iding exac e alua ions a speeds ha a e
o de s o magni ude as e han ame-based me hods. We
use MSA me ics o demons a e how his pa adigm in-
he en ly imp o es nume ical s abili y and ep oducibili y.
This unlocks he po en ial o mo e scalable and obus e-
sea ch no only o MSA bu also o o he common MIR
asks such as cho d es ima ion o sound e en de ec ion.
2. RELATED WORKS
The mo i a ion behind ou e en -based me ics o igina es
om a long-s anding eliance on sampling-based e alua-
ion wi hin he MIR communi y. This p ac ice, es ablished
in he Music In o ma ion Re ie al E alua ion eXchange
(MIREX) campaigns, is now codi ied in widely adop ed
oolki s such as mi _e al [5]. This eliance on a use -
de ined ame a e equi es p io knowledge o a ask’s
cha ac e is ic imescale, o cing a domain-speci ic heu is-
ic in o wha should be an objec i e measu emen .
Besides being a p ac ical nuisance, he ac o picking
an e alua ion ame a e in oduces ulne abili y in o he
e alua ion i sel , as epo ed pe o mances a e in luenced
by a bi a y pa ame e choices. This scena io in oduces
a me hodological ulne abili y, as epo ed pe o mances
can become suscep ible o he "Cle e Hans" e ec , whe e
a sys em’s appa en success s ems om exploi ing a bi-
a y choices o a i ac s in he es design, a he han om
ue unde s anding [6]. Ou wo k ocuses on emo ing his
laye o me hodological a iance.
Music segmen a ion is no he only ask ha can bene-
i om an e en -based o mula ion; ela ed a eas, such as
sound e en de ec ion (SED), ha e also aced challenges
associa ed wi h ame-based e alua ions. The SED com-
muni y’s s anda d e alua ion oolki , sed_e al [7], em-
ploys a ela i ely coa se de aul ame a e o 1 second, due
o he longe du a ions o audio eco dings in ol ed wi h
1h ps://gi hub.com/ omxi/ ameless-e al
426
he ask. This has led o ecen wo k by Bilen e al. [8] and
Los anlen and McFee [9] ha explici ly p oposes mo e e -
icien e en -based me ics o SED.
Despi e his clea end owa ds e en -based e alua-
ion in ela ed ields, he co e me ics o Music S uc-
u e Analysis wi hin he ield’s s anda d oolki ha e e-
mained exclusi ely dependen on he sampling pa adigm.
This pape add esses his gap by p esen ing e en -based
implemen a ions o he canonical MSA me ics, emo -
ing a long-s anding ine iciency in MSA e alua ion, and
enabling mo e obus , ep oducible esea ch.
3. SEGMENTATION METRICS
Di e en me ics ha e been p oposed o e alua e la and
hie a chical music segmen a ions. Fla segmen a ions a e
o en e alua ed using me ics ha a e based on he concep
o clus e ing. 2The pai wise ame clus e ing sco e (PFC)
[1] and he V-measu e [3] bo h all in o his ca ego y. Fo
e alua ing hie a chical segmen a ions, he L-measu e [4]
has been used in many ecen wo ks on MSA [10–13].
All h ee me ics a e implemen ed in mi _e al [5] by
sampling, wi h a de aul sampling a e o 10 Hz. Al hough
sub le, he ame size o hese me ics also a ec s he e al-
ua ion esul s in unexpec ed ways. We now e iew hese
h ee me ics and p o ide ou e en -based o mula ions.
3.1 Pai wise F ame Clus e ing
Fo a piece o music wi h ime span T= [ 0, 1], a seg-
men a ion has a label mapping S( ) ha maps ime poin s
in T o a se o kunique labels γ={y1, y2, . . . yk}:
S:T→γ(1)
When compa ing an es ima ed segmen a ion ˆ
Swi h a
e e ence S, he wo do no need o sha e he same se o
labels. Ins ead, hei compa ison elies on in e nal label-
ing consis ency, which iden i ies poin s labeled iden ically
wi hin each segmen a ion.
This consis ency is cap u ed by he label ag eemen
map, de ined as:
MS(u, ) := [S(u) = S( )]
1
,(2)
whe e [·]
1
is he indica o unc ion ha e u ns 1 i he con-
di ion is ue and 0 o he wise. I should be no ed ha al-
hough MS(u, )is piecewise cons an , i is a con inuous-
ime unc ion mapping MS:T2→ {0,1}.
We use MS o de ine he se o ime pai s ha mee ,
which o ms a se o signi ican ime pai s ha can be con-
side ed as in o ma ion o be ecalled:
A(S) := {(u, )|MS(u, ) = 1}(3)
Figu e 1 shows a simple example o a se o e e ence and
es ima ed segmen a ions S, ˆ
Sand hei co esponding se
o mee ing posi ions A(S),A(ˆ
S)in i s op wo ows.
2The bounda y hi a e me ic al eady uses an e en -based o mula-
ion, and he e o e we ocus ou a en ion on he wo me ics ha ocus
on labeling.
Re e ence Segmen a ion Es ima ed Segmen a ion
0:00
0:50
1:40
2:30
Time
Label Ag eemen Map
M
Label Ag eemen Map
M
0:00 0:50 1:40 2:30
Time
0:00
0:50
1:40
2:30
Time
In e sec ion
0:00 0:50 1:40 2:30
Time
In e sec ion
Figu e 1. Visualizing PFC as a io o a ea. Top: e e ence
Sand es ima ed ˆ
S. Middle: mee ing posi ions o each
segmen a ion: A(S),A(ˆ
S). Bo om: in e sec ion A(S)∩
A(ˆ
S)highligh ed in ed.
In oduced by Le y and Sandle [1], he pai wise ame
clus e ing (PFC) me ic e alua es segmen a ion ag eemen
by conside ing hese mee ing pai s. The PFC me ic quan-
i ies he p opo ion o mee ing pai s common o bo h seg-
men s ela i e o hose unique o each. The ime pai s
(u, ) ha mee in bo h segmen a ions a e colo ed ed in
he bo om o Figu e 1. PFC ecall and p ecision a e hen
de ined as he a io o hese a eas.
PFCR=A(S)∩ A(ˆ
S)
|A(S)|,PFCP=A(S)∩ A(ˆ
S)
|A(ˆ
S)|
(4)
He e, |A(S)| ep esen s he size o he se o mee ing
posi ions unde S. The size o his egion can be compu ed
by in eg a ing o e he ime pai space T2.
A(S)∩ A(ˆ
S)=ZT2
MS(u, )·Mˆ
S(u, ) d(u, )
|A(S)|=ZT2
MS(u, ) d(u, )
No ice ha since S( )consis s o disc e e e en s and
is he e o e piecewise cons an , he in eg als can be com-
pu ed as he sum o a eas o ec angles, which ha e simple
closed- o m solu ions.
To achie e his e icien ly, we de ine a common se o
in e als om he union o bo h segmen a ions’ bound-
a ies. Wi hin each esul ing in e al, he segmen labels
emain join ly cons an , allowing us o que y each segmen-
a ion’s label exac ly once pe in e al, i espec i e o in-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
427
e al du a ion. This s a egy le e ages he piecewise con-
s an p ope y o segmen a ions, ensu ing exac compu a-
ions wi h minimal sampling.
In e ms o compu a ional complexi y, he PFC me ic
equi es sampling he label ag eemen map Mo e pai s,
which is quad a ic in he numbe o ames (n2) o ame-
based app oaches. Fo he con inuous- ime app oach, he
complexi y is (s+ ˆs)2, whe e sand ˆsa e he numbe o
segmen s o e e ence and es ima ion, espec i ely.
3.2 V-measu e
While concep ually simple, PFC o e looks he issues o
o e - and unde -segmen a ion, which mo i a ed he adop-
ion o No malized C oss En opy (NCE) [2], and he
closely ela ed V-measu e [3]. Fo compa ing la segmen-
a ions’ labels, he V-measu e is a mode n me ic ha im-
p o es upon he No malized Condi ional En opy (NCE)
wi h p ope no maliza ion.
The en opy o a segmen a ion S:T→γcan be exam-
ined by andomly sampling along i s du a ion. We deno e
he sampled label as a andom a iable Y:
P[Y=y] = P ∼T[S( ) = y],
H(S) = E[−log P[Y]]
In pa icula , when segmen a ion is cons an ( ha is,
S( ) = y o all ), H(S) = 0.
The condi ional en opy H(ˆ
S|S)measu es he a e age
en opy o he es ima ed labels ˆ
S o each gi en segmen
label in he e e ence anno a ion S:
H(ˆ
S|S) = E
y∼γhHˆ
S|S=yi (5)
The condi ional en opy es ima es he amoun o unce -
ain y le in p edic ing he e e ence labels gi en he es-
ima e segmen a ion. When he unce ain y o he e e -
ence label is low gi en he es ima e, he es ima e ecalls
he labeling in o ma ion p esen ed in he e e ence.
The V-measu e sco e e lec s how much in o ma ion
is sha ed be ween he wo segmen a ions, ela i e o he
amoun o in o ma ion con ained wi hin each one. 3:
VR= 1 −H(S|ˆ
S)
H(S),VP= 1 −H(ˆ
S|S)
H(ˆ
S)(6)
The p obabili y and en opy a e es ima ed by sampling
in he ame-based pa adigm bu can be calcula ed ex-
ac ly since S( )consis s o disc e e labeled sec ions and
is piecewise cons an .
P[Y=y] = 1
|T|ZT
[S( ) = y]
1
,d
To calcula e he V-measu e, a con ingency able is
used o ep esen he co-occu ence o segmen labels be-
ween he e e ence and es ima ed segmen a ions. This a-
ble allows o he s aigh o wa d calcula ion o join and
ma ginal p obabili ies o label assignmen s.
3NCE di e s om V-measu e only by no malizing ela i e o he uni-
o m dis ibu ion ins ead o he ma ginals
Popula ing he k׈
kcon ingency able has a ime com-
plexi y o O(n), as i equi es a single pass h ough all n
ames. Once his k׈
k able is c ea ed, he inal V-measu e
sco e is calcula ed by compu ing he ma ginal en opies
om he con ingency able, a O(k׈
k)s ep, leading o an
o e all complexi y o O(k׈
k+n). Simila ly, he e en -
based me hod esul s in a complexi y o O(k׈
k+s+ ˆs).
3.3 L-measu e
Unlike la segmen a ion me ics, which e alua e whe he
pai s o ime poin s a e assigned he same label, hie a chi-
cal me ics assess ela ionships ac oss mul iple le els o
g anula i y. In his se ing, he goal is no jus o de e mine
whe he a pai o ime poin s ha e ma ching label, bu o
e alua e how well he hie a chical s uc u e o an es ima ed
segmen a ion aligns wi h ha o a e e ence.
The L-measu e [4] add esses he ask by explici ly con-
side ing di e ences in hie a chical dep h as a anking
p oblem. Ra he han e alua ing pai s o poin s, i uses a
iple -based compa ison ha asks, o a gi en ancho ime
, i a second ime poin uis mo e closely ela ed o han
a hi d poin . I bo h he e e ence and es ima ed hie a -
chies ag ee on his anking, hen he s uc u al in o ma ion
con ained a ime is ecalled.
In i s cu en implemen a ion wi hin mi _e al, L-
measu e compu es p ecision and ecall densi ies sepa a ely
a each ancho ime poin , be o e agg ega ing hese alues
h oughou he ime domain. This is done o ensu e ha
each ime poin con ibu es equally o he o e all me ic.
This calcula ion pe ime poin is illus a ed in Figu e 2,
which can be in e p e ed as a local ecall densi y be o e
agg ega ing i o e he ime domain.
We s a by ex ending he no ion o segmen s and i s
label ag eemen mapping M(u, ) o hie a chies and i s
coun e pa s. A hie a chy o dep h dis a sequence o p o-
g essi ely ine la segmen a ions and has a label mapping
H:T→(γ1, γ2, . . . , γd):
H( ) = (S1( ), S2( ), . . . , Sd( )) .
The label ag eemen mapping o a hie a chy is de ined
o each ime pai as he deepes le el a which hey ecei e
he same label. Fo any pai o ime poin s u, ∈T, he
dep h o hei sha ed label is de ined as
MH(u, ) := max d|Sd(u) = Sd( ).(7)
We plo wo hie a chies and hei label ag eemen maps in
he i s wo ows o Figu e 2.
Using his dep h mapping, we can de ine a iple
( , u, ) o be signi ican unde hie a chy Hi uis mo e
closely ela ed o han is:
A(H; ) := {( , u, )∈T3|MH( , u)> MH( , )}.(8)
The hi d ow in igu e 2 shows MH( , ·) o all imes
in T, and he ou h ow shows signi ican iple s
associa ed wi h que y ime : whe e ma oon ma ks
MH( , u)> MH( , ), i.e. A(H; ).
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
428
Re e ence Hie a chy
H
1
Es ima ed Hie a chy
H
1 2
2 3
0
1
2
3
4
5
Time
Label Ag eemen Map
M
Que y ime
Label Ag eemen Map
M
Que y ime
0
1
2
3
Mee Dep h
Rele ance o :
M
(
, ) Rele ance o :
M
(
, )
0
1
2
3
4
5
(sec)
Rele an T iple s (
H
;
) Rele an T iple s (
H
;
)
012345
u (sec)
0
1
2
3
4
5
(sec)
In e sec ion
012345
u (sec)
In e sec ion
Figu e 2. Visualizing L-measu e’s densi y a ime as a io
o a ea. Row 1: wo hie a chies Hand ˆ
H. Row 2: label
ag eemen maps MHand Mˆ
H. Row 3: local ele ance
MH( , ·). Row 4: signi ican iple s A(H; ),A(ˆ
H; ).
Row 5: ma ching signi ican iple s A(H; )∩ A(ˆ
H; )
highligh ed in ed.
The p ecision and ecall sco es a e hen de ined o
each ins an o ime by coun ing he p opo ions o iple s
sha ed be ween he wo se s o iple s A(H; )∩ A(ˆ
H; ),
shown in ed on he bo om ow o Figu e 2.
Fo an es ima ed hie a chy ˆ
Hand a e e ence H, he
densi y o L-measu e a ime is de ined as ollows:
ρ ecall(ˆ
H|H; ) = A(ˆ
H; )∩ A(H; )
A(H; )
,
ρp ecision(ˆ
H|H; ) = A(ˆ
H; )∩ A(H; )
A(ˆ
H; )
.(9)
No ice how ρis no de ined when i s denomina o is
|A(H; )|= 0. In he con inuous- ime o mula ion, his
would imply ha MH( , u) = MH( , ) o all (u, ); o ,
in o he wo ds, a la segmen a ion wi h cons an labeling.
Wi h ρde ined, he o e all L-measu e ecall is de ined
as he a e age ecall densi y o e T:
L ecall(ˆ
H|H) = 1
|T|ZT
ρH(ˆ
H; ) d ,
Lp ecision(ˆ
H|H) = 1
|T|ZT
ρˆ
H(H; ) d . (10)
In p ac ical e ms, he ask o iden i ying he se o
empo al iple s p esen s a compu a ional complexi y o
O(n3) o n ames when employing a nai e me hod-
ology. This becomes pa icula ly challenging o ex-
ended sequences. To imp o e e iciency in his p ocess,
mi _e al implemen s an in e sion coun ing algo i hm
o assess anking disc epancies be ween wo lis s, he eby
educing he compu a ional complexi y o O(n2log n)
ime speci ically in he con ex o he sampling case. Tak-
ing ad an age o he same in e sion coun ing algo i hm,
he complexi y o he con inuous app oach is O((s+
ˆs)2log(s+ ˆs)).
Simila ly o PFC, he L-measu e also equi es sam-
pling he hie a chical label ag eemen map MH, which is
quad a ic in he numbe o segmen s s+ ˆso he numbe
o ames n o each o he d+ˆ
dle els.
3.4 Compu a ional Complexi y
Table 1 desc ibes he compu a ional ime complexi y o he
h ee MSA me ics analyzed in his pape . We lis he cu -
en ame-based app oaches and he new e en -based ap-
p oaches o each me ic side by side. We will see ha
when s < n in sec ion 5, ou e en -based implemen a ions
a e signi ican ly as e han he sampled e sions, while
main aining accu acy.
Me ic F ame-Based E en -Based
Pai wise O(n2)O(s2)
V-measu e O(k2+n)O(k2+s)
L-measu e O(n2(d+ log n)) O(s2(d+ log s))
Table 1. Compu a ional complexi y o ame-based and
e en -based e sions o MSA me ics. kis he numbe o
unique labels, sis numbe o segmen s, nis numbe o
ames, dis he dep h o he hie a chy.
4. EXPERIMENTS
To empi ically e alua e ou e en -based implemen a ion o
MSA me ics, we pe o med expe imen s using hie a chi-
cal s uc u al anno a ions om he SALAMI da ase [14],
a widely used co pus con aining hie a chical music s uc-
u e anno a ions. We used 884 acks om SALAMI, each
ea u ing wo sepa a e human-gene a ed wo-le el hie a -
chies, allowing us o benchma k he p oposed me ics us-
ing eal anno a ions. We used he lowe le el o he hi-
e a chy o compa e la segmen a ion me ics. Segmen-
a ions p oduced by Salamon e al.’s hie a chical MSA
me hod [15] on he SALAMI da ase we e also conside ed,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
429
0 200 400 600
10 4
10 2
100
102
Run ime (sec)
Pai wise
0 200 400 600
T ack Du a ion (sec)
V-measu e
F ame Size
e en -based
0.1s
0.2s
0.5s
1.0s
2.0s
0 200 400 600
L-measu e
Figu e 3. Run ime o MSA me ics agains anno a ion du a ion, wi h colo s ep esen ing di e en me ic esolu ion.
Shaded egions indica ing 95% con idence in e al.
p o iding deepe hie a chies ha ma ch ealis ic compu a-
ional scena ios ypical in cu en MIR esea ch. We will
e e o hei app oach as he Segmen Fusion me hod.
4.1 Benchma king Se up
We benchma ked ou e en -based implemen a ions agains
he ame-based e sions a ailable in mi _e al [5]. All
expe imen s we e conduc ed on a 2021 MacBook P o
equipped wi h an M1 Max chip, 32 GB RAM, and Py hon
3.9 using mi _e al e sion 0.8.2. La ge ame sizes in-
c ease compu a ion e iciency by educing he numbe o
ames o p ocess, acili a ing apid p o o yping i e a ions.
Howe e , hese la ge ames do no achie e he same le el
o accu acy as hose wi h ine ame sizes. We conduc ed
an assessmen o bo h compu a ional e iciency (in e ms
o un- ime) and accu acy (in e ms o me ic sco e consis-
ency). Ou s udy e alua ed he accu acy and un ime pe -
o mance o mi _e al’s implemen a ion ac oss i e dis-
inc ame sizes [0.1,0.2,0.5,1.0,2.0], measu ed in sec-
onds.
4.2 Scena io: Hype pa ame e Tuning
To con ex ualize he compu a ional o e head in a eal-
wo ld esea ch scena io, we model ou expe imen on
he hype pa ame e uning p ocess o he Segmen Fusion
me hod [15]. Thei op imiza ion p ocess in ol ed e alua -
ing 100 pa ame e combina ions on a de elopmen se o
471 acks. We epo he cumula i e un- ime equi ed o
e alua e a single hype pa ame e combina ion ac oss his
da ase using bo h ou e en -based and he s anda d ame-
based me hods.
4.3 L-measu e and Dep h
We also assessed he L-measu e’s un ime dependency on
hie a chical dep h, a conce n o e alua ing e y deep hi-
e a chies p oduced by segmen a ion algo i hms. Fo his
expe imen , we ook he 12-laye hie a chies om he
Segmen Fusion me hod and sys ema ically educed hei
dep h, one laye a a ime. A each dep h, we epo ed he
a e age un ime o compu e he L-measu e agains a wo-
laye e e ence hie a chy.
F ame Size Pai wise V-measu e L-measu e
0.1s -0.0009 0.0003 -0.0004
0.2s -0.0019 0.0010 -0.0009
0.5s -0.0047 0.0030 -0.0021
1.0s -0.0093 0.0059 -0.0035
2.0s -0.0187 0.0111 -0.0076
Table 2. A e age me ic de ia ion be ween ame-based
and e en -based app oach
5. RESULTS
5.1 Compu a ional E iciency
Figu e 3 plo s he un ime o di e en aming schemes
e sus he du a ion o he anno a ion in seconds. This
shows ha ou e en -based implemen a ion consis en ly
ou pe o ms he ame-based me hod in e ms o un ime.
Speci ically, he un- ime o ou e en -based implemen-
a ion o L-measu e emained close o 10 milliseconds
pe compu a ion, ega dless o he du a ion o he ack,
whe eas ame-based compu a ions’ un- ime can some-
imes exceeded 10 seconds and g ows supe -linea ly wi h
du a ion. This subs an ial imp o emen in compu a ional
e iciency highligh s he po en ial o in eg a ing hese me -
ics in o la ge -scale analyses o i e a i e wo k lows, such
as hype pa ame e uning. Al hough no as p onounced,
he V-measu e and PFC me ics also show a easonable
speed up, especially wi h inc easing ack du a ion. This
is no su p ising, as ou implemen a ion’s compu a ional
complexi y depends on he numbe o bounda ies, as op-
posed o he numbe o ames.
5.2 F ame Size Sensi i i y
Al hough coa se ame sizes make he o iginal sampling-
based implemen a ion as e o compu e, hey comp omise
accu acy. Ou expe imen s e eal conside able sensi i i y
o he o iginal ame-based me ics o he chosen ame
size.
Table 2 illus a es he a e age de ia ion o he ame-
based implemen a ion o he h ee me ics when e alu-
a ed wi h di e en ame sizes. Fu he mo e, Figu e 4
p esen s a de ailed analysis o he nume ical disc epan-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
430

10010 2010 2100
0.1
0.2
0.5
1.0
2.0
F ame Size (sec)
Pai wise
10010 2010 2100
Me ic De ia ion
V-measu e
10010 2010 2100
L-measu e
Figu e 4. De ia ion o ame-based me ics unde di e en ame sizes om e en -based me ics. The de ia ion is shown
on a symme ic log scale, whe e nega i e (posi i e) alue indica es unde - (o e -) p edic ions o he e en -based sco e.
cies obse ed be ween ou p ecise implemen a ion and he
sampling-based me hods, which a e sensi i e o he ame
size employed.
The esul s e eal sys ema ic biases ha a y by me ic
and ame size. Bo h PFC and L-measu e end o unde -
p edic he ue sco es wi h inc easing ame size, while he
V-measu e consis en ly in la ed. No ice ha while he a -
e age biases a e sys ema ic, he indi idual e o s a e wo-
sided; o any gi en ame size and me ic, he de ia ions
can be ei he posi i e o nega i e ela i e o he p ecise
alue, as shown in Figu e 4
These opposing biases sugges ha he epo ed ank-
ing o di e en sys ems could change based on he chosen
ame size, po en ially unde mining he ep oducibili y o
compa a i e e alua ions.
5.3 Scena io: Hype pa ame e Tuning
To assess he p ac ical impac on i e a i e esea ch wo k-
lows, we e alua ed he compu a ional cos ollowing he
hype pa ame e uning p ocess used o de elop he Seg-
men Fusion algo i hm [15]. Fo one e alua ion pass on
he 471- ack de elopmen se , using ou e en -based im-
plemen a ion ook only 40 seconds. In s a k con as , he
ame-based implemen a ion o mi _e al (wi h he ec-
ommended 0.1-second ame size) equi ed app oxima ely
90 minu es o pe o m he same e alua ion. This would
mean ha using ou implemen a ion du ing his de elop-
men pipeline could sa e hund eds o hou s o compu e
ime, di ec ly a ec ing he easibili y o ce ain esea ch
ac i i ies ha depend on hea y e alua ion.
We also ound ha his o e head is disp opo iona ely
a ec ed by ack du a ion; he en longes acks alone ac-
coun ed o 22 minu es o he ame-based e alua ion ime,
which p o ides an uppe bound on how much e iciency
could be gained by pa allelism. Ou e en -based o mula-
ion a oids his dependency, o e ing no only signi ican
compu a ional sa ings bu also enhancing he p ac icali y
o using la ge-scale da ase s in mode n MIR wo k lows.
5.4 L-measu e and Dep h
As no ed in Sec ion 3.4, he compu a ional complexi y
o bo h e en -based and ame-based implemen a ions de-
pends on hie a chies’ dep h. We con i m his empi ically
by showing a e age un ime o inc easingly deepe es i-
ma ed hie a chies agains he same anno a ions in Figu e 5.
123456789101112
Dep h o Es ima ed Hie a chy
10 2
10 1
100
101
Run ime (sec)
E ec o Dep h on L-measu e Run ime
F ame Size
0.1s
0.2s
0.5s
1.0s
2.0s
e en
based
Figu e 5. Run ime compa ison o ame-based and e en -
based L-measu e implemen a ions as a unc ion o hie a -
chy dep h o ˆ
H. The y-axis shows he me ic un ime in
seconds (log scale), wi h shaded egions indica ing 95%
con idence in e al.
6. DISCUSSION
We in oduced an e en -based pa adigm o Music S uc-
u e Analysis (MSA) e alua ion me ics. This new
pa adigm elimina es he a bi a y choice o ame size in-
he en in adi ional ame-based app oaches while d as i-
cally imp o ing compu a ional e iciency.
Ou empi ical e alua ion using he SALAMI da ase
demons a ed he subs an ial bene i s o ou e en -based
implemen a ions. We showed signi ican gains in compu-
a ional e iciency and, c ucially, iden i ied sys ema ic bi-
ases p esen in ame-based me ics. These indings unde -
sco e ha e en seemingly mino choices ega ding ame
g anula i y can lead o non- i ial e o s, po en ially com-
p omising e alua ion in eg i y. While ame-based me -
ics can app oach he accu acy o e en -based coun e pa s
wi h su icien ly small ame sizes, his p ecision comes a
a conside able compu a ional cos , se e ely limi ing hei
u ili y in la ge-scale o i e a i e esea ch wo k lows.
Looking ahead, ou e en -based pa adigm can be ead-
ily ex ended o o he e alua ion me ics and b oade MIR
asks. Adop ing such an app oach b oadly wi hin MIR
e alua ion amewo ks will no only enhance accu acy and
ep oducibili y, bu will also os e he sus ainable g ow h
and scalabili y o u u e esea ch e o s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
431
7. REFERENCES
[1] M. Le y and M. Sandle , “S uc u al segmen a ion o
musical audio by cons ained clus e ing,” IEEE ans-
ac ions on audio, speech, and language p ocessing,
ol. 16, no. 2, pp. 318–326, 2008.
[2] H. M. Lukashe ich, “Towa ds quan i a i e measu es
o e alua ing song segmen a ion.” in ISMIR, 2008, pp.
375–380.
[3] A. Rosenbe g and J. Hi schbe g, “V-measu e: A con-
di ional en opy-based ex e nal clus e e alua ion mea-
su e,” in P oceedings o he 2007 join con e ence on
empi ical me hods in na u al language p ocessing and
compu a ional na u al language lea ning (EMNLP-
CoNLL), 2007, pp. 410–420.
[4] B. McFee, O. Nie o, M. M. Fa bood, and J. P. Bello,
“E alua ing hie a chical s uc u e in music anno a-
ions,” F on ie s in Psychology, ol. 8, p. 1337, 2017.
[5] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, and D. P. W. Ellis, “Mi _e al: A
anspa en implemen a ion o common MIR me ics,”
in ISMIR, 2014, pp. 367–372.
[6] B. L. S u m, “A simple me hod o de e mine i a music
in o ma ion e ie al sys em is a "ho se",” IEEE T ans.
Mul im., ol. 16, no. 6, pp. 1636–1644, 2014.
[7] A. Mesa os, T. Hei ola, and T. Vi anen, “Me ics o
polyphonic sound e en de ec ion,” Applied Sciences,
ol. 6, no. 6, p. 162, 2016.
[8] Ç. Bilen, G. Fe oni, F. Tu e i, J. Azca e a, and
S. K s ulo i´
c, “A amewo k o he obus e alua-
ion o sound e en de ec ion,” in ICASSP 2020-2020
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2020, pp.
61–65.
[9] V. Los anlen and B. Mc ee, “E icien e alua ion al-
go i hms o sound e en de ec ion,” in 8 h Wo kshop
on De ec ion and Classi ica ion o Acous ic Scenes and
E en s (DCASE 2023), 2023.
[10] M. Buisson, B. McFee, S. Essid, and H. C. C ayencou ,
“Sel -supe ised lea ning o mul i-le el audio ep e-
sen a ions o music segmen a ion,” IEEE ACM T ans.
Audio Speech Lang. P ocess., ol. 32, pp. 2141–2152,
2024.
[11] C. J. T alie and B. McFee, “Enhanced hie a chical mu-
sic s uc u e anno a ions ia ea u e le el simila i y u-
sion,” in ICASSP. IEEE, 2019, pp. 201–205.
[12] J. de Be a dinis, M. Vam aka is, A. Cangelosi, and
E. Cou inho, “Un eiling he hie a chical s uc u e
o music by mul i- esolu ion communi y de ec ion,”
T ans. In . Soc. Music. In . Re ., ol. 3, no. 1, pp. 82–
97, 2020.
[13] T. Chen, L. Su, and K. Yoshii, “Lea ning mul i ace ed
sel -simila i y o musical s uc u e analysis,” in AP-
SIPA ASC. IEEE, 2023, pp. 165–172.
[14] J. B. L. Smi h, J. A. Bu goyne, I. Fujinaga,
D. De Rou e, and J. S. Downie, “Design and c ea ion
o a la ge-scale da abase o s uc u al anno a ions.” in
ISMIR, ol. 11. Miami, FL, 2011, pp. 555–560.
[15] J. Salamon, O. Nie o, and N. J. B yan, “Deep embed-
dings and sec ion usion imp o e music segmen a ion,”
in ISMIR, 2021, pp. 594–601.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
432