dPLP: A DIFFERENTIABLE VERSION OF PREDOMINANT LOCAL
PULSE ESTIMATION
Ching-Yu Chiu, Sebas ian S ahl, and Meina d Mülle
In e na ional Audio Labo a o ies E langen, Ge many
{ching-yu.chiu, sebas ian.s ahl, meina d.muelle }@audiolabs-e langen.de
ABSTRACT
P edominan Local Pulse (PLP) es ima ion is a key ech-
nique in hy hmic analysis o music eco dings, designed
o iden i y he mos salien pulse in an audio signal while
adap ing o local empo a ia ions. Unlike global empo
es ima ion, which assumes a ixed empo, PLP dynami-
cally adjus s o changes in empo and hy hm, making i
pa icula ly e ec i e as a pos -p ocessing s a egy o en-
hance he locally pe iodic s uc u e o a gi en inpu no -
el y o ac i i y unc ion. T adi ional PLP es ima ion e-
lies on a max ope a ion o selec he mos p ominen pe-
iodici y, limi ing i s use in di e en iable lea ning ame-
wo ks. In his pape , we in oduce dPLP, a di e en iable
e sion o PLP es ima ion ha eplaces he max ope a ion
when selec ing a locally op imal pe iodici y ke nel wi h a
so max-based weigh ing scheme. This modi ica ion en-
su es good g adien low, allowing PLP o be seamlessly
in eg a ed in o deep lea ning pipelines as an in e media e
laye o as pa o he loss unc ion. We p o ide echni-
cal insigh s in o i s di e en iable o mula ion and p esen
expe imen s compa ing i o he o iginal non-di e en iable
PLP app oach. Addi ionally, case s udies in bea acking
highligh he ad an ages o dPLP in imp o ing pe iodici y-
awa e ep esen a ions wi hin neu al ne wo k a chi ec u es.
1. INTRODUCTION
Rhy hm, a undamen al componen o music, is shaped by
bea s ( egula pulses), empo ( he a e a which hose bea s
occu ), and me e ( he g ouping o bea s in o measu es).
As hy hm in ol es he o ganiza ion o elemen s ac oss
mul iple hie a chical le els, i s analysis emains a chal-
lenging ask in MIR [1,2]. P edominan Local Pulse (PLP)
es ima ion, designed o analyze and enhance he local pe-
iodici y o musical no el y unc ions [3, 4], se es as an
e ec i e ool o hy hm analysis [5–7] and bea ack-
ing [8–10]. Relying on he idea o he Fou ie empog am,
he me hod o PLP analyzes an inpu no el y unc ion and
de i es o each ime posi ion an op imal sinusoidal ke -
nel ha bes ep esen s he local peak s uc u e o he no -
© C.-Y. Chiu, S. S ahl, and M. Mülle . Licensed unde a
C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY 4.0).
A ibu ion: C.-Y. Chiu, S. S ahl, and M. Mülle , “dPLP: A Di e en-
iable Ve sion o P edominan Local Pulse Es ima ion”, in P oc. o he
26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h
Ko ea, 2025.
el y unc ion. By o e lap-adding hese de i ed sinusoids
o all ime posi ions and applying ec i ica ion, a PLP
unc ion which ep esen s he pe iodici y enhancemen o
he o iginal no el y unc ion can be de i ed. Howe e ,
he p ocess o de e mining a each ime posi ion he op-
imal sinusoidal ke nel ep esen ing p edominan pe iod-
ici y elies on a non-di e en iable max ope a ion, es ic -
ing PLP’s in eg a ion wi h mode n deep-lea ning ame-
wo ks. Consequen ly, exis ing s udies employ PLP as
a pos -p ocessing echnique, isola ed om he sys em’s
aining p ocess. Fo example, in bea acking, cu en
neu al ne wo ks o en lack an explici mechanism o lea n
and p oduce pe iodic ou pu s, hus depending on a sepa a e
pos -p ocesso like PLP [8,9] o a dynamic Bayesian Ne -
wo k (DBN) [11–13], which en o ces pe iodici y h ough
s onge empo assump ions. This wo-s age a chi ec u e
no only e eals he limi a ions o wha exis ing neu al ne -
wo ks can lea n bu also necessi a es manual adjus men s
o pos -p ocessing se ings when hei empo assump ions
a e iola ed. 1
Wi h he g owing demand o in e p e able, e icien ,
and con ollable models, esea che s a e inc easingly
de eloping di e en iable a ian s o model-based ap-
p oaches. Fo ins ance, by eplacing he minimal-cos
alignmen in dynamic ime wa ping (DTW) wi h a so -
minimum calcula ion, Cu u i and Blondel [14] in oduced
so -DTW, enabling i s use as a di e en iable loss unc-
ion o aining neu al ne wo ks on weakly aligned da a
[15, 16]. Simila ly, di e en iable digi al signal p ocess-
ing (DDSP) me hods [17–21] ha e eme ged ollowing his
end. Building on hese ad ancemen s and add essing ex-
is ing limi a ions, we in oduce dPLP, a di e en iable a i-
an o PLP es ima ion. Designed o seamless in eg a-
ion in o deep lea ning pipelines, dPLP eplaces he non-
di e en iable max ope a ion wi h a so max-based weigh -
ing scheme, enabling smoo h op imiza ion. To e alua e
i s bene i s, we conduc a p oo -o -concep bea acking
expe imen on a small da ase o popula music. We in-
oduce a ligh weigh , di e en iable spec al lux a ian
as a ainable ac i i y es ima o . By in eg a ing his mod-
ule wi h dPLP, we es ablish a model-based, in e p e able
1The DBN, o ins ance, equi es hype pa ame e s o de ine a empo
change dis ibu ion, a ec ing he model’s lexibili y in handling empo
a ia ions. Likewise, he PLP equi es a p ede ined ke nel size o es-
ima e local pe iodici y. I he selec ed ke nel size is oo sho , i may
ail o cap u e pe iodici y om he inpu no el y unc ion; i oo long, i
may in oduce noise by cap u ing inconsis en pe iodici y om di e en
egions.
198
Figu e 1. Compa ison o he o iginal PLP (le ) and dPLP ( igh ) calcula ion pipelines. (Top) Inpu no el y unc ion, dupli-
ca ed in (b) and (c) o e e ence. (a) Fou ie magni ude empog am (le ) and i s ame-wise so max- ans o med a ian
( igh ). (b) Op imal (le ) s. weigh ed-summed ( igh ) sinusoidal ke nels a ou ime posi ions. (c) Ke nel accumula ion.
(d) De i ed PLP/dPLP unc ions (black cu es) wi h peak posi ions iden i ied by a peak picke . Anno a ed bea posi ions
a e ma ked by e ical ed dashed lines. The yellow egion highligh s di e ences be ween PLP and dPLP.
amewo k ha enhances he model’s abili y o cap u e pe-
iodici y.
The emainde o his pape is s uc u ed as ollows.
Sec ion 2 in oduces he o mula ion, compu a ion, and
key pa ame e s o dPLP. Sec ion 3 p esen s a bea acking
case s udy, ou lining he esea ch ques ions and baseline
a chi ec u es. Sec ion 4 analyzes he expe imen al esul s,
p o iding bo h quan i a i e and quali a i e e alua ions. Fi-
nally, Sec ion 5 concludes he s udy.
2. MATHEMATICAL FORMULATION OF dPLP
In his sec ion, we in oduce he ma hema ical no a ion and
o mulas o bo h he classical PLP and di e en iable PLP.
2.1 O iginal PLP
Figu e 1 (le ) illus a es he compu a ion o he o iginal
PLP unc ion [3]. Gi en a no el y unc ion ∆ : Z→R
(Figu e 1, op), ep esen ing he onse en elope o bea
likelihood, PLP es ima es a pe iodici y-enhanced e sion
o ∆(Figu e 1d, le ). The p ocess applies a disc e e STFT
o ∆using a window unc ion W:Z→R. This window,
o example a Hann window, is o leng h K∈N, cen e ed
a n= 0, and ze o ou side. Fo equency ω∈R≥0and
ime n∈Z, he Fou ie coe icien F(n, ω)is de ined as
F(n, ω) = X
m∈Z
∆(m)W(m−n)e−2πiωm.(1)
Le Θ⊂R>0be a ini e se o empi, speci ied in
bea s pe minu e (BPM). The disc e e Fou ie empog am
T:Z×Θ→R≥0is de ined as he magni ude o he
Fou ie coe icien , gi en by
T(n, τ) = |F(n, τ/60)|.(2)
Le ϕ(n, τ)deno e he phase o F(n, τ/60). The co e-
sponding windowed sinusoidal ke nel a ime nwi h empo
τis
κn,τ (m) := W(m−n) cos 2πm·τ/60 −ϕ(n, τ),(3)
whe e Wis he same window unc ion as in he STFT
compu a ion. The o iginal (a gmax-based) PLP es ima ion
inds o each n he empo τn∈Θ ha maximizes he
magni ude empog am T(n, τ)(Figu e 1a, le ):
τn:= a gmax
τ∈Θ
T(n, τ).(4)
Using τnand he co esponding phase ϕ(n, τn), he op-
imal sinusoidal ke nel κn,τn(m)(Figu e 1b, le ) can be
de i ed by Equa ion 3. The de i ed sinusoids a e accu-
mula ed o e ime by o e lap-adding (Figu e 1c, le ), p e-
se ing pe iodici y while allowing local empo a ia ions.
Finally, hal -wa e ec i ica ion (omi ing nega i e alues)
yields he o iginal PLP unc ion Γ : Z→R≥0(Figu e 1d,
le ):
Γ(m) =
X
n∈Z
κn,τn(m)
≥0
.(5)
In summa y, PLP es ima ion employs Fou ie coe -
icien s and he Fou ie empog am o ex ac local si-
nusoidal ke nels ha model pe iodici y. O e lap-adding
hese ke nels econs uc s a pe iodici y-enhanced unc ion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
199
The esul ing PLP unc ion depends on ∆’s quali y, he
window size K, and he empo se Θ, equi ing ca e ul
pa ame e selec ion.
2.2 Di e en iable PLP
One s ep ha makes he PLP compu a ion non-
di e en iable is he a gmax ope a ion in Equa ion 4, only
e aining he windowed sinusoid ha i s bes . To ob ain a
so and di e en iable app oxima ion o he op imal win-
dowed sinusoid, we ins ead apply he so max unc ion, e-
placing he op imal windowed sinusoid in Equa ion 3 by a
weigh ed sum o all windowed sinusoids.
To his end, we compu e weigh ac o s o all win-
dowed sinusoids using he so max unc ion
σγ
n(τ) := exp (T(n, τ)/γ)
Pτ′∈Θexp (T(n, τ′)/γ),(6)
whe e γ > 0is a empe a u e hype pa ame e ha con ols
he so ness o he dis ibu ion.
Fo γ→0,σγ
n(τ)app oxima es he a gmax ope a ion,
meaning he la ges alue o T(n, τ)domina es, esul -
ing in a one-ho dis ibu ion. Con e sely, o la ge γ, he
so max ou pu becomes mo e uni o m, wi h all alues o
σγ
n(τ) ending owa d 1/|Θ|.
Using hese weigh s, we compu e a so app oxima ion
o he op imal windowed sinusoid as
κγ
n(m) := X
τ∈Θ
σγ
n(τ)·κn,τ (m),(7)
whe e σγ
n(τ)is he so max ou pu , ep esen ing weigh s
o all windowed sinusoids.
Since he so max unc ion is di e en iable, κγ
nis di -
e en iable wi h espec o T(n, τ). The dPLP unc ion is
hen:
Γγ(m) =
X
n∈Z
κγ
n(m)
≥0
.(8)
Figu e 1 ( igh ) illus a es he dPLP compu a ion. Gi en
he so max-no malized empog am (Figu e 1a, igh ), he
weigh ed-summed sinusoidal ke nel a ime nis a weigh ed
sum o he ke nels o all empi. Cons uc i e o des uc-
i e in e e ence modi ies ke nel shapes compa ed o he
a gmax case (Figu e 1b, le ). Fo ime posi ions wi h
ambiguous empo (e.g., o ange and g een do s), κγ
np e-
se es ewe peaks due o des uc i e in e e ence. Fo po-
si ions wi h a dominan empo (e.g., ed do ), he so max
and a gmax ke nels a e nea ly iden ical. As shown in Fig-
u e 1d (yellow egions), hese ke nel di e ences a ec bea
es ima es when PLP/dPLP unc ions se e as bea no el y
unc ions. O e all, dPLP beha es as an in e media y be-
ween he o iginal no el y unc ion and he o iginal PLP,
o e ing a di e en iable module o pe iodici y enhance-
men , wi h he so ness adjus able ia he so max empe -
a u e pa ame e γ.
2.3 Hype pa ame e s
As indica ed in Sec ions 2.1 and 2.2, he p ope ies o he
o iginal PLP and dPLP depend la gely on he hype pa am-
e e s o he window (ke nel) leng h Kand he empo ange
Θ.2In his s udy, we expe imen wi h ke nel sizes o 3, 5,
and 10 seconds—co esponding o K∈ {300,500,1000}
ames a a ame a e o 100 Hz— o co e a ying leng hs
o local empo al con ex . Fo Θ, we conside empi ang-
ing om 20 o 320 BPM, using wo ypes o scales: lin-
ea (LN) and loga i hmic (LG). In he LN scale, Θis de-
ined as {τ∈N|20 ≤τ≤320}, esul ing in a o-
al o 301 empo classes. In he LG scale, Θconsis s
o 81 alues, also anging om 20 o 320 BPM, spaced
e enly on a loga i hmic scale. Since humans a e sensi i e
o ela i e changes in empo a he han absolu e di e -
ences [22], he LG scale aligns be e wi h human pe cep-
ion. I achie es good co e age o he empo ange wi h
ewe empo classes han he LN scale, educing he com-
pu a ional cos o empog am and dPLP compu a ion.
Addi ionally, since ou ocus is o explo e he p ope ies
and po en ial bene i s o inco po a ing dPLP a he han
op imizing hype pa ame e s o a speci ic case, we ix he
so max empe a u e a γ= 1 in his s udy. The e ec i e-
ness o hese choice is e alua ed in Sec ion 3.
3. CASE STUDY IN BEAT TRACKING
We conduc a case s udy on bea acking, ollowing he
con en ional a chi ec u e, which consis o an ac i i y es-
ima o and a pos -p ocesso [23, 24]. The ac i i y es ima-
o con e s audio ea u es (e.g., spec og ams) in o eal-
alued no el y cu es, indica ing he likelihood o each
ime ame con aining a bea . The pos -p ocesso hen e-
ines hese cu es in o inal bina y bea es ima es. This
expe imen aims o illus a e he ad an ages o he dPLP
me hod, which enables backp opaga ion. Ra he han ad-
ancing he s a e o he a in bea acking, i se es as
a con olled demons a ion. To ensu e e icien aining
and con olled analysis, we use a small oy da ase (Sec-
ion 3.1), keep all ne wo k componen s minimal, in eg a e
dPLP in a ious ways (Sec ion 3.2), and employ a peak-
picking-based pos -p ocessing me hod (Sec ion 3.3). The
esul ing bea es ima es a e e alua ed in Sec ion 3.4 o as-
sess dPLP’s impac and unc ionali y.
3.1 Da ase s
The GTZAN da ase [25, 26] is a widely used benchma k
o music gen e classi ica ion and a ious audio analysis
asks, including bea acking [13, 24, 27, 28]. I comp ises
1,000 audio acks, each 30 seconds long, spanning en
gen es, o e ing a di e se collec ion o musical s yles. In
his s udy, we speci ically ocus on he 100 acks o pop-
ula music, p o iding a simpli ied scena io o examine he
2No e ha when calcula ing he Fou ie empog am and he co e-
sponding PLP unc ion, he hop size is also a hype pa ame e ha a ec s
empo al esolu ion and compu a ional cos . Fo simplici y, we empi i-
cally ix he hop size o 10 ames wi hou u he discussion.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
200
beha io s and e ec s o he p oposed ideas. Fo he ol-
lowing bea acking expe imen s, we andomly spli hese
100 acks in o 60 o aining, 20 o alida ion, and 20 o
es ing, epo ing esul s o he es da a.
3.2 Bea Ac i i y Es ima o s
Gi en a 44.1 kHz audio eco ding, we compu e STFT-
based spec og ams using lib osa [29] wi h an FFT size
o 2048, a window leng h o 1024, and a hop size o 441,
esul ing in spec og ams wi h a 100 Hz empo al esolu-
ion. As shown in Figu e 2, hese spec og ams se e as
he p ima y inpu ea u e o subsequen expe imen s.
3.2.1 Spec al Flux
Spec al lux [3, 4, 30] is a widely used model-based ech-
nique o onse de ec ion. Gi en an STFT spec og am, i
applies loga i hmic comp ession, disc e e di e en ia ion,
hal -wa e ec i ica ion, and accumula ion o gene a e a
no el y cu e. To u he e ine i s quali y, baseline sub-
ac ion, Gaussian smoo hing, and no maliza ion a e o en
inco po a ed. To e alua e he impac o dPLP’s di e en-
iabili y, we use bandwise spec al lux as an example o
no el y unc ion compu a ion. As a mino con ibu ion o
his s udy, we implemen a ligh weigh , ainable e sion
o spec al lux (SFX) by o mula ing i as a PyTo ch
module. The SFX module p ocesses an STFT spec og am
by di iding i in o eigh equency bands, compu ing spec-
al lux independen ly o each band [3, 4, 30], applying a
weigh ed sum, and pe o ming Gaussian smoo hing, ec i-
ica ion, and max-no maliza ion. To in oduce lea nabili y,
we make he di e en ia ion con olu ion ke nels, log com-
p ession pa ame e s, and bandwise weigh ing pa ame e s
ainable, esul ing in a o al o 64 ainable pa ame e s.
We ain SFX o bea acking using 60 aining acks
om GTZAN (Sec ion 3.1), wi h a ba ch size o 8, a lea n-
ing a e o 0.1, he Adam op imize , and weigh ed bina y
c oss-en opy (BCE) loss. 3The module is ini ialized wi h
i s -o de di e en ia ion con olu ion ke nels, a log com-
p ession pa ame e o 10, and a e age-weigh ed summa-
ion, e e ed o as SFX-I. A e aining, he esul ing
model is deno ed as SFX-T.
3.2.2 A gmax PLP and So max PLP
To compa e he p ope ies and beha io o he o iginal
a gmax-based PLP (A-*) and he p oposed di e en iable
so max-based PLP (S-*), we p ocess he bea no el y
gene a ed by he ained SFX-T using bo h me hods sep-
a a ely. Following Sec ion 2.3, we expe imen wi h h ee
PLP ke nel sizes K(3, 5, and 10 seconds) and wo empo
scales: linea (LN) and loga i hmic (LG). The esul ing
PLP unc ions a e peak-picked (Sec ion 3.3) and e alua ed
(Sec ion 3.4) o compa ison (Sec ion 4.1).
3.2.3 dPLP Inco po a ed A chi ec u e
Figu e 2 illus a es ou p oposed dPLP-inco po a ed a chi-
ec u e, consis ing o a spec al lux module (S, iden ical
3To add ess class imbalance, we assign a weigh o 3 o he bea class
in he BCE loss, as non-bea ames domina e.
o SFX), a dPLP module, and a use (F). This design in-
eg a es an onse -based ac i i y es ima o (S), a pe iodici y
analyze (dPLP), and a use (F) ha lea ns o combine in-
o ma ion om bo h componen s. Gi en an STFT spec o-
g am, he Smodule gene a es a bea no el y unc ion ∆S.
The dPLP module p ocesses ∆Swi h h ee ke nel sizes
(K∈ {3,5,10}seconds), p oducing h ee dPLP cu es.
The use (F) conca ena es ∆Swi h hese dPLP cu es
(Γγ−K∗), applies weigh ed summa ion and smoo hing,
and ou pu s he inal bea no el y unc ion ∆F. The use
(F) comp ises a linea laye , a con olu ional laye , and a
sigmoid ac i a ion, o aling 21 ainable pa ame e s.
We e e o his a chi ec u e, whe e bo h Sand Fa e
ainable, as M1. Du ing aining, he bea no el y unc ion
∆Fgene a ed by he use (deno ed as M1-F) is compa ed
wi h e e ence bea anno a ions, and he BCE loss guides
he lea ning p ocess. Since dPLP enables g adien back-
p opaga ion om M1-F h ough he dPLP o he Smodule
(M1-S), M1-S is op imized by a combined loss unc ion
inco po a ing dPLP, he use , and BCE loss. We a e pa -
icula ly in e es ed in whe he M1-S beha es di e en ly
om he s andalone- ained SFX-T.
To assess he complemen a i y be ween ∆Sand he
dPLP cu es (Γγ−K∗), we modi y egion A (see Fig-
u e 2) and implemen an abla ion model, M2. In M2, he S
module (deno ed as M2-S) is ini ialized wi h SFX-T pa-
ame e s and kep ixed, allowing only he use (M2-F) o
be ained. This se up e alua es whe he M2-F can e ec-
i ely in eg a e in o ma ion om M2-S and dPLP
Finally, o e alua e he dPLP module’s abili y o p o ide
pe iodici y-based in o ma ion and enhance bea acking,
we in oduce M3 by modi ying egion B (also shown in
Figu e 2). In M3, he h ee dPLP cu es a e eplaced wi h
h ee duplica ed M3-S bea no el y unc ions ∆S, keeping
he model size iden ical o M1. Since M1 and M3 sha e he
same ainable componen s (Sand F) and model size, hei
only di e ence—p esence o absence o dPLP—allows us
o di ec ly assess dPLP’s e ec i eness.
3.3 Peak-Picking-based Pos -p ocessing
Fo all he de i ed bea no el y unc ions, we ollow [4]
and apply one-dimensional Gaussian smoo hing (wi h σ=
3 ames), max-no maliza ion, and local a e age-based
peak picking (using a 20-second a e aging window) o ob-
ain he bea es ima ions. Compa ed o con en ional pos -
p ocesso s such as DBN [11, 24, 31], his peak picking
me hod a oids s ong empo assump ions and be e e-
lec s he p ope ies o he no el y unc ions, as [10,13,32].
3.4 E alua ion
We e alua e bea es ima es using he F1-sco e (F1),
p ecision (P), and ecall (R) as implemen ed in
mi _e al [33], wi h a ole ance window o ±70 ms.
Addi ionally, since PLP is inco po a ed o enhance he
model’s abili y o handle longe musical con ex , we epo
he L-co ec me ic [3]. This me ic equi es a leas
Lconsecu i e e e ence bea s o be co ec ly de ec ed
a he han conside ing e e ence bea s indi idually. Fo
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
201
STFT
Spec og am
Spec al Flux (S)
dPLP
Fuse (F)
Bea No el y
Bea No el y
Bea Ta ge
BCE Loss
A
B
Figu e 2. The p oposed a chi ec u e (M1) and he egions
(do ed squa es) o modi y o he wo abla ions (M2,M3).
simplici y, we use L= 2 and epo F-measu e (F-L2),
p ecision (P-L2), and ecall (R-L2).
4. EXPERIMENT RESULTS
In he ollowing, we analyze he bea acking esul s bo h
quan i a i ely and quali a i ely, based on bea es ima es
de i ed om he a o emen ioned no el y unc ions (Sec-
ion 3.2) and a simple peak-picking me hod (Sec ion 3.3).
4.1 Compa ison o PLP Se ings
Table 1 ( op) p esen s he F-measu e and L-co ec e al-
ua ion esul s o he s andalone spec al lux module
(SFX-*), compa ing he o iginal a gmax PLP (A-*) wi h
he di e en iable so max PLP (S-*). The esul s indica e
ha SFX-*pe o ms as expec ed. As an onse -based mod-
ule, SFX-I achie es high ecall (0.950) bu low p ecision
(0.311), esul ing in an unsa is ac o y F1-sco e (0.464) and
low L-co ec alues (all below 0.100). A e aining,
SFX-T imp o es he p ecision- ecall balance, inc easing
p ecision om 0.311 o 0.436 and dec easing ecall om
0.950 o 0.888. This leads o a highe F1-sco e (0.576)
and imp o ed L-co ec alues (all abo e 0.100). Using
he SFX-T bea no el y as inpu , he de i ed a gmax PLP
unc ions (A-*) and so max PLP unc ions (S-*) gen-
e ally imp o e bea acking pe o mance. 4Speci ically,
since he es acks consis o popula music wi h s able
empi, he dPLP unc ions help il e ou non-bea onse s
ha do no align wi h he locally de ec ed pe iodici y while
also enhancing weak onse s a bea posi ions. This e-
sul s in imp o ed p ecision (all abo e 0.520) and ecall (all
abo e 0.900) compa ed o SFX-T. Mo eo e , he simila
F1-sco es (a ound 0.664) indica e ha he e is li le di e -
ence be ween he linea empo scale (LN) and he loga i h-
mic empo scale (LG), as well as be ween he a gmax PLP
(A) and he so max PLP (S). The e o e, we use LG, which
is compu a ionally mo e e icien , o he subsequen ex-
pe imen s in ol ing dPLP-inco po a ed a chi ec u es.
4Fo each se ing (e.g., A-LG), we apply h ee ke nel se ings (K∈
{3,5,10}seconds), de i e h ee se s o PLP cu es, e alua e hem sepa-
a ely, and epo he a e aged sco es in Table 1.
Ac . F-Measu e L-Co ec
F1 P R F-L2 P-L2 R-L2
SFX-I 0.464 0.311 0.950 0.031 0.022 0.055
SFX-T 0.576 0.436 0.888 0.153 0.122 0.213
A-LG 0.663 0.531 0.925 0.169 0.152 0.196
A-LN 0.662 0.528 0.929 0.158 0.142 0.183
S-LG 0.671 0.550 0.908 0.196 0.182 0.220
S-LN 0.664 0.528 0.933 0.165 0.147 0.193
M1-F 0.707 0.660 0.809 0.470 0.445 0.519
M2-F 0.684 0.615 0.817 0.385 0.360 0.439
M3-F 0.664 0.576 0.819 0.417 0.375 0.489
M1-S 0.561 0.412 0.921 0.182 0.140 0.267
M2-S 0.576 0.436 0.888 0.153 0.122 0.213
M3-S 0.529 0.371 0.952 0.083 0.060 0.138
Table 1. Bea acking esul s. Aand Sdeno e a gmax and
so max PLP. LG and LN indica e log-scale and linea -scale
empo spaces. Fand S ep esen he use and spec al lux
module o he dPLP inco po a ed a chi ec u e in Figu e 2.
4.2 E ec i eness o Di e en iabili y
Table 1 (middle) p esen s he esul s o he use s (F)
o he dPLP-inco po a ed a chi ec u es (M1,M2,M3).
No ably, compa ed o o he use s and baselines, M1-F
achie es subs an ial imp o emen s ac oss all e alua ion
me ics excep ecall. Speci ically, i s supe io F1-sco e
(0.707), p ecision (0.660), and F-L2 (0.470) sugges ha
M1-F has lea ned a mo e e ec i e bea - acking mecha-
nism. Al e na i ely, he obse ed imp o emen s may in-
dica e ha he p oposed a chi ec u e (M1) has a g ea e
capaci y o i he ela i ely simple s uc u e o popula
music in GTZAN. The esul s om he abla ion models
u he suppo his obse a ion. Compa ing M2-F wi h
SFX-T and he so max PLPs (S-LG), we ind clea e -
idence o complemen a i y be ween he M2 dPLP cu es
and he M2-S bea no el y unc ion, which M2-F e ec-
i ely le e ages. Speci ically, M2-F aligns mo e closely
wi h he consensus ac oss all inpu cu es, signi ican ly
imp o ing p ecision ( om below 0.550 o 0.615) and L-
co ec me ics ( om below 0.220 o abo e 0.360). How-
e e , since M2-S is ixed and non- ainable, M2-F does
no bene i om he di e en iabili y o dPLP, esul ing in
lowe capaci y compa ed o M1-F.
The esul s om M3-F a e also no ewo hy. Compa ed
o SFX-T,M3-F, which has a la ge model size bu lacks a
dPLP module, adop s a di e en p ecision- ecall ade-o :
i supp esses peaks om M3-S, leading o highe p eci-
sion (0.576 s. 0.436), lowe ecall (0.819 s. 0.888), and
signi ican ly imp o ed L-co ec me ics (abo e 0.370 s.
below 0.220). This sugges s ha he pe o mance gains
obse ed in M1-F,M2-F, and M3-F o e SFX-T may
also be in luenced by model size. Finally, when compa ing
M3-F o M1-F, he sligh ly lowe F1-sco es o M3-F sug-
ges ha , wi hou he dPLP module, M3-F may ha e lowe
capaci y han bo h M1-F and M2-F.5
5A i s glance, i may seem con adic o y ha M3-F achie es highe
L-co ec me ics han M2-F despi e ha ing lowe p ecision (0.576 s.
0.615). This can be a ibu ed o wo ac o s: (1) The lowe L-co ec o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
202
Figu e 3. No el y unc ions. (a) Bea no el y unc ions om he spec al lux modules (SFX-*). (b) dPLP cu es om
M1, whe e K* deno es he dPLP unc ion compu ed wi h a ke nel size o * seconds. (c) Bea no el y unc ions om use s
(F). (d) Bea no el y unc ions om he spec al lux heads (S). Black dashed lines indica e anno a ed e e ence bea s.
Las ly, as shown in Table 1 (bo om), spec al lux mod-
ules ained wi h di e en loss unc ions exhibi dis inc
beha io s, wi h F1-sco e di e ences anging om 0.03 o
0.05 and a ia ions in o he me ics be ween 0.02 and 0.10.
4.3 Compa ison o he No el y Func ions
Figu e 3 compa es he ou pu no el y unc ions calcula ed
om a es ack in he GTZAN da ase , summa izing ou
p e ious discussions. In Figu e 3a, he alignmen be ween
he e e ence anno a ed bea s (black e ical dashed lines)
and he SFX-*no el y unc ions con i ms he high e-
call and low p ecision o SFX-*, as obse ed in Table 1.
Mo eo e , compa ed o SFX-I, he ained SFX-T lea ns
o supp ess se e al non-bea peaks (pu ple egions).
Figu e 3b shows ha M1 dPLP unc ions compu ed wi h
di e en ke nel sizes (K) exhibi a ying peak dis ibu-
ions, ye hey la gely ag ee a bea posi ions. Figu e 3c
isualizes he dis inc beha io s o he h ee use s (M1-F,
M2-F,M3-F). No ably, each use exhibi s di e en alse-
posi i e e o s (e.g., pu ple o g een egions). These di -
e ences can be a ibu ed o he p esence o he dPLP
module (M1 and M2 s. M3) and o whe he module S
is u he op imized using he g adien s backp opaga ed
h ough he dPLP module (M1 s. M2). Speci ically, he
alse-posi i e e o sha ed by all use s a ound he 920 h
ame (yellow egion) e eals ha all use s a emp o p o-
duce peaks a posi ions whe e he inpu no el y unc ions
ag ee. In con as , when he no el y unc ions om he S
modules (M*-S) do no align wi h he dPLP cu es (e.g.,
g een egion), M1-F and M2-F, which ha e access o he
dPLP ou pu s, a oid making a alse-posi i e e o . Las ly,
Figu e 3d illus a es he di e en beha io s o he Smod-
ules when supe ised by di e en loss unc ions. Speci i-
M2-F is pa ly due o dPLP’s bias owa d as e empi, which can cause
aps o align wi h empo ha monics (e.g., double empo), dis up ing he
con inui y equi ed by L-co ec . (2) The lowe p ecision o M3-F esul s
om non-bea onse s clus e ing a ound speci ic bea s, in oducing alse
posi i es. Unlike he e enly dis ibu ed oc a e e o s in M2-F, hese alse
posi i es a e mo e localized, allowing M3-F o achie e highe L-co ec
alues.
cally, since he M3 a chi ec u e lacks pe iodici y in o ma-
ion, he M3-S head is ained o be mo e sensi i e, gen-
e a ing mo e and s onge alse-posi i e peaks a non-bea
posi ions (pu ple egions) compa ed o M1-S and M2-S.
In con as , wi h he addi ional bene i o g adien back-
p opaga ion h ough dPLP, M1-S beha es di e en ly om
M2-S and M3-S, supp essing many non-bea onse s (e.g.,
a ound he 730 h, 790 h, and 860 h ames).
5. CONCLUSION
In his pape , we p esen ed a di e en iable a ian o P e-
dominan Local Pulse (dPLP) es ima ion, eplacing he
non-di e en iable selec ion o an op imal windowed sinu-
soid wi h a so max-based weigh ed summa ion. While
dPLP beha es simila ly o he o iginal algo i hm in e ms
o enhancing pe iodici y in he inpu signal, i s di e en-
iabili y enables seamless in eg a ion in o deep lea ning
pipelines and suppo s end- o-end aining.
The main con ibu ion o his wo k lies on a concep-
ual le el—namely, in he o mula ion o dPLP as an in e -
p e able, lexible, and di e en iable module o pe iodici y
enhancemen . To illus a e he beha io and po en ial ben-
e i s o dPLP in a con olled se ing, we conduc ed a p oo -
o -concep expe imen on bea acking. As pa o his
se up, we also in oduced a ligh weigh di e en iable a i-
an o he spec al lux me hod, which se es as a simple
bu ainable ac i i y es ima o . While his di e en iable
spec al lux is a mino con ibu ion, i demons a es how
model-based componen s can be inco po a ed in o lea ning
amewo ks.
Ou expe imen al esul s highligh he po en ial o com-
bining di e en iable modules like dPLP wi h ainable ea-
u e ex ac o s in an end- o-end ashion. In u u e wo k, we
plan o in eg a e dPLP in o mo e ad anced a chi ec u es
and u he in es iga e i s in e ac ion wi h o he sys em
componen s. O e all, we belie e ha dPLP can se e as
a aluable building block o imp o ing he anspa ency,
con ollabili y, and in e p e abili y o hy hm analysis and
bea acking sys ems.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
203
6. ACKNOWLEDGEMENTS
This wo k was unded by he Deu sche Fo schungs-
gemeinscha (DFG, Ge man Resea ch Founda ion) un-
de G an No. 500643750 (MU 2686/15-1). The In e -
na ional Audio Labo a o ies E langen a e a join ins i-
u ion o he F ied ich-Alexande -Uni e si ä E langen-
Nü nbe g (FAU) and he F aunho e Ins i u e o In eg a ed
Ci cui s IIS.
7. REFERENCES
[1] P. G osche, M. Mülle , and F. Ku h, “Cyclic em-
pog am – a mid-le el empo ep esen a ion o music
signals,” in P oceedings o IEEE In e na ional Con-
e ence on Acous ics, Speech, and Signal P ocessing
(ICASSP), Dallas, Texas, USA, 2010, pp. 5522–5525.
[2] G. T. Toussain , “The geome y o musical hy hm,”
in P oceedings o he Japanese Con e ence on Dis-
c e e and Compu a ional Geome y (JCDCG), Tokyo,
Japan, 2004, pp. 198–212.
[3] P. G osche and M. Mülle , “Ex ac ing p edominan lo-
cal pulse in o ma ion om music eco dings,” IEEE
T ansac ions on Audio, Speech, and Language P o-
cessing, ol. 19, no. 6, pp. 1688–1701, 2011.
[4] M. Mülle and C.-Y. Chiu, “A basic u o ial on no el y
and ac i a ion unc ions o music signal p ocessing,”
T ansac ions o he In e na ional Socie y o Music In-
o ma ion Re ie al (TISMIR), ol. 7, no. 1, pp. 179–
194, 2024.
[5] P. G osche and M. Mülle , “A mid-le el ep esen a ion
o cap u ing dominan empo and pulse in o ma ion
in music eco dings,” in P oceedings o he In e na-
ional Socie y o Music In o ma ion Re ie al Con e -
ence (ISMIR), Kobe, Japan, Oc . 2009, pp. 189–194.
[6] S. P. Bha a, S. Naga aj Bha adwaj, S. Shadaksha i,
and A. Bha , “Laya es ima ion o Hindus ani clas-
sical ocals, de oid o hy hmic indica o s,” in P o-
ceedings o he In e na ional Con e ence on Elec-
onics, Compu ing and Communica ion Technologies
(CONECCT), Bangalo e, India, 2024.
[7] P. Meie , S. Schwä , and M. Mülle , “A eal- ime ap-
p oach o es ima ing pulse acking pa ame e s o
bea -synch onous audio e ec s,” in P oceedings o
he In e na ional Con e ence on Digi al Audio E ec s
(DAFx), Guild o d, Su ey, UK, 2024, pp. 314–321.
[8] P. G osche, M. Mülle , and C. S. Sapp, “Wha makes
bea acking di icul ? A case s udy on Chopin
Mazu kas,” in P oceedings o he In e na ional Socie y
o Music In o ma ion Re ie al Con e ence (ISMIR),
U ech , The Ne he lands, 2010, pp. 649–654.
[9] P. Meie , C.-Y. Chiu, and M. Mülle , “A eal- ime
bea acking sys em wi h ze o la ency and enhanced
con ollabili y,” T ansac ions o he In e na ional Soci-
e y o Music In o ma ion Re ie al (TISMIR), ol. 7,
no. 1, pp. 213–227, 2024.
[10] C.-Y. Chiu, M. Mülle , M. E. P. Da ies, A. W.-Y. Su,
and Y.-H. Yang, “Local pe iodici y-based bea ack-
ing o exp essi e classical piano music,” IEEE/ACM
T ansac ions on Audio, Speech, and Language P o-
cessing, ol. 31, pp. 2824–2835, 2023.
[11] F. K ebs, S. Böck, and G. Widme , “An e icien s a e-
space model o join empo and me e acking,” in
P oceedings o he In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), Malaga,
Spain, 2015, pp. 72–78.
[12] S. Böck and M. E. P. Da ies, “Decons uc , analyse, e-
cons uc : How o imp o e empo, bea , and downbea
es ima ion,” in P oceedings o he In e na ional Soci-
e y o Music In o ma ion Re ie al Con e ence (IS-
MIR), Mon eal, Canada, 2020, pp. 574–582.
[13] F. Fosca in, J. Schlü e , and G. Widme , “Bea his!
Accu a e bea acking wi hou DBN pos p ocessing,”
in P oceedings o he In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), San F an-
cisco, CA, Uni ed S a es, 2024, pp. 962–969.
[14] M. Cu u i and M. Blondel, “So -DTW: a di e en-
iable loss unc ion o ime-se ies,” in P oceedings
o he In e na ional Con e ence on Machine Lea ning
(ICML), Sydney, NSW, Aus alia, 2017, pp. 894–903.
[15] M. K ause, C. Weiß, and M. Mülle , “So dynamic
ime wa ping o mul i-pi ch es ima ion and beyond,”
in P oceedings o he IEEE In e na ional Con e -
ence on Acous ics, Speech, and Signal P ocessing
(ICASSP), Rhodes Island, G eece, 2023.
[16] J. Zei le , S. Deni el, M. K ause, and M. Mülle ,
“S abilizing aining wi h so dynamic ime wa ping:
A case s udy o pi ch class es ima ion wi h weakly
aligned a ge s,” in P oceedings o he In e na ional
Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR), 2023, pp. 433–439.
[17] J. Engel, R. Swa ely, L. H. Han akul, A. Robe s,
and C. Haw ho ne, “Sel -supe ised pi ch de ec ion
by in e se audio syn hesis,” in In e na ional Con e -
ence on Machine Lea ning (ICML), Wo kshop on Sel -
Supe ision in Audio and Speech, Vienna, Aus ia,
2020.
[18] J. Engel, L. Han akul, C. Gu, and A. Robe s, “DDSP:
Di e en iable digi al signal p ocessing,” in P oceed-
ings o he In e na ional Con e ence on Lea ning Rep-
esen a ions (ICLR), Vi ual, 2020.
[19] Y. Yang, M. Hi a, Z. Ni, A. As a u o , C. Chen,
C. Puh sch, D. Pollack, D. Genzel, D. G eenbe g,
E. Z. Yang, J. Lian, J. Hwang, J. Chen, P. Goldsbo -
ough, S. Na en hi an, S. Wa anabe, S. Chin ala, and
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
204
V. Quenne ille-Bélai , “To chaudio: Building blocks
o audio and speech p ocessing,” in P oceedings o he
IEEE In e na ional Con e ence on Acous ics, Speech,
and Signal P ocessing (ICASSP), Vi ual and Singa-
po e, 2022, pp. 6982–6986.
[20] M. Leibe , Y. Ma nissi, A. Ba au, and M. E. Badaoui,
“Di e en iable adap i e sho - ime ou ie ans o m
wi h espec o he window leng h,” in P oceedings
o he IEEE In e na ional Con e ence on Acous ics,
Speech, and Signal P ocessing (ICASSP), Rhodes Is-
land, G eece, 2023, pp. 1–5.
[21] ——, “Di e en iable sho - ime ou ie ans o m wi h
espec o he hop leng h,” in IEEE S a is ical Signal
P ocessing Wo kshop (SSP), Hanoi, Vie nam, 2023,
pp. 230–234.
[22] K. Thomas, “Jus no iceable di e ence and empo
change,” Jou nal o Scien i ic Psychology, ol. 2, pp.
14–20, 2007.
[23] M. Fuen es, B. McFee, H. C. C ayencou , S. Essid,
and J. P. Bello, “Analysis o common design choices in
deep lea ning sys ems o downbea acking,” in P o-
ceedings o he In e na ional Socie y o Music In o -
ma ion Re ie al Con e ence (ISMIR), Pa is, F ance,
2018, pp. 106–112.
[24] S. Böck, F. K ebs, and G. Widme , “Join bea and
downbea acking wi h ecu en neu al ne wo ks,” in
P oceedings o he In e na ional Socie y o Music In-
o ma ion Re ie al Con e ence (ISMIR), New Yo k
Ci y, New Yo k, USA, 2016, pp. 255–261.
[25] G. Tzane akis and P. Cook, “Musical gen e classi ica-
ion o audio signals,” IEEE T ansac ions on Speech
and Audio P ocessing, ol. 10, no. 5, pp. 293–302,
2002.
[26] U. Ma chand and G. Pee e s, “Swing a io es ima ion,”
in P oceedings o he In e na ional Con e ence on Dig-
i al Audio E ec s (DAFx), T ondheim, No way, 2015,
pp. 423–428.
[27] D. Desblancs, V. Los anlen, and R. Hennequin,
“Ze o-no e samba: Sel -supe ised bea acking,”
IEEE/ACM T ansac ions on Audio, Speech, and Lan-
guage P ocessing, ol. 31, pp. 2922–2934, 2023.
[28] Y. Hung, J. Wang, X. Song, W. T. Lu, and M. Won,
“Modeling bea s and downbea s wi h a ime- equency
ans o me ,” in P oceedings o he IEEE In e na ional
Con e ence on Acous ics, Speech, and Signal P ocess-
ing (ICASSP), Vi ual and Singapo e, 2022, pp. 401–
405.
[29] B. McFee, C. Ra el, D. Liang, D. P. Ellis, M. McVica ,
E. Ba enbe g, and O. Nie o, “Lib osa: Audio and
music signal analysis in Py hon,” in P oceedings he
Py hon Science Con e ence, Aus in, Texas, USA,
2015, pp. 18–25.
[30] M. Mülle and F. Zalkow, “lib mp: A Py hon pack-
age o undamen als o music p ocessing,” Jou nal
o Open Sou ce So wa e (JOSS), ol. 6, no. 63, pp.
3326:1–5, 2021.
[31] S. Böck, F. Ko zeniowski, J. Schlü e , F. K ebs, and
G. Widme , “madmom: A new Py hon audio and music
signal p ocessing lib a y,” in P oceedings o he ACM
In e na ional Con e ence on Mul imedia (ACM-MM),
Ams e dam, The Ne he lands, 2016, pp. 1174–1178.
[32] C.-Y. Chiu, L. Liu, C. Weiß, and M. Mülle , “C oss-
modal app oaches o bea acking: A case s udy on
Chopin Mazu kas,” T ansac ion o he In e na ional
Socie y o Music In o ma ion Re ie al (TISMIR),
ol. 8, no. 1, pp. 55–69, 2025.
[33] C. Ra el, B. McFee, E. J. Humph ey, J. Salamon,
O. Nie o, D. Liang, and D. P. W. Ellis, “MIR_EVAL: A
anspa en implemen a ion o common MIR me ics,”
in P oceedings o he In e na ional Socie y o Music
In o ma ion Re ie al Con e ence (ISMIR), Taipei, Tai-
wan, 2014, pp. 367–372.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
205