scieee Science in your language
[en] (orig)

Revisiting Meter Tracking in Carnatic Music using Deep Learning Approaches

Author: Prabhu, Satyajeet
Publisher: Zenodo
DOI: 10.5281/zenodo.17304733
Source: https://zenodo.org/records/17304733/files/Satyajeet-Prabhu_SMC_2025_Master_Thesis.pdf
Mas e hesis on Sound and Music Compu ing
Uni e si a Pompeu Fab a
“Re isi ing Me e T acking in Ca na ic
Music using Deep Lea ning App oaches”
Sa yajee P abhu
Supe iso : Ma ín Rocamo a
Co-Supe iso : Thomas Nu all
Augus 2025
Acknowledgmen s
I would like o exp ess my since e g a i ude o P o . Xa ie Se a o gi ing me he
oppo uni y o be pa o his p es igious p og am despi e my limi ed expe ience
in so wa e de elopmen . His encou agemen o explo e esea ch in Indian A Mu-
sic has been a sou ce o inspi a ion o me and many aspi ing music compu a ion
esea che s om India.
I am deeply g a e ul o my supe iso , D . Ma ín Rocamo a, whose cons an
guidance o e he wo yea s o his p og am has been in aluable. F om eaching one
o he mos engaging cou ses in he p og am o o e ing me an in e nship oppo uni y
a he MTG and ul ima ely supe ising my hesis, he has played a pi o al ole in
shaping me as a esea che .
I am hank ul o my supe iso Thomas Nu all - Tom, as he is a ec iona ely known
- whose suppo began e en be o e he p og am, spa ked by ou mee ing a he
ISMIR sa elli e wo kshop in India in 2022. My g a i ude also ex ends o Genís,
Adi hi, Oguz, Behzad, Es eban, Jyo i, Alia and all he o he PhD s uden s and
esea che s a MTG, who ha e always been willing o o e assis ance and guidance
in bo h p o essional and pe sonal ma e s.
I also wan o since ely hank Ajay S ini asamu hy, he au ho o he wo k on
which his s udy is based, o being gene ous wi h his ime and cons an ly o e ing
his suppo despi e his busy schedule.
I has been a p i ilege o s udy alongside my inc edibly alen ed colleagues in he
SMC Mas e s p og am, who I now p oudly call iends. Special hanks o Anmol
Mish a, now my co-au ho as well, o he ban e and o he cons an encou agemen
o ake on new challenges and o Robin Doe le o some o he mos philosophical
and in ellec ually s imula ing con e sa ions I ha e e e had.
Las ly, I am immensely hank ul o my pa en s o hei unwa e ing suppo in my
musical (mis)ad en u es o e he yea s, and o my close amily and iends, who
con inually encou age me o keep explo ing and g owing.
3
Abs ac
Bea and downbea acking, join ly e e ed o as Me e T acking, is a undamen al
ask in Music In o ma ion Re ie al (MIR). Deep lea ning models ha e a su passed
adi ional signal p ocessing and classical machine lea ning app oaches in his do-
main, pa icula ly o Wes e n (Eu ogene ic) gen es, whe e la ge anno a ed da ase s
a e widely a ailable. These sys ems, howe e , pe o m less eliably on unde ep e-
sen ed musical adi ions.
Ca na ic music, a ich adi ion om he Indian subcon inen , is enowned o i s
hy hmic in icacy and unique me ical s uc u es ( ¯al
.as). The mos no able p io
wo k on me e acking in his con ex employed p obabilis ic Dynamic Bayesian
Ne wo ks (DBNs). The pe o mance o s a e-o - he-a (SOTA) deep lea ning mod-
els on Ca na ic music, howe e , emains la gely unexplo ed.
In his s udy, we e alua e wo models o me e acking in Ca na ic music: he
Tempo al Con olu ional Ne wo k (TCN), a ligh weigh a chi ec u e ha has been
success ully adap ed o La in hy hms, and Bea This!, a ans o me -based model
designed o b oad s ylis ic co e age wi hou he need o pos -p ocessing. Repli-
ca ing he expe imen al se up o he DBN baseline on he Ca na ic Music Rhy hm
(CMR ) da ase , we sys ema ically assess he pe o mance o hese models in a di-
ec ly compa able se ing. We u he in es iga e adap a ion s a egies, including
ine- uning he models on Ca na ic da a and he use o musically in o med pa am-
e e s.
Resul s show ha while o - he-shel models do no always ou pe o m he DBN,
hei pe o mance imp o es subs an ially wi h ans e lea ning, ma ching o su -
passing he baseline. These indings indica e ha SOTA deep lea ning models can
be e ec i ely adap ed o unde ep esen ed adi ions, pa ing he way o mo e in-
clusi e and b oadly applicable me e acking sys ems.
4
Con en s
Abs ac 4
1 In oduc ion 7
1.1 Backg ound .................................... 7
1.1.1 Me ical S uc u e in Music . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Rhy hm in Ca na ic Music . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Mo i a ion..................................... 10
1.2.1 Speci ic Challenges in Ca na ic Music . . . . . . . . . . . . . . . . . . 10
1.3 Resea ch Ques ion and Objec i es . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Resea ch Ques ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Objec i es................................... 11
2 S a e o he A 13
2.1 Signal P ocessing App oach . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 BayesianApp oach................................ 14
2.2.1 Ba Poin e model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 In e ence in Bayesian Me e T acking . . . . . . . . . . . . . . . . . . 17
2.3 Deep Lea ning App oach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 DNN Pipeline o Me e T acking . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 O e iew o A chi ec u es . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 E alua ion..................................... 19
2.4.1 F-Measu e................................... 20
2.4.2 Con inui y-based Me ics . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 DNN Models o Me e T acking 23
3.1 Tempo al Con olu ional Ne wo k . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 A chi ec u al De ails . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 Adap a ion and Gene aliza ion . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 Mul i- ask Lea ning Fo mula ion . . . . . . . . . . . . . . . . . . . . . 26
3.2 Bea This! : T acke wi hou Pos P ocessing . . . . . . . . . . . . . . . . 27
3.2.1 A chi ec u al De ails . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Shi - ole an Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 P ac ical Conside a ions: TCN s Bea This! . . . . . . . . . . . . . . . . 29
4 Me hodology 30
4.1 Da ase ....................................... 30
4.2 Baseline ...................................... 31
4.2.1 BaselineSe up ................................ 31
5

4.3 Expe imen Se up ................................ 32
4.3.1 TCN...................................... 32
4.3.2 Bea This!................................... 34
4.4 Musically In o med S a egies . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.1 Music-In o med T aining . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.2 Music-In o med Pos -P ocessing . . . . . . . . . . . . . . . . . . . . . . 35
5 Resul s and Discussion 36
5.1 Model-wise Pe o mance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 T¯al
.a-wisePe o mance ............................. 37
5.3 Ou lie Analysis ................................. 39
5.4 Tempo and T¯al
.a Cycle Du a ion E ec s . . . . . . . . . . . . . . . . . . . 40
6 Conclusions and Fu u e Wo k 42
6.1 Summa y o he S udy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 Conclusions .................................... 43
6.3 Fu u eWo k ................................... 43
Lis o Figu es 45
Lis o Tables 45
Bibliog aphy 47
Appendices 50
A So wa e and O he Resou ces 50
B De ailed Analysis Plo s 52
Chap e 1
In oduc ion
1.1 Backg ound
Rhy hm analysis is a cen al opic in Music In o ma ion Re ie al (MIR), aimed a
compu a ionally analysing o modelling he empo al s uc u e o music. I encom-
passes a a ie y o asks such as onse de ec ion, empo es ima ion, bea and down-
bea acking, pa e n analysis, mic o iming analysis and synch oniza ion among
o he s, which oge he enable a comp ehensi e unde s anding o musical iming.
This wo k ocuses on he ask o au oma ic es ima ion o bea s and downbea s,
commonly e e ed o as Me e T acking, c i ical o se e al highe -le el MIR asks
such as music segmen a ion and s uc u al analysis, as well as applica ions such as
DJ mixing and au oma ic bea ma ching.
1.1.1 Me ical S uc u e in Music
Rhy hm in music is pe cei ed as pulsa ions o ganized a mul iple hie a chical le els
o di e ing imespans, known as i s me e o me ical s uc u e [Bilmes, 1993,
London, 2012]. These le els ange om e y as subdi isions o la ge o ganiza-
ional uni s ( o example, see igu e 1). The di e en me ical le els a e desc ibed
as ollows:
Figu e 1: Pe cei ed me ical le els in ’Twinkle, Twinkle, Li le S a ’
Ta um The as es egula pulse in he music ha lis ene s can pe cei e as a mean-
ing ul subdi ision o hy hm. O en co esponds o a 16 h no e in Wes e n
7
music, bu he speci ic du a ion depends on he empo and s yle.
Tac us (Bea ) The pe cep ually mos salien pulse le el ha a lis ene would na -
u ally ap hei oo o. I ypically co esponds o qua e no es in Wes e n
music bu again depends on empo and con ex . The bea le el is cen al o
mos hy hm pe cep ion asks and o en se es as he e e ence le el o empo.
Me e (Ba , Measu e, Cycle) A g ouping o bea s in o a ecu ing s uc u e
ha es ablishes musical ph asing and o m. Measu es a e ypically ma ked
by accen pa e ns and se e o shape lis ene s’ expec a ions o iming and
emphasis.
In Wes e n music, me e is commonly ep esen ed using ime signa u es, such
as 4/4, which indica es ou bea s pe measu e, wi h each bea ypically being
a qua e no e in du a ion (see igu e 2). O he common me e s include 3/4
(e.g., wal z).
Downbea The i s bea o a ba o cycle, o en ma ked by a s ong accen o
s uc u al change. I ac s as a empo al ancho and plays a c ucial ole in
con eying he s a o a measu e. Accu a e downbea pe cep ion is essen ial
o unde s anding musical o m and ph asing.
Figu e 2: Musical me e in Wes e n music. Figu e om Wikipedia, Me e (music).
1.1.2 Rhy hm in Ca na ic Music
Ca na ic music, one o he wo p incipal adi ions o Indian a music (IAM),
is p edominan ly p ac iced and app ecia ed in he sou he n egions o he Indian
subcon inen . I is dis inguished by i s dedica ed audiences, sophis ica ed heo e ical
amewo k, and high le el o musicianship.
T adi ionally, aining in Ca na ic music is ansmi ed o ally h ough a lineage
o eache s, wi h a s ong emphasis on pe o mance and imp o isa ion. A ypical
Ca na ic pe o mance ea u es a lead pe o me , a hy hmic accompanimen , a con-
inuous backg ound d one, and a melodic accompanimen . Unlike Wes e n onal
music, Ca na ic music does no employ ha mony; ins ead, i is s uc u ed a ound
he melodic amewo k o ¯aga and he hy hmic amewo k o ¯al
.a[Sambamoo hy,
1998]. Consequen ly, Ca na ic music has become an impo an subjec in MIR e-
sea ch as i p esen s unique challenges and oppo uni ies o compu a ional analysis
[Rao e al., 2023].
Rhy hmic o ganiza ion in Ca na ic music is go e ned by he ¯al
.a sys em, a hie a -
chical amewo k o ime cycles ha unde lies melodic and hy hmic ph asing as well
as imp o isa ion. Wi hin each ¯al
.a cycle, sub-s uc u es a e de ined o ack p o-
g ession h ough he cycle. While he e a e some concep ual pa allels wi h Wes e n
me ical s uc u es, he e minology and o ganiza ion wi hin he ¯al
.a sys em di -
e signi ican ly. Table 1 p esen s app oxima e co espondences be ween me ical
hie a chies in Wes e n and Ca na ic amewo ks.
Wes e n Ca na ic
Ta um aks
.a a
Bea (indica ed by hand ges u es)
Measu e ¯a a ana
Downbea sama
Table 1: Mapping o Wes e n and Ca na ic hy hmic concep s
Mo eo e , he ¯al
.a amewo k includes elemen s unique o he Ca na ic adi ion
ha lack di ec equi alen s in Wes e n me ical heo y, esul ing in 175 heo e ically
possible ¯al
.as. In p ac ice, howe e , a co e se o 35 ¯al
.as is p edominan ly used in
pe o mance and pedagogy. Table 2 lis s he ou mos commonly employed ¯al
.as in
Ca na ic music, alongside hei o al numbe o bea s pe cycle.
T¯al
.a #Bea s
¯
Adi 8
R¯upaka 3
Miś a ch¯apu 7
Khan
.d
.a ch¯apu 5
Table 2: Popula ¯al
.as in Ca na ic music
Figu e 3 illus a es some o he concep s om able 1 using he example o an ¯
Adi
¯al
.a cycle (8 bea s). I also demons a es how bea s a e u he g ouped in o sec ions
called a˙ngas. The p og ession h ough he ¯al
.a cycle is ma ked by dis inc i e hand
ges u es, which indica e bo h indi idual bea s and he di e en ypes o sec ions
wi hin he cycle.
Figu e 3: Illus a ion o ¯
Adi ¯al
.a. Figu e om [S ini asamu hy, 2016]
The model also ea u es p ominen ly in Ajay S ini asamu hy’s 2016 doc o al hesis
[S ini asamu hy, 2016] on au oma ic hy hm analysis o Indian A Music, which
emains he mos comp ehensi e s udy o me e acking in Ca na ic music o da e.
The ba poin e model in oduces a hypo he ical "poin e " ha mo es h ough he
me ical cycle and ese s a he downbea .
Figu e 5: The Ba Poin e model. Figu e aken om [S ini asamu hy, 2016].
Hidden Va iables in he BP-model
•Ba Posi ion (ϕk) : Va iable indica ing posi ion in he ba ; ϕ=0deno es he
downbea .
•Tempo ( ˙
ϕk) : Ra e o p og ession o he poin e h ough he ba ; modeled
s ochas ically o allow o na u al empo luc ua ion.
•Rhy hmic Pa e n Index ( k) : Encodes disc e e hy hmic empla es, cap u ing
expec ed accen s uc u es ac oss di e en me ical s yles.
T ansi ion Model
The ansi ion model de ines how he hidden s a e e ol es o e ime:
P(xk∣xk−1)=P(ϕk∣ϕk−1,˙
ϕk−1, k−1)⋅P(˙
ϕk∣˙
ϕk−1)⋅P( k∣ k−1, ϕk, ϕk−1)
The i s e m upda es he ba posi ion ϕkbased on he p e ious posi ion ϕk−1and
empo ˙
ϕk−1. The second e m en o ces smoo h empo changes by modelling ˙
ϕkbased
on ˙
ϕk−1. The hi d e m allows hy hmic pa e n k o change, bu only a he end
o a ba (i.e., when ϕk<ϕk−1).
Obse a ion Model
The obse a ion model P(yk∣xk)de ines he likelihood o obse ing ea u e ykgi en
he cu en s a e. I is o en implemen ed using Gaussian Mix u e Models (GMMs)
ained on ba -posi ion-aligned hy hmic pa e ns de i ed om anno a ed da a. The
model cap u es how likely an onse o spec al e en is o occu a each posi ion in
he ba , o each pa e n.

2.2.2 In e ence in Bayesian Me e T acking
Once a Bayesian model (like he ba poin e model) is de ined, he co e compu a-
ional ask is in e ence. Gi en he obse a ions y1∶K, he goal is o es ima e he
hidden s a e sequence x1∶K— empo, ba posi ion, and hy hmic pa e n — ha
bes explain he obse ed audio ea u es.
Goal: a g max
x1∶K
P(x1∶K∣y1∶K)
Depending on how he hidden s a e space is modeled — disc e ely o con inuously
— di e en in e ence echniques a e used. The wo dominan app oaches a e:
Vi e bi Decoding
The Vi e bi algo i hm is a dynamic p og amming me hod ha inds he single mos
likely i.e. Maximum A Pos e io i (MAP) sequence o hidden s a es. I assumes
a disc e e s a e space - he ba posi ion ϕ, empo ˙
ϕ, and hy hmic pa e n a e
disc e ized in o a ixed g id.
This app oach p o ides exac in e ence unde he disc e e model and is e icien
when he s a e space is mode a ely sized. Howe e , i becomes compu a ionally
expensi e wi h ine disc e iza ion, o example in cases wi h long ba s, and i is
in lexible in eal- ime o online se ings. To ackle hese scalabili y challenges, K ebs
e al. p oposed an E icien S a e Space Model [K ebs e al., 2015] ha es uc u es
he o iginal ba poin e model esul ing in be e accu acy and d as ically educed
compu a ional complexi y.
Pa icle Fil e ing
When he hidden s a e space is modeled as con inuous (o e y high-dimensional),
exac in e ence becomes in ac able. Pa icle il e ing p o ides an app oxima e solu-
ion by using a se o weigh ed samples, called pa icles, each ep esen ing a possible
s a e ajec o y. In o he wo ds, each pa icle ep esen s a hypo hesis o he hidden
s a e a ime k:x(i)
k=[ϕ(i)
k,˙
ϕ(i)
k, (i)
k].
Pa icle il e ing na u ally inco po a es unce ain y and mul imodali y, such as mul-
iple possible empo hypo heses, making i be e sui ed o online o eal- ime ap-
plica ions. Howe e , i is compu a ionally in ensi e, equi es uning he numbe o
pa icles, and since i is an app oxima e me hod, he esul s may a y be ween uns.
2.3 Deep Lea ning App oach
Cu en ly, da a-d i en deep lea ning app oaches domina e he landscape o he
me e acking ask as hey o e se e al ad an ages. Deep Neu al Ne wo ks (DNN)
a e capable o lea ning complex ep esen a ions om aw inpu da a, p ocessing
la ge-scale da ase s mo e e icien ly, gene alising be e ac oss da a and mul i- ask
lea ning. The a ailabili y o GPUs and specialized deep lea ning amewo ks (e.g.,
Tenso Flow, PyTo ch) has made aining and deploying DNNs mo e p ac ical.
2.3.1 DNN Pipeline o Me e T acking
Figu e 6: DNN based me e acking pipeline. Figu e adap ed om Tempo, Bea
and Downbea ISMIR Tu o ial 2021 [Da ies e al., 2021].
A ypical pipeline o a DNN-based me e acking sys em (see Figu e 6) consis s o
wo s ages - ea u e lea ning and empo al decoding. DNNs i s lea n ea u es
om he inpu audio o i s ime- equency ep esen a ion and ou pu an ac i a ion
o salience unc ion con aining he possible bea and downbea candida es. This
is simila o he no el y unc ion, bu while he no el y unc ion is de i ed om
hand-c a ed ea u es, he ac i a ion unc ion is p oduced by he ne wo k’s complex
in e nal ep esen a ion.
The ou pu ac i a ions o DNNs a e o en noisy and canno be di ec ly used o
p edic ions. DBNs a e p o en in hei abili y o impose empo al consis ency and
me ical s uc u e and a e commonly used as a pos -p ocessing s ep o in e bea s
and downbea s om DNN ac i a ions. Howe e , DBNs can also in oduce se e al
limi a ions due o hei inhe en p ope ies. They do no wo k o music wi h ime
signa u e changes, empo changes ou side he p esc ibed ange, and me ic s uc-
u es no ep esen ed in he s a e space. To o e come bias in oduced by DBNs
and gene alise ac oss music gen es, ecen e o s ha e a emp ed o emo e his
pos -p ocessing s age.
2.3.2 O e iew o A chi ec u es
Since bea and downbea es ima ion is a sequence modelling p oblem, he mos suc-
cess ul a chi ec u es applied o his ask include Recu en Neu al Ne wo ks (RNNs),
Tempo al Con olu ional Ne wo ks (TCNs) and ans o me -based models, all o
which a e well-sui ed o cap u ing empo al dependencies in musical signals.
Böck e al. [Böck e al., 2016] u ilise RNN, speci ically Bidi ec ional Long Sho -
Te m Memo y (BLSTM) a chi ec u e, o a supe ised classi ica ion ask o simul-
aneously de ec bea s and downbea s. This signi ican wo k ou pe o med s an-
dalone DBN-based me e acking, especially on he downbea de ec ion ask o
mos Wes e n music da ase s.
Con olu ional Neu al Ne wo ks (CNN) a e known o excel a ex ac ing local ea-
u es, such as ansien s, while ha ing a ela i ely low model complexi y. Howe e ,
hey su e om a lack o long- e m con ex , which makes i di icul o iden i y
global hy hmic s uc u es. Hyb id app oaches ha inco po a e bo h spa ial and
empo al unde s anding a e, he e o e, u ilised o me e acking. Bea Ne [Heyda i
e al., 2021] uses CRNN (Con olu ional Recu en Neu al Ne wo k), which combines
CNNs o ea u e ex ac ion and ecu en laye s o sequen ial modelling.
Tempo al Con olu ional Ne wo k has eme ged as ano he powe ul a chi ec u e o
bea and downbea acking. TCNs u ilise con olu ional laye s wi h dila ions o
achie e a la ge ecep i e ield, allowing hem o model long empo al con ex s e i-
cien ly.
Mo e ecen ly, he ans o me a chi ec u e - o iginally success ul in na u al language
p ocessing - has been applied o me e acking. T ans o me s u ilise a sel -a en ion
mechanism ha allows he model o weigh he impo ance o di e en pa s o he
inpu sequence when making p edic ions. This enables hem o cap u e bo h local
and global dependencies e ec i ely while co e ing he en i e inpu sequence. Hung
e al. [Hung e al., 2022] employ a spec al- empo al ans o me (SpecTNT) a -
chi ec u e o his ask. Bea This!, a ans o me -based sys em ha emo es he
pos -p ocessing s age, achie es s a e-o - he-a bea and downbea acking pe o -
mance on a numbe o Wes e n music da ase s.
2.4 E alua ion
The e alua ion o me e acking sys ems ypically in ol es compa ing p edic ed
bea and downbea imes agains anno a ed g ound u h. In o de o accoun o
he inhe en imp ecision in anno a ions and musical e en s, mos e alua ion me ics
allow a ole ance window a ound he anno a ed imes.
2.4.1 F-Measu e
The F-measu e, also known as he F1-sco e, e alua es he accu acy o p edic ed
bea imes by compa ing hem o g ound u h anno a ions wi hin a ixed empo al
ole ance window (commonly ±70 ms). This me ic in ends o p o ide a measu e o
how many bea s a e co ec ly p edic ed wi hou o e o unde p edic ing. Downbea s
a e e alua ed simila ly, bu due o hei lowe equency, e o s a e mo e impac ul.
I is de ined in e ms o :
T ue Posi i es (NTP): Numbe o p edic ed bea s ha all wi hin he ole ance
window o a g ound- u h bea .
False Posi i es (NFP): Numbe o p edic ed bea s ha do no ma ch any g ound-
u h bea wi hin he ole ance window.
False Nega i es (NFN): Numbe o g ound- u h bea s o which no p edic ed bea
lies wi hin he ole ance window.
Figu e 7: Tole ance window o F-measu e. Figu e om Tempo, Bea and Downbea
ISMIR Tu o ial 2021 [Da ies e al., 2021].
P ecision and Recall a e de ined as:
P ecision =NTP
NTP +NFP
,Recall =NTP
NTP +NFN
Then, he F-measu e is he ha monic mean o p ecision and ecall:
F1=2⋅P ecision ⋅Recall
P ecision +Recall
While he F-measu e is a widely used and in ui i e me ic, i is p one o sys em-
a ic issues ha can gi e a misleading imp ession o acking quali y. Fo ins ance,
changing he size o he ole ance window can d ama ically change he alue o he
measu e. As a esul o using a ixed ole ance window, bea s inside he window a e
conside ed accu a e ega dless o hei posi ion inside he window. So, p edic ions
consis en ly o se om he anno a ion would esul a high F1 sco e.
2.4.2 Con inui y-based Me ics
Con inui y-based me ics we e in oduced o add ess some o hese gaps by e alu-
a ing no jus alignmen accu acy, bu also he consis ency o me ical phase and
empo o e ex ended egions. Tha is, e alua ing no jus whe he bea s a e de-
ec ed, bu whe he hey a e de ec ed consis en ly ac oss ime and a he co ec
me ical le el. This is especially impo an o applica ions such as eal- ime ack-
ing, whe e main aining s able and accu a e bea in o ma ion o e ime is c ucial o
synch oniza ion and esponsi eness.
Con inui y C i e ia
A p edic ed bea a ime ˆ
biis conside ed accu a e only i i sa is ies wo condi ions:
1. The p edic ed bea ˆ
bimus lie wi hin a p ede ined ole ance window a ound he
co esponding g ound- u h bea bi. This window is no absolu e bu ela i e
o he in e -bea in e al (IBI), ypically se o ±17.5% o he local IBI.
2. The p eceding bea ˆ
bi−1mus also all wi hin i s own ole ance window. Fu -
he mo e, he IBI be ween ˆ
bi−1and ˆ
bimus be consis en wi h he IBI be ween
bi−1and bi.
These condi ions oge he de ine wha is e e ed o as a con inuous segmen : a
sequence o a leas h ee consecu i e bea s ha a e empo ally aligned, me ically
consis en , and phase-co ec . Only such segmen s con ibu e o he con inui y-
based me ics.
Me ical Ambigui y
Con inui y me ics a e designed o be sensi i e o a ange o me ical e o s which
may all ha e simila F-measu e alues bu as ly di e en pe cep ual implica ions.
To achie e his, con inui y-based me ics in oduce me ical a ian s o he e e ence
anno a ion g id and e alua e p edic ions agains each a ian . The highes esul ing
sco e is selec ed. Commonly used me ical a ian s include:
•Same me ical le el, in-phase (i.e., bea s align exac ly wi h anno a ions)
•Same me ical le el, o -phase (i.e., bea s occu hal way be ween anno a ions)
•Double empo (Twice he anno a ed me ical le el)
•Hal empo (e en-phase) (e e y o he anno a ion s a ing om he i s )
•Hal empo (odd-phase) (e e y o he anno a ion s a ing om he second)
De ini ions o Con inui y Me ics
Le Nseg
co ec be he numbe o bea s in he longes con inuous co ec segmen , and
Nall
co ec be he o al numbe o co ec bea s (e en ac oss mul iple segmen s). Fou
me ics a e de i ed om his p inciple, dis inguishing be ween s ic (anno a ed)
and lenien (allowed) me ical le els:

•CMLc(Co ec Me ical Le el - con inuous):
CMLc=Nseg
co ec
Np ed
•CML (Co ec Me ical Le el - o al):
CML =Nall
co ec
Np ed
•AMLc(Allowed Me ical Le els - con inuous): Same as CMLc, bu allows
me ical ambigui ies.
•AML (Allowed Me ical Le els - o al): Same as CML , bu allows me ical
ambigui ies.
Low con inui y sco es - especially when pai ed wi h a high F-measu e - sugges ha
p edic ions a e agmen ed o me ically inconsis en , e en i indi idual bea s a e
equen ly close o anno a ions. Compa ing CML and AML a ian s can also e eal
whe he a sys em is making me ical-le el e o s (e.g., consis en ly acking a hal o
double empo) ha s ill esul in pe cep ually accep able ou pu . O e all, con inui y
me ics o e a mo e s uc u ally awa e e alua ion han ame-le el accu acy alone.
Chap e 3
DNN Models o Me e T acking
This wo k ocuses on wo main a chi ec u es: Tempo al Con olu ional Ne wo k
and Bea This!. The ollowing sec ions ake a close look a each sys em, explaining
hei key componen s and how hey app oach he ask o me e acking.
3.1 Tempo al Con olu ional Ne wo k
TCNs ha e been shown o ou pe o m adi ional RNN-based models such as
BLSTMs in me e acking asks. The TCN a chi ec u e uses dila ed con olu ions
o model empo al dependencies, allowing he model o p ocess audio sequences
in pa allel. Unlike BLSTMs, which a e inhe en ly sequen ial and hus di icul o
pa allelize, TCNs enable pa allel aining ac oss ime s eps, signi ican ly educing
aining imes and compu a ional cos s. Th ough dila ed con olu ions, TCNs a e
capable o modelling long- ange empo al dependencies (spanning en i e ba s o
ph ases) wi h signi ican ly ewe pa ame e s. These cha ac e is ics make TCNs no
only mo e scalable bu also be e sui ed o eal- ime o low-la ency applica ions.
Figu e 8 shows an o e iew o bea acking pipelines o he wo a chi ec u es.
3.1.1 A chi ec u al De ails
The e a e wo main componen s a he hea o a TCN-based me e acke :
Con olu ional Block
The con olu ional block ac s as he on end ea u e ex ac o in he TCN-based me-
e acking pipeline. I s ole is o ans o m he inpu spec og am in o a mo e com-
pac and in o ma i e se o lea ned ea u es ha emphasize he spec al- empo al
pa e ns ele an o hy hm pe cep ion. Impo an ly, all con olu ion ope a ions a e
pe o med wi hou empo al downsampling. The con olu ional block is designed
o educe spec al dimensionali y while p ese ing he empo al esolu ion ha is
c i ical o acking bea - ela ed e en s. As seen in igu e 9, a ypical con olu ional
block includes:
23
Figu e 8: Compa ison o BLSTM and TCN a chi ec u es o bea acking. Figu e
aken om [Da ies and Böck, 2019].
•Mul iple 2D con olu ional laye s, each wi h a small ke nel size (e.g., 3×3) o
cap u e local ime- equency pa e ns.
•Pooling along he equency axis, which comp esses he spec al dimension
while main aining he o iginal empo al esolu ion.
•Nonlinea ac i a ion unc ions, such as ELU, applied a e each con olu ion o
in oduce nonlinea i y.
Figu e 9: Con olu ional block in a TCN-based me e acke . Figu e aken om
[Böck and Da ies, 2020].
TCN Block
The inpu o he TCN is a highly sub-sampled ea u e ec o de i ed om he
magni ude spec og am by he con olu ional block, bu which e ains he same
empo al esolu ion. The TCN block is he co e empo al modelling componen
o he a chi ec u e. I s p ima y unc ion is o model he sequen ial dependencies
and pe iodic s uc u es equi ed o bea and downbea p edic ion. I does so by
lea ning il e s ia dila ed con olu ion. Dila ion is equi alen o skipping samples in
he inpu sequence.
In a s anda d 1D con olu ion, each il e “slides” ac oss he ime axis o he inpu
ea u e map, p ocessing a local window (e.g., 3 ames) a each s ep. This is analo-
gous o scanning o epea ing hy hmic mo i s. Howe e , o model longe con ex s,
TCNs in oduce dila ed con olu ions. A dila ion de ines he spacing be ween he
elemen s in he il e ’s ecep i e ield. Fo example, e e ing o igu e 10 :
A dila ion o 1 co esponds o adjacen ime s eps( −1, , +1).
A dila ion o 2 looks a e e y second ime s ep ( −2, , +2).
A dila ion o 4 expands u he ( −4, , +4).
By s acking laye s wi h exponen ially inc easing dila ions (e.g., 1, 2, 4, 8...), he
ne wo k can e ec i ely model pa e ns o e a la ge ime span wi hou a p opo ional
inc ease in he numbe o pa ame e s.
Figu e 10: Tempo al Con olu ional Ne wo k. Figu e aken om ISMIR 2021 u o ial
on Tempo, Bea , and Downbea Es ima ion.[Da ies e al., 2021].
Ad an ages o Me e T acking:
•Tempo al esolu ion is p ese ed: Unlike RNNs, TCNs can main ain he
ull empo al g anula i y o he inpu .
•E icien long- e m modelling: Due o dila ion, a TCN wi h 10 laye s and
ke nel size 3 can access 210 =1024 ime s eps—se e al seconds o music—
wi hou loss o esolu ion.
•Pa allel aining: All ime s eps can be p ocessed simul aneously, making
he model highly sui able o GPU accele a ion.
3.1.2 Adap a ion and Gene aliza ion
In he con ex o me e acking, Da ies and Böck [Da ies and Böck, 2019] i s
success ully epu posed he TCN design inspi ed by Wa eNe [Van Den Oo d e al.,
5. P ese a ion o T¯al
.a Dis ibu ion: The expe imen al se up p ese es
¯al
.a dis ibu ion in each old. Speci ically, he dis ibu ion o ¯al
.as in he
aining and es olds mi o s he dis ibu ion o ¯al
.as in he ull da ase .
4.3 Expe imen Se up
Fo bo h models unde e alua ion, we i s eplica e he da a spli s and aining se up
as desc ibed in he p e ious sec ion. Addi ionally, we es ablish common p ocedu al
guidelines o ensu e a ai and consis en compa ison be ween he models:
•The da ase , comp ising 176 samples, is di ided in o wo p ede e mined olds
o 88 examples each, iden ical o hose used in he baseline expe imen ’s wo-
old c oss- alida ion scheme. In each i e a ion, one old is used o aining
while he o he se es as he es se , wi h he olds al e na ing oles be ween
i e a ions.
•The ain old is u he subdi ided in o aining (80%) and alida ion (20%)
subse s. Consequen ly, each old con ains 70 aining examples and 18 alida-
ion examples.
•We pe o m h ee aining uns pe old. To ensu e ep oducibili y o alida ion
spli s and ne wo k ini ializa ions, we se p ede e mined andom seeds [42, 52,
62] o each espec i e un. As a esul , six dis inc models a e gene a ed o
e e y aining s a egy, and he esul s a e epo ed as he mean pe o mance
ac oss hese six models.
•Valida ion loss is employed as he p ima y me ic o moni o ing aining
p og ess and o ea ly s opping. T aining is e mina ed when no imp o emen
in alida ion loss is obse ed.
•The models a e e alua ed using he F-measu e as well as he con inui y me ics
CML and AML o bo h bea and downbea . The e alua ion p ocess is
ca ied ou using he Py hon package mi _e al [Ra el e al., 2014].
4.3.1 TCN
Model
The expe imen al se up o he TCN employed in his s udy is based on he open-
sou ce implemen a ion o Decons uc , Analyse, Recons uc [Böck and Da ies, 2020]
made a ailable by he au ho s as pa o he ISMIR 2021 u o ial on Tempo, Bea
and Downbea Es ima ion. This implemen a ion was subsequen ly epu posed in
Adap ing Me e T acking Models o La in Ame ican Music [Maia e al., 2022], and an
upda ed, use - iendly e sion is p o ided in he Tu o ial o LAMIR 2024 Hacka hon
[Mo ais e al., 2024]. The p esen s udy u ilises hese p io wo ks and hei espec i e
expe imen al se ups as he basis o he TCN implemen a ion.
T ainable Pa ame e s : 72.3K

T aining S a egies
Fo he TCN model, we e alua e h ee s a egies inspi ed by [Maia e al., 2022].
•Baseline (TCN-BL)
Fi s , we ain a model on he popula Wes e n da ase s o me e acking-
GTZAN, Ball oom, Bea les and RWC da ase s ollowing [Maia e al., 2022].
This model is assumed o be a good s a ing poin o a baseline e alua ion o
he TCN on Ca na ic da a as well as o subsequen ans e lea ning expe -
imen s. Following p o ocol, we pe o m h ee aining uns and epo mean
pe o mance on he CMR da ase .
•Fine- uning (TCN-FT)
Unde his s a egy, he model wi h he lowes alida ion loss om TCN-BL
is used as a s a ing poin o ine- uning he ne wo k on Ca na ic da a. The
assump ion is ha , al hough he model was p e- ained on Wes e n da ase s,
i has lea ned a ep esen a ion ha can be adap ed o a di e en musical
adi ion, as demons a ed in [Maia e al., 2022].
•T aining om Sc a ch (TCN-FS)
This s a egy in ol es aining a andomly ini ialized ne wo k (using one o
he p ede ined seed alues) om sc a ch on each old.
Loss Func ion
We employ a simple loss unc ion de ined as he sum o he bina y c oss-en opy
(BCE) o bea and downbea p edic ions:
L=BCEbea +BCEdownbea
Table 6 p o ides a summa y o he aining con igu a ion se ings employed ac oss
he di e en TCN s a egies.
Ac onym S a egy Models
T ained Epochs
Ea ly
S oppage
Lea ning
Ra e
LR
Reduc ion
(Fac o )
TCN-BL TCN Baseline 3 100 20 0.005 0.2
TCN-FS T ain om Sc a ch 6 100 20 0.005 0.2
TCN-FT Fine une om Baseline 6 50 10 0.001 0.2
Table 6: T aining con igu a ions o he TCN s a egies
Pos P ocessing
Fo pos -p ocessing ne wo k ac i a ions, a DBN-based pos -p ocesso is used. The
Py hon lib a y madmom [Böck e al., 2016] o e s an open-sou ce join bea and
downbea DBN pos -p ocesso app oxima ed by a Hidden Ma ko Model (HMM),
based on [Böck e al., 2016, K ebs e al., 2015]. In his wo k, we use i s o line mode
u ilising he Vi e bi algo i hm o in e ence.
4.3.2 Bea This!
Model
Fo Bea This!, we use he s ock implemen a ion o baseline e alua ion. Fo ine-
uning, howe e , we build upon a modi ied implemen a ion ha acili a es ine-
uning o he s ock model. Despi e hese modi ica ions, we e ain he de aul aining
con igu a ions, including da a augmen a ion schemes, and u ilise he p e- ained
models p o ided by he o iginal au ho s.
T ainable Pa ame e s : 20.3M
T aining S a egies
We adop only wo s a egies: Baseline and Fine- uning. Gi en ha Bea This! is
a ans o me -based a chi ec u e, i is highly da a-in ensi e, which makes aining
om sc a ch on a single da ase imp ac ical.
•Baseline (Bea This-BL)
We u ilise he h ee p e- ained checkpoin s p o ided wi h he s ock model,
namely inal0, inal1 and inal2, all o which ha e been ained on a la ge
co pus comp ising 18 di e en da ase s (excluding GTZAN). We e alua e hese
models on he CMR da ase and epo he mean pe o mance ac oss all h ee
models as he baseline pe o mance o Bea This!.
•Fine- uning (Bea This-FT)
The ine- uning p ocess begins wi h he de aul ( inal0) checkpoin as he p e-
ained baseline, which is hen u he ine- uned o e he cou se o 50 epochs.
Las ly, we use he buil -in Shi - ole an weigh ed BCE loss and skip pos -p ocessing.
4.4 Musically In o med S a egies
Inco po a ing musicological insigh s in o me e acking sys ems can enhance hei
pe o mance. This sec ion explo es s a egies used du ing aining and pos -
p ocessing o help ou DNN me e acking sys ems adap be e o Ca na ic Music.
4.4.1 Music-In o med T aining
These s a egies applied a he aining s age aim o ensu e ha he model is exposed
o hy hmic di e si y consis en ly du ing he aining p ocess:
S a i ied T¯al
.a-based T ain/Valida ion Spli
To ensu e consis en pe o mance ac oss ¯al
.as, we implemen a s a i ied ain/ ali-
da ion spli based on ¯al
.as. This balances ep esen a ion o all ¯al
.as in aining and
alida ion, enabling pe - ¯al
.a e o analysis and a ge ed imp o emen s o unde -
pe o ming ¯al
.as.
In e lea ed T ain Da a Loade
An issue wi h he CMR da ase is imbalanced class dis ibu ion - some ¯al
.as appea
mo e equen ly han o he s. We use an in e lea ed da a loade ha p opo ionally
spaces a e ¯al
.as like khan
.d
.a ch¯apu in aining, ensu ing a mo e balanced lea ning
p ocess.
4.4.2 Music-In o med Pos -P ocessing
One impac ul s a egy o enhancing he pe o mance o me e acking sys ems is
pos -p ocessing. The DBNDownBea T ackingP ocesso om he madmom lib a y
allows us o une pa ame e s o e lec musical cha ac e is ics o he da a being
p ocessed. We se he ollowing pa ame e s based on musicological knowledge as
well as insigh s om he da ase :
•bea s_pe _ba = [3, 5, 7, 8] based on he ou ¯al
.as in ou da ase ins ead o
he de aul [3, 4]
•min_ empo = 55 and max_ empo = 230, e lec ing he empo ange obse ed
in he da ase (see Figu e 12). This ange, chosen based on p elimina y ex-
pe imen s, co e s 99% o all empos and p o ides a cons ained sea ch space
o he pos -p ocesso , wi h esul s showing sligh pe o mance imp o emen
compa ed o a max empo o 300 BPM (99.9% o all empos).
Figu e 12: Dis ibu ion o empos in he CMR da ase .
Fo ep oducibili y, all ele an code eposi o ies, so wa e esou ces, and da ase
e e ences u ilised in he expe imen s a e ca alogued in Appendix A.
Chap e 5
Resul s and Discussion
This chap e p esen s and examines he pe o mance o models ained wi h he
a ious s a egies desc ibed in he p e ious chap e . Each app oach is e alua ed
using quan i a i e me ics, including F-measu e and con inui y sco es, o p o ide
a obus compa ison. In addi ion, a de ailed b eakdown by ¯al
.a is conduc ed o
highligh how each model esponds o he unique hy hmic s uc u es o Ca na ic
music, allowing hei espec i e s eng hs and weaknesses o eme ge mo e clea ly.
5.1 Model-wise Pe o mance
Table 7 below shows he o e all pe o mance o he wo models and hei s a e-
gies agains he Ba Poin e model baseline wi h he highes pe o ming me ics
highligh ed in bold.
Model Bea Downbea
F-measu e CML AML F-measu e CML AML
BP-HMM 71.8 — 72.2 44.0 — —
BP-AMPF 82.5 — 90.6 57.4 — —
TCN-BL 77.1 51.6 77.9 28.9 21.6 33.8
TCN-FT 80.7 50.2 91.9 52.9 35.3 57.8
TCN-FS 84.6 62.9 88.0 63.9 52.1 67.0
Bea This-BL 71.3 39.2 56.8 27.6 2.0 8.7
Bea This-FT 90.3 78.0 80.0 66.8 38.2 53.7
Table 7: Model-wise Pe o mance Compa ison
Bo h he TCN-BL and Bea This-BL models, which we e ained on Wes e n music
da ase s, ail o achie e baseline pe o mance le els in me e acking o Ca na ic
music. Al hough he bea acking accu acy o hese models app oxima es base-
line pe o mance, hei downbea acking pe o mance emains subs an ially below
baseline, despi e being ained on ex ensi e da ase s.
36
This dispa i y e eals he undamen al di e ences in hy hmic s uc u es be ween
Wes e n and Ca na ic music and illus a es he challenges aced by neu al ne wo ks
in di ec ly ans e ing lea ned knowledge ac oss dis inc musical adi ions.
In con as , he TCN-FT model nea ly a ains baseline pe o mance, no ably achie -
ing he highes bea AML sco e among all e alua ed models. In e es ingly, p e-
limina y expe imen s demons a ed compa able esul s when ine- uning a model
ini ially ained only on he GTZAN da ase .
These indings u he highligh he necessi y o he ne wo k o e-op imize i s
hype pa ame e s when adap ing om Wes e n o Ca na ic music, indica ing ha
he quan i y o da a used du ing p e aining may be less signi ican compa ed o
he subsequen ine- uning on Ca na ic music. In ac , he pe o mance o he TCN-
FT model may be hinde ed by being unde ained. Addi ional ine- uning could
enhance he esul s, e ec i ely equa ing o aining he model om sc a ch.
Bo h TCN-FS and Bea This-FT signi ican ly ou pe o m he DBN baseline in bea
and downbea acking, wi h Bea This-FT es ablishing i sel as he mos e ec i e ap-
p oach o achie ing aw accu acy in me e acking o Ca na ic music. Meanwhile,
TCN-FS excels in main aining empo al con inui y, pa icula ly in he downbea
acking ask.
The di e ence in pe o mance be ween he wo models is as expec ed, gi en hei
espec i e a chi ec u es (see Chap e 3) and he applica ion o pos -p ocessing in
he TCN model. The Bea This! a chi ec u e emphasizes he accu acy o local
p edic ions, while he TCN model pai ed wi h he pos p ocesso p omo es globally
cohe en p edic ions. Addi ionally, he powe ul ans o me a chi ec u e used by
Bea This! is able o ex ac mo e meaning ul ea u es om he Ca na ic da a
compa ed o he ela i ely ligh weigh TCN model, al hough his ad an age comes
wi h inc eased compu a ional demands.
5.2 T¯al
.a-wise Pe o mance
Wi h TCN-FS and Bea This-FT iden i ied as he wo leading s a egies o acking
Ca na ic me e , we p oceed o analyze hei pe o mance on each ¯al
.a o gain a
ho ough unde s anding o hei capabili ies. Tables 8 and 9 p o ide b eakdowns o
he pe o mance o TCN-FS and Bea This-FT, espec i ely, ac oss he ou ¯al
.as.
In e ms o bea acking, Bea This-FT demons a es ela i ely consis en pe o -
mance ac oss ¯alas compa ed o TCN-FS. This consis ency is also e iden in he
con inui y sco es. TCN-FS s uggles pa icula ly wi h ¯adi (8) and ¯upaka (3) ¯alas,
especially he la e . Al hough he bea F-measu es a e easonable, he low CML
sco es indica e ha while many bea s a e de ec ed co ec ly, he sys em o en loses
co ec empo con inui y h oughou he sequence. The highe AML sco es sugges
ha he sys em equen ly p edic s empos ha a e hy hmically ela ed, implying
me ical ambigui y, likely due o he pos -p ocessing s age, which is absen in Bea
This!.

T¯al
.a Bea Downbea
F-measu e CML AML F-measu e CML AML
¯
Adi (8) 77.8 52.7 84.8 62.7 42.5 84.3
R¯upaka (3) 75.8 32.8 85.0 40.7 19.5 23.8
M¯ıś a ch¯apu (7) 95.6 92.3 93.9 86.7 88.8 94.5
Khan
.d
.a ch¯apu (5) 93.5 84.5 88.7 68.5 64.6 65.9
O e all 84.6 62.9 88.0 63.9 52.1 67.0
Table 8: TCN-FS : T¯al
.a-wise Pe o mance Compa ison
T¯al
.a Bea Downbea
F-measu e CML AML F-measu e CML AML
¯
Adi (8) 86.6 74.3 77.5 49.3 2.2 55.8
R¯upaka (3) 89.5 74.7 77.5 81.6 68.3 68.3
M¯ıś a ch¯apu (7) 94.2 86.0 86.1 72.8 49.7 50.6
Khan
.d
.a ch¯apu (5) 91.4 76.6 78.5 61.1 28.8 29.3
O e all 90.3 78.0 80.0 66.8 38.2 53.7
Table 9: Bea This-FT : T¯al
.a-wise Pe o mance Compa ison
In e es ingly, e en in Bea This!, he ¯alas ¯adi (8) and ¯upaka (3) sco e lowe in
bea acking accu acy han m¯ıś a ch¯apu (7) and khan
.d
.a ch¯apu (5). This may seem
coun e -in ui i e, as one migh expec sys ems o s uggle mo e wi h a e and mo e
complex me e s. Howe e , he di e ence is due o he a ie y o pa e ns wi hin
a gi en ¯ala. In Ca na ic music, pe o me s o en imp o ise and a y g ouping
s uc u es wi hin a cycle, while main aining he co e amewo k and o e all leng h.
As explained by [S ini asamu hy, 2016], mul iple hy hmic pa e ns ha depa
om he adi ional ¯ala s uc u e can be played. Fo example, a musician migh
pe o m a pa e n g ouped as 7, 7, 4, 6, and 8 aks
.a as, o aling 32 aks
.a as wi hin an
¯adi ¯al
.a cycle. Popula ¯alas like ¯adi and ¯upaka end o ha e mo e such a ia ions,
making hem mo e di icul o bea acking sys ems o gene alize. This complexi y
is isible in he plo s in igu e 13, which shows he a e age cycle leng h spec al lux
pa e ns o ¯adi and m¯ıś a ch¯apu ¯al
.as in he CMR da ase . The pa e ns indica e
a ying accen s eng hs a di e en me ical posi ions, e lec ing he hy hmic a i-
a ion wi hin each ¯ala. Also, bo h models achie e high pe o mance on m¯ıś a ch¯apu
o bo h bea and downbea de ec ion. This consis ency implies ha m¯ıś a ch¯apu’s
hy hmic pa e n is ela i ely easie o model accu a ely o bo h a chi ec u es.
When examining he downbea de ec ion ask, he esul s a e mo e nuanced. While
Bea This-FT sligh ly ou pe o ms TCN-FS in o e all F-measu e (66.8 s 63.9), bo h
exhibi mixed esul s wi h wide a ia ion ac oss ¯al
.as. Fo example, Bea This-FT
excels d ama ically on ¯upaka, whe eas i s uggles on ¯adi downbea s. This sugges s
ha each model may be mo e adep a handling ce ain hy hmic s uc u es bu less
consis en ac oss all ¯al
.a ypes. Fo u he g anula i y, Appendix B includes com-
Figu e 13: Spec al lux pa e n compa ison o ¯alas. Figu e aken om [S ini-
asamu hy, 2016].
p ehensi e iolin plo s illus a ing pe - ack pe o mance o bo h aining s a egies
on each ¯al
.a.
5.3 Ou lie Analysis
Due o he ela i ely poo pe o mance o TCN-FS on ¯adi and ¯upaka ¯al
.as, we
pe o m a p elimina y ou lie analysis. Tables 10 lis s he acks wi h he lowes
bea and downbea F-measu e sco es o hese ¯al
.as.
ack id ¯al
.a Bea
F-measu e CML AML
10047 adi 0.192975 0.100241 0.458864
11024 upakam 0.605962 0.000145 0.903054
ack id ¯al
.a Downbea
F-measu e CML AML
10048 adi 0.000000 0.0 1.000000
11040 upakam 0.032501 0.0 0.000000
Table 10: TCN-FS : Wo s Pe o ming T acks by Bea and Downbea F-Measu e
Figu e 14 isualizes he g ound u h anno a ions alongside he model p edic ions
o e he spec og am o sec ions o he wo ¯adi ¯al
.a ou lie s wi h he lowes bea
( ack id: 10047) and downbea F-measu e ( ack id: 10048) sco es, espec i ely. In
he i s case, i is clea ha he bea p edic ions a e consis en ly shi ed by hal
a bea , esul ing in a low bea F-measu e sco e. This is a common occu ence in
Ca na ic music, whe e c ea i e phase o se s (ed
.upu) a e o en employed - pe cussi e
onse s a e shi ed om he ac ual bea s o he ¯al
.a (i.e., played on he o -bea ) -
making i challenging o he ne wo k o de ec he ue bea s. He e, he AML
sco e p o ides a mo e ealis ic measu e o bea acking pe o mance.
Fo he downbea ail case, he de ec ed downbea s a e displaced by exac ly hal a
cycle. Though i sco es ze o in accu acy, i achie es a pe ec AML sco e. This issue
can be a ibu ed o he pos -p ocesso , which en o ces global p edic ion cons ain s
and may p oduce sco es un ep esen a i e o he model’s ac ual pe o mance.
Figu e 14: TCN-FS : Wo s pe o ming acks o ¯adi ¯al
.a isualised
The ¯upaka ou lie s we e analyzed by lis ening, as i is mo e di icul o isually
iden i y he easons o hei poo pe o mance. In he bea acking ou lie , he
pe cussion swi ches c ea i ely be ween iple and quad uple me e h ough me ic
modula ion, challenging he pos -p ocesso ’s abili y o handle hese apid shi s.
The downbea ou lie is pa icula ly challenging because he pe cussion is pe o med
a double empo compa ed o he g ound u h anno a ions, while also employing
polyme e . The subpa pe o mance o TCN-FS on ¯upaka is likely due o hese
ac o s and he di icul ies hey pose o he pos -p ocessing s age.
5.4 Tempo and T¯al
.a Cycle Du a ion E ec s
Las ly, we del e deepe o iden i y he possible e ec s o ack empo and ¯al
.a-cycle
du a ion on model pe o mance.
Figu e 15 p esen s a box plo o he median ack empos in he CMR da ase ,
g ouped by ¯al
.a. Fo each ack, in e -bea in e als (IBI) a e con e ed o BPM
alues, and he median BPM is plo ed. No ably, he majo i y o acks in he ¯adi
(8) and ¯upaka (3) ¯al
.as all wi hin a na ow empo ange o app oxima ely 50 o 100
bpm. In con as , he wo bes -pe o ming ¯al
.as, m¯ıś a ch¯apu (7) and khan
.d
.a ch¯apu
(5), clus e a ound 160 bpm bu exhibi a wide empo dis ibu ion. This s a k
di e ence aises an impo an ques ion: do he g ound u h anno a ions e lec he
ac ual empo, o a e he ¯adi (8) and ¯upaka (3) acks anno a ed a hal empo,
po en ially con ibu ing o hei unde pe o mance?
Nex , we measu e he cycle du a ion o each ack as he in e al be ween consec-
u i e downbea s. Table 11 summa izes he cycle du a ions by ¯al
.a. The ¯adi ¯al
.a
dis inc ly ea u es longe and slowe cycles compa ed o he o he s, wi h a median
cycle du a ion o 5.4 seconds, mo e han double ha o ¯upaka (2.1s) and khan
.d
.a
ch¯apu (1.8s). This has impo an implica ions o downbea acking: longe cycle
Figu e 15: Dis ibu ion o median ack empo by ¯al
.a
leng hs demand a la ge con ex window o accu a e de ec ion, which can inc ease
he complexi y o he model’s ask. Addi ionally, longe cycles mean ewe downbea
anno a ions pe ack, po en ially limi ing he amoun o aining da a a ailable o
he ne wo k o lea ning. Ano he insigh is ha he a iabili y in cycle du a ion is
also g ea es in ¯adi ¯al
.a, wi h a ange om 2.9 o 7.1 seconds.
T¯ala Min. cycle Max. cycle Median cycle
du a ion (s) du a ion (s) du a ion (s)
¯
Adi (8) 2.9 7.1 5.4
R¯upaka (3) 1.2 3.1 2.1
Miś a Ch¯apu (7) 1.6 3.6 2.6
Khan
.d
.a Ch¯apu (5) 0.9 2.9 1.8
Table 11: T¯ala-wise summa y o cycle du a ions (in seconds).
To in es iga e he po en ial e ec s o empo and cycle du a ion on model pe o -
mance, we plo ed ack empo and cycle du a ion agains bea and downbea ac-
cu acy o bo h TCN-FS and Bea This-FT. These plo s, p o ided in Appendix B,
also e lec a iabili y in empo and cycle du a ion h ough he poin sizes. O e all,
no conclusi e e idence eme ged linking model pe o mance di ec ly wi h empo o
cycle du a ion a ia ions.
None heless, TCN-FS, which elies on pos -p ocessing, can bene i om in o med
empo cons ain s ha na ow he sea ch space and educe ambigui ies like empo
oc a e e o s. Cu en ly, he pos -p ocesso ope a es in a b oad ange o empos o
accommoda e he ou ¯al
.as. Howe e , na owing his ange based on empo analysis
o he aining da a, especially in ¯al
.a-in o med me e acking, is likely o enhance
pe o mance. This highligh s he impo ance o empo p o iling as a p epa a o y
s ep o pos -p ocesso dependen acking sys ems.
Moj aba Heyda i, F ank Cwi kowi z, and Zhiyao Duan. Bea Ne : CRNN and Pa -
icle Fil e ing o Online Join Bea Downbea and Me e T acking. In 22 h In e -
na ional Socie y o Music In o ma ion Re ie al Con e ence, ISMIR, 2021. URL
h ps://a xi .o g/abs/2209.07140.
And e Holzap el, Flo ian K ebs, and Ajay S ini asamu hy. T acking he “odd”:
Me e in e ence in a cul u ally di e se music co pus. In ISMIR-In e na ional
Con e ence on Music In o ma ion Re ie al, pages 425–430. ISMIR, 2014.
Yun-Ning Hung, Ju-Chiang Wang, Xuchen Song, Wei-Tsung Lu, and Minz Won.
Modeling bea s and downbea s wi h a ime- equency ans o me . In ICASSP
2022-2022 IEEE In e na ional Con e ence on Acous ics, Speech and Signal P o-
cessing (ICASSP), pages 401–405. IEEE, 2022.
Daphne Kolle and Ni F iedman. P obabilis ic g aphical models: p inciples and
echniques. MIT p ess, 2009.
Flo ian K ebs, Sebas ian Böck, and Ge ha d Widme . Rhy hmic Pa e n Modelling
o Bea and Downbea T acking om Musical Audio. In P oceedings o he
14 h In e na ional Socie y o Music In o ma ion Re ie al Con e ence (ISMIR),
Cu i iba, B azil, 2013.
Flo ian K ebs, Sebas ian Böck, and Ge ha d Widme . An E icien S a e Space
Model o Join Tempo and Me e T acking. In P oceedings o he 16 h In e -
na ional Socie y o Music In o ma ion Re ie al Con e ence (ISMIR), Malaga,
Spain, 2015.
Jus in London. Hea ing in Time: Psychological Aspec s o Musical Me e . Ox-
o d Uni e si y P ess, 05 2012. ISBN 9780199744374. doi: 10.1093/acp o :
oso/9780199744374.001.0001. URL h ps://doi.o g/10.1093/acp o :oso/
9780199744374.001.0001.
Lucas S. Maia, Ma ín Rocamo a, Luiz W. P. Biscainho, and Magdalena Fuen es.
Adap ing me e acking models o La in Ame ican music. In P oceedings o
he 23 d In e na ional Socie y o Music In o ma ion Re ie al Con e ence, pages
361–368. ISMIR, Decembe 2022. doi: 10.5281/zenodo.7385261. URL h ps:
//doi.o g/10.5281/zenodo.7385261.
Gio ana Mo ais, Richa Namballa, Xa ie Juanola, Ma ín Rocamo a, and Mag-
dalena Fuen es. LAMIR HAcka hon: Adap ing Deep Lea ning Models o La in
Ame ican Music Tasks. h ps://lami -wo kshop.gi hub.io/lami _hacka hon/, De-
cembe 2024. URL h ps://lami -wo kshop.gi hub.io/lami _hacka hon/.
Ke in Pa ick Mu phy. Dynamic Bayesian Ne wo ks: Rep esen a ion, In e ence
and Lea ning. Phd hesis, Uni e si y o Cali o nia, Be keley, 2002. URL h ps:
//www.cs.ubc.ca/~mu phyk/Thesis/ hesis.pd .
Meina d Mülle . Fundamen als o Music P ocessing: Using Py hon and Jupy e

No ebooks. Sp inge In e na ional Publishing, Cham, 2021. ISBN 978-3-030-
69807-2 978-3-030-69808-9. doi: 10.1007/978-3-030-69808-9. URL h ps://
link.sp inge .com/10.1007/978-3-030-69808-9.
Colin Ra el, B ian McFee, E ic J. Humph ey, Jus in Salamon, O iol Nie o, Dawen
Liang, and Daniel P. W. Ellis. mi _e al: A anspa en implemen a ion o com-
mon mi me ics. In P oceedings o he 15 h In e na ional Con e ence on Music
In o ma ion Re ie al (ISMIR), pages 367–372, 2014.
P. Rao, H.A. Mu hy, and S.R.M. P asanna. Indian A Music: A Compu a-
ional Pe spec i e. S i anga Digi al So wa e Technologies P . L d., 2023. ISBN
9789391408091. URL h ps://books.google.es/books?id=g-2 EAAAQBAJ.
P. Sambamoo hy. Sou h Indian Music, Volumes I–VI. The Indian Music Publishing
House, Mad as, India, 1998.
Ajay S ini asamu hy. A Da a-d i en Bayesian App oach o Au oma ic Rhy hm
Analysis o Indian A Music. PhD Thesis, Uni e si a Pompeu Fab a, Ba celona,
Spain, 2016.
Ajay S ini asamu hy and Xa ie Se a. A supe ised app oach o hie a chi-
cal me ical cycle acking om audio music eco dings. In P oceedings o
he 39 h IEEE In e na ional Con e ence on Acous ics, Speech and Signal P o-
cessing (ICASSP 2014), pages 5237–5241, Flo ence, I aly, May 2014. URL
h ps://compmusic.up .edu/ca na ic- hy hm-da ase .
Aa on Van Den Oo d, Sande Dieleman, Heiga Zen, Ka en Simonyan, O iol Vinyals,
Alex G a es, Nal Kalchb enne , And ew Senio , Ko ay Ka ukcuoglu, e al.
Wa ene : A gene a i e model o aw audio. a Xi p ep in a Xi :1609.03499,
12, 2016. URL h ps://a xi .o g/pd /1609.03499.
Nick Whi eley, Ali Taylan Cemgil, and Simon J Godsill. Bayesian Modelling o
Tempo al S uc u e in Musical Audio. In ISMIR, pages 29–34, 2006.
Appendix A
So wa e and O he Resou ces
This appendix p o ides a comp ehensi e lis o key so wa e esou ces, da ase s,
and code eposi o ies u ilised in his s udy. These ma e ials enable ep oducibili y
o he expe imen s and analyses p esen ed. Addi ionally, ele an e e ence wo ks
and help ul s udy esou ces a e included o u he explo a ion.
Da ase s
The Ca na ic Music Rhy hm (CMR ) da ase [S ini asamu hy and Se a, 2014] is
a ailable o download upon eques om he CompMusic P ojec websi e:
h ps://compmusic.up .edu/ca na ic- hy hm-da ase
Links o impo an Wes e n music da ase s commonly used o me e acking, some
o which we e u ilised o p e aining models in his s udy, can be accessed a :
h ps://ismi .ne / esou ces/da ase s/
Rep oducible Code
The codebase o aining he Tempo al Con olu ional Ne wo k (TCN) on he CMR
da ase is a ailable a :
h ps://gi hub.com/sa yajee p abhu/ cn-ca na ic- acke
The eposi o y o ine- uning he Bea This! model on he CMR da ase can be
ound a :
h ps://gi hub.com/sa yajee p abhu/bea - his-ca na ic
Bo h eposi o ies include he ained models om his s udy, e alua ion esul s, and
no ebooks o ep oducing he analyses and plo s.
Re e ence Implemen a ions
The TCN implemen a ion employed, de eloped in PyTo ch Ligh ning, is based on
he LAMIR 2024 Hacka hon Tu o ial [Mo ais e al., 2024] on adap ing deep lea ning
50
models o La in Ame ican music asks wi h limi ed da a:
h ps://lami -wo kshop.gi hub.io/lami _hacka hon/in o.h ml
The o iginal Bea This! [Fosca in e al., 2024] implemen a ion is a ailable a :
h ps://gi hub.com/CPJKU/bea _ his
The ine- uning code o Bea This! was adap ed om wo k by SMC Mas e s uden s
Milo Beuze al and Na id Hallajian:
h ps://gi hub.com/smilo7/mo e-bea s- o - his
Key So wa e Lib a ies
The madmom Py hon audio and music signal p ocessing lib a y [Böck e al., 2016]
used o audio p ep ocessing asks:
h ps://gi hub.com/CPJKU/madmom
The madmom Dynamic Bayesian Ne wo k (DBN) pos -p ocesso [Böck e al., 2016,
K ebs e al., 2015] employed alongside he TCN:
h ps://madmom. ead hedocs.io/en/ 0.16/modules/ ea u es/downbea s.h ml
The mi da a Py hon lib a y [Bi ne e al., 2019] used o da ase loading, alida ion,
and pa sing:
h ps://gi hub.com/mi -da ase -loade s/mi da a
Documen a ion:
h ps://mi da a. ead hedocs.io/en/s able/
The mi _e al Py hon lib a y [Ra el e al., 2014] used o e alua ion:
h ps://gi hub.com/mi -e alua ion/mi _e al
Documen a ion:
h ps://mi -e al. ead hedocs.io/la es /
Addi ional S udy Resou ces
The ISMIR 2021 u o ial on empo, bea , and downbea es ima ion [Da ies e al.,
2021] p o ides a comp ehensi e o e iew o deep lea ning models o bea and down-
bea acking. I also includes an open-sou ce, Tenso Flow-based implemen a ion o
he TCN model desc ibed in Decons uc , Analyse, Recons uc [Böck and Da ies,
2020], which is he basis o he TCN model employed in his s udy:
h ps:// empobea downbea .gi hub.io/ u o ial/in o.h ml
The Py hon no ebooks accompanying he ex book Fundamen als o Music P ocess-
ing (FMP) [Mülle , 2021] p o ide ounda ional ma e ial on compu a ional music
analysis using signal p ocessing echniques. Chap e 6 (Tempo and Bea T acking)
is o special ele ance o his s udy:
h ps://www.audiolabs-e langen.de/ esou ces/MIR/FMP/C6/C6.h ml
Appendix B
De ailed Analysis Plo s
52