Revisiting Meter Tracking in Carnatic Music using Deep Learning Approaches

Author: Prabhu, Satyajeet

Publisher: Zenodo

DOI: 10.5281/zenodo.17304733

Source: https://zenodo.org/records/17304733/files/Satyajeet-Prabhu_SMC_2025_Master_Thesis.pdf

Mas e hesis on Sound and Music Compu ing
Uni e si a Pompeu Fab a
“Re isi ing Me e T acking in Ca na ic
Music using Deep Lea ning App oaches”
Sa yajee P abhu
Supe iso : Ma ín Rocamo a
Co-Supe iso : Thomas Nu all
Augus 2025
Acknowledgmen s
I would like o exp ess my since e g a i ude o P o . Xa ie Se a o gi ing me he
oppo uni y o be pa o his p es igious p og am despi e my limi ed expe ience
in so wa e de elopmen . His encou agemen o explo e esea ch in Indian A Mu-
sic has been a sou ce o inspi a ion o me and many aspi ing music compu a ion
esea che s om India.
I am deeply g a e ul o my supe iso , D . Ma ín Rocamo a, whose cons an
guidance o e he wo yea s o his p og am has been in aluable. F om eaching one
o he mos engaging cou ses in he p og am o o e ing me an in e nship oppo uni y
a he MTG and ul ima ely supe ising my hesis, he has played a pi o al ole in
shaping me as a esea che .
I am hank ul o my supe iso Thomas Nu all - Tom, as he is a ec iona ely known
- whose suppo began e en be o e he p og am, spa ked by ou mee ing a he
ISMIR sa elli e wo kshop in India in 2022. My g a i ude also ex ends o Genís,
Adi hi, Oguz, Behzad, Es eban, Jyo i, Alia and all he o he PhD s uden s and
esea che s a MTG, who ha e always been willing o o e assis ance and guidance
in bo h p o essional and pe sonal ma e s.
I also wan o since ely hank Ajay S ini asamu hy, he au ho o he wo k on
which his s udy is based, o being gene ous wi h his ime and cons an ly o e ing
his suppo despi e his busy schedule.
I has been a p i ilege o s udy alongside my inc edibly alen ed colleagues in he
SMC Mas e s p og am, who I now p oudly call iends. Special hanks o Anmol
Mish a, now my co-au ho as well, o he ban e and o he cons an encou agemen
o ake on new challenges and o Robin Doe le o some o he mos philosophical
and in ellec ually s imula ing con e sa ions I ha e e e had.
Las ly, I am immensely hank ul o my pa en s o hei unwa e ing suppo in my
musical (mis)ad en u es o e he yea s, and o my close amily and iends, who
con inually encou age me o keep explo ing and g owing.
3
Abs ac
Bea and downbea acking, join ly e e ed o as Me e T acking, is a undamen al
ask in Music In o ma ion Re ie al (MIR). Deep lea ning models ha e a su passed
adi ional signal p ocessing and classical machine lea ning app oaches in his do-
main, pa icula ly o Wes e n (Eu ogene ic) gen es, whe e la ge anno a ed da ase s
a e widely a ailable. These sys ems, howe e , pe o m less eliably on unde ep e-
sen ed musical adi ions.
Ca na ic music, a ich adi ion om he Indian subcon inen , is enowned o i s
hy hmic in icacy and unique me ical s uc u es ( ¯al
.as). The mos no able p io
wo k on me e acking in his con ex employed p obabilis ic Dynamic Bayesian
Ne wo ks (DBNs). The pe o mance o s a e-o - he-a (SOTA) deep lea ning mod-
els on Ca na ic music, howe e , emains la gely unexplo ed.
In his s udy, we e alua e wo models o me e acking in Ca na ic music: he
Tempo al Con olu ional Ne wo k (TCN), a ligh weigh a chi ec u e ha has been
success ully adap ed o La in hy hms, and Bea This!, a ans o me -based model
designed o b oad s ylis ic co e age wi hou he need o pos -p ocessing. Repli-
ca ing he expe imen al se up o he DBN baseline on he Ca na ic Music Rhy hm
(CMR ) da ase , we sys ema ically assess he pe o mance o hese models in a di-
ec ly compa able se ing. We u he in es iga e adap a ion s a egies, including
ine- uning he models on Ca na ic da a and he use o musically in o med pa am-
e e s.
Resul s show ha while o - he-shel models do no always ou pe o m he DBN,
hei pe o mance imp o es subs an ially wi h ans e lea ning, ma ching o su -
passing he baseline. These indings indica e ha SOTA deep lea ning models can
be e ec i ely adap ed o unde ep esen ed adi ions, pa ing he way o mo e in-
clusi e and b oadly applicable me e acking sys ems.
4
Con en s
Abs ac 4
1 In oduc ion 7
1.1 Backg ound .................................... 7
1.1.1 Me ical S uc u e in Music . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Rhy hm in Ca na ic Music . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Mo i a ion..................................... 10
1.2.1 Speci ic Challenges in Ca na ic Music . . . . . . . . . . . . . . . . . . 10
1.3 Resea ch Ques ion and Objec i es . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Resea ch Ques ions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Objec i es................................... 11
2 S a e o he A 13
2.1 Signal P ocessing App oach . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 BayesianApp oach................................ 14
2.2.1 Ba Poin e model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 In e ence in Bayesian Me e T acking . . . . . . . . . . . . . . . . . . 17
2.3 Deep Lea ning App oach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 DNN Pipeline o Me e T acking . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 O e iew o A chi ec u es . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 E alua ion..................................... 19
2.4.1 F-Measu e................................... 20
2.4.2 Con inui y-based Me ics . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 DNN Models o Me e T acking 23
3.1 Tempo al Con olu ional Ne wo k . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 A chi ec u al De ails . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.2 Adap a ion and Gene aliza ion . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 Mul i- ask Lea ning Fo mula ion . . . . . . . . . . . . . . . . . . . . . 26
3.2 Bea This! : T acke wi hou Pos P ocessing . . . . . . . . . . . . . . . . 27
3.2.1 A chi ec u al De ails . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.2 Shi - ole an Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 P ac ical Conside a ions: TCN s Bea This! . . . . . . . . . . . . . . . . 29
4 Me hodology 30
4.1 Da ase ....................................... 30
4.2 Baseline ...................................... 31
4.2.1 BaselineSe up ................................ 31
5

4.3 Expe imen Se up ................................ 32
4.3.1 TCN...................................... 32
4.3.2 Bea This!................................... 34
4.4 Musically In o med S a egies . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.1 Music-In o med T aining . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4.2 Music-In o med Pos -P ocessing . . . . . . . . . . . . . . . . . . . . . . 35
5 Resul s and Discussion 36
5.1 Model-wise Pe o mance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 T¯al
.a-wisePe o mance ............................. 37
5.3 Ou lie Analysis ................................. 39
5.4 Tempo and T¯al
.a Cycle Du a ion E ec s . . . . . . . . . . . . . . . . . . . 40
6 Conclusions and Fu u e Wo k 42
6.1 Summa y o he S udy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 Conclusions .................................... 43
6.3 Fu u eWo k ................................... 43
Lis o Figu es 45
Lis o Tables 45
Bibliog aphy 47
Appendices 50
A So wa e and O he Resou ces 50
B De ailed Analysis Plo s 52
Chap e 1
In oduc ion
1.1 Backg ound
Rhy hm analysis is a cen al opic in Music In o ma ion Re ie al (MIR), aimed a
compu a ionally analysing o modelling he empo al s uc u e o music. I encom-
passes a a ie y o asks such as onse de ec ion, empo es ima ion, bea and down-
bea acking, pa e n analysis, mic o iming analysis and synch oniza ion among
o he s, which oge he enable a comp ehensi e unde s anding o musical iming.
This wo k ocuses on he ask o au oma ic es ima ion o bea s and downbea s,
commonly e e ed o as Me e T acking, c i ical o se e al highe -le el MIR asks
such as music segmen a ion and s uc u al analysis, as well as applica ions such as
DJ mixing and au oma ic bea ma ching.
1.1.1 Me ical S uc u e in Music
Rhy hm in music is pe cei ed as pulsa ions o ganized a mul iple hie a chical le els
o di e ing imespans, known as i s me e o me ical s uc u e [Bilmes, 1993,
London, 2012]. These le els ange om e y as subdi isions o la ge o ganiza-
ional uni s ( o example, see igu e 1). The di e en me ical le els a e desc ibed
as ollows:
Figu e 1: Pe cei ed me ical le els in ’Twinkle, Twinkle, Li le S a ’
Ta um The as es egula pulse in he music ha lis ene s can pe cei e as a mean-
ing ul subdi ision o hy hm. O en co esponds o a 16 h no e in Wes e n
7
music, bu he speci ic du a ion depends on he empo and s yle.
Tac us (Bea ) The pe cep ually mos salien pulse le el ha a lis ene would na -
u ally ap hei oo o. I ypically co esponds o qua e no es in Wes e n
music bu again depends on empo and con ex . The bea le el is cen al o
mos hy hm pe cep ion asks and o en se es as he e e ence le el o empo.
Me e (Ba , Measu e, Cycle) A g ouping o bea s in o a ecu ing s uc u e
ha es ablishes musical ph asing and o m. Measu es a e ypically ma ked
by accen pa e ns and se e o shape lis ene s’ expec a ions o iming and
emphasis.
In Wes e n music, me e is commonly ep esen ed using ime signa u es, such
as 4/4, which indica es ou bea s pe measu e, wi h each bea ypically being
a qua e no e in du a ion (see igu e 2). O he common me e s include 3/4
(e.g., wal z).
Downbea The i s bea o a ba o cycle, o en ma ked by a s ong accen o
s uc u al change. I ac s as a empo al ancho and plays a c ucial ole in
con eying he s a o a measu e. Accu a e downbea pe cep ion is essen ial
o unde s anding musical o m and ph asing.
Figu e 2: Musical me e in Wes e n music. Figu e om Wikipedia, Me e (music).
1.1.2 Rhy hm in Ca na ic Music
Ca na ic music, one o he wo p incipal adi ions o Indian a music (IAM),
is p edominan ly p ac iced and app ecia ed in he sou he n egions o he Indian
subcon inen . I is dis inguished by i s dedica ed audiences, sophis ica ed heo e ical
amewo k, and high le el o musicianship.
T adi ionally, aining in Ca na ic music is ansmi ed o ally h ough a lineage
o eache s, wi h a s ong emphasis on pe o mance and imp o isa ion. A ypical
Ca na ic pe o mance ea u es a lead pe o me , a hy hmic accompanimen , a con-
inuous backg ound d one, and a melodic accompanimen . Unlike Wes e n onal
music, Ca na ic music does no employ ha mony; ins ead, i is s uc u ed a ound
he melodic amewo k o ¯aga and he hy hmic amewo k o ¯al
.a[Sambamoo hy,
1998]. Consequen ly, Ca na ic music has become an impo an subjec in MIR e-
sea ch as i p esen s unique challenges and oppo uni ies o compu a ional analysis
[Rao e al., 2023].
Rhy hmic o ganiza ion in Ca na ic music is go e ned by he ¯al
.a sys em, a hie a -
chical amewo k o ime cycles ha unde lies melodic and hy hmic ph asing as well
as imp o isa ion. Wi hin each ¯al
.a cycle, sub-s uc u es a e de ined o ack p o-
g ession h ough he cycle. While he e a e some concep ual pa allels wi h Wes e n
me ical s uc u es, he e minology and o ganiza ion wi hin he ¯al
.a sys em di -
e signi ican ly. Table 1 p esen s app oxima e co espondences be ween me ical
hie a chies in Wes e n and Ca na ic amewo ks.
Wes e n Ca na ic
Ta um aks
.a a
Bea (indica ed by hand ges u es)
Measu e ¯a a ana
Downbea sama
Table 1: Mapping o Wes e n and Ca na ic hy hmic concep s
Mo eo e , he ¯al
.a amewo k includes elemen s unique o he Ca na ic adi ion
ha lack di ec equi alen s in Wes e n me ical heo y, esul ing in 175 heo e ically
possible ¯al
.as. In p ac ice, howe e , a co e se o 35 ¯al
.as is p edominan ly used in
pe o mance and pedagogy. Table 2 lis s he ou mos commonly employed ¯al
.as in
Ca na ic music, alongside hei o al numbe o bea s pe cycle.
T¯al
.a #Bea s
¯
Adi 8
R¯upaka 3
Miś a ch¯apu 7
Khan
.d
.a ch¯apu 5
Table 2: Popula ¯al
.as in Ca na ic music
Figu e 3 illus a es some o he concep s om able 1 using he example o an ¯
Adi
¯al
.a cycle (8 bea s). I also demons a es how bea s a e u he g ouped in o sec ions
called a˙ngas. The p og ession h ough he ¯al
.a cycle is ma ked by dis inc i e hand
ges u es, which indica e bo h indi idual bea s and he di e en ypes o sec ions
wi hin he cycle.
Figu e 3: Illus a ion o ¯
Adi ¯al
.a. Figu e om [S ini asamu hy, 2016]
The model also ea u es p ominen ly in Ajay S ini asamu hy’s 2016 doc o al hesis
[S ini asamu hy, 2016] on au oma ic hy hm analysis o Indian A Music, which
emains he mos comp ehensi e s udy o me e acking in Ca na ic music o da e.
The ba poin e model in oduces a hypo he ical "poin e " ha mo es h ough he
me ical cycle and ese s a he downbea .
Figu e 5: The Ba Poin e model. Figu e aken om [S ini asamu hy, 2016].
Hidden Va iables in he BP-model
•Ba Posi ion (ϕk) : Va iable indica ing posi ion in he ba ; ϕ=0deno es he
downbea .
•Tempo ( ˙
ϕk) : Ra e o p og ession o he poin e h ough he ba ; modeled
s ochas ically o allow o na u al empo luc ua ion.
•Rhy hmic Pa e n Index ( k) : Encodes disc e e hy hmic empla es, cap u ing
expec ed accen s uc u es ac oss di e en me ical s yles.
T ansi ion Model
The ansi ion model de ines how he hidden s a e e ol es o e ime:
P(xk∣xk−1)=P(ϕk∣ϕk−1,˙
ϕk−1, k−1)⋅P(˙
ϕk∣˙
ϕk−1)⋅P( k∣ k−1, ϕk, ϕk−1)
The i s e m upda es he ba posi ion ϕkbased on he p e ious posi ion ϕk−1and
empo ˙
ϕk−1. The second e m en o ces smoo h empo changes by modelling ˙
ϕkbased
on ˙
ϕk−1. The hi d e m allows hy hmic pa e n k o change, bu only a he end
o a ba (i.e., when ϕk<ϕk−1).
Obse a ion Model
The obse a ion model P(yk∣xk)de ines he likelihood o obse ing ea u e ykgi en
he cu en s a e. I is o en implemen ed using Gaussian Mix u e Models (GMMs)
ained on ba -posi ion-aligned hy hmic pa e ns de i ed om anno a ed da a. The
model cap u es how likely an onse o spec al e en is o occu a each posi ion in
he ba , o each pa e n.

2.2.2 In e ence in Bayesian Me e T acking
Once a Bayesian model (like he ba poin e model) is de ined, he co e compu a-
ional ask is in e ence. Gi en he obse a ions y1∶K, he goal is o es ima e he
hidden s a e sequence x1∶K— empo, ba posi ion, and hy hmic pa e n — ha
bes explain he obse ed audio ea u es.
Goal: a g max
x1∶K
P(x1∶K∣y1∶K)
Depending on how he hidden s a e space is modeled — disc e ely o con inuously
— di e en in e ence echniques a e used. The wo dominan app oaches a e:
Vi e bi Decoding
The Vi e bi algo i hm is a dynamic p og amming me hod ha inds he single mos
likely i.e. Maximum A Pos e io i (MAP) sequence o hidden s a es. I assumes
a disc e e s a e space - he ba posi ion ϕ, empo ˙
ϕ, and hy hmic pa e n a e
disc e ized in o a ixed g id.
This app oach p o ides exac in e ence unde he disc e e model and is e icien
when he s a e space is mode a ely sized. Howe e , i becomes compu a ionally
expensi e wi h ine disc e iza ion, o example in cases wi h long ba s, and i is
in lexible in eal- ime o online se ings. To ackle hese scalabili y challenges, K ebs
e al. p oposed an E icien S a e Space Model [K ebs e al., 2015] ha es uc u es
he o iginal ba poin e model esul ing in be e accu acy and d as ically educed
compu a ional complexi y.
Pa icle Fil e ing
When he hidden s a e space is modeled as con inuous (o e y high-dimensional),
exac in e ence becomes in ac able. Pa icle il e ing p o ides an app oxima e solu-
ion by using a se o weigh ed samples, called pa icles, each ep esen ing a possible
s a e ajec o y. In o he wo ds, each pa icle ep esen s a hypo hesis o he hidden
s a e a ime k:x(i)
k=[ϕ(i)
k,˙
ϕ(i)
k, (i)
k].
Pa icle il e ing na u ally inco po a es unce ain y and mul imodali y, such as mul-
iple possible empo hypo heses, making i be e sui ed o online o eal- ime ap-
plica ions. Howe e , i is compu a ionally in ensi e, equi es uning he numbe o
pa icles, and since i is an app oxima e me hod, he esul s may a y be ween uns.
2.3 Deep Lea ning App oach
Cu en ly, da a-d i en deep lea ning app oaches domina e he landscape o he
me e acking ask as hey o e se e al ad an ages. Deep Neu al Ne wo ks (DNN)
a e capable o lea ning complex ep esen a ions om aw inpu da a, p ocessing
la ge-scale da ase s mo e e icien ly, gene alising be e ac oss da a and mul i- ask
lea ning. The a ailabili y o GPUs and specialized deep lea ning amewo ks (e.g.,
Tenso Flow, PyTo ch) has made aining and deploying DNNs mo e p ac ical.
2.3.1 DNN Pipeline o Me e T acking
Figu e 6: DNN based me e acking pipeline. Figu e adap ed om Tempo, Bea
and Downbea ISMIR Tu o ial 2021 [Da ies e al., 2021].
A ypical pipeline o a DNN-based me e acking sys em (see Figu e 6) consis s o
wo s ages - ea u e lea ning and empo al decoding. DNNs i s lea n ea u es
om he inpu audio o i s ime- equency ep esen a ion and ou pu an ac i a ion
o salience unc ion con aining he possible bea and downbea candida es. This
is simila o he no el y unc ion, bu while he no el y unc ion is de i ed om
hand-c a ed ea u es, he ac i a ion unc ion is p oduced by he ne wo k’s complex
in e nal ep esen a ion.
The ou pu ac i a ions o DNNs a e o en noisy and canno be di ec ly used o
p edic ions. DBNs a e p o en in hei abili y o impose empo al consis ency and
me ical s uc u e and a e commonly used as a pos -p ocessing s ep o in e bea s
and downbea s om DNN ac i a ions. Howe e , DBNs can also in oduce se e al
limi a ions due o hei inhe en p ope ies. They do no wo k o music wi h ime
signa u e changes, empo changes ou side he p esc ibed ange, and me ic s uc-
u es no ep esen ed in he s a e space. To o e come bias in oduced by DBNs
and gene alise ac oss music gen es, ecen e o s ha e a emp ed o emo e his
pos -p ocessing s age.
2.3.2 O e iew o A chi ec u es
Since bea and downbea es ima ion is a sequence modelling p oblem, he mos suc-
cess ul a chi ec u es applied o his ask include Recu en Neu al Ne wo ks (RNNs),
Tempo al Con olu ional Ne wo ks (TCNs) and ans o me -based models, all o
which a e well-sui ed o cap u ing empo al dependencies in musical signals.
Böck e al. [Böck e al., 2016] u ilise RNN, speci ically Bidi ec ional Long Sho -
Te m Memo y (BLSTM) a chi ec u e, o a supe ised classi ica ion ask o simul-
aneously de ec bea s and downbea s. This signi ican wo k ou pe o med s an-
dalone DBN-based me e acking, especially on he downbea de ec ion ask o
mos Wes e n music da ase s.
Con olu ional Neu al Ne wo ks (CNN) a e known o excel a ex ac ing local ea-
u es, such as ansien s, while ha ing a ela i ely low model complexi y. Howe e ,
hey su e om a lack o long- e m con ex , which makes i di icul o iden i y
global hy hmic s uc u es. Hyb id app oaches ha inco po a e bo h spa ial and
empo al unde s anding a e, he e o e, u ilised o me e acking. Bea Ne [Heyda i
e al., 2021] uses CRNN (Con olu ional Recu en Neu al Ne wo k), which combines
CNNs o ea u e ex ac ion and ecu en laye s o sequen ial modelling.
Tempo al Con olu ional Ne wo k has eme ged as ano he powe ul a chi ec u e o
bea and downbea acking. TCNs u ilise con olu ional laye s wi h dila ions o
achie e a la ge ecep i e ield, allowing hem o model long empo al con ex s e i-
cien ly.
Mo e ecen ly, he ans o me a chi ec u e - o iginally success ul in na u al language
p ocessing - has been applied o me e acking. T ans o me s u ilise a sel -a en ion
mechanism ha allows he model o weigh he impo ance o di e en pa s o he
inpu sequence when making p edic ions. This enables hem o cap u e bo h local
and global dependencies e ec i ely while co e ing he en i e inpu sequence. Hung
e al. [Hung e al., 2022] employ a spec al- empo al ans o me (SpecTNT) a -
chi ec u e o his ask. Bea This!, a ans o me -based sys em ha emo es he
pos -p ocessing s age, achie es s a e-o - he-a bea and downbea acking pe o -
mance on a numbe o Wes e n music da ase s.
2.4 E alua ion
The e alua ion o me e acking sys ems ypically in ol es compa ing p edic ed
bea and downbea imes agains anno a ed g ound u h. In o de o accoun o
he inhe en imp ecision in anno a ions and musical e en s, mos e alua ion me ics
allow a ole ance window a ound he anno a ed imes.
2.4.1 F-Measu e
The F-measu e, also known as he F1-sco e, e alua es he accu acy o p edic ed
bea imes by compa ing hem o g ound u h anno a ions wi hin a ixed empo al
ole ance window (commonly ±70 ms). This me ic in ends o p o ide a measu e o
how many bea s a e co ec ly p edic ed wi hou o e o unde p edic ing. Downbea s
a e e alua ed simila ly, bu due o hei lowe equency, e o s a e mo e impac ul.
I is de ined in e ms o :
T ue Posi i es (NTP): Numbe o p edic ed bea s ha all wi hin he ole ance
window o a g ound- u h bea .
False Posi i es (NFP): Numbe o p edic ed bea s ha do no ma ch any g ound-
u h bea wi hin he ole ance window.
False Nega i es (NFN): Numbe o g ound- u h bea s o which no p edic ed bea
lies wi hin he ole ance window.
Figu e 7: Tole ance window o F-measu e. Figu e om Tempo, Bea and Downbea
ISMIR Tu o ial 2021 [Da ies e al., 2021].
P ecision and Recall a e de ined as:
P ecision =NTP
NTP +NFP
,Recall =NTP
NTP +NFN
Then, he F-measu e is he ha monic mean o p ecision and ecall:
F1=2⋅P ecision ⋅Recall
P ecision +Recall
While he F-measu e is a widely used and in ui i e me ic, i is p one o sys em-
a ic issues ha can gi e a misleading imp ession o acking quali y. Fo ins ance,
changing he size o he ole ance window can d ama ically change he alue o he
measu e. As a esul o using a ixed ole ance window, bea s inside he window a e
conside ed accu a e ega dless o hei posi ion inside he window. So, p edic ions
consis en ly o se om he anno a ion would esul a high F1 sco e.
2.4.2 Con inui y-based Me ics
Con inui y-based me ics we e in oduced o add ess some o hese gaps by e alu-
a ing no jus alignmen accu acy, bu also he consis ency o me ical phase and
empo o e ex ended egions. Tha is, e alua ing no jus whe he bea s a e de-
ec ed, bu whe he hey a e de ec ed consis en ly ac oss ime and a he co ec
me ical le el. This is especially impo an o applica ions such as eal- ime ack-
ing, whe e main aining s able and accu a e bea in o ma ion o e ime is c ucial o
synch oniza ion and esponsi eness.
Con inui y C i e ia
A p edic ed bea a ime ˆ
biis conside ed accu a e only i i sa is ies wo condi ions:
1. The p edic ed bea ˆ
bimus lie wi hin a p ede ined ole ance window a ound he
co esponding g ound- u h bea bi. This window is no absolu e bu ela i e
o he in e -bea in e al (IBI), ypically se o ±17.5% o he local IBI.
2. The p eceding bea ˆ
bi−1mus also all wi hin i s own ole ance window. Fu -
he mo e, he IBI be ween ˆ
bi−1and ˆ
bimus be consis en wi h he IBI be ween
bi−1and bi.
These condi ions oge he de ine wha is e e ed o as a con inuous segmen : a
sequence o a leas h ee consecu i e bea s ha a e empo ally aligned, me ically
consis en , and phase-co ec . Only such segmen s con ibu e o he con inui y-
based me ics.
Me ical Ambigui y
Con inui y me ics a e designed o be sensi i e o a ange o me ical e o s which
may all ha e simila F-measu e alues bu as ly di e en pe cep ual implica ions.
To achie e his, con inui y-based me ics in oduce me ical a ian s o he e e ence
anno a ion g id and e alua e p edic ions agains each a ian . The highes esul ing
sco e is selec ed. Commonly used me ical a ian s include:
•Same me ical le el, in-phase (i.e., bea s align exac ly wi h anno a ions)
•Same me ical le el, o -phase (i.e., bea s occu hal way be ween anno a ions)
•Double empo (Twice he anno a ed me ical le el)
•Hal empo (e en-phase) (e e y o he anno a ion s a ing om he i s )
•Hal empo (odd-phase) (e e y o he anno a ion s a ing om he second)
De ini ions o Con inui y Me ics
Le Nseg
co ec be he numbe o bea s in he longes con inuous co ec segmen , and
Nall
co ec be he o al numbe o co ec bea s (e en ac oss mul iple segmen s). Fou
me ics a e de i ed om his p inciple, dis inguishing be ween s ic (anno a ed)
and lenien (allowed) me ical le els:

•CMLc(Co ec Me ical Le el - con inuous):
CMLc=Nseg
co ec
Np ed
•CML (Co ec Me ical Le el - o al):
CML =Nall
co ec
Np ed
•AMLc(Allowed Me ical Le els - con inuous): Same as CMLc, bu allows
me ical ambigui ies.
•AML (Allowed Me ical Le els - o al): Same as CML , bu allows me ical
ambigui ies.
Low con inui y sco es - especially when pai ed wi h a high F-measu e - sugges ha
p edic ions a e agmen ed o me ically inconsis en , e en i indi idual bea s a e
equen ly close o anno a ions. Compa ing CML and AML a ian s can also e eal
whe he a sys em is making me ical-le el e o s (e.g., consis en ly acking a hal o
double empo) ha s ill esul in pe cep ually accep able ou pu . O e all, con inui y
me ics o e a mo e s uc u ally awa e e alua ion han ame-le el accu acy alone.
Chap e 3
DNN Models o Me e T acking
This wo k ocuses on wo main a chi ec u es: Tempo al Con olu ional Ne wo k
and Bea This!. The ollowing sec ions ake a close look a each sys em, explaining
hei key componen s and how hey app oach he ask o me e acking.
3.1 Tempo al Con olu ional Ne wo k
TCNs ha e been shown o ou pe o m adi ional RNN-based models such as
BLSTMs in me e acking asks. The TCN a chi ec u e uses dila ed con olu ions
o model empo al dependencies, allowing he model o p ocess audio sequences
in pa allel. Unlike BLSTMs, which a e inhe en ly sequen ial and hus di icul o
pa allelize, TCNs enable pa allel aining ac oss ime s eps, signi ican ly educing
aining imes and compu a ional cos s. Th ough dila ed con olu ions, TCNs a e
capable o modelling long- ange empo al dependencies (spanning en i e ba s o
ph ases) wi h signi ican ly ewe pa ame e s. These cha ac e is ics make TCNs no
only mo e scalable bu also be e sui ed o eal- ime o low-la ency applica ions.
Figu e 8 shows an o e iew o bea acking pipelines o he wo a chi ec u es.
3.1.1 A chi ec u al De ails
The e a e wo main componen s a he hea o a TCN-based me e acke :
Con olu ional Block
The con olu ional block ac s as he on end ea u e ex ac o in he TCN-based me-
e acking pipeline. I s ole is o ans o m he inpu spec og am in o a mo e com-
pac and in o ma i e se o lea ned ea u es ha emphasize he spec al- empo al
pa e ns ele an o hy hm pe cep ion. Impo an ly, all con olu ion ope a ions a e
pe o med wi hou empo al downsampling. The con olu ional block is designed
o educe spec al dimensionali y while p ese ing he empo al esolu ion ha is
c i ical o acking bea - ela ed e en s. As seen in igu e 9, a ypical con olu ional
block includes:
23
Figu e 8: Compa ison o BLSTM and TCN a chi ec u es o bea acking. Figu e
aken om [Da ies and Böck, 2019].
•Mul iple 2D con olu ional laye s, each wi h a small ke nel size (e.g., 3×3) o
cap u e local ime- equency pa e ns.
•Pooling along he equency axis, which comp esses he spec al dimension
while main aining he o iginal empo al esolu ion.
•Nonlinea ac i a ion unc ions, such as ELU, applied a e each con olu ion o
in oduce nonlinea i y.
Figu e 9: Con olu ional block in a TCN-based me e acke . Figu e aken om
[Böck and Da ies, 2020].
TCN Block
The inpu o he TCN is a highly sub-sampled ea u e ec o de i ed om he
magni ude spec og am by he con olu ional block, bu which e ains he same
empo al esolu ion. The TCN block is he co e empo al modelling componen
o he a chi ec u e. I s p ima y unc ion is o model he sequen ial dependencies
and pe iodic s uc u es equi ed o bea and downbea p edic ion. I does so by
lea ning il e s ia dila ed con olu ion. Dila ion is equi alen o skipping samples in
he inpu sequence.
In a s anda d 1D con olu ion, each il e “slides” ac oss he ime axis o he inpu
ea u e map, p ocessing a local window (e.g., 3 ames) a each s ep. This is analo-
gous o scanning o epea ing hy hmic mo i s. Howe e , o model longe con ex s,
TCNs in oduce dila ed con olu ions. A dila ion de ines he spacing be ween he
elemen s in he il e ’s ecep i e ield. Fo example, e e ing o igu e 10 :
A dila ion o 1 co esponds o adjacen ime s eps( −1, , +1).
A dila ion o 2 looks a e e y second ime s ep ( −2, , +2).
A dila ion o 4 expands u he ( −4, , +4).
By s acking laye s wi h exponen ially inc easing dila ions (e.g., 1, 2, 4, 8...), he
ne wo k can e ec i ely model pa e ns o e a la ge ime span wi hou a p opo ional
inc ease in he numbe o pa ame e s.
Figu e 10: Tempo al Con olu ional Ne wo k. Figu e aken om ISMIR 2021 u o ial
on Tempo, Bea , and Downbea Es ima ion.[Da ies e al., 2021].
Ad an ages o Me e T acking:
•Tempo al esolu ion is p ese ed: Unlike RNNs, TCNs can main ain he
ull empo al g anula i y o he inpu .
•E icien long- e m modelling: Due o dila ion, a TCN wi h 10 laye s and
ke nel size 3 can access 210 =1024 ime s eps—se e al seconds o music—
wi hou loss o esolu ion.
•Pa allel aining: All ime s eps can be p ocessed simul aneously, making
he model highly sui able o GPU accele a ion.
3.1.2 Adap a ion and Gene aliza ion
In he con ex o me e acking, Da ies and Böck [Da ies and Böck, 2019] i s
success ully epu posed he TCN design inspi ed by Wa eNe [Van Den Oo d e al.,
5. P ese a ion o T¯al
.a Dis ibu ion: The expe imen al se up p ese es
¯al
.a dis ibu ion in each old. Speci ically, he dis ibu ion o ¯al
.as in he
aining and es olds mi o s he dis ibu ion o ¯al
.as in he ull da ase .
4.3 Expe imen Se up
Fo bo h models unde e alua ion, we i s eplica e he da a spli s and aining se up
as desc ibed in he p e ious sec ion. Addi ionally, we es ablish common p ocedu al
guidelines o ensu e a ai and consis en compa ison be ween he models:
•The da ase , comp ising 176 samples, is di ided in o wo p ede e mined olds
o 88 examples each, iden ical o hose used in he baseline expe imen ’s wo-
old c oss- alida ion scheme. In each i e a ion, one old is used o aining
while he o he se es as he es se , wi h he olds al e na ing oles be ween
i e a ions.
•The ain old is u he subdi ided in o aining (80%) and alida ion (20%)
subse s. Consequen ly, each old con ains 70 aining examples and 18 alida-
ion examples.
•We pe o m h ee aining uns pe old. To ensu e ep oducibili y o alida ion
spli s and ne wo k ini ializa ions, we se p ede e mined andom seeds [42, 52,
62] o each espec i e un. As a esul , six dis inc models a e gene a ed o
e e y aining s a egy, and he esul s a e epo ed as he mean pe o mance
ac oss hese six models.
•Valida ion loss is employed as he p ima y me ic o moni o ing aining
p og ess and o ea ly s opping. T aining is e mina ed when no imp o emen
in alida ion loss is obse ed.
•The models a e e alua ed using he F-measu e as well as he con inui y me ics
CML and AML o bo h bea and downbea . The e alua ion p ocess is
ca ied ou using he Py hon package mi _e al [Ra el e al., 2014].
4.3.1 TCN
Model
The expe imen al se up o he TCN employed in his s udy is based on he open-
sou ce implemen a ion o Decons uc , Analyse, Recons uc [Böck and Da ies, 2020]
made a ailable by he au ho s as pa o he ISMIR 2021 u o ial on Tempo, Bea
and Downbea Es ima ion. This implemen a ion was subsequen ly epu posed in
Adap ing Me e T acking Models o La in Ame ican Music [Maia e al., 2022], and an
upda ed, use - iendly e sion is p o ided in he Tu o ial o LAMIR 2024 Hacka hon
[Mo ais e al., 2024]. The p esen s udy u ilises hese p io wo ks and hei espec i e
expe imen al se ups as he basis o he TCN implemen a ion.
T ainable Pa ame e s : 72.3K

T aining S a egies
Fo he TCN model, we e alua e h ee s a egies inspi ed by [Maia e al., 2022].
•Baseline (TCN-BL)
Fi s , we ain a model on he popula Wes e n da ase s o me e acking-
GTZAN, Ball oom, Bea les and RWC da ase s ollowing [Maia e al., 2022].
This model is assumed o be a good s a ing poin o a baseline e alua ion o
he TCN on Ca na ic da a as well as o subsequen ans e lea ning expe -
imen s. Following p o ocol, we pe o m h ee aining uns and epo mean
pe o mance on he CMR da ase .
•Fine- uning (TCN-FT)
Unde his s a egy, he model wi h he lowes alida ion loss om TCN-BL
is used as a s a ing poin o ine- uning he ne wo k on Ca na ic da a. The
assump ion is ha , al hough he model was p e- ained on Wes e n da ase s,
i has lea ned a ep esen a ion ha can be adap ed o a di e en musical
adi ion, as demons a ed in [Maia e al., 2022].
•T aining om Sc a ch (TCN-FS)
This s a egy in ol es aining a andomly ini ialized ne wo k (using one o
he p ede ined seed alues) om sc a ch on each old.
Loss Func ion
We employ a simple loss unc ion de ined as he sum o he bina y c oss-en opy
(BCE) o bea and downbea p edic ions:
L=BCEbea +BCEdownbea
Table 6 p o ides a summa y o he aining con igu a ion se ings employed ac oss
he di e en TCN s a egies.
Ac onym S a egy Models
T ained Epochs
Ea ly
S oppage
Lea ning
Ra e
LR
Reduc ion
(Fac o )
TCN-BL TCN Baseline 3 100 20 0.005 0.2
TCN-FS T ain om Sc a ch 6 100 20 0.005 0.2
TCN-FT Fine une om Baseline 6 50 10 0.001 0.2
Table 6: T aining con igu a ions o he TCN s a egies
Pos P ocessing
Fo pos -p ocessing ne wo k ac i a ions, a DBN-based pos -p ocesso is used. The
Py hon lib a y madmom [Böck e al., 2016] o e s an open-sou ce join bea and
downbea DBN pos -p ocesso app oxima ed by a Hidden Ma ko Model (HMM),
based on [Böck e al., 2016, K ebs e al., 2015]. In his wo k, we use i s o line mode
u ilising he Vi e bi algo i hm o in e ence.
4.3.2 Bea This!
Model
Fo Bea This!, we use he s ock implemen a ion o baseline e alua ion. Fo ine-
uning, howe e , we build upon a modi ied implemen a ion ha acili a es ine-
uning o he s ock model. Despi e hese modi ica ions, we e ain he de aul aining
con igu a ions, including da a augmen a ion schemes, and u ilise he p e- ained
models p o ided by he o iginal au ho s.
T ainable Pa ame e s : 20.3M
T aining S a egies
We adop only wo s a egies: Baseline and Fine- uning. Gi en ha Bea This! is
a ans o me -based a chi ec u e, i is highly da a-in ensi e, which makes aining
om sc a ch on a single da ase imp ac ical.
•Baseline (Bea This-BL)
We u ilise he h ee p e- ained checkpoin s p o ided wi h he s ock model,
namely inal0, inal1 and inal2, all o which ha e been ained on a la ge
co pus comp ising 18 di e en da ase s (excluding GTZAN). We e alua e hese
models on he CMR da ase and epo he mean pe o mance ac oss all h ee
models as he baseline pe o mance o Bea This!.
•Fine- uning (Bea This-FT)
The ine- uning p ocess begins wi h he de aul ( inal0) checkpoin as he p e-
ained baseline, which is hen u he ine- uned o e he cou se o 50 epochs.
Las ly, we use he buil -in Shi - ole an weigh ed BCE loss and skip pos -p ocessing.
4.4 Musically In o med S a egies
Inco po a ing musicological insigh s in o me e acking sys ems can enhance hei
pe o mance. This sec ion explo es s a egies used du ing aining and pos -
p ocessing o help ou DNN me e acking sys ems adap be e o Ca na ic Music.
4.4.1 Music-In o med T aining
These s a egies applied a he aining s age aim o ensu e ha he model is exposed
o hy hmic di e si y consis en ly du ing he aining p ocess:
S a i ied T¯al
.a-based T ain/Valida ion Spli
To ensu e consis en pe o mance ac oss ¯al
.as, we implemen a s a i ied ain/ ali-
da ion spli based on ¯al
.as. This balances ep esen a ion o all ¯al
.as in aining and
alida ion, enabling pe - ¯al
.a e o analysis and a ge ed imp o emen s o unde -
pe o ming ¯al
.as.
In e lea ed T ain Da a Loade
An issue wi h he CMR da ase is imbalanced class dis ibu ion - some ¯al
.as appea
mo e equen ly han o he s. We use an in e lea ed da a loade ha p opo ionally
spaces a e ¯al
.as like khan
.d
.a ch¯apu in aining, ensu ing a mo e balanced lea ning
p ocess.
4.4.2 Music-In o med Pos -P ocessing
One impac ul s a egy o enhancing he pe o mance o me e acking sys ems is
pos -p ocessing. The DBNDownBea T ackingP ocesso om he madmom lib a y
allows us o une pa ame e s o e lec musical cha ac e is ics o he da a being
p ocessed. We se he ollowing pa ame e s based on musicological knowledge as
well as insigh s om he da ase :
•bea s_pe _ba = [3, 5, 7, 8] based on he ou ¯al
.as in ou da ase ins ead o
he de aul [3, 4]
•min_ empo = 55 and max_ empo = 230, e lec ing he empo ange obse ed
in he da ase (see Figu e 12). This ange, chosen based on p elimina y ex-
pe imen s, co e s 99% o all empos and p o ides a cons ained sea ch space
o he pos -p ocesso , wi h esul s showing sligh pe o mance imp o emen
compa ed o a max empo o 300 BPM (99.9% o all empos).
Figu e 12: Dis ibu ion o empos in he CMR da ase .
Fo ep oducibili y, all ele an code eposi o ies, so wa e esou ces, and da ase
e e ences u ilised in he expe imen s a e ca alogued in Appendix A.
Chap e 5
Resul s and Discussion
This chap e p esen s and examines he pe o mance o models ained wi h he
a ious s a egies desc ibed in he p e ious chap e . Each app oach is e alua ed
using quan i a i e me ics, including F-measu e and con inui y sco es, o p o ide
a obus compa ison. In addi ion, a de ailed b eakdown by ¯al
.a is conduc ed o
highligh how each model esponds o he unique hy hmic s uc u es o Ca na ic
music, allowing hei espec i e s eng hs and weaknesses o eme ge mo e clea ly.
5.1 Model-wise Pe o mance
Table 7 below shows he o e all pe o mance o he wo models and hei s a e-
gies agains he Ba Poin e model baseline wi h he highes pe o ming me ics
highligh ed in bold.
Model Bea Downbea
F-measu e CML AML F-measu e CML AML
BP-HMM 71.8 — 72.2 44.0 — —
BP-AMPF 82.5 — 90.6 57.4 — —
TCN-BL 77.1 51.6 77.9 28.9 21.6 33.8
TCN-FT 80.7 50.2 91.9 52.9 35.3 57.8
TCN-FS 84.6 62.9 88.0 63.9 52.1 67.0
Bea This-BL 71.3 39.2 56.8 27.6 2.0 8.7
Bea This-FT 90.3 78.0 80.0 66.8 38.2 53.7
Table 7: Model-wise Pe o mance Compa ison
Bo h he TCN-BL and Bea This-BL models, which we e ained on Wes e n music
da ase s, ail o achie e baseline pe o mance le els in me e acking o Ca na ic
music. Al hough he bea acking accu acy o hese models app oxima es base-
line pe o mance, hei downbea acking pe o mance emains subs an ially below
baseline, despi e being ained on ex ensi e da ase s.
36
This dispa i y e eals he undamen al di e ences in hy hmic s uc u es be ween
Wes e n and Ca na ic music and illus a es he challenges aced by neu al ne wo ks
in di ec ly ans e ing lea ned knowledge ac oss dis inc musical adi ions.
In con as , he TCN-FT model nea ly a ains baseline pe o mance, no ably achie -
ing he highes bea AML sco e among all e alua ed models. In e es ingly, p e-
limina y expe imen s demons a ed compa able esul s when ine- uning a model
ini ially ained only on he GTZAN da ase .
These indings u he highligh he necessi y o he ne wo k o e-op imize i s
hype pa ame e s when adap ing om Wes e n o Ca na ic music, indica ing ha
he quan i y o da a used du ing p e aining may be less signi ican compa ed o
he subsequen ine- uning on Ca na ic music. In ac , he pe o mance o he TCN-
FT model may be hinde ed by being unde ained. Addi ional ine- uning could
enhance he esul s, e ec i ely equa ing o aining he model om sc a ch.
Bo h TCN-FS and Bea This-FT signi ican ly ou pe o m he DBN baseline in bea
and downbea acking, wi h Bea This-FT es ablishing i sel as he mos e ec i e ap-
p oach o achie ing aw accu acy in me e acking o Ca na ic music. Meanwhile,
TCN-FS excels in main aining empo al con inui y, pa icula ly in he downbea
acking ask.
The di e ence in pe o mance be ween he wo models is as expec ed, gi en hei
espec i e a chi ec u es (see Chap e 3) and he applica ion o pos -p ocessing in
he TCN model. The Bea This! a chi ec u e emphasizes he accu acy o local
p edic ions, while he TCN model pai ed wi h he pos p ocesso p omo es globally
cohe en p edic ions. Addi ionally, he powe ul ans o me a chi ec u e used by
Bea This! is able o ex ac mo e meaning ul ea u es om he Ca na ic da a
compa ed o he ela i ely ligh weigh TCN model, al hough his ad an age comes
wi h inc eased compu a ional demands.
5.2 T¯al
.a-wise Pe o mance
Wi h TCN-FS and Bea This-FT iden i ied as he wo leading s a egies o acking
Ca na ic me e , we p oceed o analyze hei pe o mance on each ¯al
.a o gain a
ho ough unde s anding o hei capabili ies. Tables 8 and 9 p o ide b eakdowns o
he pe o mance o TCN-FS and Bea This-FT, espec i ely, ac oss he ou ¯al
.as.
In e ms o bea acking, Bea This-FT demons a es ela i ely consis en pe o -
mance ac oss ¯alas compa ed o TCN-FS. This consis ency is also e iden in he
con inui y sco es. TCN-FS s uggles pa icula ly wi h ¯adi (8) and ¯upaka (3) ¯alas,
especially he la e . Al hough he bea F-measu es a e easonable, he low CML
sco es indica e ha while many bea s a e de ec ed co ec ly, he sys em o en loses
co ec empo con inui y h oughou he sequence. The highe AML sco es sugges
ha he sys em equen ly p edic s empos ha a e hy hmically ela ed, implying
me ical ambigui y, likely due o he pos -p ocessing s age, which is absen in Bea
This!.

T¯al
.a Bea Downbea
F-measu e CML AML F-measu e CML AML
¯
Adi (8) 77.8 52.7 84.8 62.7 42.5 84.3
R¯upaka (3) 75.8 32.8 85.0 40.7 19.5 23.8
M¯ıś a ch¯apu (7) 95.6 92.3 93.9 86.7 88.8 94.5
Khan
.d
.a ch¯apu (5) 93.5 84.5 88.7 68.5 64.6 65.9
O e all 84.6 62.9 88.0 63.9 52.1 67.0
Table 8: TCN-FS : T¯al
.a-wise Pe o mance Compa ison
T¯al
.a Bea Downbea
F-measu e CML AML F-measu e CML AML
¯
Adi (8) 86.6 74.3 77.5 49.3 2.2 55.8
R¯upaka (3) 89.5 74.7 77.5 81.6 68.3 68.3
M¯ıś a ch¯apu (7) 94.2 86.0 86.1 72.8 49.7 50.6
Khan
.d
.a ch¯apu (5) 91.4 76.6 78.5 61.1 28.8 29.3
O e all 90.3 78.0 80.0 66.8 38.2 53.7
Table 9: Bea This-FT : T¯al
.a-wise Pe o mance Compa ison
In e es ingly, e en in Bea This!, he ¯alas ¯adi (8) and ¯upaka (3) sco e lowe in
bea acking accu acy han m¯ıś a ch¯apu (7) and khan
.d
.a ch¯apu (5). This may seem
coun e -in ui i e, as one migh expec sys ems o s uggle mo e wi h a e and mo e
complex me e s. Howe e , he di e ence is due o he a ie y o pa e ns wi hin
a gi en ¯ala. In Ca na ic music, pe o me s o en imp o ise and a y g ouping
s uc u es wi hin a cycle, while main aining he co e amewo k and o e all leng h.
As explained by [S ini asamu hy, 2016], mul iple hy hmic pa e ns ha depa
om he adi ional ¯ala s uc u e can be played. Fo example, a musician migh
pe o m a pa e n g ouped as 7, 7, 4, 6, and 8 aks
.a as, o aling 32 aks
.a as wi hin an
¯adi ¯al
.a cycle. Popula ¯alas like ¯adi and ¯upaka end o ha e mo e such a ia ions,
making hem mo e di icul o bea acking sys ems o gene alize. This complexi y
is isible in he plo s in igu e 13, which shows he a e age cycle leng h spec al lux
pa e ns o ¯adi and m¯ıś a ch¯apu ¯al
.as in he CMR da ase . The pa e ns indica e
a ying accen s eng hs a di e en me ical posi ions, e lec ing he hy hmic a i-
a ion wi hin each ¯ala. Also, bo h models achie e high pe o mance on m¯ıś a ch¯apu
o bo h bea and downbea de ec ion. This consis ency implies ha m¯ıś a ch¯apu’s
hy hmic pa e n is ela i ely easie o model accu a ely o bo h a chi ec u es.
When examining he downbea de ec ion ask, he esul s a e mo e nuanced. While
Bea This-FT sligh ly ou pe o ms TCN-FS in o e all F-measu e (66.8 s 63.9), bo h
exhibi mixed esul s wi h wide a ia ion ac oss ¯al
.as. Fo example, Bea This-FT
excels d ama ically on ¯upaka, whe eas i s uggles on ¯adi downbea s. This sugges s
ha each model may be mo e adep a handling ce ain hy hmic s uc u es bu less
consis en ac oss all ¯al
.a ypes. Fo u he g anula i y, Appendix B includes com-
Figu e 13: Spec al lux pa e n compa ison o ¯alas. Figu e aken om [S ini-
asamu hy, 2016].
p ehensi e iolin plo s illus a ing pe - ack pe o mance o bo h aining s a egies
on each ¯al
.a.
5.3 Ou lie Analysis
Due o he ela i ely poo pe o mance o TCN-FS on ¯adi and ¯upaka ¯al
.as, we
pe o m a p elimina y ou lie analysis. Tables 10 lis s he acks wi h he lowes
bea and downbea F-measu e sco es o hese ¯al
.as.
ack id ¯al
.a Bea
F-measu e CML AML
10047 adi 0.192975 0.100241 0.458864
11024 upakam 0.605962 0.000145 0.903054
ack id ¯al
.a Downbea
F-measu e CML AML
10048 adi 0.000000 0.0 1.000000
11040 upakam 0.032501 0.0 0.000000
Table 10: TCN-FS : Wo s Pe o ming T acks by Bea and Downbea F-Measu e
Figu e 14 isualizes he g ound u h anno a ions alongside he model p edic ions
o e he spec og am o sec ions o he wo ¯adi ¯al
.a ou lie s wi h he lowes bea
( ack id: 10047) and downbea F-measu e ( ack id: 10048) sco es, espec i ely. In
he i s case, i is clea ha he bea p edic ions a e consis en ly shi ed by hal
a bea , esul ing in a low bea F-measu e sco e. This is a common occu ence in
Ca na ic music, whe e c ea i e phase o se s (ed
.upu) a e o en employed - pe cussi e
onse s a e shi ed om he ac ual bea s o he ¯al
.a (i.e., played on he o -bea ) -
making i challenging o he ne wo k o de ec he ue bea s. He e, he AML
sco e p o ides a mo e ealis ic measu e o bea acking pe o mance.
Fo he downbea ail case, he de ec ed downbea s a e displaced by exac ly hal a
cycle. Though i sco es ze o in accu acy, i achie es a pe ec AML sco e. This issue
can be a ibu ed o he pos -p ocesso , which en o ces global p edic ion cons ain s
and may p oduce sco es un ep esen a i e o he model’s ac ual pe o mance.
Figu e 14: TCN-FS : Wo s pe o ming acks o ¯adi ¯al
.a isualised
The ¯upaka ou lie s we e analyzed by lis ening, as i is mo e di icul o isually
iden i y he easons o hei poo pe o mance. In he bea acking ou lie , he
pe cussion swi ches c ea i ely be ween iple and quad uple me e h ough me ic
modula ion, challenging he pos -p ocesso ’s abili y o handle hese apid shi s.
The downbea ou lie is pa icula ly challenging because he pe cussion is pe o med
a double empo compa ed o he g ound u h anno a ions, while also employing
polyme e . The subpa pe o mance o TCN-FS on ¯upaka is likely due o hese
ac o s and he di icul ies hey pose o he pos -p ocessing s age.
5.4 Tempo and T¯al
.a Cycle Du a ion E ec s
Las ly, we del e deepe o iden i y he possible e ec s o ack empo and ¯al
.a-cycle
du a ion on model pe o mance.
Figu e 15 p esen s a box plo o he median ack empos in he CMR da ase ,
g ouped by ¯al
.a. Fo each ack, in e -bea in e als (IBI) a e con e ed o BPM
alues, and he median BPM is plo ed. No ably, he majo i y o acks in he ¯adi
(8) and ¯upaka (3) ¯al
.as all wi hin a na ow empo ange o app oxima ely 50 o 100
bpm. In con as , he wo bes -pe o ming ¯al
.as, m¯ıś a ch¯apu (7) and khan
.d
.a ch¯apu
(5), clus e a ound 160 bpm bu exhibi a wide empo dis ibu ion. This s a k
di e ence aises an impo an ques ion: do he g ound u h anno a ions e lec he
ac ual empo, o a e he ¯adi (8) and ¯upaka (3) acks anno a ed a hal empo,
po en ially con ibu ing o hei unde pe o mance?
Nex , we measu e he cycle du a ion o each ack as he in e al be ween consec-
u i e downbea s. Table 11 summa izes he cycle du a ions by ¯al
.a. The ¯adi ¯al
.a
dis inc ly ea u es longe and slowe cycles compa ed o he o he s, wi h a median
cycle du a ion o 5.4 seconds, mo e han double ha o ¯upaka (2.1s) and khan
.d
.a
ch¯apu (1.8s). This has impo an implica ions o downbea acking: longe cycle
Figu e 15: Dis ibu ion o median ack empo by ¯al
.a
leng hs demand a la ge con ex window o accu a e de ec ion, which can inc ease
he complexi y o he model’s ask. Addi ionally, longe cycles mean ewe downbea
anno a ions pe ack, po en ially limi ing he amoun o aining da a a ailable o
he ne wo k o lea ning. Ano he insigh is ha he a iabili y in cycle du a ion is
also g ea es in ¯adi ¯al
.a, wi h a ange om 2.9 o 7.1 seconds.
T¯ala Min. cycle Max. cycle Median cycle
du a ion (s) du a ion (s) du a ion (s)
¯
Adi (8) 2.9 7.1 5.4
R¯upaka (3) 1.2 3.1 2.1
Miś a Ch¯apu (7) 1.6 3.6 2.6
Khan
.d
.a Ch¯apu (5) 0.9 2.9 1.8
Table 11: T¯ala-wise summa y o cycle du a ions (in seconds).
To in es iga e he po en ial e ec s o empo and cycle du a ion on model pe o -
mance, we plo ed ack empo and cycle du a ion agains bea and downbea ac-
cu acy o bo h TCN-FS and Bea This-FT. These plo s, p o ided in Appendix B,
also e lec a iabili y in empo and cycle du a ion h ough he poin sizes. O e all,
no conclusi e e idence eme ged linking model pe o mance di ec ly wi h empo o
cycle du a ion a ia ions.
None heless, TCN-FS, which elies on pos -p ocessing, can bene i om in o med
empo cons ain s ha na ow he sea ch space and educe ambigui ies like empo
oc a e e o s. Cu en ly, he pos -p ocesso ope a es in a b oad ange o empos o
accommoda e he ou ¯al
.as. Howe e , na owing his ange based on empo analysis
o he aining da a, especially in ¯al
.a-in o med me e acking, is likely o enhance
pe o mance. This highligh s he impo ance o empo p o iling as a p epa a o y
s ep o pos -p ocesso dependen acking sys ems.
Moj aba Heyda i, F ank Cwi kowi z, and Zhiyao Duan. Bea Ne : CRNN and Pa -
icle Fil e ing o Online Join Bea Downbea and Me e T acking. In 22 h In e -
na ional Socie y o Music In o ma ion Re ie al Con e ence, ISMIR, 2021. URL
h ps://a xi .o g/abs/2209.07140.
And e Holzap el, Flo ian K ebs, and Ajay S ini asamu hy. T acking he “odd”:
Me e in e ence in a cul u ally di e se music co pus. In ISMIR-In e na ional
Con e ence on Music In o ma ion Re ie al, pages 425–430. ISMIR, 2014.
Yun-Ning Hung, Ju-Chiang Wang, Xuchen Song, Wei-Tsung Lu, and Minz Won.
Modeling bea s and downbea s wi h a ime- equency ans o me . In ICASSP
2022-2022 IEEE In e na ional Con e ence on Acous ics, Speech and Signal P o-
cessing (ICASSP), pages 401–405. IEEE, 2022.
Daphne Kolle and Ni F iedman. P obabilis ic g aphical models: p inciples and
echniques. MIT p ess, 2009.
Flo ian K ebs, Sebas ian Böck, and Ge ha d Widme . Rhy hmic Pa e n Modelling
o Bea and Downbea T acking om Musical Audio. In P oceedings o he
14 h In e na ional Socie y o Music In o ma ion Re ie al Con e ence (ISMIR),
Cu i iba, B azil, 2013.
Flo ian K ebs, Sebas ian Böck, and Ge ha d Widme . An E icien S a e Space
Model o Join Tempo and Me e T acking. In P oceedings o he 16 h In e -
na ional Socie y o Music In o ma ion Re ie al Con e ence (ISMIR), Malaga,
Spain, 2015.
Jus in London. Hea ing in Time: Psychological Aspec s o Musical Me e . Ox-
o d Uni e si y P ess, 05 2012. ISBN 9780199744374. doi: 10.1093/acp o :
oso/9780199744374.001.0001. URL h ps://doi.o g/10.1093/acp o :oso/
9780199744374.001.0001.
Lucas S. Maia, Ma ín Rocamo a, Luiz W. P. Biscainho, and Magdalena Fuen es.
Adap ing me e acking models o La in Ame ican music. In P oceedings o
he 23 d In e na ional Socie y o Music In o ma ion Re ie al Con e ence, pages
361–368. ISMIR, Decembe 2022. doi: 10.5281/zenodo.7385261. URL h ps:
//doi.o g/10.5281/zenodo.7385261.
Gio ana Mo ais, Richa Namballa, Xa ie Juanola, Ma ín Rocamo a, and Mag-
dalena Fuen es. LAMIR HAcka hon: Adap ing Deep Lea ning Models o La in
Ame ican Music Tasks. h ps://lami -wo kshop.gi hub.io/lami _hacka hon/, De-
cembe 2024. URL h ps://lami -wo kshop.gi hub.io/lami _hacka hon/.
Ke in Pa ick Mu phy. Dynamic Bayesian Ne wo ks: Rep esen a ion, In e ence
and Lea ning. Phd hesis, Uni e si y o Cali o nia, Be keley, 2002. URL h ps:
//www.cs.ubc.ca/~mu phyk/Thesis/ hesis.pd .
Meina d Mülle . Fundamen als o Music P ocessing: Using Py hon and Jupy e

No ebooks. Sp inge In e na ional Publishing, Cham, 2021. ISBN 978-3-030-
69807-2 978-3-030-69808-9. doi: 10.1007/978-3-030-69808-9. URL h ps://
link.sp inge .com/10.1007/978-3-030-69808-9.
Colin Ra el, B ian McFee, E ic J. Humph ey, Jus in Salamon, O iol Nie o, Dawen
Liang, and Daniel P. W. Ellis. mi _e al: A anspa en implemen a ion o com-
mon mi me ics. In P oceedings o he 15 h In e na ional Con e ence on Music
In o ma ion Re ie al (ISMIR), pages 367–372, 2014.
P. Rao, H.A. Mu hy, and S.R.M. P asanna. Indian A Music: A Compu a-
ional Pe spec i e. S i anga Digi al So wa e Technologies P . L d., 2023. ISBN
9789391408091. URL h ps://books.google.es/books?id=g-2 EAAAQBAJ.
P. Sambamoo hy. Sou h Indian Music, Volumes I–VI. The Indian Music Publishing
House, Mad as, India, 1998.
Ajay S ini asamu hy. A Da a-d i en Bayesian App oach o Au oma ic Rhy hm
Analysis o Indian A Music. PhD Thesis, Uni e si a Pompeu Fab a, Ba celona,
Spain, 2016.
Ajay S ini asamu hy and Xa ie Se a. A supe ised app oach o hie a chi-
cal me ical cycle acking om audio music eco dings. In P oceedings o
he 39 h IEEE In e na ional Con e ence on Acous ics, Speech and Signal P o-
cessing (ICASSP 2014), pages 5237–5241, Flo ence, I aly, May 2014. URL
h ps://compmusic.up .edu/ca na ic- hy hm-da ase .
Aa on Van Den Oo d, Sande Dieleman, Heiga Zen, Ka en Simonyan, O iol Vinyals,
Alex G a es, Nal Kalchb enne , And ew Senio , Ko ay Ka ukcuoglu, e al.
Wa ene : A gene a i e model o aw audio. a Xi p ep in a Xi :1609.03499,
12, 2016. URL h ps://a xi .o g/pd /1609.03499.
Nick Whi eley, Ali Taylan Cemgil, and Simon J Godsill. Bayesian Modelling o
Tempo al S uc u e in Musical Audio. In ISMIR, pages 29–34, 2006.
Appendix A
So wa e and O he Resou ces
This appendix p o ides a comp ehensi e lis o key so wa e esou ces, da ase s,
and code eposi o ies u ilised in his s udy. These ma e ials enable ep oducibili y
o he expe imen s and analyses p esen ed. Addi ionally, ele an e e ence wo ks
and help ul s udy esou ces a e included o u he explo a ion.
Da ase s
The Ca na ic Music Rhy hm (CMR ) da ase [S ini asamu hy and Se a, 2014] is
a ailable o download upon eques om he CompMusic P ojec websi e:
h ps://compmusic.up .edu/ca na ic- hy hm-da ase
Links o impo an Wes e n music da ase s commonly used o me e acking, some
o which we e u ilised o p e aining models in his s udy, can be accessed a :
h ps://ismi .ne / esou ces/da ase s/
Rep oducible Code
The codebase o aining he Tempo al Con olu ional Ne wo k (TCN) on he CMR
da ase is a ailable a :
h ps://gi hub.com/sa yajee p abhu/ cn-ca na ic- acke
The eposi o y o ine- uning he Bea This! model on he CMR da ase can be
ound a :
h ps://gi hub.com/sa yajee p abhu/bea - his-ca na ic
Bo h eposi o ies include he ained models om his s udy, e alua ion esul s, and
no ebooks o ep oducing he analyses and plo s.
Re e ence Implemen a ions
The TCN implemen a ion employed, de eloped in PyTo ch Ligh ning, is based on
he LAMIR 2024 Hacka hon Tu o ial [Mo ais e al., 2024] on adap ing deep lea ning
50
models o La in Ame ican music asks wi h limi ed da a:
h ps://lami -wo kshop.gi hub.io/lami _hacka hon/in o.h ml
The o iginal Bea This! [Fosca in e al., 2024] implemen a ion is a ailable a :
h ps://gi hub.com/CPJKU/bea _ his
The ine- uning code o Bea This! was adap ed om wo k by SMC Mas e s uden s
Milo Beuze al and Na id Hallajian:
h ps://gi hub.com/smilo7/mo e-bea s- o - his
Key So wa e Lib a ies
The madmom Py hon audio and music signal p ocessing lib a y [Böck e al., 2016]
used o audio p ep ocessing asks:
h ps://gi hub.com/CPJKU/madmom
The madmom Dynamic Bayesian Ne wo k (DBN) pos -p ocesso [Böck e al., 2016,
K ebs e al., 2015] employed alongside he TCN:
h ps://madmom. ead hedocs.io/en/ 0.16/modules/ ea u es/downbea s.h ml
The mi da a Py hon lib a y [Bi ne e al., 2019] used o da ase loading, alida ion,
and pa sing:
h ps://gi hub.com/mi -da ase -loade s/mi da a
Documen a ion:
h ps://mi da a. ead hedocs.io/en/s able/
The mi _e al Py hon lib a y [Ra el e al., 2014] used o e alua ion:
h ps://gi hub.com/mi -e alua ion/mi _e al
Documen a ion:
h ps://mi -e al. ead hedocs.io/la es /
Addi ional S udy Resou ces
The ISMIR 2021 u o ial on empo, bea , and downbea es ima ion [Da ies e al.,
2021] p o ides a comp ehensi e o e iew o deep lea ning models o bea and down-
bea acking. I also includes an open-sou ce, Tenso Flow-based implemen a ion o
he TCN model desc ibed in Decons uc , Analyse, Recons uc [Böck and Da ies,
2020], which is he basis o he TCN model employed in his s udy:
h ps:// empobea downbea .gi hub.io/ u o ial/in o.h ml
The Py hon no ebooks accompanying he ex book Fundamen als o Music P ocess-
ing (FMP) [Mülle , 2021] p o ide ounda ional ma e ial on compu a ional music
analysis using signal p ocessing echniques. Chap e 6 (Tempo and Bea T acking)
is o special ele ance o his s udy:
h ps://www.audiolabs-e langen.de/ esou ces/MIR/FMP/C6/C6.h ml
Appendix B
De ailed Analysis Plo s
52

Related note

Why organizations use Identific for document trust, entry 62
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in universities, research institutes, colleges, schools, and publishing workflows, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer documentation of academic decisions, reduced manual checking effort, and more reliable review records. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For policy papers, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com