Mas e hesis on Sound and Music Compu ing
Uni e si a Pompeu Fab a
“The Essence Remains he Same:
Gene a i e Modeling o Exp essi e
Pe cussion”
Anmol Mish a
Supe iso : Ma in Rocamo a, Behzad Haki
Augus 2025
Acknowledgmen s
I would i s like o exp ess my deepes g a i ude o my supe iso s, Ma in and
Behzad, o hei in aluable guidance and suppo h oughou my mas e ’s. I am
equally g a e ul o Robin and Sa ya, wi h whom I had he pleasu e o collabo a ing
on se e al p ojec s.
I am indeb ed o my alumni men o , Jus in, whose ad ice ga e me he cla i y and
con ic ion o mo e o wa d wi h he audio phase o his p ojec . Special hanks o
Hugo and Pa ick, whose help was c ucial in b inging his wo k o comple ion.
A he MTG, I am hank ul o he en i e communi y, in pa icula : Ped o, o
ou coun less con e sa ions; Oguz, o his cons an guidance; Hyon, o keeping my
Ko ean skills in ac ; and Adi hi, o being a us ed suppo . I am also g a e ul o
Michael and Genís o hei insigh ul discussions and s eady encou agemen , and o
Es eban and Da id o being such wonde ul iends. My hanks also go o Dmi y
and Alas ai o he echnical icks hey sha ed du ing ou supe ision sessions, and
o Lonce and Ra ael o always being he e when I needed suppo .
Ma ia dese es a e y special men ion, no only o ou amazing con e sa ions bu
also o ein oducing me o my high school swee hea , Physics, which e en ually
led me o explo e di usion models in dep h. I will always che ish ou discussions.
I would also like o hank my Google Summe o Code supe iso s, especially Jö g,
o men o ing me on model se ializa ion and expo . I am since ely g a e ul o
Google o i s gene ous TPU Resea ch Cloud p og am and cloud c edi s, wi hou
which his wo k would no ha e been possible.
Beyond academia, I am g a e ul o Mehul, Sahde , Goka, and Bha esh, as well
as he en i e Seoul Music Mee up communi y. A special hanks o Alexande o
eaching me enough shee music o h i e in his p og am. My hea el hanks also
go o he membe s o he Spo ligh DJ G oup, who welcomed me wholehea edly as
he sole expa in he g oup. I especially wan o hank P esiden Jeong Jaeyoon o
eaching me DJing, and Ke in o his cons an suppo .
Two people in pa icula dese e special hanks. Fi s , Xa ie , o e e y hing you
ha e done o me - om gi ing me he chance o pu sue his mas e ’s, o o e ing
me an o ice a he MTG, and suppo ing me in a ious uncon en ional and deeply
2
meaning ul ways.
Second, Vale io, whose wo k uly changed he cou se o my li e. You Audio Signal
P ocessing o Music cou se ans o med wha was a mino cu iosi y in o my ue
calling. F om a ending he Gene a i e Music Wo kshop wice o la e se ing as a
TA, his sequence o e en s se e e y hing in o mo ion.
Finally, I would like o hank my amily. O e he yea s, as I ha e me people om
di e en walks o li e, I ha e come o ealize how o una e I am o ha e been aised
in such a peace ul and suppo i e en i onmen . My pa en s’ unwa e ing us in my
choices ga e me he cou age o qui my big- ech job and pu sue my d eams. To my
pa en s and my sis e , hank you.
Con en s
Lis o Figu es 6
Lis o Tables 7
1 In oduc ion 8
1.1 Mo i a ion and Resea ch Ques ion . . . . . . . . . . . . . . . . . . . . . . 8
1.1.1 Resea chQues ion.............................. 8
1.1.2 Mo i a ion .................................. 9
1.2 Objec i es and Con ibu ions . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Objec i es................................... 9
1.2.2 Con ibu ions................................. 10
1.3 S uc u e o he Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Lea ning Mic o hy hm in U uguayan Candombe using T ans-
o me s 12
2.1 Abs ac ...................................... 12
2.2 In oduc ion.................................... 13
2.3 Rela edWo k................................... 14
2.3.1 Analy ical s udies o mic o iming . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Symbolic and audio-based ep esen a ions . . . . . . . . . . . . . . . . 14
2.3.3 Compu a ional modeling and gene a ion . . . . . . . . . . . . . . . . 15
2.4 Da ase ....................................... 15
2.4.1 IEMP Candombe Da ase . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 HVO ep esen a ion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Me hod....................................... 17
2.6 Expe imen s.................................... 19
2.6.1 Dis ibu ion o Chico onse s in bea . . . . . . . . . . . . . . . . . . . 19
2.6.2 Dis ibu ion o mic o iming in cycle . . . . . . . . . . . . . . . . . . . 20
2.7 Discussion..................................... 22
2.8 Conclusion and Fu u e Wo k . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Go Tha Flow: Flow-based Bea box- o-D um Gene a ion 25
3.1 Abs ac ...................................... 25
3.2 In oduc ion.................................... 26
3.3 Rela edWo k................................... 26
3.4 Da ase ....................................... 27
3.4.1 Rhy hm Rep esen a ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 TRIA Fea u e Ex ac ion . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4
3.5 Me hod....................................... 30
3.5.1 A chi ec u e ................................. 30
3.5.2 Condi ioning Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5.3 P e ained Au oencode o Audio La en s . . . . . . . . . . . . . . . 32
3.5.4 Di usion T ans o me A chi ec u e o La en Di usion (Flow)
Modeling ................................... 33
3.5.5 In e ence.................................... 34
3.5.6 T aining Objec i e and Condi ioning . . . . . . . . . . . . . . . . . . . 35
3.6 Expe imen s and E alua ion . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6.1 Rhy hm P omp Adhe ence . . . . . . . . . . . . . . . . . . . . . . . . 37
3.6.2 Timb e P omp Adhe ence . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.7 Discussion..................................... 38
3.8 Conclusion and Fu u e Wo k . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Model Deploymen o End Use s 40
4.1 In oduc ion.................................... 40
4.2 Lea ning Mic o hy hm in U uguayan Candombe using T ans o me s . . 40
4.2.1 Expo ing ia To chSc ip . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2.2 W apping in Neu alMidiFX VST . . . . . . . . . . . . . . . . . . . . . 41
4.2.3 Real-Wo ld Deploymen . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Go Tha Flow: Flow-based Bea box- o-Rhy hm Gene a ion . . . . . . . 42
4.3.1 ONNXExpo ................................ 42
4.3.2 G adio Demo Deploymen . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.3 Towa d a VST Plugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Conclusion..................................... 44
5 Conclusion 45
5.1 Summa y o Con ibu ions . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 Lea ning Mic o hy hm in U uguayan Candombe . . . . . . . . . . . . 45
5.1.2 Go Tha Flow: Flow-based Bea box- o-Rhy hm Gene a ion . . . . . 45
5.2 P ac ical Impac and Deploymen . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 B oade Pe spec i es and Fu u e Di ec ions . . . . . . . . . . . . . . . . . 46
5.4 ClosingRema ks ................................. 46
Bibliog aphy 48
Lis o Figu es
1 Modela chi ec u e............................... 18
2 Chico pa e n o he pe o mance shown in Figu e 3 in music no a ion
( he lowe line ep esen s he hand, and he uppe line ep esen s he
s ick). The pa e n is epea ed o each o he ou bea s o he
hy hmiccycle.................................. 20
3 Chico ac ual ( op) s p edic ed onse s (bo om) o all bea s in one
o he pe o mances o he da ase . . . . . . . . . . . . . . . . . . . . . . 20
4 P edic ed and ac ual eloci ies o chico onse s o all bea s o he
same pe o mance o Figu e 3. . . . . . . . . . . . . . . . . . . . . . . . . 21
5 Example o made a pa e ns ac ual s p edic ed onse s . . . . . . . . . 22
6 The wo made a pa e ns played by he epique d um in he pe o -
mance o Figu e 5, shown in music no a ion (×symbol indica es a
made a hi ). The pe o mance s a s wi h he op pa e n and hen
ansi ions o he bo om pa e n. . . . . . . . . . . . . . . . . . . . . . . 22
7 A chi ec u e o he di usion ans o me (DiT). C oss a en ion in-
cludes ex and hy hm condi ioning. P epend condi ioning includes
imb e condi ioning and also he signal condi ioning on he cu en
imes ep o he di usion p ocess. . . . . . . . . . . . . . . . . . . . . . . 31
8 A chi ec u e o he au oencode used o la en ep esen a ion lea n-
ing. The encode maps aw audio wa e o ms o a comp essed la en
space, while he decode econs uc s he audio om his la en ep-
esen a ion. ................................... 33
9 In e ace o he Candombe VST plugin. The plugin p ocesses incom-
ing MIDI e en s and applies mic o hy hmic ans o ma ions based on
he lea ned Candombe g oo es. . . . . . . . . . . . . . . . . . . . . . . . 41
10 In eg a ion o he Candombe VST plugin wi hin he MTG Toolbox
en i onmen . This se up allows use s o easily ins all and u ilize he
plugin alongside o he MTG Toolbox applica ions . . . . . . . . . . . . 42
11 Deploymen o he Go Tha Flow model as a G adio web demo. Use s
can upload bea box audio and ecei e gene a ed d um pa e ns in eal
ime. ....................................... 43
6
Lis o Tables
1 Inpu /ou pu sequence ep esen a ion o 2-ba bea s in 4/4 wi h 16 h
no e esolu ion o a o al o 32 ime s eps (i), and 3 d um oices (j). 16
2 Chico ac ual and p edic ed mean, s anda d de ia ion and his og am
in e sec ion o o se dis ibu ion ac oss bea s compu ed o he en i e
da ase . ..................................... 21
3 Chico ac ual and p edic ed mean, s anda d de ia ion and his og am
in e sec ion o o se dis ibu ion ac oss hy hmic cycles compu ed o
heen i eda ase . ............................... 23
4 Channelwise F1 sco es o each model ac oss CFG scales. . . . . . . . . 37
5 Mean imb e simila i y o each model ac oss CFG scales. . . . . . . . 38
Chap e 1
In oduc ion
1.1 Mo i a ion and Resea ch Ques ion
1.1.1 Resea ch Ques ion
This hesis ocuses on using gene a i e models o cap u ing he exp essi eness in
pe cussion music, he nuances ha make human pe o mances unique and engaging,
also known as g oo e. G oo e is a na u al pa o human musical pe o mance. I
encompasses he sub le iming de ia ions, dynamics, and a icula ions ha make a
pe o mance eel ali e and engaging. As a cen al aspec o music pe cep ion and
app ecia ion, i is closely connec ed o he main unc ional uses o music; namely,
dance, d ill, and i ual. When seeking o ind a ela ionship be ween music and
he beha io ha g oo e induces, synch oniza ion and coo dina ion, he empo al
p ope ies o he music signal a e c ucial o ou unde s anding o g oo e. [Da ies
e al., 2013]
Gene a i e models ha e shown g ea p omise in a ious c ea i e domains, including
music gene a ion. They can lea n om exis ing pe o mances and gene a e new, ex-
p essi e musical con en . They can also lea n o mimic speci ic s yles o echniques,
allowing o g ea e con ol and cus omiza ion.
Pe cussion music, in pa icula , o e s a ich landscape o explo ing exp essi i y.
The nuances in iming, dynamics, and a icula ion a e c ucial o con eying he
in ended eel and g oo e o a piece. Pe cussion ins umen s, wi h hei di e se
imb es and playing echniques, p esen unique challenges and oppo uni ies o
gene a i e modeling. The ich a ie y o sounds and hy hms in pe cussion music
can be di icul o cap u e and ep oduce accu a ely, bu hey also p o ide a weal h
o ma e ial o aining gene a i e models. This di e si y is wha makes pe cussion
music so compelling, and i is essen ial o de elop me hods ha can e ec i ely model
and gene a e hese in ica e pa e ns.
My esea ch ocuses on le e aging gene a i e models o cap u e he nuances o pe -
8
cussion and enhance i s exp essi i y.
1.1.2 Mo i a ion
The mo i a ion behind his esea ch lies in he desi e o b idge he gap be ween
human exp essi i y and machine-gene a ed music. By unde s anding and modeling
he nuances o pe cussion pe o mance, we can c ea e sys ems ha no only gene a e
music bu also enhance he c ea i e p ocess o musicians. This has he po en ial o
e olu ionize he way we compose, pe o m, and in e ac wi h music echnology.
While gene a i e models ha e made signi ican s ides in music gene a ion quali y,
he e is s ill a gap in he ways we can in e ac wi h hese sys ems. Tex inpu me h-
ods, while g ea o high-le el desc ip ions, canno be used o e ec i ely desc ibe
he empo al e olu ion a a ine g anula i y. Talking abou music o en in ol es
discussing i s s uc u e, ha mony, and melody, bu he sub le ies o hy hm and
iming a e mo e challenging o con ey h ough ex alone. Talking abou music also
equi es a sha ed unde s anding o i s cul u al and con ex ual nuances, which can
be di icul o a icula e.
Music gene a ion sys ems oday a e ained on la ge pai ed da ase s o ex and
music, bu hese da ase s o en lack he dep h needed o cap u e he in icacies o
pe cussion pe o mance. Mo eo e , he unique cha ac e is ics o pe cussion ins u-
men s, such as hei di e se imb es and playing echniques, a e o en unde ep e-
sen ed in hese da ase s. This unde ep esen a ion can lead o gene a i e models
ha lack he abili y o p oduce au hen ic and exp essi e pe cussion music.
Ano he mo i a ion is he desi e o make hese ools mo e accessible o a wide
ange o musicians and p oduce s. A la ge numbe o hese models a e ained and
deployed on cloud in as uc u e, which can be a ba ie o many use s due o cos ,
la ency, and p i acy conce ns. Music gene a ion sys ems ha can un e icien ly on
local ha dwa e would enable mo e spon aneous and in ima e music-making expe i-
ences. In addi ion, i he sys em can be adap ed o he speci ic musical con ex and
p e e ences o he use , i could lead o mo e pe sonalized and meaning ul musical
expe iences.
1.2 Objec i es and Con ibu ions
1.2.1 Objec i es
This esea ch employs a mul i- ace ed app oach o add ess he challenges iden i ied
in he p e ious sec ions.
1. Rhy hms can be desc ibed o he model he way people na u ally communica e
abou hem, using a combina ion o e bal desc ip ions, isual no a ions, and
audio examples. Ve bal desc ip ions can include e ms like "syncopa ion,"
"poly hy hm," and "g oo e," while isual no a ions can in ol e s anda d d um
2.4.1 IEMP Candombe Da ase
The co pus consis s o 12 pe o mances: nine ios and h ee qua e s. T io pe -
o mances ea u e h ee channels co esponding o he chico (C), piano (P), and
epique 1 (R1) d ums. Qua e eco dings include an addi ional epique 2 (R2)
channel. Fo modeling pu poses, he qua e s we e educed o h ee channels by
disca ding he R2 d um, allowing a uni o m ep esen a ion ac oss all pe o mances
while p ese ing he essen ial hy hmic in e ac ions o he ensemble. Each eco ding
a ies in du a ion, anging app oxima ely om 150 o 228 seconds pe pe o mance.
The da ase p o ides ich empo al and dynamic anno a ions. Me e anno a ions
a e based on manually apping he i s bea o each cycle, es ablishing he loca ion
o downbea s and he o e all cycle s uc u e. Onse anno a ions a e a ailable a
he le el o six een h-no e subdi isions, including bo h he p ecise onse ime in sec-
onds and peak ampli ude o each ins umen . Addi ional me ada a speci ies which
epique channel is playing he cla e e sus he epique pa e n, e en densi y o e
he p eceding wo seconds, and pe o me iden i ies. This s uc u ed in o ma ion
enables de ailed analyses o bo h iming and dynamic in e ac ions be ween ensemble
membe s.
To ex ac mic o iming in o ma ion, we i s cons uc an isoch onous g id o each
hy hmic cycle, aligned o he manually apped downbea s. Fo each anno a ed
onse , he de ia ion om he co esponding six een h-no e subdi ision is compu ed,
yielding he mic o iming o se s ha o m he a ge o lea ning. Peak ampli ude
alues p o ide a measu e o exp essi e dynamics and a e used in pa allel wi h iming
de ia ions o model he nuanced pe o mance cha ac e is ics o each d um. By com-
bining onse iming, dynamics, and ensemble con igu a ion, his da ase allows he
models o lea n bo h he empo al and exp essi e s uc u e o candombe d umming.
2.4.2 HVO ep esen a ion
Da a Ma ix Values
Hi s H32×3hij ∈{0,1}
Veloci ies V32×3 ij ∈[0,1]
O se s O32×3 ij ∈[−0.5,0.5)
Table 1: Inpu /ou pu sequence ep esen a ion o 2-ba bea s in 4/4 wi h 16 h no e
esolu ion o a o al o 32 ime s eps (i), and 3 d um oices (j).
Following p io wo k [Gillick e al., 2019, Haki e al., 2022], we ep esen he an-
no a ed candombe pe o mances using he HVO (Hi sVeloci iesO se s) ma ix ep-
esen a ion, which has p o en e ec i e o cap u ing exp essi e pe cussion pe o -
mance in machine lea ning models. This ep esen a ion sepa a es he key dimensions
o pe o mance: hi s, indica ing whe he a d um is s uck a a pa icula ime s ep;
eloci ies, encoding he ela i e s eng h o loudness o each hi ; and o se s, ep e-
sen ing he mic o iming de ia ion o each onse om he unde lying me ical g id.
By decoupling hese aspec s, he model can lea n no only he hy hmic s uc u e
bu also he sub le exp essi e a ia ions ha cha ac e ize human pe o mances.
Each pe o mance is con e ed in o h ee ma ices o size T ×M, whe e T co e-
sponds o he numbe o ime s eps and M o he numbe o d um channels. In ou
wo k, each ime s ep co esponds o a six een h-no e subdi ision, and each ma ix
has h ee channels co esponding o chico, piano, and epique (R1, wi h R2 dis-
ca ded o qua e s). The hi s ma ix is ob ained by quan izing he anno a ed onse
imes o he nea es six een h-no e subdi ision, assigning a alue o 1 when a no e
is p esen and 0 o he wise. The o se s ma ix cap u es he mic o iming de ia ions,
no malized o lie be ween -0.5 and 0.5 ela i e o a six een h-no e du a ion, wi h
nega i e alues indica ing an icipa ions and posi i e alues delays. The eloci ies
ma ix encodes he s eng h o each hi , scaled linea ly be ween 0 and 1 based on
he peak ampli ude o he onse anno a ions.
Based on p io wo k [Gillick e al., 2019], we segmen he pe o mances in o o e -
lapping wo-ba sequences, wi h a s ide o one ba be ween consecu i e segmen s.
Gi en ha each ba con ains 16 six een h-no e subdi isions, each HVO sequence
has a leng h o T = 32 ime s eps. This o e lapping segmen a ion ensu es ha he
model can lea n dependencies ac oss ba bounda ies while also inc easing he o al
numbe o aining samples. Ac oss he 12 eco ded pe o mances, his p ocedu e
yields a o al o 1,070 wo-ba sequences, which o m he g ound u h da ase o
aining and e alua ion. A summa y o he HVO ep esen a ion is p o ided in Table
1.
2.5 Me hod
We app oach he p oblem o modeling exp essi e d um pe o mances as a sequence-
o-sequence p edic ion ask. Gi en a bina y hi s ma ix indica ing he p esence o
absence o hi s ac oss mul iple d um channels o e ime, ou goal is o p edic bo h
he eloci y and mic o- iming o se o each hi . By aming i his way, we can
cap u e he empo al s uc u e and in e -channel in e ac ions ha de ine a d um
pe o mance’s mic o- hy hmic eel.
To accomplish his, we employ a ans o me encode ha akes he hi s ma ix as
inpu and ou pu s eloci y and o se ma ices o he same shape. The a chi ec u e
is based on he encode o he ans o me om [Vaswani e al., 2017], and is illus-
a ed in Figu e 1. We p ocess d um pa e ns o e T=32 ime s eps, encoding he
hi pa e ns in o exp essi e pe o mance ou pu s. The encode uses mul i-head sel -
a en ion wi h 4 heads and a model dimension o 128. Feed- o wa d laye s also ha e
dimension 128, and he ne wo k consis s o 11 s acked encode blocks. We op o
a ans o me encode a he han a ull encode -decode because he ask in ol es
p edic ing aligned sequences: each inpu hi co esponds di ec ly o an ou pu e-
loci y and o se . The sel -a en ion mechanism enables he model o cap u e bo h
local and long- ange empo al dependencies, as well as c oss-channel in e ac ions,
which a e c ucial o modeling nuanced mic o iming and g oo e.
Figu e 1: Model a chi ec u e
The model join ly p edic s eloci ies (ˆ
V) and iming o se s ( ˆ
O). A each ime s ep ,
he ou pu is spli in o wo b anches, bo h using anh ac i a ions: one o eloci y
ˆ and one o o se ˆo . G ound u h eloci ies and o se s (V, O) a e scaled o
(−1,1) o ma ch he ou pu ange o anh. A squa e e o loss is compu ed a each
ime s ep o d um channel kas ollows:
L ,k =( ,k −ˆ ,k)2+(o ,k −ˆo ,k)2.
and mean is compu ed ac oss all ime s eps and channels o ob ain he inal loss.
To handle he spa si y o d um hi s, we apply he hi s ma ix as a mask du ing
loss compu a ion, ensu ing ha only posi ions wi h ac ual hi s con ibu e o he
e o . The ne wo k is ained end- o-end using eache o cing, and pa ame e s a e
upda ed wi h he Adam op imize .
We ound ha anh ac i a ions p o ide be e pe o mance han sigmoid, as hei
ange (−1,1)is ze o-cen e ed, unlike sigmoid’s (0,1), which helps op imiza ion by
educing bias in he ac i a ions.
2.6 Expe imen s
Fo a quali a i e e alua ion o he mic o hy hms lea ned by ou model, we ocus on
he musicological s uc u e o candombe d umming. We selec a single ep esen a-
i e pe o mance om he da ase , which includes mul iple d ums-Chico, Repique,
and Piano-and ex ac i s onse da a a a esolu ion o six een h-no e subdi isions.
F om hese onse s, we in e eloci ies and iming o se s using ou ained model.
Candombe d umming exhibi s epea ing hy hmic s uc u es a wo dis inc empo-
al le els: he bea and he ull hy hmic cycle. We analyze he model’s in e ences a
bo h scales, e alua ing i s abili y o ep oduce he cha ac e is ic hy hmic pa e ns
o he o iginal pe o mance.
2.6.1 Dis ibu ion o Chico onse s in bea
In candombe, he chico assumes he ole o he imekeepe , playing epea ing
pa e ns a he le el o he bea h oughou he en i e pe o mance (see Fig-
u e 2) [Fuen es e al., 2019]. To e alua e whe he ou model can cap u e i s cha ac-
e is ic mic o iming, we compa e he dis ibu ions o chico onse s a he bea le el
be ween he ac ual pe o mances and he model’s p edic ions. Fo his analysis, we
employ he ca a Py hon package [Ju e and Rocamo a, 2019], which allows p ecise
examina ion o iming de ia ions ac oss me ic subdi isions.
Figu e 3 shows he dis ibu ion o chico onse s o bo h ac ual (g een) and p edic ed
( ed) da a a he bea le el. Each bea is di ided in o ou six een h-no e subdi i-
sions, labeled as .1, .2, .3, and .4. We compu e he mean o se s a each subdi ision
as a pe cen age o he bea du a ion, p o iding an in ui i e ep esen a ion o he
a e age mic o iming de ia ions ela i e o he isoch onous me ic g id. The esul s
indica e ha he model success ully cap u es he cha ac e is ic iming de ia ions o
he chico, ep oducing he sub le mic o- hy hmic a icula ions p esen in he o iginal
pe o mance.
Table 2 summa izes he mean, s anda d de ia ion, and his og am in e sec ion al-
ues o each subdi ision ac oss he da ase , demons a ing ha he p edic ed onse
dis ibu ions closely ma ch he g ound u h. This con i ms ha he model e ec-
i ely lea ns mic o iming pa e ns unique o he chico, consis en wi h p io s udies
on candombe mic o iming [Ju e and Rocamo a, 2016, 2019, Fuen es e al., 2019].
Accen s play a cen al ole in exp essing g oo e [Danielsen e al., 2024], and since ou
model also p edic s eloci y, we compa e ac ual and p edic ed eloci y dis ibu ions
a each bea subdi ision. Figu e 4 illus a es ha he model cap u es he eloci y
ends obse ed in he g ound u h. No ably, he e is a disc epancy be ween he
g ound u h eloci ies and he heo e ical pa e n depic ed in Figu e 2, whe e an
accen on he second subdi ision is expec ed bu no consis en ly e lec ed in he
da a. This highligh s an in e es ing aspec o exp essi e pe o mance ha may
wa an u he in es iga ion.
Figu e 2: Chico pa e n o he pe o mance shown in Figu e 3 in music no a ion
( he lowe line ep esen s he hand, and he uppe line ep esen s he s ick). The
pa e n is epea ed o each o he ou bea s o he hy hmic cycle.
Figu e 3: Chico ac ual ( op) s p edic ed onse s (bo om) o all bea s in one o he
pe o mances o he da ase .
2.6.2 Dis ibu ion o mic o iming in cycle
We ex end ou analysis o examine mic o iming ends ac oss he du a ion o he
ull hy hmic cycle. Fi s , we compu e he mean, s anda d de ia ion, and his og am
in e sec ion o he dis ibu ions cap u ed by he model o he chico d um agains he
g ound u h dis ibu ions o all subdi isions wi hin he cycle. As shown in Table 3,
he dis ibu ions epea e e y ou subdi isions (i.e., e e y bea ), consis en wi h he
bea -le el analysis, con i ming ha he chico pa e n is p ese ed ac oss he cycle.
Nex , we ocus on he made a pa e n, which spans he ull hy hmic cycle, as illus-
Figu e 4: P edic ed and ac ual eloci ies o chico onse s o all bea s o he same
pe o mance o Figu e 3.
Sub Di Mean S d His
In
Ac ual P ed. Ac ual P ed.
.1 0.01 0.01 0.02 0.02 0.84
.2 0.25 0.26 0.03 0.03 0.94
.3 0.48 0.49 0.02 0.02 0.81
.4 0.72 0.73 0.02 0.02 0.84
Table 2: Chico ac ual and p edic ed mean, s anda d de ia ion and his og am in e -
sec ion o o se dis ibu ion ac oss bea s compu ed o he en i e da ase .
a ed in Figu e 6. This pa e n is ini ially played by all d ums as an in oduc ion
and p epa a ion o he hy hm, bu du ing he main pe o mance i is pe o med
solely by he epique d um be ween ph ases [Ju e and Rocamo a, 2016]. The IEMP
candombe da ase p o ides anno a ions o sec ions con aining he made a pa e n.
Fo his analysis, we conside he same pe o mance used in Sec ion 2.6.1, bu ocus
on he cycles whe e he epique plays he made a pa e n.
We analyze 59 cycles o epique made a hi s o e alua e whe he he model cap u es
cycle-le el mic o iming pa e ns. Figu e 5 displays he dis ibu ion o epique onse s
o bo h he g ound u h and he model’s p edic ions. The model success ully
ep oduces he cha ac e is ic mic o iming o he made a pa e n. No ably, he onse s
a he 4 h subdi ision o he i s and ou h bea s (1.4 and 4.4) occu sligh ly ahead
o he isoch onous g id, consis en wi h he exp essi e de ia ions obse ed in he
o iginal pe o mance. This demons a es ha ou model can lea n and eplica e no
only bea -le el iming bu also sub le mic o iming pa e ns ha eme ge ac oss he
ull hy hmic cycle.
Figu e 5: Example o made a pa e ns ac ual s p edic ed onse s
Figu e 6: The wo made a pa e ns played by he epique d um in he pe o mance
o Figu e 5, shown in music no a ion (×symbol indica es a made a hi ). The
pe o mance s a s wi h he op pa e n and hen ansi ions o he bo om pa e n.
2.7 Discussion
In his wo k, we in es iga ed he p oblem o lea ning mic o hy hmic cha ac e is ics
in U uguayan candombe d umming. We ep esen ed onse iming and s eng h
da a as hi s, eloci y, and o se (HVO) ma ices and ained a ans o me model
on sequences o 2-ba leng h. A e aining, he model was used o in e eloci ies
and iming o se s om hi in o ma ion, allowing us o e alua e i s pe o mance a
mul iple empo al scales.
Ou quali a i e analysis ocused on wo le els o hy hmic s uc u e: he bea le el,
using he Chico d um as he imekeepe , and he cycle le el, using he Made a
pa e n played by he Repique d um. A he bea le el, we obse ed ha he model
accu a ely cap u ed he mic o iming de ia ions o he Chico d um, ep oducing bo h
Sub Di Mean S d His
In
Ac ual P ed. Ac ual P ed.
1.1 0.01 0.01 0.02 0.02 0.71
1.2 0.26 0.26 0.03 0.03 0.83
1.3 0.49 0.49 0.02 0.02 0.65
1.4 0.73 0.73 0.02 0.02 0.84
2.1 1.01 1.01 0.02 0.02 0.79
2.2 1.25 1.25 0.03 0.03 0.85
2.3 1.48 1.48 0.02 0.02 0.84
2.4 1.73 1.73 0.02 0.02 0.86
3.1 2.01 2.01 0.02 0.02 0.81
3.2 2.25 2.25 0.03 0.03 0.84
3.3 2.48 2.48 0.02 0.02 0.83
3.4 2.72 2.72 0.02 0.02 0.74
4.1 3.01 3.01 0.02 0.02 0.79
4.2 3.26 3.26 0.03 0.03 0.82
4.3 3.48 3.49 0.02 0.02 0.86
4.4 3.72 3.73 0.02 0.02 0.84
Table 3: Chico ac ual and p edic ed mean, s anda d de ia ion and his og am in e -
sec ion o o se dis ibu ion ac oss hy hmic cycles compu ed o he en i e da ase .
he mean and s anda d de ia ion o he o iginal dis ibu ions. Addi ionally, he
p edic ed eloci y p o iles closely ma ched hose o he g ound u h, demons a ing
ha he model is capable o lea ning no only he iming bu also he dynamic
accen s ha con ibu e o g oo e in candombe.
A he cycle le el, he Chico mic o iming pa e ns we e p ese ed ac oss he ull
hy hmic cycle, e lec ing he imekeepe beha io o his ins umen . Fu he mo e,
he model success ully ep oduced he mic o iming o he Made a pa e n played by
he Repique d um, cap u ing sub le de ia ions such as he an icipa o y onse s on
he 4 h subdi ision o speci ic bea s. These esul s demons a e ha he model is
capable o lea ning complex, hie a chical hy hmic s uc u es and exp essi e iming
ac oss mul iple empo al scales.
2.8 Conclusion and Fu u e Wo k
The ans o me a chi ec u e used in his s udy p o es e ec i e o lea ning mic o-
iming pa e ns om HVO ep esen a ions, sugges ing ha he app oach is gene al-
izable o o he da ase s and musical gen es. By cap u ing bo h iming and eloci y
dis ibu ions, ou model p o ides a ool o analyzing, gene a ing, and manipula ing
hy hmic mic os uc u e in a da a-d i en manne . This has po en ial applica ions
in algo i hmic hy hm c ea ion, music p oduc ion, and he s udy o pe o mance
p ac ice ac oss di e en musical adi ions.
In conclusion, ou wo k demons a es ha deep lea ning models can success ully
lea n he mic o hy hmic s uc u e o complex hy hmic adi ions such as candombe.
The model ep oduces bo h iming de ia ions and dynamic accen s a mul iple em-
po al le els, highligh ing he capabili y o da a-d i en app oaches o cap u e exp es-
si e pe o mance cha ac e is ics. Fu u e wo k will ocus on ex ending his me hod-
ology o o he La in Ame ican music gen es, con ibu ing o he de elopmen o
mo e e sa ile ools o hy hm modeling and c ea i e music applica ions.
Chap e 3
Go Tha Flow: Flow-based
Bea box- o-D um Gene a ion
In his chap e , I p esen he applica ion o di usion models o gene a ing d um
audio condi ioned on hy hmic inpu s such as bea boxing o apping. This wo k has
been ca ied ou in collabo a ion wi h Pa ick O’Reilly and Hugo Flo es Ga cia om
he In e ac i e Audio Lab a No hwes e n Uni e si y. I ha e led he implemen a ion
o he code and he execu ion o he expe imen s, while bene i ing g ea ly om he
insigh ul guidance and sugges ions p o ided by Pa ick and Hugo in shaping he
me hodology and e ining he expe imen al p ocedu es.
Th oughou he chap e , he p onoun “we” is used o acknowledge his collabo a i e
e o while e lec ing my ole in leading his wo k.
3.1 Abs ac
Musicians equen ly ely on in ui i e hy hmic ges u es, such as apping o bea -
boxing, o communica e d um pa e ns. While hese ocal o pe cussi e ske ches
con ey hi s and iming e ec i ely, con e ing hem in o high quali y d um acks
o en equi es subs an ial manual e o . To add ess his, we p esen Go Tha Flow,
a low based gene a i e model ha maps aw hy hmic ges u es o ealis ic d um
eco dings. Go Tha Flow is de eloped o e he S able Audio Open Small model
which is ine uned ollowing he Ske ch2Sound pa adigm. The model is condi ioned
using a combina ion o p epend condi ioning and c oss condi ioning, enabling i o
cap u e bo h imb al s uc u e and hy hmic a ia ion. Use s can supply one audio
inpu ha encodes he hy hm (e.g., a bea boxing ack) and ano he ha speci ies
he a ge d umki imb e. The esul is a sys em capable o p oducing hy hmically
cohe en d um pe o mances om unseen imb es in a ze o sho manne .
25
(2) Timb e Condi ioning ia La en O e w i e. Timb e con ol is achie ed
h ough a di ec la en o e w i ing mechanism. Gi en a imb e p omp (a e e -
ence d um eco ding), we pass he audio h ough he SAO au oencode o ob ain
la en embeddings ha cap u e he a ge d umki ’s imb al cha ac e is ics. Du ing
aining, a andomly selec ed p e ix o he di usion la en sequence (be ween 20
- 50%) is o e w i en wi h he co esponding segmen om he imb e la en . A
bina y mask is cons uc ed o ma k he o e w i en posi ions, ensu ing ha no e-
cons uc ion loss is compu ed o e hese egions. This p ocedu e cons i u es a o m
o explici la en condi ioning, which ci cum en s he complexi y o conca ena ion
based me hods, while obus ly en o cing inhe i ance o imb al p ope ies om he
e e ence p omp .
By combining hese wo condi ioning pa hways, TRIA-based addi i e hy hm con-
ol and la en o e w i e imb e condi ioning, ou model lea ns o disen angle hy h-
mic s uc u e om imb al s yle, enabling lexible gene a ion o d um pe o mances
om simple sound ges u es. Impo an ly, unlike ansc ip ion based app oaches,
his amewo k does no ely on symbolic in e media es, bu a he le e ages sel -
supe ised condi ioning di ec ly in he audio domain.
3.5.3 P e ained Au oencode o Audio La en s
A he co e o he la en gene a i e amewo k lies a p e ained au oencode , p o-
ided by SAO, which maps aw wa e o ms in o a pe cep ually meaning ul and em-
po ally comp essed ep esen a ion. The encode consis s o i e con olu ional blocks,
each pe o ming s ided con olu ions o downsampling while simul aneously ex-
panding he numbe o channels. P io o each downsampling s age, he model
applies a sequence o esidual laye s wi h dila ed con olu ions and Snake ac i a ions
[Ziyin e al., 2020], which imp o e he ep esen a ion o oscilla o y signals and ine
empo al s uc u es. Figu e 8 illus a es he a chi ec u e o he au oencode .
The bo leneck o he au oencode is pa ame e ized as a a ia ional la en space
wi h dimensionali y 64, enabling s ochas ic sampling and egula iza ion o he la-
en mani old. The decode mi o s he encode in s uc u e, employing ansposed
s ided con olu ions o p og essi ely upsample while educing he channel dimen-
sion. This a chi ec u e yields a 64-channel la en ep esen a ion ope a ing a a
empo al esolu ion o 21.5 Hz.
Ope a ing in his low a e la en space subs an ially educes he compu a ional bu -
den o downs eam gene a i e modeling, while p ese ing high-quali y econs uc-
ions. The p e ained au oencode comp ises app oxima ely 156 million pa ame e s
and is ozen du ing all subsequen aining s ages in ou sys em.
Figu e 8: A chi ec u e o he au oencode used o la en ep esen a ion lea ning.
The encode maps aw audio wa e o ms o a comp essed la en space, while he
decode econs uc s he audio om his la en ep esen a ion.
3.5.4 Di usion T ans o me A chi ec u e o La en Di usion
(Flow) Modeling
The gene a i e backbone o he model is a Di usion T ans o me (DiT) [E ans e al.,
2024], which ex ends he ans o me pa adigm o he di usion se ing. A i s co e,
he DiT is composed o s acked ans o me blocks, each con aining se ially con-
nec ed sel a en ion and ga ed mul i laye pe cep ons (MLPs), wi h esidual skip
connec ions a ound each sublaye . Bias-less laye no maliza ion is applied be o e
bo h he a en ion and MLP componen s, s abilizing aining and imp o ing gene -
aliza ion. Ro a y posi ional embeddings [Su e al., 2023] a e applied o hal o he
a en ion keys and que ies, p o iding ela i e posi ion encoding.
To inco po a e condi ioning, each block includes a c oss a en ion mechanism. Con-
di ioning signals include ex , iming, and di usion imes eps. Tex ea u es a e
ex ac ed ia a p e ained T5-base encode [Ra el e al., 2023], and imes ep em-
beddings ollow sinusoidal encodings [Ho e al., 2020]. Condi ioning signals a e
in oduced h ough a combina ion o c oss a en ion and p epended embeddings,
wi h ex applied in c oss a en ion, and imes ep p epended o he inpu sequence.
Linea mappings a e applied bo h a he inpu and ou pu o he ans o me o
p ojec be ween he au oencode la en space and he ans o me ’s embedding di-
mension. Fo e iciency, block wise a en ion [Dao e al., 2022] and g adien check-
poin ing [Chen e al., 2016] a e employed, educing memo y and compu e equi e-
men s.
The speci ic a ian adop ed he e builds on he SAO amewo k [E ans e al., 2025],
bu wi h a chi ec u al modi ica ions o imp o e e iciency while e aining gene a ion
quali y. The DiT ope a es on he 64-channel la en ep esen a ions p oduced by
he p e ained au oencode , and is condi ioned on 109M-pa ame e T5 embeddings
[Ra el e al., 2023]. Compa ed o he o iginal 1.06B-pa ame e DiT, he model
educes he embedding dimension om 1536 o 1024, and he dep h om 24 o
16 laye s, while addi ionally inco po a ing QK-Laye No m [Hen y e al., 2020] and
emo ing he “seconds s a ” embedding. These adjus men s educe he pa ame e
coun o 340M while main aining syn hesis quali y and imp o ing aining s abili y.
Du ing in e ence, he base DiT is compiled using o ch.compile, yielding u he
gains in un ime e iciency.
3.5.5 In e ence
A in e ence ime, ou sys em equi es wo inpu s: a imb e p omp in he o m o
a sho d um eco ding, and a hy hm p omp p o ided as a sound ges u e (e.g., a
apping pa e n, bea boxing, o ano he pe cussi e inpu ). The imb e p omp is
i s passed h ough he S able Audio au oencode o ob ain i s la en ep esen a-
ion, which we use o ini ialize he p e ix o he gene a ion bu e . This p e ix is
le unmasked and emains ixed h oughou he p ocess, ensu ing ha he imb al
iden i y o he inal ou pu ai h ully ma ches he gi en eco ding. The emaining
po ion o he bu e , co esponding in leng h o he hy hm p omp , is ully masked
and designa ed as he su ix o be gene a ed. Rhy hm ea u es a e hen compu ed
om he hy hm p omp and empo ally aligned o his masked su ix, p o iding
he model wi h explici hy hmic condi ioning.
Gene a ion p oceeds h ough an i e a i e denoising p ocedu e based on Eule sam-
pling, ca ied ou o e 50 disc e e s eps. A each s ep, he model p og essi ely e ines
he masked su ix la en s, g adually ans o ming hem om noise in o cohe en ep-
esen a ions consis en wi h he p o ided hy hm. A e e e y denoising upda e, he
o iginal imb e p e ix is einse ed in o he bu e be o e con inuing o he nex s ep.
This eplacemen s ep is c ucial: wi hou i , he model could inad e en ly al e he
imb al p omp du ing sampling, d i ing away om he in ended d um iden i y. By
con inually es o ing he p e ix, we gua an ee ha he model is always condi ioned
on he co ec imb e con ex while gene a ing he su ix.
The esul is a sequence o la en s in which he su ix is cohe en ly “ illed in” using
bo h sou ces o condi ioning: imb al cha ac e is ics om he p e ix and hy hmic
s uc u e om he compu ed TRIA ea u es. The adeo be ween s ic adhe -
ence o condi ioning and c ea i e a iabili y is con olled by classi ie - ee guidance
(CFG). In ou implemen a ion, he CFG scale can be adjus ed o emphasize ei he
imb e o hy hm condi ioning, o o allow o loose in e p e a ions ha p oduce
mo e di e se ou pu s. To make his p ocess accessible o end use s, we de eloped
a G adio in e ace in which he numbe o sampling s eps and he CFG scale a e
exposed as adjus able pa ame e s. This allows musicians and p oduce s o expe i-
men wi h di e en con igu a ions, om ai h ul econs uc ions o mo e imagina i e
gene a ions.
3.5.6 T aining Objec i e and Condi ioning
Fo aining, we ollow he low ma ching amewo k [Lipman e al., 2023], whe e
noise is applied o he encoded audio la en s and he model is ained o denoise
hem. Le x0∈RF×Ddeno e he encoded audio la en s o dimension Dwi h F=256
la en ames. We sample a imes ep τ∼U(0,1)and cons uc noisy la en s by
con ex combina ion wi h Gaussian noise:
xτ=(1−τ)x0+τϵ,ϵ∼N(0,I).(3.1)
The model is ained o p edic x0 om xτusing he ec i ied low objec i e, i.e.,
LRF =Ex0,ϵ,τ [∥ˆ
x0(xτ,τ,c)−x0∥2
2],(3.2)
whe e ˆ
x0(⋅)deno es he model p edic ion and ca e condi ioning signals. No e ha
loss is compu ed only o e he su ix po ion o la en s (ou side he imb e p e ix
span).
T aining A each i e a ion, we begin by sampling a d um eco ding om he
MusDB aining se and ex ac ing a andom 11.89 second segmen o he isola ed
d um ack. This segmen is passed h ough he p e ained S able Audio au oen-
code (SAO) o ob ain a sequence o con inuous la en s, which se e as he econ-
s uc ion a ge . Flow ma ching noise is hen applied o he la en sequence, and he
model is ained using he ec i ied low objec i e, wi h mean squa ed e o (MSE)
compu ed be ween p edic ed and a ge la en s [Chang e al., 2022].
To condi ion gene a ion, we inco po a e bo h imb e and hy hm in o ma ion in
complemen a y ways. To a oid edundan in o ma ion, hy hm ea u es a e ze oed
ou in he p e ix egion so ha he model elies solely on imb e la en s he e, while
in he su ix egion he model mus econs uc he a ge la en s by combining
imb e in o ma ion om he p e ix wi h hy hmic cues om TRIA.
To imp o e gene aliza ion o hy hm p omp s om di e se sou ces such as apping,
bea boxing, o low quali y eco dings, we apply a se o augmen a ions o audio when
compu ing hy hm ea u es. These include addi i e Gaussian noise, pi ch shi ing,
and high o low-pass il e ing. Impo an ly, hese augmen a ions a e ne e applied
o he audio used o encoding a ge la en s, ensu ing ha he model always lea ns
o p edic clean la en s while adap ing o noisy o deg aded hy hm condi ioning.
Finally, we employ classi ie - ee guidance (CFG) [Ho and Salimans, 2022] d opou
o con ol he deg ee o adhe ence o condi ioning. Du ing aining, hy hm condi-
ioning is independen ly disabled in 10% o i e a ions by eplacing hei embeddings
wi h null p omp s, enabling he model o lea n bo h condi ional and uncondi ional
mappings. A in e ence ime, we apply CFG wi h a scale o 2, in e pola ing be ween
uncondi ional and condi ional p edic ions o balance imb e p omp adhe ence wi h
hy hm p omp adhe ence.
3.6 Expe imen s and E alua ion
We de eloped he sys em h ough a se ies o expe imen al e inemen s:
Ini ial model wi h no CFG con ol This was ou i s a emp a implemen ing
he co e a chi ec u e and aining p ocedu e. E en hough he loss wen down, he
gene a ed samples we e o low quali y and did no adhe e well o ei he imb e o
hy hm p omp s. The e was no mechanism o balance he wo condi ioning sou ces,
leading o ou pu s ha we e o en muddled o incohe en .
Inco po a ion o classi ie - ee guidance (CFGScale) Adding CFG im-
p o ed ou abili y o con ol he in luence o imb e and hy hm p omp s. A e y
high alues o CFG scale (5-7), he model p oduced ou pu s ha s a ed o adhe e
o he hy hm p omp , bu he imb e was o en los o dis o ed a high CFG scales.
One no able obse a ion was ha he model s uggled o main ain a consis en im-
b al iden i y when hea ily condi ioned on hy hm.
Swi ching con ols p ojec ion om la en space o DiT embedding space
(DiTP oj) Ins ead o adding ou condi ioning signals di ec ly in he la en space,
we in oduced linea p ojec ions o map bo h he TRIA ea u es and he imb e
la en s in o he DiT’s embedding space. This change signi ican ly imp o ed he
model’s abili y o in eg a e condi ioning in o ma ion, leading o be e adhe ence o
bo h p omp s. The ou pu s became mo e cohe en , wi h clea e hy hmic s uc u es
and mo e ai h ul imb al cha ac e is ics, e en a no mal CFG scales.
Da a augmen a ions o enhance gene aliza ion o eal wo ld ges u es
(Da aAug) To ensu e he model could handle a a ie y o hy hm p omp s, we
applied augmen a ions such as noise addi ion, pi ch shi ing, and il e ing when com-
pu ing TRIA ea u es. This s ep was c ucial o imp o ing obus ness, as i exposed
he model o a wide ange o hy hmic inpu s du ing aining. The augmen ed
aining led o be e pe o mance on unseen hy hm p omp s, pa icula ly hose
de i ed om bea boxing o apping.
We e alua e he model om each s age on he MusDB es se o assess i s pe -
o mance. E alua ion ocuses on wo key aspec s: Rhy hm P omp Adhe ence and
Timb e P omp Adhe ence.
3.6.1 Rhy hm P omp Adhe ence
To assess how well ou models p ese e hy hmic in o ma ion om he p omp , we
adop an au oma ic d um ansc ip ion based e alua ion, on he held ou es spli
o MusDB (50 acks), which was no used o aining. Fo each ack, we conside
he g ound u h d um s em as a e e ence and gene a e co esponding d um ou pu s
om ou sys em.
Bo h he g ound u h s ems and he model ou pu s a e ansc ibed using he p e-
ained F ame-RNN d um ansc ip ion model [Zeh en e al., 2023], yielding sym-
bolic onse sequences o kick, sna e, and hi-ha . We measu e he co espondence
be ween he wo ansc ip ions using he onse F1 sco e a 30ms ole ance, as is
s anda d p ac ice in d um ansc ip ion e alua ion [Vogl e al., 2017, Heyda i e al.,
2021]. Highe F1 alues indica e close alignmen be ween he empo al placemen
o d um hi s in he e e ence s em and he gene a ed ou pu .
Model CFG F1 Kick F1 Sna e F1 HiHa s
1 0.00 0.00 0.00
CFGScale 2 0.00 0.00 0.00
3 0.05 0.01 0.02
1 0.25 0.30 0.00
DiTP oj 2 0.43 0.46 0.16
3 0.43 0.52 0.24
1 0.59 0.43 0.15
Da aAug 2 0.82 0.65 0.38
3 0.84 0.66 0.37
Table 4: Channelwise F1 sco es o each model ac oss CFG scales.
3.6.2 Timb e P omp Adhe ence
To e alua e imb e p ese a ion, we again d aw on he MusDB es se d um s ems
as sou ce ma e ial. Fo each e alua ion ins ance, we supply he model wi h a imb e
p omp (a d um s em exce p ) and a hy hm p omp . The goal is o de e mine
whe he he gene a ed ou pu e ains he spec al and ex u al quali ies o he im-
b e p omp , while emaining una ec ed by he hy hm p omp in e ms o imb al
con en .
We quan i y imb e simila i y using ea u e based measu es compu ed on sho ime
spec al ep esen a ions. Speci ically, we compa e he 80 dimensional empo ally
a e aged MFCC ep esen a ions o he gene a ed audio and imb e p omp ia cosine
simila i y. A highe simila i y indica es s onge imb al adhe ence.
Model CFG 1 CFG 2 CFG 3
DiTP oj 0.94 0.95 0.96
Da aAug 0.94 0.96 0.98
Table 5: Mean imb e simila i y o each model ac oss CFG scales.
3.7 Discussion
Tables 4 and 5 summa ize he esul s o ou e alua ions. Rhy hm p omp adhe ence,
as measu ed by onse F1 sco es, shows a clea p og ession ac oss model a ian s. The
ini ial CFGScale model exhibi s minimal hy hmic ideli y, wi h F1 sco es nea ze o
ac oss all channels. The DiTP oj a ian demons a es subs an ial imp o emen s,
pa icula ly a highe CFG scales, indica ing ha p ojec ion space modi ica ions
signi ican ly enhance he model’s abili y o ollow hy hmic cues. The Da aAug
model achie es he highes F1 sco es, wi h alues exceeding 0.8 o kick and 0.65
o sna e a CFG2. Hi-ha F1 sco es emain lowe o e all, e lec ing he inhe en
di icul y o accu a ely cap u ing high equency, apid pe cussion elemen s. Timb e
simila i y emains consis en ly high ac oss all models, wi h mean cosine simila i ies
abo e 0.94. I mus be no ed ha imb e e alua ions he e a e es ic ed o d um
sounds p esen in he MusDB da ase , in u u e wo k we aim o quan i y imb e
gene aliza ion o ou o dis ibu ion d um sounds.
Ne e heless, ou lis ening es s also su aced impo an limi a ions. While classi ie -
ee guidance (CFG) and hy hm augmen a ions imp o ed pe o mance, imb e gen-
e aliza ion is impe ec , pa icula ly o a e d um sounds o hea ily p ocessed s ems
in he MusDB da ase . Simila ly, ansc ip ion based e alua ion in oduces i s own
sou ces o e o , as he accu acy o onse de ec ion di ec ly impac s hy hm adhe -
ence sco es. E alua ion p o ocols ha in eg a e pe cep ual lis ening es s o lea ned
embeddings mo e closely aligned wi h imb e may yield mo e eliable assessmen s.
Beyond he scope o quan i a i e me ics, quali a i e expe imen s showed ha he
sys em can ecombine hy hmic and imb al cues in musically compelling ways,
o en p oducing plausible and s ylis ically cohe en d um acks. This alida es ou
b oade ision o gene a i e hy hm imb e ans e : enabling musicians o impose
a desi ed g oo e s uc u e on o a chosen d um sound, much like a p oduce laye ing
he eel o one d umme wi h he ki o ano he . Howe e , u he wo k is needed o
e ine he model’s abili y o handle ex eme o ou o dis ibu ion inpu s, as well as
o explo e use in e aces ha acili a e in ui i e con ol o e he gene a ion p ocess.
3.8 Conclusion and Fu u e Wo k
Se e al a enues eme ge o u u e wo k. Fi s , mo e obus imb e simila i y mea-
su es, po en ially based on p e ained audio embeddings, could complemen MFCC
cosine me ics and cap u e pe cep ual quali ies beyond spec al ene gy. Second,
use cen ic e alua ion will be c i ical: pe cep ual s udies wi h musicians and p o-
duce s could e eal how con ollable and musically use ul such sys ems uly a e.
Finally, ex ending his amewo k beyond d um syn hesis owa d o he ins umen
classes may gene alize he p e ix su ix condi ioning pa adigm as a powe ul ool o
con ollable audio gene a ion.
Chap e 4
Model Deploymen o End Use s
While he co e o his hesis has ocused on he design, aining, and e alua ion
o machine lea ning models o modeling exp essi e hy hms, an equally impo an
con ibu ion lies in he expo ing and deploymen o hese models in o usable ools.
Making esea ch a i ac s a ailable o musicians and p ac i ione s ou side he lab is
essen ial o b idging he gap be ween academic esea ch and a is ic p ac ice.
4.1 In oduc ion
In his chap e , we desc ibe he p ocess o expo ing and deploying he models
de eloped in he wo independen wo ks p esen ed in his hesis: (1) Lea ning
Mic o hy hm in U uguayan Candombe using T ans o me s, and (2) Go Tha Flow:
Flow-based Bea box- o-Rhy hm Gene a ion. Fo bo h p ojec s, we de ail he expo
o ma s used, he deploymen con ex s (plugins and demos), and he ecep ion o
cu en s a us o hese deploymen s.
4.2 Lea ning Mic o hy hm in U uguayan Can-
dombe using T ans o me s
4.2.1 Expo ing ia To chSc ip
The ained T ans o me model o mic o hy hm gene a ion was expo ed using
To chSc ip , a se ializa ion o ma p o ided by PyTo ch ha allows models o be un
independen ly o Py hon. To chSc ip enables in eg a ion in o C++ en i onmen s
and plugin amewo ks, which is c ucial o deploymen in digi al audio wo ks a ions
(DAWs).
Unlike ONNX o o he in e media e ep esen a ions, To chSc ip p o ides a ela-
i ely seamless pa h o PyTo ch- ained models o be used in C++ wi hou in-
oducing hi d-pa y un ime dependencies. This educes deploymen ic ion and
40
Figu e 9: In e ace o he Candombe VST plugin. The plugin p ocesses incoming
MIDI e en s and applies mic o hy hmic ans o ma ions based on he lea ned Can-
dombe g oo es.
imp o es s abili y wi hin DAW en i onmen s.
The PyTo ch model was aced using o ch.ji . ace and subsequen ly sa ed
as a To chSc ip module. The expo ed module was alida ed agains he Py hon
e sion o ensu e pa i y o esul s.
4.2.2 W apping in Neu alMidiFX VST
To make he model accessible o musicians, he To chSc ip model was in eg a ed
in o he Neu alMidiFX [Haki e al., 2023b] VST w appe . This w appe p o ides a
amewo k o embedding neu al models inside VST plugins. Wi hin his se up, he
plugin accep s incoming MIDI inpu and ou pu s hy hmically ans o med MIDI
e en s, applying mic o hy hmic de ia ions lea ned om Candombe pe o mances.
The plugin in e ace is shown in Figu e 9.
Musicians can load he plugin in hei p e e ed DAW (Able on Li e, Logic P o,
e c.) and di ec ly en ich MIDI sequences wi h au hen ic Candombe g oo es.
Bibliog aphy
Guillaume Alain, Maxime Che alie -Bois e , F ede ic Os e a h, and Remi Piche-
Taille e . DeepD umme : Gene a ing D um Loops using Deep Lea ning and a
Human in he Loop, Augus 2020.
Je ey Adam Bilmes. Timing Is o he Essence : Pe cep ual and Compu a ional
Techniques o Rep esen ing, Lea ning, and Rep oducing Exp essi e Timing in
Pe cussi e Rhy hm. Thesis, Massachuse s Ins i u e o Technology, 1993.
G igo e Bu loiu. Adap i e D um Machine Mic o iming wi h T ans e Lea ning and
RNNs. In Ex ended Abs ac s o he La e-B eaking Demo Session o he 21s In .
Socie y o Music In o ma ion Re ie al Con ., 2020.
An oine Caillon and Philippe Esling. RAVE: A a ia ional au oencode o as and
high-quali y neu al audio syn hesis, Decembe 2021.
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. F eeman. MaskGIT:
Masked Gene a i e Image T ans o me , Feb ua y 2022.
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Ca los Gues in. T aining Deep Ne s
wi h Sublinea Memo y Cos , Ap il 2016.
Ma in Clay on, Kelly Jakubowski, Tuomas Ee ola, Pe e E. Kelle , An onio Ca-
mu i, Gual ie o Volpe, and Paolo Albo no. In e pe sonal En ainmen in Music
Pe o mance: Theo y, Me hod, and Model. Music Pe cep ion, 38(2):136–194,
No embe 2020. ISSN 0730-7829. doi: 10.1525/mp.2020.38.2.136.
Anne Danielsen, Ragnhild B ø ig, Kje il Kle e Bøhle , Guilhe me Schmid Câma a,
Ma i Roma heim Haugen, Ei ik Jacobsen, Ma s S. Johansson, Oli ie La illo ,
K is ian Nymoen, Kjell And eas Oddekal , Bjø na Sand ik, Geo ge Sio os, and
Jus in London. The e’s Mo e o Timing han Time: In es iga ing Musical Mi-
c o hy hm Ac oss Disciplines and Cul u es. Music Pe cep ion, 41(3):176–198,
Feb ua y 2024. ISSN 0730-7829. doi: 10.1525/mp.2024.41.3.176.
T i Dao, Daniel Y. Fu, S e ano E mon, A i Rud a, and Ch is ophe Ré. FlashA -
en ion: Fas and Memo y-E icien Exac A en ion wi h IO-Awa eness, June
2022.
Ma hew Da ies, Guy Madison, Ped o Sil a, and Fabien Gouyon. The E ec o
48
Mic o iming De ia ions on he Pe cep ion o G oo e in Sho Rhy hms. Music
Pe cep ion, 30(5):497–510, June 2013. ISSN 0730-7829. doi: 10.1525/mp.2013.30.
5.497.
Nils Deme lé, Philippe Esling, Guillaume Do as, and Da id Geno a. Combining
audio con ol and s yle ans e using la en di usion. In P oceedings o he 25 h
In . Socie y o Music In o ma ion Re ie al Con e ence. a Xi , July 2024. doi:
10.48550/a Xi .2408.00196.
Jesse Engel, Lam ha n (Hanoi) Han akul, Chenjie Gu, and Adam Robe s. DDSP:
Di e en iable Digi al Signal P ocessing. In In e na ional Con e ence on Lea ning
Rep esen a ions, Sep embe 2019.
Zach E ans, Julian D. Pa ke , C. J. Ca , Zack Zukowski, Josiah Taylo , and Jo di
Pons. Long- o m music gene a ion wi h la en di usion, July 2024.
Zach E ans, Julian D. Pa ke , CJ Ca , Zack Zukowski, Josiah Taylo , and Jo di
Pons. S able Audio Open. In ICASSP 2025 - 2025 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP), pages 1–5, Ap il 2025. doi:
10.1109/ICASSP49660.2025.10888461.
Luis Fe ei a. An A ocen ic App oach o Musical Pe o mance in Sou h Black
A lan ic: The Candombe D umming. T ans : T anscul u al Music Re iew =
Re is a T anscul u al de Música, ISSN 1697-0101, Nº. 11, 2007, Janua y 2007.
Ande s F ibe g and And eas Sunds öm. Swing Ra ios and Ensemble Timing in
Jazz Pe o mance: E idence o a Common Rhy hmic Pa e n. Music Pe cep ion,
19(3):333–349, Ma ch 2002. ISSN 0730-7829. doi: 10.1525/mp.2002.19.3.333.
Magdalena Fuen es, Lucas S. Maia, Ma ín Rocamo a, L. Biscainho, H. C ayencou ,
S. Essid, and J. Bello. T acking Bea s and Mic o iming in A o-La in Ame ican
Music Using Condi ional Random Fields and Deep Lea ning. In In e na ional
Socie y o Music In o ma ion Re ie al Con e ence, 2019.
Al Gab ielsson. In e play be ween Analysis and Syn hesis in S udies o Music
Pe o mance and Music Expe ience. Music Pe cep ion, 3(1):59–86, Oc obe 1985.
ISSN 0730-7829. doi: 10.2307/40285322.
Hugo Flo es Ga cía, O iol Nie o, Jus in Salamon, B yan Pa do, and P em See ha a-
man. Ske ch2Sound: Con ollable Audio Gene a ion ia Time-Va ying Signals
and Sonic Imi a ions. In ICASSP 2025 - 2025 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP), pages 1–5, Ap il 2025. doi:
10.1109/ICASSP49660.2025.10888184.
Jon Gillick, Adam Robe s, Jesse Engel, Douglas Eck, and Da id Bamman. Lea ning
o G oo e wi h In e se Sequence T ans o ma ions. In P oceedings o he 36 h
In e na ional Con e ence on Machine Lea ning, pages 2269–2279. PMLR, May
2019.
Daniel Gómez-Ma ín, Se gi Jo dà, and Pe ec o He e a. Ne wo k ep esen a ions o
d um sequences o classi ica ion and gene a ion. F on ie s in Compu e Science,
6, Janua y 2025. ISSN 2624-9898. doi: 10.3389/ comp.2024.1476996.
Behzad Haki, Ma ina Nie o, Te esa Pelinski, and Se gi Jo dà. Real-Time D um
Accompanimen Using T ans o me A chi ec u e. In P oceedings o he 3 d In-
e na ional Con e ence on on AI and Musical C ea i i y. AIMC, Sep embe 2022.
doi: 10.5281/ZENODO.7088343.
Behzad Haki, Cheuk Lun Isaac Lee, and Se gi Jo dà. Tap amd um: A Da ase
o Dualized D um Pa e ns. In P oceedings o he 24 h In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, 2023a.
Behzad Haki, Julian Lenz, and Se gi Jo da. Neu alMidiFx: A W appe Templa e
o Deploying Neu al Ne wo ks as VST3 Plugins. AIMC 2023, Augus 2023b.
Holge Hennig, Ragna Fleischmann, Anneke F edebohm, Yo k Hagmaye , Jan Na-
gle , Anne e Wi , Fabian J. Theis, and Theo Geisel. The Na u e and Pe cep ion
o Fluc ua ions in Human Musical Rhy hms. PLOS ONE, 6(10):e26457, Oc obe
2011. ISSN 1932-6203. doi: 10.1371/jou nal.pone.0026457.
Alex Hen y, P udh i Raj Dachapally, Shubham Pawa , and Yuxuan Chen. Que y-
Key No maliza ion o T ans o me s, Oc obe 2020.
Moj aba Heyda i, F ank Cwi kowi z, and Zhiyao Duan. Bea Ne : CRNN and Pa -
icle Fil e ing o Online Join Bea Downbea and Me e T acking. In 22nd In-
e na ional Socie y o Music In o ma ion Re ie al (ISMIR) Con e ence. a Xi ,
Augus 2021. doi: 10.48550/a Xi .2108.03576.
Jona han Ho and Tim Salimans. Classi ie -F ee Di usion Guidance, July 2022.
Jona han Ho, Ajay Jain, and Pie e Abbeel. Denoising Di usion P obabilis ic Mod-
els, Decembe 2020.
Vijay Iye . Embodied Mind, Si ua ed Cogni ion, and Exp essi e Mic o iming in
A ican-Ame ican Music. Music Pe cep ion, 19(3):387–414, Ma ch 2002. ISSN
0730-7829. doi: 10.1525/mp.2002.19.3.387.
Luis Ju e and Ma ín Rocamo a. Mic o iming in he hy hmic s uc u e o Can-
dombe d umming pa e ns. In Fou h In e na ional Con e ence on Analy ical
App oaches o Wo ld Music, New Yo k, USA, June 2016.
Luis Ju e and Ma ín Rocamo a. Subi la llamada: Nego ia ing empo and dynamics
in U uguayan Candombe d umming. In In e na ional Wo kshop on Folk Music
Analysis, June 2018.
Luis Ju e and Ma ín Rocamo a. Ca a : A oolbox o compu e –aided hy hm
analysis. In Analy ical App oaches o Wo ld Music Special Topics Symposium
(AAWM 2019). Zenodo, July 2019. doi: 10.5281/ZENODO.10030090.
Oli ie La illo and F ed B u o d. Bis a e Reduc ion and Compa ison o D um
Pa e ns. In P oceedings o he 21s In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, 2021.
S e an La ne and Maa en G ach en. High-Le el Con ol o D um T ack Gen-
e a ion Using Lea ned Pa e ns o Rhy hmic In e ac ion. In IEEE Wo kshop on
Applica ions o Signal P ocessing o Audio and Acous ics (WASPAA 2019). a Xi ,
Augus 2019. doi: 10.48550/a Xi .1908.00948.
An oine La aul , Axel Roebel, and Ma hieu Voi y. S yleWa eGAN: S yle-based
syn hesis o d um sounds using gene a i e ad e sa ial ne wo ks o highe au-
dio quali y. In 30 h Eu opean Signal P ocessing Con e ence (EUSIPCO 2022),
Belg ade, Se bia, Augus 2022.
Kyungyun Lee, Wonil Kim, and Juhan Nam. Pocke VAE: A Two-s ep Model o
G oo e Gene a ion and Con ol, July 2021.
Ya on Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Ma
Le. Flow Ma ching o Gene a i e Modeling, Feb ua y 2023.
Anmol Mish a, Behzad Haki, Sa yajee P abhu, and Ma in Rocamo a. G oo e
T ans e VST Fo La in Ame ican Rhy hms. In Ex ended Abs ac s o he La e-
B eaking Demo Session o he 25 h In . Socie y o Music In o ma ion Re ie al
Con ., San F ancisco, No embe 2024.
Anmol Mish a, Sa yajee P abhu, Behzad Haki, and Ma ín Rocamo a. Lea ning
Mic o hy hm in U uguayan Candombe using T ans o me s. In P oceedings o he
In e na ional Compu e Music Con e ence (ICMC), Bos on, 2025.
Luiz Na eda, Fabien Gouyon, Ca los Guedes, and Ma c Leman. Mic o iming Pa -
e ns and In e ac ions wi h Musical P ope ies in Samba Music. Jou nal o
New Music Resea ch, 40(3):225–238, Sep embe 2011. ISSN 0929-8215. doi:
10.1080/09298215.2011.603833.
Ja ie Nis al Hu lé, S e an La ne , and Gael Richa d. D umGAN: Syn hesis o
d um sounds wi h imb al ea u e condi ioning using Gene a i e Ad e sa ial Ne -
wo ks. In 21s In e na ional Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR), To on o, Canada, Augus 2020.
Zacha y No ack, Zach E ans, Zack Zukowski, Josiah Taylo , C. J. Ca , Julian
Pa ke , Adnan Al-Sinan, Gian Ma co Iodice, Julian McAuley, Taylo Be g-
Ki kpa ick, and Jo di Pons. Fas Tex - o-Audio Gene a ion wi h Ad e sa ial
Pos -T aining, May 2025.
Pa ick O’Reilly, Hugo Flo es Ga cia, P em See ha aman, and B yan Pa do. Masked
Token Modeling o Ze o-sho Any hing- o-d ums Con e sion. In Ex ended Ab-
s ac s o he La e-B eaking Demo Session o he 25 h In . Socie y o Music
In o ma ion Re ie al Con .
Colin Ra el, Noam Shazee , Adam Robe s, Ka he ine Lee, Sha an Na ang, Michael
Ma ena, Yanqi Zhou, Wei Li, and Pe e J. Liu. Explo ing he Limi s o T ans e
Lea ning wi h a Uni ied Tex - o-Tex T ans o me , Sep embe 2023.
Za a Ra ii, An oine Liu kus, Fabian-Robe S ö e , S ylianos Ioannis Mimilakis,
and Rachel Bi ne . MUSDB18 - a co pus o music sepa a ion. Decembe 2017.
doi: 10.5281/zenodo.1117371.
An ónio Rami es, Rui Penha, and Ma hew E. P. Da ies. Use Speci ic Adap a ion
in Au oma ic T ansc ip ion o Vocalised Pe cussion, No embe 2018.
Ma ín Rocamo a, Luis Ju e, Be na do Ma enco, Magdalena Fuen es, Flo encia
Lanza o, and Al a o Gomez. An audio- isual da abase o Candombe pe o mances
o compu a ional musicological s udies. In Cong eso In e nacional de Ciencia y
Tecnología Musical, Sep embe 2015.
Vincen Rosinach and T aube Ca oline. Measu ing swing in I ish adi ional iddle
music. In In e na ional Con e ence on Music Pe cep ion and Cogni ion, 2006.
And é C. San os and F. Amilca Ca doso. F om Taps o D ums: Audio- o-audio
Pe cussion S yle T ans e . In Ex ended Abs ac s o he La e-B eaking Demo
Session o he 22nd In . Socie y o Music In o ma ion Re ie al Con , 2023.
William Se ha es. The geome y o musical hy hm: Wha makes a “good” hy hm
good? Jou nal o Ma hema ics and he A s, 8:135–137, Decembe 2014. doi:
10.1080/17513472.2014.906116.
Jianlin Su, Yu Lu, Sheng eng Pan, Ahmed Mu adha, Bo Wen, and Yun eng Liu.
RoFo me : Enhanced T ans o me wi h Ro a y Posi ion Embedding, No embe
2023.
Ashish Vaswani, Noam Shazee , Niki Pa ma , Jakob Uszko ei , Llion Jones, Aidan N
Gomez, Łukasz Kaise , and Illia Polosukhin. A en ion is All you Need. In Ad-
ances in Neu al In o ma ion P ocessing Sys ems, olume 30. Cu an Associa es,
Inc., 2017.
Richa d Vogl, Ma hias Do e , and Pe e Knees. D um ansc ip ion om poly-
phonic music wi h ecu en neu al ne wo ks. In 2017 IEEE In e na ional Con e -
ence on Acous ics, Speech and Signal P ocessing (ICASSP), pages 201–205, Ma ch
2017. doi: 10.1109/ICASSP.2017.7952146.
I.-Chieh Wei, Chih-Wei Wu, and Li Su. Gene a ing S uc u ed D um Pa e n Using
Va ia ional Au oencode and Sel -simila i y Ma ix. In In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, 2019.
Mickaël Zeh en, Ma co Alunno, and Paolo Bien inesi. High-Quali y and Rep o-
ducible Au oma ic D um T ansc ip ion om C owdsou ced Da a. Signals, 4(4):
768–787, Decembe 2023. ISSN 2624-6120. doi: 10.3390/signals4040042.
Liu Ziyin, Tilman Ha wig, and Masahi o Ueda. Neu al Ne wo ks Fail o Lea n
Pe iodic Func ions and How o Fix I , Oc obe 2020.