Generating Abstract Rhythm Streams

Author: Bosma, Justin

Publisher: Zenodo

DOI: 10.5281/zenodo.17304305

Source: https://zenodo.org/records/17304305/files/Justin-Bosma_SMC_2025_Master_Thesis.pdf

Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Gene a ing Abs ac Rhy hm S eams
Jus in Bosma
Supe iso : Se gi Jo da
Co-Supe iso : Behzad Haki
Augus 2025
Mas e in Sound and Music Compu ing
Uni e si a Pompeu Fab a
Gene a ing Abs ac Rhy hm S eams
Jus in Bosma
Supe iso : Se gi Jo da
Co-Supe iso : Behzad Haki
Augus 2025
Con en s
1 In oduc ion 1
1.1 Mo i a ion.................................. 1
1.2 Ou line ................................... 2
1.3 Supplemen a y Ma e ial . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Rela ed Wo ks 4
2.1 A chi ec u es ................................ 4
2.2 T ans o me s ................................ 5
2.2.1 Inpu Rep esen a ion o T ans o me Model . . . . . . . . . . . . . . . 5
2.2.2 Posi ionalEncoding ............................ 5
2.3 Va ia ional Au oencode s . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 LossC i e ia ................................ 6
2.4 T ans o me Based Models o Music Gene a ion . . . . . . . . . . . . . 7
2.5 Non-Tokenized T ans o me Inpu s and Rep esen a ion . . . . . . . . . 8
2.6 U ilizing he La en Space o Gene a i e Pu poses . . . . . . . . . . . 8
2.7 Rhy hmicFea u es ............................. 9
2.8 G oo eT ans o me ............................. 10
2.8.1 O e iew .................................. 10
2.8.2 A chi ec u e................................. 11
2.8.3 Da a..................................... 11
2.8.4 S eams/Rep esen a ion . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.8.5 G oo e.................................... 12

3 Design 13
3.1 S eams ................................... 13
3.2 Ins umen O e iew............................ 14
3.2.1 Encode ................................... 15
3.2.2 La en Space ................................ 17
3.2.3 Decode ................................... 18
4 Da ase s 20
4.1 Anno a ed Candombe Reco dings . . . . . . . . . . . . . . . . . . . . . 21
4.2 LAKH.................................... 22
4.3 G oo eMIDI................................. 23
4.4 ElBongose o ................................ 25
4.5 TapTamD um................................ 26
4.6 Da ase P e-p ocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Model 29
5.1 A chi ec u e................................. 29
5.2 Da aandT aining ............................. 30
5.3 Valida ion.................................. 31
5.3.1 Densi y ................................... 35
5.3.2 Ou pu S eam Quali y . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3.3 Rela ionships be ween Fea u e Values . . . . . . . . . . . . . . . . . . . 38
6 Discussion 41
6.1 Da a..................................... 41
6.2 Fea u es................................... 43
7 Conclusion 45
7.1 Da aChanges................................ 45
7.2 Fea u eChanges .............................. 46
7.3 Fu u e Implemen a ion . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Lis o Figu es 48
Bibliog aphy 50
A Fi s Appendix 53
B Second Appendix 56
Dedica ion
I dedica e his wo k o my mo he , Linnae. Wi hou you cons an suppo , I would
ne e ha e been able o be he pe son I am oday. I am so g a e ul o ha e a mo he
as wonde ul as you.
2Chap e 1. In oduc ion
melody c ea ion, ha monic addi ions o exis ing melodies, he gene a ion o hy hms
ela ed o he piece o music, and a ious ools o aid in he mixing and mas e ing
ac i i ies. Al hough hese sys ems ha e helped in he c ea i e p ocess, hey end o
be e y speci ic in hei implemen a ion, such as hy hm gene a ion sys ems wi h a
mapping o hei ou pu s o speci ic d ums like kick, sna e, and hi-ha .
Fo ou sys em, cu en ly known as T iple S eams, we aim o c ea e a eal- ime
sys em ha ocuses on he c ea ion o abs ac hy hmic s eams. Ou in en is
o push he use o hink mo e me hodically abou how o apply hese hy hmic
s eams o enhance hei pe o mance wi hou sac i icing c ea i e choices.
To achie e his goal, we pu sue h ee main objec i es:
1. Da ase C ea ion: Build a collec ion o di e se and in e es ing hy hmic
sequences d awn om bo h d um-based and non-d um sou ces, mo ing beyond
one- o-one mappings be ween ins umen s and ou pu s eams.
2. Model Design: De elop a gene a i e model wi h eal- ime con ol ea u es,
enabling pe o me s o manipula e ou pu s dynamically and ea he sys em
as a playable ins umen .
3. E alua ion: Ensu e ha he sys em p oduces di e se, musically meaning ul
hy hmic s eams ha ela e cohe en ly o he use ’s inpu .
1.2 Ou line
•Chap e 2: Rela ed Wo ks – O e iew o p io wo k ha in o ms ou
a chi ec u e, da a ep esen a ion, and con ol ea u es. We examine ele an
machine lea ning models, hy hmic ea u e ep esen a ions, and he G oo e-
T ans o me sys em ha se es as ou ounda ion.
•Chap e 3: Ins umen Design – High-le el o e iew o he sys em’s de-
sign, pa i ioned in o encode , la en space, and decode componen s. We also
desc ibe he sys em’s inpu s, ou pu s, and in e ac i e con ols.

1.3. Supplemen a y Ma e ial 3
•Chap e 4: Da a Cu a ion – Desc ip ion o da ase s used o aining and
alida ion, p ep ocessing me hods, challenges encoun e ed, and he a ionale
behind design choices.
•Chap e 5: Model O e iew and Valida ion – P esen a ion o he model’s
a chi ec u e, aining p ocess, and e alua ion o hy hmic ou pu s, including
he e ec o con ol ea u es on sys em pe o mance.
•Chap e 6: Discussion – Re lec ion on he sys em’s cu en s a e, iden i y-
ing s eng hs, limi a ions, and possible imp o emen s in da a, con ol ea u es,
and ou pu quali y.
•Chap e 7: Conclusion – Summa y o key indings, open challenges, and
planned u u e wo k, including e inemen s o da ase s, con ol mechanisms,
and deploymen .
1.3 Supplemen a y Ma e ial
Th oughou his hesis, we will use supplemen a y ma e ial o p o ide addi ional
insigh s and explana ions o he ex . All download links, p ocessed da a, igu es,
code, and o he esou ces used in his hesis can be ound in he ollowing gi hub
eposi o y:
h ps://gi hub.com/jus inbosma/T ipleS eams
Chap e 2
Rela ed Wo ks
O e he pas decade, nume ous con ibu ions ha e been made o he in e sec ion o
audio, music, and machine lea ning, and each con ibu ion can be seen as a s epping
s one o ou cu en wo k. Howe e , in his sec ion we will ocus on he mos ela ed
wo ks o ou cu en esea ch, and how hey di ec ly impac ou implemen a ion.
We will begin wi h a b ie o e iew o he Machine Lea ning models employed in ou
sys em. Following his, we will examine some o he mo e ecen gene a i e models
ha ha e had a s ong in luence on ou decisions. Finally, we will gi e an o e iew
o he p io wo k o Behzad Haki, speci ically he G oo eT ans o me , allowing us
o show whe e ou cu en esea ch de elops om.
2.1 A chi ec u es
Due o he sequen ial na u e o symbolic no a ion, and speci ically hy hm, we
employ a ans o me model o he modeling o empo al dependencies. To allow
us o c ea e a ia ions on hy hms, we u ilize a a ia ional au oencode (VAE) o
encode hey hy hms in o a la en space, whe e we can in e pola e be ween simila
hy hms o sligh a iance om he gene a ions based on he inpu . Below, we will
gi e an o e iew o bo h a chi ec u es.
4
2.2. T ans o me s 5
2.2 T ans o me s
Fi s desc ibed by Ashish Vaswani in hei 2017 pape "A en ion is All You Need",
he ans o me is a deep lea ning model ha has become undamen al o he a ea o
Na u al Language P ocessing (NLP) [3]. This model di e s om Recu en Neu al
Ne wo ks (RNN) in ha i allows pa allel p ocessing o inpu sequences, making
aining and in e ence mo e e icien .
The wo impo an aspec s o ans o me models a e posi ional encoding and sel -
a en ion. In he example o ex , posi ional encoding assigns a numbe o each
wo d, a he han looking a i s posi ion in he sen ence. This p o ides in o ma ion
abou he oken, allowing he model o conside sequen ial in o ma ion. Wi h sel -
a en ion, each wo d is assigned a weigh and is compa ed agains he o he wo ds
in pa allel. This allows he model o "lea n" he g amma based on how wo ds a e
used in e e yday li e.
2.2.1 Inpu Rep esen a ion o T ans o me Model
A he bo om mos encode , he inpu is i s b oken down in o okens, which
would be wo ds and subwo ds in NLP. Each oken is assigned a unique index om
he model’s ocabula y. A e assigning he unique index, each oken is con e ed
in o a dense embedding ec o o size V×d, whe e V is he ocabula y size and d
is he embedding dimension.
2.2.2 Posi ional Encoding
To keep ack o he o de o he sequence, T ans o me s use a posi ional encoding
o main ain he in o ma ion abou he ela i e o absolu e posi ion o he oken.
This can be achie ed by using he ollowing equa ion:
PE(pos,2i)= sin pos/100002i/dmodel 
6Chap e 2. Rela ed Wo ks
whe e pos is he posi ion and iis he dimension. Each dimension io he posi ional
encoding is ela ed o a sinusoid [3].
2.3 Va ia ional Au oencode s
Va ia ional au oencode s a e sel -supe ised deep lea ning models ha allow us o
isola e impo an in o ma ion, known as la en a iables, ia an encode . The la en
a iables a e hen used o decode ou inpu . The inpu is no encoded di ec ly, bu
is encoded in a p obabilis ic dis ibu ion, om which he nume ical ep esen a ion
is sampled [4]. One o he mos impo an aspec s o Va ia ional Au oencode s is
he la en space, which is he space consis ing o all o he la en a iables om
ou da ase . The da a is comp essed in o a smalle dimensional la en space, whe e
only he meaning ul in o ma ion ex ac ed om he au oencode is ep esen ed.
This encoding o in o ma ion in o a la en space allows us no only o ec ea e ou
o iginal inpu bu also o gene a e new da a.
2.3.1 Loss C i e ia
Fo aining he VAE, we use he econs uc ion loss o econs uc he inpu om
he encoded space. This can be w i en o mally as:
L econ(x,ˆx) = di e ence(x,ˆx)
whe e we a emp o minimize he loss be ween di e ence(x,ˆx).
The econs uc ion loss will di e depending on he ype o da a ha we a e using,
o example, wi h okenized language, we could calcula e c oss-en opy loss be ween
he p edic ed and o iginal sequences [5].
L econ(x,ˆx) = −X
x log(ˆx )
Because we a e in e es ed in gene a ing new samples om ou inpu da a, we canno
ely only on a loss unc ion. We need o use ano he egula iza ion e m o ensu e
2.4. T ans o me Based Models o Music Gene a ion 7
ha we can sample om anywhe e in he la en space be ween ou o iginal da a
poin s. To achie e his, we use he Kullback-Leible (KL) Di e gence, whe e we
ensu e he la en space ollows a s anda d no mal dis ibu ion. The KL Di e gence
is o mally de ined as:
LKL =DKL N(µ(x), σ2(x)) ∥ N(0,1)
F om he e, ou o al loss can be de ined as ollows:
L o al =L econ +LKL
Ha ing a la en space ha is con inuous, whe e nea by poin s yield simila samples
when decoded, and comple e, whe e all he poin s in he la en space should ha e
meaning ul con en when decoded, is an impo an aspec o ou implemen a ion.
This allows us o in e pola e be ween wo da a poin s in ou space.
2.4 T ans o me Based Models o Music Gene a-
ion
The Music T ans o me , de eloped by Huang e al in 2019 inco po a es a ela i e
posi ional a en ion mechanism, which allows o he c ea ion o songs up o one
minu e in leng h. They show ha hei new implemen a ion ou pe o ms LSTM-
based models like Pe o mance RNN in main aining mo i s, epe i ion, and s uc u e
o e long spans. I is also capable o con inuing an exis ing melody o ha monizing
wi h a gi en melody [6].
The Pop Music T ans o me by Huang e al. is an ex ension o he Music T ans-
o me ha is mo e capable o cap u ing he hy hmic s uc u e o pop music [7].
The implemen a ion ocuses on enhancing he ep esen a ion o inpu /ou pu se-
quences. This was made possible by he use o REMI (RE amped MIDI de i ed
e en s) okeniza ion, explici ly encoding ba s, bea posi ions, empo changes, and
cho ds alongside ypical no e e en s.

8Chap e 2. Rela ed Wo ks
2.5 Non-Tokenized T ans o me Inpu s and Rep e-
sen a ion
In hei pape p eceding he implemen a ion o G ooVAE, Gillick e al. sugges he
use o an enhanced piano oll ep esen a ion o d ums [8]. This ep esen a ion uses
a bina y piano oll ela i e o a ixed g id, se ing onse s o ’1’ and es s o’0’. They
also inco po a e he use o eloci y and o se s, mic o- ime de ia ions in ela ion
wi h he g id, in o hei ep esen a ion. This allows o a simple ep esen a ion o
d ums ha can be used as inpu o hei sys em. In his wo k on G oo eT ans o me ,
Behzad Haki uses his HVO (hi , eloci y, o se ) ep esen a ion o he inpu and
ou pu o he sys em.
2.6 U ilizing he La en Space o Gene a i e Pu -
poses
O e he pas decade, he e ha e been nume ous eal- ime models o he use in
music and sound gene a ion. We will look a some o hese ha ela e o ou
p ojec and show how hese models can help in ou esea ch.
MusicVAE is a ecu en a ia ional au o-encode model de eloped o c ea ion o
symbolic music o a ying leng hs. The au ho s no e ha he applica ion o VAEs
o sequen ial da a has been limi ed, and hey s uggle o model sequences wi h
long- e m s uc u e. In an a emp o sol e his issue, hey employ he use o a
hie a chical decode , which ou pu s embeddings o subsequences and hen uses he
embeddings o gene a e new subsequences [9]. This ex ension o MusicVAE can gen-
e a e loops by sampling poin s in he la en space. I also allows one o in e pola e
be ween wo loops, by mo ing be ween poin s in he la en space, which allows one
o mix be ween he loops o mo e musical con ol. In his G oo eT ans o me imple-
men a ion, Behzhad Haki u ilizes his echnique o in e pola ing be ween wo loops
in he la en space, and is some hing we will include in ou new implemen a ion.
GLSR-VAE is ano he app oach o loop gene a ion using ecu en VAEs. I employs
2.7. Rhy hmic Fea u es 9
egula iza ion echniques o imp o e he dis ibu ions o he embeddings in he la en
space. This allows o a smoo he ansi ion be ween loops in he space, gi ing mo e
con ol and simila i y, while s ill allowing in e es ing a ia ion [10]. An impo an
applica ion in his wo k o ou own is he obse a ion ha a dimension in he la en
space can be associa ed wi h he densi y o a gene a ed sequence, and mo ing in
one di ec ion can inc ease he densi y o he sequence. This will be u ilized when
we implemen ou con ol ea u es o ou inal sys em.
Gene a i e Ad e sa ial Ne wo ks (GANS) ha e also been u ilized in he c ea ion o
musical loops. One example o his is MidiNe (Yang e al.), whe e a con olu ional
GAN is used o c ea e musical sequences [11]. They p opose a condi ional mech-
anism o use a ailable p io knowledge, allowing he model o gene a e melodies
om sc a ch, ollowing inpu ia cho d sequences, o by condi ioning on he p io
melodies.
O he app oaches o c ea ing hy hmic spaces like he one men ioned abo e include
Rhy hmVAE by Tokui and R-VAE by Vigliensoni e al. [12] [13]. Bo h o hese
use VAEs o map d um pa e ns in o 2-dimensional la en space encodings. While
mo ing h ough his space, he use can gene a e andom pa e ns, sligh ly modi y
pa e ns, and in e pola e be ween a ious pa e ns o c ea e in e es ing hy hmic
sequences.
2.7 Rhy hmic Fea u es
Fo ou implemen a ion o he T iple S eams hy hm gene a ing sys em, we wish
o c ea e a ious con ol ea u es o allow he pe o me o ha e mo e con ol o e
he gene a ed s eams. Gomez-Ma ín e al. ha e eco ded a gene al se o hy hmic
ea u es, desc ibing a ious simila i y me ics o 1-ba d um pa e ns, which ha e
been collec ed and es ed acco ding o human a ings [14]. The ea u es men ioned
in he pape include densi y o onse s o he o al hy hm, as well as looking a he
densi y o onse s in low-, mid-, and high- equency bins. They also de ine a means o
looking a he syncopa ion on bo h a global le el and he indi idual equency bins
10 Chap e 2. Rela ed Wo ks
as well. We can u ilize his idea o looking a he ea u es o indi idual equency
bins and apply i o ou indi idual s eams. F om hei wo k, we will implemen
ea u es based on onse densi y, hy hm simila i y, and syncopa ion. Fu he mo e,
we can use he ea u es desc ibed in hei pape o alida ion o ou gene a ed
hy hmic s eams, allowing us o ensu e ha we ha e a di e se se o hy hms a e
decoding.
2.8 G oo eT ans o me
The G oo eT ans o me , a con ollable d um accompanimen gene a ion sys em,
was de eloped by Behzad Haki, wi h inpu om Ca alan musician Raül Re ee. I
is a eal- ime hy hm gene a ion sys em ha ou pu s d um hy hms based on a
use ’s inpu , in ended o use in li e pe o mance. Ou new implemen a ion will be
hea ily based on his wo k, and we look o build on his wo k o ou implemen a ion
o he T iple S eams sys em.
Now, we will gi e a b ie o e iew o he sys em as a whole, he model a chi ec u e,
da a sou ces, and he ep esen a ion o he s eams.
2.8.1 O e iew
The cu en implemen a ion o G oo eT ans o me was c ea ed by Behzad Haki o
hei doc o al esea ch a Uni e si a Pompeu Fab a. The sys em is based on yea s
o expe ience wo king in he a ea o hy hm gene a ion and ep esen a ion, and in i s
cu en o m, i is a edesign based on he eedback om he musician Raül Re ee
[5]. The ocus o he edesign is o allow o a balance be ween pe o mance and
composi ion, which is a ained by add essing he dualism o au onomy and con ol
o he gene a ed hy hmic s eams. Ini ially, he sys em could only be upda ed by
changing he inpu g oo e o changing he sampling pa ame e s. This was modi ied
o gi e he pe o me mo e con ol o e he sys em, by he inclusion o wo addi-
ional p ede e mined pa e ns ha allowed he use o in e pola e be ween he wo
addi ional pa e ns and he inpu g oo e. The addi ion o a VAE in o he model’s
a chi ec u e allowed o he in e pola ion be ween hese h ee hy hms, enabling he
2.8. G oo eT ans o me 11
pe o me o modi y he ou pu in eal- ime [5]. In addi ion, wo new con ols we e
included in he new sys em implemen a ion. The i s o hese, i e indi idual mu e
con ols, allow o he mu ing o he ins umen g oupings "kick, sna e, ha s, oms,
cymbals. These a e enabled by lea ning an embedding wi h he same dimension as
he la en space and added o he la en ec o be o e decoding. The o he con ol, a
gen e mu e/selec o , is c ea ed by including gen e ea u es in he aining p ocess by
adding he in o ma ion o he encode , and simila ly o he ins umen mu es/gen e
in o ma ion sen o he decode .
2.8.2 A chi ec u e
The model a chi ec u e o he cu en implemen a ion o he G oo eT ans o me
consis s o a T ans o me VAE , whe e a 2-ba hy hmic pa e n is encoded in o a
la en dis ibu ion, and is decoded in o a 9- oice d um pa e n. The model consis s
o h ee pa allel decode s, one o hi s (onse s), eloci y, and o se s (mic o imings).
2.8.3 Da a
G oo eMIDI was chosen as he da ase o he G oo eT ans o me sys em, as i has
a la ge amoun o mul i-d um MIDI iles ha can be used o aining he model.
In addi ion, he MIDI iles ha e an associa ed gen e ag, allowing o he c ea ion o
gen e il e ing in he sys em. Due o he gen e-imbalance o he da ase , addi ional
p i a e da a was used o balance he da ase [5].
2.8.4 S eams/Rep esen a ion
G oo eT ans o me uses an HVO ma ix o ep esen he hi s, eloci ies, and o se s
o bo h he inpu and ou pu hy hms. The ma ices a e o size T x M, whe e T
is he ime s eps in he ep esen a ion in six een h no es, and M is he numbe o
ins umen s. In he inal G oo eT ans o me implemen a ion, hese HVO ma ices
will be o size 32 X 27. We ge 32 ime s eps om 2-ba hy hm sequences composed
o six een h no es in 4/4 iming, and 27 om mul iplying he 9 ins umen s by 3 o
he hi s, eloci ies, and o se s. Fo ou T ipleS eams implemen a ion, we will use
18 Chap e 3. Design
Expanding on he ideas in he G oo eT ans o me , we wan o allow he use o
mo e beyond he s o ed la en a iables in a linea di ec ion. This gi es he use
a simple way o explo e he space, hope ully yielding in e es ing esul s while s ill
main aining some ela ion o he wo hy hms.
Looking a igu e 6, we can see he use ’s inpu hy hm encoded in he op node
labeled ZG, and he wo sa ed da a poin s labeled ZAand ZB. The diag am on he
igh o his igu e shows how we can in e pola e be ween he wo da a poin s, bu
also allow he use o mo e in a linea di ec ion beyond he pa h be ween he wo
poin s.
Figu e 6: Rep esen a ion o he La en Space o Real-Time In e pola ion
3.2.3 Decode
Ou inal subsec ion is he decode . In his egion, hy hms ha ha e been en-
coded in he la en space a e decoded in o a la ened s eam. We hen expand his
la ened s eam in o ou h ee ou pu s based on he densi y alues we ha e selec ed.
Densi y
A e decoding he la en a iable, we ecei e ou la ened ou pu o he h ee
s eams. To keep he design as simple as possible, we decided o include only one se
o con ols on his end. We designa e hese con ols as ’Densi y’, which con ols he
numbe o onse s in each expanded s eam. The numbe o onse s in each s eam is
bounded by he o al numbe o hi s in he la ened ou pu . I he la ened ou pu

3.2. Ins umen O e iew 19
has 12 onse s, he maximum numbe o onse s each un la ened s eam can ha e is
12. This allows he use o quickly modi y he indi idual s eams acco ding o hei
needs du ing a pe o mance.
To calcula e he densi y o each s eam, we employ he Jacca d Simila i y be ween
each indi idual expanded s eam and he o iginal la ened ou pu . We de ine his
o mally as
S eami∩Fla ened
S eami∪Fla ened
Looking a igu e 7 we can see he expansion o he la ened ou pu in o he h ee
s eams based on he densi y ea u e con ol alues. Se ing ou densi y ea u e
con ol o 70% o he op s eam allows us o ha e 7 o he 10 onse s ound in he
la ened ou pu . Bo h eloci y and o se alues will be e ained om he la ened
s eam.
Figu e 7: Expanding he ou pu in o he h ee s eams based on he Densi y Fea u e
Con ol Values
Chap e 4
Da ase s
A cen al ocus o his esea ch is explo ing di e en ways o ex ac hy hmic in-
o ma ion om di e se sou ces, mo ing beyond a d um-cen ic pe spec i e in which
each gene a ed s eam co esponds o a speci ic d um. Ins ead, we in es iga e mo e
abs ac me hods o cons uc ing hy hmic s eams. This includes d awing on
da ase s ha a e no s ic ly pe cussi e as well as de eloping s a egies o pa i-
ioning and ecombining da a o econs uc hy hms.
Du ing he da a cu a ion p ocess, we applied se e al c i e ia. Fi s , he da ase
needed o ca y a sui able license; hose unde he C ea i e Commons A ibu ion
4.0 In e na ional (CC BY 4.0) license we e conside ed accep able. Second, because
ou me hod equi es accu a e empo and ime signa u e in o ma ion o compu e
o se alues om MIDI iles, only da ase s ha e ained his me ada a we e usable.
This es ic ion excluded some o he wise aluable collec ions. Thi d, we p io i ized
da ase s wi h a b oad ange o eloci ies and o se s in o de o p o ide iche ain-
ing da a o he model. Based on hese c i e ia, we selec ed a se o anno a ed
Candombe eco dings, he LAKH MIDI da ase , G oo eMIDI, El Bongose o, and
TapTamD um.
The ollowing sec ions de ail ou a ionale o choosing hese da ase s, he speci ic
p ocessing s eps used o educe hem in o ou s eams, and he challenges encoun-
e ed. We hen desc ibe he con e sion o he cu a ed ma e ial in o he HVO ep-
20
4.1. Anno a ed Candombe Reco dings 21
esen a ion used o aining, along wi h he cons uc ion o a new da ase c ea ed
speci ically o his p ojec .
All da a manipula ion was implemen ed in Py hon and is a ailable in he p ojec ’s
Gi Hub eposi o y. Fo MIDI-based da ase s, we elied ex ensi ely on P e yMIDI
o eading, analyzing, modi ying, and w i ing MIDI iles [15]. All da ase s used all
unde he C ea i e Commons A ibu ion 4.0 In e na ional license, making hem
sui able o ou pu poses.
4.1 Anno a ed Candombe Reco dings
The da ase o Candombe eco dings was c ea ed by Luis Ju e, Ma ín Rocamo a,
Simone Ta si ani, and Ma in Clay on. I consis s o mul iple li e eco dings o
Candombe a is s eco ded in a s udio in U uguay in 2018. Speci ically, one playe
on he piano d um, one on he chico, and wo on epique [16]. The da ase also
con ains anno a ions o onse s and eloci ies. O se alues we e no included, bu
calcula ed by Sa yajee P abhu and Anmol Mish a. The o se alues a e only o
he acks wi h one epique, so we will omi he eco dings wi h wo epique playe s.
In he adi ion o Candombe music, each d um has a speci ic ole in he pe o -
mance. The piano, which is he la ges o he d ums, has he lowes equency
and p o ides a melodic ounda ion o he o he d ums. The epique is he second
la ges o he h ee d ums and is used o add a ia ion and imp o isa ion o he
pe o mance. The ’llamada’, o call and esponse, is pe o med on hese d ums and
is cen al o Candombe. Finally, he chico is he smalles o he h ee and is used o
p o ide he s able hy hm o he pe o mance. The Candombe hy hm comes om
he in e ac ion o hese h ee d ums and hei unc ions, c ea ing a e y complex
hy hmic s uc u e.
Though we ha e s a ed we wish o mo e away om mapping one d um o one
ou pu s eam, we ound i was bes o ha e each d um ep esen one s eam, ou
ou h s eam being a la ened combina ion o he h ee d ums. This allows o he
complex hy hmic s uc u e o be main ained in ou model aining.
22 Chap e 4. Da ase s
The da a p ep ocessing code o ou Candombe da ase can be ound in he ’can-
dombe.py’ ile in ou eposi o y.
4.2 LAKH
The LAKH da ase consis s o 176,581 unique MIDI iles, including 45,129 iles ha
ha e been ma ched o songs in he Million Song Da abase [17]. Fo ou pu poses,
he LAKH da ase o e s an asso men o ins umen s ha we can use o c ea ing
a ious hy hmic s eams, and allowing di e en ins umen s o ill he ole o he
’G oo e’, e.g. using bass gui a as he main g oo e, and using he piano, elec ic
gui a , and d um ins umen s as he ela ed s eams.
Due o he size o he LAKH da ase , we need o be p ecise wi h how we choose ou
combina ions o ins umen s. Fi s , o condense he la ge amoun o ins umen s in o
hei amilies, like gui a , elec ic gui a , lead gui a in o ’Gui a ’, we ma ch on all
he associa ed MIDI p og am numbe s and me ge hese in o one ins umen wi h he
name we will be using. We hen look a he po en ial ins umen s o use and choose
hose ha we assume o be be e sui ed o ou ask. Fo example, a combina ion
like "Piano", "Gui a ", "D ums", and "Bass" seems like a classic example ha would
ha e a good unde lying hy hmic s uc u e ha we can u ilize. Howe e , ce ain
hings like "S ings" o "Syn h Pads", end o ha e long sus ained no es and may
no ha e he mos in e es ing onse and mic o- iming in o ma ion o ou esea ch,
so hese a e excluded. Ou o iginal subse o ins umen s o use consis s o "Piano",
"Pe cussion", "Gui a ", "Bass", "B ass", "E hnic", "Pe cussi e", "Sound E ec s",
"D ums", and "Syn h E ec s". To ind which combina ions o hese ins umen s
a e mos ep esen ed in ou da ase , we c ea e a small subse o 4,200 MIDI iles,
plo a his og am o he op en combina ions o ou o he ins umen s we ha e
selec ed, and choose hese o c ea ing ou hy hmic s eams. In igu e 8 we can see
ha "Bass", "D ums", "Gui a ", "Piano", "B ass", "Pe cussion", and "Pe cussi e"
ha e he highes numbe s, so we will use hese o ou hy hmic s eam c ea ion.
Because he LAKH MIDI da ase consis s o en i e songs, i was decided o only use
4.3. G oo eMIDI 23
Figu e 8: Top 10 ins umen combina ions om LAKH da ase .
a po ion o each MIDI ile. The e a e many 2-ba loops ha a e epea ed and hese
become edundan in ou aining. To accomplish his, we so he HVO sequences
o each MIDI ile by he numbe o o al hi s. We hen selec he second hal o he
so ed lis and ake eigh e enly sp ead HVO sequences. This will also help educe
ins ances whe e he e a e li le o no onse s in he HVO sequences.
The code o ou LAKH MIDI p ep ocessing, as well as he code o selec ing he
eigh HVO sequences can be ound in lakh_midi.py and lmd_bes _eigh .py.
4.3 G oo eMIDI
The G oo eMidi da ase con ains mo e han 13.6 hou s o aligned MIDI o human
pe o med d umming and o e s us a g ea deal o ma e ial o use o aining [18].
Behzad Haki’s G oo eT ans o me sys em was ained on his da ase o c ea e i s
ou pu d um s eams. Fo ou pu poses, we a e looking o o he ways we can me ge
hese a ious d um acks in o h ee o ou s eams. In "D um Rhy hm Spaces"
(Gómez-Ma ín e al. 2020), i is no ed ha humans pe cei e he oles o a ious
d ums in h ee equency bands, i.e., low, mid, and high. We use his as a way o
sepa a e he MIDI eco dings in o h ee sepa a e s eams, using a la ened e sion
o he h ee as he g oo e [19]. Ou hope is ha his will main ain some o he
’ unc ions’ o he d ums, e.g., he kick d um as a pulse in wes e n music, he sna e
d um o ein o ce he me e in ock, e c. The h ee pa i ions we c ea e include

24 Chap e 4. Da ase s
he low pa i ion, including kick d um and low om, he mid pa i ion, including
sna e d ums, mid oms, and high oms, and he high pa i ion, including hi-ha s,
ide cymbals, and c ash cymbals.
Fo ou second g ouping, we pa i ion he eco dings in o eloci y bins. Loudness
and i s MIDI ep esen a ion, eloci y, can be used o emphasize ce ain po ions o
he hy hm. These emphasized momen s could be accen s o ein o ce he me e , o
in e es ing e en s in a musical piece. Ou in en ion was o pa i ion hese emphasized
po ions, whe e one s eam would consis o onse s wi h high eloci y ha would
s and ou agains he o he sounds, one s eam would be e y sub le and could be
used o mino changes, and some hing in be ween. F om a pe o mance pe spec i e,
hese could be e y use ul when mapped o pa ame e s such as syn h imb e and
en elopes o con ol e ec s like delay.
The hi d g ouping ha we use is oo ed in he adi ions o wes e n ock, pop,
and elec onic music. We ocus on ou ins umen amilies, kick d um, sna e, oms,
and hi-ha s, and c ea e new MIDI iles based on hese g oupings. This allows us o
abs ac he oles o he ins umen s sligh ly, no mapping each indi idual d um o
a s eam, bu ocusing on how hey a e used in hese s yles o music.
Ou las wo g oupings we e chosen as po en ially in e es ing g oupings ha would
also yield mo e spa se s eams. One o hese is a collec ion o he cymbal ins u-
men s, whe e one s eam is he open hi-ha , one is he closed hi-ha , one is he c ash
cymbal, and he las is he ide cymbal. Ce ain s yles o music, such as jazz, ha e
a endency o inco po a e cymbals in o he hy hm, and we assume his could be
an in e es ing g ouping. In a simila ashion, we make a g ouping ou o he h ee
oms and he ide cymbal.
G oo eMIDI is a la ge da ase , and hese i e g oupings allow us o di e si y he
HVO sequences ha we c ea e om i . A e p ocessing ou da ase , we ha e 82,687
2-ba hy hms spli in o i e di e en g oups. We can see in igu e 9 he eloci y,
unc ional, and pi ch ( equency) g oupings, each ep esen ing oughly a qua e o
he da ase , which is expec ed. The o he wo ha e less ep esen a ion because hey
4.4. El Bongose o 25
Figu e 9: Pe cen ages o each G oo eMIDI g ouping o o al da ase .
a e d um speci ic, and ce ain eco dings may no ha e all he indi idual d ums o
c ea e he g ouping.
The code o ou p ep ocessing o he G oo eMIDI da ase can be ound in he ile
named g oo e_midi.py.
4.4 El Bongose o
El Bongose o is a la ge-scale symbolic da ase c ea ed by Behzad Haki e al. con-
sis ing o 6,035 c owd-sou ced imp o ised d um pe o mances by 3,184 pa icipan s
o a ying le els o expe ience [20]. The pa icipan s we e asked o selec a gen e
o a backing ack om he G oo eMIDI da ase and imp o ise on a se o digi al
bongos o e a 2-ba loop. Once inished, hey could lis en o hei eco ding and
o e dub addi ional hi s. The pa icipan s we e hen asked o speci y hei le el o
expe ience ou o a maximum o 5 poin s, as well as a e hei pe o mance ou o 5
poin s.
Fo ou da ase , we choose o use he e en s om he le hand, igh hand, a
combina ion o bo h le and igh hands, and he la ened G oo eMIDI ack. The
inclusion o he ’bo h hands’ e en allows us o ex ac in e es ing hy hms ha a e
26 Chap e 4. Da ase s
di ec ly ela ed o he wo o he hy hmic s eams by sha ing o e lapping onse s,
and can aid in he aining o he model in ega ds o Rhy hmic Simila i y, Accen
Simila i y, and he Densi y ea u es. To be e cu a e ou da ase , we include only
sessions wi h an expe anking o 4 o highe and a use a ing o 3 o highe .
Wi h ou es ic ions in place, he El Bongose o da ase is educed o 2243 o al
eco ded sessions. O hese emaining eco ded sessions, 1917 con ain e en s whe e
he le hand plays, 1863 whe e he igh hand plays, and 1221 whe e bo h hands
play simul aneously. This can be seen in igu e 10. Gi en he numbe o hi s o each
o ou selec ions, we eel ha ou choice o e en s om he El Bongose o da ase is
jus i ied and will add in e es ing di e si y o ou inal da ase o model aining.
Figu e 10: Reco ded Session Coun s o Le Hand, Righ Hand, Bo h Hands, and
o al sessions o ou il e ed El Bongose o da ase .
The p ep ocessing code o ou El Bongose o da ase can be ound in he ile
bongose o.py.
4.5 TapTamD um
TapTamD um is ano he symbolic da ase c ea ed by Behzad Haki e al. whe e ou
expe d umme s a e asked wi h imp o ising on wo d um pads o e 2-ba loops
chosen om G oo eMIDI [21]. The sessions las app oxima ely one hou , and each
4.6. Da ase P e-p ocessing 27
expe d umme plays a a ie y o gen es. A e each session, he pe o me s a e
asked o a e hei session ou o i e poin s.
We ollow a simila me hod wi h he El Bongose o da ase , using he g oupings le
hand, igh hand, bo h hands, and he la ened G oo eMIDI ack. To add a ie y
o ou aining da a, we c ea e wo da ase s om TapTamD um, allowing he ’bo h
hands’ ca ego y o be when bo h hands play simul aneously (In e sec ion) o when
a leas one hand is playing (Union). Addi ionally, we es ic ou da ase o en ies
ha ha e a use a ing o 4 o highe . A e es ic ing ou da ase , we ha e 1116
sequences in each da ase , o aling 2232 TapTamD um HVO sequences we can use
o aining.
Ou code o he TapTamD um da ase can be ound in he ile ap_ am_d um.py.
4.6 Da ase P e-p ocessing
The da ase s desc ibed abo e a e ep esen ed in a ious ways. LAKH and G oo eMIDI
a e bo h MIDI da ase s, and equi e he addi ional usage o P e yMIDI o open,
me ge he ins umen s, and ew i e o MIDI iles. The Candombe anno a ions we e
s o ed in CSV iles, and only equi ed he use o Pandas o ex ac ing he da a and
sa ing in o HVO ep esen a ions. Because hey included he onse s, eloci ies, and
o se s, we only needed o ans e his in o ma ion o new HVO objec s. Bo h El
Bongose o and TapTamD um we e al eady in HVO ep esen a ion, only equi ing
he ans e o he hi , eloci y, and o se alues in o new HVO objec s.
Fo he da a p ep ocessing ask, he MIDI da ase s, LAKH and G oo eMIDI, ook
he longes o c ea e. Fi s , we had o g oup he a ious ins umen s in o hei la ge
amily, such as elec ic piano and acous ic piano in o ’Piano’. A e we designa ed
he g oupings, we sa ed each indi idual ins umen amily o each ile, o be loaded
in o HVO sequences ia he HVO MIDI loade . I was obse ed ha addi ional
no es we e w i en o he newly c ea ed MIDI iles, which was caused by e y sho
du a ions in he o iginal MIDI ile ha we e smalle han he ime in e als sup-
po ed by P e yMIDI. To o e come his, we w o e a small me hod o emo e any
34 Chap e 5. Model
he o al eloci ies changed. This implies ha he model s uggles o c ea e s eams
wi h changing accen s o alues be ween he maximum and minimum. Looking
a igu e 16 we see ha ou Accen Simila i y ea u e is no wo king as well as we
had expec ed. We would expec his o ha e a simila shape as he g aph om he
Rhy hmic Simila i y, g owing mono onically. Fo he o iginal model, he e is some
g ow h, bu he a e o change is e y slow un il i eaches he ea u e alues o 8
and 9, whe e i inc eases much mo e apidly. In he lex model, we see no change in
ou Hamming Dis ance and we can assume ha educing he numbe o okens is
no a alid way o imp o e ou model. The a iance also inc eases be ween alues
o [−7,7]. This shows us ou cu en implemen a ion is no a eliable con ol in i ’s
cu en s a e.
(a) Hamming Dis ance o a ious Ac-
cen Simila i y alues using o iginal
model
(b) Hamming Dis ance o a ious Ac-
cen Simila i y alues using lex model
Figu e 16: Hamming Dis ance o a ious Accen Simila i y alues o o iginal and
lex models
Ano he issue can be obse ed in igu e 17. When looking a s eam 1 and s eam 2
o he maximum dis ance alue, we can see he e a e many alues ha a e less han
hose o s eam 3 ( he da ke shading ep esen s highe eloci ies). This implies ha
he e is a ela ionship be ween he densi y alues and he Accen Simila i y, and we
shall explo e hese unin ended ela ionships la e in his sec ion.

5.3. Valida ion 35
(a) Accen Simila i y Min-
imum (b) Accen Simila i y Hal (c) Accen Simila i y Max
Figu e 17: Ou pu S eams o Th ee Accen Simila i y alues
5.3.1 Densi y
In Sec ion 3.2.3, we explain how he densi y ea u e is calcula ed o each o he
ou pu s eams. Ou in en was o ha e a con ol ha could dic a e he numbe o
onse s in each s eam, ela i e o he la ened ou pu . Fo many o he ou pu s,
ou expec a ions a e me by he numbe o onse s in each s eam. We can see in
igu e 18 se ing s eam 1 o wo, s eam 2 o i e, and s eam 3 o nine gi es us
a s eam wi h e y ew onse s, and he o he s add mo e based upon he alues
speci ied. The same holds o he image on he igh . S eam 1 and s eam 3 a e
se o he maximum alue and s eam 2 is se o one. Bo h images a e using a
Rhy hmic Simila i y dis ance se o ze o, so should mi o he g oo e displayed on
he bo om.
(a) Densi y Values se o 2, 5, and 9 ou
o 9.
(b) Densi y Values se o 9, 1, and 9 ou
o 9.
Figu e 18: Va ious Densi y Values o Ou pu S eams
Howe e , when we decide o maximize all he alues, we can see he e a e some
issues in he decoding. The pa e n on he le o igu e 19 has a densi y alue o
se en o s eam 1, nine o s eam 2, and nine o s eam 3, ou o a maximum o
nine. All o he con ols a e se o ze o. The image on he igh has all densi y alues
o he s eams se o nine, he maximum alue. We would assume ha se ing each
36 Chap e 5. Model
s eam’s densi y alue o he maximum in his scena io would con ain all he onse s
o he inpu g oo e, and each s eam would be he same as he o he . Bu , bo h
s eam 1 and s eam 3 ha e less onse s han s eam 2, as well as he inpu . This
may be an issue wi h aining, o wi h how we calcula e he onse s on he decoding
side, and mo e in es iga ion needs o happen.
(a) Densi y Values se o 7, 9, and 9 ou
o 9.
(b) Densi y Values se o 9, 9, and 9 ou
o 9.
Figu e 19: Issues wi h Densi ies se o Maximum Values
5.3.2 Ou pu S eam Quali y
To judge he gene al quali y o he ou pu s eams, we will look a h ee c i e ia.
The i s is ha he onse s c ea e an in e es ing pa e n. By in e es ing, we mean a
pa e n ha has hy hmic quali ies, bu has a ia ions in he posi ions o he onse s.
An example o unin e es ing esul s would be ha ing many pa e ns gene a ed o
a ious g oo es, bu all he gene a ed onse s appea on he down bea , o whe e he
same pa e n is epea ed each measu e.
Ou second c i e ion is o look a he eloci ies o he gene a ed s eams. We expec
o see a ia ion in he eloci ies, accen ing ce ain onse s, and gi ing he hy hm
mul idimensionali y.
Fo ou hi d c i e ia, we a e conce ned wi h he o se alues o he gene a ed onse s.
This shows ou model is capable o ec ea ing hese o se s, and can ou pu hy hms
ha consis o iple s ha a en’ aligned o he g id.
In igu e 20 we ha e ou di e en piano olls displaying he ou pu o ou di e en
inpu g oo es. The alues o he ea u e con ols ha e been andomly selec ed. The
piano olls o igu es a, b, and c all show in e es ing a ia ion o he onse s. We
5.3. Valida ion 37
ha e ins ances whe e li le o no epe i ion happens in he 2-ba loop, such as o
igu es a and b. In igu e c, we see he pa e n epea s i sel o each ba , bu his
may be an issue wi h he inpu , which also includes h ee epea ing no es o each
ba . In igu e d, we see he leas amoun o di e si y. The second ba is a copy o
he i s o each o he h ee s eams. Fu he mo e, S eam 2 con ains onse s on
all bu one posi ion in each ba . Ou assump ion is he inpu g oo e’s epe i i e
pa e n, and a high densi y alue o s eam 2, is wha causes his lack o di e si y.
Looking a he eloci y alues o hese ou examples c ea es some conce n. Fo each
o he s eams in each o he pa e ns, eloci y alues do no exceed a alue o 0.7.
The inpu g oo es all con ain a ious eloci ies be ween minimum and maximum
alues, so we would expec o see mo e a ying alues o ou ou pu s eams. This
has been a common occu ence in all o ou a emp s, and no speci ic o hese ou
examples. Changing he Accen Simila i y alue will cause eloci y alues o be
aised, bu his di e s om ou expec a ions. The Accen Simila i y should shi
he eloci y alues. I we ha e a la ge di e ence, onse s ha ha e a la ge eloci y
in he inpu should ha e smalle eloci ies in he ou pu s eams, and onse s ha
ha e small eloci ies in he inpu s eam should ha e la ge eloci y alues in he
ou pu s eams.
(a) (b)
(c) (d)
Figu e 20: Fou Examples o Gene a ed S eams o Fou Di e en Inpu s
38 Chap e 5. Model
Finally, when we look a he o se alues o ou ou pu s, we see much less a ia ion
han we had hoped o . The only example ha has o se alues less han o g ea e
han 0 is igu e b. The o he h ee examples only con ain onse s snapped o he g id.
Each o he inpu g oo es ha e a ying le els o o se s, so we canno say ha his is
an issue wi h he inpu s. Fu he mo e, a e many a emp s, we s ill ind ha mos
o ou ou pu s con ain li le o no a ia ion in he o se alues. Ou assump ion is
ha his is caused by a lack o o se di e si y in he LAKH MIDI aining se and
some hing we will discuss la e in his pape .
5.3.3 Rela ionships be ween Fea u e Values
One o he unin ended ela ionships we ha e no iced in ou sys em is be ween he
Rhy hmic Simila i y con ol and he Accen Simila i y con ol. In igu e 21 we look
a how changing he Rhy hmic Simila i y dis ance a ec s he Accen Simila i y and
ice e sa.
(a) O iginal Model - Rhy hmic Simila -
i y held cons an while inc easing Accen
Simila i y
(b) Flex Model - Rhy hmic Simila i y
held cons an while inc easing Accen
Simila i y
Figu e 21: Issues wi h Densi ies se o Maximum Values
I we se he Accen Simila i y dis ance o he maximum alue, and ou Rhy hmic
Simila i y dis ance o some hing low, we see we ha e onse s in he same posi ions
as he inpu g oo e, and eloci y alues ha a e highe han he lowe eloci ies
in he inpu . Howe e , as we inc ease he Rhy hmic Simila i y dis ance we no ice
he eloci y alues began o dec ease in ou ou pu . This can be seen in igu e 22.
The image on he le shows ou ou pu wi h a maximum dis ance o he Accen
Simila i y and a dis ance o 1/3 he maximum o Rhy hmic Simila i y. Ou eloci y
alues ha e a maximum o 0.754 and a minimum o 0.180. On he igh we can see
5.3. Valida ion 39
wha happens when we aise ou Rhy hmic Simila i y o he maximum alues. He e
ou eloci ies ha e a maximum alue o 0.528 and a minimum alue o 0.088. We
would expec o main ain he same le el o eloci y o ou ou pu , and o ha e high
eloci y when he same posi ion in he inpu ei he con ains an onse wi h a low
eloci y o no onse a all. This issue may be due o he simila ways we calcula e
hese wo ea u es and we shall discuss his la e in he pape .
(a) Densi y Values se o 2, 5, and 9 ou
o 9.
(b) Densi y Values se o 9, 1, and 9 ou
o 9.
Figu e 22: Veloci y alues dec easing as we aise he Rhy hmic Simila i y Dis ance
A second unexpec ed ela ionship was ound be ween he Rhy hmic Simila i y ea-
u e and he Densi y ea u es o each o he ou pu s eams. In igu e 23 we see
h ee di e en scena ios when we ha e a e y ac i e inpu wi h many onse s.
(a) Rhy hmic Simila i y
dis ance alue o 8 ou o
32
(b) Rhy hmic Simila i y
dis ance alue o 19 ou o
32
(c) Rhy hmic Simila i y
dis ance alue o 32 ou o
32
Figu e 23: Rela ionship be ween Rhy hmic Simila i y and Densi y o ac i e inpu
When he Rhy hmic Simila i y dis ance alue is low, we allow ou ou pu onse s o
happen a he same posi ion as he inpu onse s. I we se ou densi y alues o
la ge amoun s, such as he maximum, we see e y ac i e ou pu s. This is expec ed
beha io . Mo e densi y implies mo e onse s. Howe e , as we inc ease he amoun
o dis ance o ou Rhy hmic Simila i y ea u e, we see he onse s in he ou pu
s eams dec easing. Knowing how he ea u es a e calcula ed, his makes pe ec
sense. The Rhy hmic Simila i y measu es he dis ance be ween he inpu and he

40 Chap e 5. Model
la ened ou pu . A maximum dis ance be ween he wo would only allow onse s
in ou ou pu o occu a loca ions whe e he e a e no onse s in he inpu . Fo
an inpu con aining many onse s, a maximum dis ance o he Rhy hmic Simila i y
would con ain e y ew onse s. Because o his, he densi y o each o he s eams
would be bounded by he numbe o onse s in he la ened ou pu s eam. I he e
a e only ou loca ions o ou onse s o occu , a maximum densi y alue would
allow ou onse s in ha s eam. Howe e , his can be e y con using om a use ’s
pe spec i e. In mos si ua ions, inc easing he densi y will inc ease he numbe o
onse s in he ou pu , and i would be sa e o assume ha many use s will ha e his
assump ion.
Chap e 6
Discussion
6.1 Da a
In ou discussion o he model we ha e seen ha ou sys em does no qui e mee
ou expec a ions. Speci ically, we ha e no iced issues a ound he hy hmic s uc u e
o ou ou pu s being oo epe i i e, unin ended ela ionships be ween ou ea u es
o con olling he hy hm ou pu s, and a lack o a ia ion in ou aining da a. In
his sec ion, we will look a he speci ic p oblems o ou cu en model and how we
can imp o e i in he u u e.
One o he issues we aised is he lack o a ia ion in ou ou pu hy hms. As we
ha e seen, we o en ha e ou pu s ha do no con ain much di e si y o eloci y
alues o o o se alues. We ha e also no iced a endency o c ea e ou pu s ha
ha e epe i i e pa e ns in he sequence and lack a ia ion in he placemen o he
onse s. One o he bigges ac o s c ea ing his issue is he da ase s we ha e used,
speci ically he LAKH MIDI da ase . Al hough he da ase allowed us in e es ing
g oupings om he mul iple ins umen s used, he pa e ns in he MIDI iles a e
lacking in he a ia ion o eloci y and o se s ha we we e hoping o ind in ou
ou pu .
In igu e 24 we ha e h ee examples o he LAKH MIDI aining da a.
41
42 Chap e 6. Discussion
(a) Example 1 o LAKH
MIDI T aining Da a
(b) Example 2 o LAKH
MIDI T aining Da a
(c) Example 3 o LAKH
MIDI T aining Da a
Figu e 24: Th ee Examples o LAKH MIDI T aining Da a
We can see he e is e y li le a ia ion wi h he eloci y alues, o se alues, and
he onse posi ions. Compa ing hese piano olls wi h hose gene a ed om ou
o he da ase s 25, we can ind much mo e di e si y in ou hi , eloci y, and o se
alues.
(a) Example o Candombe
T aining Da a
(b) Example o El Bon-
gose o T aining Da a
(c) Example o TapTam-
D um T aining Da a
Figu e 25: T aining Da a om Candombe, El Bongose o, and TapTamD um
Da ase s
One o he issues wi h he LAKH da ase is ha i con ains ull songs. When we spli
hese ull songs in o 2-ba HVO sequences, we ind ha many o he sequences do no
ha e he quali ies we ha e men ioned, such as di e si y in he eloci ies o o se s.
Ini ially, we il e ed he ull songs om he o iginal LAKH MIDI da ase by he
numbe o eloci y changes, bu ne e did his o he 2-ba spli HVO sequences.
To ec i y his, we can emo e 2-ba HVO sequences ha do no ha e a ce ain
numbe o eloci y o o se changes. In igu e 26 we can see he numbe o eloci y
changes o a small subse o ou LAKH MIDI 2-ba spli s used o aining. Ou o
267713 HVO sequences, 174496 ha e only one eloci y alue. This will ha e a huge
impac on aining ou model. Howe e , i we es ic he HVO sequences o ones
wi h six o mo e eloci y changes, we s ill ha e 17,002 emaining sequences. Ou ull
LAKH MIDI da ase con ains six y pa i ions o he 2-ba HVO sequences, which
6.2. Fea u es 43
would p o ide mo e han enough da a o mee ou c i e ia o ain ou model.
Figu e 26: Veloci y Changes in 2-ba spli s o a Subse o LAKH MIDI
6.2 Fea u es
Ano he a ea o conce n ha we ha e men ioned is he ela ion be ween he con ol
ea u es o ou sys em. Speci ically, he ela ion be ween he Rhy hmic Simila i y
and he Accen Simila i y con ols and he ela ion be ween he Rhy hmic Simila i y
and he Densi y con ols.
Ou in en ion wi h he Accen Simila i y con ol was o allow he use o injec a
eeling o syncopa ion in o he ou pu , in ela ion o he inpu g oo e. Howe e , we
disco e ed ha i is di icul o ind a easonable way o calcula e and implemen
syncopa ion o be used as a con ol in ou sys em. Fu he mo e, he simila i ies
be ween he way we calcula e he Rhy hmic Simila i y and Accen Simila i y con ols
has made i e y di icul o isola e how hey impac he ou pu o he sys em.
Fo u u e wo k, we will emo e he Accen Simila i y, and ocus on ensu ing he
Rhy hmic Simila i y wo ks exac ly as we in end. Once we a e ce ain ha he e
a e no issues wi h ou ea u e on he encoding side, we can begin o add addi ional
ea u es and be mo e awa e o hei impac on he ou pu .
Ano he ealiza ion is he need o he gene a ion o he ou pu o be in a ian o
he ea u e con ols. The con ols should a ec how he onse s a e dis ibu ed and
a anged, bu no he numbe o onse s in he la ened ou pu . Mo ing he Rhy h-
Bibliog aphy
[1] Jiang, H. H. e al. AI a and i s impac on a is s. In Rossi, F., Das, S., Da is,
J., Fi h-Bu e ield, K. & John, A. (eds.) P oceedings o he 2023 AAAI/ACM
Con e ence on AI, E hics, and Socie y, AIES 2023, Mon éal, QC, Canada,
Augus 8-10, 2023, 363–374 (ACM, 2023). URL h ps://doi.o g/10.1145/
3600211.3604681.
[2] Nug oho, Y. Y. T. & Manggala, P. P. M. D. The use o ai in c ea ing music
composi ions: A case s udy on suno applica ion. In P oceedings o he 7 h Cel
In e na ional Con e ence (CIC 2024), 177–189 (A lan is P ess, 2024). URL
h ps://doi.o g/10.2991/978-2-38476-348-1_13.
[3] Vaswani, A. e al. A en ion is all you need (2023). URL h ps://a xi .o g/
abs/1706.03762.1706.03762.
[4] Kingma, D. P. & Welling, M. Au o-encoding a ia ional bayes (2022). URL
h ps://a xi .o g/abs/1312.6114.1312.6114.
[5] Haki, B. Design, de elopmen , and deploymen o eal- ime d um accompani-
men sys ems (2025).
[6] Huang, C. A. e al. An imp o ed ela i e sel -a en ion mechanism o ans-
o me wi h applica ion o music gene a ion. CoRR abs/1809.04281 (2018).
URL h p://a xi .o g/abs/1809.04281.1809.04281.
[7] Huang, Y. & Yang, Y. Pop music ans o me : Gene a ing music wi h hy hm
and ha mony. CoRR abs/2002.00212 (2020). URL h ps://a xi .o g/
abs/2002.00212.2002.00212.
50

BIBLIOGRAPHY 51
[8] Gillick, J., Robe s, A., Engel, J. H., Eck, D. & Bamman, D. Lea ning o g oo e
wi h in e se sequence ans o ma ions. CoRR abs/1905.06118 (2019). URL
h p://a xi .o g/abs/1905.06118.1905.06118.
[9] Robe s, A., Engel, J. H., Ra el, C., Haw ho ne, C. & Eck, D. A hie a chical
la en ec o model o lea ning long- e m s uc u e in music. In Dy, J. G. &
K ause, A. (eds.) P oceedings o he 35 h In e na ional Con e ence on Machine
Lea ning, ICML 2018, S ockholmsmässan, S ockholm, Sweden, July 10-15,
2018, ol. 80 o P oceedings o Machine Lea ning Resea ch, 4361–4370 (PMLR,
2018). URL h p://p oceedings.ml .p ess/ 80/ obe s18a.h ml.
[10] Hadje es, G., Nielsen, F. & Pache , F. GLSR-VAE: geodesic la en space egu-
la iza ion o a ia ional au oencode a chi ec u es. In 2017 IEEE Symposium
Se ies on Compu a ional In elligence, SSCI 2017, Honolulu, HI, USA, No em-
be 27 - Dec. 1, 2017, 1–7 (IEEE, 2017). URL h ps://doi.o g/10.1109/
SSCI.2017.8280895.
[11] Yang, L., Chou, S. & Yang, Y. Midine : A con olu ional gene a i e ad e sa ial
ne wo k o symbolic-domain music gene a ion using 1d and 2d condi ions.
CoRR abs/1703.10847 (2017). URL h p://a xi .o g/abs/1703.10847.
1703.10847.
[12] Tokui, N. Towa ds democ a izing music p oduc ion wi h ai-design o a i-
a ional au oencode -based hy hm gene a o as a DAW plugin. CoRR
abs/2004.01525 (2020). URL h ps://a xi .o g/abs/2004.01525.2004.
01525.
[13] Vigliensoni, M. L. M. E., G. & Fieb ink, R. R- ae: Li e la en space d um
hy hm gene a ion om minimal-size da ase s. Jou nal o C ea i e Music Sys-
ems 1(1) (2022). URL h ps://doi.o g/10.5920/jcms.902.
[14] Gómez-Ma ín, D., Jo dà, S. & He e a, P. D um hy hm spaces: F om global
models o s yle-speci ic maps. In A amaki, M., Da ies, M. E. P., K onland-
Ma ine , R. & Ys ad, S. (eds.) Music Technology wi h Swing - 13 h In e na-
ional Symposium, CMMR 2017, Ma osinhos, Po ugal, Sep embe 25-28, 2017,
52 BIBLIOGRAPHY
Re ised Selec ed Pape s, ol. 11265 o Lec u e No es in Compu e Science, 123–
134 (Sp inge , 2017). URL h ps://doi.o g/10.1007/978-3-030-01692-0_
9.
[15] Ra el, C. & Ellis, D. P. W. In ui i e analysis, c ea ion and manipula ion o
MIDI da a wi h p e y_midi. In P oceedings o he 15 h In e na ional Con e -
ence on Music In o ma ion Re ie al La e B eaking and Demo Pape s (2014).
[16] Ju e, L., Rocamo a, M., Ta si ani, S. & Clay on, M. Iemp u uguayan candombe
(2025). URL os .io/w x7k.
[17] Ra el, C. Lea ning-Based Me hods o Compa ing Sequences, wi h Applica ions
o Audio- o-MIDI Alignmen and Ma ching. Ph.D. hesis, Columbia Uni e si y,
USA (2016). URL h ps://doi.o g/10.7916/D8N58MHV.
[18] Gillick, J., Robe s, A., Engel, J., Eck, D. & Bamman, D. Lea ning o g oo e
wi h in e se sequence ans o ma ions. In In e na ional Con e ence on Machine
Lea ning (ICML) (2019).
[19] Gómez-Ma ín, D., Jo dà, S. & He e a, P. D um hy hm spaces: F om poly-
phonic simila i y o gene a i e maps. Jou nal o New Music Resea ch 49,
438–456 (2020).
[20] E ans, N., Haki, B., Gómez-Ma ín, D. & Jo dà, S. El bongose o: A c owd-
sou ced symbolic da ase o imp o ised hand pe cussion hy hms pai ed wi h
d um pa e ns. In Kaneshi o, B. e al. (eds.) P oceedings o he 25 h In e na-
ional Socie y o Music In o ma ion Re ie al Con e ence, ISMIR 2024, San
F ancisco, Cali o nia, USA and Online, No embe 10-14, 2024, 540–546 (2024).
URL h ps://doi.o g/10.5281/zenodo.14877393.
[21] Haki, B., Ko owski, B., Lee, C. L. I. & Jo dà, S. Tap amd um: A da ase
o dualized d um pa e ns. In Sa i, A. e al. (eds.) P oceedings o he 24 h
In e na ional Socie y o Music In o ma ion Re ie al Con e ence, ISMIR 2023,
Milan, I aly, No embe 5-9, 2023, 114–120 (2023). URL h ps://doi.o g/10.
5281/zenodo.10265237.
Appendix A
Fi s Appendix
Du ing ou discussion a ound he alida ion o he model, we chose no o discuss
he KL Di e gence loss du ing aining and alida ion. In igu e 27 we can see ha
ou o iginal model is p ope ly educing he loss du ing aining and alida ion.
(a) KL Di e gence T aining Loss (b) KL Di e gence Tes Loss
Figu e 27: Kullback-Leible Di e gence o T aining and Valida ion
In he Rela ed Wo ks sec ion we in oduced he ans o me model. Being ha his
hesis is ocused on he design aspec o ou sys em, and ans o me models ha e
been used subs an ially in he pas ew yea s, we chose o no go oo in o de ail
wi h ou desc ip ion. Fo mo e in o ma ion on ans o me models, one can ead
A en ion is All You Need (Vaswani e al. 2017) [3] We can see he a chi ec u e o
he ans o me model in igu e 28.
53
54 Appendix A. Fi s Appendix
Figu e 28: T ans o me model a chi ec u e [3]
We also chose o no go oo in o de ail wi h he Va ia ion Au oEncode model o
he same easons as we ga e o he ans o me model. We include a igu e o
he VAE a chi ec u e he e 29. In his doc o al disse a ion, Behzhad Haki gi es an
in-dep h explana ion o he VAE a chi ec u e and we e e he eade o his pape
[5].
55
Figu e 29: Va ia ional Au oencode A chi ec u e [5]

Appendix B
Second Appendix
56

Related note

Why organizations use Identific for document trust, entry 84
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in North America, Europe, Latin America, and international online education, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports more transparent source review, better handling of multilingual submissions, and more consistent review procedures. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For doctoral theses, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com